Re: Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread Saisai Shao
Writing RDD based application using pyspark will bring in additional overheads, Spark is running on the JVM whereas your python code is running on python runtime, so data should be communicated between JVM world and python world, this requires additional serialization-deserialization, IPC. Also

Fw: Significant performance difference for same spark job in scala vs pyspark

2016-05-05 Thread pratik gawande
Hello, I am new to spark. For one of job I am finding significant performance difference when run in pyspark vs scala. Could you please let me know if this is known and scala is preferred over python for writing spark jobs? Also DAG visualization shows completely different DAGs for scala and

unsubscribe

2016-05-05 Thread Brindha Sengottaiyan

Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
But why ? Any specific reason behind it ? I am aware of that we can persist the dataframes but before proceeding would like to know the memory level of my DFs. I am working on performance tuning of my Spark jobs , looking for Storage Level APIs like RDDs. Thanks, Divya On 6 May 2016 at 11:16,

Re: [Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Ted Yu
I am afraid there is no such API. When persisting, you can specify StorageLevel : def persist(newLevel: StorageLevel): this.type = { Can you tell us your use case ? Thanks On Thu, May 5, 2016 at 8:06 PM, Divya Gehlot wrote: > Hi, > How can I get and set storage

[Spark 1.5.2 ]-how to set and get Storage level for Dataframe

2016-05-05 Thread Divya Gehlot
Hi, How can I get and set storage level for Dataframes like RDDs , as mentioned in following book links https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-caching.html Thanks, Divya

Disable parquet metadata summary in

2016-05-05 Thread Bijay Kumar Pathak
Hi, How can we disable writing _common_metdata while saving Data Frame in parquet format in PySpark. I tried to set the property using below command but didn't helped. sparkContext._jsc.hadoopConfiguration().set("parquet.enable.summary-metadata", "false") Thanks, Bijay

How long should logistic regression take on this data?

2016-05-05 Thread Bibudh Lahiri
Hi, I am doing the following exercise: I have 100 million labeled records (total 2.7 GB data) in LibSVM (sparse) format, split across 200 files on HDFS (each file ~14 MB), so each file has about 500K records. Only 50K of these 100 million are labeled as "positive", and the rest are all

Re: H2O + Spark Streaming?

2016-05-05 Thread ndjido
Sure! Check the following working example : https://github.com/h2oai/qcon2015/tree/master/05-spark-streaming/ask-craig-streaming-app Cheers. Ardo Sent from my iPhone > On 05 May 2016, at 17:26, diplomatic Guru wrote: > > Hello all, I was wondering if it is

Re: Writing output of key-value Pair RDD

2016-05-05 Thread Afshartous, Nick
Answering my own question. I filtered out the keys from the output file by overriding MultipleOutputFormat.generateActualKey to return the empty string. -- Nick class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat { @Override protected

SortWithinPartitions on DataFrame

2016-05-05 Thread Darshan Singh
Hi, I have a dataframe df1 and I partitioned it by col1,col2 and persisted it. Then I created new dataframe df2. val df2 = df1.sortWithinPartitions("col1","col2","col3") df1.persist() df2.persist() df1.count() df2.count() now I expect that any group by statement using the "col1","col2","col3"

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
Does that code even compile? I'm assuming eventLogJson.foreach is supposed to be eventLogJson.foreachRDD ? I'm also confused as to why you're repartitioning to 1 partition. Is your streaming job lagging behind (especially given that you're basically single-threading it by repartitioning to 1

Re: Writing output of key-value Pair RDD

2016-05-05 Thread Afshartous, Nick
Thanks, I got the example below working. Though it writes both the keys and values to the output file. Is there any way to write just the values ? -- Nick String[] strings = { "Abcd", "Azlksd", "whhd", "wasc", "aDxa" }; sc.parallelize(Arrays.asList(strings))

Re: Content-based Recommendation Engine

2016-05-05 Thread Chan Yi Sheng(Eason)
Dear Sree, here's a simple content-based recommendation engine I built using LDA. Demo site: http://54.183.251.139:8080/ Github link: https://github.com/easonchan1213/LDA_RecEngine Cheers, Chan Yi Sheng 2016-05-05 17:34 GMT+01:00 Sree Eedupuganti : > Can anyone share the

Re: Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi David, Thank you for your response. Before inserting to Cassandra, I had checked the data have already missed at HDFS (My second step is to load data from HDFS and then insert to Cassandra). Can you send me the link relating this bug of 0.8.2? Thank you! Jerry On Thu, May 5, 2016 at 12:38

Re: Spark Streaming, Batch interval, Windows length and Sliding Interval settings

2016-05-05 Thread Mich Talebzadeh
Thanks Ryan for the correction. Posted to the wrong user list :( Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: Accessing JSON array in Spark SQL

2016-05-05 Thread Michael Armbrust
use df.selectExpr to evaluate complex expression (instead of just column names). On Thu, May 5, 2016 at 11:53 AM, Xinh Huynh wrote: > Hi, > > I am having trouble accessing an array element in JSON data with a > dataframe. Here is the schema: > > val json1 = """{"f1":"1",

Accessing JSON array in Spark SQL

2016-05-05 Thread Xinh Huynh
Hi, I am having trouble accessing an array element in JSON data with a dataframe. Here is the schema: val json1 = """{"f1":"1", "f1a":[{"f2":"2"}] } }""" val rdd1 = sc.parallelize(List(json1)) val df1 = sqlContext.read.json(rdd1) df1.printSchema() root |-- f1: string (nullable = true) |-- f1a:

mesos cluster mode

2016-05-05 Thread satish saley
Hi, Spark documentation says that "cluster mode is currently not supported for Mesos clusters."But below we can see mesos example with cluster mode. I don't have mesos cluster to try it out. Which one is true? Shall I interpret it as "cluster mode is currently not supported for Mesos clusters* for

Content-based Recommendation Engine

2016-05-05 Thread Sree Eedupuganti
Can anyone share the code for Content-based Recommendation Engine to recommend the user based on E-mail subject. -- Best Regards, Sreeharsha Eedupuganti Data Engineer innData Analytics Private Limited

Re: Missing data in Kafka Consumer

2016-05-05 Thread Cody Koeninger
That's not much information to go on. Any relevant code sample or log messages? On Thu, May 5, 2016 at 11:18 AM, Jerry wrote: > Hi, > > Does anybody give me an idea why the data is lost at the Kafka Consumer > side? I use Kafka 0.8.2 and Spark (streaming) version is

Content-based Recommendation Engine

2016-05-05 Thread Sree Eedupuganti
Can anyone share the code for Content-based Recommendation Engine to recommend the user based on E-mail subject. -- Best Regards, Sreeharsha Eedupuganti Data Engineer innData Analytics Private Limited

Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi, Does anybody give me an idea why the data is lost at the Kafka Consumer side? I use Kafka 0.8.2 and Spark (streaming) version is 1.5.2. Sometimes, I found out I could not receive the same number of data with Kafka producer. Exp) I sent 1000 data to Kafka Broker via Kafka Producer and

Re: groupBy and store in parquet

2016-05-05 Thread Xinh Huynh
Hi Michal, Why is your solution so slow? Is it from the file IO caused by storing in a temp file as JSON and then reading it back in and writing it as Parquet? How are you getting "events" in the first place? Do you have the original Kafka messages as an RDD[String]? Then how about: 1. Start

Individual DStream Checkpointing in Spark Streaming

2016-05-05 Thread Akash Mishra
Hi *, I am little confused over the checkpointing of Spark Streaming Context and Individual Streaming context. E.g: JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(1)); jssc.checkpoint("hdfs://...") Will start checkpointing the Dstream operation, configuration &

Could we use Sparkling Water Lib with Spark Streaming

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark Streaming for online prediction?

H2O + Spark Streaming?

2016-05-05 Thread diplomatic Guru
Hello all, I was wondering if it is possible to use H2O with Spark Streaming for online prediction?

Re: DeepSpark: where to start

2016-05-05 Thread Jason Nerothin
Just so that there is no confusion, there is a Spark user interface project called DeepSense that is actually useful: http://deepsense.io I am not affiliated with them in any way... On Thu, May 5, 2016 at 9:42 AM, Joice Joy wrote: > What the heck, I was already beginning

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-05 Thread sunday2000
Hi, I built spark 1.6.1 on Linux redhat 2.6.32-279.el6.x86_64 server, with JDK: jdk1.8.0_91 -- -- ??: "Divya Gehlot";; : 2016??5??5??(??) 10:41 ??: "sunday2000"<2314476...@qq.com>; :

Re: DeepSpark: where to start

2016-05-05 Thread Joice Joy
What the heck, I was already beginning to like it. On Thu, May 5, 2016 at 12:31 PM, Mark Vervuurt wrote: > Wel you got me fooled as wel ;) > Had it on my todolist to dive into this new component... > > Mark > > > Op 5 mei 2016 om 07:06 heeft Derek Chan

Re: package for data quality in Spark 1.5.2

2016-05-05 Thread Mich Talebzadeh
ok thanks let me check it. So your primary storage layer is Hbase with Phoenix as a tool. Sounds interesting. I will get back to you on this Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Fwd: package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ I am looking for something similar to above solution . -- Forwarded message -- From: "Divya Gehlot" Date: May 5, 2016 6:51 PM Subject: package for data

Re: package for data quality in Spark 1.5.2

2016-05-05 Thread Mich Talebzadeh
Hi, Spark is a query tool. It stores data in HDFS or Hive database or anything else but does not have its own generic database nulls values and foreign key constraint belong to the domain of databases. What is exactly the nature of your requirements? Do you want to use Spark tool to look at the

package for data quality in Spark 1.5.2

2016-05-05 Thread Divya Gehlot
Hi, Is there any package or project in Spark/scala which supports Data Quality check? For instance checking null values , foreign key constraint Would really appreciate ,if somebody has already done it and happy to share or has any open source package . Thanks, Divya

Access S3 bucket using IAM roles

2016-05-05 Thread Jyotiska
Hi, I am trying to access my S3 bucket. Is it possible to access the bucket and files inside it without using secret access key and access key id, by using the IAM role? I am able to do the same in boto where I do not pass secret key and key id while connecting, but it is able to connect using

Re: groupBy and store in parquet

2016-05-05 Thread Michal Vince
Hi Xinh For (1) the biggest problem are those null columns. e.g. DF will have ~1000 columns so every partition of that DF will have ~1000 columns, one of the partitioned columns can have 996 null columns which is big waste of space (in my case more than 80% in avg) for (2) I can`t really

Re: DeepSpark: where to start

2016-05-05 Thread Mark Vervuurt
Wel you got me fooled as wel ;) Had it on my todolist to dive into this new component... Mark > Op 5 mei 2016 om 07:06 heeft Derek Chan het volgende > geschreven: > > The blog post is a April Fool's joke. Read the last line in the post: > >

Re: Mllib using model to predict probability

2016-05-05 Thread ndjido
You can user the BinaryClassificationEvaluator class to get both predicted classes (0/1) and probabilities. Check the following spark doc https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html . Cheers, Ardo Sent from my iPhone > On 05 May 2016, at 07:59, colin

Mllib using model to predict probability

2016-05-05 Thread colin
In 2-class problems, when I use SVM, RondomForest models to do classifications, they predict "0" or "1". And when I use ROC to evaluate the model, sometimes I need a probability that a record belongs to "0" or "1". In scikit-learn, every model can do "predict" and "predict_prob", which the last