Re: Spark <--> S3 flakiness

2017-05-10 Thread Miguel Morales
Try using the DirectParquetOutputCommiter: http://dev.sortable.com/spark-directparquetoutputcommitter/ On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com wrote: > Hi users, we have a bunch of pyspark jobs that are using S3 for loading / > intermediate steps and final

Spark <--> S3 flakiness

2017-05-10 Thread lucas.g...@gmail.com
Hi users, we have a bunch of pyspark jobs that are using S3 for loading / intermediate steps and final output of parquet files. We're running into the following issues on a semi regular basis: * These are intermittent errors, IE we have about 300 jobs that run nightly... And a fairly random but

unsubscribe

2017-05-10 Thread williamtellme123
unsubscribe From: Aaron Perrin [mailto:aper...@gravyanalytics.com] Sent: Tuesday, January 31, 2017 9:42 AM To: user @spark Subject: Multiple quantile calculations I want to calculate quantiles on two different columns. I know that I can calculate them with two

unsubscribe

2017-05-10 Thread williamtellme123
From: Aaron Jackson [mailto:ajack...@pobox.com] Sent: Tuesday, July 19, 2016 7:17 PM To: user Subject: Heavy Stage Concentration - Ends With Failure Hi, I have a cluster with 15 nodes of which 5 are HDFS nodes. I kick off a job that creates some 120 stages.

CSV output with JOBUUID

2017-05-10 Thread Swapnil Shinde
Hello I am using spark-2.0.1 and saw that CSV fileformat stores output with JOBUUID in it. https://github.com/apache/spark/blob/v2.0.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala#L191 I want to avoid csv writing JOBUUID in it. Is there any property

Re: [EXT] Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Michael Mansour (CS)
Debugging PySpark is admittedly difficult, and I’ll agree too that the docs can be lacking at times. PySpark docstrings are sometimes just missing or are incomplete. While I don’t write in Scala, I find that the Scala-spark source code and docs can fill in the PySpark gaps, and can only point

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Pavel Klemenkov
This video https://www.youtube.com/watch?v=LQHMMCf2ZWY I think. On Wed, May 10, 2017 at 8:04 PM, lucas.g...@gmail.com wrote: > Any chance of a link to that video :) > > Thanks! > > G > > On 10 May 2017 at 09:49, Holden Karau wrote: > >> So this

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-05-10 Thread albertoandreotti
Hi, in case you're still struggling with this, I wrote a blog post explaining Spark Joins and UDFs, http://alberto-computerengineering.blogspot.com.ar/2017/05/custom-joins-in-spark-sql-spark-has.html

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread lucas.g...@gmail.com
Any chance of a link to that video :) Thanks! G On 10 May 2017 at 09:49, Holden Karau wrote: > So this Python side pipelining happens in a lot of places which can make > debugging extra challenging. Some people work around this with persist > which breaks the pipelining

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
So this Python side pipelining happens in a lot of places which can make debugging extra challenging. Some people work around this with persist which breaks the pipelining during debugging, but if your interested in more general Python debugging I've got a YouTube video on the topic which could be

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Pavel Klemenkov
Thanks for the quick answer, Holden! Are there any other tricks with PySpark which are hard to debug using UI or toDebugString? On Wed, May 10, 2017 at 7:18 PM, Holden Karau wrote: > In PySpark the filter and then map steps are combined into a single > transformation from

incremental broadcast join

2017-05-10 Thread Mendelson, Assaf
Hi, It seems as if when doing broadcast join, the entire dataframe is resent even if part of it has already been broadcasted. Consider the following case: val df1 = ??? val df2 = ??? val df3 = ??? df3.join(broadcast(df1), on=cond, "left_outer") followed by df4.join(broadcast(df1.union(df2),

Re: [Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Holden Karau
In PySpark the filter and then map steps are combined into a single transformation from the JVM point of view. This allows us to avoid copying the data back to Scala in between the filter and the map steps. The debugging exeperience is certainly much harder in PySpark and I think is an interesting

[Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread pklemenkov
This Scala code: scala> val logs = sc.textFile("big_data_specialization/log.txt"). | filter(x => !x.contains("INFO")). | map(x => (x.split("\t")(1), 1)). | reduceByKey((x, y) => x + y) generated obvious lineage: (2) ShuffledRDD[4] at reduceByKey at :27 [] +-(2)

[Spark Core]: Python and Scala generate different DAGs for identical code

2017-05-10 Thread Pavel Klemenkov
This Scala code: scala> val logs = sc.textFile("big_data_specialization/log.txt"). | filter(x => !x.contains("INFO")). | map(x => (x.split("\t")(1), 1)). | reduceByKey((x, y) => x + y) generated obvious lineage: (2) ShuffledRDD[4] at reduceByKey at :27 [] +-(2)

[WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2017-05-10 Thread Mendelson, Assaf
Hi all, When running spark I get the following warning: [WARN] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Now I know that in general it is possible to ignore this warning, however, it means that

Re: running spark program on intellij connecting to remote master for cluster

2017-05-10 Thread David Kaczynski
Do you have Spark installed locally on your laptop with IntelliJ? Are you using the SparkLauncher class or your local spark-submit script? A while back, I was trying to submit a spark job from my local workstation to a remote cluster using the SparkLauncher class, but I didn't actually have

[Spark Streaming] Unknown delay in event timeline

2017-05-10 Thread Zhiwen Sun
Env : Spark 1.6.2 + kafka 0.8.2 DirectStream. Spark streaming job with 1s interval. Process time of micro batch suddenly became to 4s while is is usually 0.4s . When we check where the time spent, we find a unknown delay in job. There is no executor computing or shuffle reading. It is about 4s

Why spark.sql.autoBroadcastJoinThreshold not available

2017-05-10 Thread Jone Zhang
Now i use spark1.6.0 in java I wish the following sql to be executed in BroadcastJoin way *select * from sample join feature* This is my step 1.set spark.sql.autoBroadcastJoinThreshold=100M 2.HiveContext.sql("cache lazy table feature as "select * from src where ...) which result size is only 100K

running spark program on intellij connecting to remote master for cluster

2017-05-10 Thread s t
Hello, I am trying to run spark code from my laptop with intellij. I have cluster of 2 nodes and a master. When i start the program from intellij it gets error of some missing classes. I am aware that some jars need to be distributed to the workers but do not know if it is possible

CrossValidator and stackoverflowError

2017-05-10 Thread issues solution
Hi , when i try to perform CrossValidator i get the stackoverflowError i have aleardy perform all necessary transforimation Stringindexer vector and save data frame in HDFS like parquet afeter that i load all in new data frame and split to train and test when i try fit(train_set) i get

URGENT :

2017-05-10 Thread issues solution
Hi , i know you busy about questions but i don't undestand : 1- why we dont have features importance inside pyspakr features ? 2- why we can't use cache data frame with cross validation ? 3- why the documnetation it s not clear when we talk about pyspark ? you can understand

features IMportance

2017-05-10 Thread issues solution
Hi , some one can tell me if we have features importance inside pyspark 1.6.0 thx