Re: Memory problems with simple ETL in Pyspark

2017-04-17 Thread ayan guha
Good to know it worked. In case some of the job still failed can indicate skew in your dataset. You may want to think of a partition by function. Also, do you still see containers killed by yarn? If so, at what point? You should see something like your app is trying to use x gb while yarn can

Re: Memory problems with simple ETL in Pyspark

2017-04-16 Thread Patrick McCarthy
The partitions helped! I added repartition() and my function looks like this now: feature_df = (alldat_idx .withColumn('label',alldat_idx['label_val'].cast('double')) .groupBy('id','label') .agg(sort_array(collect_list('indexedSegs')).alias('collect_list_is'))

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha
What i missed is try increasing number of partitions using repartition On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote: > It does not look like scala vs python thing. How big is your audience data > store? Can it be broadcasted? > > What is the memory footprint you are

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha
It does not look like scala vs python thing. How big is your audience data store? Can it be broadcasted? What is the memory footprint you are seeing? At what point yarn is killing? Depeneding on that you may want to tweak around number of partitions of input dataset and increase number of

Memory problems with simple ETL in Pyspark

2017-04-14 Thread Patrick McCarthy
Hello, I'm trying to build an ETL job which takes in 30-100gb of text data and prepares it for SparkML. I don't speak Scala so I've been trying to implement in PySpark on YARN, Spark 2.1. Despite the transformations being fairly simple, the job always fails by running out of executor memory.