Good to know it worked. In case some of the job still failed can indicate
skew in your dataset. You may want to think of a partition by function.
Also, do you still see containers killed by yarn? If so, at what point? You
should see something like your app is trying to use x gb while yarn can
The partitions helped!
I added repartition() and my function looks like this now:
feature_df = (alldat_idx
.withColumn('label',alldat_idx['label_val'].cast('double'))
.groupBy('id','label')
.agg(sort_array(collect_list('indexedSegs')).alias('collect_list_is'))
What i missed is try increasing number of partitions using repartition
On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote:
> It does not look like scala vs python thing. How big is your audience data
> store? Can it be broadcasted?
>
> What is the memory footprint you are
It does not look like scala vs python thing. How big is your audience data
store? Can it be broadcasted?
What is the memory footprint you are seeing? At what point yarn is killing?
Depeneding on that you may want to tweak around number of partitions of
input dataset and increase number of
Hello,
I'm trying to build an ETL job which takes in 30-100gb of text data and
prepares it for SparkML. I don't speak Scala so I've been trying to
implement in PySpark on YARN, Spark 2.1.
Despite the transformations being fairly simple, the job always fails by
running out of executor memory.