I have an RDD[String, MyObj] which is a result of Join + Map operation. It
has no partitioner info. I run reduceByKey without passing any Partitioner
or partition counts.  I observed that output aggregation result for given
key is incorrect sometime. like 1 out of 5 times. It looks like reduce
operation is joining values from two different keys. There is no
configuration change between multiple runs. I am scratching my head over
this. I verified results by printing out RDD before and after reduce
operation; collecting subset at driver.

Besides shuffle and storage memory fraction I use following options:

sparkConf.set("spark.driver.userClassPathFirst","true")
sparkConf.set("spark.unsafe.offHeap","true")
sparkConf.set("spark.reducer.maxSizeInFlight","128m")
sparkConf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Reply via email to