I have an RDD[String, MyObj] which is a result of Join + Map operation. It has no partitioner info. I run reduceByKey without passing any Partitioner or partition counts. I observed that output aggregation result for given key is incorrect sometime. like 1 out of 5 times. It looks like reduce operation is joining values from two different keys. There is no configuration change between multiple runs. I am scratching my head over this. I verified results by printing out RDD before and after reduce operation; collecting subset at driver.
Besides shuffle and storage memory fraction I use following options: sparkConf.set("spark.driver.userClassPathFirst","true") sparkConf.set("spark.unsafe.offHeap","true") sparkConf.set("spark.reducer.maxSizeInFlight","128m") sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>