Re: groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Andy Dang
groupByKey() is a wide dependency and will cause a full shuffle. It's advised against using this transformation unless you keys are balanced (well-distributed) and you need a full shuffle. Otherwise, what you want is aggregateByKey() or reduceByKey() (depending on the output). These actions are

groupByKey vs mapPartitions for efficient grouping within a Partition

2017-01-16 Thread Patrick
Hi, Does groupByKey has intelligence associated with it, such that if all the keys resides in the same partition, it should not do the shuffle? Or user should write mapPartitions( scala groupBy code). Which would be more efficient and what are the memory considerations? Thanks