It doesn't. However, if you have a very large number of keys, with a small number of very large keys, you can do one of the following: A. Use a custom partitioner that counts the number of items in a key and avoids putting large keys together; alternatively, if feasible (and needed), include part of the value together with the key so that you can split very large keys in multiple partitions (but that will likely alter the way you need to do your computations). B. Count by key, collect to the driver the keys with lots of values. Then broadcast the set of the "large keys", and: i. filter-out the large keys, do your regular processing for the many small keys ii. filter-out the small keys, do a special processing for the very large keys, using the fact that you can probably store all of them in memory at any given point (e.g. thus you can key by value and do random-access to retrieve "key data" for any given key, in the mapPartitions() code)
There is no universal solution for this problem, but these are good general solutions that should hopefully set you on the right track. (note: solution A works for RDDs only; solution B can work with DataFrame too) Regards, Virgil. On Sat, May 21, 2016 at 11:48 PM, unk1102 <umesh.ka...@gmail.com> wrote: > Hi I am having DataFrame with huge skew data in terms of TB and I am doing > groupby on 8 fields which I cant avoid unfortunately. I am looking to > optimize this I have found hive has > > set hive.groupby.skewindata=true; > > I dont use Hive I have Spark DataFrame can we achieve above Spark? Please > guide. Thanks in advance. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >