Re: Does DataFrame has something like set hive.groupby.skewindata=true;

Virgil Palanciuc Sun, 22 May 2016 23:27:22 -0700

It doesn't.
However, if you have a very large number of keys, with a small number of
very large keys, you can do one of the following:
A. Use a custom partitioner that counts the number of items in a key and
avoids putting large keys together; alternatively, if feasible (and
needed), include part of the value together with the key so that you can
split very large keys in multiple partitions (but that will likely alter
the way you need to do your computations).
B. Count by key, collect to the driver the keys with lots of values. Then
broadcast the set of the "large keys", and:
  i. filter-out the large keys, do your regular processing for the many
small keys
  ii. filter-out the small keys, do a special processing for the very large
keys, using the fact that you can probably store all  of them in memory at
any given point (e.g. thus you can key by value and do random-access to
retrieve "key data" for any given key, in the mapPartitions() code)


There is no universal solution for this problem, but these are good general
solutions that should hopefully set you on the right track.
(note: solution A works for RDDs only; solution B can work with DataFrame
too)

Regards,
Virgil.


On Sat, May 21, 2016 at 11:48 PM, unk1102 <umesh.ka...@gmail.com> wrote:

> Hi I am having DataFrame with huge skew data in terms of TB and I am doing
> groupby on 8 fields which I cant avoid unfortunately. I am looking to
> optimize this I have found hive has
>
> set hive.groupby.skewindata=true;
>
> I dont use Hive I have Spark DataFrame can we achieve above Spark? Please
> guide. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Does DataFrame has something like set hive.groupby.skewindata=true;

Reply via email to