Thank you again ! The distribution over the partitions is quite uniform.
Regarding option #1, how can I increase the number of reducers for the vertex. ? On Mon, May 25, 2015 at 2:11 PM, Rajesh Balamohan < [email protected]> wrote: > > Forgot to mention another scenario #3 in earlier mail. > > 1. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is > approximately 1.0, you can possibly increase the number of reducers for the > vertex. > > 2. If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less > than 0.2 (~20%) and if almost all the records are processed by this > reducer, it could mean data skew. In this case, you might want to consider > increasing the amount of memory allocated (try increasing the container > size to check if it is helping the situation) > > 3. In some cases, REDUCE_INPUT_GROUPS/REDUCE_INPUT_RECORDS ratio might be > in between (i.e 0.3 - 0.8). In such cases, if most of the records are > processed by this reducer, you might want to check the partition logic. > > > To answer your question, yes, based on counters if you find that #2 is the > case, you might want to increase the memory and try it out. > > > > On Mon, May 25, 2015 at 3:25 PM, David Ginzburg <[email protected]> > wrote: > >> Thank you, >> It is my understanding that you suspect a skew in the data, and suggest >> an increase of heap for that single reducer ? >> >> On Mon, May 25, 2015 at 12:45 PM, Rajesh Balamohan < >> [email protected]> wrote: >> >>> >>> As of today, Tez autoparallelism can only decrease the number of >>> reducers allocated. It can not increase the number of tasks at runtime >>> (could be there in future releases). >>> >>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is >>> approximately 1.0, you can possibly increase the number of reducers for the >>> vertex. >>> - If the ratio of REDUCE_INPUT_GROUPS / REDUCE_INPUT_RECORDS is lot less >>> than 0.2 (~20%), this could potentially mean single reducer taking up most >>> of the records. In this case, you might want to consider increasing the >>> amount of memory allocated (try increasing the container size to check if >>> it is helping the situation) >>> >>> ~Rajesh.B >>> >>> On Mon, May 25, 2015 at 2:41 PM, David Ginzburg <[email protected] >>> > wrote: >>> >>>> Thank you, >>>> Already tried this with no effect on number of reducers >>>> >>>> On Mon, May 25, 2015 at 3:51 AM, [email protected] <[email protected] >>>> > wrote: >>>> >>>>> >>>>> when one reduce process too many data(skew join) set >>>>> hive.tez.auto.reducer.parallelism >>>>> =true can slove this problem? >>>>> >>>>> ------------------------------ >>>>> [email protected] >>>>> >>>> >>>> >>> >>> >>> -- >>> ~Rajesh.B >>> >> >> > > > -- > ~Rajesh.B >
