Re: Insert into dynamic partitions performance
I see. Thanks a lot that's very helpful! Daniel > On 7 בדצמ׳ 2014, at 09:10, Gopal V wrote: > >> On 12/6/14, 10:11 PM, Daniel Haviv wrote: >> >> Isn't there a way to make hive allocate more than one reducer for the whole >> job? Maybe one >> per partition. > > Yes. > > hive.optimize.sort.dynamic.partition=true; does nearly that. > > It raises the net number of useful reducers to total-num-of-partitions x > total-num-buckets. > > If you have say, data being written into six hundred partitions with 1 bucket > each, it can use anywhere between 1 and 600 reducers (hashcode collisions > causing skews, of course). > > It's turned off by default, because it really slows down the 1 partition > without buckets insert speed. > > Cheers, > Gopal > On 7 בדצמ׳ 2014, at 06:06, Gopal V wrote: On 12/6/14, 6:27 AM, Daniel Haviv wrote: Hi, I'm executing an insert statement that goes over 1TB of data. The map phase goes well but the reduce stage only used one reducer which becomes >> a great bottleneck. >>> >>> Are you inserting into a bucketed or sorted table? >>> >>> If the destination table is bucketed + partitioned, you can use the dynamic >>> partition >> sort optimization to get beyond the single reducer. >>> >>> Cheers, >>> Gopal >
Re: Insert into dynamic partitions performance
On 12/6/14, 10:11 PM, Daniel Haviv wrote: Isn't there a way to make hive allocate more than one reducer for the whole job? Maybe one per partition. Yes. hive.optimize.sort.dynamic.partition=true; does nearly that. It raises the net number of useful reducers to total-num-of-partitions x total-num-buckets. If you have say, data being written into six hundred partitions with 1 bucket each, it can use anywhere between 1 and 600 reducers (hashcode collisions causing skews, of course). It's turned off by default, because it really slows down the 1 partition without buckets insert speed. Cheers, Gopal On 7 בדצמ׳ 2014, at 06:06, Gopal V wrote: On 12/6/14, 6:27 AM, Daniel Haviv wrote: Hi, I'm executing an insert statement that goes over 1TB of data. The map phase goes well but the reduce stage only used one reducer which becomes a great bottleneck. Are you inserting into a bucketed or sorted table? If the destination table is bucketed + partitioned, you can use the dynamic partition sort optimization to get beyond the single reducer. Cheers, Gopal
Re: Insert into dynamic partitions performance
Thanks Gopal, I dont want to divide my data any further. Isn't there a way to make hive allocate more than one reducer for the whole job? Maybe one per partition. Daniel > On 7 בדצמ׳ 2014, at 06:06, Gopal V wrote: > >> On 12/6/14, 6:27 AM, Daniel Haviv wrote: >> Hi, >> I'm executing an insert statement that goes over 1TB of data. >> The map phase goes well but the reduce stage only used one reducer which >> becomes a great bottleneck. > > Are you inserting into a bucketed or sorted table? > > If the destination table is bucketed + partitioned, you can use the dynamic > partition sort optimization to get beyond the single reducer. > > Cheers, > Gopal
Re: Insert into dynamic partitions performance
On 12/6/14, 6:27 AM, Daniel Haviv wrote: Hi, I'm executing an insert statement that goes over 1TB of data. The map phase goes well but the reduce stage only used one reducer which becomes a great bottleneck. Are you inserting into a bucketed or sorted table? If the destination table is bucketed + partitioned, you can use the dynamic partition sort optimization to get beyond the single reducer. Cheers, Gopal
Insert into dynamic partitions performance
Hi, I'm executing an insert statement that goes over 1TB of data. The map phase goes well but the reduce stage only used one reducer which becomes a great bottleneck. I've tried to set the number of reducers to four and added a distribute by clause to the statement but I'm still using just one reducer. How can I increase the reducer's parallelism? Thanks, Daniel