Re: Insert into dynamic partitions performance

2014-12-06 Thread Daniel Haviv
I see.
Thanks a lot that's very helpful!

Daniel

> On 7 בדצמ׳ 2014, at 09:10, Gopal V  wrote:
> 
>> On 12/6/14, 10:11 PM, Daniel Haviv wrote:
>> 
>> Isn't there a way to make hive allocate more than one reducer for the whole 
>> job? Maybe one
>> per partition.
> 
> Yes.
> 
> hive.optimize.sort.dynamic.partition=true; does nearly that.
> 
> It raises the net number of useful reducers to total-num-of-partitions x 
> total-num-buckets.
> 
> If you have say, data being written into six hundred partitions with 1 bucket 
> each, it can use anywhere between 1 and 600 reducers (hashcode collisions 
> causing skews, of course).
> 
> It's turned off by default, because it really slows down the 1 partition 
> without buckets insert speed.
> 
> Cheers,
> Gopal
> 
 On 7 בדצמ׳ 2014, at 06:06, Gopal V  wrote:
 
 On 12/6/14, 6:27 AM, Daniel Haviv wrote:
 Hi,
 I'm executing an insert statement that goes over 1TB of data.
 The map phase goes well but the reduce stage only used one reducer which 
 becomes
>> a great bottleneck.
>>> 
>>> Are you inserting into a bucketed or sorted table?
>>> 
>>> If the destination table is bucketed + partitioned, you can use the dynamic 
>>> partition
>> sort optimization to get beyond the single reducer.
>>> 
>>> Cheers,
>>> Gopal
> 


Re: Insert into dynamic partitions performance

2014-12-06 Thread Gopal V

On 12/6/14, 10:11 PM, Daniel Haviv wrote:


Isn't there a way to make hive allocate more than one reducer for the whole 
job? Maybe one
per partition.


Yes.

hive.optimize.sort.dynamic.partition=true; does nearly that.

It raises the net number of useful reducers to total-num-of-partitions x 
total-num-buckets.


If you have say, data being written into six hundred partitions with 1 
bucket each, it can use anywhere between 1 and 600 reducers (hashcode 
collisions causing skews, of course).


It's turned off by default, because it really slows down the 1 partition 
without buckets insert speed.


Cheers,
Gopal


On 7 בדצמ׳ 2014, at 06:06, Gopal V  wrote:


On 12/6/14, 6:27 AM, Daniel Haviv wrote:
Hi,
I'm executing an insert statement that goes over 1TB of data.
The map phase goes well but the reduce stage only used one reducer which becomes

a great bottleneck.


Are you inserting into a bucketed or sorted table?

If the destination table is bucketed + partitioned, you can use the dynamic 
partition

sort optimization to get beyond the single reducer.


Cheers,
Gopal







Re: Insert into dynamic partitions performance

2014-12-06 Thread Daniel Haviv
Thanks Gopal,
I dont want to divide my data any further.

Isn't there a way to make hive allocate more than one reducer for the whole 
job? Maybe one per partition.

Daniel

> On 7 בדצמ׳ 2014, at 06:06, Gopal V  wrote:
> 
>> On 12/6/14, 6:27 AM, Daniel Haviv wrote:
>> Hi,
>> I'm executing an insert statement that goes over 1TB of data.
>> The map phase goes well but the reduce stage only used one reducer which 
>> becomes a great bottleneck.
> 
> Are you inserting into a bucketed or sorted table?
> 
> If the destination table is bucketed + partitioned, you can use the dynamic 
> partition sort optimization to get beyond the single reducer.
> 
> Cheers,
> Gopal


Re: Insert into dynamic partitions performance

2014-12-06 Thread Gopal V

On 12/6/14, 6:27 AM, Daniel Haviv wrote:

Hi,
I'm executing an insert statement that goes over 1TB of data.
The map phase goes well but the reduce stage only used one reducer which 
becomes a great bottleneck.


Are you inserting into a bucketed or sorted table?

If the destination table is bucketed + partitioned, you can use the 
dynamic partition sort optimization to get beyond the single reducer.


Cheers,
Gopal


Insert into dynamic partitions performance

2014-12-06 Thread Daniel Haviv
Hi,
I'm executing an insert statement that goes over 1TB of data.
The map phase goes well but the reduce stage only used one reducer which 
becomes a great bottleneck.

 I've tried to set the number of reducers to four and added a distribute by 
clause to the statement but I'm still using just one reducer.

How can I increase the reducer's parallelism?

Thanks,
Daniel