kasakrisz commented on pull request #2231:
URL: https://github.com/apache/hive/pull/2231#issuecomment-829900602
Hi Marta,
Thanks for reviewing this patch.
This is what I found about distributing rows to reducers while I was
debugging:
Let's say we have the following statements:
```
create table acidtbl(a int, b int) clustered by (a) into 2 buckets stored as
orc TBLPROPERTIES ('transactional'='true');
insert ...
delete from acidtbl where a = 1 or a = 3;
```
This case the the plan of the delete statement after ReduceSinkDeDuplication
looks like:
```
TS[0]-FIL[8]-SEL[2]-RS[5]-SEL[6]-FS[7]
```
So with Tez we have a mapper: TS[0]-FIL[8]-SEL[2]-RS[5]
and have two reducers each of them has: SEL[6]-FS[7]
RS[5] has
Partition keys: GenericUDFBridge ==> UDFToInteger (Column[_col0])
Sort keys: Column[_col0]
And maxReducers: 2
where _col0 is the row_id coming from SEL[2].
UDFToInteger(<row_id_type>) extracts the bucket_id field which is going to
be used to generate a `reducesink.key` in the RS operator. This is going to be
passed to the wrapped `OutputCollector` with the row. This case this is an
`org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput`. This class
is part of Tez which I'm not familiar with but I found that this is where rows
are distributed to reducers by the key coming from RS.
Hive/hadoop also has a setting
`hive.exec.reducers.max`/`mapreduce.job.reduces`. This limits the maxReducers
in RS operator. If the table has more buckets than the max reducers then
FileSink operator also distributes the rows into different files. If I
understand correctly this is done by the `multiFileSpray` functionality.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]