[ 
https://issues.apache.org/jira/browse/HIVE-26110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516906#comment-17516906
 ] 

Ádám Szita commented on HIVE-26110:
-----------------------------------

Thanks for finding this [~rbalamohan]. I was able to reproduce this locally, 
and found what most probably was missing from my patch in HIVE-25975 dealing 
with this feature:

The explain plan lacked this line for the Reduce Sink operator:
{code:java}
Map-reduce partition columns: _col23 (type: bigint) {code}
..so although the SortedDynPartitionOptimizer has taken care of setting the 
Iceberg table's partition column as a KEY column, it didn't mark it as 
_*partition*_ {*}column in Map/Reduce terms{*}. (I have probably missed this 
because of the confusing terminology, as I see it, this is not a Hive table 
partition, but a MR job relevant term for adjusting the shuffle phase... ) This 
made all reducers write a certain amount of rows irrespective of key 
distribution (nevertheless the rows within the reducer tasks were sorted on the 
key column - otherwise an exception would have been thrown from ClusteredWriter 
class as you mentioned it earlier) and we ended up with 
{code:java}
1 <= n <= reducerCount {code}
files for each partition.

Anyhow what we probably need is to add a partCols.add() invocation beside 
[https://github.com/apache/hive/pull/3060/files#diff-b28bcf13b1a3e2d73139df50ed102fae4625032ea73337aa1c295bb3069e1499R651]
 , it actually solved the issue on my local repro.

> bulk insert into partitioned table creates lots of files in iceberg
> -------------------------------------------------------------------
>
>                 Key: HIVE-26110
>                 URL: https://issues.apache.org/jira/browse/HIVE-26110
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>
> For e.g, create web_returns table in tpcds in iceberg format and try to copy 
> over data from regular table. More like "insert into web_returns_iceberg as 
> select * from web_returns".
> This inserts the data correctly, however there are lot of files present in 
> each partition. IMO, dynamic sort optimisation isn't working fine and this 
> causes records not to be grouped in the final phase.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to