[ https://issues.apache.org/jira/browse/IMPALA-12308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Kaszab resolved IMPALA-12308. ----------------------------------- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Implement DIRECTED distribution mode for Iceberg tables > ------------------------------------------------------- > > Key: IMPALA-12308 > URL: https://issues.apache.org/jira/browse/IMPALA-12308 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend > Reporter: Zoltán Borók-Nagy > Assignee: Gabor Kaszab > Priority: Major > Labels: impala-iceberg, performance > Fix For: Impala 4.4.0 > > > Currently there are two distribution modes for JOIN-operators: > * BROADCAST: RHS is delivered to all executors of LHS > * PARTITIONED: both LHS and RHS are shuffled across executors > We implement reading of an Iceberg V2 table (with position delete files) via > an ANTI JOIN operator. LHS is the SCAN operator of the data records, RHS is > the SCAN operator of the delete records. The delete record contain > (file_path, pos) information of the deleted rows. > This means we can invent another distribution mode, just for Iceberg V2 > tables with position deletes: DIRECTED distribution mode. > At scheduling we must save the information about data SCAN operators, i.e. on > which nodes are they going to be executed. The LHS don't need to be shuffled > over the network. > The delete records of RHS can use the scheduling information to transfer > delete records to the hosts that process the corresponding data file. > This minimizes network communication. > We can also add further optimizations to the Iceberg V2 operator > (IcebergDeleteNode): > * Compare the pointers of the file paths instead of doing string compare > * Each tuple in a rowbatch belong to the same file, and positions are in > ascending order > ** Onlyone lookup is needed from the Hash table > ** We can add fast paths to skip testing the whole rowbatch (when the row > batch's position range is outside of the delete position range) -- This message was sent by Atlassian Jira (v8.20.10#820010)