[ https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gabor Kaszab closed IMPALA-12388. --------------------------------- Fix Version/s: Not Applicable Resolution: Won't Fix I explored some possible implementations for this, the simplest one was where I unconditionally set the relevant null indicators to true for the position delete related slots. This add the less overhead on top of the existing logic in terms of performance. I then started perf verifications on both TPCDS and TPCH, but apparently for some queries this bring actual perf degradation. In worst case (a select-only query) this results in a 5% increase of runtime. There were some queries where I observed improvements around 2-3% but the overall results weren't convincing for me to progress. Closing this as won't fix as initial results aren't good enough to proceed. > Strip file/pos information from tuples once they are not needed > --------------------------------------------------------------- > > Key: IMPALA-12388 > URL: https://issues.apache.org/jira/browse/IMPALA-12388 > Project: IMPALA > Issue Type: Bug > Components: Backend, Frontend > Reporter: Zoltán Borók-Nagy > Assignee: Gabor Kaszab > Priority: Major > Labels: Performance, impala-iceberg, performance > Fix For: Not Applicable > > > When Impala processes Iceberg V2 tables that have position delete files it > needs to add extra slots to the input tuples (requried by the ANTI JOIN > between data files and delete files): > * STRING file path > * BIGINT position > This makes the row-size larger by 20 bytes. Please note that this 20 bytes is > only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT > slot), the file path actually points to a potentially large string (100-200 > bytes) stored in a heap buffer. > In the plan fragments of the SCANs we only create a string object per file > for the file path (and set it in the template tuple), so the situation is not > that bad, but once we send the rows over the network the STRINGs are getting > duplicated per record, which can add substantial network and serialization > overhead. > One way to resolve this is to re-materialize the tuples after the Iceberg V2 > scan is done, and only store the interesting slots. This mechanism also saves > us the 20 bytes per tuple overhead, but the re-materialization cost can be > high. > Another, easier solution is to just NULL-out the file path and position slots > once they are not needed anymore. > Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME / > FILE_POSITION}} we cannot re-materialize / NULL out. > Given the following plan: > {noformat} > UNION ALL > / \ > / \ > SCAN V2 ANTI JOIN > data files / \ > without / \ > deletes SCAN SCAN > data files delete files > with deletes > {noformat} > In the "SCAN data files without deletes" we shouldn't even fill the file > path / position slots. The latter also saves some computational cost. > In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file > path / pos slots once the data records are processed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org