[ 
https://issues.apache.org/jira/browse/IMPALA-12388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Kaszab closed IMPALA-12388.
---------------------------------
    Fix Version/s: Not Applicable
       Resolution: Won't Fix

I explored some possible implementations for this, the simplest one was where I 
unconditionally set the relevant null indicators to true for the position 
delete related slots. This add the less overhead on top of the existing logic 
in terms of performance.

I then started perf verifications on both TPCDS and TPCH, but apparently for 
some queries this bring actual perf degradation. In worst case (a select-only 
query) this results in a 5% increase of runtime. There were some queries where 
I observed improvements around 2-3% but the overall results weren't convincing 
for me to progress.

Closing this as won't fix as initial results aren't good enough to proceed.

> Strip file/pos information from tuples once they are not needed
> ---------------------------------------------------------------
>
>                 Key: IMPALA-12388
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12388
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Frontend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Gabor Kaszab
>            Priority: Major
>              Labels: Performance, impala-iceberg, performance
>             Fix For: Not Applicable
>
>
> When Impala processes Iceberg V2 tables that have position delete files it 
> needs to add extra slots to the input tuples (requried by the ANTI JOIN 
> between data files and delete files):
>  * STRING file path
>  * BIGINT position
> This makes the row-size larger by 20 bytes. Please note that this 20 bytes is 
> only the increase in the tuple memory (12 byte STRING slot plus 8 byte BIGINT 
> slot), the file path actually points to a potentially large string (100-200 
> bytes) stored in a heap buffer.
> In the plan fragments of the SCANs we only create a string object per file 
> for the file path (and set it in the template tuple), so the situation is not 
> that bad, but once we send the rows over the network the STRINGs are getting 
> duplicated per record, which can add substantial network and serialization 
> overhead.
> One way to resolve this is to re-materialize the tuples after the Iceberg V2 
> scan is done, and only store the interesting slots. This mechanism also saves 
> us the 20 bytes per tuple overhead, but the re-materialization cost can be 
> high.
> Another, easier solution is to just NULL-out the file path and position slots 
> once they are not needed anymore.
> Of course if the user SELECTs the virtual column {{INPUT_FILE_NAME / 
> FILE_POSITION}} we cannot re-materialize / NULL out.
> Given the following plan:
> {noformat}
>     UNION ALL
>     /        \
>    /          \
> SCAN          V2 ANTI JOIN
> data files       /      \
> without         /        \
> deletes     SCAN         SCAN
>             data files   delete files
>             with deletes
> {noformat}
> In the "SCAN  data files without deletes" we shouldn't even fill the file 
> path / position slots. The latter also saves some computational cost.
> In our V2 ANTI JOIN operator (IcebergDeleteNode) we can NULL out the file 
> path / pos slots once the data records are processed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to