github-matthias-kunter commented on issue #10828:
URL: https://github.com/apache/iceberg/issues/10828#issuecomment-2636810078

   @RussellSpitzer We experience as well massive increase in Spark input data 
size after switching from raw Parquet ingestion to Iceberg table ingestion. 
This happens only for those jobs/ processes that read non-primitive columns 
(arrays, nested fields). If we leave those columns out in experiments, Iceberg 
table reads are extremely read data efficient. Usually by some orders of 
magnitude compared to raw Parquet. Since the source data is stored in S3 this 
is a significant cost factor.
   
   As I understood from the conversation above Iceberg uses Spark code (copied 
code) to read finally the Parquet files managed by an Iceberg table. But this 
code does not yet contain the optimizations for columnar reads of nested fields 
introduced in Spark 3.0. So, is there any plan for the near future to update 
those parts in the Iceberg code?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to