[I] Provide option to specify user defined schema while reading from iceberg table [iceberg]

via GitHub Thu, 26 Sep 2024 11:42:47 -0700


IgorBerman opened a new issue, #11217:
URL: https://github.com/apache/iceberg/issues/11217


   ### Feature Request / Improvement
   
   Hi
   currently Iceberg not permitting to read from Iceberg table with user 
specified schema. 
   However, in some scenarios it might be highly beneficial:
   Current support of column pruning in spark doesn't handles highly nested 
schema(think of array of structs which have array of structs inside and yet 
another level or few more :)
   Think e.g about pageview object that have array of widgets while each widget 
has array of items, each of pageview/widget/item has a lot of fields. While 
working with this object model one usually needs to explode to some level and 
here is a catch. Since explode is not handled well from column pruning 
perspective, effectively we reading all columns of widget/item (so basically 
missing columnar nature of parquet format and reading almost full schema)
   Anyway there is progress done by spark community to solve this issue, but we 
still not "there" yet
   e.g. rather trivial examples described in 
https://issues.apache.org/jira/browse/SPARK-47230 or 
https://issues.apache.org/jira/browse/SPARK-34956.
   
   While working with files on hdfs we could provide partial schema with 
spark.read.schema(partial-schema).parquet(...), thus manually solving this 
inefficiency, however Iceberg and spark itself prohibit providing user defined 
schema while working with catalog tables.
   I believe it might permit users which have highly nested schemas to benefit 
from Iceberg.
   
   One option would be to provide this sub-schema through custom option(e.g). 
Of course one can validated that sub-schema is really sub of schema of the 
table(to remove any chance of some wrong expectations/errors)
   
   From our internal tests this might bring up to 30% performance gains in 
those environments(once again since it permits to read less columns)
   
   thanks in advance
   
   
   
   ### Query engine
   
   Spark
   
   ### Willingness to contribute
   
   - [ ] I can contribute this improvement/feature independently
   - [X] I would be willing to contribute this improvement/feature with 
guidance from the Iceberg community
   - [ ] I cannot contribute this improvement/feature at this time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Provide option to specify user defined schema while reading from iceberg table [iceberg]

Reply via email to