Hello Devs, We met to discuss progress and next steps on Vectorized read path in Iceberg. Here are my notes from the sync. Feel free to reply with clarifications in case I mis-quoted or missed anything.
*Attendees*: Anjali Norwood Padma Pennumarthy Ryan Blue Samarth Jain Gautam Kowshik *Topics * - Progress on Arrow Based Vectorization Reads - Features being worked on and possible improvements - Pending bottlenecks - Identify things to collaborate on going forward. - Next steps Arrow Vectorized Reader Samarth/Anjali: - Working on Arrow based vectoization [1] - At performance parity between Spark and Iceberg on primitive types except strings. - Planning to do dictionary encoding on strings - New Arrow version gives boost in performance and fixes issues - Vectorized batched Reading of definition levels improves performance - Some checks had to be turned off in arrow to push performance further viz. null check, unsafe memory access - Implemented prefetching of parquet pages, this improves perf on primitives beyond Vanilla spark Ryan: - Arrow version should not tied to spark and have iceberg specific implementation binding so it will work with any reader not just spark. - Add DatasourceV2Strategy to handle nested pruning into Spark upstream. Will coordinate with Apple folks to add their work into Spark. - Need ability to fallback to row based reads for cases where columnar isn't possible. A config option maybe. - Can add options where columnar batches are read into InternalRow and returned to the Datasource. Padma: - Possibly contribute work on arrow back to arrow project. (can punt on this for now to move forward faster on current work) - Was looking into complex type support for Arrow based reads. V1 Vectorized Read Path [2] Gautam: - Been working on V1 vectorized short circuit read path [3]. (this is prolly not as useful once we have full featured support on Arrow based reads) - Will work on getting schema evolution parts working with this reader by getting Projection unit/integration tests working. (this can be contributed back into iceberg repo to unblock this path if we want to have that option till arrow based read is fully capable) *Next steps:* - Unit tests for current Arrow based work. - Provide options to perform vectorized batch reads, Row oriented reads and Internal Row over Batch reads. - Separate Arrow work in Iceberg into it's own sub-module - Dictionary encoding support for strings in Arrow. - Complex type support for Arrow. - File issues for the above and identify how to distribute work between us. [1] https://github.com/apache/incubator-iceberg/tree/vectorized-read [2] https://github.com/apache/incubator-iceberg/pull/462 [3] https://github.com/prodeezy/incubator-iceberg/commits/v1-vectorized-reader