Hello Devs,
We met to discuss progress and next steps on Vectorized
read path in Iceberg. Here are my notes from the sync. Feel free to reply
with clarifications in case I mis-quoted or missed anything.
*Attendees*:
Anjali Norwood
Padma Pennumarthy
Ryan Blue
Samarth Jain
Gautam Kowshik
*Topics *
- Progress on Arrow Based Vectorization Reads
- Features being worked on and possible improvements
- Pending bottlenecks
- Identify things to collaborate on going forward.
- Next steps
Arrow Vectorized Reader
Samarth/Anjali:
- Working on Arrow based vectoization [1]
- At performance parity between Spark and Iceberg on primitive types
except strings.
- Planning to do dictionary encoding on strings
- New Arrow version gives boost in performance and fixes issues
- Vectorized batched Reading of definition levels improves performance
- Some checks had to be turned off in arrow to push performance further
viz. null check, unsafe memory access
- Implemented prefetching of parquet pages, this improves perf on
primitives beyond Vanilla spark
Ryan:
- Arrow version should not tied to spark and have iceberg specific
implementation binding so it will work with any reader not just spark.
- Add DatasourceV2Strategy to handle nested pruning into Spark upstream.
Will coordinate with Apple folks to add their work into Spark.
- Need ability to fallback to row based reads for cases where columnar
isn't possible. A config option maybe.
- Can add options where columnar batches are read into InternalRow and
returned to the Datasource.
Padma:
- Possibly contribute work on arrow back to arrow project. (can punt on
this for now to move forward faster on current work)
- Was looking into complex type support for Arrow based reads.
V1 Vectorized Read Path [2]
Gautam:
- Been working on V1 vectorized short circuit read path [3]. (this is
prolly not as useful once we have full featured support on Arrow based
reads)
- Will work on getting schema evolution parts working with this reader
by getting Projection unit/integration tests working. (this can be
contributed back into iceberg repo to unblock this path if we want to have
that option till arrow based read is fully capable)
*Next steps:*
- Unit tests for current Arrow based work.
- Provide options to perform vectorized batch reads, Row oriented reads
and Internal Row over Batch reads.
- Separate Arrow work in Iceberg into it's own sub-module
- Dictionary encoding support for strings in Arrow.
- Complex type support for Arrow.
- File issues for the above and identify how to distribute work between
us.
[1] https://github.com/apache/incubator-iceberg/tree/vectorized-read
[2] https://github.com/apache/incubator-iceberg/pull/462
[3]
https://github.com/prodeezy/incubator-iceberg/commits/v1-vectorized-reader