Hello Devs,
                We met to discuss progress and next steps on Vectorized
read path in Iceberg. Here are my notes from the sync. Feel free to reply
with clarifications in case I mis-quoted or missed anything.

*Attendees*:

Anjali Norwood
Padma Pennumarthy
Ryan Blue
Samarth Jain
Gautam Kowshik

*Topics *
- Progress on Arrow Based Vectorization Reads
- Features being worked on and possible improvements
- Pending bottlenecks
- Identify things to collaborate on going forward.
- Next steps

Arrow Vectorized Reader

  Samarth/Anjali:

   - Working on Arrow based vectoization [1]
   - At  performance parity between Spark and Iceberg on primitive types
   except strings.
   - Planning to do dictionary encoding on strings
   - New Arrow version gives boost in performance and fixes issues
   - Vectorized batched Reading of definition levels improves performance
   - Some checks had to be turned off in arrow to push performance further
   viz. null check, unsafe memory access
   - Implemented prefetching of parquet pages, this improves perf on
   primitives beyond Vanilla spark


   Ryan:


   - Arrow version should not tied to spark and have iceberg specific
   implementation binding so it will work with any reader not just spark.
   - Add DatasourceV2Strategy to handle nested pruning into Spark upstream.
   Will coordinate with Apple folks to add their work into Spark.
   - Need ability to fallback  to row based reads for cases where columnar
   isn't possible. A config option maybe.
   - Can add options where columnar batches are read into InternalRow and
   returned to the Datasource.

  Padma:

   - Possibly contribute work on arrow back to arrow project. (can punt on
   this for now to move forward faster on current work)
   - Was looking into complex type support for Arrow based reads.


V1 Vectorized Read Path [2]

Gautam:

   - Been working on V1 vectorized short circuit read path [3]. (this is
   prolly not as useful once we have full featured support on Arrow based
   reads)
   - Will work on getting schema evolution parts working with this reader
   by getting Projection unit/integration tests working. (this can be
   contributed back into iceberg repo to unblock this path if we want to have
   that option till arrow based read is fully capable)



*Next steps:*

   - Unit tests for current Arrow based work.
   - Provide options to perform vectorized batch reads, Row oriented reads
   and Internal Row over Batch reads.
   - Separate Arrow work in Iceberg into it's own sub-module
   - Dictionary encoding support for strings in Arrow.
   - Complex type support for Arrow.
   - File issues for the above and identify how to distribute work between
   us.




[1]  https://github.com/apache/incubator-iceberg/tree/vectorized-read

[2]  https://github.com/apache/incubator-iceberg/pull/462

[3]
https://github.com/prodeezy/incubator-iceberg/commits/v1-vectorized-reader

Reply via email to