*Vectorization notes (Nov 14) *
Attendees: - Anjali - Samarth - Ryan - Gautam Overall things covered: - Current state of performance - How to start getting things from vectorized-read branch into master - Next steps for complex types Current performance: - Reads for dictionary encoded string columns including fallback to plain encoding is around 30% faster than vectorized spark reads - Other primitive types 5-7 % slower - Currently using Arrow version 14.1 - Upgrading to this improved performance - Shade this version within iceberg so it doesn't conflict with Spark's dependency *Things to do:* - Merge Reader and ArrowVectorizedReaders into one and handle enabling vectorization based on config and projection schema ( https://github.com/apache/incubator-iceberg/issues/520) - *Gautam* - Separate Glue code for Spark ColumnVector from Iceberg arrow (added new issue: https://github.com/apache/incubator-iceberg/issues/648) - *Samarth/ Anjali *? - Separate out iceberg arrow code into it's own module ( https://github.com/apache/incubator-iceberg/issues/522) - Unit tests for current work. *Discussion: * What are next steps? *Ryan*: - Aim for vectorization work to make it into master. Work on separating out code into PRs to master - ColumnVector implementations - Breaking up Type-wise decode implementations - Separate out glue code for iceberg arrow and spark ColumnVector - Make sure Licensing of code is honored (e.g. if code was copied from spark, attribute that contribution accordingly) Question: Is smallest unit of task planning a row group? *Ryan*: Yes, having said that, there's provision in spark to read partial batches. Can use row counts in ColumnVector to express how much valid Can we start on complex types? *Ryan*: Yes, shouldn't be blocked on anything major. Can start with top-level structs right now (struct with 1 level of nesting). Added a new issue https://github.com/apache/incubator-iceberg/issues/648 , please add this to the milestone https://github.com/apache/incubator-iceberg/milestone/2 Lemme know if there was anything I missed or misquoted. Regards, -Gautam.