Iceberg Vectorized Reads Meeting Notes (Nov 14)

Gautam Thu, 14 Nov 2019 17:01:32 -0800

*Vectorization notes (Nov 14) *



Attendees:

   - Anjali
   - Samarth
   - Ryan
   - Gautam


Overall things covered:

   - Current state of performance
   - How to start getting things from vectorized-read branch into master
   - Next steps for complex types



Current performance:

   - Reads for dictionary encoded string columns including fallback to
   plain encoding is around 30% faster than vectorized spark reads
   - Other primitive types 5-7 % slower
   - Currently using Arrow version 14.1
      - Upgrading to this improved performance
      - Shade this version within iceberg so it doesn't conflict with
      Spark's dependency



*Things to do:*



   - Merge Reader and ArrowVectorizedReaders into one and handle enabling
   vectorization based on config and projection schema  (
   https://github.com/apache/incubator-iceberg/issues/520) - *Gautam*
   - Separate Glue code for Spark ColumnVector from Iceberg arrow (added
   new issue: https://github.com/apache/incubator-iceberg/issues/648)
- *Samarth/
   Anjali *?
   - Separate out iceberg arrow  code into it's own module (
   https://github.com/apache/incubator-iceberg/issues/522)
   - Unit tests for current work.



*Discussion: *


What are next steps?

*Ryan*:

   - Aim for vectorization work to make it into master. Work on separating
   out code into PRs to master
      - ColumnVector implementations
      - Breaking up Type-wise decode implementations
      - Separate out glue code for iceberg arrow and spark ColumnVector
   - Make sure Licensing  of code is honored (e.g. if code was copied from
   spark, attribute that contribution accordingly)

Question: Is smallest unit of task planning a row group?

*Ryan*: Yes, having said that, there's provision in spark to read partial
batches. Can use row counts in ColumnVector to express how much valid


Can we start on complex types?

*Ryan*: Yes, shouldn't be blocked on anything major. Can start with
top-level structs right now (struct with 1 level of nesting).



Added a new issue https://github.com/apache/incubator-iceberg/issues/648 ,
please add this to the milestone
https://github.com/apache/incubator-iceberg/milestone/2



Lemme know if there was anything I missed or misquoted.



Regards,

-Gautam.

Iceberg Vectorized Reads Meeting Notes (Nov 14)

Reply via email to