Re: Approaching Vectorized Reading in Iceberg ..

Gautam Fri, 14 Jun 2019 17:46:59 -0700

Agree with the approach of getting this working for primitive types only.
I'l work on a prototype assuming just primitive types for now.


I don't think that you can mix regular columns and Arrow columns. It has to
> be all one or the other.


I was jsut curious about this coz Vanilla Spark reader (with vectorization)
doesn't support batching on nested fields today but it's still able to do
vectorization on data with nested/non-nested. This is not needed for my poc
but would be good to know so if we can leverage this for our
implementation. Either ways, i'l get to it when this step is done.


On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com> wrote:

> Replies inline.
>
> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com> wrote:
>
>> Thanks for responding Ryan,
>>
>> Couple of follow up questions on ParquetValueReader for Arrow..
>>
>> I'd like to start with testing Arrow out with readers for primitive type
>> and incrementally add in Struct/Array support, also ArrowWriter [1]
>> currently doesn't have converters for map type. How can I default these
>> types to regular materialization whilst supporting Arrow based support for
>> primitives?
>>
>
> We should look at what Spark does to handle maps.
>
> I think we should get the prototype working with test cases that don't
> have maps, structs, or lists. Just getting primitives working is a good
> start and just won't hit these problems.
>
>
>> Lemme know if this makes sense...
>>
>> - I extend  PrimitiveReader (for Arrow) that loads primitive types into
>> ArrowColumnVectors of corresponding column types by iterating over
>> underlying ColumnIterator *n times*, where n is size of batch.
>>
>
> Sounds good to me. I'm not sure about extending vs wrapping because I'm
> not too familiar with the Arrow APIs.
>
>
>> - Reader.newParquetIterable()  maps primitive column types to the newly
>> added ArrowParquetValueReader but for other types (nested types, etc.) uses
>> current *InternalRow* based ValueReaders
>>
>
> Sounds good for primitives, but I would just leave the nested types
> un-implemented for now.
>
>
>> - Stitch the columns vectors together to create ColumnarBatch, (Since
>> *SupportsScanColumnarBatch* mixin currently expects this ) .. *although* *I'm
>> a bit lost on how the stitching of columns happens currently*? .. and
>> how the ArrowColumnVectors could  be stitched alongside regular columns
>> that don't have arrow based support ?
>>
>
> I don't think that you can mix regular columns and Arrow columns. It has
> to be all one or the other. That's why it's easier to start with
> primitives, then add structs, then lists, and finally maps.
>
>
>> - Reader returns readTasks as  *InputPartition<*ColumnarBatch*> *so that
>> DataSourceV2ScanExec starts using ColumnarBatch scans
>>
>
> We will probably need two paths. One for columnar batches and one for
> row-based reads. That doesn't need to be done right away and what you
> already have in your working copy makes sense as a start.
>
>
>> That's a lot of questions! :-) but hope i'm making sense.
>>
>> -Gautam.
>>
>>
>>
>> [1] -
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to