>From a performance viewpoint, this isn’t a great solution. The row by row
approach will substantially hurt performance compared to the vectorized
reader. I’ve seen 30% or more speed up when removing row-by-row access. So
putting a row-by-row adapter in the middle of two vectorized
representations is pretty costly.

Iceberg doesn’t impose this requirement, it is how Spark consumes the rows
itself, one at a time:
https://github.com/apache/spark/blob/branch-2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala#L138

By exposing Arrow data as Spark’s ColumnarBatch, we should pick up any
benefits from improved execution when Spark is updated.

On Tue, May 28, 2019 at 12:33 PM Owen O'Malley <owen.omal...@gmail.com>
wrote:

>
>
> On Fri, May 24, 2019 at 8:28 PM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
>> Iterator[InternalRow] interface, it would still not work right? Coz it
>> seems to me there is a lot more going on upstream in the operator execution
>> path that would be needed to be done here.
>>
>> There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an
>> iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what
>> we want to take advantage of. You’re right that the first thing that Spark
>> does it to get each row as InternalRow. But we still get a benefit from
>> vectorizing the data materialization to Arrow itself. Spark execution is
>> not vectorized, but that can be updated in Spark later (I think there’s a
>> proposal).
>>
> From a performance viewpoint, this isn't a great solution. The row by row
> approach will substantially hurt performance compared to the vectorized
> reader. I've seen 30% or more speed up when removing row-by-row access. So
> putting a row-by-row adapter in the middle of two vectorized
> representations is pretty costly.
>
> .. Owen
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to