On Fri, May 24, 2019 at 8:28 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
> Iterator[InternalRow] interface, it would still not work right? Coz it
> seems to me there is a lot more going on upstream in the operator execution
> path that would be needed to be done here.
>
> There’s already a wrapper to adapt Arrow to ColumnarBatch, as well as an
> iterator to read a ColumnarBatch as a sequence of InternalRow. That’s what
> we want to take advantage of. You’re right that the first thing that Spark
> does it to get each row as InternalRow. But we still get a benefit from
> vectorizing the data materialization to Arrow itself. Spark execution is
> not vectorized, but that can be updated in Spark later (I think there’s a
> proposal).
>
>From a performance viewpoint, this isn't a great solution. The row by row
approach will substantially hurt performance compared to the vectorized
reader. I've seen 30% or more speed up when removing row-by-row access. So
putting a row-by-row adapter in the middle of two vectorized
representations is pretty costly.

.. Owen

Reply via email to