Re: Approaching Vectorized Reading in Iceberg ..

Samarth Jain Tue, 30 Jul 2019 17:14:33 -0700

Hey Gautam,

Wanted to check back with you and see if you had any success running the
benchmark and if you have any numbers to share.




On Fri, Jul 26, 2019 at 4:34 PM Gautam <gautamkows...@gmail.com> wrote:

> Got it. Commented out that module and it works. Was just curious why it
> doesn't work on master branch either.
>
> On Fri, Jul 26, 2019 at 3:49 PM Daniel Weeks <dwe...@netflix.com> wrote:
>
>> Actually, it looks like the issue is right there in the error . . . the
>> ErrorProne module is being excluded from the compile stages of the
>> sub-projects here:
>> https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L152
>>
>> However, it is still being applied to the jmh tasks.  I'm not familiar
>> with this module, but you can run the benchmarks by commenting it out here:
>> https://github.com/apache/incubator-iceberg/blob/master/build.gradle#L167
>>
>> We'll need to fix the build to disable for the jmh tasks.
>>
>> -Dan
>>
>> On Fri, Jul 26, 2019 at 3:34 PM Daniel Weeks <dwe...@netflix.com> wrote:
>>
>>> Gautam, you need to have the jmh-core libraries available to run.  I
>>> validated that PR, so I'm guessing I had it configured in my environment.
>>>
>>> I assume there's a way to make that available within gradle, so I'll
>>> take a look.
>>>
>>> On Fri, Jul 26, 2019 at 2:52 PM Gautam <gautamkows...@gmail.com> wrote:
>>>
>>>> This fails on master too btw. Just wondering if i'm doing
>>>> something wrong trying to run this.
>>>>
>>>> On Fri, Jul 26, 2019 at 2:24 PM Gautam <gautamkows...@gmail.com> wrote:
>>>>
>>>>> I'v been trying to run the jmh benchmarks bundled within the project.
>>>>> I'v been running into issues with that .. have other hit this? Am I 
>>>>> running
>>>>> these incorrectly?
>>>>>
>>>>>
>>>>> bash-3.2$ ./gradlew :iceberg-spark:jmh
>>>>> -PjmhIncludeRegex=IcebergSourceFlatParquetDataFilterBenchmark
>>>>> -PjmhOutputPath=benchmark/iceberg-source-flat-parquet-data-filter-benchmark-result.txt
>>>>> ..
>>>>> ...
>>>>> > Task :iceberg-spark:jmhCompileGeneratedClasses FAILED
>>>>> error: plug-in not found: ErrorProne
>>>>>
>>>>> FAILURE: Build failed with an exception.
>>>>>
>>>>>
>>>>>
>>>>> Is there a config/plugin I need to add to build.gradle?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>>> Thanks Gautam!
>>>>>>
>>>>>> We'll start taking a look at your code. What do you think about
>>>>>> creating a branch in the Iceberg repository where we can work on 
>>>>>> improving
>>>>>> it together, before merging it into master?
>>>>>>
>>>>>> Also, you mentioned performance comparisons. Do you have any early
>>>>>> results to share?
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Tue, Jul 23, 2019 at 3:40 PM Gautam <gautamkows...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Folks,
>>>>>>>
>>>>>>> I have checked in a WIP branch [1] with a working version of
>>>>>>> Vectorized reads for Iceberg reader. Here's the diff  [2].
>>>>>>>
>>>>>>> *Implementation Notes:*
>>>>>>>  - Iceberg's Reader adds a `SupportsScanColumnarBatch` mixin to
>>>>>>> instruct the DataSourceV2ScanExec to use `planBatchPartitions()` 
>>>>>>> instead of
>>>>>>> the usual `planInputPartitions()`. It returns instances of 
>>>>>>> `ColumnarBatch`
>>>>>>> on each iteration.
>>>>>>>  - `ArrowSchemaUtil` contains Iceberg to Arrow type conversion. This
>>>>>>> was copied from [3] . Added by @Daniel Weeks <dwe...@netflix.com> .
>>>>>>> Thanks for that!
>>>>>>>  - `VectorizedParquetValueReaders` contains ParquetValueReaders used
>>>>>>> for reading/decoding the Parquet rowgroups (aka pagestores as referred 
>>>>>>> to
>>>>>>> in the code)
>>>>>>>  - `VectorizedSparkParquetReaders` contains the visitor
>>>>>>> implementations to map Parquet types to appropriate value readers. I
>>>>>>> implemented the struct visitor so that the root schema can be mapped
>>>>>>> properly. This has the added benefit of vectorization support for 
>>>>>>> structs,
>>>>>>> so yay!
>>>>>>>  - For the initial version the value readers read an entire row
>>>>>>> group into a single Arrow Field Vector. this i'd imagine will require
>>>>>>> tuning for right batch sizing but i'v gone with one batch per rowgroup 
>>>>>>> for
>>>>>>> now.
>>>>>>>  - Arrow Field Vectors are wrapped using `ArrowColumnVector` which
>>>>>>> is Spark's ColumnVector implementation backed by Arrow. This is the 
>>>>>>> first
>>>>>>> contact point between Spark and Arrow interfaces.
>>>>>>>  - ArrowColumnVectors are stitched together into a `ColumnarBatch`
>>>>>>> by `ColumnarBatchReader` . This is my replacement for 
>>>>>>> `InternalRowReader`
>>>>>>> which maps Structs to Columnar Batches. This allows us to have nested
>>>>>>> structs where each level of nesting would be a nested columnar batch. 
>>>>>>> Lemme
>>>>>>> know what you think of this approach.
>>>>>>>  - I'v added value readers for all supported primitive types listed
>>>>>>> in `AvroDataTest`. There's a corresponding test for vectorized reader 
>>>>>>> under
>>>>>>> `TestSparkParquetVectorizedReader`
>>>>>>>  - I haven't fixed all the Checkstyle errors so you will have to
>>>>>>> turn checkstyle off in build.gradle. Also skip tests while building..
>>>>>>> sorry! :-(
>>>>>>>
>>>>>>> *P.S*. There's some unused code under ArrowReader.java. Ignore this
>>>>>>> as it's not used. This was from my previous impl of Vectorization. I'v 
>>>>>>> kept
>>>>>>> it around to compare performance.
>>>>>>>
>>>>>>> Lemme know what folks think of the approach. I'm getting this
>>>>>>> working for our scale test benchmark and will report back with numbers.
>>>>>>> Feel free to run your own benchmarks and share.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> -Gautam.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1] -
>>>>>>> https://github.com/prodeezy/incubator-iceberg/tree/issue-9-support-arrow-based-reading-WIP
>>>>>>> [2] -
>>>>>>> https://github.com/apache/incubator-iceberg/compare/master...prodeezy:issue-9-support-arrow-based-reading-WIP
>>>>>>> [3] -
>>>>>>> https://github.com/apache/incubator-iceberg/blob/72e3485510e9cbec05dd30e2e7ce5d03071f400d/core/src/main/java/org/apache/iceberg/arrow/ArrowSchemaUtil.java
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jul 22, 2019 at 2:33 PM Gautam <gautamkows...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Will do. Doing a bit of housekeeping on the code and also adding
>>>>>>>> more primitive type support.
>>>>>>>>
>>>>>>>> On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah <mch...@palantir.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Would it be possible to put the work in progress code in open
>>>>>>>>> source?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *From: *Gautam <gautamkows...@gmail.com>
>>>>>>>>> *Reply-To: *"dev@iceberg.apache.org" <dev@iceberg.apache.org>
>>>>>>>>> *Date: *Monday, July 22, 2019 at 9:46 AM
>>>>>>>>> *To: *Daniel Weeks <dwe...@netflix.com>
>>>>>>>>> *Cc: *Ryan Blue <rb...@netflix.com>, Iceberg Dev List <
>>>>>>>>> dev@iceberg.apache.org>
>>>>>>>>> *Subject: *Re: Approaching Vectorized Reading in Iceberg ..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That would be great!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks <dwe...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hey Gautam,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We also have a couple people looking into vectorized reading (into
>>>>>>>>> Arrow memory).  I think it would be good for us to get together and 
>>>>>>>>> see if
>>>>>>>>> we can collaborate on a common approach for this.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'll reach out directly and see if we can get together.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Jul 21, 2019 at 10:35 PM Gautam <gautamkows...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Figured this out. I'm returning ColumnarBatch iterator directly
>>>>>>>>> without projection with schema set appropriately in `readSchema() `.. 
>>>>>>>>> the
>>>>>>>>> empty result was due to valuesRead not being set correctly on 
>>>>>>>>> FileIterator.
>>>>>>>>> Did that and things are working. Will circle back with numbers soon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 19, 2019 at 5:22 PM Gautam <gautamkows...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hey Guys,
>>>>>>>>>
>>>>>>>>>            Sorry bout the delay on this. Just got back on getting
>>>>>>>>> a basic working implementation in Iceberg for Vectorization on 
>>>>>>>>> primitive
>>>>>>>>> types.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Here's what I have so far :  *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have added `ParquetValueReader` implementations for some basic
>>>>>>>>> primitive types that build the respective Arrow Vector 
>>>>>>>>> (`ValueVector`) viz.
>>>>>>>>> `IntVector` for int, `VarCharVector` for strings and so on. 
>>>>>>>>> Underneath each
>>>>>>>>> value vector reader there are column iterators that read from the 
>>>>>>>>> parquet
>>>>>>>>> pagestores (rowgroups) in chunks. These `ValueVector-s` are lined up 
>>>>>>>>> as
>>>>>>>>> `ArrowColumnVector`-s (which is ColumnVector wrapper backed by Arrow) 
>>>>>>>>> and
>>>>>>>>> stitched together using a `ColumnarBatchReader` (which as the name 
>>>>>>>>> suggests
>>>>>>>>> wraps ColumnarBatches in the iterator)   I'v verified that these 
>>>>>>>>> pieces
>>>>>>>>> work properly with the underlying interfaces.  I'v also made changes 
>>>>>>>>> to
>>>>>>>>> Iceberg's `Reader` to  implement `planBatchPartitions()` (to add the
>>>>>>>>> `SupportsScanColumnarBatch` mixin to the reader).  So the reader now
>>>>>>>>> expects ColumnarBatch instances (instead of InternalRow). The query
>>>>>>>>> planning runtime works fine with these changes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Although it fails during query execution, the bit it's  currently
>>>>>>>>> failing at is this line of code : 
>>>>>>>>> https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java#L414
>>>>>>>>> [github.com]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_incubator-2Diceberg_blob_master_spark_src_main_java_org_apache_iceberg_spark_source_Reader.java-23L414&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=7wzoBoRwCjQjgamnHukQSe0wiATMnGbYhfJQpXfSMks&e=>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This code, I think,  tries to apply the iterator's schema
>>>>>>>>> projection on the InternalRow instances. This seems to be tightly 
>>>>>>>>> coupled
>>>>>>>>> to InternalRow as Spark's catalyst expressions have implemented the
>>>>>>>>> UnsafeProjection for InternalRow only. If I take this out and just 
>>>>>>>>> return
>>>>>>>>> the `Iterator<ColumnarBatch>` iterator I built it returns empty 
>>>>>>>>> result on
>>>>>>>>> the client. I'm guessing this is coz Spark is unaware of the 
>>>>>>>>> iterator's
>>>>>>>>> schema? There's a Todo in the code that says "*remove the
>>>>>>>>> projection by reporting the iterator's schema back to Spark*".
>>>>>>>>> Is there a simple way to communicate that to Spark for my new 
>>>>>>>>> iterator? Any
>>>>>>>>> pointers on how to get around this?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks and Regards,
>>>>>>>>>
>>>>>>>>> -Gautam.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue <rb...@netflix.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Replies inline.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 14, 2019 at 1:11 AM Gautam <gautamkows...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thanks for responding Ryan,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Couple of follow up questions on ParquetValueReader for Arrow..
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I'd like to start with testing Arrow out with readers for
>>>>>>>>> primitive type and incrementally add in Struct/Array support, also
>>>>>>>>> ArrowWriter [1] currently doesn't have converters for map type. How 
>>>>>>>>> can I
>>>>>>>>> default these types to regular materialization whilst supporting Arrow
>>>>>>>>> based support for primitives?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We should look at what Spark does to handle maps.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think we should get the prototype working with test cases that
>>>>>>>>> don't have maps, structs, or lists. Just getting primitives working 
>>>>>>>>> is a
>>>>>>>>> good start and just won't hit these problems.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Lemme know if this makes sense...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - I extend  PrimitiveReader (for Arrow) that loads primitive types
>>>>>>>>> into ArrowColumnVectors of corresponding column types by iterating 
>>>>>>>>> over
>>>>>>>>> underlying ColumnIterator *n times*, where n is size of batch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sounds good to me. I'm not sure about extending vs wrapping
>>>>>>>>> because I'm not too familiar with the Arrow APIs.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Reader.newParquetIterable()  maps primitive column types to the
>>>>>>>>> newly added ArrowParquetValueReader but for other types (nested types,
>>>>>>>>> etc.) uses current *InternalRow* based ValueReaders
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sounds good for primitives, but I would just leave the nested
>>>>>>>>> types un-implemented for now.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Stitch the columns vectors together to create ColumnarBatch,
>>>>>>>>> (Since *SupportsScanColumnarBatch* mixin currently expects this )
>>>>>>>>> .. *although* *I'm a bit lost on how the stitching of columns
>>>>>>>>> happens currently*? .. and how the ArrowColumnVectors could  be
>>>>>>>>> stitched alongside regular columns that don't have arrow based 
>>>>>>>>> support ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't think that you can mix regular columns and Arrow columns.
>>>>>>>>> It has to be all one or the other. That's why it's easier to start 
>>>>>>>>> with
>>>>>>>>> primitives, then add structs, then lists, and finally maps.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Reader returns readTasks as  *InputPartition<*ColumnarBatch*> *so
>>>>>>>>> that DataSourceV2ScanExec starts using ColumnarBatch scans
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> We will probably need two paths. One for columnar batches and one
>>>>>>>>> for row-based reads. That doesn't need to be done right away and what 
>>>>>>>>> you
>>>>>>>>> already have in your working copy makes sense as a start.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> That's a lot of questions! :-) but hope i'm making sense.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Gautam.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] - 
>>>>>>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
>>>>>>>>> [github.com]
>>>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_sql_core_src_main_scala_org_apache_spark_sql_execution_arrow_ArrowWriter.scala&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs&m=UW1Nb5KZOPeIqsjzFnKhGQaxYHT_wAI_2PvgFUlfAoY&s=8yzJh2S49rbuM06dC5Sy-yMECClqEeLS7tpg45BmDN4&e=>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Ryan Blue
>>>>>>>>>
>>>>>>>>> Software Engineer
>>>>>>>>>
>>>>>>>>> Netflix
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>

Re: Approaching Vectorized Reading in Iceberg ..

Reply via email to