I recently pushed support for vectorized reads for dictionary encoded parquet data and wanted to share some benchmark results for string and numeric data types:
Dictionary Encoded VARCHAR column Benchmark Cnt Score Error Units VectorizedDictionaryEncodedStringsBenchmark.readFileSourceNonVectorized 5 19.974 ±1.289 s/op VectorizedDictionaryEncodedStringsBenchmark.readFileSourceVectorized 5 13.960 ±1.327 s/op VectorizedDictionaryEncodedStringsBenchmark.readIcebergVectorized5k 5 9.081 ±0.263 s/op Non Dictionary Encoded VARCHAR column Benchmark Cnt Score Error Units VectorizedDictionaryEncodedStringsBenchmark.readFileSourceNonVectorized 5 31.044 ±1.289 s/op VectorizedDictionaryEncodedStringsBenchmark.readFileSourceVectorized 5 14.149 ±1.327 s/op VectorizedDictionaryEncodedStringsBenchmark.readIcebergVectorized5k 5 14.480 ±0.263 s/op Dictionary Encoded with Fallback to Plain Encoding VARCHAR column Benchmark Cnt Score Error Units VectorizedFallbackToPlainEncodingStringsBenchmark.readFileSourceNonVectorized 5 20.432 ±1.289 s/op VectorizedFallbackToPlainEncodingStringsBenchmark.readFileSourceVectorized 5 10.913 ±1.327 s/op VectorizedFallbackToPlainEncodingStringsBenchmark.readIcebergVectorized5k 5 7.868 ±0.263 s/op BIGINT column Benchmark Cnt Score Error Units VectorizedDictionaryEncodedLongsBenchmark.readFileSourceNonVectorized 5 20.042 ± 1.629 s/op VectorizedDictionaryEncodedLongsBenchmark.readFileSourceVectorized 5 6.511 ± 0.241 s/op VectorizedDictionaryEncodedLongsBenchmark.readIcebergVectorized5k 5 7.010 ± 0.332 s/op To sum it up, Iceberg Vectorized reads for dictionary encoded string columns including fallback to plain encoding is around 30% faster than vectorized spark reads. For dictionary encoded numeric data types like BIGINT, we are currently 7% slower. On Mon, Sep 9, 2019 at 4:55 PM Samarth Jain <samarth.j...@gmail.com> wrote: > I wanted to share progress made so far with improving the performance of > the Iceberg Arrow vectorized read path. > > BIGINT column > > Benchmark > > Cnt > > Score > > Error > > Units > > IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized > > 5 > > 4.642 > > ± 1.629 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k > > 5 > > 4.311 > > ± 0.241 > > s/op > > > IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergVectorized5k > > 5 > > 4.348 > > ± 0.332 > > s/op > > > DECIMAL column > > IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized > > 5 > > 22.103 > > ±1.928 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k > > 5 > > 21.385 > > ±0.347 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k > > 5 > > 21.815 > > ±0.206 > > s/op > > BIGINT, INT, FLOAT, DOUBLE, DECIMAL, DATE, TIMESTAMP columns > > Benchmark > > Cnt > > Score > > Error > > Units > > IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized > > 5 > > 36.83 > > ±0.76 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k > > 5 > > 40.242 > > ±0.643 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k > > 5 > > 37.222 > > ±0.368 > > s/op > > > VARCHAR column > > Benchmark > > Cnt > > Score > > Error > > Units > > IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized > > 5 > > 10.946 > > ±1.32 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k > > 5 > > 13.133 > > ±1.327 > > s/op > > IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k > > 5 > > 13.923 > > ±0.263 > > s/op > > In a nutshell, the read performance for *BIGINT, FLOAT, DOUBLE, DECIMAL, > DATE and TIMESTAMP* types is same as Spark. This holds both for > benchmarks using an Iceberg table with single column as well as a table > schema with multiple such columns. > For String type data, *Iceberg Arrow path is ~20% slower than Spark*. The > current implementation is hot-spotting in copying memory from a heap byte > array to the arrow value buffer. > > This benchmark was run by building Spark and Iceberg using 0.14.1 version > of Arrow. Below configs were used to improve Arrow read performance: > > Allow unsafe memory access in arrow (arrow.enable_unsafe_memory_access > set to true) > > Checking for null when doing a get(index) disabled in arrow > (arrow.enable_null_check_for_get > set to false) > > > I also wanted to point out that I have other optimizations in progress > including pre-fetching of Parquet data pages so that the hot code path > doesn't incur the cost of decompressing the pages. This requires changes in > the Parquet library for which I have a couple of PRs open. The PoC I built > for pre-fetching for BIGINT columns saw the read performance of Iceberg > Arrow path to be 40% faster than Spark. Also, we have identified places > where we can run a tight loop to improve vectorized performance. With these > changes, I expect Iceberg vectorized read performance to be faster than > Spark's. > > > > > On Thu, Sep 5, 2019 at 4:40 PM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Nice work, Gautam! Looks like this could be a useful patch before the >> Arrow read path is ready to go. >> >> It's also good to see the performance between Spark's DataSource v2 and >> v1. We were wondering if the additional projection added in the v2 path was >> causing v2 to be slower than v1 due to an extra copy, but from your tests >> it looks like that isn't a problem. My guess is that either v2 is somehow >> avoiding other work and is faster (unlikely) or that an equivalent >> projection is getting added in the v1 path automatically by the codegen >> support for columnar reads. Either way, we know that v2 isn't slower >> because of that overhead. >> >> I have some concerns about merging it right now. Mainly, I'd like to get >> a release out soon so that we have our first Apache release. Including a >> vectorized path in that release would delay it, so I'd like to keep >> vectorization separate for now and follow up with a release that includes >> vectorization when that code is stable. Does that plan work for you guys? >> >> My other concern about the PR is the reason why I think merging it would >> delay the release. Originally, we used Spark's built-in read support for >> Parquet that creates InternalRow. But we found that version differences >> between Parquet pulled in by Spark and Iceberg caused runtime errors. We >> fixed those problems by removing the use of Spark internal classes and >> shading/relocating Parquet to be able to use a our own copy of Parquet. >> Merging this support would require reverting that change and updating the >> iceberg-spark-runtime Jar build. >> >> It also looks like we will need to invest some time in making sure this >> read path provides the same guarantees as other readers. From looking at >> this, I think that this passes a Spark schema to project columns, but that >> would result in by-name resolution instead of using column IDs. So we will >> need to fix that up for each file to ensure the right columns are projected >> after schema changes, like renaming a column. >> >> I'm at ApacheCon next week, but I'll take a closer look at this when I am >> back. >> >> rb >> >> >> On Thu, Sep 5, 2019 at 4:59 AM Gautam <gautamkows...@gmail.com> wrote: >> >>> I'v added unit tests and created a PR for the v1 vectorization work : >>> https://github.com/apache/incubator-iceberg/pull/452 >>> >>> I'm sure there's scope for further improvement so lemme know your >>> feedback over the PR so I can sharpen it further. >>> >>> Cheers, >>> -Gautam. >>> >>> >>> On Wed, Sep 4, 2019 at 10:33 PM Mouli Mukherjee < >>> moulimukher...@gmail.com> wrote: >>> >>>> Hi Gautam, this is very exciting to see. It would be great if this was >>>> available behind a flag if possible. >>>> >>>> Best, >>>> Mouli >>>> >>>> On Wed, Sep 4, 2019, 7:01 AM Gautam <gautamkows...@gmail.com> wrote: >>>> >>>>> Hello Devs, >>>>> As some of you know there's been ongoing work as >>>>> part of [1] to build Arrow based vectorization into Iceberg. There's a >>>>> separate thread on this dev list where that is being discussed and >>>>> progress >>>>> is being tracked in a separate branch [2]. The overall approach there is >>>>> to >>>>> build a ground up Arrow based implementation of vectorization within >>>>> Iceberg so that any compute engine using Iceberg would benefit from those >>>>> optimizations. We feel that is indeed the longer term solution and the >>>>> best >>>>> way forward. >>>>> >>>>> Meanwhile, Xabriel & I took to investigating an interim approach where >>>>> Iceberg could use the current Vectorization code built into Spark Parquet >>>>> reading, which I will refer to as "*V1 Vectorization*". This is the >>>>> code path that Spark's DataSourceV1 readers use to read Parquet data. The >>>>> result is that we have performance parity between Iceberg and Spark's >>>>> Vanilla Parquet reader. We thought we should share this with the larger >>>>> community so others can benefit from this gain. >>>>> >>>>> *What we did *: >>>>> - Added a new reader viz. *V1VectorizedReader *that internally short >>>>> circuits to using the V1 codepath [3] which does most of the setup and >>>>> work to perform vectorization. it's exactly what Vanilla Spark's reader >>>>> does underneath the DSV1 implementation. >>>>> - It builds an iterator which expects ColumnarBatches from the Objects >>>>> returned by the resolving iterator. >>>>> - We re-organized and optimized code while building *ReadTask *instances >>>>> which >>>>> considerably improved task initiation and planning time. >>>>> - Setting `*iceberg.read.enableV1VectorizedReader*` to true enables >>>>> this reader in IcebergSource. >>>>> - The V1Vectorized reader is an independent class with copied code in >>>>> some methods as we didn't want to degrade perf due to inheritance/virtual >>>>> method calls (we noticed degradation when we did try to re-use code). >>>>> - I'v pushed this code to a separate branch [4] in case others want to >>>>> give this a try. >>>>> >>>>> >>>>> *The Numbers*: >>>>> >>>>> >>>>> Flat Data 10 files 10M rows each >>>>> >>>>> >>>>> Benchmark >>>>> Mode Cnt Score Error Units >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized >>>>> ss 5 63.631 ± 1.300 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized >>>>> ss 5 28.322 ± 2.400 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readIceberg >>>>> ss 5 65.862 ± 2.480 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized10k >>>>> ss 5 28.199 ± 1.255 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readIcebergV1Vectorized20k >>>>> ss 5 29.822 ± 2.848 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.*readIcebergV1Vectorized5k* >>>>> ss 5 27.953 ± 0.949 s/op >>>>> >>>>> >>>>> >>>>> >>>>> Flat Data Projections 10 files 10M rows each >>>>> >>>>> >>>>> Benchmark >>>>> Mode Cnt Score Error Units >>>>> >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionFileSourceNonVectorized >>>>> ss 5 11.307 ± 1.791 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark. >>>>> *readWithProjectionFileSourceVectorized* ss 5 3.480 ± >>>>> 0.087 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIceberg >>>>> ss 5 11.057 ± 0.236 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized10k >>>>> ss 5 3.953 ± 1.592 s/op >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized20k >>>>> ss 5 3.619 ± 1.305 s/op >>>>> >>>>> >>>>> IcebergSourceFlatParquetDataReadBenchmark.readWithProjectionIcebergV1Vectorized5k >>>>> ss 5 4.109 ± 1.734 s/op >>>>> >>>>> >>>>> Filtered Data 500 files 10k rows each >>>>> >>>>> >>>>> Benchmark >>>>> Mode Cnt Score Error Units >>>>> >>>>> >>>>> IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceNonVectorized >>>>> ss 5 2.139 ± 0.719 s/op >>>>> >>>>> IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterFileSourceVectorized >>>>> ss 5 2.213 ± 0.598 s/op >>>>> >>>>> IcebergSourceFlatParquetDataFilterBenchmark. >>>>> *readWithFilterIcebergNonVectorized* ss 5 0.144 ± 0.029 >>>>> s/op >>>>> >>>>> >>>>> IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized100k >>>>> ss 5 0.179 ± 0.019 s/op >>>>> >>>>> IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized10k >>>>> ss 5 0.189 ± 0.046 s/op >>>>> >>>>> >>>>> IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergV1Vectorized5k >>>>> ss 5 0.195 ± 0.137 s/op >>>>> >>>>> >>>>> *Perf Notes*: >>>>> - Iceberg V1 Vectorization's real gain (over current Iceberg impl) is >>>>> in flat data scans. Notice how it's almost exactly same as vanilla spark >>>>> vectorization. >>>>> - Projections work equally well. Although we see Nested column >>>>> projections are still not performing as well as we need to be able to push >>>>> nested column projections down to Iceberg. >>>>> - We saw a slight overhead with Iceberg V1 Vectorization over smaller >>>>> workloads, but this goes away with larger data files. >>>>> >>>>> *Why we think this is useful*: >>>>> - This approach allows users to benefit from both: 1) Iceberg's >>>>> metadata filtering and 2) Spark's Scan Vectorization. This should help >>>>> with >>>>> Iceberg adoption. >>>>> - We think this can be an interim solution (until Arrow based impl is >>>>> fully performant) for those who are currently blocked by performance >>>>> difference between Iceberg and Spark's native Vectorization for >>>>> interactive >>>>> usecases. There's a lot of optimization work and testing gone into V1 >>>>> vectorization that Iceberg can now benefit from. >>>>> - In many cases companies have proprietary implementations of >>>>> *ParquetFileFormat* that could have extended features like complex >>>>> type support etc. Our code can use that at runtime as long as ' >>>>> *buildReaderWithPartitionValues()*' signature is consistent.. if not >>>>> the reader can be easily modified to plug their own vectorized reader in. >>>>> - While profiling the Arrow implementation I found it difficult to >>>>> compare bottlenecks due to major differences between DSv1 and DSv2 >>>>> client-to-source interface paths. This makes it easier to compare numbers >>>>> and profile code between V1 vectorization and Arrow vectorization as we >>>>> now >>>>> have both paths working behind a single DataSourceV2 path (viz. >>>>> IcebergSource). >>>>> >>>>> *Limitations*: >>>>> - This implementation is specific to Spark so other compute frameworks >>>>> like Presto won't benefit from this. >>>>> - It doesn't use Iceberg's Value Reader interface as it bypasses >>>>> everything under the Task Data Reading. (added a separate >>>>> *V1VectorizedTaskDataReader*) >>>>> - Need to maintain two readers, as adding any code to Reader.java >>>>> might need changes to V1Vectorized Reader. Although, we could minimize >>>>> this >>>>> with a *ReaderUtils* class. >>>>> >>>>> >>>>> I have the code checked in at >>>>> https://github.com/prodeezy/incubator-iceberg/tree/v1-vectorized-reader . >>>>> If folks think this is useful and we can keep this as an interim solution >>>>> behind a feature flag, I can get a PR up with proper unit tests. >>>>> >>>>> thanks and regards, >>>>> -Gautam. >>>>> >>>>> >>>>> [1] - https://github.com/apache/incubator-iceberg/issues/9 >>>>> [2] - https://github.com/apache/incubator-iceberg/tree/vectorized-read >>>>> [3] - >>>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L197 >>>>> [4] - >>>>> https://github.com/prodeezy/incubator-iceberg/tree/v1-vectorized-reader >>>>> >>>>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> >