I’m not sure what your are saying, for our implementation vectorization is the arrow format. That’s how we pass batches to spark in vectorization mode. They cannot be separated in the iceberg code although I guess you could implement another columnar in memory format extending Spark columnar batch.
Sent from my iPhone > On Feb 5, 2022, at 7:06 PM, Mike Zhang <mike.zhang.2001...@gmail.com> wrote: > > > Thanks Russel! I wonder if the performance gain is mainly from vectorization > instead of using arrow format? My understanding of the benefits of using > Arrow is to avoid serialization/deserialization. I just got a hard time > understanding how Iceberg uses Arrow to get the benefit of that. > >> On Sat, Feb 5, 2022 at 5:39 AM Russell Spitzer <russell.spit...@gmail.com> >> wrote: >> One thing to note is we never go to "RDD" records really, since we are >> always working the DataFrame API. Spark builds RDDs but expects us to >> deliver data in one of two ways, row-based (internalRows) or columnar >> (arrowVectors). Columnar reads are generally more efficient and >> parallelizable, usually when someone is talking vectorizing parquet reads >> they mean columnar reads. >> >> While this gives us much better performance (see our various perf test >> modules in the code base if you would like to run yourself) Spark is still a >> row oriented engine. Spark wants to take advantage of this format which is >> why it provides the "columnarBatch" interface but still does all codegen and >> other operations on a per row basis. This means that although we can >> generally load the data in a much faster way than row based loading, Spark >> still has to work on the data in a row format most of the time. There are a >> variety of projects working to fix this as well. >> >>> On Fri, Feb 4, 2022 at 11:01 PM Mike Zhang <mike.zhang.2001...@gmail.com> >>> wrote: >>> I am reading the Iceberg code regarding the Parquet reading path and see >>> the Parquet files are red to Arrow format first. I wonder how much >>> performance gain we could have by doing that. Let’s take the example of the >>> Spark application with Iceberg. If the Parquet file is red directly to >>> Spark RDD records, shouldn’t it be faster than Parquet->Arrow->Spark >>> Record? Since Iceberg is converting to Arrow first today, there must be >>> some benefits of that. So I feel I miss something. Can somebody help to >>> explain?