msamirkhan commented on pull request #29354: URL: https://github.com/apache/spark/pull/29354#issuecomment-669519533
The [pdf attached to the PR](https://github.com/apache/spark/files/5025167/AvroBenchmarks.pdf) contains the read and write time improvements with the commits. For read, previous behavior was Decoder => GenericDatumReader => AvroDeserializer. Changes to the way SpecificInternalRow is created results in improvements to read times, but these can be made in the SpecificInternalRow constructor instead (will create a new PR based on the patch in https://github.com/apache/spark/pull/29353#issuecomment-669459288). Moving to native reader changes the behavior to Decoder => SparkAvroDatumReader. The benefits include the ability to "skip" data not needed. This is reflected in column K in the read benchmark cases for single column scan as well as in the pruning benchmark. For write, previous behavior was AvroSerializer => ReflectDatumWriter => Encoder. Spark doesn't need ReflectDatumWriter and can use GenericDatumWriter instead. This is a one line change https://github.com/apache/spark/pull/29354/commits/515b4a99d3edeb902a6680f78a38f0d3f977528f and improves the write times significantly (column A, pg 3 of the pdf). Moving to native writer changes the behavior to SparkAvroDatumWriter => Encoder and improves the write times significantly more (columns D-K, pg 3 of the pdf). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org