[GitHub] [spark] msamirkhan commented on pull request #29354: [SPARK-32533][SQL] Improve Avro read/write performance on nested structs and array of structs

GitBox Wed, 05 Aug 2020 14:25:08 -0700


msamirkhan commented on pull request #29354:
URL: https://github.com/apache/spark/pull/29354#issuecomment-669519533

The [pdf attached to the
PR](https://github.com/apache/spark/files/5025167/AvroBenchmarks.pdf) contains
the read and write time improvements with the commits.

For read, previous behavior was Decoder => GenericDatumReader =>
AvroDeserializer. Changes to the way SpecificInternalRow is created results in
improvements to read times, but these can be made in the SpecificInternalRow
constructor instead (will create a new PR based on the patch in
https://github.com/apache/spark/pull/29353#issuecomment-669459288). Moving to
native reader changes the behavior to Decoder => SparkAvroDatumReader. The
benefits include the ability to "skip" data not needed. This is reflected in
column K in the read benchmark cases for single column scan as well as in the
pruning benchmark.

For write, previous behavior was AvroSerializer => ReflectDatumWriter =>
Encoder. Spark doesn't need ReflectDatumWriter and can use GenericDatumWriter
instead. This is a one line change
https://github.com/apache/spark/pull/29354/commits/515b4a99d3edeb902a6680f78a38f0d3f977528f
and improves the write times significantly (column A, pg 3 of the pdf). Moving
to native writer changes the behavior to SparkAvroDatumWriter => Encoder and
improves the write times significantly more (columns D-K, pg 3 of the pdf).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] msamirkhan commented on pull request #29354: [SPARK-32533][SQL] Improve Avro read/write performance on nested structs and array of structs

Reply via email to