msamirkhan commented on pull request #29354:
URL: https://github.com/apache/spark/pull/29354#issuecomment-669519533


   The [pdf attached to the 
PR](https://github.com/apache/spark/files/5025167/AvroBenchmarks.pdf) contains 
the read and write time improvements with the commits.
   
   For read, previous behavior was Decoder => GenericDatumReader => 
AvroDeserializer. Changes to the way SpecificInternalRow is created results in 
improvements to read times, but these can be made in the SpecificInternalRow 
constructor instead (will create a new PR based on the patch in 
https://github.com/apache/spark/pull/29353#issuecomment-669459288). Moving to 
native reader changes the behavior to Decoder => SparkAvroDatumReader. The 
benefits include the ability to "skip" data not needed. This is reflected in 
column K in the read benchmark cases for single column scan as well as in the 
pruning benchmark.
   
   For write, previous behavior was AvroSerializer => ReflectDatumWriter => 
Encoder. Spark doesn't need ReflectDatumWriter and can use GenericDatumWriter 
instead. This is a one line change 
https://github.com/apache/spark/pull/29354/commits/515b4a99d3edeb902a6680f78a38f0d3f977528f
 and improves the write times significantly (column A, pg 3 of the pdf). Moving 
to native writer changes the behavior to SparkAvroDatumWriter => Encoder and 
improves the write times significantly more (columns D-K, pg 3 of the pdf). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to