Github user viirya commented on the issue: https://github.com/apache/spark/pull/21952 Ah, finally I can reproduce this. It needs to allocate the array feature with length 16000. I was reducing it to 1600 and it largely relieve the regression. `com.databricks.spark.avro` is faster only on Spark 2.3. If using with current master branch, it isn't faster than built-in avro datasource. Maybe somewhere causes this regression. ```scala > "com.databricks.spark.avro - Spark 2.3" scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show() +-------+-------------------+ |summary| writeTimes| +-------+-------------------+ | count| 100| | mean| 0.9711099999999999| | stddev|0.01940836797556013| | min| 0.941| | max| 1.037| +-------+-------------------+ scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show() +-------+-------------------+ |summary| readTimes| +-------+-------------------+ | count| 100| | mean| 0.36022| | stddev|0.05807476546520342| | min| 0.287| | max| 0.626| +-------+-------------------+ > "avro" scala> spark.sparkContext.parallelize(writeTimes.slice(50, 150)).toDF("writeTimes").describe("writeTimes").show() +-------+-------------------+ |summary| writeTimes| +-------+-------------------+ | count| 100| | mean| 1.7371699999999999| | stddev|0.03504399976018602| | min| 1.695| | max| 1.886| +-------+-------------------+ scala> spark.sparkContext.parallelize(readTimes.slice(50, 150)).toDF("readTimes").describe("readTimes").show() +-------+-------------------+ |summary| readTimes| +-------+-------------------+ | count| 100| | mean|0.32348999999999994| | stddev|0.06235617714615632| | min| 0.263| | max| 0.781| +-------+-------------------+ ```
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org