[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

viirya Thu, 02 Aug 2018 09:59:04 -0700

Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/21952
  
    @dbtsai I didn't use Spark 2.3 when testing databricks-avro. I also used 
current master. But because a recent change of schema verifying 
(`FileFormat.supportDataType`) causes incompatibility, I manually skip this 
call to `supportDataType`.
    
    So basically I tested built-in avro and databricks-avro both on current 
master. I think the difference between Spark 2.3 and current master may cause 
difference.
    
    Btw, in the following benchmark numbers I modify array feature length from 
16000 to 1600.
    
    ```scala
    > "com.databricks.spark.avro"
     
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 
150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+--------------------+
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.21102|
    | stddev|0.010737435692590912|
    |    min|               0.195|
    |    max|               0.247|
    +-------+--------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 
150)).toDF("readTimes").describe("readTimes").show()
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean| 0.09441999999999999|
    | stddev|0.016021563751722395|
    |    min|                0.07|
    |    max|               0.134|
    +-------+--------------------+
    
    > "avro"
    
    scala> spark.sparkContext.parallelize(writeTimes.slice(50, 
150)).toDF("writeTimes").describe("writeTimes").show()
    +-------+--------------------+
    |summary|          writeTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.21445|
    | stddev|0.008952596824329237|
    |    min|               0.201|
    |    max|                0.25|
    +-------+--------------------+
    
    
    scala> spark.sparkContext.parallelize(readTimes.slice(50, 
150)).toDF("readTimes").describe("readTimes").show()
    +-------+--------------------+
    |summary|           readTimes|
    +-------+--------------------+
    |  count|                 100|
    |   mean|             0.10792|
    | stddev|0.015983375201386058|
    |    min|                0.08|
    |    max|                0.15|
    +-------+--------------------+
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21952: [SPARK-24993] [SQL] [WIP] Make Avro Fast Again

Reply via email to