Yuming Wang created PARQUET-1355:
------------------------------------

             Summary: Improvement parquet Binary write performance
                 Key: PARQUET-1355
                 URL: https://issues.apache.org/jira/browse/PARQUET-1355
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
    Affects Versions: 1.10.0
            Reporter: Yuming Wang


*Benchmark code*:
{code:java}
test("Parquet write benchmark") {
  val count = 100 * 1024 * 1024
  val numIters = 5
  withTempPath { path =>
    val benchmark = new Benchmark(s"Parquet write benchmark 
${spark.sparkContext.version}", 5)

    Seq("long", "string", "decimal(18, 0)", "decimal(38, 18)", 
"timestamp").foreach { dt =>
      benchmark.addCase(s"$dt type", numIters = numIters) { iter =>
        spark.range(count).selectExpr(s"cast(id as $dt) as id")
          .write.mode("overwrite").parquet(path.getAbsolutePath)
      }
    }
    benchmark.run()
  }
}
{code}

*Result*:

{noformat}
-- Spark 2.3.3-SNAPSHOT with Parquet 1.8.3

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.3.3-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   10963 / 11344          0.0  
2192675973.8       1.0X
string type                                 28423 / 29437          0.0  
5684553922.2       0.4X
decimal(18, 0) type                         11558 / 11696          0.0  
2311587203.6       0.9X
decimal(38, 18) type                        43858 / 44432          0.0  
8771537663.4       0.2X


-- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   11633 / 12070          0.0  
2326572295.8       1.0X
string type                                 31374 / 32178          0.0  
6274760187.4       0.4X
decimal(18, 0) type                         13019 / 13294          0.0  
2603841925.4       0.9X
decimal(38, 18) type                        50719 / 50983          0.0 
10143775007.6       0.2X
{noformat}


The mainly is 
[toByteBuffer|https://github.com/apache/parquet-mr/blob/d61d221c9e752ce2cc0da65ede8b55653b3ae21f/parquet-column/src/main/java/org/apache/parquet/io/api/Binary.java#L83]
 affects performance.
If do not use the {{toByteBuffer}} when compare binary, the result is:
{noformat}
-- Spark 2.4.0-SNAPSHOT with Parquet 1.10.0

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

Parquet write benchmark 2.4.0-SNAPSHOT:  Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
long type                                   11171 / 11508          0.0  
2234189382.0       1.0X
string type                                 30072 / 30290          0.0  
6014346455.4       0.4X
decimal(18, 0) type                         12150 / 12239          0.0  
2430052708.8       0.9X
decimal(38, 18) type                        44974 / 45423          0.0  
8994773738.8       0.2X
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to