[ 
https://issues.apache.org/jira/browse/SPARK-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15726.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

Issue resolved by pull request 13459
[https://github.com/apache/spark/pull/13459]

> Make DatasetBenchmark fairer among Dataset, DataFrame and RDD
> -------------------------------------------------------------
>
>                 Key: SPARK-15726
>                 URL: https://issues.apache.org/jira/browse/SPARK-15726
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Hiroshi Inoue
>            Priority: Minor
>             Fix For: 2.1.0
>
>
> DatasetBenchmark compares the performances of RDD, DataFrame and Dataset 
> while running the same operations.
> In backToBackMap test case, however, only DataFrame implementation executes 
> less work compared to RDD or Dataset implementations. This test case 
> processes Long+String pairs, but the output from the DataFrame implementation 
> does not include String part while RDD or Dataset generates Long+String pairs 
> as output. This difference significantly changes the performance 
> characteristics due to the String manipulation and creation overheads. After 
> the fix RDD outperforms DataFrame, while DataFrame was more than 2x faster 
> than RDD without the fix. Also, the performance gap between DataFrame and 
> Dataset becomes much narrower.
> Of course, this issue does not affect Spark users, but it may confuse Spark 
> developers.
> {quote}
> *// DataFrame*
> val df = spark.range(1, numRows).select($"id".as("l"), 
> $"id".cast(StringType).as("s"))
> var res = df
> {color:blue}res = res.select($"l" + 1 as "l"){color}
> // this should be {color:red}res = res.select($"l" + 1 as "l", $"s"){color} 
> for fair comparison
> *// Dataset* 
> case class Data(l: Long, s: String)
> val func = (d: Data) => Data(d.l + 1, d.s)
> var res = df.as\[Data\]
> res = res.map(func)
> {quote}
> Additionally, I added a new test case named "back-to-back map for primitive". 
> This is almost equivalent with the old behavior of the DataFrame 
> implementation of back-to-back map.
> {quote}
> without fix
> OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
> Intel Xeon E3-12xx v2 (Ivy Bridge)
> back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> RDD                                           2051 / 2077         48.7        
>   20.5       1.0X
> DataFrame                                      755 /  940        132.5        
>    7.5       2.7X
> Dataset                                       6155 / 6680         16.2        
>   61.6       0.3X
> with fix
> back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> RDD                                           2077 / 2259         48.1        
>   20.8       1.0X
> DataFrame                                     3030 / 3310         33.0        
>   30.3       0.7X
> Dataset                                       6504 / 7006         15.4        
>   65.0       0.3X
> back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> RDD                                           1073 / 1509         93.2        
>   10.7       1.0X
> DataFrame                                      763 /  913        131.0        
>    7.6       1.4X
> Dataset                                       4189 / 4312         23.9        
>   41.9       0.3X
> {quote}
> Note that DatasetBenchmark causes JVM crash in an aggregate test case. This 
> is not related to this issue.
> I already created a jira entry and submited a pull request for the aggregate 
> issue.
> https://issues.apache.org/jira/browse/SPARK-15704
> https://github.com/apache/spark/pull/13446



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to