[jira] [Updated] (SPARK-15726) Make DatasetBenchmark fairer among Dataset, DataFrame and RDD

Hiroshi Inoue (JIRA) Wed, 01 Jun 2016 20:14:13 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-15726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hiroshi Inoue updated SPARK-15726:
----------------------------------
    Description: 
DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while 
running the same operations.
In backToBackMap test case, however, only DataFrame implementation executes 
less work compared to RDD or Dataset implementations. This test case processes 
Long+String pairs, but the output from the DataFrame implementation does not 
include String part while RDD or Dataset generates Long+String pairs as output. 
This difference significantly changes the performance characteristics due to 
the String manipulation and creation overheads. After the fix RDD outperforms 
DataFrame, while DataFrame was more than 2x faster than RDD without the fix. 
Also, the performance gap between DataFrame and Dataset becomes much narrower.

Of course, this issue does not affect Spark users, but it may confuse Spark 
developers.

{quote}
*// DataFrame*
val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
var res = df
{color:blue}res = res.select($"l" + 1 as "l"){color}
// this should be {color:red}res = res.select($"l" + 1 as "l", $"s"){color} for 
fair comparison

*// Dataset* 
case class Data(l: Long, s: String)
val func = (d: Data) => Data(d.l + 1, d.s)
var res = df.as\[Data\]
res = res.map(func)
{quote}

Additionally, I added a new test case named "back-to-back map for primitive". 
This is almost equivalent with the old behavior of the DataFrame implementation 
of back-to-back map.

{quote}
without fix
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
RDD                                           2051 / 2077         48.7          
20.5       1.0X
DataFrame                                      755 /  940        132.5          
 7.5       2.7X
Dataset                                       6155 / 6680         16.2          
61.6       0.3X

with fix
back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
RDD                                           2077 / 2259         48.1          
20.8       1.0X
DataFrame                                     3030 / 3310         33.0          
30.3       0.7X
Dataset                                       6504 / 7006         15.4          
65.0       0.3X

back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
RDD                                           1073 / 1509         93.2          
10.7       1.0X
DataFrame                                      763 /  913        131.0          
 7.6       1.4X
Dataset                                       4189 / 4312         23.9          
41.9       0.3X

{quote}

Note that DatasetBenchmark causes JVM crash in an aggregate test case. This is 
not related to this issue.
I already created a jira entry and submited a pull request for the aggregate 
issue.
https://issues.apache.org/jira/browse/SPARK-15704
https://github.com/apache/spark/pull/13446


  was:
DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while 
running the same operations.
In backToBackMap test case, however, the only DataFrame implementation executes 
less work compared to RDD or Dataset implementations. This test case processes 
Long+String pairs, but the output from the DataFrame implementation does not 
include String part while RDD or Dataset generates Long+String pairs as output. 
This difference significantly changes the performance characteristics due to 
the String manipulation and creation overheads. After the fix RDD outperforms 
DataFrame, while DataFrame was more than 2x faster than RDD without the fix. 
Also, the performance gap between DataFrame and Dataset becomes much narrower.

Of course, this issue does not affect Spark users, but it may confuse Spark 
developers.

{quote}
*// DataFrame*
val df = spark.range(1, numRows).select($"id".as("l"), 
$"id".cast(StringType).as("s"))
var res = df
{color:blue}res = res.select($"l" + 1 as "l"){color}
// this should be {color:red}res = res.select($"l" + 1 as "l", $"s"){color} for 
fair comparison

*// Dataset* 
case class Data(l: Long, s: String)
val func = (d: Data) => Data(d.l + 1, d.s)
var res = df.as\[Data\]
res = res.map(func)
{quote}

Additionally, I added a new test case named "back-to-back map for primitive". 
This is almost equivalent with the old behavior of the DataFrame implementation 
of back-to-back map.

{quote}
without fix
OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
Intel Xeon E3-12xx v2 (Ivy Bridge)
back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
RDD                                           2051 / 2077         48.7          
20.5       1.0X
DataFrame                                      755 /  940        132.5          
 7.5       2.7X
Dataset                                       6155 / 6680         16.2          
61.6       0.3X

with fix
back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
RDD                                           2077 / 2259         48.1          
20.8       1.0X
DataFrame                                     3030 / 3310         33.0          
30.3       0.7X
Dataset                                       6504 / 7006         15.4          
65.0       0.3X

back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   Per 
Row(ns)   Relative
------------------------------------------------------------------------------------------------
RDD                                           1073 / 1509         93.2          
10.7       1.0X
DataFrame                                      763 /  913        131.0          
 7.6       1.4X
Dataset                                       4189 / 4312         23.9          
41.9       0.3X

{quote}

Note that DatasetBenchmark causes JVM crash in an aggregate test case. This is 
not related to this issue.
I already created a jira entry and submited a pull request for this issue.
https://issues.apache.org/jira/browse/SPARK-15704
https://github.com/apache/spark/pull/13446



> Make DatasetBenchmark fairer among Dataset, DataFrame and RDD
> -------------------------------------------------------------
>
>                 Key: SPARK-15726
>                 URL: https://issues.apache.org/jira/browse/SPARK-15726
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Hiroshi Inoue
>            Priority: Minor
>
> DatasetBenchmark compares the performances of RDD, DataFrame and Dataset 
> while running the same operations.
> In backToBackMap test case, however, only DataFrame implementation executes 
> less work compared to RDD or Dataset implementations. This test case 
> processes Long+String pairs, but the output from the DataFrame implementation 
> does not include String part while RDD or Dataset generates Long+String pairs 
> as output. This difference significantly changes the performance 
> characteristics due to the String manipulation and creation overheads. After 
> the fix RDD outperforms DataFrame, while DataFrame was more than 2x faster 
> than RDD without the fix. Also, the performance gap between DataFrame and 
> Dataset becomes much narrower.
> Of course, this issue does not affect Spark users, but it may confuse Spark 
> developers.
> {quote}
> *// DataFrame*
> val df = spark.range(1, numRows).select($"id".as("l"), 
> $"id".cast(StringType).as("s"))
> var res = df
> {color:blue}res = res.select($"l" + 1 as "l"){color}
> // this should be {color:red}res = res.select($"l" + 1 as "l", $"s"){color} 
> for fair comparison
> *// Dataset* 
> case class Data(l: Long, s: String)
> val func = (d: Data) => Data(d.l + 1, d.s)
> var res = df.as\[Data\]
> res = res.map(func)
> {quote}
> Additionally, I added a new test case named "back-to-back map for primitive". 
> This is almost equivalent with the old behavior of the DataFrame 
> implementation of back-to-back map.
> {quote}
> without fix
> OpenJDK 64-Bit Server VM 1.8.0_91-b14 on Linux 3.10.0-229.el7.x86_64
> Intel Xeon E3-12xx v2 (Ivy Bridge)
> back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> RDD                                           2051 / 2077         48.7        
>   20.5       1.0X
> DataFrame                                      755 /  940        132.5        
>    7.5       2.7X
> Dataset                                       6155 / 6680         16.2        
>   61.6       0.3X
> with fix
> back-to-back map:                        Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> RDD                                           2077 / 2259         48.1        
>   20.8       1.0X
> DataFrame                                     3030 / 3310         33.0        
>   30.3       0.7X
> Dataset                                       6504 / 7006         15.4        
>   65.0       0.3X
> back-to-back map for primitive:          Best/Avg Time(ms)    Rate(M/s)   Per 
> Row(ns)   Relative
> ------------------------------------------------------------------------------------------------
> RDD                                           1073 / 1509         93.2        
>   10.7       1.0X
> DataFrame                                      763 /  913        131.0        
>    7.6       1.4X
> Dataset                                       4189 / 4312         23.9        
>   41.9       0.3X
> {quote}
> Note that DatasetBenchmark causes JVM crash in an aggregate test case. This 
> is not related to this issue.
> I already created a jira entry and submited a pull request for the aggregate 
> issue.
> https://issues.apache.org/jira/browse/SPARK-15704
> https://github.com/apache/spark/pull/13446



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15726) Make DatasetBenchmark fairer among Dataset, DataFrame and RDD

Reply via email to