[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-27 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21625


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-25 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197709175
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

Anyway, I updated the results by applying #21631


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-24 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197679212
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

@HyukjinKwon I'm currently fixing this now. But, it seems this bug is 
similar to SPARK-24645. So, would it be better to merge this fix with 
SPARK-24645?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-24 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197679077
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

yea, I though I would do so first, but I couldn't because I hit another bug 
when the column pruning disabled...;
```
./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", 
"id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String 
cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-24 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197678386
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

@maropu, if the JIRA blocks this PR, please feel free to set the 
configuration to false and proceed. Technically, looks that's what the 
benchmark originally covered at that time it's merged in. Setting it true can 
be separately done in the JIRA you opened.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-24 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197676313
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

I filed a jira; https://issues.apache.org/jira/browse/SPARK-24645


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-24 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197676056
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

oh, I hit the bug in csv parsing when updating this benchmark...
```
scala> val dir = "/tmp/spark-csv/csv"
scala> spark.range(10).selectExpr("id % 2 AS p", 
"id").write.mode("overwrite").partitionBy("p").csv(dir)
scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)
  
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
at 
org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
...
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-24 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197652431
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -39,9 +39,11 @@ import org.apache.spark.util.{Benchmark, Utils}
 object DataSourceReadBenchmark {
   val conf = new SparkConf()
 .setAppName("DataSourceReadBenchmark")
-.setIfMissing("spark.master", "local[1]")
+// Since `spark.master` always exists, overrides this value
+.set("spark.master", "local[1]")
--- End diff --

Thank you for fixing this and updating the result, @maropu .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-23 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197628635
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

oh, thanks. I'll update soon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21625#discussion_r197627610
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala
 ---
@@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
   }
 }
 
-/*
-Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
-Partitioned Table:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
-

--- End diff --

Seems missed to update.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

2018-06-23 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/21625

[SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark 
results 

## What changes were proposed in this pull request?
This pr corrected the default configuration (`spark.master=local[1]`) for 
benchmarks. Also, this updated performance results on the AWS `r3.xlarge`.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark FixDataSourceReadBenchmark

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21625.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21625


commit 23528200f833f236a83d6b891388b6ec698bcac7
Author: Takeshi Yamamuro 
Date:   2018-06-16T01:48:15Z

Fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org