[GitHub] spark issue #22318: [SPARK-25150][SQL] Fix attribute deduplication in join

2018-09-03 Thread peter-toth
Github user peter-toth commented on the issue:

https://github.com/apache/spark/pull/22318
  
@mgaido91 , 2.2 also suffered from this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...

2018-09-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22314
  
@ueshin Just verified in 2.3. This problem does not exist in 2.3. This is 
due to the fact that implementation of `nullSafeCodeGen` is different in 2.3 
than in master. However, we are missing the test cases we added in these PRs in 
2.3. Should we have the test cases checked in into the branch ? I am afraid 
that if we ever backported the pr that changed nullSafeCodeGen , we may 
introduce this bug. Please advise ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22324
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22324
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95645/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22324
  
**[Test build #95645 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95645/testReport)**
 for PR 22324 at commit 
[`510d729`](https://github.com/apache/spark/commit/510d729b0ed6f83b05a3b0f06c2631163d62ef1a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class FileSourceSuite extends SharedSQLContext `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22318: [SPARK-25150][SQL] Fix attribute deduplication in...

2018-09-03 Thread peter-toth
Github user peter-toth commented on a diff in the pull request:

https://github.com/apache/spark/pull/22318#discussion_r214793247
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameJoinSuite.scala ---
@@ -295,4 +295,14 @@ class DataFrameJoinSuite extends QueryTest with 
SharedSQLContext {
   df.join(df, df("id") <=> df("id")).queryExecution.optimizedPlan
 }
   }
+
+  test("SPARK-25150: Attribute deduplication handles attributes in join 
condition properly") {
+val a = spark.range(1, 5)
+val b = spark.range(10)
+val c = b.filter($"id" % 2 === 0)
+
+val r = a.join(b, a("id") === b("id"), "inner").join(c, a("id") === 
c("id"), "inner")
--- End diff --

That simpler join doesn't hit the issue. It is handled by a different rule 
`ResolveNaturalAndUsingJoin`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214787227
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala
 ---
@@ -398,6 +398,24 @@ class FilterPushdownBenchmark extends SparkFunSuite 
with BenchmarkBeforeAndAfter
   }
 }
   }
+
+  test(s"Pushdown benchmark with many filters") {
+val numRows = 1
+val width = 500
+
+withTempPath { dir =>
+  val columns = (1 to width).map(i => s"id c$i")
+  val df = spark.range(1).selectExpr(columns: _*)
+  withTempTable("orcTable", "patquetTable") {
--- End diff --

nit: a typo, `patquetTable`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22317: [SPARK-25310][SQL] ArraysOverlap may throw a Comp...

2018-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22317


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22317: [SPARK-25310][SQL] ArraysOverlap may throw a Compilation...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22317
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22313
  
**[Test build #95651 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95651/testReport)**
 for PR 22313 at commit 
[`5c46693`](https://github.com/apache/spark/commit/5c46693e58e0f71fe8e67dce16f4b8c783c80aa6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22317: [SPARK-25310][SQL] ArraysOverlap may throw a Compilation...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22317
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2817/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

2018-09-03 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/22320#discussion_r214786494
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala 
---
@@ -754,6 +754,47 @@ class HiveDDLSuite
 }
   }
 
+  test("Insert overwrite Hive table should output correct schema") {
+withTable("tbl", "tbl2") {
+  withView("view1") {
+spark.sql("CREATE TABLE tbl(id long)")
+spark.sql("INSERT OVERWRITE TABLE tbl SELECT 4")
+spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
+spark.sql("CREATE TABLE tbl2(ID long)")
+spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
+checkAnswer(spark.table("tbl2"), Seq(Row(4)))
--- End diff --

Add schema assert please. We can read data since 
[SPARK-25132](https://issues.apache.org/jira/browse/SPARK-25132).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...

2018-09-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22314
  
@ueshin Sure. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22314
  
@dilipbiswal Do we need to backport this to 2.3? If so, could you submit a 
backport pr to branch-2.3 please? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22315
  
@dilipbiswal Do we need to backport this to 2.3? If so, could you submit a 
backport pr to branch-2.3 please? Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22219: [SPARK-25224][SQL] Improvement of Spark SQL Thrif...

2018-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22219#discussion_r214785788
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -3237,6 +3238,28 @@ class Dataset[T] private[sql](
 files.toSet.toArray
   }
 
+  /**
+   * Returns the tuple of the row count and an SeqView that contains all 
rows in this Dataset.
+   *
+   * The SeqView will consume as much memory as the total size of 
serialized results which can be
+   * limited with the config 'spark.driver.maxResultSize'. Rows are 
deserialized when iterating rows
+   * with iterator of returned SeqView. Whether to collect all 
deserialized rows or to iterate them
+   * incrementally can be decided with considering total rows count and 
driver memory.
+   */
+  private[sql] def collectCountAndSeqView(): (Long, SeqView[T, Array[T]]) =
+withAction("collectCountAndSeqView", queryExecution) { plan =>
+  // This projection writes output to a `InternalRow`, which means 
applying this projection is
+  // not thread-safe. Here we create the projection inside this method 
to make `Dataset`
+  // thread-safe.
+  val objProj = GenerateSafeProjection.generate(deserializer :: Nil)
+  val (totalRowCount, internalRowsView) = plan.executeCollectSeqView()
+  (totalRowCount, internalRowsView.map { row =>
+// The row returned by SafeProjection is `SpecificInternalRow`, 
which ignore the data type
+// parameter of its `get` method, so it's safe to use null here.
+objProj(row).get(0, null).asInstanceOf[T]
+  }.asInstanceOf[SeqView[T, Array[T]]])
+}
--- End diff --

If this is a thriftserver specific issue, can we do the same thing by 
fixing code only in the thriftserver package?
IMHO we'd be better not to modify code in the sql package as much as 
possible.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95644/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22219: [SPARK-25224][SQL] Improvement of Spark SQL Thrif...

2018-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22219#discussion_r214785499
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -641,6 +641,16 @@ object SQLConf {
 .intConf
 .createWithDefault(200)
 
+  val THRIFTSERVER_BATCH_DESERIALIZE_LIMIT =
+buildConf("spark.sql.thriftServer.batchDeserializeLimit")
+  .doc("The maximum number of result rows that can be deserialized at 
one time. " +
+"If the number of result rows exceeds this value, the Thrift 
Server will only use " +
+"'memory of serialized rows' + 'memory of the deserialized rows 
being fetched to the " +
+"client'. Only valid if spark.sql.thriftServer.incrementalCollect 
is false. " +
--- End diff --

nit: `s"client'. Only valid if ${THRIFTSERVER_INCREMENTAL_COLLECT.key} is 
false. " +`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22319
  
**[Test build #95644 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95644/testReport)**
 for PR 22319 at commit 
[`4791240`](https://github.com/apache/spark/commit/4791240d08c75d5df23332d0059a4b15197d289f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...

2018-09-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22314
  
@ueshin @kiszk @maropu  Thanks a lot.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22314: [SPARK-25307][SQL] ArraySort function may return ...

2018-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22314


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread dilipbiswal
Github user dilipbiswal commented on the issue:

https://github.com/apache/spark/pull/22315
  
@gatorsmile Sure.. I will check and add.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22314: [SPARK-25307][SQL] ArraySort function may return an erro...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22314
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/22315
  
@dilipbiswal Could we also add the test cases for the other high-order 
functions, if missing?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22315: [SPARK-25308][SQL] ArrayContains function may ret...

2018-09-03 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22315


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22319: [SPARK-25044][SQL][followup] add back UserDefined...

2018-09-03 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/22319#discussion_r214784141
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala
 ---
@@ -41,12 +41,16 @@ import org.apache.spark.sql.types.DataType
 case class UserDefinedFunction protected[sql] (
 f: AnyRef,
 dataType: DataType,
-inputTypes: Option[Seq[ScalaReflection.Schema]]) {
+inputTypes: Option[Seq[DataType]]) {
--- End diff --

+1. This is why we added _nameOption, _nullable and _deterministic in 2.3 
release. 

Please also remove the changes of MimaExcludes.scala made in 
https://github.com/apache/spark/pull/22259


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22315
  
Thanks! merging to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread ueshin
Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/22315
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22324
  
**[Test build #95650 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95650/testReport)**
 for PR 22324 at commit 
[`bc05a35`](https://github.com/apache/spark/commit/bc05a354e375dfb1df6a70a46f28b792f8567fc5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22324
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22324
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2816/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...

2018-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22324#discussion_r214783002
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala
 ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
+import org.apache.spark.sql.test.SharedSQLContext
+
+
+class FileSourceSuite extends SharedSQLContext {
+
+  test("SPARK-25237 compute correct input metrics in FileScanRDD") {
--- End diff --

ok


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22324
  
oh, I see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22306: [SPARK-25300][CORE]Unified the configuration parameter `...

2018-09-03 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22306
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22321: [DOC] Update some outdated links

2018-09-03 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/22321
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22320
  
**[Test build #95649 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95649/testReport)**
 for PR 22320 at commit 
[`538fea9`](https://github.com/apache/spark/commit/538fea99ed2158316d89f64ce397c4791fbed1f3).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22320
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22320
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2815/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22179
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95643/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22179
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22179
  
**[Test build #95643 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95643/testReport)**
 for PR 22179 at commit 
[`f2fb28d`](https://github.com/apache/spark/commit/f2fb28da3eb272651530b77dbd4ea33511f0727d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class MapHolder `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214778954
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -71,12 +71,24 @@ private[orc] object OrcFilters {
 
 for {
   // Combines all convertible filters using `And` to produce a single 
conjunction
-  conjunction <- 
convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
+  conjunction <- buildTree(convertibleFilters)
--- End diff --

BTW, Parquet has another issue here due to `.reduceOption(FilterApi.and)`. 
When I make a benchmark, Parquet seems to be unable to handle 1000 filters, 
@cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

2018-09-03 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22320#discussion_r214778690
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -805,6 +805,80 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("Insert overwrite table command should output correct schema: 
basic") {
+withTable("tbl", "tbl2") {
+  withView("view1") {
+val df = spark.range(10).toDF("id")
+df.write.format("parquet").saveAsTable("tbl")
+spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl")
+spark.sql("CREATE TABLE tbl2(ID long) USING parquet")
+spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1")
+val identifier = TableIdentifier("tbl2", Some("default"))
+val location = 
spark.sessionState.catalog.getTableMetadata(identifier).location.toString
+val expectedSchema = StructType(Seq(StructField("ID", LongType, 
true)))
+assert(spark.read.parquet(location).schema == expectedSchema)
+checkAnswer(spark.table("tbl2"), df)
+  }
+}
+  }
+
+  test("Insert overwrite table command should output correct schema: 
complex") {
+withTable("tbl", "tbl2") {
+  withView("view1") {
+val df = spark.range(10).map(x => (x, x.toInt, 
x.toInt)).toDF("col1", "col2", "col3")
+df.write.format("parquet").saveAsTable("tbl")
+spark.sql("CREATE VIEW view1 AS SELECT * FROM tbl")
+spark.sql("CREATE TABLE tbl2(COL1 long, COL2 int, COL3 int) USING 
parquet PARTITIONED " +
+  "BY (COL2) CLUSTERED BY (COL3) INTO 3 BUCKETS")
+spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT COL1, COL2, COL3 
FROM view1")
+val identifier = TableIdentifier("tbl2", Some("default"))
+val location = 
spark.sessionState.catalog.getTableMetadata(identifier).location.toString
+val expectedSchema = StructType(Seq(
+  StructField("COL1", LongType, true),
--- End diff --

Keep it should be OK.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

2018-09-03 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/22320#discussion_r214778523
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala
 ---
@@ -805,6 +805,80 @@ class DataFrameReaderWriterSuite extends QueryTest 
with SharedSQLContext with Be
 }
   }
 
+  test("Insert overwrite table command should output correct schema: 
basic") {
+withTable("tbl", "tbl2") {
+  withView("view1") {
+val df = spark.range(10).toDF("id")
--- End diff --

This is trivial...As the column name `id` is case sensitive and used below, 
I would like to show it explicitly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214778262
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25306 createFilter should not hang") {
+import org.apache.spark.sql.sources._
+val schema = new StructType(Array(StructField("a", IntegerType, 
nullable = true)))
+val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
+failAfter(2 seconds) {
+  OrcFilters.createFilter(schema, filters)
--- End diff --

I'll choose (2), @cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22048: [SPARK-25108][SQL] Fix the show method to display...

2018-09-03 Thread xuejianbest
Github user xuejianbest commented on a diff in the pull request:

https://github.com/apache/spark/pull/22048#discussion_r214778257
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2794,6 +2794,30 @@ private[spark] object Utils extends Logging {
   }
 }
   }
+
+  /**
+   * Regular expression matching full width characters
+   */
+  private val fullWidthRegex = ("""[""" +
+// scalastyle:off nonascii
+"""\u1100-\u115F""" +
+"""\u2E80-\uA4CF""" +
+"""\uAC00-\uD7A3""" +
+"""\uF900-\uFAFF""" +
+"""\uFE10-\uFE19""" +
+"""\uFE30-\uFE6F""" +
+"""\uFF00-\uFF60""" +
+"""\uFFE0-\uFFE6""" +
--- End diff --

> Can you describe them there and put a references to a public unicode 
document?

This is a regular expression match using unicode, regardless of the 
specific encoding.
For example, the following string is encoded using gbk instead of utf8, and 
the match still works:
`
val bytes = Array[Byte](0xd6.toByte, 0xd0.toByte, 0xB9.toByte, 
0xFA.toByte)
val s1 = new String(bytes, "gbk")

println(s1) //中国

val fullWidthRegex = ("""[""" +
// scalastyle:off nonascii
"""\u1100-\u115F""" +
"""\u2E80-\uA4CF""" +
"""\uAC00-\uD7A3""" +
"""\uF900-\uFAFF""" +
"""\uFE10-\uFE19""" +
"""\uFE30-\uFE6F""" +
"""\uFF00-\uFF60""" +
"""\uFFE0-\uFFE6""" +
// scalastyle:on nonascii
"""]""").r

println(fullWidthRegex.findAllIn(s1).size) //2
`
This regular expression is obtained experimentally under a specific font.
I don't understand what you are going to do.


> How about some additional overheads when calling showString as compared 
to showString w/o this patch?

I tested a Dataset consisting of 100 rows, each row has two columns, one 
column is the index (0-99), and the other column is a random string of length 
100 characters, and then the showString display is called separately.
The original showString method (w/o this patch) took about 42ms, and the 
improved time took about 46ms, and the performance was about 10% worse.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22325
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22325
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22325: [SPARK-25318]. Add exception handling when wrapping the ...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22325
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22325: [SPARK-25318]. Add exception handling when wrappi...

2018-09-03 Thread rezasafi
GitHub user rezasafi opened a pull request:

https://github.com/apache/spark/pull/22325

[SPARK-25318]. Add exception handling when wrapping the input stream during 
the the fetch or stage retry in response to a corrupted block

SPARK-4105 provided a solution to block corruption issue by retrying the 
fetch or the stage. In that solution there is a step that wraps the input 
stream with compression and/or encryption. This step is prone to exceptions, 
but in the current code there is no exception handling for this step and this 
has caused confusion for the user. This change adds exception handling for the 
wrapping step and also adds a fetch retry if we experience a corruption during 
the wrapping step. 


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rezasafi/spark localcorruption

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22325.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22325


commit cc1c4cdf2bd3b77326f831212c64ede338c807b1
Author: Reza Safi 
Date:   2018-09-04T03:06:33Z

[SPARK-25318]. Add exception handling when wrapping the input stream during 
the the fetch or stage retry in response to a corrupted block




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22321: [DOC] Update the 'Specifying the Hadoop Version' link in...

2018-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22321
  
Mind fixing the PR title as well since we fix other broken links too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22240
  
**[Test build #95648 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95648/testReport)**
 for PR 22240 at commit 
[`9b6a47b`](https://github.com/apache/spark/commit/9b6a47bf718309eb0b5a22a0282a5a7c4226e991).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22321: [DOC] Update the 'Specifying the Hadoop Version' link in...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22321
  
**[Test build #95647 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95647/testReport)**
 for PR 22321 at commit 
[`d9bbf3c`](https://github.com/apache/spark/commit/d9bbf3c4a7be82d66eb643a42c2724cd30ea1ad5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22240
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22240
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2814/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...

2018-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22324#discussion_r214776872
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceSuite.scala
 ---
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.scheduler.{SparkListener, SparkListenerTaskEnd}
+import org.apache.spark.sql.test.SharedSQLContext
+
+
+class FileSourceSuite extends SharedSQLContext {
+
+  test("SPARK-25237 compute correct input metrics in FileScanRDD") {
--- End diff --

Shall we move this suite into `FileBasedDataSourceSuite`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22324
  
we can credit to multiple people now though :-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214775155
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -71,12 +71,24 @@ private[orc] object OrcFilters {
 
 for {
   // Combines all convertible filters using `And` to produce a single 
conjunction
-  conjunction <- 
convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
+  conjunction <- buildTree(convertibleFilters)
--- End diff --

For the first question, I don't think Parquet has the same issue because 
Parquet uses `canMakeFilterOn` while ORC is trying to build a full result (with 
a fresh builder) to check if it's okay or not.

For the second question, in ORC, we already did the first half(`flatMap`) 
to compute `convertibleFilters`, but it can change it with `filters.filter`.
```scala
val convertibleFilters = for {
filter <- filters
_ <- buildSearchArgument(dataTypeMap, filter, 
SearchArgumentFactory.newBuilder())
} yield filter
```

2. And, the second half `reduceOption(FilterApi.and)` was the original ORC 
code which generated a skewed tree having exponential time complexity. We need 
to use `buildTree`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22240
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95642/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22240
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22240
  
**[Test build #95642 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95642/testReport)**
 for PR 22240 at commit 
[`c61eec3`](https://github.com/apache/spark/commit/c61eec363f78d586070c673e44e9120eb10b83b5).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22179
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22179
  
**[Test build #95646 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95646/testReport)**
 for PR 22179 at commit 
[`0d78113`](https://github.com/apache/spark/commit/0d7811348e5746e1a7e1ce887d47ae4ba413c014).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22179
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2813/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22324
  
**[Test build #95645 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95645/testReport)**
 for PR 22324 at commit 
[`510d729`](https://github.com/apache/spark/commit/510d729b0ed6f83b05a3b0f06c2631163d62ef1a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22324
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22324
  
@srowen reworked cuz the author is inactive and can you check? (btw, it's 
ok that the credit of this commit goes to the original author.) 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFileSize in...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22324
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2812/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22324: [SPARK-25237][SQL] Remove updateBytesReadWithFile...

2018-09-03 Thread maropu
GitHub user maropu opened a pull request:

https://github.com/apache/spark/pull/22324

[SPARK-25237][SQL] Remove updateBytesReadWithFileSize in FileScanRDD

## What changes were proposed in this pull request?
This pr removed the method `updateBytesReadWithFileSize` in `FileScanRDD` 
because it computes input metrics by file size supported in Hadoop 2.5 and 
earlier. The current Spark does not support the versions, so it causes wrong 
input metric numbers.

This is rework from #22232.

Closes #22232

## How was this patch tested?
Added `FileSourceSuite` to tests this case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/maropu/spark pr22232-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22324.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22324


commit 0f75257b50a611e069d406da8d72225bb4e73b51
Author: dujunling 
Date:   2018-08-25T06:20:35Z

remove updateBytesReadWithFileSize because we use Hadoop FileSystem 
statistics to update the inputMetrics

commit 53dd42c1facebf97044afb22b1f0894ec209f3bb
Author: dujunling 
Date:   2018-08-27T03:26:30Z

add ut

commit 1c326466fbd24c432184be6e53afec93369970c1
Author: dujunling 
Date:   2018-08-27T03:33:46Z

ut

commit 510d729b0ed6f83b05a3b0f06c2631163d62ef1a
Author: Takeshi Yamamuro 
Date:   2018-09-04T01:47:59Z

fix




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214769029
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25306 createFilter should not hang") {
+import org.apache.spark.sql.sources._
+val schema = new StructType(Array(StructField("a", IntegerType, 
nullable = true)))
+val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
+failAfter(2 seconds) {
+  OrcFilters.createFilter(schema, filters)
--- End diff --

Sure. Something like the test code in the PR description? And marked as 
`ignore(...)` instead of `test(...)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22315
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95638/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22315
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22315
  
**[Test build #95638 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95638/testReport)**
 for PR 22315 at commit 
[`59ddb99`](https://github.com/apache/spark/commit/59ddb993790f4bb0ec920a2b0d897d8052c9f108).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22319
  
**[Test build #95644 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95644/testReport)**
 for PR 22319 at commit 
[`4791240`](https://github.com/apache/spark/commit/4791240d08c75d5df23332d0059a4b15197d289f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22319: [SPARK-25044][SQL][followup] add back UserDefinedFunctio...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22319
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2811/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21860: [SPARK-24901][SQL]Merge the codegen of RegularHashMap an...

2018-09-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/21860
  
cc: @cloud-fan @hvanhovell 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22315: [SPARK-25308][SQL] ArrayContains function may return a e...

2018-09-03 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22315
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214765115
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -71,12 +71,24 @@ private[orc] object OrcFilters {
 
 for {
   // Combines all convertible filters using `And` to produce a single 
conjunction
-  conjunction <- 
convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
+  conjunction <- buildTree(convertibleFilters)
--- End diff --

In parquet, this is done as
```
filters
  .flatMap(ParquetFilters.createFilter(requiredSchema, _))
  .reduceOption(FilterApi.and)
```

can we follow it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214765026
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
 ---
@@ -71,12 +71,24 @@ private[orc] object OrcFilters {
 
 for {
   // Combines all convertible filters using `And` to produce a single 
conjunction
-  conjunction <- 
convertibleFilters.reduceOption(org.apache.spark.sql.sources.And)
+  conjunction <- buildTree(convertibleFilters)
--- End diff --

does parquet has the same problem?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22313: [SPARK-25306][SQL] Avoid skewed filter trees to s...

2018-09-03 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22313#discussion_r214764993
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
 ---
@@ -383,4 +386,13 @@ class OrcFilterSuite extends OrcTest with 
SharedSQLContext {
   )).get.toString
 }
   }
+
+  test("SPARK-25306 createFilter should not hang") {
+import org.apache.spark.sql.sources._
+val schema = new StructType(Array(StructField("a", IntegerType, 
nullable = true)))
+val filters = (1 to 2000).map(LessThan("a", _)).toArray[Filter]
+failAfter(2 seconds) {
+  OrcFilters.createFilter(schema, filters)
--- End diff --

This test looks tricky... It's a bad practice to assume some code will 
return in a certain time. Can we just add a microbenchmark for it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95637/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22313
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22313: [SPARK-25306][SQL] Avoid skewed filter trees to speed up...

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22313
  
**[Test build #95637 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95637/testReport)**
 for PR 22313 at commit 
[`4acbaf8`](https://github.com/apache/spark/commit/4acbaf8be9e572c5cdbc61c49b488e8aef9e646b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22321: [DOC] Update the 'Specifying the Hadoop Version' link in...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22321
  
Thank you for your first contribution, @kisimple . As @kiszk mentioned, 
could you fix those files, too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22204: [SPARK-25196][SQL] Analyze column statistics in cached q...

2018-09-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22204
  
ok, I'll do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22204: [SPARK-25196][SQL] Analyze column statistics in cached q...

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22204
  
Thank you, @maropu . BTW, if this PR aims to provide `ANALYZE` command 
interface to users, could you update the PR content and test cases for that?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22218: [SPARK-25228][CORE]Add executor CPU time metric.

2018-09-03 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/22218
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22179#discussion_r214762021
  
--- Diff: 
core/src/test/scala/org/apache/spark/serializer/KryoSerializerSuite.scala ---
@@ -412,6 +412,26 @@ class KryoSerializerSuite extends SparkFunSuite with 
SharedSparkContext {
 assert(!ser2.getAutoReset)
   }
 
+  test("ClassCastException when writing a Map after previously " +
--- End diff --

Since this is a bug fix test case, could you add `SPARK-25176` like 
`SPARK-25176 ClassCastException ...`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...

2018-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22320#discussion_r214761843
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -69,7 +69,7 @@ case class InsertIntoHiveTable(
 query: LogicalPlan,
 overwrite: Boolean,
 ifPartitionNotExists: Boolean,
-outputColumns: Seq[Attribute]) extends SaveAsHiveFile {
+outputColumnNames: Seq[String]) extends SaveAsHiveFile {
--- End diff --

thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22316: [SPARK-25048][SQL] Pivoting by multiple columns i...

2018-09-03 Thread maropu
Github user maropu commented on a diff in the pull request:

https://github.com/apache/spark/pull/22316#discussion_r214761811
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala ---
@@ -308,4 +308,27 @@ class DataFramePivotSuite extends QueryTest with 
SharedSQLContext {
 
 assert(exception.getMessage.contains("aggregate functions are not 
allowed"))
   }
+
+  test("pivoting column list with values") {
+val expected = Row(2012, 1.0, null) :: Row(2013, 48000.0, 3.0) 
:: Nil
+val df = trainingSales
+  .groupBy($"sales.year")
+  .pivot(struct(lower($"sales.course"), $"training"), Seq(
+struct(lit("dotnet"), lit("Experts")),
+struct(lit("java"), lit("Dummies")))
+  ).agg(sum($"sales.earnings"))
+
+checkAnswer(df, expected)
+  }
+
+  test("pivoting column list") {
+val exception = intercept[RuntimeException] {
+  trainingSales
+.groupBy($"sales.year")
+.pivot(struct(lower($"sales.course"), $"training"))
+.agg(sum($"sales.earnings"))
+.collect()
--- End diff --

I tried in your branch;
```
scala> df.show
+++
|training|   sales|
+++
| Experts|[dotNET, 2012, 10...|
| Experts|[JAVA, 2012, 2000...|
| Dummies|[dotNet, 2012, 50...|
| Experts|[dotNET, 2013, 48...|
| Dummies|[Java, 2013, 3000...|
+++

scala> df.groupBy($"sales.year").pivot(struct(lower($"sales.course"), 
$"training")).agg(sum($"sales.earnings"))
java.lang.RuntimeException: Unsupported literal type class 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema [dotnet,Dummies]
  at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
  at 
org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at 
org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
  at scala.util.Try.getOrElse(Try.scala:79)
  at 
org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
  at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
```
I miss something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22179
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2810/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22179
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22179
  
**[Test build #95643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95643/testReport)**
 for PR 22179 at commit 
[`f2fb28d`](https://github.com/apache/spark/commit/f2fb28da3eb272651530b77dbd4ea33511f0727d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2

2018-09-03 Thread wangyum
Github user wangyum commented on the issue:

https://github.com/apache/spark/pull/22179
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22240
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95641/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22240: [SPARK-25248] [CORE] Audit barrier Scala APIs for 2.4

2018-09-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22240
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >