date:20181128

[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22957
  
**[Test build #99375 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99375/testReport)**
 for PR 22957 at commit 
[`0491249`](https://github.com/apache/spark/commit/049124971eb9d0f84eb98cc33c4cccd65d7a42b7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23031
  
**[Test build #99374 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99374/testReport)**
 for PR 23031 at commit 
[`4e7b6bb`](https://github.com/apache/spark/commit/4e7b6bb704f2aff1c091b73c08a167c4a6478534).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22957
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22957
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5451/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23086: [SPARK-25528][SQL] data source v2 API refactor (b...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23086#discussion_r237078844
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/TableProvider.java ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.sources.v2;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.sources.DataSourceRegister;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * The base interface for v2 data sources which don't have a real catalog. 
Implementations must
+ * have a public, 0-arg constructor.
+ *
+ * The major responsibility of this interface is to return a {@link Table} 
for read/write.
+ */
+@Evolving
+// TODO: do not extend `DataSourceV2`, after we finish the API refactor 
completely.
+public interface TableProvider extends DataSourceV2 {
+
+  /**
+   * Return a {@link Table} instance to do read/write with user-specified 
options.
+   *
+   * @param options the user-specified options that can identify a table, 
e.g. file path, Kafka
+   *topic name, etc. It's an immutable case-insensitive 
string-to-string map.
+   */
+  Table getTable(DataSourceOptions options);
+
+  /**
+   * Return a {@link Table} instance to do read/write with user-specified 
schema and options.
+   *
+   * By default this method throws {@link UnsupportedOperationException}, 
implementations should
--- End diff --

what I learned is that, we should only declare checked exceptions. See 
http://www.javapractices.com/topic/TopicAction.do?Id=171


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23162
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99365/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23162
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23162: [MINOR][DOC] Correct some document description errors

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23162
  
**[Test build #99365 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99365/testReport)**
 for PR 23162 at commit 
[`e9aba19`](https://github.com/apache/spark/commit/e9aba19b526610f3f31fa6a5b56140f6be8dc1c1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237076213
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
+case a: Alias => sameResult(a.child)
+case _ => this.semanticEquals(other)
--- End diff --

we can do
```
CleanupAliases.trimAliases(this) semanticEquals 
CleanupAliases.trimAliases(other)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23031#discussion_r237075228
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala 
---
@@ -2715,4 +2715,11 @@ abstract class DDLSuite extends QueryTest with 
SQLTestUtils {
   }
 }
   }
+
+  test("set command rejects SparkConf entries") {
+val ex = intercept[AnalysisException] {
+  sql("SET spark.task.cpus = 4")
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23031#discussion_r237075170
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/RuntimeConfigSuite.scala ---
@@ -68,4 +68,13 @@ class RuntimeConfigSuite extends SparkFunSuite {
 assert(!conf.isModifiable(""))
 assert(!conf.isModifiable("invalid config parameter"))
   }
+
+  test("reject SparkConf entries") {
+val conf = newConf()
+
+val ex = intercept[AnalysisException] {
+  conf.set("spark.task.cpus", 4)
--- End diff --

can we use `config.CPUS_PER_TASK` instead of hardcoding it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237075431
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
+case a: Alias => sameResult(a.child)
+case _ => this.semanticEquals(other)
--- End diff --

well, it needs to be overridden by `HashPartitioning` too, so since I am 
not able to make it final anyway, I don't think it is a good idea. Well, I can 
add a match on `HashPartitioning`too, but it doesn't seem a clean solution to 
me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23031#discussion_r237074727
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala 
---
@@ -154,5 +154,9 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new 
SQLConf) {
 if (SQLConf.staticConfKeys.contains(key)) {
   throw new AnalysisException(s"Cannot modify the value of a static 
config: $key")
 }
+if (sqlConf.setCommandRejectsSparkConfs &&
+ConfigEntry.findEntry(key) != null && 
!SQLConf.sqlConfEntries.containsKey(key)) {
--- End diff --

we should only reject configs that are registered as SparkConf. Thinking 
about configs that are either a SparkConf or SQLConf, we shouldn't reject it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23031
  
**[Test build #99373 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99373/testReport)**
 for PR 23031 at commit 
[`1f703aa`](https://github.com/apache/spark/commit/1f703aaa6597d43654d4fd7d22feb7d2e4b73f9d).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23031
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5449/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23031: [SPARK-26060][SQL] Track SparkConf entries and make SET ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23031
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237073129
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
+case a: Alias => sameResult(a.child)
+case _ => this.semanticEquals(other)
--- End diff --

I think it is doable, but I didn't want to put too many `match` where it 
was not needed. But if you prefer that way I can try and do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23031: [SPARK-26060][SQL] Track SparkConf entries and ma...

2018-11-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/23031#discussion_r237071928
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala 
---
@@ -154,5 +154,9 @@ class RuntimeConfig private[sql](sqlConf: SQLConf = new 
SQLConf) {
 if (SQLConf.staticConfKeys.contains(key)) {
   throw new AnalysisException(s"Cannot modify the value of a static 
config: $key")
 }
+if (sqlConf.setCommandRejectsSparkConfs &&
+ConfigEntry.findEntry(key) != null && 
!SQLConf.sqlConfEntries.containsKey(key)) {
--- End diff --

I'm sorry for the delay.
As per the comment 
https://github.com/apache/spark/pull/22887#issuecomment-442425557, I'd leave it 
as is for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237070739
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
+case a: Alias => sameResult(a.child)
+case _ => this.semanticEquals(other)
--- End diff --

can we also strip the alias of this here? so that we can mark `sameResult` 
as final.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237070496
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
--- End diff --

Sure, thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237069486
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
--- End diff --

can you put it in the method doc(both `semanticEquals` and `sameResult`)? 
This makes sense to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/23153
  
a late LGTM as well, thanks @cloud-fan for the patch and thanks 
@xuanyuanking for the review.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237068718
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
--- End diff --

remove `Alias` is not possible for the reason explained in 
https://github.com/apache/spark/pull/22957#issuecomment-436992955. In general, 
`semanticEquals` should be used when we want to replace an expression with 
another, while `sameResult` should be used in order to check that 2 expressions 
return the same output.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23153: [SPARK-26147][SQL] only pull out unevaluable pyth...

2018-11-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23153


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23052
  
**[Test build #99372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99372/testReport)**
 for PR 23052 at commit 
[`586ab31`](https://github.com/apache/spark/commit/586ab316ed2b9bce07a879dc89766dc854807c21).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23153
  
thanks, merging to master/2.4!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5448/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23164
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r237065584
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ---
@@ -33,26 +33,21 @@ class FailureSafeParser[IN](
   private val corruptFieldIndex = 
schema.getFieldIndex(columnNameOfCorruptRecord)
   private val actualSchema = StructType(schema.filterNot(_.name == 
columnNameOfCorruptRecord))
   private val resultRow = new GenericInternalRow(schema.length)
-  private val nullResult = new GenericInternalRow(schema.length)
 
   // This function takes 2 parameters: an optional partial result, and the 
bad record. If the given
   // schema doesn't contain a field for corrupted record, we just return 
the partial result or a
   // row with all fields null. If the given schema contains a field for 
corrupted record, we will
   // set the bad record to this field, and set other fields according to 
the partial result or null.
   private val toResultRow: (Option[InternalRow], () => UTF8String) => 
InternalRow = {
-if (corruptFieldIndex.isDefined) {
-  (row, badRecord) => {
-var i = 0
-while (i < actualSchema.length) {
-  val from = actualSchema(i)
-  resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, 
from.dataType)).orNull
-  i += 1
-}
-resultRow(corruptFieldIndex.get) = badRecord()
-resultRow
+(row, badRecord) => {
--- End diff --

For now JSON does not support this. Need additional changes in 
JacksonParser to return partial results.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237065506
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
--- End diff --

"erase the name" can also mean remove `Alias`. If we can't clearly tell the 
difference between `semanticEquals` and `sameResult`, and give a guideline 
about using which one in which case, I think we should just update 
`semanticEquals`(i.e. `Canonicalize`).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23052
  
Actually it needs similar changes like in 
https://github.com/apache/spark/pull/23130


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread xuanyuanking

Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/23128
  
Thanks @cloud-fan @gatorsmile @rxin !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23164: [SPARK-26198][SQL] Fix Metadata serialize null values th...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23164
  
**[Test build #99371 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99371/testReport)**
 for PR 23164 at commit 
[`0310186`](https://github.com/apache/spark/commit/03101868a72b5ae68bf6324e627f1874af32f040).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23124
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99362/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23124
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r237062653
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in non bucketed read") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+  Files.write(Paths.get(path, "notEmpty"), 
"a".getBytes(StandardCharsets.UTF_8))
+  val readback = spark.read.option("wholetext", true).text(path)
+
+  assert(readback.rdd.getNumPartitions === 1)
--- End diff --

do you mean `wholetext` mode will force to create one partition per file?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23164: [SPARK-26198][SQL] Fix Metadata serialize null va...

2018-11-28 Thread wangyum

GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/23164

[SPARK-26198][SQL] Fix Metadata serialize null values throw NPE

## What changes were proposed in this pull request?
How to reproduce this issue:
```scala
scala> val meta = new 
org.apache.spark.sql.types.MetadataBuilder().putNull("key").build()
java.lang.NullPointerException
  at 
org.apache.spark.sql.types.Metadata$.org$apache$spark$sql$types$Metadata$$toJsonValue(Metadata.scala:196)
  at 
org.apache.spark.sql.types.Metadata$$anonfun$1.apply(Metadata.scala:180)
```

This pr fix `NullPointerException` when `Metadata` serialize `null` values.

## How was this patch tested?

unit tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-26198

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23164.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23164


commit 03101868a72b5ae68bf6324e627f1874af32f040
Author: Yuming Wang 
Date:   2018-11-28T12:22:09Z

Fix Metadata serialize null values throw NPE




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23083: [SPARK-26114][CORE] ExternalSorter's readingItera...

2018-11-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23083


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237062190
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
--- End diff --

that is reasonable but it doesn't solve the problem stated in the JIRA. So 
the goal here is to avoid that something like `a as b` is considered different 
from `a` in terms of ordering/distribution. If we just erase the name of alias, 
the 2 expression would still be different because of the presence of `Alias` 
itself would make the 2 expressions different.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23124
  
**[Test build #99362 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99362/testReport)**
 for PR 23124 at commit 
[`6dff654`](https://github.com/apache/spark/commit/6dff6545f272e0d5117ac17fdc27b686573c5626).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23083
  
thanks, merging to master/2.4!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23128: [SPARK-26142][SQL] Implement shuffle read metrics...

2018-11-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23128


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23128
  
thanks , merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237059126
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
+   */
+  protected def withCreateTempDir(f: File => Unit): Unit = {
--- End diff --

if we have this function in `SparkFunSuite`, why do we need to define it 
again in `SQLTestUtils`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23083
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99360/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23083
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23151: [SPARK-26180][CORE][TEST] Add a withCreateTempDir...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23151#discussion_r237058872
  
--- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala ---
@@ -105,5 +105,16 @@ abstract class SparkFunSuite
   logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n")
 }
   }
-
+  /**
+   * Creates a temporary directory, which is then passed to `f` and will 
be deleted after `f`
+   * returns.
+   *
+   * @todo Probably this method should be moved to a more general place
--- End diff --

I think this is the most general place we can find...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23132: [SPARK-26163][SQL] Parsing decimals from JSON usi...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23132#discussion_r237058331
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -9,6 +9,8 @@ displayTitle: Spark SQL Upgrading Guide
 
 ## Upgrading From Spark SQL 2.4 to 3.0
 
+  - In Spark version 2.4 and earlier, accepted format of decimals parsed 
from JSON is an optional sign ('+' or '-'), followed by a sequence of zero or 
more decimal digits, optionally followed by a fraction, optionally followed by 
an exponent. Any commas were removed from the input before parsing. Since Spark 
3.0, format varies and depends on locale which can be set via JSON option 
`locale`. The default locale is `en-US`. To switch back to previous behavior, 
set `spark.sql.legacy.decimalParsing.enabled` to `true`.
--- End diff --

I have the same question. Do we need the `DecimalFormat` when locale is 
`en-US`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23083: [SPARK-26114][CORE] ExternalSorter's readingIterator fie...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23083
  
**[Test build #99360 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99360/testReport)**
 for PR 23083 at commit 
[`1723819`](https://github.com/apache/spark/commit/17238196719de1e68cbcb1eb930cb3176308e437).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22887: [SPARK-25880][CORE] user set's hadoop conf should not ov...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/22887
  
Spark SQL SET command can't update any static config or Spark core configs, 
but I think hadoop configs are different. It's not static as users can update 
it via `SparkContext.hadoopConfiguration`.  
`SparkSession.SessionState.newHadoopConf()` is a mechanism to allow users to 
set hadoop config per-session in Spark SQL.

So it's reasonble for users to expect that, if they set hadoop config via 
the SQL SET command, it should override the one in `spark-defaults.conf`.

Looking back at `appendS3AndSparkHadoopConfigurations`, it has 2 
parameters: spark conf and hadoop conf. The spark conf comes from 
`spark-defaults.conf` and any user provided configs when building the 
`SparkContext`. The user provided configs override `spark-defaults.conf`. The 
hadoop conf is either an empty config(if `appendS3AndSparkHadoopConfigurations` 
is called from `SparkHadoopUtil.newHadoopConfiguration`), or from 
`SparkSession.SessionState.newHadoopConf()`(if 
`appendS3AndSparkHadoopConfigurations` is called from `HadoopTableReader`).

For the first case, nothing we need to worry about. For the second case, I 
think the hadoop config should take priority, as it contains the configs 
specified by users at rutime.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22514
  
**[Test build #99370 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99370/testReport)**
 for PR 22514 at commit 
[`6e2a31b`](https://github.com/apache/spark/commit/6e2a31b0db71ab3dcc8a81cb5ca585f784793736).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5447/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22514: [SPARK-25271][SQL] Hive ctas commands should use ...

2018-11-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/22514#discussion_r237052582
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
 ---
@@ -45,6 +46,11 @@ case class CreateHiveTableAsSelectCommand(
 
   override def run(sparkSession: SparkSession, child: SparkPlan): Seq[Row] 
= {
--- End diff --

I created a Hive CTAS with data source command.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22514
  
**[Test build #99369 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99369/testReport)**
 for PR 22514 at commit 
[`796e964`](https://github.com/apache/spark/commit/796e9640726c6f5ed4be7dd356245ff121f4dea6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5446/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22514: [SPARK-25271][SQL] Hive ctas commands should use data so...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22514
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99363/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23128
  
**[Test build #99363 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99363/testReport)**
 for PR 23128 at commit 
[`8e84c5b`](https://github.com/apache/spark/commit/8e84c5bbfc4b9151310bce84c1506c6aad449011).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23153
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23153
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99358/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r237050065
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in non bucketed read") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+  Files.write(Paths.get(path, "notEmpty"), 
"a".getBytes(StandardCharsets.UTF_8))
+  val readback = spark.read.option("wholetext", true).text(path)
+
+  assert(readback.rdd.getNumPartitions === 1)
--- End diff --

> does this test fail without your change?

Yes, it does due to the `wholetext`. 

>  Is JSON the only data source that may return a row for empty file?

We depend on underlying parser here. I will check CSV and Text.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23153: [SPARK-26147][SQL] only pull out unevaluable python udf ...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23153
  
**[Test build #99358 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99358/testReport)**
 for PR 23153 at commit 
[`7b985d8`](https://github.com/apache/spark/commit/7b985d84cb0fd853d40610b2380313389791298e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #22957: [SPARK-25951][SQL] Ignore aliases for distributio...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/22957#discussion_r237049166
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -195,14 +195,26 @@ abstract class Expression extends 
TreeNode[Expression] {
   }
 
   /**
-   * Returns true when two expressions will always compute the same 
result, even if they differ
+   * Returns true when two expressions will always compute the same 
output, even if they differ
* cosmetically (i.e. capitalization of names in attributes may be 
different).
*
* See [[Canonicalize]] for more details.
*/
   def semanticEquals(other: Expression): Boolean =
 deterministic && other.deterministic && canonicalized == 
other.canonicalized
 
+  /**
+   * Returns true when two expressions will always compute the same 
result, even if the output may
+   * be different, because of different names or similar differences.
+   * Usually this means they their canonicalized form equals, but it may 
also not be the case, as
+   * different output expressions can evaluate to the same result as well 
(eg. when an expression
+   * is aliased).
+   */
+  def sameResult(other: Expression): Boolean = other match {
--- End diff --

I know it's always safer to introduce a new API, does is it really 
necessary? In `Canonicalize`, we erase the name for attributes, I think it's 
reasonable to erase the name of `Alias`, as it doesn't affect the output.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23128
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99359/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23128: [SPARK-26142][SQL] Implement shuffle read metrics in SQL

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23128
  
**[Test build #99359 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99359/testReport)**
 for PR 23128 at commit 
[`d12ea31`](https://github.com/apache/spark/commit/d12ea311e58e7925f21d343e5de13bfec6737549).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r237046616
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/FailureSafeParser.scala
 ---
@@ -33,26 +33,21 @@ class FailureSafeParser[IN](
   private val corruptFieldIndex = 
schema.getFieldIndex(columnNameOfCorruptRecord)
   private val actualSchema = StructType(schema.filterNot(_.name == 
columnNameOfCorruptRecord))
   private val resultRow = new GenericInternalRow(schema.length)
-  private val nullResult = new GenericInternalRow(schema.length)
 
   // This function takes 2 parameters: an optional partial result, and the 
bad record. If the given
   // schema doesn't contain a field for corrupted record, we just return 
the partial result or a
   // row with all fields null. If the given schema contains a field for 
corrupted record, we will
   // set the bad record to this field, and set other fields according to 
the partial result or null.
   private val toResultRow: (Option[InternalRow], () => UTF8String) => 
InternalRow = {
-if (corruptFieldIndex.isDefined) {
-  (row, badRecord) => {
-var i = 0
-while (i < actualSchema.length) {
-  val from = actualSchema(i)
-  resultRow(schema.fieldIndex(from.name)) = row.map(_.get(i, 
from.dataType)).orNull
-  i += 1
-}
-resultRow(corruptFieldIndex.get) = badRecord()
-resultRow
+(row, badRecord) => {
--- End diff --

without this change in `FailureSafeParser`, does JSON support returning 
partial result?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23120: [SPARK-26151][SQL] Return partial results for bad...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23120#discussion_r237046251
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala
 ---
@@ -243,21 +243,27 @@ class UnivocityParser(
 () => getPartialResult(),
 new RuntimeException("Malformed CSV record"))
 } else {
-  try {
-// When the length of the returned tokens is identical to the 
length of the parsed schema,
-// we just need to convert the tokens that correspond to the 
required columns.
-var i = 0
-while (i < requiredSchema.length) {
+  // When the length of the returned tokens is identical to the length 
of the parsed schema,
+  // we just need to convert the tokens that correspond to the 
required columns.
+  var badRecordException: Option[Throwable] = None
+  var i = 0
+  while (i < requiredSchema.length) {
--- End diff --

shall we stop parsing when we hit the first exception?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23130: [SPARK-26161][SQL] Ignore empty files in load

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/23130#discussion_r237045706
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/sources/SaveLoadSuite.scala ---
@@ -142,4 +144,15 @@ class SaveLoadSuite extends DataSourceTest with 
SharedSQLContext with BeforeAndA
   assert(e.contains(s"Partition column `$unknown` not found in schema 
$schemaCatalog"))
 }
   }
+
+  test("skip empty files in non bucketed read") {
+withTempDir { dir =>
+  val path = dir.getCanonicalPath
+  Files.write(Paths.get(path, "empty"), Array.empty[Byte])
+  Files.write(Paths.get(path, "notEmpty"), 
"a".getBytes(StandardCharsets.UTF_8))
+  val readback = spark.read.option("wholetext", true).text(path)
+
+  assert(readback.rdd.getNumPartitions === 1)
--- End diff --

does this test fail without your change? IIUC one partition can read 
multiple files. Is JSON the only data source that may return a row for empty 
file?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23124
  
**[Test build #99368 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99368/testReport)**
 for PR 23124 at commit 
[`72c771e`](https://github.com/apache/spark/commit/72c771e691ae1071742bde5a612593d38f147ff5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23124
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23124: [SPARK-25829][SQL] remove duplicated map keys with last ...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23124
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5445/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21004: [SPARK-23896][SQL]Improve PartitioningAwareFileIn...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21004#discussion_r237040743
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala
 ---
@@ -126,35 +126,32 @@ abstract class PartitioningAwareFileIndex(
 val caseInsensitiveOptions = CaseInsensitiveMap(parameters)
 val timeZoneId = 
caseInsensitiveOptions.get(DateTimeUtils.TIMEZONE_OPTION)
   .getOrElse(sparkSession.sessionState.conf.sessionLocalTimeZone)
-
-userPartitionSchema match {
+val inferredPartitionSpec = PartitioningUtils.parsePartitions(
+  leafDirs,
+  typeInference = 
sparkSession.sessionState.conf.partitionColumnTypeInferenceEnabled,
--- End diff --

Before this patch, there was a subtle difference between with and without a 
user-provided partition schema:
1. with user-provided partition schema, we should not infer data types. We 
should infer as string and cast to user-provided type
2. without user-provided partition schema, we should infer the data 
type(with a config)

So it was wrong to unify these 2 code paths. @gengliangwang can you change 
it back?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23052
  
> seems like a real failure

I am looking at it. It seems the test is not deterministic. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread cloud-fan

Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/23052
  
seems like a real failure


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23052
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/99361/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23052
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23052: [SPARK-26081][SQL] Prevent empty files for empty partiti...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23052
  
**[Test build #99361 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99361/testReport)**
 for PR 23052 at commit 
[`76e1466`](https://github.com/apache/spark/commit/76e1466a39aa2a40d999791bb9d3b09628921e85).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23055: [SPARK-26080][PYTHON] Disable 'spark.executor.pys...

2018-11-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/23055#discussion_r237021410
  
--- Diff: python/pyspark/worker.py ---
@@ -22,7 +22,12 @@
 import os
 import sys
 import time
-import resource
+# 'resource' is a Unix specific module.
+has_resource_module = True
+try:
+import resource
+except ImportError:
+has_resource_module = False
--- End diff --

@vanzin, I'm going to remove this flag. Does it then look okay to you?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23163
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-11-28 Thread c21

Github user c21 commented on the issue:

https://github.com/apache/spark/pull/23163
  
cc people who have most context for review - @cloud-fan, @tejasapatil and 
@sameeragarwal. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23163
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23100
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23100
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5444/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23163: [SPARK-26164][SQL] Allow FileFormatWriter to write multi...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23163
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23163: [SPARK-26164][SQL] Allow FileFormatWriter to writ...

2018-11-28 Thread c21

GitHub user c21 opened a pull request:

https://github.com/apache/spark/pull/23163

[SPARK-26164][SQL] Allow FileFormatWriter to write multiple 
partitions/buckets without sort

## What changes were proposed in this pull request?

Currently spark always requires a local sort before writing to output table 
on partition/bucket columns (see `write.requiredOrdering` in 
`FileFormatWriter.scala`), which is unnecessary, and can be avoided by keeping 
multiple output writers concurrently in `FileFormatDataWriter.scala`.

This pr is first doing hash-based write, then falling back to sort-based 
write (current implementation) when number of opened writer exceeding a 
threshold (controlled by a config). Specifically:

1. (hash-based write) Maintain mapping between file path and output writer, 
and re-use writer for writing input row. In case of the number of opened output 
writers exceeding a threshold (can be changed by a config), we go to 2.

2. (sort-based write) Sort the rest of input rows (use the same sorter in 
SortExec). Then writing the rest of sorted rows, and we can close the writer on 
the fly, in case no more rows for current file path.

## How was this patch tested?

Added unit test in `DataFrameReaderWriterSuite.scala`. Existing test like 
`SQLMetricsSuite.scala` would already exercise the code path of executor write 
metrics.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/c21/spark more-writers

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23163.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23163


commit c2e81eb2f9cdc1b290c098d228d477f325a24101
Author: Cheng Su 
Date:   2018-11-28T09:46:35Z

Allow FileFormatWriter to write multiple partitions/buckets without sort




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/23100
  
Except for changing `Since` tag, is there any other comments?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23120: [SPARK-26151][SQL] Return partial results for bad CSV re...

2018-11-28 Thread MaxGekk

Github user MaxGekk commented on the issue:

https://github.com/apache/spark/pull/23120
  
@cloud-fan May I ask you to take a look at the PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23100
  
**[Test build #99367 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99367/testReport)**
 for PR 23100 at commit 
[`a451756`](https://github.com/apache/spark/commit/a451756f393c107963b893ba4a74c1d6ade33dd0).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder...

2018-11-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/23100#discussion_r237010801
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala ---
@@ -17,126 +17,512 @@
 
 package org.apache.spark.ml.feature
 
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
 import org.apache.spark.annotation.Since
-import org.apache.spark.ml.Transformer
+import org.apache.spark.ml.{Estimator, Model}
 import org.apache.spark.ml.attribute._
 import org.apache.spark.ml.linalg.Vectors
 import org.apache.spark.ml.param._
-import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.param.shared.{HasHandleInvalid, HasInputCols, 
HasOutputCols}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
-import org.apache.spark.sql.functions.{col, udf}
-import org.apache.spark.sql.types.{DoubleType, NumericType, StructType}
+import org.apache.spark.sql.expressions.UserDefinedFunction
+import org.apache.spark.sql.functions.{col, lit, udf}
+import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
+
+/** Private trait for params and common methods for OneHotEncoder and 
OneHotEncoderModel */
+private[ml] trait OneHotEncoderBase extends Params with HasHandleInvalid
+with HasInputCols with HasOutputCols {
+
+  /**
+   * Param for how to handle invalid data during transform().
+   * Options are 'keep' (invalid data presented as an extra categorical 
feature) or
+   * 'error' (throw an error).
+   * Note that this Param is only used during transform; during fitting, 
invalid data
+   * will result in an error.
+   * Default: "error"
+   * @group param
+   */
+  @Since("2.3.0")
--- End diff --

I changed since tag of renamed `OneHotEncoder` to `3.0.0`.

Because this `OneHotEncoderBase` is not renamed, I didn't change its since 
tag.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23100
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23100
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5443/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23100: [SPARK-26133][ML] Remove deprecated OneHotEncoder and re...

2018-11-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/23100
  
**[Test build #99366 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/99366/testReport)**
 for PR 23100 at commit 
[`9e3dab7`](https://github.com/apache/spark/commit/9e3dab7388a6de18b7cf8ddcc8cc2c73a6efea67).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22947: [SPARK-24913][SQL] Make AssertNotNull and AssertTrue non...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22947
  
any more comments on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22957: [SPARK-25951][SQL] Ignore aliases for distributions and ...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/22957
  
cc @cloud-fan @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23057: [SPARK-26078][SQL] Dedup self-join attributes on IN subq...

2018-11-28 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/23057
  
cc @gatorsmile too


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23155: [MINOR][K8S] add missing docs for podTemplateContainerNa...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23155
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #23155: [MINOR][K8S] add missing docs for podTemplateContainerNa...

2018-11-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/23155
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5441/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

< 2 3 4 5 6 7 8 >

601 - 700 of 769 matches

Mail list logo