date:20171204

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154577760
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,11 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
--- End diff --

instead of using `var`, you can use the pattern match


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19869
  
**[Test build #84419 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84419/testReport)**
 for PR 19869 at commit 
[`0589e7d`](https://github.com/apache/spark/commit/0589e7d8b57aa71c72dd052f687a4706fb0c5567).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84413 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84413/testReport)**
 for PR 19871 at commit 
[`2e498f9`](https://github.com/apache/spark/commit/2e498f9c1a748bdb648705c16305f9f34e638ff8).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84411 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84411/testReport)**
 for PR 19871 at commit 
[`e7beb02`](https://github.com/apache/spark/commit/e7beb02ef8b1f9761c77952fc69d087c7cc92d3f).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19874
  
**[Test build #84418 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84418/testReport)**
 for PR 19874 at commit 
[`657cf88`](https://github.com/apache/spark/commit/657cf88bbf4259d1a823f93f16eaccc2fbe78667).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait TypePropagation extends Logging `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84411/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84412/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19873
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84417/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19874
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84418/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19873
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84413/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19874
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84412 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84412/testReport)**
 for PR 19871 at commit 
[`8b7e88a`](https://github.com/apache/spark/commit/8b7e88a833adf1d3138ce426142b6bb6abe057df).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19873
  
**[Test build #84417 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84417/testReport)**
 for PR 19873 at commit 
[`136fd30`](https://github.com/apache/spark/commit/136fd30cb98609e648a8b689a28853c1bae67bf7).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class AnalysisBarrier(child: LogicalPlan) extends LeafNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19873
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19873
  
**[Test build #84420 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84420/testReport)**
 for PR 19873 at commit 
[`136fd30`](https://github.com/apache/spark/commit/136fd30cb98609e648a8b689a28853c1bae67bf7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19840
  
@yaooqinn OK, I see the situation.

In client mode, I think we can't use `spark.yarn.appMasterEnv.XXX` which is 
for cluster mode. So we should use environment variable `PYSPARK_PYTHON` or 
`PYSPARK_DRIVER_PYTHON`, or corresponding spark conf, `spark.pyspark.python`, 
`spark.pyspark.driver.python`.

In cluster mode, we can use `spark.yarn.appMasterEnv.XXX` and if there 
exist `spark.yarn.appMasterEnv.PYSPARK_PYTHON` or 
`spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON`, they overwrite original 
environment variables.

Btw, `PYSPARK_DRIVER_PYTHON` is for only Driver, not Executors, so we 
should handle only `PYSPARK_PYTHON` in executor and the priority of 
`PYSPARK_DRIVER_PYTHON` is higher than `PYSPARK_PYTHON` in Driver.

Currently we handle only environment varibale but not 
`spark.executorEnv.PYSPARK_PYTHON` for executor so we should handle it at 
`api/python/PythonRunner` as you do now or 
[context.py#L191](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread viirya

Github user viirya commented on the issue:

https://github.com/apache/spark/pull/19869
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154580007
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,11 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
--- End diff --

Also add the maps for new orc format to `backwardCompatibilityMap`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...

2017-12-04 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19869#discussion_r154580311
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ---
@@ -596,7 +596,7 @@ case class HashAggregateExec(
 ctx.addMutableState(fastHashMapClassName, fastHashMapTerm,
   s"$fastHashMapTerm = new $fastHashMapClassName();")
 ctx.addMutableState(
-  classOf[java.util.Iterator[ColumnarRow]].getName,
+  s"java.util.Iterator<${classOf[ColumnarRow]}>",
--- End diff --

```scala
scala> s"java.util.Iterator<${classOf[ColumnarRow]}>"
res2: String = java.util.Iterator
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154580323
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and 
sql/core") {
--- End diff --

Move it to `OrcQuerySuite`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread gatorsmile

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154580498
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -195,8 +195,18 @@ case class RelationConversions(
 .convertToLogicalRelation(relation, options, 
classOf[ParquetFileFormat], "parquet")
 } else {
   val options = relation.tableMeta.storage.properties
-  sessionCatalog.metastoreCatalog
-.convertToLogicalRelation(relation, options, 
classOf[OrcFileFormat], "orc")
+  if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
+  "orc")
+  } else {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc")
--- End diff --

indents.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
I can `spark.executorEnv.PYSPARK_PYTHON` in `sparkConf` at executor side ,  
because it is set at 
[context.py#L156](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L156)
 by 
[conf.py#L153](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/conf.py#L153)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154581731
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala
 ---
@@ -115,9 +120,35 @@ abstract class Expression extends TreeNode[Expression] 
{
 }
   }
 
+  /**
+   * Returns the input variables to this expression.
+   */
+  private def findInputVars(ctx: CodegenContext, eval: ExprCode): 
Seq[ExprInputVar] = {
+if (ctx.currentVars != null) {
+  val boundRefs = this.collect {
+case b @ BoundReference(ordinal, _, _) if ctx.currentVars(ordinal) 
!= null => (ordinal, b)
--- End diff --

It is not necessarily empty. We will also record it if its `code` is not 
empty.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19862: [WIP][SPARK-22671][SQL] Make SortMergeJoin shuffl...

2017-12-04 Thread gczsjdy

Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19862#discussion_r154581844
  
--- Diff: 
sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java
 ---
@@ -159,6 +154,12 @@ public boolean hasNext() {
 @Override
 public UnsafeRow next() {
   try {
+if (!alreadyCalculated) {
--- End diff --

Yes. There are mistakes here, I will change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154581864
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
+
+val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr)
+val paramsFromRows = inputRows.distinct.filter(_ != null).map { row =>
+  (row, s"InternalRow $row")
+}
+
+val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + 
paramsFromRows.length
+// Maximum allowed parameter number for Java's method descriptor.
+if (paramsLength > 255) {
+  None
+} else {
+  val allParams = (paramsFromRows ++ paramsFromColumns ++ 
paramsFromSubExprs).unzip
+  val callParams = allParams._1.distinct
+  val declParams = allParams._2.distinct
+  Some((callParams, declParams))
+}
+  }
+
+  /**
+   * Returns the eliminated subexpressions in the children expressions.
+   */
+  def getSubExprInChildren(ctx: CodegenContext, expr: Expression): 
Seq[Expression] = {
+expr.children.flatMap { child =>
+  child.collect {
+case e if ctx.subExprEliminationExprs.contains(e) => e
+  }
+}.distinct
+  }
+
+  /**
+   * A small helper function to return `ExprCode`s that represent 
subexpressions.
+   */
+  def getSubExprCodes(ctx: CodegenContext, subExprs: Seq[Expression]): 
Seq[ExprCode] = {
+subExprs.map { subExpr =>
+  val stat = ctx.subExprEliminationExprs(subExpr)
--- End diff --

It is for state. Yes, `state` is better and not to confuse.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19840
  
@yaooqinn I meant it is not used for `pythonExec`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154582383
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
+
+val inputRows = ctx.INPUT_ROW +: getInputRowsForChildren(ctx, expr)
+val paramsFromRows = inputRows.distinct.filter(_ != null).map { row =>
+  (row, s"InternalRow $row")
+}
+
+val paramsLength = getParamLength(ctx, inputAttrs, subExprs) + 
paramsFromRows.length
+// Maximum allowed parameter number for Java's method descriptor.
+if (paramsLength > 255) {
+  None
+} else {
+  val allParams = (paramsFromRows ++ paramsFromColumns ++ 
paramsFromSubExprs).unzip
+  val callParams = allParams._1.distinct
+  val declParams = allParams._2.distinct
+  Some((callParams, declParams))
+}
+  }
+
+  /**
+   * Returns the eliminated subexpressions in the children expressions.
+   */
+  def getSubExprInChildren(ctx: CodegenContext, expr: Expression): 
Seq[Expression] = {
+expr.children.flatMap { child =>
+  child.collect {
+case e if ctx.subExprEliminationExprs.contains(e) => e
+  }
+}.distinct
+  }
+
+  /**
+   * A small helper function to return `ExprCode`s that represent 
subexpressions.
+   */
+  def getSubExprCodes(ctx: CodegenContext, subExprs: Seq[Expression]): 
Seq[ExprCode] = {
+subExprs.map { subExpr =>
+  val stat = ctx.subExprEliminationExprs(subExpr)
+  ExprCode(code = "", value = stat.value, isNull = stat.isNull)
+}
+  }
+
+  /**
+   * Retrieves previous input rows referred by children and deferred 
expressions.
+   */
+  def getInputRowsForChildren(ctx: CodegenContext, expr: Expression): 
Seq[String] = {
+expr.children.flatMap(getInputRows(ctx, _)).distinct
+  }
+
+  /**
+   * Given a child expression, retrieves previous input rows referred by 
it or deferred expressions
+   * which are needed to evaluate it.
+   */
+  def getInputRows(ctx: CodegenContext, child: Expression): Seq[String] = {
+child.flatMap {
+  // An expression directly evaluates on current input row.
+  case BoundReference(ordinal, _, _) if ctx.currentVars == null ||
+  ctx.currentVars(ordinal) == null =>
+Seq(ctx.INPUT_ROW)
+
+  // An expression which is not

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154582497
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
--- End diff --

I think so. The eliminated subexpressions will be extracted as parameters 
too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154582670
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +571,11 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
+var provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
--- End diff --

Thanks. Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154582712
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala 
---
@@ -2153,4 +2153,21 @@ class SQLQuerySuite extends QueryTest with 
SQLTestUtils with TestHiveSingleton {
   }
 }
   }
+
+  test("SPARK-20728 Make ORCFileFormat configurable between sql/hive and 
sql/core") {
--- End diff --

Yep.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun

Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154583136
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala ---
@@ -195,8 +195,18 @@ case class RelationConversions(
 .convertToLogicalRelation(relation, options, 
classOf[ParquetFileFormat], "parquet")
 } else {
   val options = relation.tableMeta.storage.properties
-  sessionCatalog.metastoreCatalog
-.convertToLogicalRelation(relation, options, 
classOf[OrcFileFormat], "orc")
+  if (conf.getConf(SQLConf.ORC_USE_NEW_VERSION)) {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
+  "orc")
+  } else {
+sessionCatalog.metastoreCatalog.convertToLogicalRelation(
+  relation,
+  options,
+  classOf[org.apache.spark.sql.hive.orc.OrcFileFormat], "orc")
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84421 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84421/testReport)**
 for PR 19871 at commit 
[`5474a07`](https://github.com/apache/spark/commit/5474a070ada64be7467a64fcd849064b063e2e42).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19869
  
**[Test build #84419 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84419/testReport)**
 for PR 19869 at commit 
[`0589e7d`](https://github.com/apache/spark/commit/0589e7d8b57aa71c72dd052f687a4706fb0c5567).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19869
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84419/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19869
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
@ueshin i see.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-12-04 Thread zuotingbing

Github user zuotingbing commented on the issue:

https://github.com/apache/spark/pull/19841
  
please review it again, thanks all.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154588045
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -108,7 +108,10 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
  |}""".stripMargin)
 
 ctx.currentVars = null
+// `rowIdx` isn't in `ctx.currentVars`. If the expressions are split 
later, we can't track it.
+// So making it as global variable.
--- End diff --

I think it works, although it feels a bit hacky. Like:

```scala
  val rowidx = ctx.freshName("rowIdx")
  val rowidxExpr = AttributeReference("rowIdx", IntegerType, nullable = 
false)()
  val columnsBatchInput = (output zip colVars).map { case (attr, colVar) =>
val exprCode = genCodeColumnVector(ctx, colVar, rowidx, attr.dataType, 
attr.nullable)
exprCode.inputVars = Seq(ExprInputVar(rowidxExpr,
  ExprCode("", isNull = "false", value = rowidx)))
  exprCode
}
  }
```

This just adds one global variable. I think it is not a big problem?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19860: [SPARK-22669][SQL] Avoid unnecessary function calls in c...

2017-12-04 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19860
  
Thanks for your work! A late LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84422 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84422/testReport)**
 for PR 19813 at commit 
[`429afba`](https://github.com/apache/spark/commit/429afbabef6f718870ca3c6caf0712a1e459681f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-12-04 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19841
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19841
  
**[Test build #84423 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84423/testReport)**
 for PR 19841 at commit 
[`515b00a`](https://github.com/apache/spark/commit/515b00abf2d214d2a253ef3ccb213df670d64fd4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread yaooqinn

Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
@ueshin 
[context.py#L191](https://github.com/yaooqinn/spark/blob/8ff5663fe9a32eae79c8ee6bc310409170a8da64/python/pyspark/context.py#L191)
 set for both driver and executor?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread dongjoon-hyun

Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19874
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19874
  
**[Test build #84424 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84424/testReport)**
 for PR 19874 at commit 
[`657cf88`](https://github.com/apache/spark/commit/657cf88bbf4259d1a823f93f16eaccc2fbe78667).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-12-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r154598540
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
 ---
@@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   spark.sql("DROP TABLE IF EXISTS libsvmTable")
 }
   }
+
+  def testLineSeparator(lineSep: String): Unit = {
+test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") {
+  val data = Seq(
+"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 
6:6.0").mkString(lineSep)
+  val dataWithTrailingLineSep = s"$data$lineSep"
+
+  Seq(data, dataWithTrailingLineSep).foreach { lines =>
+val path0 = new File(tempDir.getCanonicalPath, "write0")
+val path1 = new File(tempDir.getCanonicalPath, "write1")
+try {
+  // Read
+  java.nio.file.Files.write(path0.toPath, 
lines.getBytes(StandardCharsets.UTF_8))
+  val df = spark.read
+.option("lineSep", lineSep)
+.format("libsvm")
+.load(path0.getAbsolutePath)
+
+  assert(df.columns(0) == "label")
+  assert(df.columns(1) == "features")
+  val row1 = df.first()
+  assert(row1.getDouble(0) == 1.0)
+  val v = row1.getAs[SparseVector](1)
+  assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0
+
--- End diff --

So here you only test the first line ? 
Why not use df.collect() to test every line ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-12-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r154594381
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
 ---
@@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   spark.sql("DROP TABLE IF EXISTS libsvmTable")
 }
   }
+
+  def testLineSeparator(lineSep: String): Unit = {
+test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") {
+  val data = Seq(
+"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 
6:6.0").mkString(lineSep)
+  val dataWithTrailingLineSep = s"$data$lineSep"
+
+  Seq(data, dataWithTrailingLineSep).foreach { lines =>
+val path0 = new File(tempDir.getCanonicalPath, "write0")
+val path1 = new File(tempDir.getCanonicalPath, "write1")
+try {
+  // Read
+  java.nio.file.Files.write(path0.toPath, 
lines.getBytes(StandardCharsets.UTF_8))
--- End diff --

Why not import this ? `java.nio.file.Files`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-12-04 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r154594735
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
 ---
@@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   spark.sql("DROP TABLE IF EXISTS libsvmTable")
 }
   }
+
+  def testLineSeparator(lineSep: String): Unit = {
+test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") {
+  val data = Seq(
+"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 
6:6.0").mkString(lineSep)
+  val dataWithTrailingLineSep = s"$data$lineSep"
+
+  Seq(data, dataWithTrailingLineSep).foreach { lines =>
+val path0 = new File(tempDir.getCanonicalPath, "write0")
+val path1 = new File(tempDir.getCanonicalPath, "write1")
+try {
+  // Read
+  java.nio.file.Files.write(path0.toPath, 
lines.getBytes(StandardCharsets.UTF_8))
+  val df = spark.read
+.option("lineSep", lineSep)
+.format("libsvm")
+.load(path0.getAbsolutePath)
+
+  assert(df.columns(0) == "label")
+  assert(df.columns(1) == "features")
+  val row1 = df.first()
+  assert(row1.getDouble(0) == 1.0)
+  val v = row1.getAs[SparseVector](1)
+  assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0
--- End diff --

Use `===` instead of `==`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread ueshin

Github user ueshin commented on the issue:

https://github.com/apache/spark/pull/19840
  
@yaooqinn It is used for executors.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r154605874
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
 ---
@@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   spark.sql("DROP TABLE IF EXISTS libsvmTable")
 }
   }
+
+  def testLineSeparator(lineSep: String): Unit = {
+test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") {
+  val data = Seq(
+"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 
6:6.0").mkString(lineSep)
+  val dataWithTrailingLineSep = s"$data$lineSep"
+
+  Seq(data, dataWithTrailingLineSep).foreach { lines =>
+val path0 = new File(tempDir.getCanonicalPath, "write0")
+val path1 = new File(tempDir.getCanonicalPath, "write1")
+try {
+  // Read
+  java.nio.file.Files.write(path0.toPath, 
lines.getBytes(StandardCharsets.UTF_8))
+  val df = spark.read
+.option("lineSep", lineSep)
+.format("libsvm")
+.load(path0.getAbsolutePath)
+
+  assert(df.columns(0) == "label")
+  assert(df.columns(1) == "features")
+  val row1 = df.first()
+  assert(row1.getDouble(0) == 1.0)
+  val v = row1.getAs[SparseVector](1)
+  assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0
--- End diff --

I'd like to ask why actually. I had some discussion about this and ended up 
without conclusion. The doc says `===` is preferred but the actual error 
messages are even clear with `==` sometimes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r154606263
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
 ---
@@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   spark.sql("DROP TABLE IF EXISTS libsvmTable")
 }
   }
+
+  def testLineSeparator(lineSep: String): Unit = {
+test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") {
+  val data = Seq(
+"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 
6:6.0").mkString(lineSep)
+  val dataWithTrailingLineSep = s"$data$lineSep"
+
+  Seq(data, dataWithTrailingLineSep).foreach { lines =>
+val path0 = new File(tempDir.getCanonicalPath, "write0")
+val path1 = new File(tempDir.getCanonicalPath, "write1")
+try {
+  // Read
+  java.nio.file.Files.write(path0.toPath, 
lines.getBytes(StandardCharsets.UTF_8))
--- End diff --

To differenciate with google's `Files` explicitly above. Not a big deal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18581: [SPARK-21289][SQL][ML] Supports custom line separ...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/18581#discussion_r154606499
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/source/libsvm/LibSVMRelationSuite.scala
 ---
@@ -184,4 +184,54 @@ class LibSVMRelationSuite extends SparkFunSuite with 
MLlibTestSparkContext {
   spark.sql("DROP TABLE IF EXISTS libsvmTable")
 }
   }
+
+  def testLineSeparator(lineSep: String): Unit = {
+test(s"SPARK-21289: Support line separator - lineSep: '$lineSep'") {
+  val data = Seq(
+"1.0 1:1.0 3:2.0 5:3.0", "0.0", "0.0", "0.0 2:4.0 4:5.0 
6:6.0").mkString(lineSep)
+  val dataWithTrailingLineSep = s"$data$lineSep"
+
+  Seq(data, dataWithTrailingLineSep).foreach { lines =>
+val path0 = new File(tempDir.getCanonicalPath, "write0")
+val path1 = new File(tempDir.getCanonicalPath, "write1")
+try {
+  // Read
+  java.nio.file.Files.write(path0.toPath, 
lines.getBytes(StandardCharsets.UTF_8))
+  val df = spark.read
+.option("lineSep", lineSep)
+.format("libsvm")
+.load(path0.getAbsolutePath)
+
+  assert(df.columns(0) == "label")
+  assert(df.columns(1) == "features")
+  val row1 = df.first()
+  assert(row1.getDouble(0) == 1.0)
+  val v = row1.getAs[SparseVector](1)
+  assert(v == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0
+
--- End diff --

Why not just test the first line?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19873
  
**[Test build #84420 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84420/testReport)**
 for PR 19873 at commit 
[`136fd30`](https://github.com/apache/spark/commit/136fd30cb98609e648a8b689a28853c1bae67bf7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class AnalysisBarrier(child: LogicalPlan) extends LeafNode `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19873
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84420/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19873
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread gberger

Github user gberger commented on the issue:

https://github.com/apache/spark/pull/19792
  
Friendly ping -- I've fixed that @ueshin.
Is there anything else I should look at to get this to be merged?
/cc @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154615728
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
 return udf_obj._wrapped()
 
 
+class UDFColumn(Column):
--- End diff --

Why did we add this new sub-class?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154616454
  
--- Diff: python/pyspark/sql/functions.py ---
@@ -2070,6 +2070,8 @@ class PandasUDFType(object):
 
 GROUP_MAP = PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF
 
+GROUP_AGG = PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF
--- End diff --

So I'm worried that it isn't clear to the user that this will result in a 
full-shuffle with no-partial aggregation. Is there maybe a place we can 
document this warning?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19792: [SPARK-22566][PYTHON] Better error message for `_merge_t...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/19792
  
Oh, will double check too for sure shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19841
  
**[Test build #84423 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84423/testReport)**
 for PR 19841 at commit 
[`515b00a`](https://github.com/apache/spark/commit/515b00abf2d214d2a253ef3ccb213df670d64fd4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19841
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84423/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19841: [SPARK-22642][SQL] the createdTempDir will not be delete...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19841
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated D...

2017-12-04 Thread mgaido91

GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19875

[SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date functions

## What changes were proposed in this pull request?

#19696 replaced the deprecated usages for `Date` and `Waiter`, but a few 
methods were missed. The PR fixes the forgotten deprecated usages.

## How was this patch tested?

existing UTs


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22473_FOLLOWUP

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19875.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19875


commit 424e47175387e063a60fe06287f77703cf400045
Author: Marco Gaido 
Date:   2017-12-04T11:09:24Z

[SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date functions




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19876: [WIP][ML][SPARK-11171] spark 11237 Add PMML expor...

2017-12-04 Thread holdenk

GitHub user holdenk opened a pull request:

https://github.com/apache/spark/pull/19876

[WIP][ML][SPARK-11171] spark 11237 Add PMML export to Spark ML pipelines 

## What changes were proposed in this pull request?

Adds PMML export support to Spark ML pipelines in the style of Spark's 
DataSource API to allow library authors to add their own model export formats.

This is a WIP to see if this is the design we want to go with.

## How was this patch tested?

Basic unit test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/holdenk/spark 
SPARK-11171-SPARK-11237-Add-PMML-export-for-ML-KMeans-r2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19876.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19876


commit 43ae30f08aed921178da07a5e982297b272c7c8f
Author: Holden Karau 
Date:   2017-11-24T14:00:16Z

Initial attempt at allowing Spark ML writers to be slightly more pluggable

commit 9fec08fbd2dd1c980d5862f0b4521213e1e9349c
Author: Holden Karau 
Date:   2017-11-25T12:55:19Z

The LinearRegression suite passes

commit 0075bf4776ecffa7fcb24a6f74c0e96161d6221c
Author: Holden Karau 
Date:   2017-11-25T13:00:18Z

Add missing META-INFO for MLFormatRegister

commit c68880d6d982c56934f4b583263ed5cd4e8329d6
Author: Holden Karau 
Date:   2017-11-25T16:19:35Z

Add a (untested) PMMLLinearRegressionModelWriter

commit c2108df2b499bd45dff0e8add789f01d8c3c2c48
Author: Holden Karau 
Date:   2017-12-04T10:00:56Z

Basic PMML export test

commit de8619098eeb01ff86b54753f27c29729935bb94
Author: Holden Karau 
Date:   2017-12-04T11:27:03Z

Add PMML testing utils for Spark ML that were accidently left out




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84421 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84421/testReport)**
 for PR 19871 at commit 
[`5474a07`](https://github.com/apache/spark/commit/5474a070ada64be7467a64fcd849064b063e2e42).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19871
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84421/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19813
  
**[Test build #84422 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84422/testReport)**
 for PR 19813 at commit 
[`429afba`](https://github.com/apache/spark/commit/429afbabef6f718870ca3c6caf0712a1e459681f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84422/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nested expr...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19813
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19792: [SPARK-22566][PYTHON] Better error message for `_...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19792#discussion_r154630463
  
--- Diff: python/pyspark/sql/session.py ---
@@ -337,7 +338,7 @@ def _inferSchemaFromList(self, data):
 if type(first) is dict:
 warnings.warn("inferring schema from dict is deprecated,"
   "please use pyspark.sql.Row instead")
-schema = reduce(_merge_type, map(_infer_schema, data))
+schema = reduce(_merge_type, [_infer_schema(row, names) for row in 
data])
--- End diff --

Not a big deal but let's use generator expression -> `(_infer_schema(row, 
names) for row in data)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19877: [SPARK-22681]Accumulator should only updated once...

2017-12-04 Thread carsonwang

GitHub user carsonwang opened a pull request:

https://github.com/apache/spark/pull/19877

[SPARK-22681]Accumulator should only updated once for each task in result 
stage

## What changes were proposed in this pull request?
As the doc says "For accumulator updates performed inside actions only, 
Spark guarantees that each taskâs update to the accumulator will only be 
applied once, i.e. restarted tasks will not update the value."
But currently the code doesn't guarantee this.

## How was this patch tested?
New added tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/carsonwang/spark fixAccum

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19877.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19877


commit 882126c2671e1733d572350af9749e9f8bdca1c2
Author: Carson Wang 
Date:   2017-12-04T12:23:14Z

Do not update accumulator for resubmitted task in result stage




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

2017-12-04 Thread gczsjdy

Github user gczsjdy commented on a diff in the pull request:

https://github.com/apache/spark/pull/19862#discussion_r154635850
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala
 ---
@@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
   matchJoinKey = null
   bufferedMatches.clear()
   false
-} else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, 
matchJoinKey) == 0) {
-  // The new streamed row has the same join key as the previous row, 
so return the same matches.
-  true
-} else if (bufferedRow == null) {
-  // The streamed row's join key does not match the current batch of 
buffered rows and there are
-  // no more rows to read from the buffered iterator, so there can be 
no more matches.
-  matchJoinKey = null
-  bufferedMatches.clear()
-  false
 } else {
-  // Advance both the streamed and buffered iterators to find the next 
pair of matching rows.
-  var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
-  do {
-if (streamedRowKey.anyNull) {
-  advancedStreamed()
-} else {
-  assert(!bufferedRowKey.anyNull)
-  comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
-  if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
-  else if (comp < 0) advancedStreamed()
-}
-  } while (streamedRow != null && bufferedRow != null && comp != 0)
-  if (streamedRow == null || bufferedRow == null) {
-// We have either hit the end of one of the iterators, so there 
can be no more matches.
+  // To make sure vars like bufferedRow is set
+  advancedBufferedIterRes
--- End diff --

Good advice. Thx.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19874
  
**[Test build #84424 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84424/testReport)**
 for PR 19874 at commit 
[`657cf88`](https://github.com/apache/spark/commit/657cf88bbf4259d1a823f93f16eaccc2fbe78667).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait TypePropagation extends Logging `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19876
  
**[Test build #84426 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84426/testReport)**
 for PR 19876 at commit 
[`de86190`](https://github.com/apache/spark/commit/de8619098eeb01ff86b54753f27c29729935bb94).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19874
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19877
  
**[Test build #84427 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84427/testReport)**
 for PR 19877 at commit 
[`882126c`](https://github.com/apache/spark/commit/882126c2671e1733d572350af9749e9f8bdca1c2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19875
  
**[Test build #84425 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84425/testReport)**
 for PR 19875 at commit 
[`424e471`](https://github.com/apache/spark/commit/424e47175387e063a60fe06287f77703cf400045).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19874
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84424/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread carsonwang

Github user carsonwang commented on the issue:

https://github.com/apache/spark/pull/19877
  
cc @vanzin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19870: [SPARK-22665][SQL] Avoid repartitioning with empt...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19870#discussion_r154542257
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 ---
@@ -838,6 +838,8 @@ case class RepartitionByExpression(
 numPartitions: Int) extends RepartitionOperation {
 
   require(numPartitions > 0, s"Number of partitions ($numPartitions) must 
be positive.")
+  require(partitionExpressions.nonEmpty, s"${getClass.getSimpleName} 
requires a non empty set of " +
+s"partitioning expressions.")
--- End diff --

Let's remove this leading 's'.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19876
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84426/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-04 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19876
  
**[Test build #84426 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84426/testReport)**
 for PR 19876 at commit 
[`de86190`](https://github.com/apache/spark/commit/de8619098eeb01ff86b54753f27c29729935bb94).
 * This patch **fails to generate documentation**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait PMMLReadWriteTest extends TempDirectory `


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19876: [WIP][ML][SPARK-11171][SPARK-11239] Add PMML export to S...

2017-12-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19876
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154638156
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -85,7 +87,8 @@ case class DataSource(
 
   case class SourceInfo(name: String, schema: StructType, 
partitionColumns: Seq[String])
 
-  lazy val providingClass: Class[_] = 
DataSource.lookupDataSource(className)
+  lazy val providingClass: Class[_] =
+DataSource.lookupDataSource(sparkSession.sessionState.conf, className)
--- End diff --

I'd put this conf as the last argument actually if you wouldn't mind .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154638371
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -363,6 +363,14 @@ object SQLConf {
 .checkValues(Set("none", "uncompressed", "snappy", "zlib", "lzo"))
 .createWithDefault("snappy")
 
+  val ORC_USE_NEW_VERSION = buildConf("spark.sql.orc.useNewVersion")
+.doc("When true, use new OrcFileFormat in sql/core module instead of 
the one in sql/hive. " +
+  "Since new OrcFileFormat uses Apache ORC library instead of ORC 
library Hive 1.2.1, it is " +
+  "more stable and faster.")
--- End diff --

tiny nit: let's take out `more stable` ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread jiangxb1987

Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154640805
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,12 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(conf: SQLConf, provider: String): Class[_] = {
--- End diff --

After more thinking, I think it don't worth to pass the whole SQLConf into 
this function, we just need to know whether `SQLConf.ORC_USE_NEW_VERSION` is 
enabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #895: [SPARK-1940] Enabling rolling of executor logs, and automa...

2017-12-04 Thread wbowditch

Github user wbowditch commented on the issue:

https://github.com/apache/spark/pull/895
  
Can these configuration additions be added to Spark Documentation 
(https://spark.apache.org/docs/latest/configuration.html) ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644235
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
   df.logicalPlan))
   }
 
+
+  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
+val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
+
+val groupingNamedExpressions = groupingExprs.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
+}
+
+val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
+
+val child = df.logicalPlan
+
+val childrenExpressions = exprs.flatMap(expr =>
+  expr.children.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
--- End diff --

indentation nit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644340
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala ---
@@ -437,6 +437,37 @@ class RelationalGroupedDataset protected[sql](
   df.logicalPlan))
   }
 
+
+  private[sql] def aggInPandas(columns: Seq[Column]): DataFrame = {
+val exprs = columns.map(column => column.expr.asInstanceOf[PythonUDF])
+
+val groupingNamedExpressions = groupingExprs.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
+}
+
+val groupingAttributes = groupingNamedExpressions.map(_.toAttribute)
+
+val child = df.logicalPlan
+
+val childrenExpressions = exprs.flatMap(expr =>
+  expr.children.map {
+  case ne: NamedExpression => ne
+  case other => Alias(other, other.toString)()
+})
+
+val project = Project(groupingNamedExpressions ++ childrenExpressions, 
child)
+
+val udfOutputs = exprs.flatMap(expr =>
+  Seq(AttributeReference(expr.name, expr.dataType)())
+)
--- End diff --

I think this could be inlined.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644620
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -4016,6 +4016,89 @@ def test_unsupported_types(self):
 with self.assertRaisesRegexp(Exception, 'Unsupported data 
type'):
 df.groupby('id').apply(f).collect()
 
+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not 
installed")
+class GroupbyAggTests(ReusedSQLTestCase):
+def assertFramesEqual(self, expected, result):
+msg = ("DataFrames are not equal: " +
+   ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +
--- End diff --

indentation nit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154642902
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala
 ---
@@ -38,3 +38,13 @@ case class FlatMapGroupsInPandas(
*/
   override val producedAttributes = AttributeSet(output)
 }
+
+case class AggregateInPandas(
+groupingAttributes: Seq[Attribute],
+functionExprs: Seq[Expression],
+output: Seq[Attribute],
+child: LogicalPlan
+) extends UnaryNode {
--- End diff --

nit:

```
child: LogicalPlan) extends UnaryNode {
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154642230
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.python
+
+import java.io.File
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.{SparkEnv, TaskContext}
+import org.apache.spark.api.python.{ChainedPythonFunctions, PythonEvalType}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.{Ascending, Attribute, 
AttributeSet, Expression, JoinedRow, SortOrder, UnsafeProjection, UnsafeRow}
+import org.apache.spark.sql.catalyst.plans.physical.{AllTuples, 
ClusteredDistribution, Distribution, Partitioning}
+import org.apache.spark.sql.execution.{GroupedIterator, SparkPlan, 
UnaryExecNode}
+import org.apache.spark.sql.types.StructType
+import org.apache.spark.util.Utils
+
+case class AggregateInPandasExec(
+groupingAttributes: Seq[Attribute],
+func: Seq[Expression],
+output: Seq[Attribute],
+child: SparkPlan)
+  extends UnaryExecNode {
+  private val udfs = func.map(expr => expr.asInstanceOf[PythonUDF])
+
+  override def outputPartitioning: Partitioning = child.outputPartitioning
+
+  override def producedAttributes: AttributeSet = AttributeSet(output)
+
+  override def requiredChildDistribution: Seq[Distribution] = {
+if (groupingAttributes.isEmpty) {
+  AllTuples :: Nil
+} else {
+  ClusteredDistribution(groupingAttributes) :: Nil
+}
+  }
+
+  private def collectFunctions(udf: PythonUDF): (ChainedPythonFunctions, 
Seq[Expression]) = {
+udf.children match {
+  case Seq(u: PythonUDF) =>
+val (chained, children) = collectFunctions(u)
+(ChainedPythonFunctions(chained.funcs ++ Seq(udf.func)), children)
+  case children =>
+// There should not be any other UDFs, or the children can't be 
evaluated directly.
+assert(children.forall(_.find(_.isInstanceOf[PythonUDF]).isEmpty))
+(ChainedPythonFunctions(Seq(udf.func)), udf.children)
+}
+  }
+
+  override def requiredChildOrdering: Seq[Seq[SortOrder]] =
+Seq(groupingAttributes.map(SortOrder(_, Ascending)))
+
+  override protected def doExecute(): RDD[InternalRow] = {
+val inputRDD = child.execute()
+
+val bufferSize = inputRDD.conf.getInt("spark.buffer.size", 65536)
+val reuseWorker = 
inputRDD.conf.getBoolean("spark.python.worker.reuse", defaultValue = true)
+// val argOffsets = Array((0 until (child.output.length - 
groupingAttributes.length)).toArray)
+val schema = StructType(child.schema.drop(groupingAttributes.length))
+val sessionLocalTimeZone = conf.sessionLocalTimeZone
+val pandasRespectSessionTimeZone = conf.pandasRespectSessionTimeZone
+
+val (pyFuncs, inputs) = udfs.map(collectFunctions).unzip
+
+val allInputs = new ArrayBuffer[Expression]
+
+val argOffsets = inputs.map { input =>
+  input.map { e =>
+  allInputs += e
--- End diff --

indentation nit


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19872: WIP: [SPARK-22274][PySpark] User-defined aggregat...

2017-12-04 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19872#discussion_r154644084
  
--- Diff: python/pyspark/sql/udf.py ---
@@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
 return udf_obj._wrapped()
 
 
+class UDFColumn(Column):
--- End diff --

BTW, what do you think about adding an attribute instead?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19875: [SPARK-22473][FOLLOWUP][TEST] Remove deprecated Date fun...

2017-12-04 Thread mgaido91

Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19875
  
yes @HyukjinKwon , you are 100% right, sorry for this error.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #18424: [SPARK-17091] Add rule to convert IN predicate to equiva...

2017-12-04 Thread ptkool

Github user ptkool commented on the issue:

https://github.com/apache/spark/pull/18424
  
@a10y Yes, I'm still tracking this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 490 matches

Mail list logo