[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19717
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84453/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19717
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84453 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84453/testReport)**
 for PR 19717 at commit 
[`347ed69`](https://github.com/apache/spark/commit/347ed697e3f17546af83a8cd8a9578527cf65384).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84461 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84461/testReport)**
 for PR 19871 at commit 
[`7fac88f`](https://github.com/apache/spark/commit/7fac88f3a25122d4445f2491dcfc3ebc34749186).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154866923
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
--- End diff --

If I get rid of the part of extracting subexpressions as parameters, some 
tests will be failed.

Because hash aggregation uses subexpression elimination under wholestage 
codegen, this PR enables splitting expression codes under wholestage codegen, 
so it inevitably needs to include subexpression parameters.

This PR doesn't yet to support `splitExpressions` in wholestage codegen. 
This is only applied to split codes of deeply nested expression now. But the 
added API can be applied to `splitExpressions` later. That is a TODO.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19871: [SPARK-20728][SQL] Make OrcFileFormat configurable betwe...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19871
  
**[Test build #84460 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84460/testReport)**
 for PR 19871 at commit 
[`e3f6f75`](https://github.com/apache/spark/commit/e3f6f75084006a0ceb1c2e50c47819bc9ae8e463).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154865943
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

And for here, I added the following to prevent `Multiple sources found`. 
Last time, I missed this way. My bad.
```
+  case name if name.equalsIgnoreCase("orc") &&
+conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "hive" =>
+"org.apache.spark.sql.hive.orc.OrcFileFormat"
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19882
  
It has more lines, doesn't it? In any way, we need helper functions.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154865049
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
--- End diff --

I think supporting subexpression elimination in `splitExpressions` really 
worth an individual JIRA, can we do it in a new PR? This PR should focus on 
supporting `splitExpressions` in whole stage codegen.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154864703
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

sounds good


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-04 Thread zhengruifeng
Github user zhengruifeng commented on the issue:

https://github.com/apache/spark/pull/19889
  
No other algs output multi-column for now


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19882
  
> OrcTest will provide common helper functions and def format: String.

Instead of having `def format: String`, can we just add `beforeAll` and 
`afterAll` in the test suites to set the `ORC_IMPLEMENTATION`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154863367
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -280,7 +280,7 @@ object TypeCoercion {
*/
   object WidenSetOperationTypes extends Rule[LogicalPlan] {
 
-def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case p if p.analyzed => p
--- End diff --

In which cases, we should still use the `analyzed ` flag?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19889
  
**[Test build #84459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84459/testReport)**
 for PR 19889 at commit 
[`5881b88`](https://github.com/apache/spark/commit/5881b88cfcee0989ef5e715ce4e243b501b39ec7).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154863115
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -280,7 +280,7 @@ object TypeCoercion {
*/
   object WidenSetOperationTypes extends Rule[LogicalPlan] {
 
-def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case p if p.analyzed => p
--- End diff --

Sorry?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154863020
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -470,7 +470,7 @@ case class DataSource(
   }.head
 }
 // For partitioned relation r, r.schema's column ordering can be 
different from the column
-// ordering of data.logicalPlan (partition columns are all moved after 
data column).  This
+// ordering of data.logicalPlan (partition columns are all moved after 
data column). This
--- End diff --

Sure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154863033
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1073,11 +1076,13 @@ class Analyzer(
* The HAVING clause could also used a grouping columns that is not 
presented in the SELECT.
*/
   object ResolveMissingReferences extends Rule[LogicalPlan] {
-def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {
+def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp {
   // Skip sort with aggregate. This will be handled in 
ResolveAggregateFunctions
+  case sa @ Sort(_, _, AnalysisBarrier(child: Aggregate)) => sa
   case sa @ Sort(_, _, child: Aggregate) => sa
 
-  case s @ Sort(order, _, child) if !s.resolved && child.resolved =>
+  case s @ Sort(order, _, oriChild) if !s.resolved && 
oriChild.resolved =>
--- End diff --

Ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154862947
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala ---
@@ -241,7 +241,7 @@ class PlannerSuite extends SharedSQLContext {
   test("collapse adjacent repartitions") {
 val doubleRepartitioned = 
testData.repartition(10).repartition(20).coalesce(5)
 def countRepartitions(plan: LogicalPlan): Int = plan.collect { case r: 
Repartition => r }.length
-assert(countRepartitions(doubleRepartitioned.queryExecution.logical) 
=== 3)
+assert(countRepartitions(doubleRepartitioned.queryExecution.analyzed) 
=== 3)
--- End diff --

Please see previous discussion: 
https://github.com/apache/spark/pull/17770/files#r118480364



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19889: [SPARK-22690][ML] Imputer inherit HasOutputCols

2017-12-04 Thread zhengruifeng
GitHub user zhengruifeng opened a pull request:

https://github.com/apache/spark/pull/19889

[SPARK-22690][ML] Imputer inherit HasOutputCols

## What changes were proposed in this pull request?
make `Imputer` inherit `HasOutputCols`

## How was this patch tested?
existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhengruifeng/spark using_HasOutputCols

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19889


commit 5881b88cfcee0989ef5e715ce4e243b501b39ec7
Author: Zheng RuiFeng 
Date:   2017-12-05T07:16:32Z

create pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154863003
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -666,7 +667,9 @@ class Analyzer(
  * Generate a new logical plan for the right child with different 
expression IDs
  * for all conflicting attributes.
  */
-private def dedupRight (left: LogicalPlan, right: LogicalPlan): 
LogicalPlan = {
+private def dedupRight (left: LogicalPlan, oriRight: LogicalPlan): 
LogicalPlan = {
--- End diff --

Ok.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154862787
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
--- End diff --

Oh, then it is no. Firstly we don't try to eliminate subexpressions before 
splitting codes now. Secondly, `splitExpressions` doesn't support subexpression 
elimination too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154862662
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

So, there is no more `The ORC data source must be used with Hive support 
enabled`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19873
  
From the PR description, I am unable to tell the changes made in this PR. 
We need a better description to explain what is the solution proposed in this 
PR. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19840: [SPARK-22640][PYSPARK][YARN]switch python exec on execut...

2017-12-04 Thread yaooqinn
Github user yaooqinn commented on the issue:

https://github.com/apache/spark/pull/19840
  
@vanzin 
according to @ueshin 's explanation, `PYSPARK_DRIVER_PYTHON` is only for 
driver, if executor follows the order of 
[SparkSubmitCommandBuilder.java#L304](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L304),
 we may not need so many configs, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154860853
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

I agree with both of you.

Just for explanation: The original design completely preserves the previous 
behavior.
Without `SQLConf.ORC_IMPLEMENTATION` option, Spark doesn't know 
OrcFileFormat. So, in case of non-Hive support, creating data source with "orc" 
will fail with unknown data source.

Anyway, I'm happy to update according to your advice. :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19874
  
LGTM


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should not throw...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19888
  
**[Test build #84458 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84458/testReport)**
 for PR 19888 at commit 
[`b2b4d54`](https://github.com/apache/spark/commit/b2b4d5460ce1bf156f76b346a7d670bd843f0799).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19793: [SPARK-22574] [Mesos] [Submit] Check submission r...

2017-12-04 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/19793#discussion_r154859418
  
--- Diff: 
resource-managers/mesos/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala
 ---
@@ -77,10 +77,16 @@ private[mesos] class MesosSubmitRequestServlet(
   private def buildDriverDescription(request: CreateSubmissionRequest): 
MesosDriverDescription = {
 // Required fields, including the main class because python is not yet 
supported
 val appResource = Option(request.appResource).getOrElse {
-  throw new SubmitRestMissingFieldException("Application jar is 
missing.")
+  throw new SubmitRestMissingFieldException("Application jar 
'appResource' is missing.")
 }
 val mainClass = Option(request.mainClass).getOrElse {
-  throw new SubmitRestMissingFieldException("Main class is missing.")
+  throw new SubmitRestMissingFieldException("Main class 'mainClass' is 
missing.")
+}
+val appArgs = Option(request.appArgs).getOrElse {
+  throw new SubmitRestMissingFieldException("Application arguments 
'appArgs' are missing.")
+}
+val environmentVariables = 
Option(request.environmentVariables).getOrElse {
+  throw new SubmitRestMissingFieldException("Environment variables 
'environmentVariables' are missing.")
--- End diff --

so this is API changes - new required arguments for the request.
are folks ok with this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154859083
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1098,7 +1103,8 @@ class Analyzer(
   case ae: AnalysisException => s
 }
 
-  case f @ Filter(cond, child) if !f.resolved && child.resolved =>
+  case f @ Filter(cond, oriChild) if !f.resolved && oriChild.resolved 
=>
--- End diff --

Use `originalChild`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154858997
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -666,7 +667,9 @@ class Analyzer(
  * Generate a new logical plan for the right child with different 
expression IDs
  * for all conflicting attributes.
  */
-private def dedupRight (left: LogicalPlan, right: LogicalPlan): 
LogicalPlan = {
+private def dedupRight (left: LogicalPlan, oriRight: LogicalPlan): 
LogicalPlan = {
--- End diff --

Use `originalRight`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154859044
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1073,11 +1076,13 @@ class Analyzer(
* The HAVING clause could also used a grouping columns that is not 
presented in the SELECT.
*/
   object ResolveMissingReferences extends Rule[LogicalPlan] {
-def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators {
+def apply(plan: LogicalPlan): LogicalPlan = plan.transformUp {
   // Skip sort with aggregate. This will be handled in 
ResolveAggregateFunctions
+  case sa @ Sort(_, _, AnalysisBarrier(child: Aggregate)) => sa
   case sa @ Sort(_, _, child: Aggregate) => sa
 
-  case s @ Sort(order, _, child) if !s.resolved && child.resolved =>
+  case s @ Sort(order, _, oriChild) if !s.resolved && 
oriChild.resolved =>
--- End diff --

Use `originalChild`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154858146
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -470,7 +470,7 @@ case class DataSource(
   }.head
 }
 // For partitioned relation r, r.schema's column ordering can be 
different from the column
-// ordering of data.logicalPlan (partition columns are all moved after 
data column).  This
+// ordering of data.logicalPlan (partition columns are all moved after 
data column). This
--- End diff --

Get rid of changes in this file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154857858
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -666,7 +667,9 @@ class Analyzer(
  * Generate a new logical plan for the right child with different 
expression IDs
  * for all conflicting attributes.
  */
-private def dedupRight (left: LogicalPlan, right: LogicalPlan): 
LogicalPlan = {
+private def dedupRight (left: LogicalPlan, oriRight: LogicalPlan): 
LogicalPlan = {
--- End diff --

What is `oriRight `?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154857587
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -280,7 +280,7 @@ object TypeCoercion {
*/
   object WidenSetOperationTypes extends Rule[LogicalPlan] {
 
-def apply(plan: LogicalPlan): LogicalPlan = plan resolveOperators {
+def apply(plan: LogicalPlan): LogicalPlan = plan transformUp {
   case p if p.analyzed => p
--- End diff --

Why?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154857148
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

Yep. It exists in master, too.
I'll update like the example by @cloud-fan .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19874
  
**[Test build #84457 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84457/testReport)**
 for PR 19874 at commit 
[`bcfbd45`](https://github.com/apache/spark/commit/bcfbd45ea60a42480d165160680437e3a6757035).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/19874
  
BTW, I did not change all the rules in `object TypeCoercion` to propagate 
the types, because not all of them need to propagate the types.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19874#discussion_r154855820
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala
 ---
@@ -58,7 +58,7 @@ import org.apache.spark.sql.types._
  * - FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE
  */
 // scalastyle:on
-object DecimalPrecision extends Rule[LogicalPlan] {
+object DecimalPrecision extends Rule[LogicalPlan] with TypePropagation {
--- End diff --

one benefit of naming it `TypeCoercionRule` is, we can just do `object 
DecimalPrecision extends TypeCoercionRule`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154855703
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

I can reproduce it in master now. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19873: [SPARK-20392][SQL] Set barrier to prevent re-entering a ...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19873
  
LGTM, also cc @gatorsmile 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154855592
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala ---
@@ -241,7 +241,7 @@ class PlannerSuite extends SharedSQLContext {
   test("collapse adjacent repartitions") {
 val doubleRepartitioned = 
testData.repartition(10).repartition(20).coalesce(5)
 def countRepartitions(plan: LogicalPlan): Int = plan.collect { case r: 
Repartition => r }.length
-assert(countRepartitions(doubleRepartitioned.queryExecution.logical) 
=== 3)
+assert(countRepartitions(doubleRepartitioned.queryExecution.analyzed) 
=== 3)
--- End diff --

is it a necessary change?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r15480
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

+1 for ^ 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in TypeCo...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19874
  
**[Test build #84456 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84456/testReport)**
 for PR 19874 at commit 
[`3c63a6a`](https://github.com/apache/spark/commit/3c63a6a4fba638e251a7e9f7dbab5cd0f710370a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19873: [SPARK-20392][SQL] Set barrier to prevent re-ente...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19873#discussion_r154855150
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
 ---
@@ -50,41 +50,6 @@ abstract class LogicalPlan
   /** Returns true if this subtree has data from a streaming data source. 
*/
   def isStreaming: Boolean = children.exists(_.isStreaming == true)
 
-  /**
-   * Returns a copy of this node where `rule` has been recursively applied 
first to all of its
-   * children and then itself (post-order). When `rule` does not apply to 
a given node, it is left
-   * unchanged.  This function is similar to `transformUp`, but skips 
sub-trees that have already
-   * been marked as analyzed.
-   *
-   * @param rule the function use to transform this nodes children
-   */
-  def resolveOperators(rule: PartialFunction[LogicalPlan, LogicalPlan]): 
LogicalPlan = {
--- End diff --

can we also remove the `analyzed` flag in this class?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19874#discussion_r154854338
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -39,6 +40,8 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
*/
   protected override def afterAll(): Unit = {
 try {
+  // For debugging dump some statistics about how much time was spent 
in various optimizer rules
+  logWarning(RuleExecutor.dumpTimeSpent())
--- End diff --

Let me change all the other places. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19874#discussion_r154854356
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -850,3 +834,45 @@ object TypeCoercion {
 }
   }
 }
+
+trait TypePropagation extends Logging {
--- End diff --

Sure. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19813: [SPARK-22600][SQL] Fix 64kb limit for deeply nest...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19813#discussion_r154854328
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/ExpressionCodegen.scala
 ---
@@ -0,0 +1,237 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.expressions.codegen
+
+import scala.collection.mutable
+
+import org.apache.spark.sql.catalyst.expressions._
+
+/**
+ * Defines APIs used in expression code generation.
+ */
+object ExpressionCodegen {
+
+  /**
+   * Given an expression, returns the all necessary parameters to evaluate 
it, so the generated
+   * code of this expression can be split in a function.
+   * The 1st string in returned tuple is the parameter strings used to 
call the function.
+   * The 2nd string in returned tuple is the parameter strings used to 
declare the function.
+   *
+   * Returns `None` if it can't produce valid parameters.
+   *
+   * Params to include:
+   * 1. Evaluated columns referred by this, children or deferred 
expressions.
+   * 2. Rows referred by this, children or deferred expressions.
+   * 3. Eliminated subexpressions referred bu children expressions.
+   */
+  def getExpressionInputParams(
+  ctx: CodegenContext,
+  expr: Expression): Option[(Seq[String], Seq[String])] = {
+val (inputAttrs, inputVars) = getInputVarsForChildren(ctx, expr)
+val paramsFromColumns = prepareFunctionParams(ctx, inputAttrs, 
inputVars)
+
+val subExprs = getSubExprInChildren(ctx, expr)
+val subExprCodes = getSubExprCodes(ctx, subExprs)
+val paramsFromSubExprs = prepareFunctionParams(ctx, subExprs, 
subExprCodes)
--- End diff --

I mean before your PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154854071
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

@dongjoon-hyun can you verify it? thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154854042
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

hmmm, it's fixed in master?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19874#discussion_r154853933
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala ---
@@ -39,6 +40,8 @@ class TPCDSQuerySuite extends QueryTest with 
SharedSQLContext with BeforeAndAfte
*/
   protected override def afterAll(): Unit = {
 try {
+  // For debugging dump some statistics about how much time was spent 
in various optimizer rules
+  logWarning(RuleExecutor.dumpTimeSpent())
--- End diff --

use `logInfo`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #84455 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84455/testReport)**
 for PR 19885 at commit 
[`8bbbf04`](https://github.com/apache/spark/commit/8bbbf046a8f706add8430dc6d34c551d87c4d5a1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19885
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84455/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154853724
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

I can see it in 2.2.1. It is a warning message.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19874: [SPARK-22675] [SQL] Refactoring PropagateTypes in...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19874#discussion_r154853773
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala
 ---
@@ -850,3 +834,45 @@ object TypeCoercion {
 }
   }
 }
+
+trait TypePropagation extends Logging {
--- End diff --

how about
```
trait TypeCoercionRule extends Rule[LogicalPlan] {
  protected def coerciveType(plan: LogicalPlan): LogicalPlan

  def apply(plan: LogicalPlan): LogicalPlan = {
val newPlan = coerciveType(plan)
...
  }
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19877
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84452/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19877
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19877: [SPARK-22681]Accumulator should only be updated once for...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19877
  
**[Test build #84452 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84452/testReport)**
 for PR 19877 at commit 
[`756f02f`](https://github.com/apache/spark/commit/756f02f1586edd14e42e32cd119e43132e9d13ee).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19852: [SPARK-22655][PySpark] Throw exception rather tha...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19852#discussion_r154852742
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/python/PythonRunner.scala ---
@@ -319,7 +319,7 @@ private[spark] abstract class BasePythonRunner[IN, OUT](
 
   case e: Exception if env.isStopped =>
 logDebug("Exception thrown after context is stopped", e)
-null.asInstanceOf[OUT]  // exit silently
+throw new SparkException("Spark session has been stopped", e)
--- End diff --

It seems like java task doesn't check `env.isStopped` when task failed, and 
only check it when retry failed.

Does python task have retry mechanism?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154852342
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

it's in the log, which means it's not a serious bug.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154852231
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

and also add comments here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154852193
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

This sounds counter-intuitive, I think we should register the new orc 
instead of the old one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154852162
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

```
scala> spark.version
res2: String = 2.3.0-SNAPSHOT

scala> sql("DROP TABLE IF EXISTS t")
res3: org.apache.spark.sql.DataFrame = []

scala> sql("DROP TABLE IF EXISTS t")
res4: org.apache.spark.sql.DataFrame = []
```

Unable to reproduce it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19885
  
**[Test build #84455 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84455/testReport)**
 for PR 19885 at commit 
[`8bbbf04`](https://github.com/apache/spark/commit/8bbbf046a8f706add8430dc6d34c551d87c4d5a1).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19888#discussion_r154851944
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala ---
@@ -206,7 +206,8 @@ case class DropTableCommand(
 try {
   
sparkSession.sharedState.cacheManager.uncacheQuery(sparkSession.table(tableName))
 } catch {
-  case _: NoSuchTableException if ifExists =>
+  case ae: AnalysisException
+if ifExists && ae.cause.nonEmpty && 
ae.getCause.isInstanceOf[NoSuchTableException] =>
--- End diff --

can we follow 
https://github.com/apache/spark/pull/19713/files#diff-bc55b5f76add105ec32ae4107075b278R57
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19885
  
ok to test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154850738
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
+  }
+} else {
+  if (curBin.hi == curBin.lo) {
+// the entire bin is covered in the range
+1.0
+  } else if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+math.min((higherValue - lowerValue) / (curBin.hi - curBin.lo), 1.0)
+  }
+}
+  }
+
+  /**
+   * Returns the number of bins for column values in [lowerValue, 
higherValue].
+   * The column value distribution is saved in an equi-height histogram.
+   *
+   * @param higherEnd a given upper bound value of a specified column 
value range
+   * @param lowerEnd a given lower bound value of a specified column value 
range
+   * @param histogram a numeric equi-height histogram
+   * @return the selectivity percentage for column values in [lowerValue, 
higherValue].
+   */
+  def getOccupationBins(
+  higherEnd: Double,
+  lowerEnd: Double,
+  histogram: Histogram): Double = {
+// find bins where current min and max locate
+val lowerBinId = findFirstBinForValue(lowerEnd, histogram.bins)
+val higherBinId = findLastBinForValue(higherEnd, histogram.bins)
+assert(lowerBinId <= higherBinId)
+
+// compute how much current [lowerEnd, higherEnd] range occupies the 
histogram in the
+// number of bins
+getOccupationBins(higherBinId, lowerBinId, higherEnd, lowerEnd, 
histogram)
+  }
+

[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2017-12-04 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
gental ping @zsxwing 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/19885
  
Is this assumption based on the implementation of Hadoop `FileSystem`? I 
was thinking that wasb is an exception, for other we still keep the original 
code.

@steveloughran would you please comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19713: [SPARK-22488] [SQL] Fix the view resolution issue in the...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/19713
  
Hi, All.
It seems we missed `DROP TABLE IF EXISTS` cases here.
(I also forgot to check that during 2.2-backporting of this PR.)
This PR swallows `NoSuchTableException` and emits `AnalysisException`. 
`DROP TABLE IF EXISTS` expected `NoSuchTableException` in ddl.scala. It becomes 
a regression on 2.2.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848509
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
--- End diff --

why do we need to specialize it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19810: [SPARK-22599][SQL] In-Memory Table Pruning withou...

2017-12-04 Thread sadikovi
Github user sadikovi commented on a diff in the pull request:

https://github.com/apache/spark/pull/19810#discussion_r154848470
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -193,38 +195,68 @@ case class InMemoryTableScanExec(
 
   private val inMemoryPartitionPruningEnabled = 
sqlContext.conf.inMemoryPartitionPruning
 
+  private def doFilterCachedBatches(
+  rdd: RDD[CachedBatch],
+  partitionStatsSchema: Seq[AttributeReference]): RDD[CachedBatch] = {
+val schemaIndex = partitionStatsSchema.zipWithIndex
+rdd.mapPartitionsWithIndex {
+  case (partitionIndex, cachedBatches) =>
+if (inMemoryPartitionPruningEnabled) {
+  cachedBatches.filter { cachedBatch =>
+val partitionFilter = newPredicate(
+  partitionFilters.reduceOption(And).getOrElse(Literal(true)),
+  partitionStatsSchema)
+partitionFilter.initialize(partitionIndex)
+if (!partitionFilter.eval(cachedBatch.stats)) {
--- End diff --

All good, no need to change. I was trying to understand the code, so my 
question would be referring to statistics collection overall, not changes in 
this PR. Link points to a condition (exists before this PR) that could 
potentially result in exiting iterator before exhausting all records in it, so 
statistics would be partially collected, which might affect any filtering that 
uses such statistics - though it is quite possibly handled later, or a 
theoretical use case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848428
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
+val curBin = histogram.bins(binId)
+if (binId == 0 && curBin.hi == curBin.lo) {
+  // the Min of the histogram occupies the whole first bin
+  1.0
+} else if (binId == 0 && curBin.hi != curBin.lo) {
+  if (higherValue == lowerValue) {
+// set percentage to 1/NDV
+1.0 / curBin.ndv.toDouble
+  } else {
+// Use proration since the range falls inside this bin.
+(higherValue - lowerValue) / (curBin.hi - curBin.lo)
--- End diff --

this is the only branch we need to specialize for `binId=0`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154848336
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

@cloud-fan . To avoid that issue, new OrcFileFormat is not registered 
intentionally.
@HyukjinKwon 's comment is correct.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should not throw...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19888
  
**[Test build #84454 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84454/testReport)**
 for PR 19888 at commit 
[`2b113a2`](https://github.com/apache/spark/commit/2b113a2871dc732d74b0564017bed15e53b297ab).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848277
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
+var binId = 0
+for (i <- bins.indices) {
+  if (value > bins(i).hi) {
+// increment binId to point to next bin
+binId += 1
+  }
+  if ((value == bins(i).hi) && (i < bins.length - 1) && (value == 
bins(i + 1).lo)) {
+// We assume the above 3 conditions will be evaluated from left to 
right sequentially.
+// If the above 3 conditions are evaluated out-of-order, then 
out-of-bound error may happen.
+// At that time, we should split the third condition into another 
if statement.
+// increment binId since the value appears in this bin and next bin
+binId += 1
+  }
+}
+binId
+  }
+
+  /**
+   * Returns a percentage of a bin holding values for column value in the 
range of
+   * [lowerValue, higherValue]
+   *
+   * @param binId a given bin id in a specified histogram
+   * @param higherValue a given upper bound value of a specified column 
value range
+   * @param lowerValue a given lower bound value of a specified column 
value range
+   * @param histogram a numeric equi-height histogram
+   * @return the percentage of a single bin holding values in [lowerValue, 
higherValue].
+   */
+  private def getOccupation(
+  binId: Int,
+  higherValue: Double,
+  lowerValue: Double,
+  histogram: Histogram): Double = {
--- End diff --

instead of accepting a `binId` and `histogram`, can't we just ask the 
caller side to pass a `HistogramBin`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154848218
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

I was looking the exact same path. It seems not because  it's not 
registered to `ServiceLoader` 
(`src/main/resources/org.apache.spark.sql.sources.DataSourceRegister`). So, 
short name for the newer ORC source would not find be used.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19888: [SPARK-22686][SQL] DROP TABLE IF EXISTS should no...

2017-12-04 Thread dongjoon-hyun
GitHub user dongjoon-hyun opened a pull request:

https://github.com/apache/spark/pull/19888

[SPARK-22686][SQL] DROP TABLE IF EXISTS should not throw AnalysisException

## What changes were proposed in this pull request?

During [SPARK-22488](https://github.com/apache/spark/pull/19713) to fix 
view resolution issue, there occurs a regression at `2.2.1` and `master` branch 
like the following. This PR fixes that.

```scala
scala> spark.version
res2: String = 2.2.1

scala> sql("DROP TABLE IF EXISTS t").show
17/12/04 21:01:06 WARN DropTableCommand: 
org.apache.spark.sql.AnalysisException: Table or view not found: t;
org.apache.spark.sql.AnalysisException: Table or view not found: t;
```

## How was this patch tested?

Manual.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-22686

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19888.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19888


commit 2b113a2871dc732d74b0564017bed15e53b297ab
Author: Dongjoon Hyun 
Date:   2017-12-05T05:01:53Z

[SPARK-22686][SQL] DROP TABLE IF EXISTS should not throw AnalysisException




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154848018
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
+}
+binId
+  }
+
+  /**
+   * Returns the number of the last bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the last bin into which a column values falls.
+   */
+  def findLastBinForValue(value: Double, bins: Array[HistogramBin]): Int = 
{
--- End diff --

Why is this method so different from `findFirstBinForValue`? It looks like 
we just need to reverse the iteration order, i.e. from `bins.length` to `0`. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154847802
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
--- End diff --

this looks more like a while loop pattern, can we use while loop here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19783: [SPARK-21322][SQL] support histogram in filter ca...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19783#discussion_r154847703
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala
 ---
@@ -114,4 +114,194 @@ object EstimationUtils {
 }
   }
 
+  /**
+   * Returns the number of the first bin into which a column values falls 
for a specified
+   * numeric equi-height histogram.
+   *
+   * @param value a literal value of a column
+   * @param bins an array of bins for a given numeric equi-height histogram
+   * @return the number of the first bin into which a column values falls.
+   */
+  def findFirstBinForValue(value: Double, bins: Array[HistogramBin]): Int 
= {
+var binId = 0
+bins.foreach { bin =>
+  if (value > bin.hi) binId += 1
--- End diff --

shall we use binary search here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19871: [SPARK-20728][SQL] Make OrcFileFormat configurabl...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19871#discussion_r154847064
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala
 ---
@@ -568,8 +574,13 @@ object DataSource extends Logging {
 "org.apache.spark.Logging")
 
   /** Given a provider name, look up the data source class definition. */
-  def lookupDataSource(provider: String): Class[_] = {
-val provider1 = backwardCompatibilityMap.getOrElse(provider, provider)
+  def lookupDataSource(provider: String, conf: SQLConf): Class[_] = {
+val provider1 = backwardCompatibilityMap.getOrElse(provider, provider) 
match {
+  case name if name.equalsIgnoreCase("orc") &&
+  conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native" =>
+classOf[OrcFileFormat].getCanonicalName
+  case name => name
--- End diff --

if `ORC_IMPLEMENTATION` is `hive`, we leave the provider as it was, which 
may be `orc`. Then we will hit `Multiple sources found` issue, aren't we? Both 
the old and new orc has the same short name `orc`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19885: [SPARK-22587] Spark job fails if fs.defaultFS and applic...

2017-12-04 Thread merlintang
Github user merlintang commented on the issue:

https://github.com/apache/spark/pull/19885
  
@jerryshao  yes, hdfs://us...@nn1.com:8020 and hdfs://us...@nn1.com:8020 
would consider as two filesystem, since the authority information should be 
taken into consideration. that is why need to add the authority to check two 
FS. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19878


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19717: [SPARK-22646] [Submission] Spark on Kubernetes - ...

2017-12-04 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19717#discussion_r154846305
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -388,6 +388,32 @@ class SparkSubmitSuite
 conf.get("spark.ui.enabled") should be ("false")
   }
 
+  test("handles k8s cluster mode") {
+val clArgs = Seq(
+  "--deploy-mode", "cluster",
+  "--master", "k8s://host:port",
+  "--executor-memory", "5g",
+  "--class", "org.SomeClass",
+  "--kubernetes-namespace", "foo",
+  "--driver-memory", "4g",
+  "--conf", "spark.kubernetes.driver.docker.image=bar",
+  "/home/thejar.jar",
+  "arg1")
+val appArgs = new SparkSubmitArguments(clArgs)
+val (childArgs, classpath, conf, mainClass) = 
prepareSubmitEnvironment(appArgs)
+
+val childArgsMap = childArgs.grouped(2).map(a => a(0) -> a(1)).toMap
+childArgsMap.get("--primary-java-resource") should be 
(Some("file:/home/thejar.jar"))
+childArgsMap.get("--main-class") should be (Some("org.SomeClass"))
+childArgsMap.get("--arg") should be (Some("arg1"))
+mainClass should be ("org.apache.spark.deploy.k8s.submit.Client")
+classpath should have length (0)
+conf.get("spark.executor.memory") should be ("5g")
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19878: [SPARK-22682][SQL] HashExpression does not need to creat...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19878
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19878: [SPARK-22682][SQL] HashExpression does not need t...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19878#discussion_r154845901
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala
 ---
@@ -730,23 +776,29 @@ case class HiveHash(children: Seq[Expression]) 
extends HashExpression[Int] {
   input: String,
   result: String,
   fields: Array[StructField]): String = {
-val localResult = ctx.freshName("localResult")
 val childResult = ctx.freshName("childResult")
-fields.zipWithIndex.map { case (field, index) =>
+val fieldsHash = fields.zipWithIndex.map { case (field, index) =>
+  val computeFieldHash = nullSafeElementHash(
+input, index.toString, field.nullable, field.dataType, 
childResult, ctx)
   s"""
- $childResult = 0;
- ${nullSafeElementHash(input, index.toString, field.nullable, 
field.dataType,
-   childResult, ctx)}
- $localResult = (31 * $localResult) + $childResult;
-   """
-}.mkString(
-  s"""
- int $localResult = 0;
- int $childResult = 0;
-   """,
-  "",
-  s"$result = (31 * $result) + $localResult;"
-)
+ |$childResult = 0;
+ |$computeFieldHash
+ |$result = (31 * $result) + $childResult;
+   """.stripMargin
+}
+
+s"${ctx.JAVA_INT} $childResult = 0;\n" + ctx.splitExpressions(
--- End diff --

Yea, the input here is a row that may be produced by `row.getStruct` 
instead of `ctx.INPUT_ROW`, so we don't need this check as the input won't be 
`ctx.currentVars`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19869: [SPARK-22677][SQL] cleanup whole stage codegen fo...

2017-12-04 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19869


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19869: [SPARK-22677][SQL] cleanup whole stage codegen for hash ...

2017-12-04 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19869
  
thanks, merging to master!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19810: [SPARK-22599][SQL] In-Memory Table Pruning without Extra...

2017-12-04 Thread CodingCat
Github user CodingCat commented on the issue:

https://github.com/apache/spark/pull/19810
  
@sadikovi thanks for the review, I replied in comments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19810: [SPARK-22599][SQL] In-Memory Table Pruning withou...

2017-12-04 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19810#discussion_r154845669
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala
 ---
@@ -193,38 +195,68 @@ case class InMemoryTableScanExec(
 
   private val inMemoryPartitionPruningEnabled = 
sqlContext.conf.inMemoryPartitionPruning
 
+  private def doFilterCachedBatches(
+  rdd: RDD[CachedBatch],
+  partitionStatsSchema: Seq[AttributeReference]): RDD[CachedBatch] = {
+val schemaIndex = partitionStatsSchema.zipWithIndex
+rdd.mapPartitionsWithIndex {
+  case (partitionIndex, cachedBatches) =>
+if (inMemoryPartitionPruningEnabled) {
+  cachedBatches.filter { cachedBatch =>
+val partitionFilter = newPredicate(
+  partitionFilters.reduceOption(And).getOrElse(Literal(true)),
+  partitionStatsSchema)
+partitionFilter.initialize(partitionIndex)
+if (!partitionFilter.eval(cachedBatch.stats)) {
--- End diff --

I might not understand your proposal well...are you trying to simplify the 
logic in 
https://github.com/apache/spark/pull/19810/files#diff-5fc188468d3066580ea9a766114b8f1dR74?
 it would make the code simpler but degrade pruning effect here, 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19746
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84451/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19746
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19746: [SPARK-22346][ML] VectorSizeHint Transformer for using V...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19746
  
**[Test build #84451 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84451/testReport)**
 for PR 19746 at commit 
[`c3d1c5e`](https://github.com/apache/spark/commit/c3d1c5eb2a81124b05a9f7b066f9e8ce46efe217).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84453 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84453/testReport)**
 for PR 19717 at commit 
[`347ed69`](https://github.com/apache/spark/commit/347ed697e3f17546af83a8cd8a9578527cf65384).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19810: [SPARK-22599][SQL] In-Memory Table Pruning withou...

2017-12-04 Thread CodingCat
Github user CodingCat commented on a diff in the pull request:

https://github.com/apache/spark/pull/19810#discussion_r154844579
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -52,6 +52,68 @@ object InMemoryRelation {
 private[columnar]
 case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
InternalRow)
 
+private[columnar] class CachedBatchIterator(
+rowIterator: Iterator[InternalRow],
+output: Seq[Attribute],
+batchSize: Int,
+useCompression: Boolean,
+batchStats: LongAccumulator,
+singleBatchPerPartition: Boolean) extends Iterator[CachedBatch] {
+
+  def next(): CachedBatch = {
+val columnBuilders = output.map { attribute =>
+  ColumnBuilder(attribute.dataType, batchSize, attribute.name, 
useCompression)
+}.toArray
+
+var rowCount = 0
+var totalSize = 0L
+
+val terminateLoop = (singleBatch: Boolean, rowIter: 
Iterator[InternalRow],
--- End diff --

to make getting partition stats easier, we construct only one CatchedBatch 
for each partition when enabling the functionality proposed in this PR. 
`singleBatch` distinguishes the scenarios which enables/disables the 
functionality by introducing different while loop termination conditions, 
making the other code reusable


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19882: [SPARK-22672][SQL][TEST] Refactor ORC Tests

2017-12-04 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/19882#discussion_r154843730
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala
 ---
@@ -614,11 +531,63 @@ class OrcQuerySuite extends QueryTest with 
BeforeAndAfterAll with OrcTest {
 }
   }
 
-   test("read from multiple orc input paths") {
- val path1 = Utils.createTempDir()
- val path2 = Utils.createTempDir()
- makeOrcFile((1 to 10).map(Tuple1.apply), path1)
- makeOrcFile((1 to 10).map(Tuple1.apply), path2)
- assertResult(20)(read.orc(path1.getCanonicalPath, 
path2.getCanonicalPath).count())
-   }
+  test("read from multiple orc input paths") {
+val path1 = Utils.createTempDir()
+val path2 = Utils.createTempDir()
+makeOrcFile((1 to 10).map(Tuple1.apply), path1)
+makeOrcFile((1 to 10).map(Tuple1.apply), path2)
+val df = spark.read.format(format).load(path1.getCanonicalPath, 
path2.getCanonicalPath)
+assert(df.count() == 20)
+  }
+}
+
+class OrcQuerySuite extends OrcQueryTest with SharedSQLContext {
+  import testImplicits._
+
+  test("LZO compression options for writing to an ORC file") {
+withTempPath { file =>
+  spark.range(0, 10).write
+.option("compression", "LZO")
+.format(format)
+.save(file.getCanonicalPath)
+
+  val maybeOrcFile = 
file.listFiles().find(_.getName.endsWith(".lzo.orc"))
+  assert(maybeOrcFile.isDefined)
+
+  val orcFilePath = new Path(maybeOrcFile.get.getAbsolutePath)
+  val conf = OrcFile.readerOptions(new Configuration())
+  assert("LZO" === OrcFile.createReader(orcFilePath, 
conf).getCompressionKind.name)
+}
+  }
+
+  test("Schema discovery on empty ORC files") {
+// SPARK-8501 is fixed.
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+
+  withTable("empty_orc") {
+withTempView("empty", "single") {
+  spark.sql(
+s"""CREATE TABLE empty_orc(key INT, value STRING)
+   |USING $format
+   |LOCATION '${dir.toURI}'
+ """.stripMargin)
+
+  val emptyDF = Seq.empty[(Int, String)].toDF("key", 
"value").coalesce(1)
+  emptyDF.createOrReplaceTempView("empty")
+
+  // This creates 1 empty ORC file with Hive ORC SerDe.  We are 
using this trick because
+  // Spark SQL ORC data source always avoids write empty ORC files.
--- End diff --

Thank you for review, @viirya . I'll update tomorrow.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19717
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19717
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84450/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19717: [SPARK-22646] [Submission] Spark on Kubernetes - basic s...

2017-12-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19717
  
**[Test build #84450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84450/testReport)**
 for PR 19717 at commit 
[`cdca57e`](https://github.com/apache/spark/commit/cdca57ece845c86ac16d6089460f28e3ff3c298f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   >