[GitHub] spark issue #19492: [SPARK-22228][SQL] Add support for array...

2018-01-23 Thread mgaido91
Github user mgaido91 commented on the issue:

https://github.com/apache/spark/pull/19492
  
Kindly ping @viirya


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19978: [SPARK-22784][CORE][WIP] Configure reading buffer size i...

2018-01-23 Thread MikhailErofeev
Github user MikhailErofeev commented on the issue:

https://github.com/apache/spark/pull/19978
  
@srowen, yes,  the processing is no longer IO-bound after backporting 
SPARK-20923


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19978: [SPARK-22784][CORE][WIP] Configure reading buffer...

2018-01-23 Thread MikhailErofeev
Github user MikhailErofeev closed the pull request at:

https://github.com/apache/spark/pull/19978


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2018-01-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r163472676
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala
 ---
@@ -1313,6 +1331,9 @@ object CodeGenerator extends Logging {
   // This is the value of HugeMethodLimit in the OpenJDK JVM settings
   val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000
 
+  // The max valid length of method parameters in JVM.
+  val MAX_JVM_METHOD_PARAMS_LENGTH = 255
--- End diff --

make it `final`? I think we can add final to all the constants here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20376
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20377
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86562/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20377
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20376
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/179/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20376
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20377
  
**[Test build #86562 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86562/testReport)**
 for PR 20377 at commit 
[`0574ec7`](https://github.com/apache/spark/commit/0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20376
  
**[Test build #86561 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86561/testReport)**
 for PR 20376 at commit 
[`dfd828b`](https://github.com/apache/spark/commit/dfd828b718bcda85d2b822c8a96ae593c14850e8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20376
  
**[Test build #86574 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86574/testReport)**
 for PR 20376 at commit 
[`dfd828b`](https://github.com/apache/spark/commit/dfd828b718bcda85d2b822c8a96ae593c14850e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20376
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20376: [SPARK-23020][CORE][FOLLOWUP] Fix Java style check issue...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20376
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86561/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17702
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/178/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17702
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread xuanyuanking
Github user xuanyuanking commented on the issue:

https://github.com/apache/spark/pull/17702
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17702
  
**[Test build #86573 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86573/testReport)**
 for PR 17702 at commit 
[`dc373ae`](https://github.com/apache/spark/commit/dc373ae68040b92e7b67c99ce0c793c80b5ab507).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20380: [SPARK-20906][SPARKR] Add API doc example for Constraine...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20380
  
**[Test build #86572 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86572/testReport)**
 for PR 20380 at commit 
[`f56b00d`](https://github.com/apache/spark/commit/f56b00d777eb27ff675bf4e127c83e004a9e34ff).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20379
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/177/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20380: [SPARK-20906][SPARKR] Add API doc example for Constraine...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20380
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20379
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20380: [SPARK-20906][SPARKR] Add API doc example for Constraine...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20380
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/176/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20380: [SPARK-20906][SPARKR] Add API doc example for Con...

2018-01-23 Thread felixcheung
GitHub user felixcheung opened a pull request:

https://github.com/apache/spark/pull/20380

[SPARK-20906][SPARKR] Add API doc example for Constrained Logistic 
Regression

## What changes were proposed in this pull request?

doc only changes

## How was this patch tested?

manual

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/felixcheung/spark rclrdoc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20380.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20380


commit f56b00d777eb27ff675bf4e127c83e004a9e34ff
Author: Felix Cheung 
Date:   2018-01-24T07:14:20Z

api doc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20379
  
**[Test build #86571 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86571/testReport)**
 for PR 20379 at commit 
[`a66b5e0`](https://github.com/apache/spark/commit/a66b5e0c4b81444974f02c7154111b47a1a5137c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-p...

2018-01-23 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/20379
  
cc @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20379: [SPARK-23177][SQL][PySpark][Backport-2.3] Extract...

2018-01-23 Thread viirya
GitHub user viirya opened a pull request:

https://github.com/apache/spark/pull/20379

[SPARK-23177][SQL][PySpark][Backport-2.3] Extract zero-parameter UDFs from 
aggregate

## What changes were proposed in this pull request?

We extract Python UDFs in logical aggregate which depends on aggregate 
expression or grouping key in ExtractPythonUDFFromAggregate rule. But Python 
UDFs which don't depend on above expressions should also be extracted to avoid 
the issue reported in the JIRA.

A small code snippet to reproduce that issue looks like:
```python
import pyspark.sql.functions as f

df = spark.createDataFrame([(1,2), (3,4)])
f_udf = f.udf(lambda: str("const_str"))
df2 = df.distinct().withColumn("a", f_udf())
df2.show()
```

Error exception is raised as:
```
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF0#50
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
```

This exception raises because `HashAggregateExec` tries to bind the aliased 
Python UDF expression (e.g., `pythonUDF0#50 AS a#44`) to grouping key.

## How was this patch tested?

Added test.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/viirya/spark-1 SPARK-23177-backport-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20379.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20379


commit a66b5e0c4b81444974f02c7154111b47a1a5137c
Author: Liang-Chi Hsieh 
Date:   2018-01-24T06:50:53Z

Extract parameter-less UDFs from aggregate.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20338: [SPARK-23174][BUILD][PYTHON] python code style ch...

2018-01-23 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20338#discussion_r163467168
  
--- Diff: dev/tox.ini ---
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-[pep8]
-ignore=E402,E731,E241,W503,E226
+[pycodestyle]
+ignore=E402,E731,E241,W503,E226,E722,E741,E305
--- End diff --

Yea .. there look actually many instances about it. I manually ran it 
against this PR. Will maybe take this one and open a PR separately after double 
checking each rule. Didn't take a look carefully yet.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20338: [SPARK-23174][BUILD][PYTHON] python code style ch...

2018-01-23 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/20338#discussion_r163465854
  
--- Diff: dev/tox.ini ---
@@ -13,7 +13,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-[pep8]
-ignore=E402,E731,E241,W503,E226
+[pycodestyle]
+ignore=E402,E731,E241,W503,E226,E722,E741,E305
--- End diff --

should we fix the code that violates

E722: "do not use bare except'"
E741: "ambiguous variable name 'l'",
E305: "expected 2 blank lines after class or function definition"
?

Or that's another PR/JIRA?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/20372
  
Please fix the scala style checks --

```
Running Scala style checks

Scalastyle checks failed at following occurrences:
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:459:
 File line length exceeds 100 characters
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala:463:
 File line length exceeds 100 characters
[error] Total time: 14 s, completed Jan 23, 2018 10:44:36 PM
[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder/dev/lint-scala ; received 
return code 1
```

and verify locally with `./dev/lint-scala`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20367: [SPARK-23166][ML] Add maxDF Parameter to CountVec...

2018-01-23 Thread ymazari
Github user ymazari commented on a diff in the pull request:

https://github.com/apache/spark/pull/20367#discussion_r163465302
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/CountVectorizerSuite.scala ---
@@ -119,6 +119,41 @@ class CountVectorizerSuite extends SparkFunSuite with 
MLlibTestSparkContext
 }
   }
 
+  test("CountVectorizer maxDF") {
+val df = Seq(
+  (0, split("a b c d"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0), (2, 
1.0,
+  (1, split("a b c"), Vectors.sparse(3, Seq((0, 1.0), (1, 1.0,
+  (2, split("a b"), Vectors.sparse(3, Seq((0, 1.0,
+  (3, split("a"), Vectors.sparse(3, Seq()))
+).toDF("id", "words", "expected")
+
+// maxDF: ignore terms with count more than 3
+val cvModel = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setMaxDF(3)
+  .fit(df)
+assert(cvModel.vocabulary === Array("b", "c", "d"))
+
+cvModel.transform(df).select("features", "expected").collect().foreach 
{
+  case Row(features: Vector, expected: Vector) =>
+assert(features ~== expected absTol 1e-14)
+}
+
+// maxDF: ignore terms with freq > 0.75
+val cvModel2 = new CountVectorizer()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setMaxDF(0.75)
+  .fit(df)
+assert(cvModel2.vocabulary === Array("b", "c", "d"))
+
+cvModel2.transform(df).select("features", 
"expected").collect().foreach {
+  case Row(features: Vector, expected: Vector) =>
+assert(features ~== expected absTol 1e-14)
--- End diff --

Done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #86570 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86570/testReport)**
 for PR 18931 at commit 
[`79d0106`](https://github.com/apache/spark/commit/79d010614845a2663353c5e84ff46e2784e3f7b2).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20338
  
**[Test build #86569 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86569/testReport)**
 for PR 20338 at commit 
[`34a8590`](https://github.com/apache/spark/commit/34a8590b7fa205be3d2b3f4913e252ab0e96d091).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/175/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20338
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20338
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/174/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20372
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20372
  
**[Test build #86567 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86567/testReport)**
 for PR 20372 at commit 
[`ef04de9`](https://github.com/apache/spark/commit/ef04de9766584b0a8ab13f290c9850e44570).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20372
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86567/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20338
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20372
  
**[Test build #86567 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86567/testReport)**
 for PR 20372 at commit 
[`ef04de9`](https://github.com/apache/spark/commit/ef04de9766584b0a8ab13f290c9850e44570).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #86568 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86568/testReport)**
 for PR 18931 at commit 
[`9f0d1da`](https://github.com/apache/spark/commit/9f0d1da8e6f13e61411755d01e74e86cc399d2e8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20372#discussion_r163464207
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala 
---
@@ -445,16 +445,25 @@ case class FileSourceScanExec(
   currentSize = 0
 }
 
-// Assign files to partitions using "Next Fit Decreasing"
-splitFiles.foreach { file =>
-  if (currentSize + file.length > maxSplitBytes) {
+def addFile(file: PartitionedFile): Unit = {
+currentFiles += file
+currentSize += file.length + openCostInBytes
+}
+
+var frontIndex = 0
+var backIndex = splitFiles.length - 1
+
+while (frontIndex <= backIndex) {
+while (frontIndex <= backIndex && currentSize + 
splitFiles(frontIndex).length <= maxSplitBytes) {
+addFile(splitFiles(frontIndex))
+frontIndex += 1
+}
+while (backIndex > frontIndex && currentSize + 
splitFiles(backIndex).length <= maxSplitBytes) {
+addFile(splitFiles(backIndex))
+backIndex -= 1
+}
 closePartition()
-  }
-  // Add the given file to the current partition.
-  currentSize += file.length + openCostInBytes
--- End diff --

saw you added a commit to handle the large non-splittable files case -- can 
you please add a test for that also?

want to make sure this while loop never becomes an infinite loop!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/173/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20338
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20338
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86560/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20338
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86559/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20368
  
Thanks! revert it from master/2.3


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20338
  
**[Test build #86560 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86560/testReport)**
 for PR 20338 at commit 
[`34a8590`](https://github.com/apache/spark/commit/34a8590b7fa205be3d2b3f4913e252ab0e96d091).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20338
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20338: [SPARK-23174][BUILD][PYTHON] python code style checker u...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20338
  
**[Test build #86559 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86559/testReport)**
 for PR 20338 at commit 
[`d10fbb4`](https://github.com/apache/spark/commit/d10fbb49cc6b062bfa7d463f0bae098c77b9a890).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20372: Improved block merging logic for partitions

2018-01-23 Thread ash211
Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/20372
  
Jenkins, this is ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20037: [SPARK-22849] ivy.retrieve pattern should also co...

2018-01-23 Thread yhuai
Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/20037#discussion_r163463718
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -1271,7 +1271,7 @@ private[spark] object SparkSubmitUtils {
 // retrieve all resolved dependencies
 ivy.retrieve(rr.getModuleDescriptor.getModuleRevisionId,
   packagesDirectory.getAbsolutePath + File.separator +
-"[organization]_[artifact]-[revision].[ext]",
+"[organization]_[artifact]-[revision](-[classifier]).[ext]",
--- End diff --

I tried it today. Somehow, I only got the test jar downloaded. Have you 
guys seen this issue?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20352: [SPARK-21727][R] Allow multi-element atomic vector as co...

2018-01-23 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/20352
  
merged to master/2.3. we could revisit migration guide if necessary.
thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20368
  
Oh, thank you so much for fast recovery, @gatorsmile .


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20352: [SPARK-21727][R] Allow multi-element atomic vecto...

2018-01-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20352


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18931
  
**[Test build #86566 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86566/testReport)**
 for PR 18931 at commit 
[`2f2d1fd`](https://github.com/apache/spark/commit/2f2d1fd6a5844465e3d75b61c0e8b19fa93ada3e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20368
  
I am reverting this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18931: [SPARK-21717][SQL] Decouple consume functions of physica...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/18931
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/172/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2018-01-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r163462731
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,13 +149,100 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
+//in another function.
+// 2. The output variables are not empty. If it's empty, we don't 
bother to do that.
+// 3. We don't use row variable. The construction of row uses deferred 
variable evaluation. We
+//can't do it.
+// 4. The number of output variables must less than maximum number of 
parameters in Java method
+//declaration.
+val requireAllOutput = output.forall(parent.usedInputs.contains(_))
+val consumeFunc =
+  if (row == null && outputVars.nonEmpty && requireAllOutput && 
isValidParamLength(ctx)) {
+constructDoConsumeFunction(ctx, inputVars)
+  } else {
+parent.doConsume(ctx, inputVars, rowVar)
+  }
 s"""
|${ctx.registerComment(s"CONSUME: ${parent.simpleString}")}
|$evaluated
-   |${parent.doConsume(ctx, inputVars, rowVar)}
+   |$consumeFunc
  """.stripMargin
   }
 
+  /**
+   * In Java, a method descriptor is valid only if it represents method 
parameters with a total
+   * length of 255 or less. `this` contributes one unit and a parameter of 
type long or double
+   * contributes two units. Besides, for nullable parameters, we also need 
to pass a boolean
+   * for the null status.
+   */
+  private def isValidParamLength(ctx: CodegenContext): Boolean = {
--- End diff --

Yes. Put it in `CodegenContext`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2018-01-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r163462688
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,13 +149,100 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
+//in another function.
+// 2. The output variables are not empty. If it's empty, we don't 
bother to do that.
+// 3. We don't use row variable. The construction of row uses deferred 
variable evaluation. We
+//can't do it.
+// 4. The number of output variables must less than maximum number of 
parameters in Java method
+//declaration.
+val requireAllOutput = output.forall(parent.usedInputs.contains(_))
+val consumeFunc =
+  if (row == null && outputVars.nonEmpty && requireAllOutput && 
isValidParamLength(ctx)) {
+constructDoConsumeFunction(ctx, inputVars)
+  } else {
+parent.doConsume(ctx, inputVars, rowVar)
+  }
 s"""
|${ctx.registerComment(s"CONSUME: ${parent.simpleString}")}
|$evaluated
-   |${parent.doConsume(ctx, inputVars, rowVar)}
+   |$consumeFunc
  """.stripMargin
   }
 
+  /**
+   * In Java, a method descriptor is valid only if it represents method 
parameters with a total
+   * length of 255 or less. `this` contributes one unit and a parameter of 
type long or double
+   * contributes two units. Besides, for nullable parameters, we also need 
to pass a boolean
+   * for the null status.
+   */
+  private def isValidParamLength(ctx: CodegenContext): Boolean = {
+// Start value is 1 for `this`.
+output.foldLeft(1) { case (curLength, attr) =>
+  ctx.javaType(attr.dataType) match {
+case (ctx.JAVA_LONG | ctx.JAVA_DOUBLE) if !attr.nullable => 
curLength + 2
+case ctx.JAVA_LONG | ctx.JAVA_DOUBLE => curLength + 3
+case _ if !attr.nullable => curLength + 1
+case _ => curLength + 2
+  }
+} <= 255
+  }
+
+  /**
+   * To prevent concatenated function growing too long to be optimized by 
JIT. We can separate the
+   * parent's `doConsume` codes of a `CodegenSupport` operator into a 
function to call.
+   */
+  private def constructDoConsumeFunction(
+  ctx: CodegenContext,
+  inputVars: Seq[ExprCode]): String = {
+val (callingParams, arguList, inputVarsInFunc) =
+  constructConsumeParameters(ctx, output, inputVars)
+val rowVar = ExprCode("", "false", "unsafeRow")
+val doConsume = ctx.freshName("doConsume")
+val doConsumeFuncName = ctx.addNewFunction(doConsume,
+  s"""
+ | private void $doConsume($arguList) throws java.io.IOException {
+ |   ${parent.doConsume(ctx, inputVarsInFunc, rowVar)}
+ | }
+   """.stripMargin)
+
+s"""
+   | $doConsumeFuncName($callingParams);
+ """.stripMargin
+  }
+
+  /**
+   * Returns source code for calling consume function and the argument 
list of the consume function
+   * and also the `ExprCode` for the argument list.
+   */
+  private def constructConsumeParameters(
+  ctx: CodegenContext,
+  attributes: Seq[Attribute],
+  variables: Seq[ExprCode]): (String, String, Seq[ExprCode]) = {
+val params = variables.zipWithIndex.map { case (ev, i) =>
+  val arguName = ctx.freshName(s"expr_$i")
+  val arguType = ctx.javaType(attributes(i).dataType)
+
+  val (callingParam, funcParams, arguIsNull) = if 
(!attributes(i).nullable) {
+// When the argument is not nullable, we don't need to pass in 
`isNull` param for it and
+// simply give a `false`.
+val arguIsNull = "false"
+(ev.value, s"$arguType $arguName", arguIsNull)
+  } else {
+val arguIsNull = ctx.freshName(s"exprIsNull_$i")
+(ev.value + ", " + ev.isNull, s"$arguType $arguName, boolean 
$arguIsNull", arguIsNull)
+  }
+  (callingParam, funcParams, ExprCode("", arguIsNull, arguName))
+}.unzip3
+(params._1.mkString(", "),
+  params._2.mkString(", "),
+  params._3)
--- End diff --

Done.


---

-
To 

[GitHub] spark pull request #18931: [SPARK-21717][SQL] Decouple consume functions of ...

2018-01-23 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/18931#discussion_r163462698
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -149,13 +149,100 @@ trait CodegenSupport extends SparkPlan {
 
 ctx.freshNamePrefix = parent.variablePrefix
 val evaluated = evaluateRequiredVariables(output, inputVars, 
parent.usedInputs)
+
+// Under certain conditions, we can put the logic to consume the rows 
of this operator into
+// another function. So we can prevent a generated function too long 
to be optimized by JIT.
+// The conditions:
+// 1. The parent uses all variables in output. we can't defer variable 
evaluation when consume
+//in another function.
+// 2. The output variables are not empty. If it's empty, we don't 
bother to do that.
+// 3. We don't use row variable. The construction of row uses deferred 
variable evaluation. We
+//can't do it.
+// 4. The number of output variables must less than maximum number of 
parameters in Java method
+//declaration.
--- End diff --

Added a config for it so we can turn it off.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18756: [SPARK-21548][SQL] "Support insert into serial columns o...

2018-01-23 Thread lvdongr
Github user lvdongr commented on the issue:

https://github.com/apache/spark/pull/18756
  
 Execute me ,has the concept of default value been introduce to schema in 
master branch? @gatorsmile thank you.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

2018-01-23 Thread rednaxelafx
Github user rednaxelafx commented on the issue:

https://github.com/apache/spark/pull/20224
  
BTW, inspired by @cloud-fan 's comment, here's an example of the codegen 
stage IDs when scalar subqueries are involved:

```scala
val sub = "(select sum(id) from range(5))"
val df = spark.sql(s"select $sub as a, $sub as b")
df.explain(true)
```
would give:
```
== Parsed Logical Plan ==
'Project [scalar-subquery#0 [] AS a#1, scalar-subquery#2 [] AS b#3]
:  :- 'Project [unresolvedalias('sum('id), None)]
:  :  +- 'UnresolvedTableValuedFunction range, [5]
:  +- 'Project [unresolvedalias('sum('id), None)]
: +- 'UnresolvedTableValuedFunction range, [5]
+- OneRowRelation

== Analyzed Logical Plan ==
a: bigint, b: bigint
Project [scalar-subquery#0 [] AS a#1L, scalar-subquery#2 [] AS b#3L]
:  :- Aggregate [sum(id#14L) AS sum(id)#16L]
:  :  +- Range (0, 5, step=1, splits=None)
:  +- Aggregate [sum(id#17L) AS sum(id)#19L]
: +- Range (0, 5, step=1, splits=None)
+- OneRowRelation

== Optimized Logical Plan ==
Project [scalar-subquery#0 [] AS a#1L, scalar-subquery#2 [] AS b#3L]
:  :- Aggregate [sum(id#14L) AS sum(id)#16L]
:  :  +- Range (0, 5, step=1, splits=None)
:  +- Aggregate [sum(id#17L) AS sum(id)#19L]
: +- Range (0, 5, step=1, splits=None)
+- OneRowRelation

== Physical Plan ==
*(1) Project [Subquery subquery0 AS a#1L, Subquery subquery2 AS b#3L]
:  :- Subquery subquery0
:  :  +- *(2) HashAggregate(keys=[], functions=[sum(id#14L)], 
output=[sum(id)#16L])
:  : +- Exchange SinglePartition
:  :+- *(1) HashAggregate(keys=[], functions=[partial_sum(id#14L)], 
output=[sum#21L])
:  :   +- *(1) Range (0, 5, step=1, splits=8)
:  +- Subquery subquery2
: +- *(2) HashAggregate(keys=[], functions=[sum(id#17L)], 
output=[sum(id)#19L])
:+- Exchange SinglePartition
:   +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#17L)], 
output=[sum#23L])
:  +- *(1) Range (0, 5, step=1, splits=8)
+- Scan OneRowRelation[]
```

The reason why the IDs look a bit "odd" (that there are three separate 
codegen stages with ID 1) is because the main "spine" query and each individual 
subqueries are "planned" separately, thus they'd run `CollapseCodegenStages` 
separately, each counting up from 1 afresh. I would consider this behavior 
acceptable, but I wonder what others would think in this case.
If this behavior for subqueries is not acceptable, I'll have to find 
alternative places to put the initialization and reset of the thread-local ID 
counter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19285: [SPARK-22068][CORE]Reduce the duplicate code betw...

2018-01-23 Thread ConeyLiu
Github user ConeyLiu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19285#discussion_r163462053
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -233,17 +235,13 @@ private[spark] class MemoryStore(
 }
 
 if (keepUnrolling) {
-  // We successfully unrolled the entirety of this block
-  val arrayValues = vector.toArray
-  vector = null
-  val entry =
-new DeserializedMemoryEntry[T](arrayValues, 
SizeEstimator.estimate(arrayValues), classTag)
-  val size = entry.size
+  // We need more precise value
+  val size = valuesHolder.esitimatedSize(false)
--- End diff --

I change the code back to originally. For `DeserializedValuesHolder`, we 
could `buildEntry` and get the `size` from `MemorySize`. But for 
`SerializedValuesHolder`, this way not work correctly. Because we need call the 
`bbos.toChunkedByteBuffer` to get the `MemoryEntry` object, and if the reserved 
memory is not enough for transfer the unroll memory to storage memory. Then we 
unroll failed and need call `bbos.toChunkedByteBuffer` 
([L802](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L802),
 this should be intentional which related to #15043). So the problem is that we 
call `bbos.toChunkedByteBuffer` twice, but it can't be.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-23 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163461361
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
@@ -365,6 +388,26 @@ case class WholeStageCodegenExec(child: SparkPlan) 
extends UnaryExecNode with Co
 "pipelineTime" -> SQLMetrics.createTimingMetric(sparkContext,
   WholeStageCodegenExec.PIPELINE_DURATION_METRIC))
 
+  /**
+   * ID for codegen stages within a query plan.
+   * It does not affect equality, nor does it participate in destructuring 
pattern matching.
+   *
+   * Within a query, a codegen stage in a plan starts counting from 1, in 
"insertion order".
+   * WholeStageCodegenExec operators are inserted into a plan in 
depth-first post-order.
+   * See CollapseCodegenStages.insertWholeStageCodegen for the definition 
of insertion order.
+   *
+   * 0 is reserved as a special ID value to indicate a temporary 
WholeStageCodegenExec object
+   * is created, e.g. for special fallback handling when an existing 
WholeStageCodegenExec
+   * failed to generate/compile code.
+   */
+  val codegenStageId = WholeStageCodegenId.getNextStageId()
--- End diff --

Thanks for your comment, @cloud-fan ! That's a nice catch that I hadn't 
really thought about.

All the examples that I've run with are ones that wouldn't trigger changes 
to the plan after `CollapseCodegenStages`, which means for those examples 
`ReuseExchange` / `ReuseSubqueries` wouldn't have triggered.
Yes, these two rules could potentially change the physical plan, which 
means when we `transformUp` in those rules (and any future rule after 
`CollapseCodegenStages`) it'd create new `WholeStageCodegenExec` objects 
outside of `CollapseCodegenStages`, and with my current implementation that'll 
result in the WSC copies having a codegen stage ID of 0.

One way to workaround this is to move `CollapseCodegenStages` to always be 
the last rule in `org.apache.spark.sql.execution.QueryExecution#preparations`, 
so that we're sure there's no other transformation on the physical plan that 
could change the structure of the plan, except for fallback handling that could 
happen in a couple of  `doExecute()`s -- these exception cases are expected and 
to me they are acceptable.

If we go down that route, I'll probably have to tweak 
`CollapseCodegenStages` a little bit so that it can cope with the physical 
query plan potentially becoming a DAG instead of a tree, as the `ReuseExchange` 
/ `ReuseSubqueries` rules may do that kind of transformation. This tweak is 
easy to implement and low risk: simply bailing out of the transforming a 
subtree when it sees a `WholeStageCodegenExec` already inserted into the plan 
would suffice.

Let me actually update the PR with this tweak and see what happens in tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19285
  
**[Test build #86565 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86565/testReport)**
 for PR 19285 at commit 
[`f392217`](https://github.com/apache/spark/commit/f392217e77f3b7f29f3f0728c29a2e041adb0c0e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20355
  
Please fix the PR title.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20368
  
As a result, this seems to block SparkPullRequestBuilder, too.
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/SparkPullRequestBuilder/86557/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20368: [SPARK-23195] [SQL] Keep the Hint of Cached Data

2018-01-23 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20368
  
Hi, All.

This seems to break both `master` and `branch-2.3`. It might be correlated 
to the sister PR, 
[SPARK-23192](https://github.com/apache/spark/commit/613c290336e382664c24319f66774b1f65a3).
Could you take a look?

- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/4072/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4124/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/4402/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/4454/


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20355
  
**[Test build #86564 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86564/testReport)**
 for PR 20355 at commit 
[`51b0db5`](https://github.com/apache/spark/commit/51b0db5272858995a3b3e389a851f4e4aa04c651).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20355
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20355: [SPARK-23148][SQL] Allow pathnames with special characte...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20355
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/171/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17702
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17702
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86558/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #17702: [SPARK-20408][SQL] Get the glob path in parallel to redu...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17702
  
**[Test build #86558 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86558/testReport)**
 for PR 17702 at commit 
[`dc373ae`](https://github.com/apache/spark/commit/dc373ae68040b92e7b67c99ce0c793c80b5ab507).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20169: [SPARK-17088][hive] Fix 'sharesHadoopClasses' option whe...

2018-01-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20169
  
To be clear, I think I personally usually cc related guys. Yup, if I were 
fixing this codes, I would have cc'ed you @gatorsmile because I agree that it's 
good to do. Sure, more eyes are better. Sure, I think I am personally trying to 
support you and other committers to review and roll it better.

I left a comment only because It sounded to cc few specific committers as a 
requirement because it's not quite what has been on my mind.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20378
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20378
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86563/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20378
  
**[Test build #86563 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86563/testReport)**
 for PR 20378 at commit 
[`e69827e`](https://github.com/apache/spark/commit/e69827e17349196382b24ef65b843d0fbc76cc76).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20378
  
**[Test build #86563 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86563/testReport)**
 for PR 20378 at commit 
[`e69827e`](https://github.com/apache/spark/commit/e69827e17349196382b24ef65b843d0fbc76cc76).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20378
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20378: [SPARK-11222][Build][Python] Python document style check...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20378
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/170/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163456017
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,21 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

makes sense


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20378: [SPARK-11222][Build][Python] Python document styl...

2018-01-23 Thread rekhajoshm
GitHub user rekhajoshm opened a pull request:

https://github.com/apache/spark/pull/20378

[SPARK-11222][Build][Python] Python document style checker added

## What changes were proposed in this pull request?
Using pydocstyle for python document style checker
https://github.com/PyCQA/pycodestyle/issues/723

## How was this patch tested?
./dev/run-tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rekhajoshm/spark SPARK-11222-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20378


commit e3677c9fa9697e0d34f9df52442085a6a481c9e9
Author: Rekha Joshi 
Date:   2015-05-05T23:10:08Z

Merge pull request #1 from apache/master

Pulling functionality from apache spark

commit 106fd8eee8f6a6f7c67cfc64f57c1161f76d8f75
Author: Rekha Joshi 
Date:   2015-05-08T21:49:09Z

Merge pull request #2 from apache/master

pull latest from apache spark

commit 0be142d6becba7c09c6eba0b8ea1efe83d649e8c
Author: Rekha Joshi 
Date:   2015-06-22T00:08:08Z

Merge pull request #3 from apache/master

Pulling functionality from apache spark

commit 6c6ee12fd733e3f9902e10faf92ccb78211245e3
Author: Rekha Joshi 
Date:   2015-09-17T01:03:09Z

Merge pull request #4 from apache/master

Pulling functionality from apache spark

commit b123c601e459d1ad17511fd91dd304032154882a
Author: Rekha Joshi 
Date:   2015-11-25T18:50:32Z

Merge pull request #5 from apache/master

pull request from apache/master

commit c73c32aadd6066e631956923725a48d98a18777e
Author: Rekha Joshi 
Date:   2016-03-18T19:13:51Z

Merge pull request #6 from apache/master

pull latest from apache spark

commit 7dbf7320057978526635bed09dabc8cf8657a28a
Author: Rekha Joshi 
Date:   2016-04-05T20:26:40Z

Merge pull request #8 from apache/master

pull latest from apache spark

commit 5e9d71827f8e2e4d07027281b80e4e073e7fecd1
Author: Rekha Joshi 
Date:   2017-05-01T23:00:30Z

Merge pull request #9 from apache/master

Pull apache spark

commit 63d99b3ce5f222d7126133170a373591f0ac67dd
Author: Rekha Joshi 
Date:   2017-09-30T22:26:44Z

Merge pull request #10 from apache/master

pull latest apache spark

commit a7fc787466b71784ff86f9694f617db0f1042da8
Author: Rekha Joshi 
Date:   2018-01-21T00:17:58Z

Merge pull request #11 from apache/master

Apache spark pull latest

commit 88734413c5e33803b7d5f8c0f81899ed1d3577ff
Author: rjoshi2 
Date:   2018-01-21T03:40:59Z

[SPARK-11222][BUILD][PYTHON] python code style checker update

commit 871889113057f819337bd24379cf1f07516c3298
Author: rjoshi2 
Date:   2018-01-22T01:36:42Z

updated for review comment

commit c386a9a9b130e3974dc756b0fa89b7cff93f09ac
Author: rjoshi2 
Date:   2018-01-22T02:26:56Z

[SPARK-23174][BUILD][PYTHON] python code style checker update

commit d10fbb49cc6b062bfa7d463f0bae098c77b9a890
Author: rjoshi2 
Date:   2018-01-24T03:17:05Z

updated nit

commit 34a8590b7fa205be3d2b3f4913e252ab0e96d091
Author: rjoshi2 
Date:   2018-01-24T03:27:42Z

updated for review comment

commit 428fa093a8385eb977535cc7e4848ffe113dc9f3
Author: rjoshi2 
Date:   2018-01-24T05:16:03Z

[SPARK-11222][Build][Python] Doc style checker

commit aab86ab987c01388bcda0412162fb3cf50593ea7
Author: rjoshi2 
Date:   2018-01-24T05:18:27Z

[SPARK-11222][Build][Python] Doc style checker updates

commit e69827e17349196382b24ef65b843d0fbc76cc76
Author: rjoshi2 
Date:   2018-01-24T05:20:11Z

[SPARK-11222][Build][Python] Doc style checker updates




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20355: SPARK-23148: [SQL] Allow pathnames with special c...

2018-01-23 Thread henryr
Github user henryr commented on a diff in the pull request:

https://github.com/apache/spark/pull/20355#discussion_r163455727
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala ---
@@ -78,4 +78,20 @@ class FileBasedDataSourceSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  // Separate test case for text-based formats that support multiLine as 
an option.
+  Seq("json", "csv", "text").foreach { format =>
--- End diff --

That sounds good - my main concern is to make sure that both 
`multiLine=true` _and_ `multiLine=false` have coverage with a space in the 
name, since they are such different paths. I'll keep the change that adds a 
space to `nameWithSpecialChars`, but otherwise have the tests as you suggest - 
let me know what you think of the next patch!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20377
  
cc @vanzin @cloud-fan @felixcheung @HyukjinKwon 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19285: [SPARK-22068][CORE]Reduce the duplicate code betw...

2018-01-23 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19285#discussion_r163455326
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala ---
@@ -702,6 +645,76 @@ private[spark] class MemoryStore(
   }
 }
 
+private trait ValuesHolder[T] {
+  def storeValue(value: T): Unit
+  def estimatedSize(): Long
+  def buildEntry(): MemoryEntry[T]
+}
+
+/**
+ * A holder for storing the deserialized values.
+ */
+private class DeserializedValuesHolder[T] (classTag: ClassTag[T]) extends 
ValuesHolder[T] {
+  // Underlying vector for unrolling the block
+  var vector = new SizeTrackingVector[T]()(classTag)
+  var arrayValues: Array[T] = null
+  var preciseSize: Long = -1
--- End diff --

it can be a local variable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20169: [SPARK-17088][hive] Fix 'sharesHadoopClasses' option whe...

2018-01-23 Thread gatorsmile
Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/20169
  
Let me explain my point here. I am not trying to offend anybody. Spark is 
becoming more and more complex. Peer review and test cases are the major 
methods we used for quality control. Spark is an infrastructure software, 
instead of an application software. We really need to be careful when reviewing 
the PRs. Nobody's codes are perfect. For example, if I made a change in SHS, I 
will ping @vanzin for sure, because @vanzin is the original author and the most 
active committer of SHS.

Personally, I am doing my best to review the changes in the SQL. It would 
be nice if you can ping me or @cloud-fan . Since Spark 2.3 release is the 
recent focus, the current focus of the whole community is for testing and 
reviewing the changes of Spark 2.3 release. 

Regarding this PR, I think we do not need to introduce new parameter 
`warehouseDir`, which is already contained in `hadoopConf `. Keeping the 
interface clean is always good, I believe. Please review my follow-up PR: 
https://github.com/apache/spark/pull/20377


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20377
  
**[Test build #86562 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86562/testReport)**
 for PR 20377 at commit 
[`0574ec7`](https://github.com/apache/spark/commit/0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20377
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' opti...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20377
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/169/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20377: [SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasse...

2018-01-23 Thread gatorsmile
GitHub user gatorsmile opened a pull request:

https://github.com/apache/spark/pull/20377

[SPARK-17088] [FOLLOW-UP] Fix 'sharesHadoopClasses' option when creating 
client

## What changes were proposed in this pull request?
This PR is to remove useless  `warehouseDir`, which is already contained in 
`hadoopConf `

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gatorsmile/spark followUp17088

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20377.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20377


commit 0574ec71d830d8c4b5acdbfc910ddc0f5622b0ed
Author: gatorsmile 
Date:   2018-01-24T05:05:47Z

remove warehousePath




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/86557/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19285
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19285: [SPARK-22068][CORE]Reduce the duplicate code between put...

2018-01-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19285
  
**[Test build #86557 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/86557/testReport)**
 for PR 19285 at commit 
[`c988762`](https://github.com/apache/spark/commit/c9887625e89fa284ddf5a1ca115fe012580d1da8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20224: [SPARK-23032][SQL] Add a per-query codegenStageId...

2018-01-23 Thread rednaxelafx
Github user rednaxelafx commented on a diff in the pull request:

https://github.com/apache/spark/pull/20224#discussion_r163453639
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala
 ---
@@ -228,4 +229,21 @@ class WholeStageCodegenSuite extends QueryTest with 
SharedSQLContext {
   }
 }
   }
+
+  test("including codegen stage ID in generated class name should not 
regress codegen caching") {
+import testImplicits._
+
+withSQLConf(SQLConf.WHOLESTAGE_CODEGEN_USE_ID_IN_CLASS_NAME.key -> 
"true") {
+  val bytecodeSizeHisto = 
CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE
+  spark.range(3).select('id + 2).collect
+  val after1 = bytecodeSizeHisto.getCount
+  spark.range(3).select('id + 2).collect
+  val after2 = bytecodeSizeHisto.getCount // same query shape as 
above, deliberately
+  assert(after1 == after2, "the same query run twice should hit the 
codegen cache")
+
+  spark.range(5).select('id * 2).collect
+  val after3 = bytecodeSizeHisto.getCount
+  assert(after3 >= after2, "a different query can result in codegen 
cache miss, that's okay")
--- End diff --

I actually deliberately wrote it this way. Note how I phrased in the 
assertion message as "can result in codegen cache miss" instead of "will result 
in".

That's because the code shape of this third query was deliberately chosen 
to be similar to the two queries before it: all three have 
`spark.range(some_const).select(some_expr).collect`, so if any future changes 
to codegen of `Range` or `Project` operators affect how much specialized code 
(such as constant values) we directly embed into the code, it's actually 
possible to this third query to generate the same code as the first two, which 
will result in a codegen cache miss.

So I'm making this check a bit loose. It's just there to indicate that it's 
acceptable to for a different query to encounter a codegen cache miss. WYDT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   >