date:20170328

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17463
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17463
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75342/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17463
  
**[Test build #75342 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75342/testReport)**
 for PR 17463 at commit 
[`c68c285`](https://github.com/apache/spark/commit/c68c285d3daa2c2dc584835989f9d23cd3fe398d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17463
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75340/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17463
  
**[Test build #75340 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75340/testReport)**
 for PR 17463 at commit 
[`89d1d35`](https://github.com/apache/spark/commit/89d1d35562bdb47c54464f31adeddadbe3a3ec1b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17463
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17430: [SPARK-20096][Spark Submit][Minor]Expose the righ...

2017-03-28 Thread srowen

Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/17430#discussion_r108594377
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -148,6 +148,17 @@ class SparkSubmitSuite
 appArgs.childArgs should be (Seq("--master", "local", "some", 
"--weird", "args"))
   }
 
+  test("print the right queue name") {
+val clArgs = Seq(
+  "--name", "myApp",
+  "--class", "Foo",
+  "--conf", "spark.yarn.queue=thequeue",
+  "userjar.jar")
+val appArgs = new SparkSubmitArguments(clArgs)
+appArgs.queue should be ("thequeue")
+appArgs.toString.contains("thequeue") should be (true)
--- End diff --

@yaooqinn while you're here do you want to make this last change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17435#discussion_r108593342
  
--- Diff: python/pyspark/sql/types.py ---
@@ -57,7 +57,25 @@ def __ne__(self, other):
 
 @classmethod
 def typeName(cls):
-return cls.__name__[:-4].lower()
+typeTypeNameMap = {"DataType": "data",
+   "NullType": "null",
+   "StringType": "string",
+   "BinaryType": "binary",
+   "BooleanType": "boolean",
+   "DateType": "date",
+   "TimestampType": "timestamp",
+   "DecimalType": "decimal",
+   "DoubleType": "double",
+   "FloatType": "float",
+   "ByteType": "byte",
+   "IntegerType": "integer",
+   "LongType": "long",
+   "ShortType": "short",
+   "ArrayType": "array",
+   "MapType": "map",
+   "StructField": "struct",
--- End diff --

Btw, I don't think `i.typeName()` is a valid usage. We better let it throw 
an exception when calling `typeName` on `StructField`.

`i.dataType.typeName()` is more reasonable call to me.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17388: [SPARK-20059][YARN] Use the correct classloader for HBas...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17388
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17388: [SPARK-20059][YARN] Use the correct classloader for HBas...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17388
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75334/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17388: [SPARK-20059][YARN] Use the correct classloader for HBas...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17388
  
**[Test build #75334 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75334/testReport)**
 for PR 17388 at commit 
[`92d9587`](https://github.com/apache/spark/commit/92d9587f9ac8e3d8c166556ef1b12931b3fc3cfd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17435#discussion_r108593127
  
--- Diff: python/pyspark/sql/types.py ---
@@ -57,7 +57,25 @@ def __ne__(self, other):
 
 @classmethod
 def typeName(cls):
-return cls.__name__[:-4].lower()
+typeTypeNameMap = {"DataType": "data",
+   "NullType": "null",
+   "StringType": "string",
+   "BinaryType": "binary",
+   "BooleanType": "boolean",
+   "DateType": "date",
+   "TimestampType": "timestamp",
+   "DecimalType": "decimal",
+   "DoubleType": "double",
+   "FloatType": "float",
+   "ByteType": "byte",
+   "IntegerType": "integer",
+   "LongType": "long",
+   "ShortType": "short",
+   "ArrayType": "array",
+   "MapType": "map",
+   "StructField": "struct",
--- End diff --

@szalai1 I think @HyukjinKwon 's code snippets should address your request. 
Doesn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17346: [SPARK-19965][SS] DataFrame batch reader may fail to inf...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17346
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17346: [SPARK-19965][SS] DataFrame batch reader may fail to inf...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17346
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75336/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread srowen

Github user srowen commented on the issue:

https://github.com/apache/spark/pull/17463
  
I'm just curious why this solves the problem? if the problem is that the 
streaming context doesn't shut down or doesn't shut down quickly, then I'd 
suspect that it's not because shutting down SparkContext is the slow part, but 
I'm not sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17346: [SPARK-19965][SS] DataFrame batch reader may fail to inf...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17346
  
**[Test build #75336 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75336/testReport)**
 for PR 17346 at commit 
[`0e35db7`](https://github.com/apache/spark/commit/0e35db701342ff426a037c519e50c17d003931fb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17415: [SPARK-19408][SQL] filter estimation on two columns of s...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17415
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75335/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17415: [SPARK-19408][SQL] filter estimation on two columns of s...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17415
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17415: [SPARK-19408][SQL] filter estimation on two columns of s...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17415
  
**[Test build #75335 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75335/testReport)**
 for PR 17415 at commit 
[`7abed99`](https://github.com/apache/spark/commit/7abed99271064e27e86f7265a335b9bee0582d3a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17442: [SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutput...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17442
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17442: [SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutput...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17442
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75341/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17442: [SPARK-20107][DOC] Add spark.hadoop.mapreduce.fileoutput...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17442
  
**[Test build #75341 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75341/testReport)**
 for PR 17442 at commit 
[`ae222b2`](https://github.com/apache/spark/commit/ae222b2a35a6d8a79c6eaca20d58bd12c7d619d1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17464: [SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17464
  
**[Test build #75343 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75343/testReport)**
 for PR 17464 at commit 
[`360e9d7`](https://github.com/apache/spark/commit/360e9d71b1865443fe45920cd938fe8e8d1354e3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17464: [SPARK-20134][SQL] SQLMetrics.postDriverMetricUpd...

2017-03-28 Thread rxin

GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/17464

[SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to simplify driver 
side metric updates

## What changes were proposed in this pull request?
It is not super intuitive how to update SQLMetric on the driver side. This 
patch introduces a new SQLMetrics.postDriverMetricUpdates function to do that, 
and adds documentation to make it more obvious.

## How was this patch tested?
Updated a test case to use this method.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark SPARK-20134

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17464.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17464


commit 360e9d71b1865443fe45920cd938fe8e8d1354e3
Author: Reynold Xin 
Date:   2017-03-29T04:54:12Z

[SPARK-20134][SQL] SQLMetrics.postDriverMetricUpdates to simplify driver 
side metric updates




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17463
  
**[Test build #75342 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75342/testReport)**
 for PR 17463 at commit 
[`c68c285`](https://github.com/apache/spark/commit/c68c285d3daa2c2dc584835989f9d23cd3fe398d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17442: [SPARK-20107][DOC] Speed up HadoopMapReduceCommitProtoco...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17442
  
**[Test build #75341 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75341/testReport)**
 for PR 17442 at commit 
[`ae222b2`](https://github.com/apache/spark/commit/ae222b2a35a6d8a79c6eaca20d58bd12c7d619d1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apache.spar...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17463
  
**[Test build #75340 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75340/testReport)**
 for PR 17463 at commit 
[`89d1d35`](https://github.com/apache/spark/commit/89d1d35562bdb47c54464f31adeddadbe3a3ec1b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17463: [SPARK-20131][DStream][Test] Flaky Test: org.apac...

2017-03-28 Thread uncleGen

GitHub user uncleGen opened a pull request:

https://github.com/apache/spark/pull/17463

[SPARK-20131][DStream][Test] Flaky Test: 
org.apache.spark.streaming.StreamingContextSuite

## What changes were proposed in this pull request?

do not stop the `SparkContext` in thread. 

## How was this patch tested?

Jenkins.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/uncleGen/spark SPARK-20131

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17463.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17463


commit 89d1d35562bdb47c54464f31adeddadbe3a3ec1b
Author: uncleGen 
Date:   2017-03-29T04:43:49Z

Flaky Test: org.apache.spark.streaming.StreamingContextSuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17406: [SPARK-20009][SQL] Support DDL strings for defining sche...

2017-03-28 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17406
  
oh, my bad. Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17406: [SPARK-20009][SQL] Use DDL strings for defining schema i...

2017-03-28 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17406
  
Could you update the PR title?
```
[SPARK-20009][SQL] Support DDL strings for defining schema in 
functions.from_json
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16746: [SPARK-15648][SQL] Add teradataDialect for JDBC connecti...

2017-03-28 Thread klinvill

Github user klinvill commented on the issue:

https://github.com/apache/spark/pull/16746
  
Hi @dongjoon-hyun @gatorsmile, just circling back. Is it going to be 
impractical to check the PR against a VM rather than against a docker image?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17458: [SPARK-20127][CORE] few warning have been fixed w...

2017-03-28 Thread dbolshak

Github user dbolshak commented on a diff in the pull request:

https://github.com/apache/spark/pull/17458#discussion_r108589069
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -103,7 +103,7 @@ private[ui] class StagePage(parent: StagesTab) extends 
WebUIPage("stage") {
   val taskSortColumn = Option(parameterTaskSortColumn).map { 
sortColumn =>
 UIUtils.decodeURLParameter(sortColumn)
   }.getOrElse("Index")
-  val taskSortDesc = 
Option(parameterTaskSortDesc).map(_.toBoolean).getOrElse(false)
+  val taskSortDesc = Option(parameterTaskSortDesc).exists(_.toBoolean)
--- End diff --

Ok, I've reverted this back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17297
  
**[Test build #75339 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75339/testReport)**
 for PR 17297 at commit 
[`ace8464`](https://github.com/apache/spark/commit/ace8464a1ec34864e56fbfceaac509895dcf31d4).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17355: [SPARK-19955][PySpark] Jenkins Python Conda based test.

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17355
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75333/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17355: [SPARK-19955][PySpark] Jenkins Python Conda based test.

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17355
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17379
  
**[Test build #75338 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75338/testReport)**
 for PR 17379 at commit 
[`119dae9`](https://github.com/apache/spark/commit/119dae974554bc7a1755b8532c373464618ad56d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17355: [SPARK-19955][PySpark] Jenkins Python Conda based test.

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17355
  
**[Test build #75333 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75333/testReport)**
 for PR 17355 at commit 
[`a7bf53f`](https://github.com/apache/spark/commit/a7bf53f1b0f3c7104d23a0c1153b15eddceb9169).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17379
  
**[Test build #75337 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75337/testReport)**
 for PR 17379 at commit 
[`be191c6`](https://github.com/apache/spark/commit/be191c62f7d549931debbc08f21e025edf418faa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...

2017-03-28 Thread kunalkhamar

Github user kunalkhamar commented on a diff in the pull request:

https://github.com/apache/spark/pull/17379#discussion_r108586704
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala ---
@@ -32,15 +32,15 @@ import 
org.apache.spark.sql.internal.{BaseSessionStateBuilder, SessionResourceLo
  */
 private[hive] object HiveSessionState {
   /**
-   * Create a new Hive aware [[SessionState]]. for the given session.
+   * Create a new Hive aware [[SessionState]] for the given session.
*/
   def apply(session: SparkSession): SessionState = {
 new HiveSessionStateBuilder(session).build()
   }
 }
 
 /**
- * Builder that produces a [[HiveSessionState]].
+ * Builder that produces a Hive aware [[SessionState]].
  */
 @Experimental
 @InterfaceStability.Unstable
--- End diff --

Renamed, removed `object HiveSessionState`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...

2017-03-28 Thread kunalkhamar

Github user kunalkhamar commented on a diff in the pull request:

https://github.com/apache/spark/pull/17379#discussion_r108586634
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/sessionStateBuilders.scala
 ---
@@ -134,6 +135,14 @@ abstract class BaseSessionStateBuilder(
   }
 
   /**
+   * Interface exposed to the user for registering user-defined functions.
+   *
+   * Note 1: The user-defined functions must be deterministic.
+   * Note 2: This depends on the `functionRegistry` field.
+   */
+  protected def udf: UDFRegistration = new 
UDFRegistration(functionRegistry)
--- End diff --

changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...

2017-03-28 Thread kunalkhamar

Github user kunalkhamar commented on a diff in the pull request:

https://github.com/apache/spark/pull/17379#discussion_r108586629
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -37,38 +37,42 @@ import 
org.apache.spark.sql.util.ExecutionListenerManager
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]].
  *
- * @param sparkContext The [[SparkContext]].
- * @param sharedState The shared state.
+ * @param sharedState The state shared across sessions, e.g. global view 
manager, external catalog.
  * @param conf SQL-specific key-value configurations.
- * @param experimentalMethods The experimental methods.
+ * @param experimentalMethods Interface to add custom planning strategies 
and optimizers.
  * @param functionRegistry Internal catalog for managing functions 
registered by the user.
+ * @param udf Interface exposed to the user for registering user-defined 
functions.
  * @param catalog Internal catalog for managing table and database states.
  * @param sqlParser Parser that extracts expressions, plans, table 
identifiers etc. from SQL texts.
  * @param analyzer Logical query plan analyzer for resolving unresolved 
attributes and relations.
  * @param optimizer Logical query plan optimizer.
  * @param planner Planner that converts optimized logical plans to 
physical plans
  * @param streamingQueryManager Interface to start and stop streaming 
queries.
+ * @param listenerManager Interface to register custom
+ *
[[org.apache.spark.sql.util.QueryExecutionListener]]s
+ * @param resourceLoader Session shared resource loader to load JARs, 
files, etc
  * @param createQueryExecution Function used to create QueryExecution 
objects.
  * @param createClone Function used to create clones of the session state.
  */
 private[sql] class SessionState(
-sparkContext: SparkContext,
 sharedState: SharedState,
 val conf: SQLConf,
 val experimentalMethods: ExperimentalMethods,
 val functionRegistry: FunctionRegistry,
+val udf: UDFRegistration,
--- End diff --

changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108585042
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,138 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attrLeft)
+return None
+  case _ =>
+}
+
+val colStatLeft = colStatsMap(attrLeft)
+val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, 
attrLeft.dataType)
+  .asInstanceOf[NumericRange]
+val maxLeft = BigDecimal(statsRangeLeft.max)
+val minLeft = BigDecimal(statsRangeLeft.min)
+val ndvLeft = BigDecimal(colStatLeft.distinctCount)
+
+val colStatRight = colStatsMap(attrRight)
+val statsRangeRight = Range(colStatRight.min, colStatRight.max, 
attrRight.dataType)
+  .asInstanceOf[NumericRange]
+val maxRight = BigDecimal(statsRangeRight.max)
+val minRight = BigDecimal(statsRangeRight.min)
+val ndvRight = BigDecimal(colStatRight.distinctCount)
+
+// determine the overlapping degree between predicate range and 
column's range
+val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
+  case _: EqualTo =>
+((maxLeft < minRight) || (maxRight < minLeft),
+  (minLeft == minRight) && (maxLeft == maxRight))
+  case _: LessThan =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: LessThanOrEqual =>
+(minLeft >= maxRight, maxLeft <= minRight)
--- End diff --

`(minLeft > maxRight, maxLeft <= minRight)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108584962
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,138 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attrLeft)
+return None
+  case _ =>
+}
+
+val colStatLeft = colStatsMap(attrLeft)
+val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, 
attrLeft.dataType)
+  .asInstanceOf[NumericRange]
+val maxLeft = BigDecimal(statsRangeLeft.max)
+val minLeft = BigDecimal(statsRangeLeft.min)
+val ndvLeft = BigDecimal(colStatLeft.distinctCount)
+
+val colStatRight = colStatsMap(attrRight)
+val statsRangeRight = Range(colStatRight.min, colStatRight.max, 
attrRight.dataType)
+  .asInstanceOf[NumericRange]
+val maxRight = BigDecimal(statsRangeRight.max)
+val minRight = BigDecimal(statsRangeRight.min)
+val ndvRight = BigDecimal(colStatRight.distinctCount)
+
+// determine the overlapping degree between predicate range and 
column's range
+val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
+  case _: EqualTo =>
+((maxLeft < minRight) || (maxRight < minLeft),
+  (minLeft == minRight) && (maxLeft == maxRight))
+  case _: LessThan =>
+(minLeft >= maxRight, maxLeft <= minRight)
--- End diff --

`(minLeft >= maxRight, maxLeft < minRight)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108584830
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,138 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attrLeft)
+return None
+  case _ =>
+}
+
+val colStatLeft = colStatsMap(attrLeft)
+val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, 
attrLeft.dataType)
+  .asInstanceOf[NumericRange]
+val maxLeft = BigDecimal(statsRangeLeft.max)
+val minLeft = BigDecimal(statsRangeLeft.min)
+val ndvLeft = BigDecimal(colStatLeft.distinctCount)
+
+val colStatRight = colStatsMap(attrRight)
+val statsRangeRight = Range(colStatRight.min, colStatRight.max, 
attrRight.dataType)
+  .asInstanceOf[NumericRange]
+val maxRight = BigDecimal(statsRangeRight.max)
+val minRight = BigDecimal(statsRangeRight.min)
+val ndvRight = BigDecimal(colStatRight.distinctCount)
+
+// determine the overlapping degree between predicate range and 
column's range
+val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
+  case _: EqualTo =>
+((maxLeft < minRight) || (maxRight < minLeft),
+  (minLeft == minRight) && (maxLeft == maxRight))
+  case _: LessThan =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: LessThanOrEqual =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: GreaterThan =>
+(maxLeft <= minRight, minLeft >= maxRight)
+  case _: GreaterThanOrEqual =>
+(maxLeft < minRight, minLeft > maxRight)
--- End diff --

`(maxLeft < minRight, minLeft >= maxRight)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108584825
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,138 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attrLeft)
+return None
+  case _ =>
+}
+
+val colStatLeft = colStatsMap(attrLeft)
+val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, 
attrLeft.dataType)
+  .asInstanceOf[NumericRange]
+val maxLeft = BigDecimal(statsRangeLeft.max)
+val minLeft = BigDecimal(statsRangeLeft.min)
+val ndvLeft = BigDecimal(colStatLeft.distinctCount)
+
+val colStatRight = colStatsMap(attrRight)
+val statsRangeRight = Range(colStatRight.min, colStatRight.max, 
attrRight.dataType)
+  .asInstanceOf[NumericRange]
+val maxRight = BigDecimal(statsRangeRight.max)
+val minRight = BigDecimal(statsRangeRight.min)
+val ndvRight = BigDecimal(colStatRight.distinctCount)
+
+// determine the overlapping degree between predicate range and 
column's range
+val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
+  case _: EqualTo =>
+((maxLeft < minRight) || (maxRight < minLeft),
+  (minLeft == minRight) && (maxLeft == maxRight))
+  case _: LessThan =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: LessThanOrEqual =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: GreaterThan =>
+(maxLeft <= minRight, minLeft >= maxRight)
--- End diff --

(maxLeft <= minRight, minLeft > maxRight)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17297
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17297
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75332/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17297
  
**[Test build #75332 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75332/testReport)**
 for PR 17297 at commit 
[`bdaff12`](https://github.com/apache/spark/commit/bdaff123dd21feff72218d8163fa1a69e45f1a1e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108583642
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,138 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
--- End diff --

`a given column` -> `two given columns`. Both two columns' `ColumnStat` are 
updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17346: [SPARK-19965][SS] DataFrame batch reader may fail to inf...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17346
  
**[Test build #75336 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75336/testReport)**
 for PR 17346 at commit 
[`0e35db7`](https://github.com/apache/spark/commit/0e35db701342ff426a037c519e50c17d003931fb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17346: [SPARK-19965][SS] DataFrame batch reader may fail to inf...

2017-03-28 Thread lw-lin

Github user lw-lin commented on the issue:

https://github.com/apache/spark/pull/17346
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108583109
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,135 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
--- End diff --

The current code is written in such a way that we do not have too deep 
indentation.  Some engineers do not like deep indentation as they often put 
screen monitor vertically.
Let's handle it when the need occurs.  I think, with good test case 
coverage, we will be able to catch anything we miss.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r108582862
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2747,6 +2747,17 @@ class Dataset[T] private[sql](
 }
   }
 
+  /**
+   * Collect a Dataset as ArrowPayload byte arrays and serve to PySpark.
+   */
+  private[sql] def collectAsArrowToPython(): Int = {
+val payloadRdd = toArrowPayloadBytes()
+val payloadByteArrays = payloadRdd.collect()
--- End diff --

Can it be better to use`PythonRDD.toLocalIteratorAndServe` to serve a local 
iterator of the rdd to python side?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17415: [SPARK-19408][SQL] filter estimation on two columns of s...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17415
  
**[Test build #75335 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75335/testReport)**
 for PR 17415 at commit 
[`7abed99`](https://github.com/apache/spark/commit/7abed99271064e27e86f7265a335b9bee0582d3a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108582594
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -515,8 +530,135 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attrLeft)
+return None
+  case _ =>
+}
+
+val colStatLeft = colStatsMap(attrLeft)
+val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, 
attrLeft.dataType)
+  .asInstanceOf[NumericRange]
+val maxLeft = BigDecimal(statsRangeLeft.max)
+val minLeft = BigDecimal(statsRangeLeft.min)
+val ndvLeft = BigDecimal(colStatLeft.distinctCount)
+
+val colStatRight = colStatsMap(attrRight)
+val statsRangeRight = Range(colStatRight.min, colStatRight.max, 
attrRight.dataType)
+  .asInstanceOf[NumericRange]
+val maxRight = BigDecimal(statsRangeRight.max)
+val minRight = BigDecimal(statsRangeRight.min)
+val ndvRight = BigDecimal(colStatRight.distinctCount)
+
+// determine the overlapping degree between predicate range and 
column's range
+val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
+  case _: EqualTo =>
+((maxLeft < minRight) || (maxRight < minLeft),
+  (minLeft == minRight) && (maxLeft == maxRight))
+  case _: LessThan =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: LessThanOrEqual =>
+(minLeft >= maxRight, maxLeft <= minRight)
+  case _: GreaterThan =>
+(maxLeft <= minRight, minLeft >= maxRight)
+  case _: GreaterThanOrEqual =>
+(maxLeft < minRight, minLeft > maxRight)
+}
--- End diff --

Good catch.  Fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17415: [SPARK-19408][SQL] filter estimation on two colum...

2017-03-28 Thread ron8hu

Github user ron8hu commented on a diff in the pull request:

https://github.com/apache/spark/pull/17415#discussion_r108582540
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala
 ---
@@ -509,8 +524,131 @@ case class FilterEstimation(plan: Filter, 
catalystConf: CatalystConf) extends Lo
 Some(percent.toDouble)
   }
 
+  /**
+   * Returns a percentage of rows meeting a binary comparison expression 
containing two columns.
+   * In SQL queries, we also see predicate expressions involving two 
columns
+   * such as "column-1 (op) column-2" where column-1 and column-2 belong 
to same table.
+   * Note that, if column-1 and column-2 belong to different tables, then 
it is a join
+   * operator's work, NOT a filter operator's work.
+   *
+   * @param op a binary comparison operator such as =, <, <=, >, >=
+   * @param attrLeft the left Attribute (or a column)
+   * @param attrRight the right Attribute (or a column)
+   * @param update a boolean flag to specify if we need to update 
ColumnStat of a given column
+   *   for subsequent conditions
+   * @return an optional double value to show the percentage of rows 
meeting a given condition
+   */
+  def evaluateBinaryForTwoColumns(
+  op: BinaryComparison,
+  attrLeft: Attribute,
+  attrRight: Attribute,
+  update: Boolean): Option[Double] = {
+
+if (!colStatsMap.contains(attrLeft)) {
+  logDebug("[CBO] No statistics for " + attrLeft)
+  return None
+}
+if (!colStatsMap.contains(attrRight)) {
+  logDebug("[CBO] No statistics for " + attrRight)
+  return None
+}
+
+attrLeft.dataType match {
+  case StringType | BinaryType =>
+// TODO: It is difficult to support other binary comparisons for 
String/Binary
+// type without min/max and advanced statistics like histogram.
+logDebug("[CBO] No range comparison statistics for String/Binary 
type " + attrLeft)
+return None
+  case _ =>
+}
+
+val colStatLeft = colStatsMap(attrLeft)
+val statsRangeLeft = Range(colStatLeft.min, colStatLeft.max, 
attrLeft.dataType)
+  .asInstanceOf[NumericRange]
+val maxLeft = BigDecimal(statsRangeLeft.max)
+val minLeft = BigDecimal(statsRangeLeft.min)
+val ndvLeft = BigDecimal(colStatLeft.distinctCount)
+
+val colStatRight = colStatsMap(attrRight)
+val statsRangeRight = Range(colStatRight.min, colStatRight.max, 
attrRight.dataType)
+  .asInstanceOf[NumericRange]
+val maxRight = BigDecimal(statsRangeRight.max)
+val minRight = BigDecimal(statsRangeRight.min)
+val ndvRight = BigDecimal(colStatRight.distinctCount)
+
+// determine the overlapping degree between predicate range and 
column's range
+val (noOverlap: Boolean, completeOverlap: Boolean) = op match {
+  case _: EqualTo =>
--- End diff --

I just revised the code to handle EqualNullSafe separately from EqualTo.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r108581072
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -2828,4 +2839,16 @@ class Dataset[T] private[sql](
   Dataset(sparkSession, logicalPlan)
 }
   }
+
+  /** Convert to an RDD of ArrowPayload byte arrays */
+  private[sql] def toArrowPayloadBytes(): RDD[Array[Byte]] = {
+val schema_captured = this.schema
+queryExecution.toRdd.mapPartitionsInternal { iter =>
+  val converter = new ArrowConverters
+  val payload = converter.interalRowIterToPayload(iter, 
schema_captured)
+  val payloadBytes = ArrowConverters.payloadToByteArray(payload, 
schema_captured)
--- End diff --

This works now by consuming all rows from the iterator at once and 
constructing a `ArrowPayload` for them. It might harm for memory usage if the 
rows are huge.

I think a better way might be to only construct a `ArrowPayload` for a 
group of rows, not all rows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17388: [SPARK-20059][YARN] Use the correct classloader for HBas...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17388
  
**[Test build #75334 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75334/testReport)**
 for PR 17388 at commit 
[`92d9587`](https://github.com/apache/spark/commit/92d9587f9ac8e3d8c166556ef1b12931b3fc3cfd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17388: [SPARK-20059][YARN] Use the correct classloader f...

2017-03-28 Thread jerryshao

Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/17388#discussion_r108580909
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -485,12 +485,17 @@ object SparkSubmit extends CommandLineUtils {
 
 // In client mode, launch the application main class directly
 // In addition, add the main application jar and any added jars (if 
any) to the classpath
-if (deployMode == CLIENT) {
+// Also add the main application jar and any added jars to classpath 
in case yarn#client
--- End diff --

Thanks @vanzin , just updated the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17421: [SPARK-20040][ML][python] pyspark wrapper for Chi...

2017-03-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17421


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15821: [SPARK-13534][PySpark] Using Apache Arrow to incr...

2017-03-28 Thread viirya

Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/15821#discussion_r108579124
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -56,6 +56,15 @@
 from pyspark.sql.utils import AnalysisException, ParseException, 
IllegalArgumentException
 
 
+_have_arrow = False
+try:
+import pyarrow
+_have_arrow = True
--- End diff --

We should do similar thing above when using Arrow required feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17421
  
Merging with master
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...

2017-03-28 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17379#discussion_r108576926
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/sessionStateBuilders.scala
 ---
@@ -134,6 +135,14 @@ abstract class BaseSessionStateBuilder(
   }
 
   /**
+   * Interface exposed to the user for registering user-defined functions.
+   *
+   * Note 1: The user-defined functions must be deterministic.
+   * Note 2: This depends on the `functionRegistry` field.
+   */
+  protected def udf: UDFRegistration = new 
UDFRegistration(functionRegistry)
--- End diff --

This file only contains effectively one builder. So it should be named 
after the class. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...

2017-03-28 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17379#discussion_r108562157
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/internal/SessionState.scala ---
@@ -37,38 +37,42 @@ import 
org.apache.spark.sql.util.ExecutionListenerManager
 /**
  * A class that holds all session-specific state in a given 
[[SparkSession]].
  *
- * @param sparkContext The [[SparkContext]].
- * @param sharedState The shared state.
+ * @param sharedState The state shared across sessions, e.g. global view 
manager, external catalog.
  * @param conf SQL-specific key-value configurations.
- * @param experimentalMethods The experimental methods.
+ * @param experimentalMethods Interface to add custom planning strategies 
and optimizers.
  * @param functionRegistry Internal catalog for managing functions 
registered by the user.
+ * @param udf Interface exposed to the user for registering user-defined 
functions.
  * @param catalog Internal catalog for managing table and database states.
  * @param sqlParser Parser that extracts expressions, plans, table 
identifiers etc. from SQL texts.
  * @param analyzer Logical query plan analyzer for resolving unresolved 
attributes and relations.
  * @param optimizer Logical query plan optimizer.
  * @param planner Planner that converts optimized logical plans to 
physical plans
  * @param streamingQueryManager Interface to start and stop streaming 
queries.
+ * @param listenerManager Interface to register custom
+ *
[[org.apache.spark.sql.util.QueryExecutionListener]]s
+ * @param resourceLoader Session shared resource loader to load JARs, 
files, etc
  * @param createQueryExecution Function used to create QueryExecution 
objects.
  * @param createClone Function used to create clones of the session state.
  */
 private[sql] class SessionState(
-sparkContext: SparkContext,
 sharedState: SharedState,
 val conf: SQLConf,
 val experimentalMethods: ExperimentalMethods,
 val functionRegistry: FunctionRegistry,
+val udf: UDFRegistration,
--- End diff --

udf -> udfRegistration


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17379: [SPARK-20048][SQL] Cloning SessionState does not ...

2017-03-28 Thread tdas

Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/17379#discussion_r108577718
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala ---
@@ -32,15 +32,15 @@ import 
org.apache.spark.sql.internal.{BaseSessionStateBuilder, SessionResourceLo
  */
 private[hive] object HiveSessionState {
   /**
-   * Create a new Hive aware [[SessionState]]. for the given session.
+   * Create a new Hive aware [[SessionState]] for the given session.
*/
   def apply(session: SparkSession): SessionState = {
 new HiveSessionStateBuilder(session).build()
   }
 }
 
 /**
- * Builder that produces a [[HiveSessionState]].
+ * Builder that produces a Hive aware [[SessionState]].
  */
 @Experimental
 @InterfaceStability.Unstable
--- End diff --

This file should not be named HiveSessionState anymore. It doesnt even have 
the class HiveSessionState. It does have an object HiveSession, but do we need 
that object any more? 
@hvanhovell 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17379
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75331/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17379
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17379
  
**[Test build #75331 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75331/testReport)**
 for PR 17379 at commit 
[`cad1b63`](https://github.com/apache/spark/commit/cad1b6314c64fc5308d3b5ad0a86285356abbac0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17442: [SPARK-20107][SQL] Speed up HadoopMapReduceCommitProtoco...

2017-03-28 Thread vanzin

Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/17442
  
The PR title and description also have nothing to do with the change 
anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread markhamstra

Github user markhamstra commented on the issue:

https://github.com/apache/spark/pull/17297
  
Agreed. Let's establish what we want to do before trying to discuss the
details of how we are going to do it.

On Tue, Mar 28, 2017 at 8:17 AM, Imran Rashid 
wrote:

> @sitalkedia  This change is pretty
> contentious, there are lot of questions about whether or not this is a 
good
> change. I don't think discussing this here in github comments on a PR is
> the best form. I think of PR comments as being more about code details --
> clarity, tests, whether the implementation is correct, etc. But here we're
> discussing whether the behavior is even desirable, as well as trying to
> discuss this in relation to other changes. I think a better format would 
be
> for you to open a jira and submit a design document (maybe a shared google
> doc at first), where we can focus more on the desired behavior and 
consider
> all the changes, even if the PRs are smaller to make them easier to 
review.
>
> I'm explicitly *not* making a judgement on whether or not this is a good
> change. Also I do appreciate you having the code changes ready, as a POC,
> as that can help folks consider the complexity of the change. But it seems
> clear to me that first we need to come to a decision about the end goal.
>
> Also, assuming we do decide this is desirable behavior, there is also a
> question about how we can get changes like this in without risking 
breaking
> things -- I have started a thread on dev@ related to that topic in
> general, but we should figure that for these changes in particular as 
well.
>
> @kayousterhout  @tgravescs
>  @markhamstra
>  makes sense?
>
> â
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75330/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17421
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17421
  
**[Test build #75330 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75330/testReport)**
 for PR 17421 at commit 
[`e79f968`](https://github.com/apache/spark/commit/e79f96866bd333e046a758f8615a364fb99b0e24).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17355: [SPARK-19955][PySpark] Jenkins Python Conda based test.

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17355
  
**[Test build #75333 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75333/testReport)**
 for PR 17355 at commit 
[`a7bf53f`](https://github.com/apache/spark/commit/a7bf53f1b0f3c7104d23a0c1153b15eddceb9169).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread sitalkedia

Github user sitalkedia commented on the issue:

https://github.com/apache/spark/pull/17297
  
@squito - Sounds good to me, let me compile the list of pain points related 
to fetch failure we are seeing and also a design doc to have better handling of 
the issues. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17297: [SPARK-14649][CORE] DagScheduler should not run duplicat...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17297
  
**[Test build #75332 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75332/testReport)**
 for PR 17297 at commit 
[`bdaff12`](https://github.com/apache/spark/commit/bdaff123dd21feff72218d8163fa1a69e45f1a1e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17421
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17421
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75329/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17421
  
**[Test build #75329 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75329/testReport)**
 for PR 17421 at commit 
[`1ce5966`](https://github.com/apache/spark/commit/1ce59662c6170e142eac5e075b5497e135741039).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17406: [SPARK-20009][SQL] Use DDL strings for defining schema i...

2017-03-28 Thread maropu

Github user maropu commented on the issue:

https://github.com/apache/spark/pull/17406
  
@gatorsmile okay, please. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffleServic...

2017-03-28 Thread ash211

Github user ash211 commented on the issue:

https://github.com/apache/spark/pull/17401
  
@jerryshao ready for re-review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17401: [SPARK-18364][YARN] Expose metrics for YarnShuffl...

2017-03-28 Thread ash211

Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17401#discussion_r108562943
  
--- Diff: 
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleServiceMetrics.java
 ---
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.network.yarn;
+
+import com.codahale.metrics.*;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.metrics2.MetricsCollector;
+import org.apache.hadoop.metrics2.MetricsInfo;
+import org.apache.hadoop.metrics2.MetricsRecordBuilder;
+import org.apache.hadoop.metrics2.MetricsSource;
+
+import java.util.Map;
+
+/**
+ * Modeled off of YARN's NodeManagerMetrics.
+ */
+public class YarnShuffleServiceMetrics implements MetricsSource {
+
+  private final MetricSet metricSet;
+
+  public YarnShuffleServiceMetrics(MetricSet metricSet) {
+this.metricSet = metricSet;
+  }
+
+  /**
+   * Get metrics from the source
+   *
+   * @param collector to contain the resulting metrics snapshot
+   * @param all   if true, return all metrics even if unchanged.
+   */
+  @Override
+  public void getMetrics(MetricsCollector collector, boolean all) {
+MetricsRecordBuilder metricsRecordBuilder = 
collector.addRecord("shuffleService");
+
+for (Map.Entry entry : 
metricSet.getMetrics().entrySet()) {
+  collectMetric(metricsRecordBuilder, entry.getKey(), 
entry.getValue());
+}
+  }
+
+  @VisibleForTesting
+  public static void collectMetric(MetricsRecordBuilder 
metricsRecordBuilder, String name, Metric metric) {
--- End diff --

I use `static` here to make it clear that the method does not need to be 
run in the context of an instance.  This prevents it from accidentally 
accessing instance variables when I don't intend it to


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17379
  
**[Test build #75331 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75331/testReport)**
 for PR 17379 at commit 
[`cad1b63`](https://github.com/apache/spark/commit/cad1b6314c64fc5308d3b5ad0a86285356abbac0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17458: [SPARK-20127][CORE] few warning have been fixed w...

2017-03-28 Thread HyukjinKwon

Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17458#discussion_r108560956
  
--- Diff: core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala ---
@@ -103,7 +103,7 @@ private[ui] class StagePage(parent: StagesTab) extends 
WebUIPage("stage") {
   val taskSortColumn = Option(parameterTaskSortColumn).map { 
sortColumn =>
 UIUtils.decodeURLParameter(sortColumn)
   }.getOrElse("Index")
-  val taskSortDesc = 
Option(parameterTaskSortDesc).map(_.toBoolean).getOrElse(false)
+  val taskSortDesc = Option(parameterTaskSortDesc).exists(_.toBoolean)
--- End diff --

If my opinion can be helpful, personally I prefer the previous one to 
explicitly show the default. Actually, I intendedly use this pattern in several 
places.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalc...

2017-03-28 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17407


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator ...

2017-03-28 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17407
  
Merging with master and branch-2.1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17446: [SPARK-17075][SQL][followup] Add Estimation of Co...

2017-03-28 Thread gatorsmile

Github user gatorsmile closed the pull request at:

https://github.com/apache/spark/pull/17446


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17446: [SPARK-17075][SQL][followup] Add Estimation of Constant ...

2017-03-28 Thread gatorsmile

Github user gatorsmile commented on the issue:

https://github.com/apache/spark/pull/17446
  
Since this PR does not correctly handle the cases like `Not(Not(null))`, I 
close this PR at first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15332: [SPARK-10364][SQL] Support Parquet logical type T...

2017-03-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15332#discussion_r108547399
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala
 ---
@@ -965,6 +972,18 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
 int96AsTimestamp = true,
 writeLegacyParquetFormat = true)
 
+  testSchema(
+"Timestmp written and read as INT64 with TIMESTAMP_MILLIS",
--- End diff --

nit: Timestmp -> Timestamp


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15332: [SPARK-10364][SQL] Support Parquet logical type T...

2017-03-28 Thread ueshin

Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/15332#discussion_r108548142
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
 ---
@@ -237,6 +238,30 @@ object DateTimeUtils {
 (day.toInt, micros * 1000L)
   }
 
+  /*
+   * Converts the timestamp to milliseconds since epoc. In spark timestamp 
values have microseconds
+   * precision, so this conversion is lossy.
+   */
+  def toMillis(us: SQLTimestamp): Long = {
+var millis = us / 1000L
+
+// When the timestamp is negative i.e before 1970, we need to adjust 
the millseconds portion.
+// Example - 1965-01-01 10:11:12.123456 is represented as 
(-157700927876544) in micro precision.
+// In millis precision the above needs to be represented as 
(-157700927877)
+
+if (us < 0 && (us % MILLIS_PER_SECOND < 0)) {
+  millis = millis - 1
+}
--- End diff --

Can't we use `Math.floor()` here as the same as `millisToDays`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17379
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17379
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75328/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17379: [SPARK-20048][SQL] Cloning SessionState does not clone q...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17379
  
**[Test build #75328 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75328/testReport)**
 for PR 17379 at commit 
[`16f5bea`](https://github.com/apache/spark/commit/16f5beae3d08627606c13ccb301d624836cb1233).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread jkbradley

Github user jkbradley commented on the issue:

https://github.com/apache/spark/pull/17421
  
LGTM pending tests


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17462: [SPARK-20050][DStream] Kafka 0.10 DirectStream doesn't c...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17462
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17462: [SPARK-20050][DStream] Kafka 0.10 DirectStream do...

2017-03-28 Thread sasakitoa

GitHub user sasakitoa opened a pull request:

https://github.com/apache/spark/pull/17462

[SPARK-20050][DStream] Kafka 0.10 DirectStream doesn't commit last 
processed batch's offset when graceful shutdown

## What changes were proposed in this pull request?

When we use KafkaDirectStream(Kafka0.10) with `enable.auto.commit` is 
`false`, 
some record's offsets do not commit to Kafka at graceful shutdown, such 
below.

Sample code

```
val kafkaParams = Map[String, Object]("enable.auto.commit" -> "false")
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, ... , 
kafkaParams.asScala)

kafkaStream.map { input =>
  "key: " + input.key.toString + " value: " + input.value.toString + " 
offset: " + input.offset.toString
  }.foreachRDD { rdd =>
rdd.foreach { input =>
println(input)
  }
}

kafkaStream.foreachRDD { rdd =>
  val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
```

output at first time

```
key: null value: 1 offset: 101452472
key: null value: 2 offset: 101452473
key: null value: 3 offset: 101452474
key: null value: 4 offset: 101452475
key: null value: 5 offset: 101452476
key: null value: 6 offset: 101452477
key: null value: 7 offset: 101452478
key: null value: 8 offset: 101452479
key: null value: 9 offset: 101452480  // this is a last record before 
shutdown Spark Streaming gracefully
```

output at second time

```
key: null value: 7 offset: 101452478   // duplicate
key: null value: 8 offset: 101452479   // duplicate
key: null value: 9 offset: 101452480   // duplicate
key: null value: 10 offset: 101452481
```

This is because offset will commit at the beginning of each batches, and 
will not commit after all batches processed.



## How was this patch tested?

Added tests `offset commit when graceful shtudown` in 
`DirectKafkaStreamSuite.scala`

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sasakitoa/spark Kafka010commitAsync2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17462.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17462


commit e7b218b877b15075a370b16909fb32e1b9865bb0
Author: Sasaki Toru 
Date:   2017-03-28T22:14:04Z

Invoke DirectkafkaInputDStream#commitAll when graceful shutdown




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17407: [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator ...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17407
  
**[Test build #3618 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3618/testReport)**
 for PR 17407 at commit 
[`22ee03d`](https://github.com/apache/spark/commit/22ee03d27c528b2a07d8d7e2a1467de8a09257dc).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17421
  
**[Test build #75330 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75330/testReport)**
 for PR 17421 at commit 
[`e79f968`](https://github.com/apache/spark/commit/e79f96866bd333e046a758f8615a364fb99b0e24).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17421: [SPARK-20040][ML][python] pyspark wrapper for ChiSquareT...

2017-03-28 Thread SparkQA

Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/17421
  
**[Test build #75329 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75329/testReport)**
 for PR 17421 at commit 
[`1ce5966`](https://github.com/apache/spark/commit/1ce59662c6170e142eac5e075b5497e135741039).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17436: [SPARK-20101][SQL] Use OffHeapColumnVector when "spark.m...

2017-03-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/17436
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75327/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 >

1 - 100 of 379 matches

Mail list logo