[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20830
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20830
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88295/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20830
  
**[Test build #88295 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88295/testReport)**
 for PR 20830 at commit 
[`b7a4a91`](https://github.com/apache/spark/commit/b7a4a914fbdaddb4c56ee24257f477ff984e170e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20841
  
**[Test build #88294 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88294/testReport)**
 for PR 20841 at commit 
[`1a6cfce`](https://github.com/apache/spark/commit/1a6cfcea2fd23a2a8b7cd0604507a8eb502962a6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88294/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20838#discussion_r175000904
  
--- Diff: dev/create-release/releaseutils.py ---
@@ -49,6 +49,11 @@
 print("Install using 'sudo pip install unidecode'")
 sys.exit(-1)
 
+try:
+raw_input  # Python 2
--- End diff --

Shall we keep this consistent for now? There are a hell of a lot things to 
fix in PySpark in that way.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-03-15 Thread cclauss
Github user cclauss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20838#discussion_r175000559
  
--- Diff: dev/create-release/releaseutils.py ---
@@ -49,6 +49,11 @@
 print("Install using 'sudo pip install unidecode'")
 sys.exit(-1)
 
+try:
+raw_input  # Python 2
--- End diff --

“Where practical, apply the Python porting best practice: [Use feature 
detection instead of version 
detection](https://docs.python.org/3/howto/pyporting.html#use-feature-detection-instead-of-version-detection).”


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20838#discussion_r175000330
  
--- Diff: dev/create-release/releaseutils.py ---
@@ -49,6 +49,11 @@
 print("Install using 'sudo pip install unidecode'")
 sys.exit(-1)
 
+try:
+raw_input  # Python 2
--- End diff --

and shall we do this with if-else of Python version to be consistent with 
other places?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20838: [SPARK-23698] Resolve undefined names in Python 3

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20838#discussion_r17582
  
--- Diff: dev/create-release/releaseutils.py ---
@@ -49,6 +49,11 @@
 print("Install using 'sudo pip install unidecode'")
 sys.exit(-1)
 
+try:
+raw_input  # Python 2
--- End diff --

Shall we put two spaces before lined comment?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20795: [SPARK-23486]cache the function name from the cat...

2018-03-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20795#discussion_r17440
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1192,11 +1195,23 @@ class Analyzer(
* @see https://issues.apache.org/jira/browse/SPARK-19737
*/
   object LookupFunctions extends Rule[LogicalPlan] {
-override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformAllExpressions {
-  case f: UnresolvedFunction if !catalog.functionExists(f.name) =>
-withPosition(f) {
-  throw new 
NoSuchFunctionException(f.name.database.getOrElse("default"), f.name.funcName)
-}
+override def apply(plan: LogicalPlan): LogicalPlan = {
+  val catalogFunctionNameSet = new 
mutable.HashSet[FunctionIdentifier]()
+  plan.transformAllExpressions {
+case f: UnresolvedFunction if 
catalogFunctionNameSet.contains(f.name) => f
+case f: UnresolvedFunction if catalog.functionExists(f.name) =>
+  catalogFunctionNameSet.add(normalizeFuncName(f.name))
+  f
+case f: UnresolvedFunction =>
+  withPosition(f) {
+throw new 
NoSuchFunctionException(f.name.database.getOrElse("default"),
+  f.name.funcName)
+  }
+  }
+}
+
+private def normalizeFuncName(name: FunctionIdentifier): 
FunctionIdentifier = {
+  FunctionIdentifier(name.funcName.toLowerCase(Locale.ROOT), 
name.database)
--- End diff --

Ah, sorry for replying late. What I thought is if two 
`FunctionIdentifier`s, one is with default database name `Some("default")`, 
another is `None`. They should be equal to each other here.

I actually mean `name.database.orElse(Some("default"))`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20830
  
**[Test build #88295 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88295/testReport)**
 for PR 20830 at commit 
[`b7a4a91`](https://github.com/apache/spark/commit/b7a4a914fbdaddb4c56ee24257f477ff984e170e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20841
  
**[Test build #88294 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88294/testReport)**
 for PR 20841 at commit 
[`1a6cfce`](https://github.com/apache/spark/commit/1a6cfcea2fd23a2a8b7cd0604507a8eb502962a6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20830
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpark tests...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20830
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1552/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1551/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20841
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20727: [SPARK-23577][SQL] Supports custom line separator...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20727#discussion_r174999412
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala
 ---
@@ -39,9 +39,12 @@ private[text] class TextOptions(@transient private val 
parameters: CaseInsensiti
*/
   val wholeText = parameters.getOrElse(WHOLETEXT, "false").toBoolean
 
+  val lineSeparator: String = parameters.getOrElse(LINE_SEPARATOR, "\n")
+  require(lineSeparator.nonEmpty, s"'$LINE_SEPARATOR' cannot be an empty 
string.")
 }
 
 private[text] object TextOptions {
   val COMPRESSION = "compression"
   val WHOLETEXT = "wholetext"
+  val LINE_SEPARATOR = "lineSep"
--- End diff --

We already used the term "line" everywhere in the doc. We could just say 
lines are separated by a character and minimise the doc fix and etc. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20795: [SPARK-23486]cache the function name from the catalog fo...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20795
  
**[Test build #88292 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88292/testReport)**
 for PR 20795 at commit 
[`1e5ba02`](https://github.com/apache/spark/commit/1e5ba02f942f4aa7d6a24d76c2123700663f401f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19222
  
**[Test build #88293 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88293/testReport)**
 for PR 19222 at commit 
[`c6d45ea`](https://github.com/apache/spark/commit/c6d45ea3eed791dbd67246068d77b9b239c209e6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1550/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20727: [SPARK-23577][SQL] Supports custom line separator...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20727#discussion_r174998682
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala
 ---
@@ -39,9 +39,12 @@ private[text] class TextOptions(@transient private val 
parameters: CaseInsensiti
*/
   val wholeText = parameters.getOrElse(WHOLETEXT, "false").toBoolean
 
+  val lineSeparator: String = parameters.getOrElse(LINE_SEPARATOR, "\n")
+  require(lineSeparator.nonEmpty, s"'$LINE_SEPARATOR' cannot be an empty 
string.")
 }
 
 private[text] object TextOptions {
   val COMPRESSION = "compression"
   val WHOLETEXT = "wholetext"
+  val LINE_SEPARATOR = "lineSep"
--- End diff --

My reason is to refer other places so that practically other users feel 
comfortable, which I usually put more importances. I really don't want to spend 
time on research why the other references used the term "line".

If we think about the plain text, CSV or JSON, the term "line" can be 
correct in a way. We documented http://jsonlines.org/ (even this reference used 
the term "line"). I think, for example, the line can be defined by its 
separator.


https://github.com/apache/spark/blob/c36fecc3b416c38002779c3cf40b6a665ac4bf13/sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLParserSuite.scala#L1645


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88291/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20841
  
**[Test build #88291 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88291/testReport)**
 for PR 20841 at commit 
[`1a6cfce`](https://github.com/apache/spark/commit/1a6cfcea2fd23a2a8b7cd0604507a8eb502962a6).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20795: [SPARK-23486]cache the function name from the cat...

2018-03-15 Thread kevinyu98
Github user kevinyu98 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20795#discussion_r174998505
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 ---
@@ -1192,11 +1195,23 @@ class Analyzer(
* @see https://issues.apache.org/jira/browse/SPARK-19737
*/
   object LookupFunctions extends Rule[LogicalPlan] {
-override def apply(plan: LogicalPlan): LogicalPlan = 
plan.transformAllExpressions {
-  case f: UnresolvedFunction if !catalog.functionExists(f.name) =>
-withPosition(f) {
-  throw new 
NoSuchFunctionException(f.name.database.getOrElse("default"), f.name.funcName)
-}
+override def apply(plan: LogicalPlan): LogicalPlan = {
+  val catalogFunctionNameSet = new 
mutable.HashSet[FunctionIdentifier]()
+  plan.transformAllExpressions {
+case f: UnresolvedFunction if 
catalogFunctionNameSet.contains(f.name) => f
+case f: UnresolvedFunction if catalog.functionExists(f.name) =>
+  catalogFunctionNameSet.add(normalizeFuncName(f.name))
+  f
+case f: UnresolvedFunction =>
+  withPosition(f) {
+throw new 
NoSuchFunctionException(f.name.database.getOrElse("default"),
+  f.name.funcName)
+  }
+  }
+}
+
+private def normalizeFuncName(name: FunctionIdentifier): 
FunctionIdentifier = {
+  FunctionIdentifier(name.funcName.toLowerCase(Locale.ROOT), 
name.database)
--- End diff --

@viirya Hello Simon, I will leave this as it is for now, but if you think 
it is better to have the Option(name.database.getOrElse("default")) in the 
catalogFunctionNameSet, let me know. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20842: [SPARK-23162][PySpark][ML] Add r2adj into Python API in ...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20842
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20842: [SPARK-23162][PySpark][ML] Add r2adj into Python API in ...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20842
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20842: [SPARK-23162][PySpark][ML] Add r2adj into Python ...

2018-03-15 Thread kevinyu98
GitHub user kevinyu98 opened a pull request:

https://github.com/apache/spark/pull/20842

[SPARK-23162][PySpark][ML] Add r2adj into Python API in 
LinearRegressionSummary

## What changes were proposed in this pull request?

Adding r2adj in LinearRegressionSummary for Python API.



## How was this patch tested?

Added unit tests to exercise the api calls for the summary classes in 
tests.py.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/kevinyu98/spark spark-23162

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20842


commit 12d18b5511263817eb6bceb68155e4797e52b870
Author: Kevin Yu 
Date:   2018-03-16T04:26:31Z

add r2adj into Python API in LinearRegressionSummary




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20841
  
**[Test build #88291 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88291/testReport)**
 for PR 20841 at commit 
[`1a6cfce`](https://github.com/apache/spark/commit/1a6cfcea2fd23a2a8b7cd0604507a8eb502962a6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1549/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20841
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20837: [SPARK-23686][ML][WIP] Better instrumentation

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20837#discussion_r174995768
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -55,6 +55,11 @@ private[spark] class Instrumentation[E <: Estimator[_]] 
private (
   s" storageLevel=${dataset.getStorageLevel}")
   }
 
+  override def logWarning(msg: => String): Unit = {
+logNamedValue(Instrumentation.loggerTags.warning, msg)
+super.logWarning(msg)
--- End diff --

Why not add a `level` param into `Instrumentation : def log(msg: String)`. 
instead of the `logWarning` method.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20837: [SPARK-23686][ML][WIP] Better instrumentation

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20837#discussion_r174995625
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/util/Instrumentation.scala ---
@@ -55,6 +55,11 @@ private[spark] class Instrumentation[E <: Estimator[_]] 
private (
   s" storageLevel=${dataset.getStorageLevel}")
   }
 
+  override def logWarning(msg: => String): Unit = {
+logNamedValue(Instrumentation.loggerTags.warning, msg)
+super.logWarning(msg)
--- End diff --

I doubt it will log repeatedly here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20837: [SPARK-23686][ML][WIP] Better instrumentation

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20837#discussion_r174996170
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -517,6 +517,9 @@ class LogisticRegression @Since("1.2.0") (
 (new MultivariateOnlineSummarizer, new MultiClassSummarizer)
   )(seqOp, combOp, $(aggregationDepth))
 }
+instr.logNamedValue(Instrumentation.loggerTags.numExamples, 
summarizer.count)
+instr.logNamedValue("lowestLabelWeight", 
labelSummarizer.histogram.min.toString)
+instr.logNamedValue("highestLabelWeight", 
labelSummarizer.histogram.min.toString)
--- End diff --

Why not log the whole histogram ( each label -> its weightSum ).
Only log min/max weightSum seems useless and user even do not know they 
related to which label.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20841: [SPARK-23706][PYTHON] spark.conf.get(value, default=None...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20841
  
cc @cloud-fan, @viirya, @ueshin, @BryanCutler who I can directly think of 
for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20841: [SPARK-23706][PYTHON] spark.conf.get(value, defau...

2018-03-15 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/20841

[SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce 
None in PySpark

## What changes were proposed in this pull request?

Scala:

```
scala> spark.conf.get("hey", null)
res1: String = null
```

```
scala> spark.conf.get("spark.sql.sources.partitionOverwriteMode", null)
res2: String = null
```

Python:

**Before**

```
>>> spark.conf.get("hey", None)
...
py4j.protocol.Py4JJavaError: An error occurred while calling o30.get.
: java.util.NoSuchElementException: hey
...
```

```
>>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None)
u'STATIC'
```

**After**

```
>>> spark.conf.get("hey", None) is None
True
```

```
>>> spark.conf.get("spark.sql.sources.partitionOverwriteMode", None) is None
True
```

*Note that this PR preserves the case below:

```
>>> spark.conf.get("spark.sql.sources.partitionOverwriteMode")
u'STATIC'
```


## How was this patch tested?

Manually tested and unit tests were added.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark spark-conf-get

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20841.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20841


commit 1a6cfcea2fd23a2a8b7cd0604507a8eb502962a6
Author: hyukjinkwon 
Date:   2018-03-16T04:09:58Z

spark.conf.get(value, default=None) should produce None in PySpark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20817: [SPARK-23599][SQL] Add a UUID generator from Pseudo-Rand...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20817
  
**[Test build #88290 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88290/testReport)**
 for PR 20817 at commit 
[`75b80a2`](https://github.com/apache/spark/commit/75b80a249c9e5492e6fe159ac527feaea4f46c5a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20817: [SPARK-23599][SQL] Add a UUID generator from Pseudo-Rand...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20817
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1548/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20817: [SPARK-23599][SQL] Add a UUID generator from Pseudo-Rand...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20817
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20817: [SPARK-23599][SQL] Add a UUID generator from Pseu...

2018-03-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20817#discussion_r174995443
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/RandomUUIDGenerator.scala
 ---
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.util
+
+import java.util.UUID
+
+import org.apache.commons.math3.random.MersenneTwister
+
+/**
+ * This class is used to generate a UUID from Pseudo-Random Numbers.
+ *
+ * For the algorithm, see RFC 4122: A Universally Unique IDentifier (UUID) 
URN Namespace,
+ * section 4.4 "Algorithms for Creating a UUID from Truly Random or 
Pseudo-Random Numbers".
+ */
+case class RandomUUIDGenerator(randomSeed: Long) {
+  private val random = new MersenneTwister(randomSeed)
+
+  def getNextUUID(): UUID = {
--- End diff --

Sounds good. I've added it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20831: [SPARK-23614][SQL] Fix incorrect reuse exchange w...

2018-03-15 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20831#discussion_r174994822
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala
 ---
@@ -68,6 +69,15 @@ case class InMemoryRelation(
 
   override protected def innerChildren: Seq[SparkPlan] = Seq(child)
 
+  override def doCanonicalize(): logical.LogicalPlan =
+copy(output = output.map(QueryPlan.normalizeExprId(_, child.output)),
+  storageLevel = StorageLevel.NONE,
+  child = child.canonicalized,
+  tableName = None)(
+  _cachedColumnBuffers,
+  sizeInBytesStats,
+  statsOfPlanToCache)
--- End diff --

`cachedColumnBuffers`, `sizeInBytesStats`, `statsOfPlanToCache` won't be 
considered when comparing two `InMemoryRelation`. So instead of create empty 
instances of statistics, I just use the original values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174990264
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -49,32 +51,65 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("1.6.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  /**
+   * Param for how to handle invalid data (NULL values). Options are 
'skip' (filter out rows with
+   * invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN in the
+   * output).
+   * Default: "error"
+   * @group param
+   */
+  @Since("1.6.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"Hhow to handle invalid data (NULL values). Options are 'skip' (filter 
out rows with " +
+  "invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN " +
+  "in the * output).", 
ParamValidators.inArray(VectorAssembler.supportedHandleInvalids))
--- End diff --

"in the * output" -> "in the output"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174993897
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -85,18 +120,34 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   } else {
 // Otherwise, treat all attributes as numeric. If we cannot 
get the number of attributes
 // from metadata, check the first row.
-val numAttrs = 
group.numAttributes.getOrElse(first.getAs[Vector](index).size)
-Array.tabulate(numAttrs)(i => 
NumericAttribute.defaultAttr.withName(c + "_" + i))
+(0 until length).map { i => 
NumericAttribute.defaultAttr.withName(c + "_" + i) }
+  }
+case DoubleType =>
+  val attribute = Attribute.fromStructField(field)
+  attribute match {
+case UnresolvedAttribute =>
+  Seq(NumericAttribute.defaultAttr.withName(c))
+case _ =>
+  Seq(attribute.withName(c))
   }
+case _ : NumericType | BooleanType =>
+  // If the input column type is a compatible scalar type, assume 
numeric.
+  Seq(NumericAttribute.defaultAttr.withName(c))
 case otherType =>
   throw new SparkException(s"VectorAssembler does not support the 
$otherType type")
   }
 }
-val metadata = new AttributeGroup($(outputCol), attrs).toMetadata()
-
+val featureAttributes = featureAttributesMap.flatten[Attribute]
+val lengths = featureAttributesMap.map(a => a.length)
+val metadata = new AttributeGroup($(outputCol), 
featureAttributes.toArray).toMetadata()
+val (filteredDataset, keepInvalid) = $(handleInvalid) match {
+  case StringIndexer.SKIP_INVALID => (dataset.na.drop("any", 
$(inputCols)), false)
--- End diff --

you can directly use `dataset.na.drop($(inputCols))`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174991898
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -49,32 +51,65 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("1.6.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  /**
+   * Param for how to handle invalid data (NULL values). Options are 
'skip' (filter out rows with
+   * invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN in the
+   * output).
+   * Default: "error"
+   * @group param
+   */
+  @Since("1.6.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"Hhow to handle invalid data (NULL values). Options are 'skip' (filter 
out rows with " +
+  "invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN " +
+  "in the * output).", 
ParamValidators.inArray(VectorAssembler.supportedHandleInvalids))
+
+  setDefault(handleInvalid, VectorAssembler.ERROR_INVALID)
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
 // Schema transformation.
 val schema = dataset.schema
-lazy val first = dataset.toDF.first()
-val attrs = $(inputCols).flatMap { c =>
+
+val featureAttributesMap: Seq[Seq[Attribute]] = $(inputCols).toSeq.map 
{ c =>
   val field = schema(c)
-  val index = schema.fieldIndex(c)
   field.dataType match {
-case DoubleType =>
-  val attr = Attribute.fromStructField(field)
-  // If the input column doesn't have ML attribute, assume numeric.
-  if (attr == UnresolvedAttribute) {
-Some(NumericAttribute.defaultAttr.withName(c))
-  } else {
-Some(attr.withName(c))
-  }
-case _: NumericType | BooleanType =>
-  // If the input column type is a compatible scalar type, assume 
numeric.
-  Some(NumericAttribute.defaultAttr.withName(c))
 case _: VectorUDT =>
-  val group = AttributeGroup.fromStructField(field)
-  if (group.attributes.isDefined) {
-// If attributes are defined, copy them with updated names.
-group.attributes.get.zipWithIndex.map { case (attr, i) =>
+  val attributeGroup = AttributeGroup.fromStructField(field)
+  var length = attributeGroup.size
+  val isMissingNumAttrs = -1 == length
+  if (isMissingNumAttrs && dataset.isStreaming) {
+// this condition is checked for every column, but should be 
cheap
+throw new RuntimeException(
+  s"""
+ |VectorAssembler cannot dynamically determine the size of 
vectors for streaming
+ |data. Consider applying VectorSizeHint to ${c} so that 
this transformer can be
+ |used to transform streaming inputs.
+   """.stripMargin.replaceAll("\n", " "))
+  }
+  if (isMissingNumAttrs) {
+val column = dataset.select(c).na.drop()
+// column count is a spark job for every column missing num 
attrs
+length = (column.count() > 0, $(handleInvalid)) match {
+  // column first is the second spark job for every column 
missing num attrs
+  case (true, _) => column.first.getAs[Vector](0).size
+  case (false, VectorAssembler.SKIP_INVALID | 
VectorAssembler.ERROR_INVALID) => 0
+  case (false, _) =>
+throw new RuntimeException(
+  s"""
+ |VectorAssembler cannot determine the size of empty 
vectors. Consider applying
+ |VectorSizeHint to ${c} so that this transformer can 
be used to transform empty
+ |columns.
+   """.stripMargin.replaceAll("\n", " "))
--- End diff --

I think in this case, `VectorSizeHint` also cannot help to providing the 
vector size.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174990323
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -49,32 +51,65 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("1.6.0")
--- End diff --

`@Since("2.4.0")`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174994214
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -49,32 +51,65 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("1.6.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  /**
+   * Param for how to handle invalid data (NULL values). Options are 
'skip' (filter out rows with
+   * invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN in the
+   * output).
+   * Default: "error"
+   * @group param
+   */
+  @Since("1.6.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"Hhow to handle invalid data (NULL values). Options are 'skip' (filter 
out rows with " +
+  "invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN " +
+  "in the * output).", 
ParamValidators.inArray(VectorAssembler.supportedHandleInvalids))
+
+  setDefault(handleInvalid, VectorAssembler.ERROR_INVALID)
+
   @Since("2.0.0")
   override def transform(dataset: Dataset[_]): DataFrame = {
 transformSchema(dataset.schema, logging = true)
 // Schema transformation.
 val schema = dataset.schema
-lazy val first = dataset.toDF.first()
-val attrs = $(inputCols).flatMap { c =>
+
+val featureAttributesMap: Seq[Seq[Attribute]] = $(inputCols).toSeq.map 
{ c =>
   val field = schema(c)
-  val index = schema.fieldIndex(c)
   field.dataType match {
-case DoubleType =>
-  val attr = Attribute.fromStructField(field)
-  // If the input column doesn't have ML attribute, assume numeric.
-  if (attr == UnresolvedAttribute) {
-Some(NumericAttribute.defaultAttr.withName(c))
-  } else {
-Some(attr.withName(c))
-  }
-case _: NumericType | BooleanType =>
-  // If the input column type is a compatible scalar type, assume 
numeric.
-  Some(NumericAttribute.defaultAttr.withName(c))
 case _: VectorUDT =>
-  val group = AttributeGroup.fromStructField(field)
-  if (group.attributes.isDefined) {
-// If attributes are defined, copy them with updated names.
-group.attributes.get.zipWithIndex.map { case (attr, i) =>
+  val attributeGroup = AttributeGroup.fromStructField(field)
+  var length = attributeGroup.size
+  val isMissingNumAttrs = -1 == length
+  if (isMissingNumAttrs && dataset.isStreaming) {
+// this condition is checked for every column, but should be 
cheap
+throw new RuntimeException(
+  s"""
+ |VectorAssembler cannot dynamically determine the size of 
vectors for streaming
+ |data. Consider applying VectorSizeHint to ${c} so that 
this transformer can be
+ |used to transform streaming inputs.
+   """.stripMargin.replaceAll("\n", " "))
+  }
+  if (isMissingNumAttrs) {
+val column = dataset.select(c).na.drop()
--- End diff --

* The var name `column` isn't good. `colDataset` is better.

* An optional optimization is one-pass scanning the dataset and count 
non-null rows for each "missing num attrs" columns.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174990221
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala ---
@@ -49,32 +51,65 @@ class VectorAssembler @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("1.4.0")
   def setOutputCol(value: String): this.type = set(outputCol, value)
 
+  /** @group setParam */
+  @Since("1.6.0")
+  def setHandleInvalid(value: String): this.type = set(handleInvalid, 
value)
+
+  /**
+   * Param for how to handle invalid data (NULL values). Options are 
'skip' (filter out rows with
+   * invalid data), 'error' (throw an error), or 'keep' (return relevant 
number of NaN in the
+   * output).
+   * Default: "error"
+   * @group param
+   */
+  @Since("1.6.0")
+  override val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid",
+"Hhow to handle invalid data (NULL values). Options are 'skip' (filter 
out rows with " +
--- End diff --

HHow -> How


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20827
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88284/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20827
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...

2018-03-15 Thread maropu
Github user maropu commented on the issue:

https://github.com/apache/spark/pull/20827
  
@gatorsmile @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20827: [SPARK-23666][SQL] Do not display exprIds of Alias in us...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20827
  
**[Test build #88284 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88284/testReport)**
 for PR 20827 at commit 
[`68650ff`](https://github.com/apache/spark/commit/68650ff8c2f3a90c55b5bf4345c16a92fda3782a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class PrettyNamedExpression(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20839: [SPARK-23699][PYTHON][SQL] Raise same type of err...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20839#discussion_r174991326
  
--- Diff: python/pyspark/sql/session.py ---
@@ -679,13 +679,13 @@ def createDataFrame(self, data, schema=None, 
samplingRatio=None, verifySchema=Tr
 "true." % _exception_message(e))
 warnings.warn(msg)
 else:
-msg = (
+e.message = (
--- End diff --

@BryanCutler, I think `message` attribute is only in Python 2. Also, are 
you doubly sure if this wraps the exception message in console too .. ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpar...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20830#discussion_r174991833
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -3551,12 +3544,12 @@ def test_null_conversion(self):
 self.assertTrue(all([c == 1 for c in null_counts]))
 
 def _toPandas_arrow_toggle(self, df):
-self.spark.conf.set("spark.sql.execution.arrow.enabled", "false")
-try:
+with self.sql_conf({"spark.sql.execution.arrow.enabled": False}):
 pdf = df.toPandas()
-finally:
-self.spark.conf.set("spark.sql.execution.arrow.enabled", 
"true")
-pdf_arrow = df.toPandas()
+
+with self.sql_conf({"spark.sql.execution.arrow.enabled": True}):
--- End diff --

Ah, OK. I am fine. will omit this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20830: [SPARK-23691][PYTHON] Use sql_conf util in PySpar...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20830#discussion_r174991686
  
--- Diff: python/pyspark/sql/tests.py ---
@@ -2461,17 +2461,13 @@ def test_join_without_on(self):
 df1 = self.spark.range(1).toDF("a")
 df2 = self.spark.range(1).toDF("b")
 
-try:
-self.spark.conf.set("spark.sql.crossJoin.enabled", "false")
+with self.sql_conf({"spark.sql.crossJoin.enabled": False}):
 self.assertRaises(AnalysisException, lambda: df1.join(df2, 
how="inner").collect())
 
-self.spark.conf.set("spark.sql.crossJoin.enabled", "true")
+with self.sql_conf({"spark.sql.crossJoin.enabled": True}):
--- End diff --

Yup, it originally unset `spark.sql.crossJoin.enabled` but now it set to 
the original value back.
If `spark.sql.crossJoin.enabled` is unset in this test, it will change this 
back to be unset.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20840: [SPARK-23702][SS] Forbid watermarks on both sides of a s...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20840
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88282/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20840: [SPARK-23702][SS] Forbid watermarks on both sides of a s...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20840
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20840: [SPARK-23702][SS] Forbid watermarks on both sides of a s...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20840
  
**[Test build #88282 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88282/testReport)**
 for PR 20840 at commit 
[`9d742ab`](https://github.com/apache/spark/commit/9d742ab4d22021654aa2778d601d806b7a7d106a).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19381
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88288/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19381
  
**[Test build #88288 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88288/testReport)**
 for PR 19381 at commit 
[`20b245a`](https://github.com/apache/spark/commit/20b245ad49124d8d8b42c6835859759cd6af7964).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19381
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20824: [SPARK-23683][SQL][FOLLOW-UP] FileCommitProtocol.instant...

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/20824
  
Not a big deal at all but I acutally meant `[SPARK-20236][SQL][FOLLOW-UP] 
...` or `[SPARK-23683][SQL] ... ` for title ..  I am sorry that I wasn't clear.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20824: [SPARK-23683][SQL][FOLLOW-UP] FileCommitProtocol....

2018-03-15 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/20824#discussion_r174988745
  
--- Diff: 
core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala ---
@@ -145,15 +146,23 @@ object FileCommitProtocol {
   jobId: String,
   outputPath: String,
   dynamicPartitionOverwrite: Boolean = false): FileCommitProtocol = {
+
+logDebug(s"Creating committer $className; job $jobId; 
output=$outputPath;" +
+  s" dynamic=$dynamicPartitionOverwrite")
 val clazz = 
Utils.classForName(className).asInstanceOf[Class[FileCommitProtocol]]
 // First try the constructor with arguments (jobId: String, 
outputPath: String,
 // dynamicPartitionOverwrite: Boolean).
 // If that doesn't exist, try the one with (jobId: string, outputPath: 
String).
 try {
   val ctor = clazz.getDeclaredConstructor(classOf[String], 
classOf[String], classOf[Boolean])
+  logDebug("Using (String, String, Boolean) constructor")
   ctor.newInstance(jobId, outputPath, 
dynamicPartitionOverwrite.asInstanceOf[java.lang.Boolean])
 } catch {
   case _: NoSuchMethodException =>
+logDebug("Falling back to (String, String) constructor")
+require(!dynamicPartitionOverwrite,
+  "Dynamic Partition Overwrite is enabled but" +
+s" the committer ${className} does not have the appropriate 
constructor")
--- End diff --

Hm .. actually we could do .. for example .. 

```scala
abstract class FileCommitProtocol {
  ...
  def dynamicPartitionOverwrite: Boolean = false
}
```

```scala
class HadoopMapReduceCommitProtocol(
jobId: String,
path: String,
override val dynamicPartitionOverwrite: Boolean = false)
```

(^ it's not double checked closely, for example, if the signature is safe 
or not. Was just an idea)

and use `committer.dynamicPartitionOverwrite` in 
`InsertIntoHadoopFsRelationCommand` to respect if the commit protocol supports 
or not, if I understood all correctly, and then produce a warning saying 
dynamic partition overwrite will be ignored.

_However_, sure. I think this case is kind of a made-up case and should be 
a corner case I guess. I don't want to suggest an overkill (maybe) and I think 
we don't have to make this complicated too much for now.

I am okay as is. Just wanted to make sure that we considered and checked 
other possible stories.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20327: [SPARK-12963][CORE] NM host for driver end points

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20327
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20327: [SPARK-12963][CORE] NM host for driver end points

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20327
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88283/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20327: [SPARK-12963][CORE] NM host for driver end points

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20327
  
**[Test build #88283 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88283/testReport)**
 for PR 20327 at commit 
[`ae4ad4a`](https://github.com/apache/spark/commit/ae4ad4a7568cf5845861237d848468c4dc8cf840).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20809: [SPARK-23667][CORE] Better scala version check

2018-03-15 Thread gczsjdy
Github user gczsjdy commented on the issue:

https://github.com/apache/spark/pull/20809
  
cc @cloud-fan @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20829
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88287/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20829
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20829
  
**[Test build #88287 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88287/testReport)**
 for PR 20829 at commit 
[`482225f`](https://github.com/apache/spark/commit/482225fb1ffeccc475972e42246b5f8681d7fee5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20767: [SPARK-23623] [SS] Avoid concurrent use of cached consum...

2018-03-15 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20767
  
The idea is good. But how do you propose exposing that information?
Periodic print out in the log?

From a different angle, I would rather not do feature creep in this PR that
is intended to be backported to 2.3.

On Mar 15, 2018 7:31 PM, "tedyu"  wrote:

> *@tedyu* commented on this pull request.
> --
>
> In external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/
> KafkaDataConsumer.scala
> :
>
> >CachedKafkaDataConsumer(newInternalConsumer)
>
> -} else if (existingInternalConsumer.inuse) {
> +} else if (existingInternalConsumer.inUse) {
>// If consumer is already cached but is currently in use, then 
return a new consumer
>NonCachedKafkaDataConsumer(newInternalConsumer)
>
> Maybe keep an internal counter for how many times the non cached consumer
> is created.
> This would give us information on how effective the cache is
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly, view it on GitHub
> ,
> or mute the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20807: SPARK-23660: Fix exception in yarn cluster mode when app...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20807
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20807: SPARK-23660: Fix exception in yarn cluster mode when app...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20807
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88289/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20807: SPARK-23660: Fix exception in yarn cluster mode when app...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20807
  
**[Test build #88289 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88289/testReport)**
 for PR 20807 at commit 
[`5b304d1`](https://github.com/apache/spark/commit/5b304d158f6a53483cef2b6c00b2546004f976c4).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19222: [SPARK-10399][CORE][SQL] Introduce multiple Memor...

2018-03-15 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19222#discussion_r174985078
  
--- Diff: 
common/unsafe/src/main/java/org/apache/spark/unsafe/memory/OffHeapMemoryBlock.java
 ---
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.unsafe.memory;
+
+import org.apache.spark.unsafe.Platform;
+
+public class OffHeapMemoryBlock extends MemoryBlock {
+  static public final OffHeapMemoryBlock NULL = new OffHeapMemoryBlock(0, 
0);
+
+  public OffHeapMemoryBlock(long address, long size) {
+super(null, address, size);
+  }
+
+  @Override
+  public MemoryBlock subBlock(long offset, long size) {
+if (offset + size > this.offset + length) {
--- End diff --

Sure, I will move this check into the parent class.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20829
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20829
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88286/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20829
  
**[Test build #88286 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88286/testReport)**
 for PR 20829 at commit 
[`482225f`](https://github.com/apache/spark/commit/482225fb1ffeccc475972e42246b5f8681d7fee5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20767: [SPARK-23623] [SS] Avoid concurrent use of cached...

2018-03-15 Thread tedyu
Github user tedyu commented on a diff in the pull request:

https://github.com/apache/spark/pull/20767#discussion_r174984237
  
--- Diff: 
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala
 ---
@@ -467,44 +435,58 @@ private[kafka010] object KafkaDataConsumer extends 
Logging {
   // If this is reattempt at running the task, then invalidate cached 
consumer if any and
   // start with a new one.
   if (existingInternalConsumer != null) {
-if (existingInternalConsumer.inuse) {
-  // Consumer exists in cache and is somehow in use. Don't close 
it immediately, but
-  // mark it for being closed when it is released.
+// Consumer exists in cache. If its in use, mark it for closing 
later, or close it now.
+if (existingInternalConsumer.inUse) {
   existingInternalConsumer.markedForClose = true
-  NonCachedKafkaDataConsumer(newInternalConsumer)
-
 } else {
-  // Consumer exists in cache and is not in use, so close it 
immediately and replace
-  // it with a new one.
   existingInternalConsumer.close()
-  cache.put(key, newInternalConsumer)
-  CachedKafkaDataConsumer(newInternalConsumer)
-
 }
-  } else {
-// Consumer is not cached, put the new one in the cache
-cache.put(key, newInternalConsumer)
-CachedKafkaDataConsumer(newInternalConsumer)
-
   }
+  cache.remove(key)  // Invalidate the cache in any case
+  NonCachedKafkaDataConsumer(newInternalConsumer)
+
 } else if (!useCache) {
   // If planner asks to not reuse consumers, then do not use it, 
return a new consumer
   NonCachedKafkaDataConsumer(newInternalConsumer)
 
 } else if (existingInternalConsumer == null) {
   // If consumer is not already cached, then put a new in the cache 
and return it
-  newInternalConsumer.inuse = true
   cache.put(key, newInternalConsumer)
+  newInternalConsumer.inUse = true
   CachedKafkaDataConsumer(newInternalConsumer)
 
-} else if (existingInternalConsumer.inuse) {
+} else if (existingInternalConsumer.inUse) {
   // If consumer is already cached but is currently in use, then 
return a new consumer
   NonCachedKafkaDataConsumer(newInternalConsumer)
--- End diff --

Maybe keep an internal counter for how many times the non cached consumer 
is created.
This would give us information on how effective the cache is


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20807: SPARK-23660: Fix exception in yarn cluster mode when app...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20807
  
**[Test build #88289 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88289/testReport)**
 for PR 20807 at commit 
[`5b304d1`](https://github.com/apache/spark/commit/5b304d158f6a53483cef2b6c00b2546004f976c4).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20807: SPARK-23660: Fix exception in yarn cluster mode w...

2018-03-15 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/20807#discussion_r174983468
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 ---
@@ -497,6 +500,8 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments) extends
 // if the user app did not create a SparkContext.
 throw new IllegalStateException("User did not initialize spark 
context!")
   }
+  // After initialisation notify user class thread to continue
+  synchronized { notify() }
--- End diff --

Moved into `resumeDriver` function right below `sparkContextInitialized`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20807: SPARK-23660: Fix exception in yarn cluster mode w...

2018-03-15 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/20807#discussion_r174983449
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 ---
@@ -417,8 +417,11 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments) extends
 }
   }
 
-  private def sparkContextInitialized(sc: SparkContext) = {
+  private def sparkContextInitialized(sc: SparkContext) = synchronized {
--- End diff --

Used `sparkContextPromise` as lock.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20807: SPARK-23660: Fix exception in yarn cluster mode w...

2018-03-15 Thread gaborgsomogyi
Github user gaborgsomogyi commented on a diff in the pull request:

https://github.com/apache/spark/pull/20807#discussion_r174983476
  
--- Diff: 
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
 ---
@@ -497,6 +500,8 @@ private[spark] class ApplicationMaster(args: 
ApplicationMasterArguments) extends
 // if the user app did not create a SparkContext.
 throw new IllegalStateException("User did not initialize spark 
context!")
   }
+  // After initialisation notify user class thread to continue
--- End diff --

Fixed and switched to US spell checker.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/19381
  
**[Test build #88288 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88288/testReport)**
 for PR 19381 at commit 
[`20b245a`](https://github.com/apache/spark/commit/20b245ad49124d8d8b42c6835859759cd6af7964).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19381
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1547/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19381
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20785: [SPARK-23640][CORE] Fix hadoop config may overrid...

2018-03-15 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20785#discussion_r174980995
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2434,7 +2434,8 @@ private[spark] object Utils extends Logging {
*/
   def getSparkOrYarnConfig(conf: SparkConf, key: String, default: String): 
String = {
 val sparkValue = conf.get(key, default)
-if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") {
+if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn"
--- End diff --

`YarnConfiguration` can only configure one `spark.shuffle.service.port` 
value.
We can gradually upgrade the shuffle service if get 
`spark.shuffle.service.port` value from `SparkConf` because we can set 
different values ​​for different applications.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/1546/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20829: [SPARK-23690][ML] Add handleinvalid to VectorAsse...

2018-03-15 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20829#discussion_r174980520
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala ---
@@ -234,7 +234,7 @@ class StringIndexerModel (
 val metadata = NominalAttribute.defaultAttr
   .withName($(outputCol)).withValues(filteredLabels).toMetadata()
 // If we are skipping invalid records, filter them out.
-val (filteredDataset, keepInvalid) = getHandleInvalid match {
--- End diff --

ok. it doesn't matter no need separate PR I think. just a minor change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][CORE][SQL] Introduce multiple MemoryBlocks...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/19222
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20829
  
**[Test build #88287 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88287/testReport)**
 for PR 20829 at commit 
[`482225f`](https://github.com/apache/spark/commit/482225fb1ffeccc475972e42246b5f8681d7fee5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20579
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...

2018-03-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20579
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88279/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20802: [SPARK-23651][core]Add a check for host name

2018-03-15 Thread 10110346
Github user 10110346 closed the pull request at:

https://github.com/apache/spark/pull/20802


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20579
  
**[Test build #88279 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88279/testReport)**
 for PR 20579 at commit 
[`3588af3`](https://github.com/apache/spark/commit/3588af39eb889ab72f6800b546ba9f2107f15dc0).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread yogeshg
Github user yogeshg commented on the issue:

https://github.com/apache/spark/pull/20829
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20785: [SPARK-23640][CORE] Fix hadoop config may overrid...

2018-03-15 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/20785#discussion_r174979402
  
--- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
@@ -2434,7 +2434,8 @@ private[spark] object Utils extends Logging {
*/
   def getSparkOrYarnConfig(conf: SparkConf, key: String, default: String): 
String = {
 val sparkValue = conf.get(key, default)
-if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn") {
+if (conf.get(SparkLauncher.SPARK_MASTER, null) == "yarn"
--- End diff --

Assuming that `--conf spark.shuffle.service.port = 7338` is configured, 
7338 is displayed on the tab of the environment, but 7337 is actually used.
So my idea is get value from `SparkConf ` if  key starting with `spark.` 
except for `spark.hadoop.`.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20829: [SPARK-23690][ML] Add handleinvalid to VectorAssembler

2018-03-15 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20829
  
**[Test build #88286 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88286/testReport)**
 for PR 20829 at commit 
[`482225f`](https://github.com/apache/spark/commit/482225fb1ffeccc475972e42246b5f8681d7fee5).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   >