[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20209
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85866/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20209
  
**[Test build #85866 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85866/testReport)**
 for PR 20209 at commit 
[`f6215fc`](https://github.com/apache/spark/commit/f6215fc45901456dea8a4fb32f7c87907bb2fbfb).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class OneHotEncoderEstimator(JavaEstimator, HasInputCols, 
HasOutputCols, HasHandleInvalid,`
  * `class OneHotEncoderModel(JavaModel, JavaMLReadable, JavaMLWritable):`
  * `class HasOutputCols(Params):`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20209
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-09 Thread tomasatdatabricks
Github user tomasatdatabricks commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r160496086
  
--- Diff: python/pyspark/ml/image.py ---
@@ -71,9 +88,30 @@ def ocvTypes(self):
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+def ocvTypeByName(self, name):
+if self._ocvTypesByName is None:
+self._ocvTypesByName = {x.name: x for x in self.ocvTypes}
+if name not in self._ocvTypesByName:
+raise ValueError(
+"Can not find matching OpenCvFormat for type = '%s'; 
supported formats are = %s" %
+(name, str(
+self._ocvTypesByName.keys(
+return self._ocvTypesByName[name]
+
+def ocvTypeByMode(self, mode):
--- End diff --

It is not consistent because python can not overload methods based on type 
but I can rename the Scala side to match python. It does not make a big 
difference in this case.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...

2018-01-09 Thread shaneknapp
Github user shaneknapp commented on the issue:

https://github.com/apache/spark/pull/19290
  
ok sounds good -- we'll keep things 'old' for now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20204
  
**[Test build #85856 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85856/testReport)**
 for PR 20204 at commit 
[`9f2c400`](https://github.com/apache/spark/commit/9f2c400eceb771e88f6f4c4909e4a5e67414e3c3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85856/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...

2018-01-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/18991
  
I reopen it to re-test the master branch with this option before Apache 
Spark 2.3.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-d...

2018-01-09 Thread dongjoon-hyun
GitHub user dongjoon-hyun reopened a pull request:

https://github.com/apache/spark/pull/18991

[SPARK-21783][SQL][WIP] Turn on ORC filter push-down by default

## What changes were proposed in this pull request?

ORC filter push-down is disabled by default from the beginning, 
[SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149
)

Now, Apache Spark starts to depend on Apache ORC 1.4.0. For Apache Spark 
2.3, this PR turns on ORC filter push-down by default like Parquet 
([SPARK-9207](https://issues.apache.org/jira/browse/SPARK-21783)) as a part of 
[SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901), "Feature 
parity for ORC with Parquet".


## How was this patch tested?

Pass the Jenkins with the existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjoon-hyun/spark SPARK-21783

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18991.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18991


commit 2bc2b17aba5231c6ac3e0ab7c830acc56790df9f
Author: Dongjoon Hyun 
Date:   2017-08-18T07:26:18Z

[SPARK-21783][SQL] Turn on ORC filter push-down by default




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/18991
  
**[Test build #85868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85868/testReport)**
 for PR 18991 at commit 
[`2bc2b17`](https://github.com/apache/spark/commit/2bc2b17aba5231c6ac3e0ab7c830acc56790df9f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20013
  
**[Test build #85867 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85867/testReport)**
 for PR 20013 at commit 
[`86275b0`](https://github.com/apache/spark/commit/86275b068e08b36e3285d5ab8e77484884f39c1c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-09 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20203
  
ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20203
  
**[Test build #85869 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85869/testReport)**
 for PR 20203 at commit 
[`8d736c1`](https://github.com/apache/spark/commit/8d736c1cd56e341d4d7da88bae01ac3a47649f80).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20203
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/19893#discussion_r160501177
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala ---
@@ -17,4 +17,22 @@
 
 package org.apache.spark.sql.test
 
-trait SharedSQLContext extends SQLTestUtils with SharedSparkSession
+trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
+
+  /**
+   * Auto thread audit is turned off here intentionally and done manually.
--- End diff --

I'm not sure I understand your explanation, and I definitely don't 
understand what's going on from the comment in the code. What I'm asking is for 
the comment here to explain not what the code is doing, but *why* it's doing it.

Basically, if instead of the code you have here, you just called 
`super.beforeAll` and `super.afterAll`, without disabling 
`enableAutoThreadAudit`, what will break and why? That's what the comment 
should explain.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20203
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85869/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20203
  
**[Test build #85869 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85869/testReport)**
 for PR 20203 at commit 
[`8d736c1`](https://github.com/apache/spark/commit/8d736c1cd56e341d4d7da88bae01ac3a47649f80).
 * This patch **fails RAT tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class SparkListenerExecutorBlacklistedForStage(`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2018-01-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20097
  
Hi, @tdas .
Could you merge this to `branch-2.3` , too?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...

2018-01-09 Thread MrBago
Github user MrBago commented on a diff in the pull request:

https://github.com/apache/spark/pull/20168#discussion_r160502167
  
--- Diff: python/pyspark/ml/image.py ---
@@ -71,9 +88,30 @@ def ocvTypes(self):
 """
 
 if self._ocvTypes is None:
-ctx = SparkContext._active_spark_context
-self._ocvTypes = 
dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes())
-return self._ocvTypes
+ctx = SparkContext.getOrCreate()
+ocvTypeList = 
ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()
+self._ocvTypes = [self._OcvType(name=x.name(),
+mode=x.mode(),
+nChannels=x.nChannels(),
+dataType=x.dataType(),
+
nptype=self._ocvToNumpyMap[x.dataType()])
+  for x in ocvTypeList]
+return self._ocvTypes[:]
+
+def ocvTypeByName(self, name):
+if self._ocvTypesByName is None:
+self._ocvTypesByName = {x.name: x for x in self.ocvTypes}
+if name not in self._ocvTypesByName:
+raise ValueError(
+"Can not find matching OpenCvFormat for type = '%s'; 
supported formats are = %s" %
+(name, str(
+self._ocvTypesByName.keys(
+return self._ocvTypesByName[name]
+
+def ocvTypeByMode(self, mode):
--- End diff --

I think we could make either API work for both languages but it's a bit 
unnatural. There's a tradeoff between doing the most natural and appropriate 
thing in each language and having matching APIs, Spark has chosen to prefer 
making the APIs match so let's do our best to do that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160502356
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/executor/Dockerfile ---
@@ -1,35 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-
-FROM spark-base
-
-# Before building the docker image, first build and make a Spark 
distribution following
-# the instructions in 
http://spark.apache.org/docs/latest/building-spark.html.
-# If this docker file is being used in the context of building your images 
from a Spark
-# distribution, the docker build command should be invoked from the top 
level directory
-# of the Spark distribution. E.g.:
-# docker build -t spark-executor:latest -f 
kubernetes/dockerfiles/executor/Dockerfile .
-
-COPY examples /opt/spark/examples
-
-CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \
-env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt && \
-readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt && \
-if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then 
SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \
-if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then 
SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \
--- End diff --

The difference is handled in the submission code; `SPARK_CLASSPATH` is set 
to the appropriate value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160502410
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile ---
@@ -41,7 +41,8 @@ COPY ${spark_jars} /opt/spark/jars
 COPY bin /opt/spark/bin
 COPY sbin /opt/spark/sbin
 COPY conf /opt/spark/conf
-COPY ${img_path}/spark-base/entrypoint.sh /opt/
+COPY ${img_path}/spark/entrypoint.sh /opt/
+COPY examples /opt/spark/examples
--- End diff --

Didn't know about that directory, but sounds like it should be added.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160502618
  
--- Diff: 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh ---
@@ -0,0 +1,97 @@
+#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# echo commands to the terminal output
+set -ex
+
+# Check whether there is a passwd entry for the container UID
+myuid=$(id -u)
+mygid=$(id -g)
+uidentry=$(getent passwd $myuid)
+
+# If there is no passwd entry for the container UID, attempt to create one
+if [ -z "$uidentry" ] ; then
+if [ -w /etc/passwd ] ; then
+echo "$myuid:x:$myuid:$mygid:anonymous uid:$SPARK_HOME:/bin/false" 
>> /etc/passwd
+else
+echo "Container ENTRYPOINT failed to add passwd entry for 
anonymous UID"
+fi
+fi
+
+SPARK_K8S_CMD="$1"
+if [ -z "$SPARK_K8S_CMD" ]; then
+  echo "No command to execute has been provided." 1>&2
--- End diff --

You can do that with `docker container create --entrypoint blah`, right? 
Otherwise you have to add code here to specify what command to run when no 
arguments are provided. I'd rather have a proper error, since the entry point 
is tightly coupled with the submission code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160503103
  
--- Diff: 
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
 ---
@@ -29,17 +29,23 @@ private[spark] object Config extends Logging {
   .stringConf
   .createWithDefault("default")
 
+  val CONTAINER_IMAGE =
+ConfigBuilder("spark.kubernetes.container.image")
+  .doc("Container image to use for Spark containers. Individual 
container types " +
+"(e.g. driver or executor) can also be configured to use different 
images if desired, " +
+"by setting the container-specific image name.")
--- End diff --

Why would I mention just one specific way of overriding this?

I also have half a mind to just remove this since this documentation is not 
visible anywhere...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160504887
  
--- Diff: 
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/DriverConfigOrchestratorSuite.scala
 ---
@@ -75,8 +73,8 @@ class DriverConfigOrchestratorSuite extends SparkFunSuite 
{
 
   test("Submission steps with an init-container.") {
 val sparkConf = new SparkConf(false)
-  .set(DRIVER_CONTAINER_IMAGE, DRIVER_IMAGE)
-  .set(INIT_CONTAINER_IMAGE, IC_IMAGE)
+  .set(CONTAINER_IMAGE, DRIVER_IMAGE)
+  .set(INIT_CONTAINER_IMAGE.key, IC_IMAGE)
--- End diff --

Yes, the test is checking different values for the default and init 
container images.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160504833
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -56,14 +56,13 @@ be run in a container runtime environment that 
Kubernetes supports. Docker is a
 frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles 
provided in the runnable distribution that can be customized
 and built for your usage.
--- End diff --

Separate change. I don't even know what you'd write there. The whole 
"custom image" thing needs to be properly specified first - what exactly is the 
contract between the submission code and the images, for example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20204
  
**[Test build #85855 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85855/testReport)**
 for PR 20204 at commit 
[`a3179d7`](https://github.com/apache/spark/commit/a3179d71da64b90b9dd1a2ac8feb9cc2c18572f5).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...

2018-01-09 Thread shivaram
Github user shivaram commented on the issue:

https://github.com/apache/spark/pull/19290
  
The minimum R version supported is something that we can revisit though. I 
think we do this for Python, Java versions as well in the project


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160463657
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
+  private var source: String = "internal"
+
   /**
-   * Overwrites if the output path already exists.
+   * Specifies the format of ML export (e.g. PMML, internal, or
--- End diff --

change to `e.g. "pmml", "internal", or the fully qualified class name for 
export)."`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160483562
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -1044,6 +1056,50 @@ class LinearRegressionSuite extends MLTest with 
DefaultReadWriteTest {
   LinearRegressionSuite.allParamSettings, checkModelData)
   }
 
+  test("pmml export") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+def checkModel(pmml: PMML): Unit = {
+  val dd = pmml.getDataDictionary
+  assert(dd.getNumberOfFields === 3)
+  val fields = dd.getDataFields.asScala
+  assert(fields(0).getName().toString === "field_0")
+  assert(fields(0).getOpType() == OpType.CONTINUOUS)
+  val pmmlRegressionModel = 
pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel]
+  val pmmlPredictors = 
pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors
+  val pmmlWeights = 
pmmlPredictors.asScala.map(_.getCoefficient()).toList
+  assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3)
+  assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3)
+}
+testPMMLWrite(sc, model, checkModel)
+  }
+
+  test("unsupported export format") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+intercept[SparkException] {
+  model.write.format("boop").save("boop")
+}
+intercept[SparkException] {
+  model.write.format("com.holdenkarau.boop").save("boop")
+}
+withClue("ML source org.apache.spark.SparkContext is not a valid 
MLWriterFormat") {
+  intercept[SparkException] {
+model.write.format("org.apache.spark.SparkContext").save("boop2")
+  }
+}
+  }
+
+  test("dummy export format is called") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+withClue("Dummy writer doesn't write") {
+  intercept[Exception] {
--- End diff --

this just catches any exception. Can we do something like 

```scala
val thrown = intercept[Exception] {

model.write.format("org.apache.spark.ml.regression.DummyLinearRegressionWriter").save("")
  }
  assert(thrown.getMessage.contains("Dummy writer doesn't write."))
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160461560
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite {
   protected final def sc: SparkContext = sparkSession.sparkContext
 }
 
+/**
+ * ML export formats for should implement this trait so that users can 
specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
--- End diff --

Was this supposed to be retained from the `DataSourceRegister`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160506592
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -1044,6 +1056,50 @@ class LinearRegressionSuite extends MLTest with 
DefaultReadWriteTest {
   LinearRegressionSuite.allParamSettings, checkModelData)
   }
 
+  test("pmml export") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+def checkModel(pmml: PMML): Unit = {
+  val dd = pmml.getDataDictionary
+  assert(dd.getNumberOfFields === 3)
+  val fields = dd.getDataFields.asScala
+  assert(fields(0).getName().toString === "field_0")
+  assert(fields(0).getOpType() == OpType.CONTINUOUS)
+  val pmmlRegressionModel = 
pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel]
+  val pmmlPredictors = 
pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors
+  val pmmlWeights = 
pmmlPredictors.asScala.map(_.getCoefficient()).toList
+  assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3)
+  assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3)
+}
+testPMMLWrite(sc, model, checkModel)
+  }
+
+  test("unsupported export format") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+intercept[SparkException] {
+  model.write.format("boop").save("boop")
+}
+intercept[SparkException] {
+  model.write.format("com.holdenkarau.boop").save("boop")
+}
+withClue("ML source org.apache.spark.SparkContext is not a valid 
MLWriterFormat") {
+  intercept[SparkException] {
+model.write.format("org.apache.spark.SparkContext").save("boop2")
+  }
+}
+  }
+
+  test("dummy export format is called") {
--- End diff --

We can also add tests for the `MLFormatRegister` similar to 
`DDLSourceLoadSuite`. Just add a `META-INF/services/` directory to 
`src/test/resources/`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85855/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160496808
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
+  private var source: String = "internal"
+
   /**
-   * Overwrites if the output path already exists.
+   * Specifies the format of ML export (e.g. PMML, internal, or
+   * the fully qualified class name for export).
*/
-  @Since("1.6.0")
-  def overwrite(): this.type = {
-shouldOverwrite = true
+  @Since("2.3.0")
+  def format(source: String): this.type = {
+this.source = source
 this
   }
 
+  /**
+   * Dispatches the save to the correct MLFormat.
+   */
+  @Since("2.3.0")
+  @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
+  @throws[SparkException]("If multiple sources for a given short name 
format are found.")
+  override protected def saveImpl(path: String) = {
--- End diff --

return type


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160462794
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite {
   protected final def sc: SparkContext = sparkSession.sparkContext
 }
 
+/**
+ * ML export formats for should implement this trait so that users can 
specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
+ *
+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLFormatRegister {
+  /**
+   * The string that represents the format that this data source provider 
uses. This is
+   * overridden by children to provide a nice alias for the data source. 
For example:
+   *
+   * {{{
+   *   override def shortName(): String =
+   *   "pmml+org.apache.spark.ml.regression.LinearRegressionModel"
--- End diff --

what about making a second abstract field `def stageName(): String`, 
instead of having it packed into one string?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160502536
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite {
   protected final def sc: SparkContext = sparkSession.sparkContext
 }
 
+/**
+ * ML export formats for should implement this trait so that users can 
specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
+ *
+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLFormatRegister {
+  /**
+   * The string that represents the format that this data source provider 
uses. This is
+   * overridden by children to provide a nice alias for the data source. 
For example:
+   *
+   * {{{
+   *   override def shortName(): String =
+   *   "pmml+org.apache.spark.ml.regression.LinearRegressionModel"
+   * }}}
+   * Indicates that this format is capable of saving Spark's own 
LinearRegressionModel in pmml.
+   *
+   * Format discovery is done using a ServiceLoader so make sure to list 
your format in
+   * META-INF/services.
+   * @since 2.3.0
+   */
+  def shortName(): String
+}
+
+/**
+ * Implemented by objects that provide ML exportability.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
+ *
+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLWriterFormat {
+  /**
+   * Function write the provided pipeline stage out.
--- End diff --

Should add a full doc here with param annotations. Also should it be 
"Function to write ..."?
  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160501723
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
+  private var source: String = "internal"
+
   /**
-   * Overwrites if the output path already exists.
+   * Specifies the format of ML export (e.g. PMML, internal, or
+   * the fully qualified class name for export).
*/
-  @Since("1.6.0")
-  def overwrite(): this.type = {
-shouldOverwrite = true
+  @Since("2.3.0")
+  def format(source: String): this.type = {
+this.source = source
 this
   }
 
+  /**
+   * Dispatches the save to the correct MLFormat.
+   */
+  @Since("2.3.0")
+  @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
+  @throws[SparkException]("If multiple sources for a given short name 
format are found.")
+  override protected def saveImpl(path: String) = {
+val loader = Utils.getContextOrSparkClassLoader
+val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], 
loader)
+val stageName = stage.getClass.getName
+val targetName = s"${source}+${stageName}"
+val formats = serviceLoader.asScala.toList
+val shortNames = formats.map(_.shortName())
+val writerCls = 
formats.filter(_.shortName().equalsIgnoreCase(targetName)) match {
+  // requested name did not match any given registered alias
+  case Nil =>
+Try(loader.loadClass(source)) match {
+  case Success(writer) =>
+// Found the ML writer using the fully qualified path
+writer
+  case Failure(error) =>
+throw new SparkException(
+  s"Could not load requested format $source for $stageName 
($targetName) had $formats" +
+  s"supporting $shortNames", error)
+}
+  case head :: Nil =>
+head.getClass
+  case _ =>
+// Multiple sources
+throw new SparkException(
+  s"Multiple writers found for $source+$stageName, try using the 
class name of the writer")
+}
+if (classOf[MLWriterFormat].isAssignableFrom(writerCls)) {
+  val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat]
--- End diff --

This will fail, non-intuitively, if anyone ever extends `MLWriterFormat` 
with a constructor that has more than zero arguments. Meaning:

```scala
class DummyLinearRegressionWriter(someParam: Int) extends MLWriterFormat
```

will raise `java.lang.NoSuchMethodException: 
org.apache.spark.ml.regression.DummyLinearRegressionWriter.()`
  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160463225
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite {
   protected final def sc: SparkContext = sparkSession.sparkContext
 }
 
+/**
+ * ML export formats for should implement this trait so that users can 
specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
+ *
+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLFormatRegister {
+  /**
+   * The string that represents the format that this data source provider 
uses. This is
+   * overridden by children to provide a nice alias for the data source. 
For example:
--- End diff --

"data source" -> "model format"?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160503322
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
--- End diff --

need `@Since("2.3.0")` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160471845
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -710,15 +711,57 @@ class LinearRegressionModel private[ml] (
   }
 
   /**
-   * Returns a [[org.apache.spark.ml.util.MLWriter]] instance for this ML 
instance.
+   * Returns a [[org.apache.spark.ml.util.GeneralMLWriter]] instance for 
this ML instance.
*
* For [[LinearRegressionModel]], this does NOT currently save the 
training [[summary]].
* An option to save [[summary]] may be added in the future.
*
* This also does not save the [[parent]] currently.
*/
   @Since("1.6.0")
-  override def write: MLWriter = new 
LinearRegressionModel.LinearRegressionModelWriter(this)
+  override def write: GeneralMLWriter = new GeneralMLWriter(this)
+}
+
+/** A writer for LinearRegression that handles the "internal" (or default) 
format */
+private class InternalLinearRegressionModelWriter()
+  extends MLWriterFormat with MLFormatRegister {
+
+  override def shortName(): String =
+"internal+org.apache.spark.ml.regression.LinearRegressionModel"
+
+  private case class Data(intercept: Double, coefficients: Vector, scale: 
Double)
+
+  override def write(path: String, sparkSession: SparkSession,
+optionMap: mutable.Map[String, String], stage: PipelineStage): Unit = {
+val instance = stage.asInstanceOf[LinearRegressionModel]
+val sc = sparkSession.sparkContext
+// Save metadata and Params
+DefaultParamsWriter.saveMetadata(instance, path, sc)
+// Save model data: intercept, coefficients, scale
+val data = Data(instance.intercept, instance.coefficients, 
instance.scale)
+val dataPath = new Path(path, "data").toString
+
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
+  }
+}
+
+/** A writer for LinearRegression that handles the "pmml" format */
+private class PMMLLinearRegressionModelWriter()
--- End diff --

I could be wrong, but I think we prefer just omitting the `()`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160503640
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite {
   protected final def sc: SparkContext = sparkSession.sparkContext
 }
 
+/**
+ * ML export formats for should implement this trait so that users can 
specify a shortname rather
+ * than the fully qualified class name of the exporter.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
+ *
+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLFormatRegister {
+  /**
+   * The string that represents the format that this data source provider 
uses. This is
+   * overridden by children to provide a nice alias for the data source. 
For example:
+   *
+   * {{{
+   *   override def shortName(): String =
+   *   "pmml+org.apache.spark.ml.regression.LinearRegressionModel"
+   * }}}
+   * Indicates that this format is capable of saving Spark's own 
LinearRegressionModel in pmml.
+   *
+   * Format discovery is done using a ServiceLoader so make sure to list 
your format in
+   * META-INF/services.
+   * @since 2.3.0
+   */
+  def shortName(): String
+}
+
+/**
+ * Implemented by objects that provide ML exportability.
+ *
+ * A new instance of this class will be instantiated each time a DDL call 
is made.
+ *
+ * @since 2.3.0
+ */
+@InterfaceStability.Evolving
+trait MLWriterFormat {
--- End diff --

do we need the actual since annotations here, though?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160503466
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
--- End diff --

since tags here


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160484001
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -1044,6 +1056,50 @@ class LinearRegressionSuite extends MLTest with 
DefaultReadWriteTest {
   LinearRegressionSuite.allParamSettings, checkModelData)
   }
 
+  test("pmml export") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+def checkModel(pmml: PMML): Unit = {
+  val dd = pmml.getDataDictionary
+  assert(dd.getNumberOfFields === 3)
+  val fields = dd.getDataFields.asScala
+  assert(fields(0).getName().toString === "field_0")
+  assert(fields(0).getOpType() == OpType.CONTINUOUS)
+  val pmmlRegressionModel = 
pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel]
+  val pmmlPredictors = 
pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors
+  val pmmlWeights = 
pmmlPredictors.asScala.map(_.getCoefficient()).toList
+  assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3)
+  assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3)
+}
+testPMMLWrite(sc, model, checkModel)
+  }
+
+  test("unsupported export format") {
+val lr = new LinearRegression()
+val model = lr.fit(datasetWithWeight)
+intercept[SparkException] {
--- End diff --

Doesn't this and the one below it test the same thing? I think we could 
remove the first one.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...

2018-01-09 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/19876#discussion_r160461644
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with 
Logging {
 this
   }
 
+  // override for Java compatibility
+  override def session(sparkSession: SparkSession): this.type = 
super.session(sparkSession)
+
+  // override for Java compatibility
+  override def context(sqlContext: SQLContext): this.type = 
super.session(sqlContext.sparkSession)
+}
+
+/**
+ * A ML Writer which delegates based on the requested format.
+ */
+class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging {
+  private var source: String = "internal"
+
   /**
-   * Overwrites if the output path already exists.
+   * Specifies the format of ML export (e.g. PMML, internal, or
+   * the fully qualified class name for export).
*/
-  @Since("1.6.0")
-  def overwrite(): this.type = {
-shouldOverwrite = true
+  @Since("2.3.0")
+  def format(source: String): this.type = {
+this.source = source
 this
   }
 
+  /**
+   * Dispatches the save to the correct MLFormat.
+   */
+  @Since("2.3.0")
+  @throws[IOException]("If the input path already exists but overwrite is 
not enabled.")
+  @throws[SparkException]("If multiple sources for a given short name 
format are found.")
+  override protected def saveImpl(path: String) = {
+val loader = Utils.getContextOrSparkClassLoader
+val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], 
loader)
+val stageName = stage.getClass.getName
+val targetName = s"${source}+${stageName}"
--- End diff --

don't need brackets


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20192
  
> users with custom docker images can override the classpath by

I wrote this in a comment above, but there needs to be a proper definition 
of how to customize these docker images. There needs to be a contract between 
the submission code, the entry point, and how stuff is laid out inside the 
image, and I don't see that specified anywhere.

However that's done, I also would suggest that env variables be avoided.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20189
  
**[Test build #85861 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85861/testReport)**
 for PR 20189 at commit 
[`b7dc922`](https://github.com/apache/spark/commit/b7dc92235f434ee0630bdb6af918ee8e58d2fa2b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20189
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20189
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85861/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20201: [SPARK-22389][SQL] data source v2 partitioning reporting...

2018-01-09 Thread RussellSpitzer
Github user RussellSpitzer commented on the issue:

https://github.com/apache/spark/pull/20201
  
This looks very exciting to me


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20192
  
**[Test build #85870 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85870/testReport)**
 for PR 20192 at commit 
[`e771ed9`](https://github.com/apache/spark/commit/e771ed9271f158e3234a3dfa889b49310c690d10).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2018-01-09 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20097
  
Yes. My bad. I didnt realize the branch had already been cut.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/20192
  
@vanzin, do you have some time to modify the integration tests as well? The 
change LGTM, but a clean run on minikube would give us a lot more confidence. 
Until the integration tests get checked in to this repo and running in PRB 
(@ssuchter is working on this), we think that the best way to keep them in sync 
is to ensure that PRs get a manual clean run against the suite.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20203
  
**[Test build #85872 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85872/testReport)**
 for PR 20203 at commit 
[`d8c214b`](https://github.com/apache/spark/commit/d8c214b33f4b014f5a2c0644074f9b7668364799).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20209
  
**[Test build #85871 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85871/testReport)**
 for PR 20209 at commit 
[`2c10416`](https://github.com/apache/spark/commit/2c10416a8a06f4c574dc662a1d7bb7dbcdd36a37).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20210: [SPARK-23009][PYTHON] Fix for non-str col names t...

2018-01-09 Thread BryanCutler
GitHub user BryanCutler opened a pull request:

https://github.com/apache/spark/pull/20210

[SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from 
Pandas

## What changes were proposed in this pull request?

This the case when calling `SparkSession.createDataFrame` using a Pandas 
DataFrame that has non-str column labels.

## How was this patch tested?

Added a new test with a Pandas DataFrame that has int column labels


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/BryanCutler/spark 
python-createDataFrame-int-col-error-SPARK-23009

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20210.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20210


commit e2b1a4160be46e4cdf248e4d219c1a0e2dbec00e
Author: Bryan Cutler 
Date:   2018-01-09T19:57:29Z

fixed col name encoding to allow for unicode and non-str, added test




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread BryanCutler
Github user BryanCutler commented on the issue:

https://github.com/apache/spark/pull/20210
  
Just came across this issue, ping @HyukjinKwon @ueshin 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160511180
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -56,14 +56,13 @@ be run in a container runtime environment that 
Kubernetes supports. Docker is a
 frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles 
provided in the runnable distribution that can be customized
 and built for your usage.
--- End diff --

I agree that we don't have a solid story around customizing images here. 
But I do think that we need something clearly telling people that we do support 
using custom images if they want to and the properties they should use to 
configure custom images. It just doesn't need to be opinionated on things like 
the contact you mentioned. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20210
  
**[Test build #85873 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85873/testReport)**
 for PR 20210 at commit 
[`e2b1a41`](https://github.com/apache/spark/commit/e2b1a4160be46e4cdf248e4d219c1a0e2dbec00e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160512231
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -56,14 +56,13 @@ be run in a container runtime environment that 
Kubernetes supports. Docker is a
 frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles 
provided in the runnable distribution that can be customized
 and built for your usage.
--- End diff --

Still, that sounds like something that should be added in a separate 
change. I'm not changing the customizability of images in this change.

And not having a contract means people will have no idea of how to 
customize images, so you can't even write proper documentation for that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/20192
  
LGTM.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20210
  
**[Test build #85874 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85874/testReport)**
 for PR 20210 at commit 
[`e2e6025`](https://github.com/apache/spark/commit/e2e60251e9be8fd0894e030d4a8b28a549ce777f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...

2018-01-09 Thread liyinan926
Github user liyinan926 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20192#discussion_r160512758
  
--- Diff: docs/running-on-kubernetes.md ---
@@ -56,14 +56,13 @@ be run in a container runtime environment that 
Kubernetes supports. Docker is a
 frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles 
provided in the runnable distribution that can be customized
 and built for your usage.
--- End diff --

OK, I am fine with adding this in a future change.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2018-01-09 Thread tdas
Github user tdas commented on the issue:

https://github.com/apache/spark/pull/20097
  
Done. 

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=44763d93c0d923977c114d63586abfc1b68ad7fc


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20204
  
**[Test build #85858 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85858/testReport)**
 for PR 20204 at commit 
[`3c3c3cb`](https://github.com/apache/spark/commit/3c3c3cba721def78117561d865291931c2d5acd3).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85858/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20204
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...

2018-01-09 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/20097
  
Thank you, @tdas !


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...

2018-01-09 Thread tdas
Github user tdas commented on a diff in the pull request:

https://github.com/apache/spark/pull/20096#discussion_r160516177
  
--- Diff: 
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala
 ---
@@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.kafka010
+
+import java.util.Properties
+import java.util.concurrent.atomic.AtomicInteger
+
+import org.scalatest.time.SpanSugar._
+import scala.collection.mutable
+import scala.util.Random
+
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row}
+import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation
+import org.apache.spark.sql.execution.streaming.StreamExecution
+import 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution
+import org.apache.spark.sql.streaming.{StreamTest, Trigger}
+import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession}
+
+trait KafkaContinuousTest extends KafkaSourceTest {
+  override val defaultTrigger = Trigger.Continuous(1000)
+  override val defaultUseV2Sink = true
+
+  // We need more than the default local[2] to be able to schedule all 
partitions simultaneously.
+  override protected def createSparkSession = new TestSparkSession(
+new SparkContext(
+  "local[10]",
+  "continuous-stream-test-sql-context",
+  sparkConf.set("spark.sql.testkey", "true")))
+
+  override protected def setTopicPartitions(
--- End diff --

Add comment on what this method does. It is asserting something, so does 
not look like it only "sets" something.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20209
  
**[Test build #85871 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85871/testReport)**
 for PR 20209 at commit 
[`2c10416`](https://github.com/apache/spark/commit/2c10416a8a06f4c574dc662a1d7bb7dbcdd36a37).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20209
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85871/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20209
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20210
  
**[Test build #85873 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85873/testReport)**
 for PR 20210 at commit 
[`e2b1a41`](https://github.com/apache/spark/commit/e2b1a4160be46e4cdf248e4d219c1a0e2dbec00e).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20210
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85873/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20210
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20192
  
> do you have some time to modify the integration tests as well

I can try to look, but really you guys should be putting that code into the 
Spark repo. I don't see a task under SPARK-18278 for adding the integration 
tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20023
  
**[Test build #85864 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85864/testReport)**
 for PR 20023 at commit 
[`20616fd`](https://github.com/apache/spark/commit/20616fdfc1a75ea9ae0ec531ce72d8c722facb31).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20023
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20023
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85864/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20210
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20210
  
**[Test build #85874 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85874/testReport)**
 for PR 20210 at commit 
[`e2e6025`](https://github.com/apache/spark/commit/e2e60251e9be8fd0894e030d4a8b28a549ce777f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20210
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85874/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...

2018-01-09 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/20192
  
Thanks @vanzin. I was waiting on spark-dev [thread on integration 
testing](http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html)
 to conclude. It does look like checking the tests in is something we should do 
- adding a task tracking it. We're also stabilizing the testing atm - so, I'm 
thinking we'll target that for post-2.3. Would be great to get an architecture 
review from the Spark community on it, as it exists today, to get some feedback 
going.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...

2018-01-09 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/20151
  
So I think this could be the basis for solving a lot of related problems 
and I like the minimally invasive approach to it. I think the error message for 
setting it to a bad module rather than a nonexistent module is probably going 
to be very confusing. I think it would be good to make it clear that this is 
advanced setting we don't expect most users to modify directly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85875 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85875/testReport)**
 for PR 20096 at commit 
[`f825155`](https://github.com/apache/spark/commit/f8251552398f980768b23059c1bbbd028cfee859).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark

2018-01-09 Thread holdenk
Github user holdenk commented on the issue:

https://github.com/apache/spark/pull/13599
  
So for what its worth I made a quick POC of supporting similar 
functionality without requiring any changes to Spark its self ( 
https://github.com/nteract/coffee_boat ) which should also work in standalone 
mode. I'm going to poke at it a bit more and explore if pex might be better 
than conda (although given the packages most folks want conda seems better).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...

2018-01-09 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/13599#discussion_r160521328
  
--- Diff: python/pyspark/context.py ---
@@ -1023,6 +1032,35 @@ def getConf(self):
 conf.setAll(self._conf.getAll())
 return conf
 
+def install_packages(self, packages, install_driver=True):
+"""
+install python packages on all executors and driver through pip. 
pip will be installed
+by default no matter using native virtualenv or conda. So it is 
guaranteed that pip is
+available if virtualenv is enabled.
+:param packages: string for single package or a list of string for 
multiple packages
+:param install_driver: whether to install packages in client
+"""
+if self._conf.get("spark.pyspark.virtualenv.enabled") != "true":
+raise RuntimeError("install_packages can only use called when "
+   "spark.pyspark.virtualenv.enabled set as 
true")
+if isinstance(packages, basestring):
+packages = [packages]
+# seems statusTracker.getExecutorInfos() will return driver + 
exeuctors, so -1 here.
+num_executors = 
len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1
+dummyRDD = self.parallelize(range(num_executors), num_executors)
+
+def _run_pip(packages, iterator):
+import pip
+pip.main(["install"] + packages)
+
+# run it in the main thread. Will do it in a separated thread after
+# https://github.com/pypa/pip/issues/2553 is fixed
+if install_driver:
+_run_pip(packages, None)
+
+import functools
+dummyRDD.foreachPartition(functools.partial(_run_pip, packages))
--- End diff --

This approach is not reliable to executor failur/restart, dynamic 
allocation, and other possible changes. I'm not comfortable merging something 
which depends on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20096
  
**[Test build #85876 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85876/testReport)**
 for PR 20096 at commit 
[`9101ea6`](https://github.com/apache/spark/commit/9101ea6ef5dfd77eb0dcf3aee622b2d7a145323f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...

2018-01-09 Thread sethah
Github user sethah commented on the issue:

https://github.com/apache/spark/pull/20188
  
Good call @felixcheung! Will update shortly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

2018-01-09 Thread icexelloss
GitHub user icexelloss opened a pull request:

https://github.com/apache/spark/pull/20211

[SPARK-23011][PYTHON][SQL] Prepend missing grouping key in groupby apply

## What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-23011

## How was this patch tested?

Add more tests in `test_complex_groupby`

## TODO:
- [ ] Document the usage in groupby apply


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/icexelloss/spark 
SPARK-23011-groupby-apply-group-key

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20211.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20211


commit 51ce6e85953bd39e901fec24dfca45b86f55f939
Author: Li Jin 
Date:   2018-01-02T18:45:34Z

wip

commit 07f921139e250bd62e79da8475d8d615045d636a
Author: Li Jin 
Date:   2018-01-09T20:08:15Z

Test working; Need to add docs

commit f2822b529293e37f63a4a190b25dbdd018e36ba6
Author: Li Jin 
Date:   2018-01-09T20:55:03Z

Add simple doc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping key ...

2018-01-09 Thread icexelloss
Github user icexelloss commented on the issue:

https://github.com/apache/spark/pull/20211
  
cc @HyukjinKwon @ueshin @cloud-fan @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20211
  
**[Test build #85877 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85877/testReport)**
 for PR 20211 at commit 
[`f2822b5`](https://github.com/apache/spark/commit/f2822b529293e37f63a4a190b25dbdd018e36ba6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20168
  
**[Test build #85878 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85878/testReport)**
 for PR 20168 at commit 
[`eee25ce`](https://github.com/apache/spark/commit/eee25ceffde2c1d6ca248eceb17a559e2f921cc6).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20211
  
**[Test build #85877 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85877/testReport)**
 for PR 20211 at commit 
[`f2822b5`](https://github.com/apache/spark/commit/f2822b529293e37f63a4a190b25dbdd018e36ba6).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20211
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85877/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-09 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20168
  
**[Test build #85878 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85878/testReport)**
 for PR 20168 at commit 
[`eee25ce`](https://github.com/apache/spark/commit/eee25ceffde2c1d6ca248eceb17a559e2f921cc6).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20168
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85878/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20211
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...

2018-01-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20168
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...

2018-01-09 Thread icexelloss
Github user icexelloss commented on a diff in the pull request:

https://github.com/apache/spark/pull/20211#discussion_r160524679
  
--- Diff: python/pyspark/sql/group.py ---
@@ -233,6 +233,27 @@ def apply(self, udf):
 |  2| 1.1094003924504583|
 +---+---+
 
+Notes on grouping column:
--- End diff --

This explains the general idea. I plan to improve the doc if people think 
this change is good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    1   2   3   4   5   6   >