[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20209 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85866/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20209 **[Test build #85866 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85866/testReport)** for PR 20209 at commit [`f6215fc`](https://github.com/apache/spark/commit/f6215fc45901456dea8a4fb32f7c87907bb2fbfb). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class OneHotEncoderEstimator(JavaEstimator, HasInputCols, HasOutputCols, HasHandleInvalid,` * `class OneHotEncoderModel(JavaModel, JavaMLReadable, JavaMLWritable):` * `class HasOutputCols(Params):` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20209 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user tomasatdatabricks commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r160496086 --- Diff: python/pyspark/ml/image.py --- @@ -71,9 +88,30 @@ def ocvTypes(self): """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + +def ocvTypeByName(self, name): +if self._ocvTypesByName is None: +self._ocvTypesByName = {x.name: x for x in self.ocvTypes} +if name not in self._ocvTypesByName: +raise ValueError( +"Can not find matching OpenCvFormat for type = '%s'; supported formats are = %s" % +(name, str( +self._ocvTypesByName.keys( +return self._ocvTypesByName[name] + +def ocvTypeByMode(self, mode): --- End diff -- It is not consistent because python can not overload methods based on type but I can rename the Scala side to match python. It does not make a big difference in this case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...
Github user shaneknapp commented on the issue: https://github.com/apache/spark/pull/19290 ok sounds good -- we'll keep things 'old' for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20204 **[Test build #85856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85856/testReport)** for PR 20204 at commit [`9f2c400`](https://github.com/apache/spark/commit/9f2c400eceb771e88f6f4c4909e4a5e67414e3c3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85856/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/18991 I reopen it to re-test the master branch with this option before Apache Spark 2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-d...
GitHub user dongjoon-hyun reopened a pull request: https://github.com/apache/spark/pull/18991 [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by default ## What changes were proposed in this pull request? ORC filter push-down is disabled by default from the beginning, [SPARK-2883](https://github.com/apache/spark/commit/aa31e431fc09f0477f1c2351c6275769a31aca90#diff-41ef65b9ef5b518f77e2a03559893f4dR149 ) Now, Apache Spark starts to depend on Apache ORC 1.4.0. For Apache Spark 2.3, this PR turns on ORC filter push-down by default like Parquet ([SPARK-9207](https://issues.apache.org/jira/browse/SPARK-21783)) as a part of [SPARK-20901](https://issues.apache.org/jira/browse/SPARK-20901), "Feature parity for ORC with Parquet". ## How was this patch tested? Pass the Jenkins with the existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjoon-hyun/spark SPARK-21783 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18991.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18991 commit 2bc2b17aba5231c6ac3e0ab7c830acc56790df9f Author: Dongjoon Hyun Date: 2017-08-18T07:26:18Z [SPARK-21783][SQL] Turn on ORC filter push-down by default --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18991: [SPARK-21783][SQL][WIP] Turn on ORC filter push-down by ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/18991 **[Test build #85868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85868/testReport)** for PR 18991 at commit [`2bc2b17`](https://github.com/apache/spark/commit/2bc2b17aba5231c6ac3e0ab7c830acc56790df9f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20013: [SPARK-20657][core] Speed up rendering of the stages pag...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20013 **[Test build #85867 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85867/testReport)** for PR 20013 at commit [`86275b0`](https://github.com/apache/spark/commit/86275b068e08b36e3285d5ab8e77484884f39c1c). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20203 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20203 **[Test build #85869 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85869/testReport)** for PR 20203 at commit [`8d736c1`](https://github.com/apache/spark/commit/8d736c1cd56e341d4d7da88bae01ac3a47649f80). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20203 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19893: [SPARK-16139][TEST] Add logging functionality for...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/19893#discussion_r160501177 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala --- @@ -17,4 +17,22 @@ package org.apache.spark.sql.test -trait SharedSQLContext extends SQLTestUtils with SharedSparkSession +trait SharedSQLContext extends SQLTestUtils with SharedSparkSession { + + /** + * Auto thread audit is turned off here intentionally and done manually. --- End diff -- I'm not sure I understand your explanation, and I definitely don't understand what's going on from the comment in the code. What I'm asking is for the comment here to explain not what the code is doing, but *why* it's doing it. Basically, if instead of the code you have here, you just called `super.beforeAll` and `super.afterAll`, without disabling `enableAutoThreadAudit`, what will break and why? That's what the comment should explain. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20203 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85869/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20203 **[Test build #85869 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85869/testReport)** for PR 20203 at commit [`8d736c1`](https://github.com/apache/spark/commit/8d736c1cd56e341d4d7da88bae01ac3a47649f80). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class SparkListenerExecutorBlacklistedForStage(` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20097 Hi, @tdas . Could you merge this to `branch-2.3` , too? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20168: [SPARK-22730][ML] Add ImageSchema support for non...
Github user MrBago commented on a diff in the pull request: https://github.com/apache/spark/pull/20168#discussion_r160502167 --- Diff: python/pyspark/ml/image.py --- @@ -71,9 +88,30 @@ def ocvTypes(self): """ if self._ocvTypes is None: -ctx = SparkContext._active_spark_context -self._ocvTypes = dict(ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes()) -return self._ocvTypes +ctx = SparkContext.getOrCreate() +ocvTypeList = ctx._jvm.org.apache.spark.ml.image.ImageSchema.javaOcvTypes() +self._ocvTypes = [self._OcvType(name=x.name(), +mode=x.mode(), +nChannels=x.nChannels(), +dataType=x.dataType(), + nptype=self._ocvToNumpyMap[x.dataType()]) + for x in ocvTypeList] +return self._ocvTypes[:] + +def ocvTypeByName(self, name): +if self._ocvTypesByName is None: +self._ocvTypesByName = {x.name: x for x in self.ocvTypes} +if name not in self._ocvTypesByName: +raise ValueError( +"Can not find matching OpenCvFormat for type = '%s'; supported formats are = %s" % +(name, str( +self._ocvTypesByName.keys( +return self._ocvTypesByName[name] + +def ocvTypeByMode(self, mode): --- End diff -- I think we could make either API work for both languages but it's a bit unnatural. There's a tradeoff between doing the most natural and appropriate thing in each language and having matching APIs, Spark has chosen to prefer making the APIs match so let's do our best to do that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160502356 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/executor/Dockerfile --- @@ -1,35 +0,0 @@ -# -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -#http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -FROM spark-base - -# Before building the docker image, first build and make a Spark distribution following -# the instructions in http://spark.apache.org/docs/latest/building-spark.html. -# If this docker file is being used in the context of building your images from a Spark -# distribution, the docker build command should be invoked from the top level directory -# of the Spark distribution. E.g.: -# docker build -t spark-executor:latest -f kubernetes/dockerfiles/executor/Dockerfile . - -COPY examples /opt/spark/examples - -CMD SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && \ -env | grep SPARK_JAVA_OPT_ | sed 's/[^=]*=\(.*\)/\1/g' > /tmp/java_opts.txt && \ -readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt && \ -if ! [ -z ${SPARK_MOUNTED_CLASSPATH}+x} ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && \ -if ! [ -z ${SPARK_EXECUTOR_EXTRA_CLASSPATH+x} ]; then SPARK_CLASSPATH="$SPARK_EXECUTOR_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && \ --- End diff -- The difference is handled in the submission code; `SPARK_CLASSPATH` is set to the appropriate value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160502410 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile --- @@ -41,7 +41,8 @@ COPY ${spark_jars} /opt/spark/jars COPY bin /opt/spark/bin COPY sbin /opt/spark/sbin COPY conf /opt/spark/conf -COPY ${img_path}/spark-base/entrypoint.sh /opt/ +COPY ${img_path}/spark/entrypoint.sh /opt/ +COPY examples /opt/spark/examples --- End diff -- Didn't know about that directory, but sounds like it should be added. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160502618 --- Diff: resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh --- @@ -0,0 +1,97 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# echo commands to the terminal output +set -ex + +# Check whether there is a passwd entry for the container UID +myuid=$(id -u) +mygid=$(id -g) +uidentry=$(getent passwd $myuid) + +# If there is no passwd entry for the container UID, attempt to create one +if [ -z "$uidentry" ] ; then +if [ -w /etc/passwd ] ; then +echo "$myuid:x:$myuid:$mygid:anonymous uid:$SPARK_HOME:/bin/false" >> /etc/passwd +else +echo "Container ENTRYPOINT failed to add passwd entry for anonymous UID" +fi +fi + +SPARK_K8S_CMD="$1" +if [ -z "$SPARK_K8S_CMD" ]; then + echo "No command to execute has been provided." 1>&2 --- End diff -- You can do that with `docker container create --entrypoint blah`, right? Otherwise you have to add code here to specify what command to run when no arguments are provided. I'd rather have a proper error, since the entry point is tightly coupled with the submission code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160503103 --- Diff: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala --- @@ -29,17 +29,23 @@ private[spark] object Config extends Logging { .stringConf .createWithDefault("default") + val CONTAINER_IMAGE = +ConfigBuilder("spark.kubernetes.container.image") + .doc("Container image to use for Spark containers. Individual container types " + +"(e.g. driver or executor) can also be configured to use different images if desired, " + +"by setting the container-specific image name.") --- End diff -- Why would I mention just one specific way of overriding this? I also have half a mind to just remove this since this documentation is not visible anywhere... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160504887 --- Diff: resource-managers/kubernetes/core/src/test/scala/org/apache/spark/deploy/k8s/submit/DriverConfigOrchestratorSuite.scala --- @@ -75,8 +73,8 @@ class DriverConfigOrchestratorSuite extends SparkFunSuite { test("Submission steps with an init-container.") { val sparkConf = new SparkConf(false) - .set(DRIVER_CONTAINER_IMAGE, DRIVER_IMAGE) - .set(INIT_CONTAINER_IMAGE, IC_IMAGE) + .set(CONTAINER_IMAGE, DRIVER_IMAGE) + .set(INIT_CONTAINER_IMAGE.key, IC_IMAGE) --- End diff -- Yes, the test is checking different values for the default and init container images. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160504833 --- Diff: docs/running-on-kubernetes.md --- @@ -56,14 +56,13 @@ be run in a container runtime environment that Kubernetes supports. Docker is a frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized and built for your usage. --- End diff -- Separate change. I don't even know what you'd write there. The whole "custom image" thing needs to be properly specified first - what exactly is the contract between the submission code and the images, for example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20204 **[Test build #85855 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85855/testReport)** for PR 20204 at commit [`a3179d7`](https://github.com/apache/spark/commit/a3179d71da64b90b9dd1a2ac8feb9cc2c18572f5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19290: [SPARK-22063][R] Fixes lint check failures in R by lates...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/19290 The minimum R version supported is something that we can revisit though. I think we do this for Python, Java versions as well in the project --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160463657 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { + private var source: String = "internal" + /** - * Overwrites if the output path already exists. + * Specifies the format of ML export (e.g. PMML, internal, or --- End diff -- change to `e.g. "pmml", "internal", or the fully qualified class name for export)."` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160483562 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -1044,6 +1056,50 @@ class LinearRegressionSuite extends MLTest with DefaultReadWriteTest { LinearRegressionSuite.allParamSettings, checkModelData) } + test("pmml export") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +def checkModel(pmml: PMML): Unit = { + val dd = pmml.getDataDictionary + assert(dd.getNumberOfFields === 3) + val fields = dd.getDataFields.asScala + assert(fields(0).getName().toString === "field_0") + assert(fields(0).getOpType() == OpType.CONTINUOUS) + val pmmlRegressionModel = pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel] + val pmmlPredictors = pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors + val pmmlWeights = pmmlPredictors.asScala.map(_.getCoefficient()).toList + assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3) + assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3) +} +testPMMLWrite(sc, model, checkModel) + } + + test("unsupported export format") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +intercept[SparkException] { + model.write.format("boop").save("boop") +} +intercept[SparkException] { + model.write.format("com.holdenkarau.boop").save("boop") +} +withClue("ML source org.apache.spark.SparkContext is not a valid MLWriterFormat") { + intercept[SparkException] { +model.write.format("org.apache.spark.SparkContext").save("boop2") + } +} + } + + test("dummy export format is called") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +withClue("Dummy writer doesn't write") { + intercept[Exception] { --- End diff -- this just catches any exception. Can we do something like ```scala val thrown = intercept[Exception] { model.write.format("org.apache.spark.ml.regression.DummyLinearRegressionWriter").save("") } assert(thrown.getMessage.contains("Dummy writer doesn't write.")) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160461560 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite { protected final def sc: SparkContext = sparkSession.sparkContext } +/** + * ML export formats for should implement this trait so that users can specify a shortname rather + * than the fully qualified class name of the exporter. + * + * A new instance of this class will be instantiated each time a DDL call is made. --- End diff -- Was this supposed to be retained from the `DataSourceRegister`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160506592 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -1044,6 +1056,50 @@ class LinearRegressionSuite extends MLTest with DefaultReadWriteTest { LinearRegressionSuite.allParamSettings, checkModelData) } + test("pmml export") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +def checkModel(pmml: PMML): Unit = { + val dd = pmml.getDataDictionary + assert(dd.getNumberOfFields === 3) + val fields = dd.getDataFields.asScala + assert(fields(0).getName().toString === "field_0") + assert(fields(0).getOpType() == OpType.CONTINUOUS) + val pmmlRegressionModel = pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel] + val pmmlPredictors = pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors + val pmmlWeights = pmmlPredictors.asScala.map(_.getCoefficient()).toList + assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3) + assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3) +} +testPMMLWrite(sc, model, checkModel) + } + + test("unsupported export format") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +intercept[SparkException] { + model.write.format("boop").save("boop") +} +intercept[SparkException] { + model.write.format("com.holdenkarau.boop").save("boop") +} +withClue("ML source org.apache.spark.SparkContext is not a valid MLWriterFormat") { + intercept[SparkException] { +model.write.format("org.apache.spark.SparkContext").save("boop2") + } +} + } + + test("dummy export format is called") { --- End diff -- We can also add tests for the `MLFormatRegister` similar to `DDLSourceLoadSuite`. Just add a `META-INF/services/` directory to `src/test/resources/` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85855/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160496808 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { + private var source: String = "internal" + /** - * Overwrites if the output path already exists. + * Specifies the format of ML export (e.g. PMML, internal, or + * the fully qualified class name for export). */ - @Since("1.6.0") - def overwrite(): this.type = { -shouldOverwrite = true + @Since("2.3.0") + def format(source: String): this.type = { +this.source = source this } + /** + * Dispatches the save to the correct MLFormat. + */ + @Since("2.3.0") + @throws[IOException]("If the input path already exists but overwrite is not enabled.") + @throws[SparkException]("If multiple sources for a given short name format are found.") + override protected def saveImpl(path: String) = { --- End diff -- return type --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160462794 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite { protected final def sc: SparkContext = sparkSession.sparkContext } +/** + * ML export formats for should implement this trait so that users can specify a shortname rather + * than the fully qualified class name of the exporter. + * + * A new instance of this class will be instantiated each time a DDL call is made. + * + * @since 2.3.0 + */ +@InterfaceStability.Evolving +trait MLFormatRegister { + /** + * The string that represents the format that this data source provider uses. This is + * overridden by children to provide a nice alias for the data source. For example: + * + * {{{ + * override def shortName(): String = + * "pmml+org.apache.spark.ml.regression.LinearRegressionModel" --- End diff -- what about making a second abstract field `def stageName(): String`, instead of having it packed into one string? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160502536 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite { protected final def sc: SparkContext = sparkSession.sparkContext } +/** + * ML export formats for should implement this trait so that users can specify a shortname rather + * than the fully qualified class name of the exporter. + * + * A new instance of this class will be instantiated each time a DDL call is made. + * + * @since 2.3.0 + */ +@InterfaceStability.Evolving +trait MLFormatRegister { + /** + * The string that represents the format that this data source provider uses. This is + * overridden by children to provide a nice alias for the data source. For example: + * + * {{{ + * override def shortName(): String = + * "pmml+org.apache.spark.ml.regression.LinearRegressionModel" + * }}} + * Indicates that this format is capable of saving Spark's own LinearRegressionModel in pmml. + * + * Format discovery is done using a ServiceLoader so make sure to list your format in + * META-INF/services. + * @since 2.3.0 + */ + def shortName(): String +} + +/** + * Implemented by objects that provide ML exportability. + * + * A new instance of this class will be instantiated each time a DDL call is made. + * + * @since 2.3.0 + */ +@InterfaceStability.Evolving +trait MLWriterFormat { + /** + * Function write the provided pipeline stage out. --- End diff -- Should add a full doc here with param annotations. Also should it be "Function to write ..."? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160501723 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { + private var source: String = "internal" + /** - * Overwrites if the output path already exists. + * Specifies the format of ML export (e.g. PMML, internal, or + * the fully qualified class name for export). */ - @Since("1.6.0") - def overwrite(): this.type = { -shouldOverwrite = true + @Since("2.3.0") + def format(source: String): this.type = { +this.source = source this } + /** + * Dispatches the save to the correct MLFormat. + */ + @Since("2.3.0") + @throws[IOException]("If the input path already exists but overwrite is not enabled.") + @throws[SparkException]("If multiple sources for a given short name format are found.") + override protected def saveImpl(path: String) = { +val loader = Utils.getContextOrSparkClassLoader +val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], loader) +val stageName = stage.getClass.getName +val targetName = s"${source}+${stageName}" +val formats = serviceLoader.asScala.toList +val shortNames = formats.map(_.shortName()) +val writerCls = formats.filter(_.shortName().equalsIgnoreCase(targetName)) match { + // requested name did not match any given registered alias + case Nil => +Try(loader.loadClass(source)) match { + case Success(writer) => +// Found the ML writer using the fully qualified path +writer + case Failure(error) => +throw new SparkException( + s"Could not load requested format $source for $stageName ($targetName) had $formats" + + s"supporting $shortNames", error) +} + case head :: Nil => +head.getClass + case _ => +// Multiple sources +throw new SparkException( + s"Multiple writers found for $source+$stageName, try using the class name of the writer") +} +if (classOf[MLWriterFormat].isAssignableFrom(writerCls)) { + val writer = writerCls.newInstance().asInstanceOf[MLWriterFormat] --- End diff -- This will fail, non-intuitively, if anyone ever extends `MLWriterFormat` with a constructor that has more than zero arguments. Meaning: ```scala class DummyLinearRegressionWriter(someParam: Int) extends MLWriterFormat ``` will raise `java.lang.NoSuchMethodException: org.apache.spark.ml.regression.DummyLinearRegressionWriter.()` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160463225 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite { protected final def sc: SparkContext = sparkSession.sparkContext } +/** + * ML export formats for should implement this trait so that users can specify a shortname rather + * than the fully qualified class name of the exporter. + * + * A new instance of this class will be instantiated each time a DDL call is made. + * + * @since 2.3.0 + */ +@InterfaceStability.Evolving +trait MLFormatRegister { + /** + * The string that represents the format that this data source provider uses. This is + * overridden by children to provide a nice alias for the data source. For example: --- End diff -- "data source" -> "model format"? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160503322 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { --- End diff -- need `@Since("2.3.0")` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160471845 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -710,15 +711,57 @@ class LinearRegressionModel private[ml] ( } /** - * Returns a [[org.apache.spark.ml.util.MLWriter]] instance for this ML instance. + * Returns a [[org.apache.spark.ml.util.GeneralMLWriter]] instance for this ML instance. * * For [[LinearRegressionModel]], this does NOT currently save the training [[summary]]. * An option to save [[summary]] may be added in the future. * * This also does not save the [[parent]] currently. */ @Since("1.6.0") - override def write: MLWriter = new LinearRegressionModel.LinearRegressionModelWriter(this) + override def write: GeneralMLWriter = new GeneralMLWriter(this) +} + +/** A writer for LinearRegression that handles the "internal" (or default) format */ +private class InternalLinearRegressionModelWriter() + extends MLWriterFormat with MLFormatRegister { + + override def shortName(): String = +"internal+org.apache.spark.ml.regression.LinearRegressionModel" + + private case class Data(intercept: Double, coefficients: Vector, scale: Double) + + override def write(path: String, sparkSession: SparkSession, +optionMap: mutable.Map[String, String], stage: PipelineStage): Unit = { +val instance = stage.asInstanceOf[LinearRegressionModel] +val sc = sparkSession.sparkContext +// Save metadata and Params +DefaultParamsWriter.saveMetadata(instance, path, sc) +// Save model data: intercept, coefficients, scale +val data = Data(instance.intercept, instance.coefficients, instance.scale) +val dataPath = new Path(path, "data").toString + sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) + } +} + +/** A writer for LinearRegression that handles the "pmml" format */ +private class PMMLLinearRegressionModelWriter() --- End diff -- I could be wrong, but I think we prefer just omitting the `()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160503640 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -85,12 +87,55 @@ private[util] sealed trait BaseReadWrite { protected final def sc: SparkContext = sparkSession.sparkContext } +/** + * ML export formats for should implement this trait so that users can specify a shortname rather + * than the fully qualified class name of the exporter. + * + * A new instance of this class will be instantiated each time a DDL call is made. + * + * @since 2.3.0 + */ +@InterfaceStability.Evolving +trait MLFormatRegister { + /** + * The string that represents the format that this data source provider uses. This is + * overridden by children to provide a nice alias for the data source. For example: + * + * {{{ + * override def shortName(): String = + * "pmml+org.apache.spark.ml.regression.LinearRegressionModel" + * }}} + * Indicates that this format is capable of saving Spark's own LinearRegressionModel in pmml. + * + * Format discovery is done using a ServiceLoader so make sure to list your format in + * META-INF/services. + * @since 2.3.0 + */ + def shortName(): String +} + +/** + * Implemented by objects that provide ML exportability. + * + * A new instance of this class will be instantiated each time a DDL call is made. + * + * @since 2.3.0 + */ +@InterfaceStability.Evolving +trait MLWriterFormat { --- End diff -- do we need the actual since annotations here, though? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160503466 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) --- End diff -- since tags here --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160484001 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -1044,6 +1056,50 @@ class LinearRegressionSuite extends MLTest with DefaultReadWriteTest { LinearRegressionSuite.allParamSettings, checkModelData) } + test("pmml export") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +def checkModel(pmml: PMML): Unit = { + val dd = pmml.getDataDictionary + assert(dd.getNumberOfFields === 3) + val fields = dd.getDataFields.asScala + assert(fields(0).getName().toString === "field_0") + assert(fields(0).getOpType() == OpType.CONTINUOUS) + val pmmlRegressionModel = pmml.getModels().get(0).asInstanceOf[PMMLRegressionModel] + val pmmlPredictors = pmmlRegressionModel.getRegressionTables.get(0).getNumericPredictors + val pmmlWeights = pmmlPredictors.asScala.map(_.getCoefficient()).toList + assert(pmmlWeights(0) ~== model.coefficients(0) relTol 1E-3) + assert(pmmlWeights(1) ~== model.coefficients(1) relTol 1E-3) +} +testPMMLWrite(sc, model, checkModel) + } + + test("unsupported export format") { +val lr = new LinearRegression() +val model = lr.fit(datasetWithWeight) +intercept[SparkException] { --- End diff -- Doesn't this and the one below it test the same thing? I think we could remove the first one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19876: [ML][SPARK-11171][SPARK-11239] Add PMML export to...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/19876#discussion_r160461644 --- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala --- @@ -126,15 +180,69 @@ abstract class MLWriter extends BaseReadWrite with Logging { this } + // override for Java compatibility + override def session(sparkSession: SparkSession): this.type = super.session(sparkSession) + + // override for Java compatibility + override def context(sqlContext: SQLContext): this.type = super.session(sqlContext.sparkSession) +} + +/** + * A ML Writer which delegates based on the requested format. + */ +class GeneralMLWriter(stage: PipelineStage) extends MLWriter with Logging { + private var source: String = "internal" + /** - * Overwrites if the output path already exists. + * Specifies the format of ML export (e.g. PMML, internal, or + * the fully qualified class name for export). */ - @Since("1.6.0") - def overwrite(): this.type = { -shouldOverwrite = true + @Since("2.3.0") + def format(source: String): this.type = { +this.source = source this } + /** + * Dispatches the save to the correct MLFormat. + */ + @Since("2.3.0") + @throws[IOException]("If the input path already exists but overwrite is not enabled.") + @throws[SparkException]("If multiple sources for a given short name format are found.") + override protected def saveImpl(path: String) = { +val loader = Utils.getContextOrSparkClassLoader +val serviceLoader = ServiceLoader.load(classOf[MLFormatRegister], loader) +val stageName = stage.getClass.getName +val targetName = s"${source}+${stageName}" --- End diff -- don't need brackets --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20192 > users with custom docker images can override the classpath by I wrote this in a comment above, but there needs to be a proper definition of how to customize these docker images. There needs to be a contract between the submission code, the entry point, and how stuff is laid out inside the image, and I don't see that specified anywhere. However that's done, I also would suggest that env variables be avoided. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20189 **[Test build #85861 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85861/testReport)** for PR 20189 at commit [`b7dc922`](https://github.com/apache/spark/commit/b7dc92235f434ee0630bdb6af918ee8e58d2fa2b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20189 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20189: [SPARK-22975] MetricsReporter should not throw exception...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20189 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85861/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20201: [SPARK-22389][SQL] data source v2 partitioning reporting...
Github user RussellSpitzer commented on the issue: https://github.com/apache/spark/pull/20201 This looks very exciting to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20192 **[Test build #85870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85870/testReport)** for PR 20192 at commit [`e771ed9`](https://github.com/apache/spark/commit/e771ed9271f158e3234a3dfa889b49310c690d10). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20097 Yes. My bad. I didnt realize the branch had already been cut. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/20192 @vanzin, do you have some time to modify the integration tests as well? The change LGTM, but a clean run on minikube would give us a lot more confidence. Until the integration tests get checked in to this repo and running in PRB (@ssuchter is working on this), we think that the best way to keep them in sync is to ensure that PRs get a manual clean run against the suite. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20203: [SPARK-22577] [core] executor page blacklist status shou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20203 **[Test build #85872 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85872/testReport)** for PR 20203 at commit [`d8c214b`](https://github.com/apache/spark/commit/d8c214b33f4b014f5a2c0644074f9b7668364799). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20209 **[Test build #85871 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85871/testReport)** for PR 20209 at commit [`2c10416`](https://github.com/apache/spark/commit/2c10416a8a06f4c574dc662a1d7bb7dbcdd36a37). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20210: [SPARK-23009][PYTHON] Fix for non-str col names t...
GitHub user BryanCutler opened a pull request: https://github.com/apache/spark/pull/20210 [SPARK-23009][PYTHON] Fix for non-str col names to createDataFrame from Pandas ## What changes were proposed in this pull request? This the case when calling `SparkSession.createDataFrame` using a Pandas DataFrame that has non-str column labels. ## How was this patch tested? Added a new test with a Pandas DataFrame that has int column labels You can merge this pull request into a Git repository by running: $ git pull https://github.com/BryanCutler/spark python-createDataFrame-int-col-error-SPARK-23009 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20210.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20210 commit e2b1a4160be46e4cdf248e4d219c1a0e2dbec00e Author: Bryan Cutler Date: 2018-01-09T19:57:29Z fixed col name encoding to allow for unicode and non-str, added test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/20210 Just came across this issue, ping @HyukjinKwon @ueshin --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160511180 --- Diff: docs/running-on-kubernetes.md --- @@ -56,14 +56,13 @@ be run in a container runtime environment that Kubernetes supports. Docker is a frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized and built for your usage. --- End diff -- I agree that we don't have a solid story around customizing images here. But I do think that we need something clearly telling people that we do support using custom images if they want to and the properties they should use to configure custom images. It just doesn't need to be opinionated on things like the contact you mentioned. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20210 **[Test build #85873 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85873/testReport)** for PR 20210 at commit [`e2b1a41`](https://github.com/apache/spark/commit/e2b1a4160be46e4cdf248e4d219c1a0e2dbec00e). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160512231 --- Diff: docs/running-on-kubernetes.md --- @@ -56,14 +56,13 @@ be run in a container runtime environment that Kubernetes supports. Docker is a frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized and built for your usage. --- End diff -- Still, that sounds like something that should be added in a separate change. I'm not changing the customizability of images in this change. And not having a contract means people will have no idea of how to customize images, so you can't even write proper documentation for that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user liyinan926 commented on the issue: https://github.com/apache/spark/pull/20192 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20210 **[Test build #85874 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85874/testReport)** for PR 20210 at commit [`e2e6025`](https://github.com/apache/spark/commit/e2e60251e9be8fd0894e030d4a8b28a549ce777f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20192: [SPARK-22994][k8s] Use a single image for all Spa...
Github user liyinan926 commented on a diff in the pull request: https://github.com/apache/spark/pull/20192#discussion_r160512758 --- Diff: docs/running-on-kubernetes.md --- @@ -56,14 +56,13 @@ be run in a container runtime environment that Kubernetes supports. Docker is a frequently used with Kubernetes. With Spark 2.3, there are Dockerfiles provided in the runnable distribution that can be customized and built for your usage. --- End diff -- OK, I am fine with adding this in a future change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user tdas commented on the issue: https://github.com/apache/spark/pull/20097 Done. https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=44763d93c0d923977c114d63586abfc1b68ad7fc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20204 **[Test build #85858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85858/testReport)** for PR 20204 at commit [`3c3c3cb`](https://github.com/apache/spark/commit/3c3c3cba721def78117561d865291931c2d5acd3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85858/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20204: [SPARK-7721][PYTHON][TESTS] Adds PySpark coverage genera...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20204 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20097: [SPARK-22912] v2 data source support in MicroBatchExecut...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/20097 Thank you, @tdas ! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20096: [SPARK-22908] Add kafka source and sink for conti...
Github user tdas commented on a diff in the pull request: https://github.com/apache/spark/pull/20096#discussion_r160516177 --- Diff: external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaContinuousSuite.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.kafka010 + +import java.util.Properties +import java.util.concurrent.atomic.AtomicInteger + +import org.scalatest.time.SpanSugar._ +import scala.collection.mutable +import scala.util.Random + +import org.apache.spark.SparkContext +import org.apache.spark.sql.{DataFrame, Dataset, ForeachWriter, Row} +import org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation +import org.apache.spark.sql.execution.streaming.StreamExecution +import org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution +import org.apache.spark.sql.streaming.{StreamTest, Trigger} +import org.apache.spark.sql.test.{SharedSQLContext, TestSparkSession} + +trait KafkaContinuousTest extends KafkaSourceTest { + override val defaultTrigger = Trigger.Continuous(1000) + override val defaultUseV2Sink = true + + // We need more than the default local[2] to be able to schedule all partitions simultaneously. + override protected def createSparkSession = new TestSparkSession( +new SparkContext( + "local[10]", + "continuous-stream-test-sql-context", + sparkConf.set("spark.sql.testkey", "true"))) + + override protected def setTopicPartitions( --- End diff -- Add comment on what this method does. It is asserting something, so does not look like it only "sets" something. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20209 **[Test build #85871 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85871/testReport)** for PR 20209 at commit [`2c10416`](https://github.com/apache/spark/commit/2c10416a8a06f4c574dc662a1d7bb7dbcdd36a37). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20209 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85871/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20209: [SPARK-23008][ML] OnehotEncoderEstimator python API
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20209 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20210 **[Test build #85873 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85873/testReport)** for PR 20210 at commit [`e2b1a41`](https://github.com/apache/spark/commit/e2b1a4160be46e4cdf248e4d219c1a0e2dbec00e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20210 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85873/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20210 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20192 > do you have some time to modify the integration tests as well I can try to look, but really you guys should be putting that code into the Spark repo. I don't see a task under SPARK-18278 for adding the integration tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20023 **[Test build #85864 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85864/testReport)** for PR 20023 at commit [`20616fd`](https://github.com/apache/spark/commit/20616fdfc1a75ea9ae0ec531ce72d8c722facb31). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20023 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20023: [SPARK-22036][SQL] Decimal multiplication with high prec...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20023 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85864/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20210 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20210 **[Test build #85874 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85874/testReport)** for PR 20210 at commit [`e2e6025`](https://github.com/apache/spark/commit/e2e60251e9be8fd0894e030d4a8b28a549ce777f). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20210: [SPARK-23009][PYTHON] Fix for non-str col names to creat...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20210 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85874/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20192: [SPARK-22994][k8s] Use a single image for all Spark cont...
Github user foxish commented on the issue: https://github.com/apache/spark/pull/20192 Thanks @vanzin. I was waiting on spark-dev [thread on integration testing](http://apache-spark-developers-list.1001551.n3.nabble.com/Integration-testing-and-Scheduler-Backends-td23105.html) to conclude. It does look like checking the tests in is something we should do - adding a task tracking it. We're also stabilizing the testing atm - so, I'm thinking we'll target that for post-2.3. Would be great to get an architecture review from the Spark community on it, as it exists today, to get some feedback going. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20151: [SPARK-22959][PYTHON] Configuration to select the module...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/20151 So I think this could be the basis for solving a lot of related problems and I like the minimally invasive approach to it. I think the error message for setting it to a bad module rather than a nonexistent module is probably going to be very confusing. I think it would be good to make it clear that this is advanced setting we don't expect most users to modify directly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85875 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85875/testReport)** for PR 20096 at commit [`f825155`](https://github.com/apache/spark/commit/f8251552398f980768b23059c1bbbd028cfee859). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pyspark
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/13599 So for what its worth I made a quick POC of supporting similar functionality without requiring any changes to Spark its self ( https://github.com/nteract/coffee_boat ) which should also work in standalone mode. I'm going to poke at it a bit more and explore if pex might be better than conda (although given the packages most folks want conda seems better). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13599: [SPARK-13587] [PYSPARK] Support virtualenv in pys...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/13599#discussion_r160521328 --- Diff: python/pyspark/context.py --- @@ -1023,6 +1032,35 @@ def getConf(self): conf.setAll(self._conf.getAll()) return conf +def install_packages(self, packages, install_driver=True): +""" +install python packages on all executors and driver through pip. pip will be installed +by default no matter using native virtualenv or conda. So it is guaranteed that pip is +available if virtualenv is enabled. +:param packages: string for single package or a list of string for multiple packages +:param install_driver: whether to install packages in client +""" +if self._conf.get("spark.pyspark.virtualenv.enabled") != "true": +raise RuntimeError("install_packages can only use called when " + "spark.pyspark.virtualenv.enabled set as true") +if isinstance(packages, basestring): +packages = [packages] +# seems statusTracker.getExecutorInfos() will return driver + exeuctors, so -1 here. +num_executors = len(self._jsc.sc().statusTracker().getExecutorInfos()) - 1 +dummyRDD = self.parallelize(range(num_executors), num_executors) + +def _run_pip(packages, iterator): +import pip +pip.main(["install"] + packages) + +# run it in the main thread. Will do it in a separated thread after +# https://github.com/pypa/pip/issues/2553 is fixed +if install_driver: +_run_pip(packages, None) + +import functools +dummyRDD.foreachPartition(functools.partial(_run_pip, packages)) --- End diff -- This approach is not reliable to executor failur/restart, dynamic allocation, and other possible changes. I'm not comfortable merging something which depends on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20096: [SPARK-22908] Add kafka source and sink for continuous p...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20096 **[Test build #85876 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85876/testReport)** for PR 20096 at commit [`9101ea6`](https://github.com/apache/spark/commit/9101ea6ef5dfd77eb0dcf3aee622b2d7a145323f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20188: [SPARK-22993][ML] Clarify HasCheckpointInterval param do...
Github user sethah commented on the issue: https://github.com/apache/spark/pull/20188 Good call @felixcheung! Will update shortly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
GitHub user icexelloss opened a pull request: https://github.com/apache/spark/pull/20211 [SPARK-23011][PYTHON][SQL] Prepend missing grouping key in groupby apply ## What changes were proposed in this pull request? See https://issues.apache.org/jira/browse/SPARK-23011 ## How was this patch tested? Add more tests in `test_complex_groupby` ## TODO: - [ ] Document the usage in groupby apply You can merge this pull request into a Git repository by running: $ git pull https://github.com/icexelloss/spark SPARK-23011-groupby-apply-group-key Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20211.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20211 commit 51ce6e85953bd39e901fec24dfca45b86f55f939 Author: Li Jin Date: 2018-01-02T18:45:34Z wip commit 07f921139e250bd62e79da8475d8d615045d636a Author: Li Jin Date: 2018-01-09T20:08:15Z Test working; Need to add docs commit f2822b529293e37f63a4a190b25dbdd018e36ba6 Author: Li Jin Date: 2018-01-09T20:55:03Z Add simple doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping key ...
Github user icexelloss commented on the issue: https://github.com/apache/spark/pull/20211 cc @HyukjinKwon @ueshin @cloud-fan @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20211 **[Test build #85877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85877/testReport)** for PR 20211 at commit [`f2822b5`](https://github.com/apache/spark/commit/f2822b529293e37f63a4a190b25dbdd018e36ba6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20168 **[Test build #85878 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85878/testReport)** for PR 20168 at commit [`eee25ce`](https://github.com/apache/spark/commit/eee25ceffde2c1d6ca248eceb17a559e2f921cc6). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20211 **[Test build #85877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85877/testReport)** for PR 20211 at commit [`f2822b5`](https://github.com/apache/spark/commit/f2822b529293e37f63a4a190b25dbdd018e36ba6). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20211 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85877/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20168 **[Test build #85878 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85878/testReport)** for PR 20168 at commit [`eee25ce`](https://github.com/apache/spark/commit/eee25ceffde2c1d6ca248eceb17a559e2f921cc6). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20168 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85878/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20211: [SPARK-23011][PYTHON][SQL] Prepend missing grouping colu...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20211 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20168: [SPARK-22730][ML] Add ImageSchema support for non-intege...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20168 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20211: [SPARK-23011][PYTHON][SQL] Prepend missing groupi...
Github user icexelloss commented on a diff in the pull request: https://github.com/apache/spark/pull/20211#discussion_r160524679 --- Diff: python/pyspark/sql/group.py --- @@ -233,6 +233,27 @@ def apply(self, udf): | 2| 1.1094003924504583| +---+---+ +Notes on grouping column: --- End diff -- This explains the general idea. I plan to improve the doc if people think this change is good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org