[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4222#issuecomment-71639706 [Test build #26160 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26160/consoleFull) for PR 4222 at commit [`51987d2`](https://github.com/apache/spark/commit/51987d24ea6b29c9607679daa2b482d5855be361). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4109#issuecomment-71605804 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26151/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4109#issuecomment-71605655 [Test build #26151 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26151/consoleFull) for PR 4109 at commit [`caf4438`](https://github.com/apache/spark/commit/caf44387c2d3af5df771b9ce74aa8a9bac3f0827). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4109#issuecomment-71605800 [Test build #26151 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26151/consoleFull) for PR 4109 at commit [`caf4438`](https://github.com/apache/spark/commit/caf44387c2d3af5df771b9ce74aa8a9bac3f0827). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DenseMatrix(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5423][Core] Cleanup resources in DiskMa...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/4219 [SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file This PR adds a `finalize` method in DiskMapIterator to clean up the resources even if some exception happens during processing data. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-5423 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4219.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4219 commit d4b2ca69b3bc2d729f5d44750ab6b81de6e77644 Author: zsxwing zsxw...@gmail.com Date: 2015-01-27T08:16:13Z Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp file --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-71651118 What is going on with these tests??? I've created three PRs - for 1.1, 1.2 and 1.3 and all of them failed in a very strange way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
The test failures look unrelated, and are a Jenkins error. You should just make one PR for master; it will be back-ported as needed. On Tue, Jan 27, 2015 at 1:58 PM, jacek-lewandowski g...@git.apache.org wrote: Github user jacek-lewandowski commented on the pull request: https://github.com/apache/spark/pull/4220#issuecomment-71651118 What is going on with these tests??? I've created three PRs - for 1.1, 1.2 and 1.3 and all of them failed in a very strange way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4221#issuecomment-71649841 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26161/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4221#issuecomment-71649835 [Test build #26161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26161/consoleFull) for PR 4221 at commit [`94aeacf`](https://github.com/apache/spark/commit/94aeacf6fcc7fae6d045d35b9d8f1fe4c2594780). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-71652555 [Test build #26164 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26164/consoleFull) for PR 731 at commit [`9f0c3a4`](https://github.com/apache/spark/commit/9f0c3a4393933e77c5e97a322d7bd9038afc7f78). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/731#issuecomment-71653299 [Test build #26165 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26165/consoleFull) for PR 731 at commit [`97918d2`](https://github.com/apache/spark/commit/97918d2753359881dfd7f512bedc4495e47d3599). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5308 [BUILD] MD5 / SHA1 hash format does...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4161#issuecomment-71699359 Thanks Sean - pulling this in. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] fix python example of ALS in guide
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4226#issuecomment-71704847 [Test build #26171 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26171/consoleFull) for PR 4226 at commit [`1433d76`](https://github.com/apache/spark/commit/1433d76d0b42e3c5fa873258fc659ee3e7d162cc). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71708214 [Test build #26173 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26173/consoleFull) for PR 4155 at commit [`1df2a91`](https://github.com/apache/spark/commit/1df2a91eb39300a32ad095b37a04846d135e2cc5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4228#issuecomment-71708263 [Test build #26172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26172/consoleFull) for PR 4228 at commit [`d600b6c`](https://github.com/apache/spark/commit/d600b6cd7d80bfc31878cf1dec2a706b7256474a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4127#issuecomment-71710163 I don't have permission to do it. Can you click the close button? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4586][MLLIB] Python API for ML pip...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4151#discussion_r23635244 --- Diff: examples/src/main/python/ml/simple_text_classification_pipeline.py --- @@ -0,0 +1,70 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from pyspark import SparkContext +from pyspark.sql import SQLContext, Row +from pyspark.ml import Pipeline +from pyspark.ml.feature import HashingTF, Tokenizer +from pyspark.ml.classification import LogisticRegression + + + +A simple text classification pipeline that recognizes spark from +input text. This is to show how to create and configure a Spark ML +pipeline in Python. Run with: + + bin/spark-submit examples/src/main/python/ml/simple_text_classification_pipeline.py + + + +if __name__ == __main__: +sc = SparkContext(appName=SimpleTextClassificationPipeline) +sqlCtx = SQLContext(sc) +training = sqlCtx.inferSchema( +sc.parallelize([(0L, a b c d e spark, 1.0), +(1L, b d, 0.0), +(2L, spark f g h, 1.0), +(3L, hadoop mapreduce, 0.0)]) + .map(lambda x: Row(id=x[0], text=x[1], label=x[2]))) --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user prabeesh commented on a diff in the pull request: https://github.com/apache/spark/pull/3715#discussion_r23636136 --- Diff: examples/src/main/python/streaming/kafka_wordcount.py --- @@ -0,0 +1,57 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + + Counts words in UTF8 encoded, '\n' delimited text received from the network every second. + Usage: network_wordcount.py zk topic + + To run this on your local machine, you need to setup Kafka and create a producer first + $ bin/zookeeper-server-start.sh config/zookeeper.properties + $ bin/kafka-server-start.sh config/server.properties + $ bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --topic test + $ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test + --- End diff -- All the above commands want to run from Kafka bin/ ? But it still create some confusion which directory we want use. In all the Spark examples bin/ refer to the bin/ of Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3637#issuecomment-71713330 @petro-rudenko It is possible to get the state, but not in a single object. It's a good question whether a model and its state should be different concepts. In the current MLlib code, they are the same concept, so the functionality you're mentioning is supported in slightly different ways: * Saving will happen through save/load methods (which I'm working on: [https://issues.apache.org/jira/browse/SPARK-4587]) * Passing to prediction front-ends can happen through save, or by manually taking the needed elements of the state. * Copying the model can be done by copy(). * Printing/viewing the state can be done by casting the bestModel to the correct type: ``` cvModel.bestModel.asInstanceOf[LogisticRegressionModel].weights ... ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5432] DriverSuite and SparkSubmitSuite ...
GitHub user andrewor14 opened a pull request: https://github.com/apache/spark/pull/4230 [SPARK-5432] DriverSuite and SparkSubmitSuite should sc.stop() In the past we've disabled the UIs and messed with the ports to keep the tests passing. However, these are only temporary fixes since ultimately we're still leaking a JVM after each individual test has finished. If we stop the `SparkContext` that should ensure the resources get cleaned up properly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewor14/spark fix-driver-suite Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4230.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4230 commit 8092c36831bb2b348f566bb4bd9a8d234cc5fc3d Author: Andrew Or and...@databricks.com Date: 2015-01-27T19:51:25Z Stop SparkContext after every individual test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user prabeesh commented on a diff in the pull request: https://github.com/apache/spark/pull/3715#discussion_r23637548 --- Diff: examples/src/main/python/streaming/kafka_wordcount.py --- @@ -0,0 +1,57 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + + Counts words in UTF8 encoded, '\n' delimited text received from the network every second. + Usage: network_wordcount.py zk topic + + To run this on your local machine, you need to setup Kafka and create a producer first + $ bin/zookeeper-server-start.sh config/zookeeper.properties + $ bin/kafka-server-start.sh config/server.properties + $ bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --topic test + $ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test + --- End diff -- All the above commands want to run from Kafka bin/ ? But it still create some confusion which directory we want use. In all the Spark examples bin/ refer to the bin/ of Spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4229#issuecomment-71719260 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26182/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4229#issuecomment-71719257 [Test build #26182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26182/consoleFull) for PR 4229 at commit [`3b45aca`](https://github.com/apache/spark/commit/3b45aca6cbe5ad5312eb50425b750c5b8fe9de5f). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KafkaUtils(object):` * `class MQTTUtils(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/3732#discussion_r23628599 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala --- @@ -252,7 +252,7 @@ trait Row extends Serializable { * * @throws ClassCastException when data type does not match. */ - def getDate(i: Int): java.sql.Date = apply(i).asInstanceOf[java.sql.Date] + def getDate(i: Int): java.sql.Date = DateUtils.toJavaDate(getInt(i)) --- End diff -- one thing - you probably want to do the conversion when we create the row, like what we do for other types, instead of doing the conversion when it is accessed. otherwise apply(i: Int) will return date. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1934 [CORE] this reference escape to ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4225#issuecomment-71702592 [Test build #26170 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26170/consoleFull) for PR 4225 at commit [`c4dec3b`](https://github.com/apache/spark/commit/c4dec3b00426a1a427a4e8f88c6f733c583ebc97). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...
Github user OopsOutOfMemory commented on the pull request: https://github.com/apache/spark/pull/4127#issuecomment-71707934 Hi, @rxin I created a new PR #4227 to rewrite `this part` and bring everything up-to-date, would you please review it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4127#issuecomment-71708856 Thanks. Do you mind closing this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23633790 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -380,6 +392,12 @@ private[spark] class SparkSubmitArguments(args: Seq[String], env: Map[String, St | --name NAME A name of your application. | --jars JARS Comma-separated list of local jars to include on the driver | and executor classpaths. +| --maven Comma-separated list of maven coordinates of jars to include +| on the driver and executor classpaths. Will search the local +| maven repo, then maven central and any additional remote +| repositories given by --maven_repos. --- End diff -- Instead of taking one parameter with a list of all Maven packages, we might want to allow separate packages to be passed with separate `--maven` args. Dunno, @pwendell / @mengxr what do you think? It's just going to be annoying for people to write a giant comma separated string. Same with repos actually. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
GitHub user prabeesh opened a pull request: https://github.com/apache/spark/pull/4229 [SPARK-5155] [PySpark] [Streaming] Mqtt streaming support in Python You can merge this pull request into a Git repository by running: $ git pull https://github.com/prabeesh/spark mqtt_python Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4229.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4229 commit 07923c42fcb4d210333b3882490e23f33dc4822f Author: Davies Liu dav...@databricks.com Date: 2014-12-16T21:48:42Z support kafka in Python commit 75d485e65b75a7a5da91a37ff42a9bb7cd82dcf6 Author: Davies Liu dav...@databricks.com Date: 2014-12-16T22:18:59Z add mqtt commit 048dbe6c9ec4bff452c70e4e18d48d3075e0 Author: Davies Liu dav...@databricks.com Date: 2014-12-16T22:27:43Z fix python style commit 5697a012def1b8508d21d96f2afb7d6705cf Author: Davies Liu dav...@databricks.com Date: 2014-12-18T23:44:33Z bypass decoder in scala commit 98c8d179d3ff264d03eabc3ddd72936d95e6e305 Author: Davies Liu dav...@databricks.com Date: 2014-12-18T23:58:29Z fix python style commit f6ce899abd435f36f7c5907523c643cc8b0e61ed Author: Davies Liu dav...@databricks.com Date: 2015-01-08T21:28:39Z add example and fix bugs commit eea16a79e741255548ef2e006db3948771a47e0d Author: Davies Liu dav...@databricks.com Date: 2015-01-08T21:29:35Z refactor commit aea89538dcb9b80111f98df881d345d4e87e91aa Author: Tathagata Das t...@databricks.com Date: 2015-01-22T01:31:30Z Kafka-assembly for Python API commit adeeb3863353f9a0ca3070a9cc914a2914d95fa9 Author: Davies Liu dav...@databricks.com Date: 2015-01-22T07:08:56Z Merge pull request #3 from tdas/kafka-python-api Kafka-assembly for Python API commit 33730d14c069042dfccd4af857021afe7ff0cbb0 Author: Davies Liu dav...@databricks.com Date: 2015-01-22T07:50:28Z Merge branch 'master' of github.com:apache/spark into kafka Conflicts: make-distribution.sh project/SparkBuild.scala commit 2c567a5d55c465d706026c2395e9025fad9dbd68 Author: Davies Liu dav...@databricks.com Date: 2015-01-22T08:01:02Z update logging and comment commit 97386b3debd5f352b61dfed194ab9495fecbe834 Author: Davies Liu dav...@databricks.com Date: 2015-01-22T08:08:06Z address comment commit 370ba61571b98e9bdfb6636852d4404687143853 Author: Davies Liu dav...@databricks.com Date: 2015-01-26T20:25:12Z Update kafka.py fix spark-submit commit 26a9960937368202dfdba6c0bbf5bf7c0168e72d Author: prabs prabsma...@gmail.com Date: 2015-01-27T18:31:54Z Merge branch 'kafka' of github.com:davies/spark into mqtt_python commit 58aa907b8ad1913229468f3c3776dba2d8b45580 Author: prabs prabsma...@gmail.com Date: 2015-01-23T19:58:22Z Mqtt streaming support in Python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71710454 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26176/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4228#issuecomment-71710396 @mengxr if we are going to add this as a first class API, can we have it in Java and Python too? Also, /cc to @rxin to also vet whether we want this in the core API. My feeling is that it's hard for users to figure out how to do this on their own, and for any expensive reduction function, users will need something like this in a large cluster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Set UserKnownHostsFile to ensure deploying and...
Github user nchammas commented on the pull request: https://github.com/apache/spark/pull/4201#issuecomment-71714505 It looks like we have another PR opened a few hours before this that does the same thing: #4196 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71715606 [Test build #26180 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26180/consoleFull) for PR 3715 at commit [`dc1eed0`](https://github.com/apache/spark/commit/dc1eed0a6af190d5cf07dedcb0607a0a76e45d64). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5429][SQL] Use javaXML plan serializati...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4223#issuecomment-71698918 [Test build #26167 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26167/consoleFull) for PR 4223 at commit [`97a8760`](https://github.com/apache/spark/commit/97a8760f6c8713af581b95f05d11e8d11f331246). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5429][SQL] Use javaXML plan serializati...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4223#issuecomment-71698923 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26167/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23633701 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -380,6 +392,12 @@ private[spark] class SparkSubmitArguments(args: Seq[String], env: Map[String, St | --name NAME A name of your application. | --jars JARS Comma-separated list of local jars to include on the driver | and executor classpaths. +| --maven Comma-separated list of maven coordinates of jars to include +| on the driver and executor classpaths. Will search the local +| maven repo, then maven central and any additional remote +| repositories given by --maven_repos. +| --maven_repos Supply additional remote repositories to search for the +| maven coordinates given with --maven. --- End diff -- You should say this is a comma-separated list --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71710113 [Test build #26177 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26177/consoleFull) for PR 3715 at commit [`31e2317`](https://github.com/apache/spark/commit/31e2317a31c90b23a7b085c7fd5a1de8998194a6). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...
Github user OopsOutOfMemory commented on the pull request: https://github.com/apache/spark/pull/4127#issuecomment-71709975 ok, please close this one : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71710060 [Test build #26176 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26176/consoleFull) for PR 4155 at commit [`9fe6495`](https://github.com/apache/spark/commit/9fe64953aa437ed1ed88a294e04129afc8f2bbb5). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...
Github user pwendell commented on the pull request: https://github.com/apache/spark/pull/4228#issuecomment-71712072 I don't think we should do it separately (it sets a bad precedent), but if you are too busy, we can try to find someone in the community to do all three. It's pretty straightforward. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5432] DriverSuite and SparkSubmitSuite ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4230#issuecomment-71716554 [Test build #26181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26181/consoleFull) for PR 4230 at commit [`8092c36`](https://github.com/apache/spark/commit/8092c36831bb2b348f566bb4bd9a8d234cc5fc3d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] fix python example of ALS in guide
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4226#issuecomment-71716462 [Test build #26171 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26171/consoleFull) for PR 4226 at commit [`1433d76`](https://github.com/apache/spark/commit/1433d76d0b42e3c5fa873258fc659ee3e7d162cc). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-5393. Flood of util.RackResolver log mes...
Github user ksakellis commented on a diff in the pull request: https://github.com/apache/spark/pull/4192#discussion_r23629409 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala --- @@ -60,6 +62,9 @@ private[yarn] class YarnAllocator( import YarnAllocator._ + // RackResolver logs an INFO message whenever it resolves a rack, which is way too often. + Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN) --- End diff -- Well, I disagree. A user will get very frustrated if they are debugging an issue and they can't turn on the logging. Can you add a check: Logger.getLogger(classOf[RackResolver]).getLevel() != null at least you won't be overriding the logging level if it is set. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23633633 --- Diff: bin/utils.sh --- @@ -26,14 +26,14 @@ function gatherSparkSubmitOpts() { exit 1 fi - # NOTE: If you add or remove spark-sumbmit options, + # NOTE: If you add or remove spark-submit options, # modify NOT ONLY this script but also SparkSubmitArgument.scala SUBMISSION_OPTS=() APPLICATION_OPTS=() while (($#)); do case $1 in - --master | --deploy-mode | --class | --name | --jars | --py-files | --files | \ - --conf | --properties-file | --driver-memory | --driver-java-options | \ + --master | --deploy-mode | --class | --name | --jars | --maven | --py-files | --files | \ + --conf | --maven_repos | --properties-file | --driver-memory | --driver-java-options | \ --- End diff -- Rename this to --maven-repos with a dash instead of an underscore; everything else has a dash --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71708399 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26173/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71708396 [Test build #26173 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26173/consoleFull) for PR 4155 at commit [`1df2a91`](https://github.com/apache/spark/commit/1df2a91eb39300a32ad095b37a04846d135e2cc5). * This patch **fails Scala style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class CommitDeniedException(msg: String, jobID: Int, splitID: Int, attemptID: Int)` * `case class TaskCommitDenied(` * ` class AskCommitRunnable(` * ` class OutputCommitCoordinatorActor(outputCommitCoordinator: OutputCommitCoordinator)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...
Github user OopsOutOfMemory commented on the pull request: https://github.com/apache/spark/pull/4127#issuecomment-71711069 I don't have permission, too. I only have `comment` permission here. what's going wrong? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3715#discussion_r23636560 --- Diff: examples/src/main/python/streaming/kafka_wordcount.py --- @@ -0,0 +1,57 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + + + Counts words in UTF8 encoded, '\n' delimited text received from the network every second. + Usage: network_wordcount.py zk topic + + To run this on your local machine, you need to setup Kafka and create a producer first + $ bin/zookeeper-server-start.sh config/zookeeper.properties + $ bin/kafka-server-start.sh config/server.properties + $ bin/kafka-topics.sh --create --zookeeper localhost:2181 --partitions 1 --topic test + $ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test + --- End diff -- Good point, I will remove these, and put a link here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Set UserKnownHostsFile to ensure deploying and...
Github user wasauce commented on the pull request: https://github.com/apache/spark/pull/4201#issuecomment-71715483 Shall I close this @nchammas in light of https://github.com/apache/spark/pull/4196 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] fix python example of ALS in guide
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4226#issuecomment-71716478 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26171/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4136. Under dynamic allocation, cancel o...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4168#issuecomment-71697071 [Test build #26168 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26168/consoleFull) for PR 4168 at commit [`9ba0e01`](https://github.com/apache/spark/commit/9ba0e0161e4554839c6dbf3a097b69af3de263b8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-4136. Under dynamic allocation, cancel o...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4168#issuecomment-71697091 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26168/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [EC2] Preserve spaces in EC2 path
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4224#issuecomment-71701863 [Test build #26169 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26169/consoleFull) for PR 4224 at commit [`960711a`](https://github.com/apache/spark/commit/960711a54738c3e81d9a080566c5670d37fa9300). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [EC2] Preserve spaces in EC2 path
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4224#issuecomment-71701876 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26169/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/4228 [SPARK-5430] move treeReduce and treeAggregate from mllib to core We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. @pwendell You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark SPARK-5430 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4228 commit d600b6cd7d80bfc31878cf1dec2a706b7256474a Author: Xiangrui Meng m...@databricks.com Date: 2015-01-27T19:06:07Z move treeReduce and treeAggregate to core --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe tab...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4227#issuecomment-71708164 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71709119 [Test build #26175 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26175/consoleFull) for PR 4173 at commit [`828f70d`](https://github.com/apache/spark/commit/828f70de0bab44501cc3c1e91e320c86c3dde97b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4229#issuecomment-71709154 [Test build #26174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26174/consoleFull) for PR 4229 at commit [`58aa907`](https://github.com/apache/spark/commit/58aa907b8ad1913229468f3c3776dba2d8b45580). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/4229#discussion_r23637912 --- Diff: python/pyspark/streaming/mqtt.py --- @@ -0,0 +1,59 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from py4j.java_collections import MapConverter +from py4j.java_gateway import java_import, Py4JError + +from pyspark.storagelevel import StorageLevel +from pyspark.serializers import PairDeserializer, NoOpSerializer +from pyspark.streaming import DStream + +__all__ = ['MQTTUtils'] + +class MQTTUtils(object): + +@staticmethod +def createStream(ssc, topic, brokerUrl, --- End diff -- please keep the order of arguments as in Scala or docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4229#issuecomment-71719145 [Test build #26182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26182/consoleFull) for PR 4229 at commit [`3b45aca`](https://github.com/apache/spark/commit/3b45aca6cbe5ad5312eb50425b750c5b8fe9de5f). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/4173#discussion_r23628375 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -0,0 +1,606 @@ +/* +* Licensed to the Apache Software Foundation (ASF) under one or more +* contributor license agreements. See the NOTICE file distributed with +* this work for additional information regarding copyright ownership. +* The ASF licenses this file to You under the Apache License, Version 2.0 +* (the License); you may not use this file except in compliance with +* the License. You may obtain a copy of the License at +* +*http://www.apache.org/licenses/LICENSE-2.0 +* +* Unless required by applicable law or agreed to in writing, software +* distributed under the License is distributed on an AS IS BASIS, +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +* See the License for the specific language governing permissions and +* limitations under the License. +*/ + +package org.apache.spark.sql + +import scala.language.implicitConversions +import scala.reflect.ClassTag +import scala.collection.JavaConversions._ + +import java.util.{ArrayList, List = JList} + +import com.fasterxml.jackson.core.JsonFactory +import net.razorvine.pickle.Pickler + +import org.apache.spark.annotation.Experimental +import org.apache.spark.rdd.RDD +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.api.python.SerDeUtil +import org.apache.spark.storage.StorageLevel +import org.apache.spark.sql.catalyst.ScalaReflection +import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr} +import org.apache.spark.sql.catalyst.plans.{JoinType, Inner} +import org.apache.spark.sql.catalyst.plans.logical._ +import org.apache.spark.sql.execution.{LogicalRDD, EvaluatePython} +import org.apache.spark.sql.json.JsonRDD +import org.apache.spark.sql.types.{NumericType, StructType} +import org.apache.spark.util.Utils + + +/** + * A collection of rows that have the same columns. + * + * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and can be created using + * various functions in [[SQLContext]]. + * {{{ + * val people = sqlContext.parquetFile(...) + * }}} + * + * Once created, it can be manipulated using the various domain-specific-language (DSL) functions + * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for Scala DSL. + * + * To select a column from the data frame, use the apply method: + * {{{ + * val ageCol = people(age) // in Scala + * Column ageCol = people.apply(age) // in Java + * }}} + * + * Note that the [[Column]] type can also be manipulated through its various functions. + * {{ + * // The following creates a new column that increases everybody's age by 10. + * people(age) + 10 // in Scala + * }} + * + * A more concrete example: + * {{{ + * // To create DataFrame using SQLContext + * val people = sqlContext.parquetFile(...) + * val department = sqlContext.parquetFile(...) + * + * people.filter(age 30) + * .join(department, people(deptId) === department(id)) + * .groupBy(department(name), gender) + * .agg(avg(people(salary)), max(people(age))) + * }}} + */ +// TODO: Improve documentation. +class DataFrame protected[sql]( +val sqlContext: SQLContext, +private val baseLogicalPlan: LogicalPlan, +operatorsEnabled: Boolean) + extends DataFrameSpecificApi with RDDApi[Row] { + + protected[sql] def this(sqlContext: Option[SQLContext], plan: Option[LogicalPlan]) = +this(sqlContext.orNull, plan.orNull, sqlContext.isDefined plan.isDefined) + + protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = this(sqlContext, plan, true) + + @transient protected[sql] lazy val queryExecution = sqlContext.executePlan(baseLogicalPlan) + + @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan match { +// For various commands (like DDL) and queries with side effects, we force query optimization to +// happen right away to let these side effects take place eagerly. +case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: WriteToFile = + LogicalRDD(queryExecution.analyzed.output, queryExecution.toRdd)(sqlContext) +case _ = + baseLogicalPlan + } + + /** + * An implicit conversion function internal to this class for us to avoid doing + * new DataFrame(...) everywhere. + */
[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2471#issuecomment-71702876 @viper-kun could you close this one in that case? thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] fix python example of ALS in guide
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/4226 [MLlib] fix python example of ALS in guide fix python example of ALS in guide, use Rating instead of np.array. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark fix_als_guide Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4226.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4226 commit 1433d76d0b42e3c5fa873258fc659ee3e7d162cc Author: Davies Liu dav...@databricks.com Date: 2015-01-27T18:49:09Z fix python example of als in guide --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLlib] fix python example of ALS in guide
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4226#issuecomment-71707130 I tested it successfully. LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user mccheah commented on a diff in the pull request: https://github.com/apache/spark/pull/4155#discussion_r23634059 --- Diff: core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala --- @@ -0,0 +1,177 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.scheduler + +import java.io.{ObjectInputStream, ObjectOutputStream, IOException} + +import scala.collection.mutable + +import org.mockito.Mockito._ +import org.scalatest.concurrent.Timeouts +import org.scalatest.{BeforeAndAfter, FunSuite} + +import org.apache.hadoop.mapred.{TaskAttemptID, JobConf, TaskAttemptContext, OutputCommitter} + +import org.apache.spark._ +import org.apache.spark.executor.{TaskMetrics} +import org.apache.spark.rdd.FakeOutputCommitter + +/** + * Unit tests for the output commit coordination functionality. Overrides the + * SchedulerImpl to just run the tasks directly and send completion or error + * messages back to the DAG scheduler. + */ --- End diff -- So this is no longer testing the right thing. But I haven't been able to find an example of a unit test that overrides some of the SchedulerImpl's methods but keeps everything else the same as the default setup. Any suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4228#issuecomment-71711692 Should we do that in follow-up PRs? This PR touches MLlib, which could be separated from adding Java/Python APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4586][MLLIB] Python API for ML pip...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4151#discussion_r23635234 --- Diff: python/pyspark/ml/util.py --- @@ -0,0 +1,35 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +import uuid + + +class Identifiable(object): + +Object with a unique ID. + + +def __init__(self): +#: A unique id for the object. The default implementation +#: concatenates the class name, -, and 8 random hex chars. +self.uid = type(self).__name__ + - + uuid.uuid4().hex[:8] --- End diff -- The memory address could be reused, which may not be unique. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-4586][MLLIB] Python API for ML pip...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4151#discussion_r23635248 --- Diff: python/docs/pyspark.ml.rst --- @@ -0,0 +1,38 @@ +pyspark.ml package += + +Submodules +-- + +pyspark.ml module +- + +.. automodule:: pyspark.ml +:members: +:undoc-members: +:show-inheritance: + +pyspark.ml.param module --- End diff -- This is to be consistent with Scala/Java API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5366][EC2] Check the mode of private ke...
Github user nchammas commented on a diff in the pull request: https://github.com/apache/spark/pull/4162#discussion_r23629383 --- Diff: ec2/spark_ec2.py --- @@ -349,6 +351,16 @@ def launch_cluster(conn, opts, cluster_name): if opts.identity_file is None: print stderr, ERROR: Must provide an identity file (-i) for ssh connections. sys.exit(1) + +if not os.path.exists(opts.identity_file): +print stderr, ERROR: The identity file '{f}' doesn't exist..format(f=opts.identity_file) +sys.exit(1) + +file_mode = os.stat(opts.identity_file).st_mode +if not (file_mode os.stat.S_IRUSR): --- End diff -- I think this kind of check gives us what we're looking for: ``` oct(os.stat(file_mode).st_mode)[-2:] == '00' ``` This makes sure that the group and others have no permissions on the file. This seems to be what Amazon checks for. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe tab...
GitHub user OopsOutOfMemory opened a pull request: https://github.com/apache/spark/pull/4227 [SPARK-5135][SQL] Add support for describe table to DDL in SQLContext Hi, @rxin @marmbrus I considered your suggestion and now re-write it. This is now up-to-date. Could u please review it ? You can merge this pull request into a Git repository by running: $ git pull https://github.com/OopsOutOfMemory/spark describe Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4227 commit 5b7ae19dfdc4410f1018193e0b1701abf799c439 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-16T10:45:13Z patch commit d1689e2fdfed67decabe6c696a3d7f051138a7ad Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-16T15:49:50Z refine imports commit 5b56286c7df0e721496478ab72d56a41a72d9fd6 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-16T10:45:13Z patch commit d70b699bf391f7011540510a22cdcf3f6317945f Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-16T15:49:50Z refine imports commit 5abfbc0fbfd62c7ab0ab33f99619f7c2b6fb6ee6 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T16:17:19Z refine commit 6537b16011e18a51648e98cf3674b7d334a467b2 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T16:23:40Z refine commit 1b85c73bcb35cc162cd2ad678927d664015b9ce6 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T16:27:36Z refine commit 88ee78f6072843be0cfe3c559017ab381ba78d5a Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T16:34:08Z style refine commit a083cc5bc31e4e762095953d16f21790d763d495 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T16:36:55Z refine commit 8e1be4935d286681cb71ccbe64c0b0e3c9a48352 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T19:01:27Z refine commit 5d1b54fa47a04bacc0651d4617a9f38d2c4db983 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T19:04:36Z style fix commit b2e30a01555c40dfec6d7d72f926424b8f66fd81 Author: OopsOutOfMemory victorshen...@126.com Date: 2015-01-27T19:05:36Z refine import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/4215#discussion_r23633840 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala --- @@ -380,6 +392,12 @@ private[spark] class SparkSubmitArguments(args: Seq[String], env: Map[String, St | --name NAME A name of your application. | --jars JARS Comma-separated list of local jars to include on the driver | and executor classpaths. +| --maven Comma-separated list of maven coordinates of jars to include +| on the driver and executor classpaths. Will search the local +| maven repo, then maven central and any additional remote +| repositories given by --maven_repos. --- End diff -- Also this should say the format, i.e. groupId:artifactId:version --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4809] Rework Guava library shading.
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/3658#issuecomment-71710576 Hey @pwendell, I'll try to get to this soon. But I wanted to get your feedback on my idea for fixing the `network/` dependencies thing before I try to implement it. The way I see it, the cleanest way is to do the Guava shading in the earliest artifact possible; that would be `network/common`. So that artifact would have the honor of providing all the relocated Guava classes to everyone. Since `spark-core` depends on it, everything should work out. The only downside I see to that is that `network/common` would now expose `Optional` and friends when it's not really its fault (`spark-core` demands it). What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71710447 [Test build #26176 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26176/consoleFull) for PR 4155 at commit [`9fe6495`](https://github.com/apache/spark/commit/9fe64953aa437ed1ed88a294e04129afc8f2bbb5). * This patch **fails to build**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class CommitDeniedException(msg: String, jobID: Int, splitID: Int, attemptID: Int)` * `case class TaskCommitDenied(` * ` class AskCommitRunnable(` * ` class OutputCommitCoordinatorActor(outputCommitCoordinator: OutputCommitCoordinator)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71715803 [Test build #26180 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26180/consoleFull) for PR 3715 at commit [`dc1eed0`](https://github.com/apache/spark/commit/dc1eed0a6af190d5cf07dedcb0607a0a76e45d64). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class KafkaUtils(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4155#issuecomment-71715603 [Test build #26179 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26179/consoleFull) for PR 4155 at commit [`d63f63f`](https://github.com/apache/spark/commit/d63f63f5769a29d9377b15f3025726477226ca88). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3715#issuecomment-71715805 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26180/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [EC2] Preserve spaces in EC2 path
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/4224#issuecomment-71719779 LGTM. Would be good to create a JIRA especially if we want to backport. @JoshRosen might have more ideas on backporting --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4109#issuecomment-71606493 [Test build #26152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26152/consoleFull) for PR 4109 at commit [`87ab83c`](https://github.com/apache/spark/commit/87ab83cb07a3b3451a4e3ddd158527400b4284ea). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r23592779 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala --- @@ -65,4 +65,6 @@ private[spark] class ResultTask[T, U]( override def preferredLocations: Seq[TaskLocation] = preferredLocs override def toString = ResultTask( + stageId + , + partitionId + ) + + override def canEqual(other: Any): Boolean = other.isInstanceOf[ResultTask[T, U]] --- End diff -- yean, I know that. and in that class, it has no need to add parameter on class level, it could only use in function level `run` or `runContext`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...
Github user suyanNone commented on a diff in the pull request: https://github.com/apache/spark/pull/4055#discussion_r23592724 --- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala --- @@ -106,7 +106,21 @@ private[spark] abstract class Task[T](val stageId: Int, var partitionId: Int) ex if (interruptThread taskThread != null) { taskThread.interrupt() } - } + } + + override def hashCode(): Int = { +31 * stageId.hashCode() + partitionId.hashCode() + } + + def canEqual(other: Any): Boolean = other.isInstanceOf[Task[_]] + + override def equals(other: Any): Boolean = other match { +case that: Task[_] = + (that canEqual this) --- End diff -- yes, in current spark code, is the same type, but in parameter class level, I think is more reasonable to add that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71614911 cc @davies org.apache.spark.sql.test.ExamplePoint is not serializable causing Python to fail. Is this new? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4217#issuecomment-71615458 Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594868 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { --- End diff -- The default argument value is not Java compatible and we don't use this kind of API in `spark.mllib`. The class `IsotonicRegression` should have a parameter called `isotonic`, similar to `k` in `KMeans`. The user code should look like: ~~~ val ir = new IsotonicRegression() .setIsotonic(false) val irModel = ir.run(input) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594866 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { --- End diff -- merge this line with the one above --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594874 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. --- End diff -- Not clear about what `(Double, Double, Double)` means. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594879 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], + isotonic: Boolean): IsotonicRegressionModel = { + +val labels = predictions.map(_._1) +val features = predictions.map(_._2) + +new IsotonicRegressionModel(features, labels) + } + + /** + * Performs a pool adjacent violators algorithm (PAVA) --- End diff -- Add `.` at the end. Cite the paper. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594862 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { --- End diff -- There are 4 cases: 1. hit a boundary - return the corresponding prediction directly 2. fall between boundaries - linear interpolation (Note that a special case is singularity, where two boundaries are the same but their predictions are different. We can set manual rules for this case and document the behavior.) 3. smaller than the smallest boundary - return predictions(0) 4. larger than the largest boundary - return predictions.last --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594894 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], + isotonic: Boolean): IsotonicRegressionModel = { + +val labels = predictions.map(_._1) +val features = predictions.map(_._2) + +new IsotonicRegressionModel(features, labels) + } + + /** + * Performs a pool adjacent violators algorithm (PAVA) + * Uses approach with single processing of data where violators + * in previously processed data created by pooling are fixed immediatelly. + * Uses optimization of discovering monotonicity violating sequences (blocks) + * Method in situ mutates input array + * + * @param in input data + * @param isotonic asc or desc + * @return result + */ + private def poolAdjacentViolators( + in: Array[(Double, Double, Double)], + isotonic: Boolean): Array[(Double, Double, Double)] = { + +// Pools sub
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594856 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** --- End diff -- It may be worth validating that `features` is ordered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594858 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = --- End diff -- `JavaRDD[java.lang.Double]` - `JavaDoubleRDD` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594843 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression --- End diff -- Isotonic - isotonic --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594846 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. --- End diff -- Need to be more specific about `features` and `labels`. I would rename `features` to `boundaries` and mention that this is monotonic, and rename `labels` to `predictions` because this is not the original labels. The solution to an isotonic regression problem is piecewise linear. The model only needs to store the boundaries and the computed predictions. We can use linear interpolation for values fall between boundaries. For example, if ~~~ boundaries = [1, 2, 4, 5] predictions = [1.0, 3.0, 3.0, 4.0] ~~~ then ~~~ predict(1.5) == 2.0 predict(3.5) == 3.0 ~~~ We should also document the behavior on the semi-open segments, e.g., `predict(-10) == ?`. I suggest using the smallest prediction here, i.e., `predict(-10) = 1.0`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594859 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) --- End diff -- `result` - `insertIndex` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm --- End diff -- Cite the paper. Use `.` at the end of each sentence. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594877 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], --- End diff -- If the third parameter is not used, maybe we should remove it from the API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594889 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], + isotonic: Boolean): IsotonicRegressionModel = { + +val labels = predictions.map(_._1) +val features = predictions.map(_._2) + +new IsotonicRegressionModel(features, labels) + } + + /** + * Performs a pool adjacent violators algorithm (PAVA) + * Uses approach with single processing of data where violators + * in previously processed data created by pooling are fixed immediatelly. + * Uses optimization of discovering monotonicity violating sequences (blocks) + * Method in situ mutates input array + * + * @param in input data + * @param isotonic asc or desc + * @return result + */ + private def poolAdjacentViolators( + in: Array[(Double, Double, Double)], + isotonic: Boolean): Array[(Double, Double, Double)] = { + +// Pools sub
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594882 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], + isotonic: Boolean): IsotonicRegressionModel = { + +val labels = predictions.map(_._1) +val features = predictions.map(_._2) + +new IsotonicRegressionModel(features, labels) + } + + /** + * Performs a pool adjacent violators algorithm (PAVA) + * Uses approach with single processing of data where violators + * in previously processed data created by pooling are fixed immediatelly. + * Uses optimization of discovering monotonicity violating sequences (blocks) + * Method in situ mutates input array --- End diff -- typo? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594884 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], + isotonic: Boolean): IsotonicRegressionModel = { + +val labels = predictions.map(_._1) +val features = predictions.map(_._2) + +new IsotonicRegressionModel(features, labels) + } + + /** + * Performs a pool adjacent violators algorithm (PAVA) + * Uses approach with single processing of data where violators + * in previously processed data created by pooling are fixed immediatelly. + * Uses optimization of discovering monotonicity violating sequences (blocks) + * Method in situ mutates input array + * + * @param in input data --- End diff -- `in` - `input`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not
[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3519#discussion_r23594887 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala --- @@ -0,0 +1,238 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.regression + +import java.io.Serializable +import java.util.Arrays.binarySearch + +import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD} +import org.apache.spark.rdd.RDD + +/** + * Regression model for Isotonic regression + * + * @param features Array of features. + * @param labels Array of labels associated to the features at the same index. + */ +class IsotonicRegressionModel ( +features: Array[Double], +val labels: Array[Double]) + extends Serializable { + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: RDD[Double]): RDD[Double] = +testData.map(predict) + + /** + * Predict labels for provided features + * Using a piecewise constant function + * + * @param testData features to be labeled + * @return predicted labels + */ + def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD = +JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]])) + + /** + * Predict a single label + * Using a piecewise constant function + * + * @param testData feature to be labeled + * @return predicted label + */ + def predict(testData: Double): Double = { +val result = binarySearch(features, testData) + +val index = + if (result == -1) { +0 + } else if (result 0) { +-result - 2 + } else { +result + } + +labels(index) + } +} + +/** + * Isotonic regression + * Currently implemented using parallel pool adjacent violators algorithm + */ +class IsotonicRegression + extends Serializable { + + /** + * Run algorithm to obtain isotonic regression model + * + * @param input (label, feature, weight) + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + def run( + input: RDD[(Double, Double, Double)], + isotonic: Boolean = true): IsotonicRegressionModel = { +createModel( + parallelPoolAdjacentViolators(input, isotonic), + isotonic) + } + + /** + * Creates isotonic regression model with given parameters + * + * @param predictions labels estimated using isotonic regression algorithm. + *Used for predictions on new data points. + * @param isotonic isotonic (increasing) or antitonic (decreasing) sequence + * @return isotonic regression model + */ + protected def createModel( + predictions: Array[(Double, Double, Double)], + isotonic: Boolean): IsotonicRegressionModel = { + +val labels = predictions.map(_._1) +val features = predictions.map(_._2) + +new IsotonicRegressionModel(features, labels) + } + + /** + * Performs a pool adjacent violators algorithm (PAVA) + * Uses approach with single processing of data where violators + * in previously processed data created by pooling are fixed immediatelly. + * Uses optimization of discovering monotonicity violating sequences (blocks) + * Method in situ mutates input array + * + * @param in input data + * @param isotonic asc or desc + * @return result + */ + private def poolAdjacentViolators( + in: Array[(Double, Double, Double)], + isotonic: Boolean): Array[(Double, Double, Double)] = { + +// Pools sub
[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4173#issuecomment-71614473 [Test build #26150 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26150/consoleFull) for PR 4173 at commit [`16934ee`](https://github.com/apache/spark/commit/16934ee0c9719afeb047e4eacf6e35b5e4aca86d). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org