date:20150127

[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4222#issuecomment-71639706
  
  [Test build #26160 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26160/consoleFull)
 for   PR 4222 at commit 
[`51987d2`](https://github.com/apache/spark/commit/51987d24ea6b29c9607679daa2b482d5855be361).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71605804
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26151/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71605655
  
  [Test build #26151 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26151/consoleFull)
 for   PR 4109 at commit 
[`caf4438`](https://github.com/apache/spark/commit/caf44387c2d3af5df771b9ce74aa8a9bac3f0827).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71605800
  
  [Test build #26151 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26151/consoleFull)
 for   PR 4109 at commit 
[`caf4438`](https://github.com/apache/spark/commit/caf44387c2d3af5df771b9ce74aa8a9bac3f0827).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class DenseMatrix(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5423][Core] Cleanup resources in DiskMa...

2015-01-27 Thread zsxwing

GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/4219

[SPARK-5423][Core] Cleanup resources in DiskMapIterator.finalize to ensure 
deleting the temp file

This PR adds a `finalize` method in DiskMapIterator to clean up the 
resources even if some exception happens during processing data.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-5423

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4219


commit d4b2ca69b3bc2d729f5d44750ab6b81de6e77644
Author: zsxwing zsxw...@gmail.com
Date:   2015-01-27T08:16:13Z

Cleanup resources in DiskMapIterator.finalize to ensure deleting the temp 
file




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-01-27 Thread jacek-lewandowski

Github user jacek-lewandowski commented on the pull request:

https://github.com/apache/spark/pull/4220#issuecomment-71651118
  
What is going on with these tests??? I've created three PRs - for 1.1, 1.2 
and 1.3 and all of them failed in a very strange way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-01-27 Thread Sean Owen

The test failures look unrelated, and are a Jenkins error.

You should just make one PR for master; it will be back-ported as needed.

On Tue, Jan 27, 2015 at 1:58 PM, jacek-lewandowski g...@git.apache.org wrote:
 Github user jacek-lewandowski commented on the pull request:

 https://github.com/apache/spark/pull/4220#issuecomment-71651118

 What is going on with these tests??? I've created three PRs - for 1.1, 
 1.2 and 1.3 and all of them failed in a very strange way.


 ---
 If your project is set up for it, you can reply to this email and have your
 reply appear on GitHub as well. If your project does not have this feature
 enabled and wishes so, or if the feature is enabled but not working, please
 contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
 with INFRA.
 ---

 -
 To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
 For additional commands, e-mail: reviews-h...@spark.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4221#issuecomment-71649841
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26161/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5425: Use synchronised methods in system...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4221#issuecomment-71649835
  
  [Test build #26161 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26161/consoleFull)
 for   PR 4221 at commit 
[`94aeacf`](https://github.com/apache/spark/commit/94aeacf6fcc7fae6d045d35b9d8f1fe4c2594780).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-71652555
  
  [Test build #26164 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26164/consoleFull)
 for   PR 731 at commit 
[`9f0c3a4`](https://github.com/apache/spark/commit/9f0c3a4393933e77c5e97a322d7bd9038afc7f78).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1706: Allow multiple executors per worke...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/731#issuecomment-71653299
  
  [Test build #26165 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26165/consoleFull)
 for   PR 731 at commit 
[`97918d2`](https://github.com/apache/spark/commit/97918d2753359881dfd7f512bedc4495e47d3599).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5308 [BUILD] MD5 / SHA1 hash format does...

2015-01-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4161#issuecomment-71699359
  
Thanks Sean - pulling this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] fix python example of ALS in guide

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4226#issuecomment-71704847
  
  [Test build #26171 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26171/consoleFull)
 for   PR 4226 at commit 
[`1433d76`](https://github.com/apache/spark/commit/1433d76d0b42e3c5fa873258fc659ee3e7d162cc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71708214
  
  [Test build #26173 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26173/consoleFull)
 for   PR 4155 at commit 
[`1df2a91`](https://github.com/apache/spark/commit/1df2a91eb39300a32ad095b37a04846d135e2cc5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4228#issuecomment-71708263
  
  [Test build #26172 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26172/consoleFull)
 for   PR 4228 at commit 
[`d600b6c`](https://github.com/apache/spark/commit/d600b6cd7d80bfc31878cf1dec2a706b7256474a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...

2015-01-27 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4127#issuecomment-71710163
  
I don't have permission to do it. Can you click the close button?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4586][MLLIB] Python API for ML pip...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4151#discussion_r23635244
  
--- Diff: 
examples/src/main/python/ml/simple_text_classification_pipeline.py ---
@@ -0,0 +1,70 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from pyspark import SparkContext
+from pyspark.sql import SQLContext, Row
+from pyspark.ml import Pipeline
+from pyspark.ml.feature import HashingTF, Tokenizer
+from pyspark.ml.classification import LogisticRegression
+
+
+
+A simple text classification pipeline that recognizes spark from
+input text. This is to show how to create and configure a Spark ML
+pipeline in Python. Run with:
+
+  bin/spark-submit 
examples/src/main/python/ml/simple_text_classification_pipeline.py
+
+
+
+if __name__ == __main__:
+sc = SparkContext(appName=SimpleTextClassificationPipeline)
+sqlCtx = SQLContext(sc)
+training = sqlCtx.inferSchema(
+sc.parallelize([(0L, a b c d e spark, 1.0),
+(1L, b d, 0.0),
+(2L, spark f g h, 1.0),
+(3L, hadoop mapreduce, 0.0)])
+  .map(lambda x: Row(id=x[0], text=x[1], label=x[2])))
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread prabeesh

Github user prabeesh commented on a diff in the pull request:

https://github.com/apache/spark/pull/3715#discussion_r23636136
  
--- Diff: examples/src/main/python/streaming/kafka_wordcount.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+ Counts words in UTF8 encoded, '\n' delimited text received from the 
network every second.
+ Usage: network_wordcount.py zk topic
+
+ To run this on your local machine, you need to setup Kafka and create a 
producer first
+ $ bin/zookeeper-server-start.sh config/zookeeper.properties
+ $ bin/kafka-server-start.sh config/server.properties
+ $ bin/kafka-topics.sh --create --zookeeper localhost:2181 
--partitions 1 --topic test
+ $ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic 
test
+
--- End diff --

All the above commands  want to run from Kafka bin/ ? 
But it still create some confusion which directory we want use. In all the 
Spark examples bin/ refer to the bin/  of Spark. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib]...

2015-01-27 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3637#issuecomment-71713330
  
@petro-rudenko  It is possible to get the state, but not in a single 
object.  It's a good question whether a model and its state should be different 
concepts.  In the current MLlib code, they are the same concept, so the 
functionality you're mentioning is supported in slightly different ways:
* Saving will happen through save/load methods (which I'm working on: 
[https://issues.apache.org/jira/browse/SPARK-4587])
* Passing to prediction front-ends can happen through save, or by manually 
taking the needed elements of the state.
* Copying the model can be done by copy().
* Printing/viewing the state can be done by casting the bestModel to the 
correct type:
```
cvModel.bestModel.asInstanceOf[LogisticRegressionModel].weights
...
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5432] DriverSuite and SparkSubmitSuite ...

2015-01-27 Thread andrewor14

GitHub user andrewor14 opened a pull request:

https://github.com/apache/spark/pull/4230

[SPARK-5432] DriverSuite and SparkSubmitSuite should sc.stop()

In the past we've disabled the UIs and messed with the ports to keep the 
tests passing. However, these are only temporary fixes since ultimately we're 
still leaking a JVM after each individual test has finished. If we stop the 
`SparkContext` that should ensure the resources get cleaned up properly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/andrewor14/spark fix-driver-suite

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4230.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4230


commit 8092c36831bb2b348f566bb4bd9a8d234cc5fc3d
Author: Andrew Or and...@databricks.com
Date:   2015-01-27T19:51:25Z

Stop SparkContext after every individual test




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread prabeesh

Github user prabeesh commented on a diff in the pull request:

https://github.com/apache/spark/pull/3715#discussion_r23637548
  
--- Diff: examples/src/main/python/streaming/kafka_wordcount.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+ Counts words in UTF8 encoded, '\n' delimited text received from the 
network every second.
+ Usage: network_wordcount.py zk topic
+
+ To run this on your local machine, you need to setup Kafka and create a 
producer first
+ $ bin/zookeeper-server-start.sh config/zookeeper.properties
+ $ bin/kafka-server-start.sh config/server.properties
+ $ bin/kafka-topics.sh --create --zookeeper localhost:2181 
--partitions 1 --topic test
+ $ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic 
test
+
--- End diff --

All the above commands want to run from Kafka bin/ ? 
But it still create some confusion which directory we want use. In all the 
Spark examples bin/ refer to the bin/ of Spark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4229#issuecomment-71719260
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26182/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4229#issuecomment-71719257
  
  [Test build #26182 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26182/consoleFull)
 for   PR 4229 at commit 
[`3b45aca`](https://github.com/apache/spark/commit/3b45aca6cbe5ad5312eb50425b750c5b8fe9de5f).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaUtils(object):`
  * `class MQTTUtils(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4508] [SQL] build native date type to c...

2015-01-27 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/3732#discussion_r23628599
  
--- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala ---
@@ -252,7 +252,7 @@ trait Row extends Serializable {
*
* @throws ClassCastException when data type does not match.
*/
-  def getDate(i: Int): java.sql.Date = apply(i).asInstanceOf[java.sql.Date]
+  def getDate(i: Int): java.sql.Date = DateUtils.toJavaDate(getInt(i))
--- End diff --

one thing - you probably want to do the conversion when we create the row, 
like what we do for other types, instead of doing the conversion when it is 
accessed. otherwise apply(i: Int) will return date.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-1934 [CORE] this reference escape to ...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4225#issuecomment-71702592
  
  [Test build #26170 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26170/consoleFull)
 for   PR 4225 at commit 
[`c4dec3b`](https://github.com/apache/spark/commit/c4dec3b00426a1a427a4e8f88c6f733c583ebc97).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...

2015-01-27 Thread OopsOutOfMemory

Github user OopsOutOfMemory commented on the pull request:

https://github.com/apache/spark/pull/4127#issuecomment-71707934
  
Hi, @rxin 
I created a new PR #4227 to rewrite `this part` and bring everything 
up-to-date, would you please review it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...

2015-01-27 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4127#issuecomment-71708856
  
Thanks. Do you mind closing this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-27 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/4215#discussion_r23633790
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -380,6 +392,12 @@ private[spark] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, St
 |  --name NAME A name of your application.
 |  --jars JARS Comma-separated list of local jars 
to include on the driver
 |  and executor classpaths.
+|  --maven Comma-separated list of maven 
coordinates of jars to include
+|  on the driver and executor 
classpaths. Will search the local
+|  maven repo, then maven central and 
any additional remote
+|  repositories given by --maven_repos.
--- End diff --

Instead of taking one parameter with a list of all Maven packages, we might 
want to allow separate packages to be passed with separate `--maven` args. 
Dunno, @pwendell / @mengxr what do you think? It's just going to be annoying 
for people to write a giant comma separated string. Same with repos actually.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-01-27 Thread prabeesh

GitHub user prabeesh opened a pull request:

https://github.com/apache/spark/pull/4229

[SPARK-5155] [PySpark] [Streaming] Mqtt streaming support in Python



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/prabeesh/spark mqtt_python

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4229


commit 07923c42fcb4d210333b3882490e23f33dc4822f
Author: Davies Liu dav...@databricks.com
Date:   2014-12-16T21:48:42Z

support kafka in Python

commit 75d485e65b75a7a5da91a37ff42a9bb7cd82dcf6
Author: Davies Liu dav...@databricks.com
Date:   2014-12-16T22:18:59Z

add mqtt

commit 048dbe6c9ec4bff452c70e4e18d48d3075e0
Author: Davies Liu dav...@databricks.com
Date:   2014-12-16T22:27:43Z

fix python style

commit 5697a012def1b8508d21d96f2afb7d6705cf
Author: Davies Liu dav...@databricks.com
Date:   2014-12-18T23:44:33Z

bypass decoder in scala

commit 98c8d179d3ff264d03eabc3ddd72936d95e6e305
Author: Davies Liu dav...@databricks.com
Date:   2014-12-18T23:58:29Z

fix python style

commit f6ce899abd435f36f7c5907523c643cc8b0e61ed
Author: Davies Liu dav...@databricks.com
Date:   2015-01-08T21:28:39Z

add example and fix bugs

commit eea16a79e741255548ef2e006db3948771a47e0d
Author: Davies Liu dav...@databricks.com
Date:   2015-01-08T21:29:35Z

refactor

commit aea89538dcb9b80111f98df881d345d4e87e91aa
Author: Tathagata Das t...@databricks.com
Date:   2015-01-22T01:31:30Z

Kafka-assembly for Python API

commit adeeb3863353f9a0ca3070a9cc914a2914d95fa9
Author: Davies Liu dav...@databricks.com
Date:   2015-01-22T07:08:56Z

Merge pull request #3 from tdas/kafka-python-api

Kafka-assembly for Python API

commit 33730d14c069042dfccd4af857021afe7ff0cbb0
Author: Davies Liu dav...@databricks.com
Date:   2015-01-22T07:50:28Z

Merge branch 'master' of github.com:apache/spark into kafka

Conflicts:
make-distribution.sh
project/SparkBuild.scala

commit 2c567a5d55c465d706026c2395e9025fad9dbd68
Author: Davies Liu dav...@databricks.com
Date:   2015-01-22T08:01:02Z

update logging and comment

commit 97386b3debd5f352b61dfed194ab9495fecbe834
Author: Davies Liu dav...@databricks.com
Date:   2015-01-22T08:08:06Z

address comment

commit 370ba61571b98e9bdfb6636852d4404687143853
Author: Davies Liu dav...@databricks.com
Date:   2015-01-26T20:25:12Z

Update kafka.py

fix spark-submit

commit 26a9960937368202dfdba6c0bbf5bf7c0168e72d
Author: prabs prabsma...@gmail.com
Date:   2015-01-27T18:31:54Z

Merge branch 'kafka' of github.com:davies/spark into mqtt_python

commit 58aa907b8ad1913229468f3c3776dba2d8b45580
Author: prabs prabsma...@gmail.com
Date:   2015-01-23T19:58:22Z

Mqtt streaming support in Python




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71710454
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26176/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...

2015-01-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4228#issuecomment-71710396
  
@mengxr if we are going to add this as a first class API, can we have it in 
Java and Python too? Also, /cc to @rxin to also vet whether we want this in the 
core API. My feeling is that it's hard for users to figure out how to do this 
on their own, and for any expensive reduction function, users will need 
something like this in a large cluster.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Set UserKnownHostsFile to ensure deploying and...

2015-01-27 Thread nchammas

Github user nchammas commented on the pull request:

https://github.com/apache/spark/pull/4201#issuecomment-71714505
  
It looks like we have another PR opened a few hours before this that does 
the same thing: #4196


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71715606
  
  [Test build #26180 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26180/consoleFull)
 for   PR 3715 at commit 
[`dc1eed0`](https://github.com/apache/spark/commit/dc1eed0a6af190d5cf07dedcb0607a0a76e45d64).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5429][SQL] Use javaXML plan serializati...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4223#issuecomment-71698918
  
  [Test build #26167 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26167/consoleFull)
 for   PR 4223 at commit 
[`97a8760`](https://github.com/apache/spark/commit/97a8760f6c8713af581b95f05d11e8d11f331246).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5429][SQL] Use javaXML plan serializati...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4223#issuecomment-71698923
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26167/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-27 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/4215#discussion_r23633701
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -380,6 +392,12 @@ private[spark] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, St
 |  --name NAME A name of your application.
 |  --jars JARS Comma-separated list of local jars 
to include on the driver
 |  and executor classpaths.
+|  --maven Comma-separated list of maven 
coordinates of jars to include
+|  on the driver and executor 
classpaths. Will search the local
+|  maven repo, then maven central and 
any additional remote
+|  repositories given by --maven_repos.
+|  --maven_repos   Supply additional remote 
repositories to search for the
+|  maven coordinates given with 
--maven.
--- End diff --

You should say this is a comma-separated list


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71710113
  
  [Test build #26177 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26177/consoleFull)
 for   PR 3715 at commit 
[`31e2317`](https://github.com/apache/spark/commit/31e2317a31c90b23a7b085c7fd5a1de8998194a6).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...

2015-01-27 Thread OopsOutOfMemory

Github user OopsOutOfMemory commented on the pull request:

https://github.com/apache/spark/pull/4127#issuecomment-71709975
  
ok, please close this one : )


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71710060
  
  [Test build #26176 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26176/consoleFull)
 for   PR 4155 at commit 
[`9fe6495`](https://github.com/apache/spark/commit/9fe64953aa437ed1ed88a294e04129afc8f2bbb5).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...

2015-01-27 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/4228#issuecomment-71712072
  
I don't think we should do it separately (it sets a bad precedent), but if 
you are too busy, we can try to find someone in the community to do all three. 
It's pretty straightforward.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5432] DriverSuite and SparkSubmitSuite ...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4230#issuecomment-71716554
  
  [Test build #26181 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26181/consoleFull)
 for   PR 4230 at commit 
[`8092c36`](https://github.com/apache/spark/commit/8092c36831bb2b348f566bb4bd9a8d234cc5fc3d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] fix python example of ALS in guide

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4226#issuecomment-71716462
  
  [Test build #26171 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26171/consoleFull)
 for   PR 4226 at commit 
[`1433d76`](https://github.com/apache/spark/commit/1433d76d0b42e3c5fa873258fc659ee3e7d162cc).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-5393. Flood of util.RackResolver log mes...

2015-01-27 Thread ksakellis

Github user ksakellis commented on a diff in the pull request:

https://github.com/apache/spark/pull/4192#discussion_r23629409
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala ---
@@ -60,6 +62,9 @@ private[yarn] class YarnAllocator(
 
   import YarnAllocator._
 
+  // RackResolver logs an INFO message whenever it resolves a rack, which 
is way too often.
+  Logger.getLogger(classOf[RackResolver]).setLevel(Level.WARN)
--- End diff --

Well, I disagree. A user will get very frustrated if they are debugging an 
issue and they can't turn on the logging. Can you add a check: 
Logger.getLogger(classOf[RackResolver]).getLevel() != null at least you won't 
be overriding the logging level if it is set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-27 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/4215#discussion_r23633633
  
--- Diff: bin/utils.sh ---
@@ -26,14 +26,14 @@ function gatherSparkSubmitOpts() {
 exit 1
   fi
 
-  # NOTE: If you add or remove spark-sumbmit options,
+  # NOTE: If you add or remove spark-submit options,
   # modify NOT ONLY this script but also SparkSubmitArgument.scala
   SUBMISSION_OPTS=()
   APPLICATION_OPTS=()
   while (($#)); do
 case $1 in
-  --master | --deploy-mode | --class | --name | --jars | --py-files | 
--files | \
-  --conf | --properties-file | --driver-memory | --driver-java-options 
| \
+  --master | --deploy-mode | --class | --name | --jars | --maven | 
--py-files | --files | \
+  --conf | --maven_repos | --properties-file | --driver-memory | 
--driver-java-options | \
--- End diff --

Rename this to --maven-repos with a dash instead of an underscore; 
everything else has a dash


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71708399
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26173/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71708396
  
  [Test build #26173 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26173/consoleFull)
 for   PR 4155 at commit 
[`1df2a91`](https://github.com/apache/spark/commit/1df2a91eb39300a32ad095b37a04846d135e2cc5).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class CommitDeniedException(msg: String, jobID: Int, splitID: Int, 
attemptID: Int)`
  * `case class TaskCommitDenied(`
  * `  class AskCommitRunnable(`
  * `  class OutputCommitCoordinatorActor(outputCommitCoordinator: 
OutputCommitCoordinator)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe [ex...

2015-01-27 Thread OopsOutOfMemory

Github user OopsOutOfMemory commented on the pull request:

https://github.com/apache/spark/pull/4127#issuecomment-71711069
  
I don't have permission, too.  I only have `comment` permission here.
what's going wrong?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/3715#discussion_r23636560
  
--- Diff: examples/src/main/python/streaming/kafka_wordcount.py ---
@@ -0,0 +1,57 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+
+ Counts words in UTF8 encoded, '\n' delimited text received from the 
network every second.
+ Usage: network_wordcount.py zk topic
+
+ To run this on your local machine, you need to setup Kafka and create a 
producer first
+ $ bin/zookeeper-server-start.sh config/zookeeper.properties
+ $ bin/kafka-server-start.sh config/server.properties
+ $ bin/kafka-topics.sh --create --zookeeper localhost:2181 
--partitions 1 --topic test
+ $ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic 
test
+
--- End diff --

Good point, I will remove these, and put a link here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: Set UserKnownHostsFile to ensure deploying and...

2015-01-27 Thread wasauce

Github user wasauce commented on the pull request:

https://github.com/apache/spark/pull/4201#issuecomment-71715483
  
Shall I close this @nchammas in light of 
https://github.com/apache/spark/pull/4196


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] fix python example of ALS in guide

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4226#issuecomment-71716478
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26171/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4136. Under dynamic allocation, cancel o...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4168#issuecomment-71697071
  
  [Test build #26168 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26168/consoleFull)
 for   PR 4168 at commit 
[`9ba0e01`](https://github.com/apache/spark/commit/9ba0e0161e4554839c6dbf3a097b69af3de263b8).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: SPARK-4136. Under dynamic allocation, cancel o...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4168#issuecomment-71697091
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26168/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [EC2] Preserve spaces in EC2 path

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4224#issuecomment-71701863
  
  [Test build #26169 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26169/consoleFull)
 for   PR 4224 at commit 
[`960711a`](https://github.com/apache/spark/commit/960711a54738c3e81d9a080566c5670d37fa9300).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [EC2] Preserve spaces in EC2 path

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4224#issuecomment-71701876
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26169/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...

2015-01-27 Thread mengxr

GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/4228

[SPARK-5430] move treeReduce and treeAggregate from mllib to core

We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML 
domain. Maybe it is time to move them to Core. @pwendell


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark SPARK-5430

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4228


commit d600b6cd7d80bfc31878cf1dec2a706b7256474a
Author: Xiangrui Meng m...@databricks.com
Date:   2015-01-27T19:06:07Z

move treeReduce and treeAggregate to core




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe tab...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4227#issuecomment-71708164
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71709119
  
  [Test build #26175 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26175/consoleFull)
 for   PR 4173 at commit 
[`828f70d`](https://github.com/apache/spark/commit/828f70de0bab44501cc3c1e91e320c86c3dde97b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4229#issuecomment-71709154
  
  [Test build #26174 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26174/consoleFull)
 for   PR 4229 at commit 
[`58aa907`](https://github.com/apache/spark/commit/58aa907b8ad1913229468f3c3776dba2d8b45580).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-01-27 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/4229#discussion_r23637912
  
--- Diff: python/pyspark/streaming/mqtt.py ---
@@ -0,0 +1,59 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+from py4j.java_collections import MapConverter
+from py4j.java_gateway import java_import, Py4JError
+
+from pyspark.storagelevel import StorageLevel
+from pyspark.serializers import PairDeserializer, NoOpSerializer
+from pyspark.streaming import DStream
+
+__all__ = ['MQTTUtils']
+
+class MQTTUtils(object):
+
+@staticmethod
+def createStream(ssc, topic, brokerUrl,
--- End diff --

please keep the order of arguments as in Scala or docs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5155] [PySpark] [Streaming] Mqtt stream...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4229#issuecomment-71719145
  
  [Test build #26182 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26182/consoleFull)
 for   PR 4229 at commit 
[`3b45aca`](https://github.com/apache/spark/commit/3b45aca6cbe5ad5312eb50425b750c5b8fe9de5f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame

2015-01-27 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/4173#discussion_r23628375
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala ---
@@ -0,0 +1,606 @@
+/*
+* Licensed to the Apache Software Foundation (ASF) under one or more
+* contributor license agreements.  See the NOTICE file distributed with
+* this work for additional information regarding copyright ownership.
+* The ASF licenses this file to You under the Apache License, Version 2.0
+* (the License); you may not use this file except in compliance with
+* the License.  You may obtain a copy of the License at
+*
+*http://www.apache.org/licenses/LICENSE-2.0
+*
+* Unless required by applicable law or agreed to in writing, software
+* distributed under the License is distributed on an AS IS BASIS,
+* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+* See the License for the specific language governing permissions and
+* limitations under the License.
+*/
+
+package org.apache.spark.sql
+
+import scala.language.implicitConversions
+import scala.reflect.ClassTag
+import scala.collection.JavaConversions._
+
+import java.util.{ArrayList, List = JList}
+
+import com.fasterxml.jackson.core.JsonFactory
+import net.razorvine.pickle.Pickler
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.api.python.SerDeUtil
+import org.apache.spark.storage.StorageLevel
+import org.apache.spark.sql.catalyst.ScalaReflection
+import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.{Literal = LiteralExpr}
+import org.apache.spark.sql.catalyst.plans.{JoinType, Inner}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.execution.{LogicalRDD, EvaluatePython}
+import org.apache.spark.sql.json.JsonRDD
+import org.apache.spark.sql.types.{NumericType, StructType}
+import org.apache.spark.util.Utils
+
+
+/**
+ * A collection of rows that have the same columns.
+ *
+ * A [[DataFrame]] is equivalent to a relational table in Spark SQL, and 
can be created using
+ * various functions in [[SQLContext]].
+ * {{{
+ *   val people = sqlContext.parquetFile(...)
+ * }}}
+ *
+ * Once created, it can be manipulated using the various 
domain-specific-language (DSL) functions
+ * defined in: [[DataFrame]] (this class), [[Column]], and [[dsl]] for 
Scala DSL.
+ *
+ * To select a column from the data frame, use the apply method:
+ * {{{
+ *   val ageCol = people(age)  // in Scala
+ *   Column ageCol = people.apply(age)  // in Java
+ * }}}
+ *
+ * Note that the [[Column]] type can also be manipulated through its 
various functions.
+ * {{
+ *   // The following creates a new column that increases everybody's age 
by 10.
+ *   people(age) + 10  // in Scala
+ * }}
+ *
+ * A more concrete example:
+ * {{{
+ *   // To create DataFrame using SQLContext
+ *   val people = sqlContext.parquetFile(...)
+ *   val department = sqlContext.parquetFile(...)
+ *
+ *   people.filter(age  30)
+ * .join(department, people(deptId) === department(id))
+ * .groupBy(department(name), gender)
+ * .agg(avg(people(salary)), max(people(age)))
+ * }}}
+ */
+// TODO: Improve documentation.
+class DataFrame protected[sql](
+val sqlContext: SQLContext,
+private val baseLogicalPlan: LogicalPlan,
+operatorsEnabled: Boolean)
+  extends DataFrameSpecificApi with RDDApi[Row] {
+
+  protected[sql] def this(sqlContext: Option[SQLContext], plan: 
Option[LogicalPlan]) =
+this(sqlContext.orNull, plan.orNull, sqlContext.isDefined  
plan.isDefined)
+
+  protected[sql] def this(sqlContext: SQLContext, plan: LogicalPlan) = 
this(sqlContext, plan, true)
+
+  @transient protected[sql] lazy val queryExecution = 
sqlContext.executePlan(baseLogicalPlan)
+
+  @transient protected[sql] val logicalPlan: LogicalPlan = baseLogicalPlan 
match {
+// For various commands (like DDL) and queries with side effects, we 
force query optimization to
+// happen right away to let these side effects take place eagerly.
+case _: Command | _: InsertIntoTable | _: CreateTableAsSelect[_] |_: 
WriteToFile =
+  LogicalRDD(queryExecution.analyzed.output, 
queryExecution.toRdd)(sqlContext)
+case _ =
+  baseLogicalPlan
+  }
+
+  /**
+   * An implicit conversion function internal to this class for us to 
avoid doing
+   * new DataFrame(...) everywhere.
+   */

[GitHub] spark pull request: [SPARK-3562]Periodic cleanup event logs

2015-01-27 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/2471#issuecomment-71702876
  
@viper-kun could you close this one in that case? thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] fix python example of ALS in guide

2015-01-27 Thread davies

GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/4226

[MLlib] fix python example of ALS in guide

fix python example of ALS in guide, use Rating instead of np.array.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark fix_als_guide

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4226.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4226


commit 1433d76d0b42e3c5fa873258fc659ee3e7d162cc
Author: Davies Liu dav...@databricks.com
Date:   2015-01-27T18:49:09Z

fix python example of als in guide




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLlib] fix python example of ALS in guide

2015-01-27 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4226#issuecomment-71707130
  
I tested it successfully.  LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread mccheah

Github user mccheah commented on a diff in the pull request:

https://github.com/apache/spark/pull/4155#discussion_r23634059
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala
 ---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.scheduler
+
+import java.io.{ObjectInputStream, ObjectOutputStream, IOException}
+
+import scala.collection.mutable
+
+import org.mockito.Mockito._
+import org.scalatest.concurrent.Timeouts
+import org.scalatest.{BeforeAndAfter, FunSuite}
+
+import org.apache.hadoop.mapred.{TaskAttemptID, JobConf, 
TaskAttemptContext, OutputCommitter}
+
+import org.apache.spark._
+import org.apache.spark.executor.{TaskMetrics}
+import org.apache.spark.rdd.FakeOutputCommitter
+
+/**
+ * Unit tests for the output commit coordination functionality. Overrides 
the
+ * SchedulerImpl to just run the tasks directly and send completion or 
error
+ * messages back to the DAG scheduler.
+ */
--- End diff --

So this is no longer testing the right thing. But I haven't been able to 
find an example of a unit test that overrides some of the SchedulerImpl's 
methods but keeps everything else the same as the default setup. Any 
suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5430] move treeReduce and treeAggregate...

2015-01-27 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4228#issuecomment-71711692
  
Should we do that in follow-up PRs? This PR touches MLlib, which could be 
separated from adding Java/Python APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4586][MLLIB] Python API for ML pip...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4151#discussion_r23635234
  
--- Diff: python/pyspark/ml/util.py ---
@@ -0,0 +1,35 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+import uuid
+
+
+class Identifiable(object):
+
+Object with a unique ID.
+
+
+def __init__(self):
+#: A unique id for the object. The default implementation
+#: concatenates the class name, -, and 8 random hex chars.
+self.uid = type(self).__name__ + - + uuid.uuid4().hex[:8]
--- End diff --

The memory address could be reused, which may not be unique.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-4586][MLLIB] Python API for ML pip...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4151#discussion_r23635248
  
--- Diff: python/docs/pyspark.ml.rst ---
@@ -0,0 +1,38 @@
+pyspark.ml package
+=
+
+Submodules
+--
+
+pyspark.ml module
+-
+
+.. automodule:: pyspark.ml
+:members:
+:undoc-members:
+:show-inheritance:
+
+pyspark.ml.param module
--- End diff --

This is to be consistent with Scala/Java API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5366][EC2] Check the mode of private ke...

2015-01-27 Thread nchammas

Github user nchammas commented on a diff in the pull request:

https://github.com/apache/spark/pull/4162#discussion_r23629383
  
--- Diff: ec2/spark_ec2.py ---
@@ -349,6 +351,16 @@ def launch_cluster(conn, opts, cluster_name):
 if opts.identity_file is None:
 print  stderr, ERROR: Must provide an identity file (-i) for 
ssh connections.
 sys.exit(1)
+
+if not os.path.exists(opts.identity_file):
+print  stderr, ERROR: The identity file '{f}' doesn't 
exist..format(f=opts.identity_file)
+sys.exit(1)
+
+file_mode = os.stat(opts.identity_file).st_mode
+if not (file_mode  os.stat.S_IRUSR):
--- End diff --

I think this kind of check gives us what we're looking for:

```
oct(os.stat(file_mode).st_mode)[-2:] == '00'
```

This makes sure that the group and others have no permissions on the file. 
This seems to be what Amazon checks for.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5135][SQL] Add support for describe tab...

2015-01-27 Thread OopsOutOfMemory

GitHub user OopsOutOfMemory opened a pull request:

https://github.com/apache/spark/pull/4227

[SPARK-5135][SQL] Add support for describe table to DDL in SQLContext

Hi, @rxin @marmbrus
I considered your suggestion and now re-write it. This is now up-to-date.
Could u please review it ?

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/OopsOutOfMemory/spark describe

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4227


commit 5b7ae19dfdc4410f1018193e0b1701abf799c439
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-16T10:45:13Z

patch

commit d1689e2fdfed67decabe6c696a3d7f051138a7ad
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-16T15:49:50Z

refine imports

commit 5b56286c7df0e721496478ab72d56a41a72d9fd6
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-16T10:45:13Z

patch

commit d70b699bf391f7011540510a22cdcf3f6317945f
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-16T15:49:50Z

refine imports

commit 5abfbc0fbfd62c7ab0ab33f99619f7c2b6fb6ee6
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T16:17:19Z

refine

commit 6537b16011e18a51648e98cf3674b7d334a467b2
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T16:23:40Z

refine

commit 1b85c73bcb35cc162cd2ad678927d664015b9ce6
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T16:27:36Z

refine

commit 88ee78f6072843be0cfe3c559017ab381ba78d5a
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T16:34:08Z

style refine

commit a083cc5bc31e4e762095953d16f21790d763d495
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T16:36:55Z

refine

commit 8e1be4935d286681cb71ccbe64c0b0e3c9a48352
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T19:01:27Z

refine

commit 5d1b54fa47a04bacc0651d4617a9f38d2c4db983
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T19:04:36Z

style fix

commit b2e30a01555c40dfec6d7d72f926424b8f66fd81
Author: OopsOutOfMemory victorshen...@126.com
Date:   2015-01-27T19:05:36Z

refine import




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-5341] Use maven coordinates as dep...

2015-01-27 Thread mateiz

Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/4215#discussion_r23633840
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -380,6 +392,12 @@ private[spark] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, St
 |  --name NAME A name of your application.
 |  --jars JARS Comma-separated list of local jars 
to include on the driver
 |  and executor classpaths.
+|  --maven Comma-separated list of maven 
coordinates of jars to include
+|  on the driver and executor 
classpaths. Will search the local
+|  maven repo, then maven central and 
any additional remote
+|  repositories given by --maven_repos.
--- End diff --

Also this should say the format, i.e. groupId:artifactId:version


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4809] Rework Guava library shading.

2015-01-27 Thread vanzin

Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/3658#issuecomment-71710576
  
Hey @pwendell, I'll try to get to this soon. But I wanted to get your 
feedback on my idea for fixing the `network/` dependencies thing before I try 
to implement it.

The way I see it, the cleanest way is to do the Guava shading in the 
earliest artifact possible; that would be `network/common`. So that artifact 
would have the honor of providing all the relocated Guava classes to everyone. 
Since `spark-core` depends on it, everything should work out.

The only downside I see to that is that `network/common` would now expose 
`Optional` and friends when it's not really its fault (`spark-core` demands it).

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71710447
  
  [Test build #26176 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26176/consoleFull)
 for   PR 4155 at commit 
[`9fe6495`](https://github.com/apache/spark/commit/9fe64953aa437ed1ed88a294e04129afc8f2bbb5).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class CommitDeniedException(msg: String, jobID: Int, splitID: Int, 
attemptID: Int)`
  * `case class TaskCommitDenied(`
  * `  class AskCommitRunnable(`
  * `  class OutputCommitCoordinatorActor(outputCommitCoordinator: 
OutputCommitCoordinator)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71715803
  
  [Test build #26180 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26180/consoleFull)
 for   PR 3715 at commit 
[`dc1eed0`](https://github.com/apache/spark/commit/dc1eed0a6af190d5cf07dedcb0607a0a76e45d64).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class KafkaUtils(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4879] Use the Spark driver to authorize...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4155#issuecomment-71715603
  
  [Test build #26179 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26179/consoleFull)
 for   PR 4155 at commit 
[`d63f63f`](https://github.com/apache/spark/commit/d63f63f5769a29d9377b15f3025726477226ca88).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5154] [PySpark] [Streaming] Kafka strea...

2015-01-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3715#issuecomment-71715805
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26180/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [EC2] Preserve spaces in EC2 path

2015-01-27 Thread shivaram

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/4224#issuecomment-71719779
  
LGTM. Would be good to create a JIRA especially if we want to backport. 
@JoshRosen might have more ideas on backporting


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5321] Support for transposing local mat...

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4109#issuecomment-71606493
  
  [Test build #26152 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26152/consoleFull)
 for   PR 4109 at commit 
[`87ab83c`](https://github.com/apache/spark/commit/87ab83cb07a3b3451a4e3ddd158527400b4284ea).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-27 Thread suyanNone

Github user suyanNone commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23592779
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/ResultTask.scala 
---
@@ -65,4 +65,6 @@ private[spark] class ResultTask[T, U](
   override def preferredLocations: Seq[TaskLocation] = preferredLocs
 
   override def toString = ResultTask( + stageId + ,  + partitionId + 
)
+
+  override def canEqual(other: Any): Boolean = 
other.isInstanceOf[ResultTask[T, U]]
--- End diff --

yean, I know that.  and in that class, it has no need to add parameter on 
class level, it could only use in function level `run` or `runContext`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5259][CORE]Make sure mapStage.pendingta...

2015-01-27 Thread suyanNone

Github user suyanNone commented on a diff in the pull request:

https://github.com/apache/spark/pull/4055#discussion_r23592724
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/Task.scala ---
@@ -106,7 +106,21 @@ private[spark] abstract class Task[T](val stageId: 
Int, var partitionId: Int) ex
 if (interruptThread  taskThread != null) {
   taskThread.interrupt()
 }
-  }  
+  }
+
+  override def hashCode(): Int = {
+31 * stageId.hashCode() + partitionId.hashCode()
+  }
+
+  def canEqual(other: Any): Boolean = other.isInstanceOf[Task[_]]
+
+  override def equals(other: Any): Boolean = other match {
+case that: Task[_] =
+  (that canEqual this) 
--- End diff --

yes, in current  spark code, is the same type, but in parameter class 
level, I think is more reasonable to add that. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame

2015-01-27 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71614911
  
cc @davies org.apache.spark.sql.test.ExamplePoint is not serializable 
causing Python to fail. Is this new?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-5419][Mllib] Fix the logic in Vectors.s...

2015-01-27 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/4217#issuecomment-71615458
  
Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594868
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
--- End diff --

The default argument value is not Java compatible and we don't use this 
kind of API in `spark.mllib`. The class `IsotonicRegression` should have a 
parameter called `isotonic`, similar to `k` in `KMeans`. The user code should 
look like:

~~~
val ir = new IsotonicRegression()
  .setIsotonic(false)
val irModel = ir.run(input)
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594866
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
--- End diff --

merge this line with the one above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594874
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
--- End diff --

Not clear about what `(Double, Double, Double)` means.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594879
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
+  isotonic: Boolean): IsotonicRegressionModel = {
+
+val labels = predictions.map(_._1)
+val features = predictions.map(_._2)
+
+new IsotonicRegressionModel(features, labels)
+  }
+
+  /**
+   * Performs a pool adjacent violators algorithm (PAVA)
--- End diff --

Add `.` at the end. Cite the paper.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594862
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
--- End diff --

There are 4 cases:

1. hit a boundary - return the corresponding prediction directly
2. fall between boundaries - linear interpolation (Note that a special 
case is singularity, where two boundaries are the same but their predictions 
are different. We can set manual rules for this case and document the behavior.)
3. smaller than the smallest boundary - return predictions(0)
4. larger than the largest boundary - return predictions.last


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594894
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
+  isotonic: Boolean): IsotonicRegressionModel = {
+
+val labels = predictions.map(_._1)
+val features = predictions.map(_._2)
+
+new IsotonicRegressionModel(features, labels)
+  }
+
+  /**
+   * Performs a pool adjacent violators algorithm (PAVA)
+   * Uses approach with single processing of data where violators
+   * in previously processed data created by pooling are fixed 
immediatelly.
+   * Uses optimization of discovering monotonicity violating sequences 
(blocks)
+   * Method in situ mutates input array
+   *
+   * @param in input data
+   * @param isotonic asc or desc
+   * @return result
+   */
+  private def poolAdjacentViolators(
+  in: Array[(Double, Double, Double)],
+  isotonic: Boolean): Array[(Double, Double, Double)] = {
+
+// Pools sub

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594856
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
--- End diff --

It may be worth validating that `features` is ordered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
--- End diff --

`JavaRDD[java.lang.Double]` - `JavaDoubleRDD`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594843
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
--- End diff --

Isotonic - isotonic


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594846
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
--- End diff --

Need to be more specific about `features` and `labels`. I would rename 
`features` to `boundaries` and mention that this is monotonic, and rename 
`labels` to `predictions` because this is not the original labels. The solution 
to an isotonic regression problem is piecewise linear. The model only needs to 
store the boundaries and the computed predictions. We can use linear 
interpolation for values fall between boundaries. For example, if

~~~
boundaries = [1, 2, 4, 5]
predictions = [1.0, 3.0, 3.0, 4.0]
~~~

then

~~~
predict(1.5) == 2.0
predict(3.5) == 3.0
~~~

We should also document the behavior on the semi-open segments, e.g., 
`predict(-10) == ?`. I suggest using the smallest prediction here, i.e., 
`predict(-10) = 1.0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594859
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
--- End diff --

`result` - `insertIndex`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594864
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
--- End diff --

Cite the paper. Use `.` at the end of each sentence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594877
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
--- End diff --

If the third parameter is not used, maybe we should remove it from the API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594889
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
+  isotonic: Boolean): IsotonicRegressionModel = {
+
+val labels = predictions.map(_._1)
+val features = predictions.map(_._2)
+
+new IsotonicRegressionModel(features, labels)
+  }
+
+  /**
+   * Performs a pool adjacent violators algorithm (PAVA)
+   * Uses approach with single processing of data where violators
+   * in previously processed data created by pooling are fixed 
immediatelly.
+   * Uses optimization of discovering monotonicity violating sequences 
(blocks)
+   * Method in situ mutates input array
+   *
+   * @param in input data
+   * @param isotonic asc or desc
+   * @return result
+   */
+  private def poolAdjacentViolators(
+  in: Array[(Double, Double, Double)],
+  isotonic: Boolean): Array[(Double, Double, Double)] = {
+
+// Pools sub

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594882
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
+  isotonic: Boolean): IsotonicRegressionModel = {
+
+val labels = predictions.map(_._1)
+val features = predictions.map(_._2)
+
+new IsotonicRegressionModel(features, labels)
+  }
+
+  /**
+   * Performs a pool adjacent violators algorithm (PAVA)
+   * Uses approach with single processing of data where violators
+   * in previously processed data created by pooling are fixed 
immediatelly.
+   * Uses optimization of discovering monotonicity violating sequences 
(blocks)
+   * Method in situ mutates input array
--- End diff --

typo?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594884
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
+  isotonic: Boolean): IsotonicRegressionModel = {
+
+val labels = predictions.map(_._1)
+val features = predictions.map(_._2)
+
+new IsotonicRegressionModel(features, labels)
+  }
+
+  /**
+   * Performs a pool adjacent violators algorithm (PAVA)
+   * Uses approach with single processing of data where violators
+   * in previously processed data created by pooling are fixed 
immediatelly.
+   * Uses optimization of discovering monotonicity violating sequences 
(blocks)
+   * Method in situ mutates input array
+   *
+   * @param in input data
--- End diff --

`in` - `input`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not

[GitHub] spark pull request: [MLLIB][SPARK-3278] Monotone (Isotonic) regres...

2015-01-27 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3519#discussion_r23594887
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala 
---
@@ -0,0 +1,238 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.regression
+
+import java.io.Serializable
+import java.util.Arrays.binarySearch
+
+import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
+import org.apache.spark.rdd.RDD
+
+/**
+ * Regression model for Isotonic regression
+ *
+ * @param features Array of features.
+ * @param labels Array of labels associated to the features at the same 
index.
+ */
+class IsotonicRegressionModel (
+features: Array[Double],
+val labels: Array[Double])
+  extends Serializable {
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: RDD[Double]): RDD[Double] =
+testData.map(predict)
+
+  /**
+   * Predict labels for provided features
+   * Using a piecewise constant function
+   *
+   * @param testData features to be labeled
+   * @return predicted labels
+   */
+  def predict(testData: JavaRDD[java.lang.Double]): JavaDoubleRDD =
+JavaDoubleRDD.fromRDD(predict(testData.rdd.asInstanceOf[RDD[Double]]))
+
+  /**
+   * Predict a single label
+   * Using a piecewise constant function
+   *
+   * @param testData feature to be labeled
+   * @return predicted label
+   */
+  def predict(testData: Double): Double = {
+val result = binarySearch(features, testData)
+
+val index =
+  if (result == -1) {
+0
+  } else if (result  0) {
+-result - 2
+  } else {
+result
+  }
+
+labels(index)
+  }
+}
+
+/**
+ * Isotonic regression
+ * Currently implemented using parallel pool adjacent violators algorithm
+ */
+class IsotonicRegression
+  extends Serializable {
+
+  /**
+   * Run algorithm to obtain isotonic regression model
+   *
+   * @param input (label, feature, weight)
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  def run(
+  input: RDD[(Double, Double, Double)],
+  isotonic: Boolean = true): IsotonicRegressionModel = {
+createModel(
+  parallelPoolAdjacentViolators(input, isotonic),
+  isotonic)
+  }
+
+  /**
+   * Creates isotonic regression model with given parameters
+   *
+   * @param predictions labels estimated using isotonic regression 
algorithm.
+   *Used for predictions on new data points.
+   * @param isotonic isotonic (increasing) or antitonic (decreasing) 
sequence
+   * @return isotonic regression model
+   */
+  protected def createModel(
+  predictions: Array[(Double, Double, Double)],
+  isotonic: Boolean): IsotonicRegressionModel = {
+
+val labels = predictions.map(_._1)
+val features = predictions.map(_._2)
+
+new IsotonicRegressionModel(features, labels)
+  }
+
+  /**
+   * Performs a pool adjacent violators algorithm (PAVA)
+   * Uses approach with single processing of data where violators
+   * in previously processed data created by pooling are fixed 
immediatelly.
+   * Uses optimization of discovering monotonicity violating sequences 
(blocks)
+   * Method in situ mutates input array
+   *
+   * @param in input data
+   * @param isotonic asc or desc
+   * @return result
+   */
+  private def poolAdjacentViolators(
+  in: Array[(Double, Double, Double)],
+  isotonic: Boolean): Array[(Double, Double, Double)] = {
+
+// Pools sub

[GitHub] spark pull request: [SPARK-5097][SQL] DataFrame

2015-01-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4173#issuecomment-71614473
  
  [Test build #26150 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26150/consoleFull)
 for   PR 4173 at commit 
[`16934ee`](https://github.com/apache/spark/commit/16934ee0c9719afeb047e4eacf6e35b5e4aca86d).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 457 matches

Mail list logo