[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14648552#comment-14648552
 ] 

Jeremy Freeman commented on SPARK-9461:
---

Ah, great to hear!

 Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
 --

 Key: SPARK-9461
 URL: https://issues.apache.org/jira/browse/SPARK-9461
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Jeremy Freeman

 [~freeman-lab]
 Check out this failure: 
 [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
 It should be deterministic, but do you think it's just slight variations 
 caused by the Python version?  Or do you think it's something odd going on 
 with streaming?  This is the only time I've seen this happen, but I'll post 
 again if I see it more.
 Test failure message:
 {code}
 ==
 FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
 Test that coefs are predicted accurately by fitting on toy data.
 --
 Traceback (most recent call last):
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1282, in test_parameter_accuracy
 slr.latestModel().weights.array, [10., 10.], 1)
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1257, in assertArrayAlmostEqual
 self.assertAlmostEqual(i, j, dec)
 AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647884#comment-14647884
 ] 

Jeremy Freeman commented on SPARK-9461:
---

Interesting, the lack of failures on KMeans is consistent with the completion 
idea because  those tests use extremely toy data ( 10 data points) whereas the 
regression ones use 100s of test data points.

In this (and similar) lines:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/tests.py#L1161

there's a parameter `end_time` that's the time to wait in seconds for it to 
complete. Looking across these tests, the value fluctuates (5, 10, 15, 20) 
suggesting that it was hand-tuned, possibly tailored to a local test 
environment. Bumping that number up for any of the tests showing occasional 
errors might fix it, though that's a little ad-hoc.

I think things are more robust on the Scala side because there's a full-blown 
streaming test class that lets test jobs either run to completion, or until a 
max time out 
(https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala).
 So there's just one test-wide parameter, the max time out, and we could safely 
set that pretty high without wasting time.

 Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
 --

 Key: SPARK-9461
 URL: https://issues.apache.org/jira/browse/SPARK-9461
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Jeremy Freeman

 [~freeman-lab]
 Check out this failure: 
 [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
 It should be deterministic, but do you think it's just slight variations 
 caused by the Python version?  Or do you think it's something odd going on 
 with streaming?  This is the only time I've seen this happen, but I'll post 
 again if I see it more.
 Test failure message:
 {code}
 ==
 FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
 Test that coefs are predicted accurately by fitting on toy data.
 --
 Traceback (most recent call last):
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1282, in test_parameter_accuracy
 slr.latestModel().weights.array, [10., 10.], 1)
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1257, in assertArrayAlmostEqual
 self.assertAlmostEqual(i, j, dec)
 AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-29 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647186#comment-14647186
 ] 

Jeremy Freeman commented on SPARK-9461:
---

A couple initial observations:
- the Scala tests of the same algos are all passing, so if it's something 
deeper in streaming, it's probably specific to PySpark Streaming
- the estimate and target are close, just not quite within the tolerance. as 
these are approximations, there will be some error, but as you say, this should 
be deterministic and was passing before. 
- in developing the original Scala tests, I noticed this could occur if, for 
some reason, the sequence of streaming updates did not complete in the 
available time (so it didn't quite converge). an error elsewhere could 
conceivably cause that to happen. 

ccing [~MechCoder] who did the PySpark StreamingML bindings, did you notice 
these to be at all flaky during development? 

 Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
 --

 Key: SPARK-9461
 URL: https://issues.apache.org/jira/browse/SPARK-9461
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Jeremy Freeman

 [~freeman-lab]
 Check out this failure: 
 [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
 It should be deterministic, but do you think it's just slight variations 
 caused by the Python version?  Or do you think it's something odd going on 
 with streaming?  This is the only time I've seen this happen, but I'll post 
 again if I see it more.
 Test failure message:
 {code}
 ==
 FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
 Test that coefs are predicted accurately by fitting on toy data.
 --
 Traceback (most recent call last):
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1282, in test_parameter_accuracy
 slr.latestModel().weights.array, [10., 10.], 1)
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1257, in assertArrayAlmostEqual
 self.assertAlmostEqual(i, j, dec)
 AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-29 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647220#comment-14647220
 ] 

Jeremy Freeman commented on SPARK-9461:
---

I'd definitely be curious to see if these two tests pass again with a relaxed 
tolerance (say double the tolerance, which would have fixed the one in your 
initial comment), while we get to the core of the issue, maybe try that on your 
PR?

The only other algo to consider making the change in is StreamingKMeans, but 
the tests there are closer to toy examples, and I think will be less 
susceptible to small differences in convergence (assuming that's what's going 
on here), have you you noticed any of those failing as well?

 Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
 --

 Key: SPARK-9461
 URL: https://issues.apache.org/jira/browse/SPARK-9461
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.5.0
Reporter: Joseph K. Bradley
Assignee: Jeremy Freeman

 [~freeman-lab]
 Check out this failure: 
 [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
 It should be deterministic, but do you think it's just slight variations 
 caused by the Python version?  Or do you think it's something odd going on 
 with streaming?  This is the only time I've seen this happen, but I'll post 
 again if I see it more.
 Test failure message:
 {code}
 ==
 FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
 Test that coefs are predicted accurately by fitting on toy data.
 --
 Traceback (most recent call last):
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1282, in test_parameter_accuracy
 slr.latestModel().weights.array, [10., 10.], 1)
   File 
 /home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py,
  line 1257, in assertArrayAlmostEqual
 self.assertAlmostEqual(i, j, dec)
 AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-07-26 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642207#comment-14642207
 ] 

Jeremy Freeman commented on SPARK-4144:
---

Sorry Chris! Why don't you have a go, I'd be happy to review anything you put 
together.

 Support incremental model training of Naive Bayes classifier
 

 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly
Assignee: Jeremy Freeman

 Per Xiangrui Meng from the following user list discussion:  
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E

 For Naive Bayes, we need to update the priors and conditional
 probabilities, which means we should also remember the number of
 observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6722) Model import/export for StreamingKMeansModel

2015-04-07 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484772#comment-14484772
 ] 

Jeremy Freeman commented on SPARK-6722:
---

[~josephkb] Good question! Agreed it would be nice to add this. The only 
concern might be that we've been discussing with [~tdas] the possibility of 
representing local models for these algorithms a little differently, as local 
state streams, to better support driver fault tolerance. But even if we do 
that, the components of the model will be identical to what they are now, so in 
terms of what gets stored for import/export it should stay the same.

I guess my inclination is to add this support, but curious what [~tdas] thinks.

 Model import/export for StreamingKMeansModel
 

 Key: SPARK-6722
 URL: https://issues.apache.org/jira/browse/SPARK-6722
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 CC: [~freeman-lab] Is this API stable enough to merit adding import/export 
 (which will require supporting the model format version from now on)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390176#comment-14390176
 ] 

Jeremy Freeman commented on SPARK-6646:
---

Very promising [~tdas]! We should evaluate the performance of streaming machine 
learning algorithms. In general I think running Spark in javascript via 
scala.js and node.js is extremely appealing, will make integration with 
visualization very straightforward. 

 Spark 2.0: Rearchitecting Spark for Mobile Platforms
 

 Key: SPARK-6646
 URL: https://issues.apache.org/jira/browse/SPARK-6646
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Attachments: Spark on Mobile - Design Doc - v1.pdf


 Mobile computing is quickly rising to dominance, and by the end of 2017, it 
 is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
 Spark’s project goal can be accomplished only when Spark runs efficiently for 
 the growing population of mobile users.
 Designed and optimized for modern data centers and Big Data applications, 
 Spark is unfortunately not a good fit for mobile computing today. In the past 
 few months, we have been prototyping the feasibility of a mobile-first Spark 
 architecture, and today we would like to share with you our findings. This 
 ticket outlines the technical design of Spark’s mobile support, and shares 
 results from several early prototypes.
 Mobile friendly version of the design doc: 
 https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))

}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =
  val latest = model.latestModel()
  rdd.map(lp = (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


 Model update propagation during prediction in Streaming Regression
 --

 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 During streaming regression analyses (Streaming Linear Regression and 
 Streaming Logistic Regression), model updates based on training data are not 
 being reflected in subsequent calls to predictOn or predictOnValues, despite 
 updates themselves occurring successfully. It may be due to recent changes to 
 model declaration, and I have a working fix prepared to be submitted ASAP 
 (alongside expanded test coverage).
 A temporary workaround is retrieve the and use the updated model within a 
 foreachRDD, as in:
 {code}
 model.trainOn(trainingData)
 testingData.foreachRDD{ rdd =
 val latest = model.latestModel()
 val predictions = rdd.map(lp = latest.predict(lp.features))
 }
 {code}
 Or within a transform, as in:
 {code}
 model.trainOn(trainingData)
 val predictions = testingData.transform { rdd =
   val latest = model.latestModel()
   rdd.map(lp = (lp.label, latest.predict(lp.features)))
 }
 {code}
 Note that this does not affect Streaming KMeans, which works as expected for 
 combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =
  val latest = model.latestModel()
  rdd.map(lp = (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =
  val latest = model.latestModel()
  rdd.map(lp = (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


 Model update propagation during prediction in Streaming Regression
 --

 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 During streaming regression analyses (Streaming Linear Regression and 
 Streaming Logistic Regression), model updates based on training data are not 
 being reflected in subsequent calls to predictOn or predictOnValues, despite 
 updates themselves occurring successfully. It may be due to recent changes to 
 model declaration, and I have a working fix prepared to be submitted ASAP 
 (alongside expanded test coverage).
 A temporary workaround is retrieve and use the updated model within a 
 foreachRDD, as in:
 {code}
 model.trainOn(trainingData)
 testingData.foreachRDD{ rdd =
 val latest = model.latestModel()
 val predictions = rdd.map(lp = latest.predict(lp.features))
 ...print or other side effects...
 }
 {code}
 Or within a transform, as in:
 {code}
 model.trainOn(trainingData)
 val predictions = testingData.transform { rdd =
   val latest = model.latestModel()
   rdd.map(lp = (lp.label, latest.predict(lp.features)))
 }
 {code}
 Note that this does not affect Streaming KMeans, which works as expected for 
 combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is to retrieve and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =
  val latest = model.latestModel()
  rdd.map(lp = (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =
  val latest = model.latestModel()
  rdd.map(lp = (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


 Model update propagation during prediction in Streaming Regression
 --

 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 During streaming regression analyses (Streaming Linear Regression and 
 Streaming Logistic Regression), model updates based on training data are not 
 being reflected in subsequent calls to predictOn or predictOnValues, despite 
 updates themselves occurring successfully. It may be due to recent changes to 
 model declaration, and I have a working fix prepared to be submitted ASAP 
 (alongside expanded test coverage).
 A temporary workaround is to retrieve and use the updated model within a 
 foreachRDD, as in:
 {code}
 model.trainOn(trainingData)
 testingData.foreachRDD{ rdd =
 val latest = model.latestModel()
 val predictions = rdd.map(lp = latest.predict(lp.features))
 ...print or other side effects...
 }
 {code}
 Or within a transform, as in:
 {code}
 model.trainOn(trainingData)
 val predictions = testingData.transform { rdd =
   val latest = model.latestModel()
   rdd.map(lp = (lp.label, latest.predict(lp.features)))
 }
 {code}
 Note that this does not affect Streaming KMeans, which works as expected for 
 combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.predictOn(trainingData)
testData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


 Model update propagation during prediction in Streaming Regression
 --

 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 During streaming regression analyses (Streaming Linear Regression and 
 Streaming Logistic Regression), model updates based on training data are not 
 being reflected in subsequent calls to predictOn or predictOnValues, despite 
 updates themselves occurring successfully. It may be due to recent changes to 
 model declaration, and I have a working fix prepared to be submitted ASAP 
 (alongside expanded test coverage).
 A temporary workaround is retrieve the updated model within a foreachRDD, as 
 in:
 {code}
 model.trainOn(trainingData)
 testData.foreachRDD{ rdd =
 val latest = model.latestModel()
 val predictions = rdd.map(lp = latest.predict(lp.features)
 }
 {code}
 Note that this does not affect Streaming KMeans, which works as expected for 
 combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


 Model update propagation during prediction in Streaming Regression
 --

 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman

 During streaming regression analyses (Streaming Linear Regression and 
 Streaming Logistic Regression), model updates based on training data are not 
 being reflected in subsequent calls to predictOn or predictOnValues, despite 
 updates themselves occurring successfully. It may be due to recent changes to 
 model declaration, and I have a working fix prepared to be submitted ASAP 
 (alongside expanded test coverage).
 A temporary workaround is retrieve the updated model within a foreachRDD, as 
 in:
 {code}
 model.trainOn(trainingData)
 testData.foreachRDD{ rdd =
 val latest = model.latestModel()
 val predictions = rdd.map(lp = latest.predict(lp.features))
 }
 {code}
 Note that this does not affect Streaming KMeans, which works as expected for 
 combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-6345:
-

 Summary: Model update propagation during prediction in Streaming 
Regression
 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman


During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.predictOn(trainingData)
testData.foreachRDD{ rdd =
val latest = model.latestModel()
val predictions = rdd.map(lp = latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman edited comment on SPARK-2429 at 3/11/15 8:10 PM:


Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing experimental pieces of functionality. But I'd probably prioritize just 
reviewing the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.


was (Author: freeman-lab):
Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman commented on SPARK-2429:
---

Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
  Labels: clustering
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-20 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328726#comment-14328726
 ] 

Jeremy Freeman commented on SPARK-4144:
---

Hi all, I'd be happy to pick this up and get a prototype in the next couple 
weeks, it's related to the streaming logistic regression we just added for 1.3, 
though the update rule is more complex and might not be able to reuse as much 
code (in that sense it may end up looking more like streaming k-means).

But don't want to step on anyone's toes! [~liquanpei] are you working on it? Or 
are you still interested [~cfregly]? Also happy to work together  / review code.

 Support incremental model training of Naive Bayes classifier
 

 Key: SPARK-4144
 URL: https://issues.apache.org/jira/browse/SPARK-4144
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Reporter: Chris Fregly
Assignee: Liquan Pei

 Per Xiangrui Meng from the following user list discussion:  
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E

 For Naive Bayes, we need to update the priors and conditional
 probabilities, which means we should also remember the number of
 observations for the updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:python}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:python}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
{code}

But this works fine:

{code:python}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

``
if ar.dtype != np.float64:
ar.astype(np.float64)
``
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.


 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to {{DenseVectors}}. If the data are numpy arrays 
 with dtype {{float64}} this works. If data are numpy arrays with lower 
 precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
 {{float64}}, but due to a small bug in this line this currently doesn't 
 happen (casting is not inplace). 
 {code:python}
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 {code}
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 {code:python}
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 {code}
 But this works fine:
 {code:python}
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 {code}
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:none}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:none}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
{code}

But this works fine:

{code:none}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:python}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:python}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
{code}

But this works fine:

{code:python}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.


 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to {{DenseVectors}}. If the data are numpy arrays 
 with dtype {{float64}} this works. If data are numpy arrays with lower 
 precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
 {{float64}}, but due to a small bug in this line this currently doesn't 
 happen (casting is not inplace). 
 {code:none}
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 {code}
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 {code:none}
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 {code}
 But this works fine:
 {code:none}
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 {code}
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-5089:
-

 Summary: Vector conversion broken for non-float64 arrays
 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman


Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

```
if ar.dtype != np.float64:
ar.astype(np.float64)
```
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

``
if ar.dtype != np.float64:
ar.astype(np.float64)
``
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

```
if ar.dtype != np.float64:
ar.astype(np.float64)
```
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.


 Vector conversion broken for non-float64 arrays
 ---

 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman

 Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
 automatically converted to `DenseVectors`. If the data are numpy arrays with 
 dtype `float64` this works. If data are numpy arrays with lower precision 
 (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to 
 a small bug in this line this currently doesn't happen (casting is not 
 inplace). 
 ``
 if ar.dtype != np.float64:
 ar.astype(np.float64)
 ``
  
 Non-float64 values are in turn mangled during SerDe. This can have 
 significant consequences. For example, the following yields confusing and 
 erroneous results:
 ```
 from numpy import random
 from pyspark.mllib.clustering import KMeans
 data = sc.parallelize(random.randn(100,10).astype('float32'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  5 # should be 10!
 ```
 But this works fine:
 ```
 data = sc.parallelize(random.randn(100,10).astype('float64'))
 model = KMeans.train(data, k=3)
 len(model.centers[0])
  10 # this is correct
 ```
 The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4979) Add streaming logistic regression

2014-12-28 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4979:
-

 Summary: Add streaming logistic regression
 Key: SPARK-4979
 URL: https://issues.apache.org/jira/browse/SPARK-4979
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Priority: Minor


We currently support streaming linear regression and k-means clustering. We can 
add support for streaming logistic regression using a strategy similar to that 
used in streaming linear regression, applying gradient updates to batches of 
data from a DStream, and extending the existing mllib methods with minor 
modifications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4980) Add decay factors to streaming linear methods

2014-12-28 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4980:
-

 Summary: Add decay factors to streaming linear methods
 Key: SPARK-4980
 URL: https://issues.apache.org/jira/browse/SPARK-4980
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Priority: Minor


Our implementation of streaming k-means uses an decay factor that allows users 
to control how quickly the model adjusts to new data: whether it treats all 
data equally, or only bases its estimate on the most recent batch. It is 
intuitively parameterized, and can be specified in units of either batches or 
points. We should add a similar decay factor to the streaming linear methods 
using SGD, including streaming linear regression (currently implemented) and 
streaming logistic regression (in development).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4969) Add binaryRecords support to streaming

2014-12-25 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4969:
-

 Summary: Add binaryRecords support to streaming
 Key: SPARK-4969
 URL: https://issues.apache.org/jira/browse/SPARK-4969
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Affects Versions: 1.2.0
Reporter: Jeremy Freeman
Priority: Minor


As of Spark 1.2 there is support for loading fixed length records from flat 
binary files. This is a useful way to load dense numerical array data into 
Spark, especially in scientific computing applications.

We should add support for loading this same file type in Spark Streaming, both 
in Scala/Java and in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4727) Add dimensional RDDs (time series, spatial)

2014-12-04 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234269#comment-14234269
 ] 

Jeremy Freeman commented on SPARK-4727:
---

Great to brainstorm about this RJ! 

To some extent, we've been doing this over on the 
[Thunder|http://thefreemanlab.com/thunder/docs/] project. In particular, check 
out the {{TimeSeries}} and {{Images}} classes 
[here|https://github.com/freeman-lab/thunder/tree/master/python/thunder/rdds], 
which are essentially wrappers for specialized RDDs. Our basic abstraction is 
RDDs of ndarrays (1D for time series, 2D or 3D for images/volumes), with 
metadeta (lazily propagated) for things like dimensionality and time base, 
coordinates embedded in keys, and useful methods on these objects like the ones 
you menion (e.g. filtering, fourier, cross-correlation). We've also worked on 
transformations between representations, for the common case of sequences of 
images corresponding to different time points.

We haven't worked on custom partition strategies yet, I think that will be most 
important for image tiles drawn from a much larger image. There's cool work 
ongoing for that in GeoTrellis, see the 
[repo|https://github.com/geotrellis/geotrellis/tree/master/spark/src/main] and 
a 
[talk|http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark]
 from Rob.

FWIW, when we started it seemed more appropriate to build this into a 
specialized library, rather than Spark core. It's also something that benefits 
from using Python, due to a bevy of existing libraries for  temporal and image 
data (though there are certainly analogs in Java/Scala). But it would be great 
to probe the community for general interest in these kinds of abstractions and 
methods.

 Add dimensional RDDs (time series, spatial)
 -

 Key: SPARK-4727
 URL: https://issues.apache.org/jira/browse/SPARK-4727
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: RJ Nowling

 Certain types of data (times series, spatial) can benefit from specialized 
 RDDs.  I'd like to open a discussion about this.
 For example, time series data should be ordered by time and would benefit 
 from operations like:
 * Subsampling (taking every n data points)
 * Signal processing (correlations, FFTs, filtering)
 * Windowing functions
 Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  
 For example, path finding algorithms can optimized by only comparing points 
 within a set distance, which can be computed more efficiently by partitioning 
 data into a grid.
 Although the operations on time series and spatial data may be different, 
 there is some commonality in the sense of the data having ordered dimensions 
 and the implementations may overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-11-26 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227266#comment-14227266
 ] 

Jeremy Freeman commented on SPARK-3995:
---

Agree with [~mengxr], that's a better strategy.

 [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
 -

 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Assignee: Jeremy Freeman
Priority: Critical
 Fix For: 1.2.0


 There is a breaking bug in PySpark's sampling methods when run with NumPy 
 v1.9. This is the version of NumPy included with the current Anaconda 
 distribution (v2.1); this is a popular distribution, and is likely to affect 
 many users.
 Steps to reproduce are:
 {code:python}
 foo = sc.parallelize(range(1000),5)
 foo.takeSample(False, 10)
 {code}
 Returns:
 {code}
 PythonException: Traceback (most recent call last):
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, 
 line 79, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 196, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 127, in dump_stream
 for obj in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 185, in _batched
 for item in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 116, in func
 if self.getUniformSample(split) = self._fraction:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 58, in getUniformSample
 self.initRandomGenerator(split)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 44, in initRandomGenerator
 self._random = numpy.random.RandomState(self._seed)
   File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
 (numpy/random/mtrand/mtrand.c:7397)
   File mtrand.pyx, line 646, in mtrand.RandomState.seed 
 (numpy/random/mtrand/mtrand.c:7697)
 ValueError: Seed must be between 0 and 4294967295
 {code}
 In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:
 {code:python}
 self._seed = seed if seed is not None else random.randint(0, sys.maxint)
 {code}
 In previous versions of NumPy a random seed larger than 2 ** 32 would 
 silently get truncated to 2 ** 32. This was fixed in a recent patch 
 (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
  But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, 
 which effectively breaks sampling operations in PySpark (unless the seed is 
 set manually).
 I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3254) Streaming K-Means

2014-10-23 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181080#comment-14181080
 ] 

Jeremy Freeman commented on SPARK-3254:
---

I'd like it to! I've got a PR ready to submit ASAP, we can take it from there.

 Streaming K-Means
 -

 Key: SPARK-3254
 URL: https://issues.apache.org/jira/browse/SPARK-3254
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Xiangrui Meng
Assignee: Jeremy Freeman

 Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-3995:
-

 Summary: pyspark's sample methods do not work with NumPy 1.9
 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In previous versions of NumPy a random seed larger than 32 would silently get 
truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error, because in the 
RDDSamplerBase class from pyspark.rddsampler, we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

And this often yields ints larger than 2 ** 32. Effectively, this reliably 
breaks any sampling operation in PySpark with this NumPy version.

I am putting a PR together now (the fix is very simple!).




 pyspark's sample methods do not work with NumPy 1.9
 ---

 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Critical

 There is a breaking bug in PySpark's sampling methods when run with NumPy 
 v1.9. This is the version of NumPy included with the current Anaconda 
 distribution (v2.1); this is a popular distribution, and is likely to affect 
 many users.
 Steps to reproduce are:
 {code}
 foo = sc.parallelize(range(1000),5)
 foo.takeSample(False, 10)
 {code}
 Returns:
 {code}
 PythonException: Traceback (most recent call last):
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, 
 line 79, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 196, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 127, in dump_stream
 for obj in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py,
  line 185, in _batched
 for item in iterator:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 116, in func
 if self.getUniformSample(split) = self._fraction:
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 58, in getUniformSample
 self.initRandomGenerator(split)
   File 
 /Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py,
  line 44, in initRandomGenerator
 self._random = numpy.random.RandomState(self._seed)
   File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
 (numpy/random/mtrand/mtrand.c:7397)
   File mtrand.pyx, line 646, in mtrand.RandomState.seed 
 (numpy/random/mtrand/mtrand.c:7697)
 ValueError: Seed must be between 0 and 4294967295
 {code}
 In previous versions of NumPy a random seed larger than 32 would silently get 
 truncated to 2 ** 32. This was fixed in a recent patch 
 

[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In previous versions of NumPy a random seed larger than 32 would silently get 
truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error, because in the 
RDDSamplerBase class from pyspark.rddsampler, we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

And this often yields ints larger than 2 ** 32. Effectively, this reliably 
breaks any sampling operation in PySpark with this NumPy version.

I am putting a PR together now (the fix is very simple!).





 pyspark's sample methods 

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





 [PYSPARK] PySpark's sample methods do not work 

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





 [PYSPARK] PySpark's sample methods do not 

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





 [PYSPARK] PySpark's sample methods do not 

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





 [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
 

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark (unless the seed is set 
manually).

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py, line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 127, in dump_stream
for obj in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py, 
line 185, in _batched
for item in iterator:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 116, in func
if self.getUniformSample(split) = self._fraction:
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py, 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File mtrand.pyx, line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File mtrand.pyx, line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).





 [PYSPARK] PySpark's sample methods do not work 

[jira] [Created] (SPARK-3128) Use streaming test suite for StreamingLR

2014-08-19 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-3128:
-

 Summary: Use streaming test suite for StreamingLR
 Key: SPARK-3128
 URL: https://issues.apache.org/jira/browse/SPARK-3128
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Minor


Unit tests for Streaming Linear Regression currently use file writing to 
generate input data and a TextFileStream to read the data. It would be better 
to use existing utilities from the streaming test suite to simulate DStreams 
and collect and evaluate results of DStream operations. This will make tests 
faster, simpler, and easier to maintain / extend.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079984#comment-14079984
 ] 

Jeremy Freeman commented on SPARK-2012:
---

[~davies] cool, that definitely makes sense to me, shall I put a PR together 
done that way?

 PySpark StatCounter with numpy arrays
 -

 Key: SPARK-2012
 URL: https://issues.apache.org/jira/browse/SPARK-2012
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Jeremy Freeman
Priority: Minor

 In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
 arrays just as with an RDD of scalars, which was very useful (e.g. for 
 computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
 because the added functionality for computing the minimum and maximum, as 
 implemented, doesn't work on arrays.
 I have a PR ready that re-enables this functionality by having StatCounter 
 use the numpy element-wise functions maximum and minimum, which work on 
 both numpy arrays and scalars (and I've added new tests for this capability). 
 However, I realize this adds a dependency on NumPy outside of MLLib. If 
 that's not ok, maybe it'd be worth adding this functionality as a util within 
 PySpark MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-21 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069758#comment-14069758
 ] 

Jeremy Freeman commented on SPARK-2282:
---

Hi all, I'm the scientist, a couple updates from more real world testing, 
looking very promising!

- Set-up: 60 node cluster, an analysis with iterative updates (essentially a 
sequence of two map-reduce steps on each iteration), data cached and counted 
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. 
Before the patch I reliably hit the error after about 5 iterations, with the 
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. 
Before the patch more than one iteration always failed (!), with the patch 20+ 
complete.

So it's looking really good. I can also try the other extreme (very small 
cluster) to see if that issue manifests. Aaron, big thanks for helping with 
this, it's a big deal for our workflows, so really terrific to get to the 
bottom of it!

-- Jeremy

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-21 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14069758#comment-14069758
 ] 

Jeremy Freeman edited comment on SPARK-2282 at 7/22/14 3:18 AM:


Hi all, I'm the scientist, a couple updates from more real world testing, 
looking very promising!

- Set-up: 60 node cluster, an analysis with iterative updates (essentially a 
sequence of two map-reduce steps on each iteration), data cached and counted 
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. 
Before the patch I reliably hit the error after about 5 iterations, with the 
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. 
Before the patch more than one iteration always failed, with the patch 20+ 
complete.

So it's looking really good. I can also try the other extreme (very small 
cluster) to see if that issue manifests. Aaron, big thanks for helping with 
this, it's a big deal for our workflows, so really terrific to get to the 
bottom of it!

-- Jeremy


was (Author: freeman-lab):
Hi all, I'm the scientist, a couple updates from more real world testing, 
looking very promising!

- Set-up: 60 node cluster, an analysis with iterative updates (essentially a 
sequence of two map-reduce steps on each iteration), data cached and counted 
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. 
Before the patch I reliably hit the error after about 5 iterations, with the 
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. 
Before the patch more than one iteration always failed (!), with the patch 20+ 
complete.

So it's looking really good. I can also try the other extreme (very small 
cluster) to see if that issue manifests. Aaron, big thanks for helping with 
this, it's a big deal for our workflows, so really terrific to get to the 
bottom of it!

-- Jeremy

 PySpark crashes if too many tasks complete quickly
 --

 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.1, 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson
 Fix For: 0.9.2, 1.0.0, 1.0.1


 Upon every task completion, PythonAccumulatorParam constructs a new socket to 
 the Accumulator server running inside the pyspark daemon. This can cause a 
 buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
 stage, which will cause the SparkContext to crash if too many tasks complete 
 too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
 This bug can be fixed outside of Spark by ensuring these properties are set 
 (on a linux server);
 echo 1  /proc/sys/net/ipv4/tcp_tw_reuse
 echo 1  /proc/sys/net/ipv4/tcp_tw_recycle
 or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2438) Streaming + MLLib

2014-07-10 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-2438:
-

 Summary: Streaming + MLLib
 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jeremy Freeman


This is a ticket to track progress on developing streaming analyses in MLLib.

Many streaming applications benefit from or require fitting models online, 
where the parameters of a model (e.g. regression, clustering) are updated 
continually as new data arrive. This can be accomplished by incorporating MLLib 
algorithms into model-updating operations over DStreams. In some cases this can 
be achieved using existing updaters (e.g. those based on SGD), but in other 
cases will require custom update rules (e.g. for KMeans). The goal is to have 
streaming versions of many common algorithms, in particular regression, 
classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-06-03 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-2012:
-

 Summary: PySpark StatCounter with numpy arrays
 Key: SPARK-2012
 URL: https://issues.apache.org/jira/browse/SPARK-2012
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Jeremy Freeman
Priority: Minor


In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
arrays just as with an RDD of scalars, which was very useful (e.g. for 
computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
because the added functionality for computing the minimum and maximum, as 
implemented, doesn't work on arrays.

I have a PR ready that re-enables this functionality by having StatCounter use 
the numpy element-wise functions maximum and minimum, which work on both 
numpy arrays and scalars (and I've added new tests for this capability). 

However, I realize this adds a dependency on NumPy outside of MLLib. If that's 
not ok, maybe it'd be worth adding this functionality as a util within PySpark 
MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)