[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648552#comment-14648552
 ] 

Jeremy Freeman commented on SPARK-9461:
---

Ah, great to hear!

> Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
> --
>
> Key: SPARK-9461
> URL: https://issues.apache.org/jira/browse/SPARK-9461
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Jeremy Freeman
>
> [~freeman-lab]
> Check out this failure: 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
> It should be deterministic, but do you think it's just slight variations 
> caused by the Python version?  Or do you think it's something odd going on 
> with streaming?  This is the only time I've seen this happen, but I'll post 
> again if I see it more.
> Test failure message:
> {code}
> ==
> FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
> Test that coefs are predicted accurately by fitting on toy data.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1282, in test_parameter_accuracy
> slr.latestModel().weights.array, [10., 10.], 1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1257, in assertArrayAlmostEqual
> self.assertAlmostEqual(i, j, dec)
> AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647884#comment-14647884
 ] 

Jeremy Freeman commented on SPARK-9461:
---

Interesting, the lack of failures on KMeans is consistent with the completion 
idea because  those tests use extremely toy data (< 10 data points) whereas the 
regression ones use 100s of test data points.

In this (and similar) lines:
https://github.com/apache/spark/blob/master/python/pyspark/mllib/tests.py#L1161

there's a parameter `end_time` that's the time to wait in seconds for it to 
complete. Looking across these tests, the value fluctuates (5, 10, 15, 20) 
suggesting that it was hand-tuned, possibly tailored to a local test 
environment. Bumping that number up for any of the tests showing occasional 
errors might fix it, though that's a little ad-hoc.

I think things are more robust on the Scala side because there's a full-blown 
streaming test class that lets test jobs either run to completion, or until a 
max time out 
(https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala).
 So there's just one test-wide parameter, the max time out, and we could safely 
set that pretty high without wasting time.

> Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
> --
>
> Key: SPARK-9461
> URL: https://issues.apache.org/jira/browse/SPARK-9461
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Jeremy Freeman
>
> [~freeman-lab]
> Check out this failure: 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
> It should be deterministic, but do you think it's just slight variations 
> caused by the Python version?  Or do you think it's something odd going on 
> with streaming?  This is the only time I've seen this happen, but I'll post 
> again if I see it more.
> Test failure message:
> {code}
> ==
> FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
> Test that coefs are predicted accurately by fitting on toy data.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1282, in test_parameter_accuracy
> slr.latestModel().weights.array, [10., 10.], 1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1257, in assertArrayAlmostEqual
> self.assertAlmostEqual(i, j, dec)
> AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-29 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647220#comment-14647220
 ] 

Jeremy Freeman commented on SPARK-9461:
---

I'd definitely be curious to see if these two tests pass again with a relaxed 
tolerance (say double the tolerance, which would have fixed the one in your 
initial comment), while we get to the core of the issue, maybe try that on your 
PR?

The only other algo to consider making the change in is StreamingKMeans, but 
the tests there are closer to toy examples, and I think will be less 
susceptible to small differences in convergence (assuming that's what's going 
on here), have you you noticed any of those failing as well?

> Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
> --
>
> Key: SPARK-9461
> URL: https://issues.apache.org/jira/browse/SPARK-9461
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Jeremy Freeman
>
> [~freeman-lab]
> Check out this failure: 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
> It should be deterministic, but do you think it's just slight variations 
> caused by the Python version?  Or do you think it's something odd going on 
> with streaming?  This is the only time I've seen this happen, but I'll post 
> again if I see it more.
> Test failure message:
> {code}
> ==
> FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
> Test that coefs are predicted accurately by fitting on toy data.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1282, in test_parameter_accuracy
> slr.latestModel().weights.array, [10., 10.], 1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1257, in assertArrayAlmostEqual
> self.assertAlmostEqual(i, j, dec)
> AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests

2015-07-29 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647186#comment-14647186
 ] 

Jeremy Freeman commented on SPARK-9461:
---

A couple initial observations:
- the Scala tests of the same algos are all passing, so if it's something 
deeper in streaming, it's probably specific to PySpark Streaming
- the estimate and target are close, just not quite within the tolerance. as 
these are approximations, there will be some error, but as you say, this should 
be deterministic and was passing before. 
- in developing the original Scala tests, I noticed this could occur if, for 
some reason, the sequence of streaming updates did not complete in the 
available time (so it didn't quite converge). an error elsewhere could 
conceivably cause that to happen. 

ccing [~MechCoder] who did the PySpark StreamingML bindings, did you notice 
these to be at all flaky during development? 

> Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
> --
>
> Key: SPARK-9461
> URL: https://issues.apache.org/jira/browse/SPARK-9461
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Jeremy Freeman
>
> [~freeman-lab]
> Check out this failure: 
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull]
> It should be deterministic, but do you think it's just slight variations 
> caused by the Python version?  Or do you think it's something odd going on 
> with streaming?  This is the only time I've seen this happen, but I'll post 
> again if I see it more.
> Test failure message:
> {code}
> ==
> FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests)
> Test that coefs are predicted accurately by fitting on toy data.
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1282, in test_parameter_accuracy
> slr.latestModel().weights.array, [10., 10.], 1)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1257, in assertArrayAlmostEqual
> self.assertAlmostEqual(i, j, dec)
> AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-07-26 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642207#comment-14642207
 ] 

Jeremy Freeman commented on SPARK-4144:
---

Sorry Chris! Why don't you have a go, I'd be happy to review anything you put 
together.

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Jeremy Freeman
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4980) Add decay factors to streaming linear methods

2015-07-24 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640714#comment-14640714
 ] 

Jeremy Freeman commented on SPARK-4980:
---

That's terrific! Happy to have you work on it, thanks for sharing the doc, I'll 
review ASAP and respond with comments / thoughts.

> Add decay factors to streaming linear methods
> -
>
> Key: SPARK-4980
> URL: https://issues.apache.org/jira/browse/SPARK-4980
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>Priority: Minor
>
> Our implementation of streaming k-means uses an decay factor that allows 
> users to control how quickly the model adjusts to new data: whether it treats 
> all data equally, or only bases its estimate on the most recent batch. It is 
> intuitively parameterized, and can be specified in units of either batches or 
> points. We should add a similar decay factor to the streaming linear methods 
> using SGD, including streaming linear regression (currently implemented) and 
> streaming logistic regression (in development).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6722) Model import/export for StreamingKMeansModel

2015-04-07 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484772#comment-14484772
 ] 

Jeremy Freeman commented on SPARK-6722:
---

[~josephkb] Good question! Agreed it would be nice to add this. The only 
concern might be that we've been discussing with [~tdas] the possibility of 
representing local models for these algorithms a little differently, as local 
state streams, to better support driver fault tolerance. But even if we do 
that, the components of the model will be identical to what they are now, so in 
terms of what gets stored for import/export it should stay the same.

I guess my inclination is to add this support, but curious what [~tdas] thinks.

> Model import/export for StreamingKMeansModel
> 
>
> Key: SPARK-6722
> URL: https://issues.apache.org/jira/browse/SPARK-6722
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> CC: [~freeman-lab] Is this API stable enough to merit adding import/export 
> (which will require supporting the model format version from now on)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms

2015-04-01 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390176#comment-14390176
 ] 

Jeremy Freeman commented on SPARK-6646:
---

Very promising [~tdas]! We should evaluate the performance of streaming machine 
learning algorithms. In general I think running Spark in javascript via 
scala.js and node.js is extremely appealing, will make integration with 
visualization very straightforward. 

> Spark 2.0: Rearchitecting Spark for Mobile Platforms
> 
>
> Key: SPARK-6646
> URL: https://issues.apache.org/jira/browse/SPARK-6646
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Blocker
> Attachments: Spark on Mobile - Design Doc - v1.pdf
>
>
> Mobile computing is quickly rising to dominance, and by the end of 2017, it 
> is estimated that 90% of CPU cycles will be devoted to mobile hardware. 
> Spark’s project goal can be accomplished only when Spark runs efficiently for 
> the growing population of mobile users.
> Designed and optimized for modern data centers and Big Data applications, 
> Spark is unfortunately not a good fit for mobile computing today. In the past 
> few months, we have been prototyping the feasibility of a mobile-first Spark 
> architecture, and today we would like to share with you our findings. This 
> ticket outlines the technical design of Spark’s mobile support, and shares 
> results from several early prototypes.
> Mobile friendly version of the design doc: 
> https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is to retrieve and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


> Model update propagation during prediction in Streaming Regression
> --
>
> Key: SPARK-6345
> URL: https://issues.apache.org/jira/browse/SPARK-6345
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> During streaming regression analyses (Streaming Linear Regression and 
> Streaming Logistic Regression), model updates based on training data are not 
> being reflected in subsequent calls to predictOn or predictOnValues, despite 
> updates themselves occurring successfully. It may be due to recent changes to 
> model declaration, and I have a working fix prepared to be submitted ASAP 
> (alongside expanded test coverage).
> A temporary workaround is to retrieve and use the updated model within a 
> foreachRDD, as in:
> {code}
> model.trainOn(trainingData)
> testingData.foreachRDD{ rdd =>
> val latest = model.latestModel()
> val predictions = rdd.map(lp => latest.predict(lp.features))
> ...print or other side effects...
> }
> {code}
> Or within a transform, as in:
> {code}
> model.trainOn(trainingData)
> val predictions = testingData.transform { rdd =>
>   val latest = model.latestModel()
>   rdd.map(lp => (lp.label, latest.predict(lp.features)))
> }
> {code}
> Note that this does not affect Streaming KMeans, which works as expected for 
> combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


> Model update propagation during prediction in Streaming Regression
> --
>
> Key: SPARK-6345
> URL: https://issues.apache.org/jira/browse/SPARK-6345
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> During streaming regression analyses (Streaming Linear Regression and 
> Streaming Logistic Regression), model updates based on training data are not 
> being reflected in subsequent calls to predictOn or predictOnValues, despite 
> updates themselves occurring successfully. It may be due to recent changes to 
> model declaration, and I have a working fix prepared to be submitted ASAP 
> (alongside expanded test coverage).
> A temporary workaround is retrieve and use the updated model within a 
> foreachRDD, as in:
> {code}
> model.trainOn(trainingData)
> testingData.foreachRDD{ rdd =>
> val latest = model.latestModel()
> val predictions = rdd.map(lp => latest.predict(lp.features))
> ...print or other side effects...
> }
> {code}
> Or within a transform, as in:
> {code}
> model.trainOn(trainingData)
> val predictions = testingData.transform { rdd =>
>   val latest = model.latestModel()
>   rdd.map(lp => (lp.label, latest.predict(lp.features)))
> }
> {code}
> Note that this does not affect Streaming KMeans, which works as expected for 
> combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
...print or other side effects...
}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))

}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


> Model update propagation during prediction in Streaming Regression
> --
>
> Key: SPARK-6345
> URL: https://issues.apache.org/jira/browse/SPARK-6345
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> During streaming regression analyses (Streaming Linear Regression and 
> Streaming Logistic Regression), model updates based on training data are not 
> being reflected in subsequent calls to predictOn or predictOnValues, despite 
> updates themselves occurring successfully. It may be due to recent changes to 
> model declaration, and I have a working fix prepared to be submitted ASAP 
> (alongside expanded test coverage).
> A temporary workaround is retrieve the and use the updated model within a 
> foreachRDD, as in:
> {code}
> model.trainOn(trainingData)
> testingData.foreachRDD{ rdd =>
> val latest = model.latestModel()
> val predictions = rdd.map(lp => latest.predict(lp.features))
> ...print or other side effects...
> }
> {code}
> Or within a transform, as in:
> {code}
> model.trainOn(trainingData)
> val predictions = testingData.transform { rdd =>
>   val latest = model.latestModel()
>   rdd.map(lp => (lp.label, latest.predict(lp.features)))
> }
> {code}
> Note that this does not affect Streaming KMeans, which works as expected for 
> combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the and use the updated model within a 
foreachRDD, as in:

{code}
model.trainOn(trainingData)
testingData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))

}
{code}

Or within a transform, as in:

{code}
model.trainOn(trainingData)
val predictions = testingData.transform { rdd =>
  val latest = model.latestModel()
  rdd.map(lp => (lp.label, latest.predict(lp.features)))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


> Model update propagation during prediction in Streaming Regression
> --
>
> Key: SPARK-6345
> URL: https://issues.apache.org/jira/browse/SPARK-6345
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> During streaming regression analyses (Streaming Linear Regression and 
> Streaming Logistic Regression), model updates based on training data are not 
> being reflected in subsequent calls to predictOn or predictOnValues, despite 
> updates themselves occurring successfully. It may be due to recent changes to 
> model declaration, and I have a working fix prepared to be submitted ASAP 
> (alongside expanded test coverage).
> A temporary workaround is retrieve the and use the updated model within a 
> foreachRDD, as in:
> {code}
> model.trainOn(trainingData)
> testingData.foreachRDD{ rdd =>
> val latest = model.latestModel()
> val predictions = rdd.map(lp => latest.predict(lp.features))
> }
> {code}
> Or within a transform, as in:
> {code}
> model.trainOn(trainingData)
> val predictions = testingData.transform { rdd =>
>   val latest = model.latestModel()
>   rdd.map(lp => (lp.label, latest.predict(lp.features)))
> }
> {code}
> Note that this does not affect Streaming KMeans, which works as expected for 
> combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features))
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


> Model update propagation during prediction in Streaming Regression
> --
>
> Key: SPARK-6345
> URL: https://issues.apache.org/jira/browse/SPARK-6345
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> During streaming regression analyses (Streaming Linear Regression and 
> Streaming Logistic Regression), model updates based on training data are not 
> being reflected in subsequent calls to predictOn or predictOnValues, despite 
> updates themselves occurring successfully. It may be due to recent changes to 
> model declaration, and I have a working fix prepared to be submitted ASAP 
> (alongside expanded test coverage).
> A temporary workaround is retrieve the updated model within a foreachRDD, as 
> in:
> {code}
> model.trainOn(trainingData)
> testData.foreachRDD{ rdd =>
> val latest = model.latestModel()
> val predictions = rdd.map(lp => latest.predict(lp.features))
> }
> {code}
> Note that this does not affect Streaming KMeans, which works as expected for 
> combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-6345:
--
Description: 
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.trainOn(trainingData)
testData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.

  was:
During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.predictOn(trainingData)
testData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.


> Model update propagation during prediction in Streaming Regression
> --
>
> Key: SPARK-6345
> URL: https://issues.apache.org/jira/browse/SPARK-6345
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Streaming
>Reporter: Jeremy Freeman
>
> During streaming regression analyses (Streaming Linear Regression and 
> Streaming Logistic Regression), model updates based on training data are not 
> being reflected in subsequent calls to predictOn or predictOnValues, despite 
> updates themselves occurring successfully. It may be due to recent changes to 
> model declaration, and I have a working fix prepared to be submitted ASAP 
> (alongside expanded test coverage).
> A temporary workaround is retrieve the updated model within a foreachRDD, as 
> in:
> {code}
> model.trainOn(trainingData)
> testData.foreachRDD{ rdd =>
> val latest = model.latestModel()
> val predictions = rdd.map(lp => latest.predict(lp.features)
> }
> {code}
> Note that this does not affect Streaming KMeans, which works as expected for 
> combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6345) Model update propagation during prediction in Streaming Regression

2015-03-15 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-6345:
-

 Summary: Model update propagation during prediction in Streaming 
Regression
 Key: SPARK-6345
 URL: https://issues.apache.org/jira/browse/SPARK-6345
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Streaming
Reporter: Jeremy Freeman


During streaming regression analyses (Streaming Linear Regression and Streaming 
Logistic Regression), model updates based on training data are not being 
reflected in subsequent calls to predictOn or predictOnValues, despite updates 
themselves occurring successfully. It may be due to recent changes to model 
declaration, and I have a working fix prepared to be submitted ASAP (alongside 
expanded test coverage).

A temporary workaround is retrieve the updated model within a foreachRDD, as in:

{code}
model.predictOn(trainingData)
testData.foreachRDD{ rdd =>
val latest = model.latestModel()
val predictions = rdd.map(lp => latest.predict(lp.features)
}
{code}

Note that this does not affect Streaming KMeans, which works as expected for 
combinations of training and prediction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman edited comment on SPARK-2429 at 3/11/15 8:10 PM:


Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing experimental pieces of functionality. But I'd probably prioritize just 
reviewing the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.


was (Author: freeman-lab):
Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2015-03-11 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357497#comment-14357497
 ] 

Jeremy Freeman commented on SPARK-2429:
---

Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I 
agree with [~josephkb] that it is worth bringing this into MLlib, as the 
algorithm itself will translate to future uses, and many groups (including 
ours!) will find it useful now.

It might be worth adding to spark-packages, especially if we expect the review 
to take awhile. Those seem especially useful as a way to provide easy access to 
testing new pieces of functionality. But I'd probably prioritize just reviewing 
the patch.

Also agree with the others that we should start a new PR with the new 
algorithm, 1000x faster is a lot! It is worth incorporating some of comments 
from the old PR if you haven't already, if relevant in the new version.

I'd be happy to go through the new PR as I'm quite familiar with the problem / 
algorithm, but it would help if you could say a little more about what you did 
so differently here, to help guide me as I look at the code.

> Hierarchical Implementation of KMeans
> -
>
> Key: SPARK-2429
> URL: https://issues.apache.org/jira/browse/SPARK-2429
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: RJ Nowling
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: clustering
> Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
> Result of Benchmarking a Hierarchical Clustering.pdf, 
> benchmark-result.2014-10-29.html, benchmark2.html
>
>
> Hierarchical clustering algorithms are widely used and would make a nice 
> addition to MLlib.  Clustering algorithms are useful for determining 
> relationships between clusters as well as offering faster assignment. 
> Discussion on the dev list suggested the following possible approaches:
> * Top down, recursive application of KMeans
> * Reuse DecisionTree implementation with different objective function
> * Hierarchical SVD
> It was also suggested that support for distance metrics other than Euclidean 
> such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier

2015-02-20 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328726#comment-14328726
 ] 

Jeremy Freeman commented on SPARK-4144:
---

Hi all, I'd be happy to pick this up and get a prototype in the next couple 
weeks, it's related to the streaming logistic regression we just added for 1.3, 
though the update rule is more complex and might not be able to reuse as much 
code (in that sense it may end up looking more like streaming k-means).

But don't want to step on anyone's toes! [~liquanpei] are you working on it? Or 
are you still interested [~cfregly]? Also happy to work together  / review code.

> Support incremental model training of Naive Bayes classifier
> 
>
> Key: SPARK-4144
> URL: https://issues.apache.org/jira/browse/SPARK-4144
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Streaming
>Reporter: Chris Fregly
>Assignee: Liquan Pei
>
> Per Xiangrui Meng from the following user list discussion:  
> http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E
>
> "For Naive Bayes, we need to update the priors and conditional
> probabilities, which means we should also remember the number of
> observations for the updates."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:none}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:none}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
{code}

But this works fine:

{code:none}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:python}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:python}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
{code}

But this works fine:

{code:python}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.


> Vector conversion broken for non-float64 arrays
> ---
>
> Key: SPARK-5089
> URL: https://issues.apache.org/jira/browse/SPARK-5089
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Jeremy Freeman
>
> Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
> automatically converted to {{DenseVectors}}. If the data are numpy arrays 
> with dtype {{float64}} this works. If data are numpy arrays with lower 
> precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
> {{float64}}, but due to a small bug in this line this currently doesn't 
> happen (casting is not inplace). 
> {code:none}
> if ar.dtype != np.float64:
> ar.astype(np.float64)
> {code}
>  
> Non-float64 values are in turn mangled during SerDe. This can have 
> significant consequences. For example, the following yields confusing and 
> erroneous results:
> {code:none}
> from numpy import random
> from pyspark.mllib.clustering import KMeans
> data = sc.parallelize(random.randn(100,10).astype('float32'))
> model = KMeans.train(data, k=3)
> len(model.centers[0])
> >> 5 # should be 10!
> {code}
> But this works fine:
> {code:none}
> data = sc.parallelize(random.randn(100,10).astype('float64'))
> model = KMeans.train(data, k=3)
> len(model.centers[0])
> >> 10 # this is correct
> {code}
> The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to {{DenseVectors}}. If the data are numpy arrays with 
dtype {{float64}} this works. If data are numpy arrays with lower precision 
(e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but 
due to a small bug in this line this currently doesn't happen (casting is not 
inplace). 

{code:python}
if ar.dtype != np.float64:
ar.astype(np.float64)
{code}
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

{code:python}
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
{code}

But this works fine:

{code:python}
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
{code}

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

``
if ar.dtype != np.float64:
ar.astype(np.float64)
``
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.


> Vector conversion broken for non-float64 arrays
> ---
>
> Key: SPARK-5089
> URL: https://issues.apache.org/jira/browse/SPARK-5089
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Jeremy Freeman
>
> Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
> automatically converted to {{DenseVectors}}. If the data are numpy arrays 
> with dtype {{float64}} this works. If data are numpy arrays with lower 
> precision (e.g. {{float16}} or {{float32}}), they should be upcast to 
> {{float64}}, but due to a small bug in this line this currently doesn't 
> happen (casting is not inplace). 
> {code:python}
> if ar.dtype != np.float64:
> ar.astype(np.float64)
> {code}
>  
> Non-float64 values are in turn mangled during SerDe. This can have 
> significant consequences. For example, the following yields confusing and 
> erroneous results:
> {code:python}
> from numpy import random
> from pyspark.mllib.clustering import KMeans
> data = sc.parallelize(random.randn(100,10).astype('float32'))
> model = KMeans.train(data, k=3)
> len(model.centers[0])
> >> 5 # should be 10!
> {code}
> But this works fine:
> {code:python}
> data = sc.parallelize(random.randn(100,10).astype('float64'))
> model = KMeans.train(data, k=3)
> len(model.centers[0])
> >> 10 # this is correct
> {code}
> The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-5089:
--
Description: 
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

``
if ar.dtype != np.float64:
ar.astype(np.float64)
``
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.

  was:
Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

```
if ar.dtype != np.float64:
ar.astype(np.float64)
```
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.


> Vector conversion broken for non-float64 arrays
> ---
>
> Key: SPARK-5089
> URL: https://issues.apache.org/jira/browse/SPARK-5089
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Jeremy Freeman
>
> Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
> automatically converted to `DenseVectors`. If the data are numpy arrays with 
> dtype `float64` this works. If data are numpy arrays with lower precision 
> (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to 
> a small bug in this line this currently doesn't happen (casting is not 
> inplace). 
> ``
> if ar.dtype != np.float64:
> ar.astype(np.float64)
> ``
>  
> Non-float64 values are in turn mangled during SerDe. This can have 
> significant consequences. For example, the following yields confusing and 
> erroneous results:
> ```
> from numpy import random
> from pyspark.mllib.clustering import KMeans
> data = sc.parallelize(random.randn(100,10).astype('float32'))
> model = KMeans.train(data, k=3)
> len(model.centers[0])
> >> 5 # should be 10!
> ```
> But this works fine:
> ```
> data = sc.parallelize(random.randn(100,10).astype('float64'))
> model = KMeans.train(data, k=3)
> len(model.centers[0])
> >> 10 # this is correct
> ```
> The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5089) Vector conversion broken for non-float64 arrays

2015-01-05 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-5089:
-

 Summary: Vector conversion broken for non-float64 arrays
 Key: SPARK-5089
 URL: https://issues.apache.org/jira/browse/SPARK-5089
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Jeremy Freeman


Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are 
automatically converted to `DenseVectors`. If the data are numpy arrays with 
dtype `float64` this works. If data are numpy arrays with lower precision (e.g. 
`float16` or `float32`), they should be upcast to `float64`, but due to a small 
bug in this line this currently doesn't happen (casting is not inplace). 

```
if ar.dtype != np.float64:
ar.astype(np.float64)
```
 
Non-float64 values are in turn mangled during SerDe. This can have significant 
consequences. For example, the following yields confusing and erroneous results:

```
from numpy import random
from pyspark.mllib.clustering import KMeans
data = sc.parallelize(random.randn(100,10).astype('float32'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 5 # should be 10!
```

But this works fine:

```
data = sc.parallelize(random.randn(100,10).astype('float64'))
model = KMeans.train(data, k=3)
len(model.centers[0])
>> 10 # this is correct
```

The fix is trivial, I'll submit a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4981) Add a streaming singular value decomposition

2014-12-28 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4981:
-

 Summary: Add a streaming singular value decomposition
 Key: SPARK-4981
 URL: https://issues.apache.org/jira/browse/SPARK-4981
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman


This is for tracking WIP on a streaming singular value decomposition 
implementation. This will likely be more complex than the existing streaming 
algorithms (k-means, regression), but should be possible using the family of 
sequential update rule outlined in this paper:

"Fast low-rank modifications of the thin singular value decomposition"
by Matthew Brand
http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4980) Add decay factors to streaming linear methods

2014-12-28 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4980:
-

 Summary: Add decay factors to streaming linear methods
 Key: SPARK-4980
 URL: https://issues.apache.org/jira/browse/SPARK-4980
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Priority: Minor


Our implementation of streaming k-means uses an decay factor that allows users 
to control how quickly the model adjusts to new data: whether it treats all 
data equally, or only bases its estimate on the most recent batch. It is 
intuitively parameterized, and can be specified in units of either batches or 
points. We should add a similar decay factor to the streaming linear methods 
using SGD, including streaming linear regression (currently implemented) and 
streaming logistic regression (in development).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4979) Add streaming logistic regression

2014-12-28 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4979:
-

 Summary: Add streaming logistic regression
 Key: SPARK-4979
 URL: https://issues.apache.org/jira/browse/SPARK-4979
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Priority: Minor


We currently support streaming linear regression and k-means clustering. We can 
add support for streaming logistic regression using a strategy similar to that 
used in streaming linear regression, applying gradient updates to batches of 
data from a DStream, and extending the existing mllib methods with minor 
modifications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4969) Add binaryRecords support to streaming

2014-12-25 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-4969:
-

 Summary: Add binaryRecords support to streaming
 Key: SPARK-4969
 URL: https://issues.apache.org/jira/browse/SPARK-4969
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Streaming
Affects Versions: 1.2.0
Reporter: Jeremy Freeman
Priority: Minor


As of Spark 1.2 there is support for loading fixed length records from flat 
binary files. This is a useful way to load dense numerical array data into 
Spark, especially in scientific computing applications.

We should add support for loading this same file type in Spark Streaming, both 
in Scala/Java and in Python. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)

2014-12-04 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234269#comment-14234269
 ] 

Jeremy Freeman commented on SPARK-4727:
---

Great to brainstorm about this RJ! 

To some extent, we've been doing this over on the 
[Thunder|http://thefreemanlab.com/thunder/docs/] project. In particular, check 
out the {{TimeSeries}} and {{Images}} classes 
[here|https://github.com/freeman-lab/thunder/tree/master/python/thunder/rdds], 
which are essentially wrappers for specialized RDDs. Our basic abstraction is 
RDDs of ndarrays (1D for time series, 2D or 3D for images/volumes), with 
metadeta (lazily propagated) for things like dimensionality and time base, 
coordinates embedded in keys, and useful methods on these objects like the ones 
you menion (e.g. filtering, fourier, cross-correlation). We've also worked on 
transformations between representations, for the common case of sequences of 
images corresponding to different time points.

We haven't worked on custom partition strategies yet, I think that will be most 
important for image tiles drawn from a much larger image. There's cool work 
ongoing for that in GeoTrellis, see the 
[repo|https://github.com/geotrellis/geotrellis/tree/master/spark/src/main] and 
a 
[talk|http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark]
 from Rob.

FWIW, when we started it seemed more appropriate to build this into a 
specialized library, rather than Spark core. It's also something that benefits 
from using Python, due to a bevy of existing libraries for  temporal and image 
data (though there are certainly analogs in Java/Scala). But it would be great 
to probe the community for general interest in these kinds of abstractions and 
methods.

> Add "dimensional" RDDs (time series, spatial)
> -
>
> Key: SPARK-4727
> URL: https://issues.apache.org/jira/browse/SPARK-4727
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: RJ Nowling
>
> Certain types of data (times series, spatial) can benefit from specialized 
> RDDs.  I'd like to open a discussion about this.
> For example, time series data should be ordered by time and would benefit 
> from operations like:
> * Subsampling (taking every n data points)
> * Signal processing (correlations, FFTs, filtering)
> * Windowing functions
> Spatial data benefits from ordering and partitioning along a 2D or 3D grid.  
> For example, path finding algorithms can optimized by only comparing points 
> within a set distance, which can be computed more efficiently by partitioning 
> data into a grid.
> Although the operations on time series and spatial data may be different, 
> there is some commonality in the sense of the data having ordered dimensions 
> and the implementations may overlap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-11-26 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227266#comment-14227266
 ] 

Jeremy Freeman commented on SPARK-3995:
---

Agree with [~mengxr], that's a better strategy.

> [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
> -
>
> Key: SPARK-3995
> URL: https://issues.apache.org/jira/browse/SPARK-3995
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Jeremy Freeman
>Assignee: Jeremy Freeman
>Priority: Critical
> Fix For: 1.2.0
>
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy 
> v1.9. This is the version of NumPy included with the current Anaconda 
> distribution (v2.1); this is a popular distribution, and is likely to affect 
> many users.
> Steps to reproduce are:
> {code:python}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", 
> line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 127, in dump_stream
> for obj in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 185, in _batched
> for item in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 116, in func
> if self.getUniformSample(split) <= self._fraction:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 58, in getUniformSample
> self.initRandomGenerator(split)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 44, in initRandomGenerator
> self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
> (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
> (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:
> {code:python}
> self._seed = seed if seed is not None else random.randint(0, sys.maxint)
> {code}
> In previous versions of NumPy a random seed larger than 2 ** 32 would 
> silently get truncated to 2 ** 32. This was fixed in a recent patch 
> (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
>  But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, 
> which effectively breaks sampling operations in PySpark (unless the seed is 
> set manually).
> I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3254) Streaming K-Means

2014-10-23 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181080#comment-14181080
 ] 

Jeremy Freeman commented on SPARK-3254:
---

I'd like it to! I've got a PR ready to submit ASAP, we can take it from there.

> Streaming K-Means
> -
>
> Key: SPARK-3254
> URL: https://issues.apache.org/jira/browse/SPARK-3254
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Jeremy Freeman
>
> Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark (unless the seed is set 
manually).

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK] Py

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code:python}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code:python}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which 
effectively breaks sampling operations in PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK] PySpark's sampl

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSP

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPARK]

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).





> [PYSPAR

[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Summary: [PYSPARK] PySpark's sample methods do not work with NumPy 1.9  
(was: pyspark's sample methods do not work with NumPy 1.9)

> [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
> -
>
> Key: SPARK-3995
> URL: https://issues.apache.org/jira/browse/SPARK-3995
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Jeremy Freeman
>Priority: Critical
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy 
> v1.9. This is the version of NumPy included with the current Anaconda 
> distribution (v2.1); this is a popular distribution, and is likely to affect 
> many users.
> Steps to reproduce are:
> {code}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", 
> line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 127, in dump_stream
> for obj in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 185, in _batched
> for item in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 116, in func
> if self.getUniformSample(split) <= self._fraction:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 58, in getUniformSample
> self.initRandomGenerator(split)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 44, in initRandomGenerator
> self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
> (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
> (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:
> {code}
> self._seed = seed if seed is not None else random.randint(0, sys.maxint)
> {code}
> In previous versions of NumPy a random seed larger than 2 ** 32 would 
> silently get truncated to 2 ** 32. This was fixed in a recent patch 
> (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
>  But it means that PySpark’s code now causes an error. That line often yields 
> ints larger than 2 ** 32, which will reliably break any sampling operation in 
> PySpark.
> I am putting a PR together now (the fix is very simple!).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

In previous versions of NumPy a random seed larger than 2 ** 32 would silently 
get truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error. That line often yields 
ints larger than 2 ** 32, which will reliably break any sampling operation in 
PySpark.

I am putting a PR together now (the fix is very simple!).




  was:
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In previous versions of NumPy a random seed larger than 32 would silently get 
truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error, because in the 
RDDSamplerBase class from pyspark.rddsampler, we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

And this often yields ints larger than 2 ** 32. Effectively, this reliably 
breaks any sampling operation in PySpark with this NumPy version.

I am putting a PR together now (the fix is very sim

[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy Freeman updated SPARK-3995:
--
Description: 
There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. 
This is the version of NumPy included with the current Anaconda distribution 
(v2.1); this is a popular distribution, and is likely to affect many users.

Steps to reproduce are:

{code}
foo = sc.parallelize(range(1000),5)
foo.takeSample(False, 10)
{code}

Returns:

{code}
PythonException: Traceback (most recent call last):
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 
79, in main
serializer.dump_stream(func(split_index, iterator), outfile)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 196, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 127, in dump_stream
for obj in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", 
line 185, in _batched
for item in iterator:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 116, in func
if self.getUniformSample(split) <= self._fraction:
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 58, in getUniformSample
self.initRandomGenerator(split)
  File 
"/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", 
line 44, in initRandomGenerator
self._random = numpy.random.RandomState(self._seed)
  File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
(numpy/random/mtrand/mtrand.c:7397)
  File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
(numpy/random/mtrand/mtrand.c:7697)
ValueError: Seed must be between 0 and 4294967295
{code}

In previous versions of NumPy a random seed larger than 32 would silently get 
truncated to 2 ** 32. This was fixed in a recent patch 
(https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c).
 But it means that PySpark’s code now causes an error, because in the 
RDDSamplerBase class from pyspark.rddsampler, we use:

{code}
self._seed = seed if seed is not None else random.randint(0, sys.maxint)
{code}

And this often yields ints larger than 2 ** 32. Effectively, this reliably 
breaks any sampling operation in PySpark with this NumPy version.

I am putting a PR together now (the fix is very simple!).




> pyspark's sample methods do not work with NumPy 1.9
> ---
>
> Key: SPARK-3995
> URL: https://issues.apache.org/jira/browse/SPARK-3995
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Jeremy Freeman
>Priority: Critical
>
> There is a breaking bug in PySpark's sampling methods when run with NumPy 
> v1.9. This is the version of NumPy included with the current Anaconda 
> distribution (v2.1); this is a popular distribution, and is likely to affect 
> many users.
> Steps to reproduce are:
> {code}
> foo = sc.parallelize(range(1000),5)
> foo.takeSample(False, 10)
> {code}
> Returns:
> {code}
> PythonException: Traceback (most recent call last):
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", 
> line 79, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 196, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 127, in dump_stream
> for obj in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py",
>  line 185, in _batched
> for item in iterator:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 116, in func
> if self.getUniformSample(split) <= self._fraction:
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 58, in getUniformSample
> self.initRandomGenerator(split)
>   File 
> "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py",
>  line 44, in initRandomGenerator
> self._random = numpy.random.RandomState(self._seed)
>   File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ 
> (numpy/random/mtrand/mtrand.c:7397)
>   File "mtrand.pyx", line 646, in mtrand.RandomState.seed 
> (numpy/random/mtrand/mtrand.c:7697)
> ValueError: Seed must be between 0 and 4294967295
> {code}
> In previous versions of NumPy a random seed larger than 32 would silently get 
> truncated to 2 ** 32

[jira] [Created] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9

2014-10-17 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-3995:
-

 Summary: pyspark's sample methods do not work with NumPy 1.9
 Key: SPARK-3995
 URL: https://issues.apache.org/jira/browse/SPARK-3995
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3254) Streaming K-Means

2014-08-28 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114299#comment-14114299
 ] 

Jeremy Freeman edited comment on SPARK-3254 at 8/28/14 8:38 PM:


Here is a (public) google doc explaining a current implementation, including 
discussion of the update rule, and choices for parameterizing the decay factor.

https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing

In-progress code can be viewed here:

https://github.com/freeman-lab/spark/blob/streaming-kmeans/mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala


was (Author: freeman-lab):
Here is a (public) google doc explaining a current implementation, including 
discussion of the update rule, and choices for parameterizing the decay factor.

https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing

In-progress code can be viewed here:

https://github.com/freeman-lab/spark/tree/streaming-kmeans

> Streaming K-Means
> -
>
> Key: SPARK-3254
> URL: https://issues.apache.org/jira/browse/SPARK-3254
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Jeremy Freeman
>
> Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3254) Streaming K-Means

2014-08-28 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114299#comment-14114299
 ] 

Jeremy Freeman commented on SPARK-3254:
---

Here is a (public) google doc explaining a current implementation, including 
discussion of the update rule, and choices for parameterizing the decay factor.

https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing

In-progress code can be viewed here:

https://github.com/freeman-lab/spark/tree/streaming-kmeans

> Streaming K-Means
> -
>
> Key: SPARK-3254
> URL: https://issues.apache.org/jira/browse/SPARK-3254
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Jeremy Freeman
>
> Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3128) Use streaming test suite for StreamingLR

2014-08-19 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-3128:
-

 Summary: Use streaming test suite for StreamingLR
 Key: SPARK-3128
 URL: https://issues.apache.org/jira/browse/SPARK-3128
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, Streaming
Affects Versions: 1.1.0
Reporter: Jeremy Freeman
Priority: Minor


Unit tests for Streaming Linear Regression currently use file writing to 
generate input data and a TextFileStream to read the data. It would be better 
to use existing utilities from the streaming test suite to simulate DStreams 
and collect and evaluate results of DStream operations. This will make tests 
faster, simpler, and easier to maintain / extend.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-07-30 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079984#comment-14079984
 ] 

Jeremy Freeman commented on SPARK-2012:
---

[~davies] cool, that definitely makes sense to me, shall I put a PR together 
done that way?

> PySpark StatCounter with numpy arrays
> -
>
> Key: SPARK-2012
> URL: https://issues.apache.org/jira/browse/SPARK-2012
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.0
>Reporter: Jeremy Freeman
>Priority: Minor
>
> In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
> arrays just as with an RDD of scalars, which was very useful (e.g. for 
> computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
> because the added functionality for computing the minimum and maximum, as 
> implemented, doesn't work on arrays.
> I have a PR ready that re-enables this functionality by having StatCounter 
> use the numpy element-wise functions "maximum" and "minimum", which work on 
> both numpy arrays and scalars (and I've added new tests for this capability). 
> However, I realize this adds a dependency on NumPy outside of MLLib. If 
> that's not ok, maybe it'd be worth adding this functionality as a util within 
> PySpark MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-21 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069758#comment-14069758
 ] 

Jeremy Freeman edited comment on SPARK-2282 at 7/22/14 3:18 AM:


Hi all, I'm "the scientist", a couple updates from more real world testing, 
looking very promising!

- Set-up: 60 node cluster, an analysis with iterative updates (essentially a 
sequence of two map-reduce steps on each iteration), data cached and counted 
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. 
Before the patch I reliably hit the error after about 5 iterations, with the 
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. 
Before the patch more than one iteration always failed, with the patch 20+ 
complete.

So it's looking really good. I can also try the other extreme (very small 
cluster) to see if that issue manifests. Aaron, big thanks for helping with 
this, it's a big deal for our workflows, so really terrific to get to the 
bottom of it!

-- Jeremy


was (Author: freeman-lab):
Hi all, I'm "the scientist", a couple updates from more real world testing, 
looking very promising!

- Set-up: 60 node cluster, an analysis with iterative updates (essentially a 
sequence of two map-reduce steps on each iteration), data cached and counted 
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. 
Before the patch I reliably hit the error after about 5 iterations, with the 
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. 
Before the patch more than one iteration always failed (!), with the patch 20+ 
complete.

So it's looking really good. I can also try the other extreme (very small 
cluster) to see if that issue manifests. Aaron, big thanks for helping with 
this, it's a big deal for our workflows, so really terrific to get to the 
bottom of it!

-- Jeremy

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-21 Thread Jeremy Freeman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069758#comment-14069758
 ] 

Jeremy Freeman commented on SPARK-2282:
---

Hi all, I'm "the scientist", a couple updates from more real world testing, 
looking very promising!

- Set-up: 60 node cluster, an analysis with iterative updates (essentially a 
sequence of two map-reduce steps on each iteration), data cached and counted 
before starting iterations
- 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. 
Before the patch I reliably hit the error after about 5 iterations, with the 
patch 20+ complete.
- 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. 
Before the patch more than one iteration always failed (!), with the patch 20+ 
complete.

So it's looking really good. I can also try the other extreme (very small 
cluster) to see if that issue manifests. Aaron, big thanks for helping with 
this, it's a big deal for our workflows, so really terrific to get to the 
bottom of it!

-- Jeremy

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2438) Streaming + MLLib

2014-07-10 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-2438:
-

 Summary: Streaming + MLLib
 Key: SPARK-2438
 URL: https://issues.apache.org/jira/browse/SPARK-2438
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jeremy Freeman


This is a ticket to track progress on developing streaming analyses in MLLib.

Many streaming applications benefit from or require fitting models online, 
where the parameters of a model (e.g. regression, clustering) are updated 
continually as new data arrive. This can be accomplished by incorporating MLLib 
algorithms into model-updating operations over DStreams. In some cases this can 
be achieved using existing updaters (e.g. those based on SGD), but in other 
cases will require custom update rules (e.g. for KMeans). The goal is to have 
streaming versions of many common algorithms, in particular regression, 
classification, clustering, and possibly dimensionality reduction.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-06-03 Thread Jeremy Freeman (JIRA)
Jeremy Freeman created SPARK-2012:
-

 Summary: PySpark StatCounter with numpy arrays
 Key: SPARK-2012
 URL: https://issues.apache.org/jira/browse/SPARK-2012
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Jeremy Freeman
Priority: Minor


In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
arrays just as with an RDD of scalars, which was very useful (e.g. for 
computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
because the added functionality for computing the minimum and maximum, as 
implemented, doesn't work on arrays.

I have a PR ready that re-enables this functionality by having StatCounter use 
the numpy element-wise functions "maximum" and "minimum", which work on both 
numpy arrays and scalars (and I've added new tests for this capability). 

However, I realize this adds a dependency on NumPy outside of MLLib. If that's 
not ok, maybe it'd be worth adding this functionality as a util within PySpark 
MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)