[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
[ https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14648552#comment-14648552 ] Jeremy Freeman commented on SPARK-9461: --- Ah, great to hear! > Possibly slightly flaky PySpark StreamingLinearRegressionWithTests > -- > > Key: SPARK-9461 > URL: https://issues.apache.org/jira/browse/SPARK-9461 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Assignee: Jeremy Freeman > > [~freeman-lab] > Check out this failure: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull] > It should be deterministic, but do you think it's just slight variations > caused by the Python version? Or do you think it's something odd going on > with streaming? This is the only time I've seen this happen, but I'll post > again if I see it more. > Test failure message: > {code} > == > FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests) > Test that coefs are predicted accurately by fitting on toy data. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1282, in test_parameter_accuracy > slr.latestModel().weights.array, [10., 10.], 1) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1257, in assertArrayAlmostEqual > self.assertAlmostEqual(i, j, dec) > AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
[ https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647884#comment-14647884 ] Jeremy Freeman commented on SPARK-9461: --- Interesting, the lack of failures on KMeans is consistent with the completion idea because those tests use extremely toy data (< 10 data points) whereas the regression ones use 100s of test data points. In this (and similar) lines: https://github.com/apache/spark/blob/master/python/pyspark/mllib/tests.py#L1161 there's a parameter `end_time` that's the time to wait in seconds for it to complete. Looking across these tests, the value fluctuates (5, 10, 15, 20) suggesting that it was hand-tuned, possibly tailored to a local test environment. Bumping that number up for any of the tests showing occasional errors might fix it, though that's a little ad-hoc. I think things are more robust on the Scala side because there's a full-blown streaming test class that lets test jobs either run to completion, or until a max time out (https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala). So there's just one test-wide parameter, the max time out, and we could safely set that pretty high without wasting time. > Possibly slightly flaky PySpark StreamingLinearRegressionWithTests > -- > > Key: SPARK-9461 > URL: https://issues.apache.org/jira/browse/SPARK-9461 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Assignee: Jeremy Freeman > > [~freeman-lab] > Check out this failure: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull] > It should be deterministic, but do you think it's just slight variations > caused by the Python version? Or do you think it's something odd going on > with streaming? This is the only time I've seen this happen, but I'll post > again if I see it more. > Test failure message: > {code} > == > FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests) > Test that coefs are predicted accurately by fitting on toy data. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1282, in test_parameter_accuracy > slr.latestModel().weights.array, [10., 10.], 1) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1257, in assertArrayAlmostEqual > self.assertAlmostEqual(i, j, dec) > AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
[ https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647220#comment-14647220 ] Jeremy Freeman commented on SPARK-9461: --- I'd definitely be curious to see if these two tests pass again with a relaxed tolerance (say double the tolerance, which would have fixed the one in your initial comment), while we get to the core of the issue, maybe try that on your PR? The only other algo to consider making the change in is StreamingKMeans, but the tests there are closer to toy examples, and I think will be less susceptible to small differences in convergence (assuming that's what's going on here), have you you noticed any of those failing as well? > Possibly slightly flaky PySpark StreamingLinearRegressionWithTests > -- > > Key: SPARK-9461 > URL: https://issues.apache.org/jira/browse/SPARK-9461 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Assignee: Jeremy Freeman > > [~freeman-lab] > Check out this failure: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull] > It should be deterministic, but do you think it's just slight variations > caused by the Python version? Or do you think it's something odd going on > with streaming? This is the only time I've seen this happen, but I'll post > again if I see it more. > Test failure message: > {code} > == > FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests) > Test that coefs are predicted accurately by fitting on toy data. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1282, in test_parameter_accuracy > slr.latestModel().weights.array, [10., 10.], 1) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1257, in assertArrayAlmostEqual > self.assertAlmostEqual(i, j, dec) > AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9461) Possibly slightly flaky PySpark StreamingLinearRegressionWithTests
[ https://issues.apache.org/jira/browse/SPARK-9461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647186#comment-14647186 ] Jeremy Freeman commented on SPARK-9461: --- A couple initial observations: - the Scala tests of the same algos are all passing, so if it's something deeper in streaming, it's probably specific to PySpark Streaming - the estimate and target are close, just not quite within the tolerance. as these are approximations, there will be some error, but as you say, this should be deterministic and was passing before. - in developing the original Scala tests, I noticed this could occur if, for some reason, the sequence of streaming updates did not complete in the available time (so it didn't quite converge). an error elsewhere could conceivably cause that to happen. ccing [~MechCoder] who did the PySpark StreamingML bindings, did you notice these to be at all flaky during development? > Possibly slightly flaky PySpark StreamingLinearRegressionWithTests > -- > > Key: SPARK-9461 > URL: https://issues.apache.org/jira/browse/SPARK-9461 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Assignee: Jeremy Freeman > > [~freeman-lab] > Check out this failure: > [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38913/consoleFull] > It should be deterministic, but do you think it's just slight variations > caused by the Python version? Or do you think it's something odd going on > with streaming? This is the only time I've seen this happen, but I'll post > again if I see it more. > Test failure message: > {code} > == > FAIL: test_parameter_accuracy (__main__.StreamingLinearRegressionWithTests) > Test that coefs are predicted accurately by fitting on toy data. > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1282, in test_parameter_accuracy > slr.latestModel().weights.array, [10., 10.], 1) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/mllib/tests.py", > line 1257, in assertArrayAlmostEqual > self.assertAlmostEqual(i, j, dec) > AssertionError: 9.4243238731093655 != 9.3216175551722014 within 1 places > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14642207#comment-14642207 ] Jeremy Freeman commented on SPARK-4144: --- Sorry Chris! Why don't you have a go, I'd be happy to review anything you put together. > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: MLlib, Streaming >Reporter: Chris Fregly >Assignee: Jeremy Freeman > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4980) Add decay factors to streaming linear methods
[ https://issues.apache.org/jira/browse/SPARK-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640714#comment-14640714 ] Jeremy Freeman commented on SPARK-4980: --- That's terrific! Happy to have you work on it, thanks for sharing the doc, I'll review ASAP and respond with comments / thoughts. > Add decay factors to streaming linear methods > - > > Key: SPARK-4980 > URL: https://issues.apache.org/jira/browse/SPARK-4980 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Jeremy Freeman >Priority: Minor > > Our implementation of streaming k-means uses an decay factor that allows > users to control how quickly the model adjusts to new data: whether it treats > all data equally, or only bases its estimate on the most recent batch. It is > intuitively parameterized, and can be specified in units of either batches or > points. We should add a similar decay factor to the streaming linear methods > using SGD, including streaming linear regression (currently implemented) and > streaming logistic regression (in development). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6722) Model import/export for StreamingKMeansModel
[ https://issues.apache.org/jira/browse/SPARK-6722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14484772#comment-14484772 ] Jeremy Freeman commented on SPARK-6722: --- [~josephkb] Good question! Agreed it would be nice to add this. The only concern might be that we've been discussing with [~tdas] the possibility of representing local models for these algorithms a little differently, as local state streams, to better support driver fault tolerance. But even if we do that, the components of the model will be identical to what they are now, so in terms of what gets stored for import/export it should stay the same. I guess my inclination is to add this support, but curious what [~tdas] thinks. > Model import/export for StreamingKMeansModel > > > Key: SPARK-6722 > URL: https://issues.apache.org/jira/browse/SPARK-6722 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > CC: [~freeman-lab] Is this API stable enough to merit adding import/export > (which will require supporting the model format version from now on)? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6646) Spark 2.0: Rearchitecting Spark for Mobile Platforms
[ https://issues.apache.org/jira/browse/SPARK-6646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14390176#comment-14390176 ] Jeremy Freeman commented on SPARK-6646: --- Very promising [~tdas]! We should evaluate the performance of streaming machine learning algorithms. In general I think running Spark in javascript via scala.js and node.js is extremely appealing, will make integration with visualization very straightforward. > Spark 2.0: Rearchitecting Spark for Mobile Platforms > > > Key: SPARK-6646 > URL: https://issues.apache.org/jira/browse/SPARK-6646 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > Attachments: Spark on Mobile - Design Doc - v1.pdf > > > Mobile computing is quickly rising to dominance, and by the end of 2017, it > is estimated that 90% of CPU cycles will be devoted to mobile hardware. > Spark’s project goal can be accomplished only when Spark runs efficiently for > the growing population of mobile users. > Designed and optimized for modern data centers and Big Data applications, > Spark is unfortunately not a good fit for mobile computing today. In the past > few months, we have been prototyping the feasibility of a mobile-first Spark > architecture, and today we would like to share with you our findings. This > ticket outlines the technical design of Spark’s mobile support, and shares > results from several early prototypes. > Mobile friendly version of the design doc: > https://databricks.com/blog/2015/04/01/spark-2-rearchitecting-spark-for-mobile.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression
[ https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-6345: -- Description: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is to retrieve and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) ...print or other side effects... } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. was: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) ...print or other side effects... } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. > Model update propagation during prediction in Streaming Regression > -- > > Key: SPARK-6345 > URL: https://issues.apache.org/jira/browse/SPARK-6345 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Jeremy Freeman > > During streaming regression analyses (Streaming Linear Regression and > Streaming Logistic Regression), model updates based on training data are not > being reflected in subsequent calls to predictOn or predictOnValues, despite > updates themselves occurring successfully. It may be due to recent changes to > model declaration, and I have a working fix prepared to be submitted ASAP > (alongside expanded test coverage). > A temporary workaround is to retrieve and use the updated model within a > foreachRDD, as in: > {code} > model.trainOn(trainingData) > testingData.foreachRDD{ rdd => > val latest = model.latestModel() > val predictions = rdd.map(lp => latest.predict(lp.features)) > ...print or other side effects... > } > {code} > Or within a transform, as in: > {code} > model.trainOn(trainingData) > val predictions = testingData.transform { rdd => > val latest = model.latestModel() > rdd.map(lp => (lp.label, latest.predict(lp.features))) > } > {code} > Note that this does not affect Streaming KMeans, which works as expected for > combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression
[ https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-6345: -- Description: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) ...print or other side effects... } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. was: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) ...print or other side effects... } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. > Model update propagation during prediction in Streaming Regression > -- > > Key: SPARK-6345 > URL: https://issues.apache.org/jira/browse/SPARK-6345 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Jeremy Freeman > > During streaming regression analyses (Streaming Linear Regression and > Streaming Logistic Regression), model updates based on training data are not > being reflected in subsequent calls to predictOn or predictOnValues, despite > updates themselves occurring successfully. It may be due to recent changes to > model declaration, and I have a working fix prepared to be submitted ASAP > (alongside expanded test coverage). > A temporary workaround is retrieve and use the updated model within a > foreachRDD, as in: > {code} > model.trainOn(trainingData) > testingData.foreachRDD{ rdd => > val latest = model.latestModel() > val predictions = rdd.map(lp => latest.predict(lp.features)) > ...print or other side effects... > } > {code} > Or within a transform, as in: > {code} > model.trainOn(trainingData) > val predictions = testingData.transform { rdd => > val latest = model.latestModel() > rdd.map(lp => (lp.label, latest.predict(lp.features))) > } > {code} > Note that this does not affect Streaming KMeans, which works as expected for > combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression
[ https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-6345: -- Description: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) ...print or other side effects... } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. was: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. > Model update propagation during prediction in Streaming Regression > -- > > Key: SPARK-6345 > URL: https://issues.apache.org/jira/browse/SPARK-6345 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Jeremy Freeman > > During streaming regression analyses (Streaming Linear Regression and > Streaming Logistic Regression), model updates based on training data are not > being reflected in subsequent calls to predictOn or predictOnValues, despite > updates themselves occurring successfully. It may be due to recent changes to > model declaration, and I have a working fix prepared to be submitted ASAP > (alongside expanded test coverage). > A temporary workaround is retrieve the and use the updated model within a > foreachRDD, as in: > {code} > model.trainOn(trainingData) > testingData.foreachRDD{ rdd => > val latest = model.latestModel() > val predictions = rdd.map(lp => latest.predict(lp.features)) > ...print or other side effects... > } > {code} > Or within a transform, as in: > {code} > model.trainOn(trainingData) > val predictions = testingData.transform { rdd => > val latest = model.latestModel() > rdd.map(lp => (lp.label, latest.predict(lp.features))) > } > {code} > Note that this does not affect Streaming KMeans, which works as expected for > combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression
[ https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-6345: -- Description: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the and use the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testingData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) } {code} Or within a transform, as in: {code} model.trainOn(trainingData) val predictions = testingData.transform { rdd => val latest = model.latestModel() rdd.map(lp => (lp.label, latest.predict(lp.features))) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. was: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. > Model update propagation during prediction in Streaming Regression > -- > > Key: SPARK-6345 > URL: https://issues.apache.org/jira/browse/SPARK-6345 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Jeremy Freeman > > During streaming regression analyses (Streaming Linear Regression and > Streaming Logistic Regression), model updates based on training data are not > being reflected in subsequent calls to predictOn or predictOnValues, despite > updates themselves occurring successfully. It may be due to recent changes to > model declaration, and I have a working fix prepared to be submitted ASAP > (alongside expanded test coverage). > A temporary workaround is retrieve the and use the updated model within a > foreachRDD, as in: > {code} > model.trainOn(trainingData) > testingData.foreachRDD{ rdd => > val latest = model.latestModel() > val predictions = rdd.map(lp => latest.predict(lp.features)) > } > {code} > Or within a transform, as in: > {code} > model.trainOn(trainingData) > val predictions = testingData.transform { rdd => > val latest = model.latestModel() > rdd.map(lp => (lp.label, latest.predict(lp.features))) > } > {code} > Note that this does not affect Streaming KMeans, which works as expected for > combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression
[ https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-6345: -- Description: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features)) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. was: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. > Model update propagation during prediction in Streaming Regression > -- > > Key: SPARK-6345 > URL: https://issues.apache.org/jira/browse/SPARK-6345 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Jeremy Freeman > > During streaming regression analyses (Streaming Linear Regression and > Streaming Logistic Regression), model updates based on training data are not > being reflected in subsequent calls to predictOn or predictOnValues, despite > updates themselves occurring successfully. It may be due to recent changes to > model declaration, and I have a working fix prepared to be submitted ASAP > (alongside expanded test coverage). > A temporary workaround is retrieve the updated model within a foreachRDD, as > in: > {code} > model.trainOn(trainingData) > testData.foreachRDD{ rdd => > val latest = model.latestModel() > val predictions = rdd.map(lp => latest.predict(lp.features)) > } > {code} > Note that this does not affect Streaming KMeans, which works as expected for > combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6345) Model update propagation during prediction in Streaming Regression
[ https://issues.apache.org/jira/browse/SPARK-6345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-6345: -- Description: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the updated model within a foreachRDD, as in: {code} model.trainOn(trainingData) testData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. was: During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the updated model within a foreachRDD, as in: {code} model.predictOn(trainingData) testData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. > Model update propagation during prediction in Streaming Regression > -- > > Key: SPARK-6345 > URL: https://issues.apache.org/jira/browse/SPARK-6345 > Project: Spark > Issue Type: Bug > Components: MLlib, Streaming >Reporter: Jeremy Freeman > > During streaming regression analyses (Streaming Linear Regression and > Streaming Logistic Regression), model updates based on training data are not > being reflected in subsequent calls to predictOn or predictOnValues, despite > updates themselves occurring successfully. It may be due to recent changes to > model declaration, and I have a working fix prepared to be submitted ASAP > (alongside expanded test coverage). > A temporary workaround is retrieve the updated model within a foreachRDD, as > in: > {code} > model.trainOn(trainingData) > testData.foreachRDD{ rdd => > val latest = model.latestModel() > val predictions = rdd.map(lp => latest.predict(lp.features) > } > {code} > Note that this does not affect Streaming KMeans, which works as expected for > combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6345) Model update propagation during prediction in Streaming Regression
Jeremy Freeman created SPARK-6345: - Summary: Model update propagation during prediction in Streaming Regression Key: SPARK-6345 URL: https://issues.apache.org/jira/browse/SPARK-6345 Project: Spark Issue Type: Bug Components: MLlib, Streaming Reporter: Jeremy Freeman During streaming regression analyses (Streaming Linear Regression and Streaming Logistic Regression), model updates based on training data are not being reflected in subsequent calls to predictOn or predictOnValues, despite updates themselves occurring successfully. It may be due to recent changes to model declaration, and I have a working fix prepared to be submitted ASAP (alongside expanded test coverage). A temporary workaround is retrieve the updated model within a foreachRDD, as in: {code} model.predictOn(trainingData) testData.foreachRDD{ rdd => val latest = model.latestModel() val predictions = rdd.map(lp => latest.predict(lp.features) } {code} Note that this does not affect Streaming KMeans, which works as expected for combinations of training and prediction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357497#comment-14357497 ] Jeremy Freeman edited comment on SPARK-2429 at 3/11/15 8:10 PM: Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing experimental pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. was (Author: freeman-lab): Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing new pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans
[ https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357497#comment-14357497 ] Jeremy Freeman commented on SPARK-2429: --- Thanks for the update and contribution [~yuu.ishik...@gmail.com]! I think I agree with [~josephkb] that it is worth bringing this into MLlib, as the algorithm itself will translate to future uses, and many groups (including ours!) will find it useful now. It might be worth adding to spark-packages, especially if we expect the review to take awhile. Those seem especially useful as a way to provide easy access to testing new pieces of functionality. But I'd probably prioritize just reviewing the patch. Also agree with the others that we should start a new PR with the new algorithm, 1000x faster is a lot! It is worth incorporating some of comments from the old PR if you haven't already, if relevant in the new version. I'd be happy to go through the new PR as I'm quite familiar with the problem / algorithm, but it would help if you could say a little more about what you did so differently here, to help guide me as I look at the code. > Hierarchical Implementation of KMeans > - > > Key: SPARK-2429 > URL: https://issues.apache.org/jira/browse/SPARK-2429 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: RJ Nowling >Assignee: Yu Ishikawa >Priority: Minor > Labels: clustering > Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The > Result of Benchmarking a Hierarchical Clustering.pdf, > benchmark-result.2014-10-29.html, benchmark2.html > > > Hierarchical clustering algorithms are widely used and would make a nice > addition to MLlib. Clustering algorithms are useful for determining > relationships between clusters as well as offering faster assignment. > Discussion on the dev list suggested the following possible approaches: > * Top down, recursive application of KMeans > * Reuse DecisionTree implementation with different objective function > * Hierarchical SVD > It was also suggested that support for distance metrics other than Euclidean > such as negative dot or cosine are necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4144) Support incremental model training of Naive Bayes classifier
[ https://issues.apache.org/jira/browse/SPARK-4144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328726#comment-14328726 ] Jeremy Freeman commented on SPARK-4144: --- Hi all, I'd be happy to pick this up and get a prototype in the next couple weeks, it's related to the streaming logistic regression we just added for 1.3, though the update rule is more complex and might not be able to reuse as much code (in that sense it may end up looking more like streaming k-means). But don't want to step on anyone's toes! [~liquanpei] are you working on it? Or are you still interested [~cfregly]? Also happy to work together / review code. > Support incremental model training of Naive Bayes classifier > > > Key: SPARK-4144 > URL: https://issues.apache.org/jira/browse/SPARK-4144 > Project: Spark > Issue Type: Improvement > Components: MLlib, Streaming >Reporter: Chris Fregly >Assignee: Liquan Pei > > Per Xiangrui Meng from the following user list discussion: > http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3CCAJgQjQ_QjMGO=jmm8weq1v8yqfov8du03abzy7eeavgjrou...@mail.gmail.com%3E > > "For Naive Bayes, we need to update the priors and conditional > probabilities, which means we should also remember the number of > observations for the updates." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-5089: -- Description: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:none} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:none} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! {code} But this works fine: {code:none} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. was: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:python} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:python} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! {code} But this works fine: {code:python} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. > Vector conversion broken for non-float64 arrays > --- > > Key: SPARK-5089 > URL: https://issues.apache.org/jira/browse/SPARK-5089 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Jeremy Freeman > > Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are > automatically converted to {{DenseVectors}}. If the data are numpy arrays > with dtype {{float64}} this works. If data are numpy arrays with lower > precision (e.g. {{float16}} or {{float32}}), they should be upcast to > {{float64}}, but due to a small bug in this line this currently doesn't > happen (casting is not inplace). > {code:none} > if ar.dtype != np.float64: > ar.astype(np.float64) > {code} > > Non-float64 values are in turn mangled during SerDe. This can have > significant consequences. For example, the following yields confusing and > erroneous results: > {code:none} > from numpy import random > from pyspark.mllib.clustering import KMeans > data = sc.parallelize(random.randn(100,10).astype('float32')) > model = KMeans.train(data, k=3) > len(model.centers[0]) > >> 5 # should be 10! > {code} > But this works fine: > {code:none} > data = sc.parallelize(random.randn(100,10).astype('float64')) > model = KMeans.train(data, k=3) > len(model.centers[0]) > >> 10 # this is correct > {code} > The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-5089: -- Description: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to {{DenseVectors}}. If the data are numpy arrays with dtype {{float64}} this works. If data are numpy arrays with lower precision (e.g. {{float16}} or {{float32}}), they should be upcast to {{float64}}, but due to a small bug in this line this currently doesn't happen (casting is not inplace). {code:python} if ar.dtype != np.float64: ar.astype(np.float64) {code} Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: {code:python} from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! {code} But this works fine: {code:python} data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct {code} The fix is trivial, I'll submit a PR shortly. was: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). `` if ar.dtype != np.float64: ar.astype(np.float64) `` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. > Vector conversion broken for non-float64 arrays > --- > > Key: SPARK-5089 > URL: https://issues.apache.org/jira/browse/SPARK-5089 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Jeremy Freeman > > Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are > automatically converted to {{DenseVectors}}. If the data are numpy arrays > with dtype {{float64}} this works. If data are numpy arrays with lower > precision (e.g. {{float16}} or {{float32}}), they should be upcast to > {{float64}}, but due to a small bug in this line this currently doesn't > happen (casting is not inplace). > {code:python} > if ar.dtype != np.float64: > ar.astype(np.float64) > {code} > > Non-float64 values are in turn mangled during SerDe. This can have > significant consequences. For example, the following yields confusing and > erroneous results: > {code:python} > from numpy import random > from pyspark.mllib.clustering import KMeans > data = sc.parallelize(random.randn(100,10).astype('float32')) > model = KMeans.train(data, k=3) > len(model.centers[0]) > >> 5 # should be 10! > {code} > But this works fine: > {code:python} > data = sc.parallelize(random.randn(100,10).astype('float64')) > model = KMeans.train(data, k=3) > len(model.centers[0]) > >> 10 # this is correct > {code} > The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5089) Vector conversion broken for non-float64 arrays
[ https://issues.apache.org/jira/browse/SPARK-5089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-5089: -- Description: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). `` if ar.dtype != np.float64: ar.astype(np.float64) `` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. was: Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). ``` if ar.dtype != np.float64: ar.astype(np.float64) ``` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. > Vector conversion broken for non-float64 arrays > --- > > Key: SPARK-5089 > URL: https://issues.apache.org/jira/browse/SPARK-5089 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Jeremy Freeman > > Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are > automatically converted to `DenseVectors`. If the data are numpy arrays with > dtype `float64` this works. If data are numpy arrays with lower precision > (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to > a small bug in this line this currently doesn't happen (casting is not > inplace). > `` > if ar.dtype != np.float64: > ar.astype(np.float64) > `` > > Non-float64 values are in turn mangled during SerDe. This can have > significant consequences. For example, the following yields confusing and > erroneous results: > ``` > from numpy import random > from pyspark.mllib.clustering import KMeans > data = sc.parallelize(random.randn(100,10).astype('float32')) > model = KMeans.train(data, k=3) > len(model.centers[0]) > >> 5 # should be 10! > ``` > But this works fine: > ``` > data = sc.parallelize(random.randn(100,10).astype('float64')) > model = KMeans.train(data, k=3) > len(model.centers[0]) > >> 10 # this is correct > ``` > The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5089) Vector conversion broken for non-float64 arrays
Jeremy Freeman created SPARK-5089: - Summary: Vector conversion broken for non-float64 arrays Key: SPARK-5089 URL: https://issues.apache.org/jira/browse/SPARK-5089 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Jeremy Freeman Prior to performing many MLlib operations in PySpark (e.g. KMeans), data are automatically converted to `DenseVectors`. If the data are numpy arrays with dtype `float64` this works. If data are numpy arrays with lower precision (e.g. `float16` or `float32`), they should be upcast to `float64`, but due to a small bug in this line this currently doesn't happen (casting is not inplace). ``` if ar.dtype != np.float64: ar.astype(np.float64) ``` Non-float64 values are in turn mangled during SerDe. This can have significant consequences. For example, the following yields confusing and erroneous results: ``` from numpy import random from pyspark.mllib.clustering import KMeans data = sc.parallelize(random.randn(100,10).astype('float32')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 5 # should be 10! ``` But this works fine: ``` data = sc.parallelize(random.randn(100,10).astype('float64')) model = KMeans.train(data, k=3) len(model.centers[0]) >> 10 # this is correct ``` The fix is trivial, I'll submit a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4981) Add a streaming singular value decomposition
Jeremy Freeman created SPARK-4981: - Summary: Add a streaming singular value decomposition Key: SPARK-4981 URL: https://issues.apache.org/jira/browse/SPARK-4981 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman This is for tracking WIP on a streaming singular value decomposition implementation. This will likely be more complex than the existing streaming algorithms (k-means, regression), but should be possible using the family of sequential update rule outlined in this paper: "Fast low-rank modifications of the thin singular value decomposition" by Matthew Brand http://www.stat.osu.edu/~dmsl/thinSVDtracking.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4980) Add decay factors to streaming linear methods
Jeremy Freeman created SPARK-4980: - Summary: Add decay factors to streaming linear methods Key: SPARK-4980 URL: https://issues.apache.org/jira/browse/SPARK-4980 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman Priority: Minor Our implementation of streaming k-means uses an decay factor that allows users to control how quickly the model adjusts to new data: whether it treats all data equally, or only bases its estimate on the most recent batch. It is intuitively parameterized, and can be specified in units of either batches or points. We should add a similar decay factor to the streaming linear methods using SGD, including streaming linear regression (currently implemented) and streaming logistic regression (in development). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4979) Add streaming logistic regression
Jeremy Freeman created SPARK-4979: - Summary: Add streaming logistic regression Key: SPARK-4979 URL: https://issues.apache.org/jira/browse/SPARK-4979 Project: Spark Issue Type: New Feature Components: MLlib, Streaming Reporter: Jeremy Freeman Priority: Minor We currently support streaming linear regression and k-means clustering. We can add support for streaming logistic regression using a strategy similar to that used in streaming linear regression, applying gradient updates to batches of data from a DStream, and extending the existing mllib methods with minor modifications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4969) Add binaryRecords support to streaming
Jeremy Freeman created SPARK-4969: - Summary: Add binaryRecords support to streaming Key: SPARK-4969 URL: https://issues.apache.org/jira/browse/SPARK-4969 Project: Spark Issue Type: Improvement Components: PySpark, Streaming Affects Versions: 1.2.0 Reporter: Jeremy Freeman Priority: Minor As of Spark 1.2 there is support for loading fixed length records from flat binary files. This is a useful way to load dense numerical array data into Spark, especially in scientific computing applications. We should add support for loading this same file type in Spark Streaming, both in Scala/Java and in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4727) Add "dimensional" RDDs (time series, spatial)
[ https://issues.apache.org/jira/browse/SPARK-4727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14234269#comment-14234269 ] Jeremy Freeman commented on SPARK-4727: --- Great to brainstorm about this RJ! To some extent, we've been doing this over on the [Thunder|http://thefreemanlab.com/thunder/docs/] project. In particular, check out the {{TimeSeries}} and {{Images}} classes [here|https://github.com/freeman-lab/thunder/tree/master/python/thunder/rdds], which are essentially wrappers for specialized RDDs. Our basic abstraction is RDDs of ndarrays (1D for time series, 2D or 3D for images/volumes), with metadeta (lazily propagated) for things like dimensionality and time base, coordinates embedded in keys, and useful methods on these objects like the ones you menion (e.g. filtering, fourier, cross-correlation). We've also worked on transformations between representations, for the common case of sequences of images corresponding to different time points. We haven't worked on custom partition strategies yet, I think that will be most important for image tiles drawn from a much larger image. There's cool work ongoing for that in GeoTrellis, see the [repo|https://github.com/geotrellis/geotrellis/tree/master/spark/src/main] and a [talk|http://spark-summit.org/2014/talk/geotrellis-adding-geospatial-capabilities-to-spark] from Rob. FWIW, when we started it seemed more appropriate to build this into a specialized library, rather than Spark core. It's also something that benefits from using Python, due to a bevy of existing libraries for temporal and image data (though there are certainly analogs in Java/Scala). But it would be great to probe the community for general interest in these kinds of abstractions and methods. > Add "dimensional" RDDs (time series, spatial) > - > > Key: SPARK-4727 > URL: https://issues.apache.org/jira/browse/SPARK-4727 > Project: Spark > Issue Type: Brainstorming > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: RJ Nowling > > Certain types of data (times series, spatial) can benefit from specialized > RDDs. I'd like to open a discussion about this. > For example, time series data should be ordered by time and would benefit > from operations like: > * Subsampling (taking every n data points) > * Signal processing (correlations, FFTs, filtering) > * Windowing functions > Spatial data benefits from ordering and partitioning along a 2D or 3D grid. > For example, path finding algorithms can optimized by only comparing points > within a set distance, which can be computed more efficiently by partitioning > data into a grid. > Although the operations on time series and spatial data may be different, > there is some commonality in the sense of the data having ordered dimensions > and the implementations may overlap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14227266#comment-14227266 ] Jeremy Freeman commented on SPARK-3995: --- Agree with [~mengxr], that's a better strategy. > [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 > - > > Key: SPARK-3995 > URL: https://issues.apache.org/jira/browse/SPARK-3995 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Jeremy Freeman >Assignee: Jeremy Freeman >Priority: Critical > Fix For: 1.2.0 > > > There is a breaking bug in PySpark's sampling methods when run with NumPy > v1.9. This is the version of NumPy included with the current Anaconda > distribution (v2.1); this is a popular distribution, and is likely to affect > many users. > Steps to reproduce are: > {code:python} > foo = sc.parallelize(range(1000),5) > foo.takeSample(False, 10) > {code} > Returns: > {code} > PythonException: Traceback (most recent call last): > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", > line 79, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 196, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 127, in dump_stream > for obj in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 185, in _batched > for item in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 116, in func > if self.getUniformSample(split) <= self._fraction: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 58, in getUniformSample > self.initRandomGenerator(split) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 44, in initRandomGenerator > self._random = numpy.random.RandomState(self._seed) > File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ > (numpy/random/mtrand/mtrand.c:7397) > File "mtrand.pyx", line 646, in mtrand.RandomState.seed > (numpy/random/mtrand/mtrand.c:7697) > ValueError: Seed must be between 0 and 4294967295 > {code} > In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: > {code:python} > self._seed = seed if seed is not None else random.randint(0, sys.maxint) > {code} > In previous versions of NumPy a random seed larger than 2 ** 32 would > silently get truncated to 2 ** 32. This was fixed in a recent patch > (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). > But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, > which effectively breaks sampling operations in PySpark (unless the seed is > set manually). > I am putting a PR together now (the fix is very simple!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3254) Streaming K-Means
[ https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181080#comment-14181080 ] Jeremy Freeman commented on SPARK-3254: --- I'd like it to! I've got a PR ready to submit ASAP, we can take it from there. > Streaming K-Means > - > > Key: SPARK-3254 > URL: https://issues.apache.org/jira/browse/SPARK-3254 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Xiangrui Meng >Assignee: Jeremy Freeman > > Streaming K-Means with proper decay settings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark (unless the seed is set manually). I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK] Py
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code:python} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code:python} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But sampling {{(0, sys.maxint)}} often yields ints larger than 2 ** 32, which effectively breaks sampling operations in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK] PySpark's sampl
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's {{RDDSamplerBase}} class from {{pyspark.rddsampler}} we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSP
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPARK]
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from {{pyspark.rddsampler}}) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from ``pyspark.rddsampler``) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). > [PYSPAR
[jira] [Updated] (SPARK-3995) [PYSPARK] PySpark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Summary: [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 (was: pyspark's sample methods do not work with NumPy 1.9) > [PYSPARK] PySpark's sample methods do not work with NumPy 1.9 > - > > Key: SPARK-3995 > URL: https://issues.apache.org/jira/browse/SPARK-3995 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Jeremy Freeman >Priority: Critical > > There is a breaking bug in PySpark's sampling methods when run with NumPy > v1.9. This is the version of NumPy included with the current Anaconda > distribution (v2.1); this is a popular distribution, and is likely to affect > many users. > Steps to reproduce are: > {code} > foo = sc.parallelize(range(1000),5) > foo.takeSample(False, 10) > {code} > Returns: > {code} > PythonException: Traceback (most recent call last): > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", > line 79, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 196, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 127, in dump_stream > for obj in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 185, in _batched > for item in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 116, in func > if self.getUniformSample(split) <= self._fraction: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 58, in getUniformSample > self.initRandomGenerator(split) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 44, in initRandomGenerator > self._random = numpy.random.RandomState(self._seed) > File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ > (numpy/random/mtrand/mtrand.c:7397) > File "mtrand.pyx", line 646, in mtrand.RandomState.seed > (numpy/random/mtrand/mtrand.c:7697) > ValueError: Seed must be between 0 and 4294967295 > {code} > In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use: > {code} > self._seed = seed if seed is not None else random.randint(0, sys.maxint) > {code} > In previous versions of NumPy a random seed larger than 2 ** 32 would > silently get truncated to 2 ** 32. This was fixed in a recent patch > (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). > But it means that PySpark’s code now causes an error. That line often yields > ints larger than 2 ** 32, which will reliably break any sampling operation in > PySpark. > I am putting a PR together now (the fix is very simple!). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In PySpark's RDDSamplerBase class (from `pyspark.rddsampler`) we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} In previous versions of NumPy a random seed larger than 2 ** 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error. That line often yields ints larger than 2 ** 32, which will reliably break any sampling operation in PySpark. I am putting a PR together now (the fix is very simple!). was: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In previous versions of NumPy a random seed larger than 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error, because in the RDDSamplerBase class from pyspark.rddsampler, we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} And this often yields ints larger than 2 ** 32. Effectively, this reliably breaks any sampling operation in PySpark with this NumPy version. I am putting a PR together now (the fix is very sim
[jira] [Updated] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9
[ https://issues.apache.org/jira/browse/SPARK-3995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy Freeman updated SPARK-3995: -- Description: There is a breaking bug in PySpark's sampling methods when run with NumPy v1.9. This is the version of NumPy included with the current Anaconda distribution (v2.1); this is a popular distribution, and is likely to affect many users. Steps to reproduce are: {code} foo = sc.parallelize(range(1000),5) foo.takeSample(False, 10) {code} Returns: {code} PythonException: Traceback (most recent call last): File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 116, in func if self.getUniformSample(split) <= self._fraction: File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 58, in getUniformSample self.initRandomGenerator(split) File "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", line 44, in initRandomGenerator self._random = numpy.random.RandomState(self._seed) File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ (numpy/random/mtrand/mtrand.c:7397) File "mtrand.pyx", line 646, in mtrand.RandomState.seed (numpy/random/mtrand/mtrand.c:7697) ValueError: Seed must be between 0 and 4294967295 {code} In previous versions of NumPy a random seed larger than 32 would silently get truncated to 2 ** 32. This was fixed in a recent patch (https://github.com/numpy/numpy/commit/6b1a1205eac6fe5d162f16155d500765e8bca53c). But it means that PySpark’s code now causes an error, because in the RDDSamplerBase class from pyspark.rddsampler, we use: {code} self._seed = seed if seed is not None else random.randint(0, sys.maxint) {code} And this often yields ints larger than 2 ** 32. Effectively, this reliably breaks any sampling operation in PySpark with this NumPy version. I am putting a PR together now (the fix is very simple!). > pyspark's sample methods do not work with NumPy 1.9 > --- > > Key: SPARK-3995 > URL: https://issues.apache.org/jira/browse/SPARK-3995 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 1.1.0 >Reporter: Jeremy Freeman >Priority: Critical > > There is a breaking bug in PySpark's sampling methods when run with NumPy > v1.9. This is the version of NumPy included with the current Anaconda > distribution (v2.1); this is a popular distribution, and is likely to affect > many users. > Steps to reproduce are: > {code} > foo = sc.parallelize(range(1000),5) > foo.takeSample(False, 10) > {code} > Returns: > {code} > PythonException: Traceback (most recent call last): > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", > line 79, in main > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 196, in dump_stream > self.serializer.dump_stream(self._batched(iterator), stream) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 127, in dump_stream > for obj in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", > line 185, in _batched > for item in iterator: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 116, in func > if self.getUniformSample(split) <= self._fraction: > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 58, in getUniformSample > self.initRandomGenerator(split) > File > "/Users/freemanj11/code/spark-1.1.0-bin-hadoop1/python/pyspark/rddsampler.py", > line 44, in initRandomGenerator > self._random = numpy.random.RandomState(self._seed) > File "mtrand.pyx", line 610, in mtrand.RandomState.__init__ > (numpy/random/mtrand/mtrand.c:7397) > File "mtrand.pyx", line 646, in mtrand.RandomState.seed > (numpy/random/mtrand/mtrand.c:7697) > ValueError: Seed must be between 0 and 4294967295 > {code} > In previous versions of NumPy a random seed larger than 32 would silently get > truncated to 2 ** 32
[jira] [Created] (SPARK-3995) pyspark's sample methods do not work with NumPy 1.9
Jeremy Freeman created SPARK-3995: - Summary: pyspark's sample methods do not work with NumPy 1.9 Key: SPARK-3995 URL: https://issues.apache.org/jira/browse/SPARK-3995 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Jeremy Freeman Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3254) Streaming K-Means
[ https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114299#comment-14114299 ] Jeremy Freeman edited comment on SPARK-3254 at 8/28/14 8:38 PM: Here is a (public) google doc explaining a current implementation, including discussion of the update rule, and choices for parameterizing the decay factor. https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing In-progress code can be viewed here: https://github.com/freeman-lab/spark/blob/streaming-kmeans/mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala was (Author: freeman-lab): Here is a (public) google doc explaining a current implementation, including discussion of the update rule, and choices for parameterizing the decay factor. https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing In-progress code can be viewed here: https://github.com/freeman-lab/spark/tree/streaming-kmeans > Streaming K-Means > - > > Key: SPARK-3254 > URL: https://issues.apache.org/jira/browse/SPARK-3254 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Xiangrui Meng >Assignee: Jeremy Freeman > > Streaming K-Means with proper decay settings. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3254) Streaming K-Means
[ https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114299#comment-14114299 ] Jeremy Freeman commented on SPARK-3254: --- Here is a (public) google doc explaining a current implementation, including discussion of the update rule, and choices for parameterizing the decay factor. https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing In-progress code can be viewed here: https://github.com/freeman-lab/spark/tree/streaming-kmeans > Streaming K-Means > - > > Key: SPARK-3254 > URL: https://issues.apache.org/jira/browse/SPARK-3254 > Project: Spark > Issue Type: New Feature > Components: MLlib, Streaming >Reporter: Xiangrui Meng >Assignee: Jeremy Freeman > > Streaming K-Means with proper decay settings. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3128) Use streaming test suite for StreamingLR
Jeremy Freeman created SPARK-3128: - Summary: Use streaming test suite for StreamingLR Key: SPARK-3128 URL: https://issues.apache.org/jira/browse/SPARK-3128 Project: Spark Issue Type: Improvement Components: MLlib, Streaming Affects Versions: 1.1.0 Reporter: Jeremy Freeman Priority: Minor Unit tests for Streaming Linear Regression currently use file writing to generate input data and a TextFileStream to read the data. It would be better to use existing utilities from the streaming test suite to simulate DStreams and collect and evaluate results of DStream operations. This will make tests faster, simpler, and easier to maintain / extend. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays
[ https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14079984#comment-14079984 ] Jeremy Freeman commented on SPARK-2012: --- [~davies] cool, that definitely makes sense to me, shall I put a PR together done that way? > PySpark StatCounter with numpy arrays > - > > Key: SPARK-2012 > URL: https://issues.apache.org/jira/browse/SPARK-2012 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.0 >Reporter: Jeremy Freeman >Priority: Minor > > In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy > arrays just as with an RDD of scalars, which was very useful (e.g. for > computing stats on a set of vectors in ML analyses). In 1.0.0 this broke > because the added functionality for computing the minimum and maximum, as > implemented, doesn't work on arrays. > I have a PR ready that re-enables this functionality by having StatCounter > use the numpy element-wise functions "maximum" and "minimum", which work on > both numpy arrays and scalars (and I've added new tests for this capability). > However, I realize this adds a dependency on NumPy outside of MLLib. If > that's not ok, maybe it'd be worth adding this functionality as a util within > PySpark MLLib? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069758#comment-14069758 ] Jeremy Freeman edited comment on SPARK-2282 at 7/22/14 3:18 AM: Hi all, I'm "the scientist", a couple updates from more real world testing, looking very promising! - Set-up: 60 node cluster, an analysis with iterative updates (essentially a sequence of two map-reduce steps on each iteration), data cached and counted before starting iterations - 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. Before the patch I reliably hit the error after about 5 iterations, with the patch 20+ complete. - 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. Before the patch more than one iteration always failed, with the patch 20+ complete. So it's looking really good. I can also try the other extreme (very small cluster) to see if that issue manifests. Aaron, big thanks for helping with this, it's a big deal for our workflows, so really terrific to get to the bottom of it! -- Jeremy was (Author: freeman-lab): Hi all, I'm "the scientist", a couple updates from more real world testing, looking very promising! - Set-up: 60 node cluster, an analysis with iterative updates (essentially a sequence of two map-reduce steps on each iteration), data cached and counted before starting iterations - 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. Before the patch I reliably hit the error after about 5 iterations, with the patch 20+ complete. - 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. Before the patch more than one iteration always failed (!), with the patch 20+ complete. So it's looking really good. I can also try the other extreme (very small cluster) to see if that issue manifests. Aaron, big thanks for helping with this, it's a big deal for our workflows, so really terrific to get to the bottom of it! -- Jeremy > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069758#comment-14069758 ] Jeremy Freeman commented on SPARK-2282: --- Hi all, I'm "the scientist", a couple updates from more real world testing, looking very promising! - Set-up: 60 node cluster, an analysis with iterative updates (essentially a sequence of two map-reduce steps on each iteration), data cached and counted before starting iterations - 250 GB data set, 4000 tasks / stage, ~6 seconds for each stage to complete. Before the patch I reliably hit the error after about 5 iterations, with the patch 20+ complete. - 2.3 TB data set, 26000 tasks / stage, ~27 seconds for each stage to complete. Before the patch more than one iteration always failed (!), with the patch 20+ complete. So it's looking really good. I can also try the other extreme (very small cluster) to see if that issue manifests. Aaron, big thanks for helping with this, it's a big deal for our workflows, so really terrific to get to the bottom of it! -- Jeremy > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2438) Streaming + MLLib
Jeremy Freeman created SPARK-2438: - Summary: Streaming + MLLib Key: SPARK-2438 URL: https://issues.apache.org/jira/browse/SPARK-2438 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Jeremy Freeman This is a ticket to track progress on developing streaming analyses in MLLib. Many streaming applications benefit from or require fitting models online, where the parameters of a model (e.g. regression, clustering) are updated continually as new data arrive. This can be accomplished by incorporating MLLib algorithms into model-updating operations over DStreams. In some cases this can be achieved using existing updaters (e.g. those based on SGD), but in other cases will require custom update rules (e.g. for KMeans). The goal is to have streaming versions of many common algorithms, in particular regression, classification, clustering, and possibly dimensionality reduction. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2012) PySpark StatCounter with numpy arrays
Jeremy Freeman created SPARK-2012: - Summary: PySpark StatCounter with numpy arrays Key: SPARK-2012 URL: https://issues.apache.org/jira/browse/SPARK-2012 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.0.0 Reporter: Jeremy Freeman Priority: Minor In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy arrays just as with an RDD of scalars, which was very useful (e.g. for computing stats on a set of vectors in ML analyses). In 1.0.0 this broke because the added functionality for computing the minimum and maximum, as implemented, doesn't work on arrays. I have a PR ready that re-enables this functionality by having StatCounter use the numpy element-wise functions "maximum" and "minimum", which work on both numpy arrays and scalars (and I've added new tests for this capability). However, I realize this adds a dependency on NumPy outside of MLLib. If that's not ok, maybe it'd be worth adding this functionality as a util within PySpark MLLib? -- This message was sent by Atlassian JIRA (v6.2#6252)