[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56147622 @davies Does `PickleSerializer` compress data? If not, maybe we should cache the deserialized RDD instead of the one from `_.reserialize`. They have the same storage. I

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56207099 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20576/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56210084 @mengxr PickleSerializer do not compress data, there is CompressSerializer can do it using gzip(level 1). Compression can help for small range of double or repeated

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56211052 @mengxr In this PR, I just tried to avoid other changes except serialization, we could change the cache behavior or compression later. It's will be good to have

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56216817 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20576/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56241679 @davies LGTM except few linear algebra operators and caching. But those are orthogonal to this PR. I'm merging this and we will update the linear algebra ops later. ---

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2378 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56242298 Merged. Thanks a lot! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17751963 --- Diff: python/pyspark/mllib/tests.py --- @@ -198,41 +212,36 @@ def test_serialize(self): lil[1, 0] = 1 lil[3, 0] = 2

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752055 --- Diff: python/pyspark/mllib/linalg.py --- @@ -257,10 +410,34 @@ def stringify(vector): Vectors.stringify(Vectors.dense([0.0, 1.0]))

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752465 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from numpy

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752588 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -64,6 +64,12 @@ class DenseMatrix(val numRows: Int, val numCols: Int, val

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752597 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from numpy

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56104238 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20551/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17756431 --- Diff: python/pyspark/mllib/tests.py --- @@ -198,41 +212,36 @@ def test_serialize(self): lil[1, 0] = 1 lil[3, 0] = 2

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17757207 --- Diff: python/pyspark/mllib/tests.py --- @@ -198,41 +212,36 @@ def test_serialize(self): lil[1, 0] = 1 lil[3, 0] = 2

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17757949 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable {

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56109944 @jkbradley I should have addressed all your comments, or leave comments if I have not figure out how to do now, thanks for reviewing this huge PR. --- If your project is

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56110091 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20551/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56110037 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20554/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56112010 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/132/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17760498 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56114946 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20554/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56116566 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/132/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56117852 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20560/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56122608 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20560/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56136476 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55855348 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/122/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55855377 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20453/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55860743 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/122/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55860805 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20453/consoleFull) for PR 2378 at commit

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55916761 @mengxr it's ready to review now, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread staple
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17686208 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17686849 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17686887 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread staple
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17687682 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17693196 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17694320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -40,11 +43,11 @@ import org.apache.spark.mllib.util.MLUtils

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17697102 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17697397 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17698519 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17700232 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17700424 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17700471 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17700987 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17701086 --- Diff: python/pyspark/mllib/linalg.py --- @@ -61,16 +195,19 @@ def __init__(self, size, *args): if type(pairs) == dict:

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17701227 --- Diff: python/pyspark/mllib/linalg.py --- @@ -61,16 +195,19 @@ def __init__(self, size, *args): if type(pairs) == dict:

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17701626 --- Diff: python/pyspark/mllib/linalg.py --- @@ -61,16 +195,19 @@ def __init__(self, size, *args): if type(pairs) == dict:

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17702050 --- Diff: python/pyspark/mllib/linalg.py --- @@ -257,10 +410,34 @@ def stringify(vector): Vectors.stringify(Vectors.dense([0.0, 1.0]))

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17702101 --- Diff: python/pyspark/mllib/linalg.py --- @@ -257,10 +410,34 @@ def stringify(vector): Vectors.stringify(Vectors.dense([0.0, 1.0]))

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17703466 --- Diff: python/pyspark/mllib/tests.py --- @@ -198,41 +212,36 @@ def test_serialize(self): lil[1, 0] = 1 lil[3, 0] = 2

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17703595 --- Diff: python/pyspark/mllib/tree.py --- @@ -90,53 +89,24 @@ class DecisionTree(object): EXPERIMENTAL: This is an experimental API.

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55987147 @davies This looks like a great PR! I don’t see major issues, though +1 to the remarks about checking for performance regressions. Pending performance testing and