[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 Hi, can you give a ref to the new style guides? I'm not sure if anything major needs changing. In the meantime I resolved the one conflict at the bottom of ml/tests.py and extended the docstring to include your comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 ping @holdenk , revisiting this PR for the memories today. Wondering whether it was shipworthy in the end, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 ping @holdenk , please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 @holdenk Thanks for the review and your availability, I really appreciate the work you are doing too by mentoring me! I didn't realize np.append was making copies, though it made a lot of sense once I thought about what is actually going on there in terms of memory. The appends were also ultimately unnecessary and have been removed in favor of a much better, simpler, efficient call to make the csr matrix! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 ping @holdenk, can you take a look at this please? sorry for leaving this off for so long. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 @holdenk , are we there yet :P? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 @holdenk I did a quick benchmark on my Macbook to get a single CSR's approximate construction time (in minutes) on a large case as well as compared to LIL and DOK types. 2mil x 2mil Elements: CSR,1.157418 DOK,5.531311 LIL, 0.839315 LIL2CSR, 1.650390 Snippet used for benchmarking: ``` setup_small1 = 'from scipy.sparse import csr_matrix;' method1 = 'csr_matrix((range(0, 200),range(0, 200),[0, 200]))' setup_small2 = 'from scipy.sparse import dok_matrix;' method2 = 'dok_matrix((range(0, 200),range(0, 200)))' setup_small3 = 'from scipy.sparse import lil_matrix;' method3 = 'lil_matrix((200,200))' setup_small4 = 'from scipy.sparse import lil_matrix;' method4 = 'lil_matrix((200,200)).tocsr()' t = timeit.Timer(method1, setup_small1) print('CSR construction: ' + str(t.timeit(100)/float(100))) t = timeit.Timer(method2, setup_small2) print('DOK construction: ' + str(t.timeit(100)/float(100))) t = timeit.Timer(method3, setup_small3) print('LIL construction: ' + str(t.timeit(100)/float(100))) t = timeit.Timer(method4, setup_small4) print('LIL construction + Convert to CSR: ' + str(t.timeit(100)/float(100))) ``` I expect any changing of the Sparse Vector structure will take place using the PySpark object class, so CSR will definitely outperform LIL and DOK matrix types for function execution as well. This comes from the SciPy documentation [Advantages of the CSR format - efficient arithmetic operations CSR + CSR, CSR * CSR, etc., efficient row slicing, fast matrix vector products / Disadvantages of the CSR format - slow column slicing operations (consider CSC), changes to the sparsity structure are expensive (consider LIL or DOK)]. Other items resolved: - The \s have been removed - The error message has been lengthened to mention SciPy not being available as a possible cause for a lack of object attribute (the other cause being a call to something that does not exist in SciPy either). - I have only added my single test function to ML since I do not know the reason why all of the SciPy tests in MLlib are not in ML. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 Sorry, I forgot to ping anyone after the last update. @holdenk how does this look? What next steps should we take to push this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 I have applied the code change to both ML and MLLIB now. And, I added some simple tests to check if the SciPy sparse functions are behaving correctly. (Only MLLIB has tests for SciPy functions so I only added test cases there). Additionally, I updated the implementation with a wrapper script which 1) allows for functions with inputs to work correctly and 2) seamlessly allows for SciPy's functions which generate a SciPy matrix output to be automatically returned as a SparseVector object. Example use case: ``` c = SparseVector(40, {1: 1, 3: 2, 23: 7, 25:9, 39:3}) c.power(4) SparseVector(40, {1: 1.0, 3: 16.0, 23: 2401.0, 25: 6561.0, 39: 81.0}) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user itg-abby commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r84350100 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -794,6 +794,13 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +if _have_scipy: +csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2])) +return getattr(csr, item) +else: +return self --- End diff -- Systems without Scipy return the following: ``` from pyspark.mllib.linalg import SparseVector a = SparseVector(4, {1: 1.0, 3: 5.5}) a.sum() Traceback (most recent call last): File "", line 1, in TypeError: 'SparseVector' object is not callable ``` This has been corrected in the latest commit. `__getattr__` essentially is the catch for "any attribute that does not have a name" so standard behavior should be an AttributeError not just returning the object (thank you for catching this!). The new code gives the following error when SciPy is not available: ``` a.sum() Traceback (most recent call last): File "", line 1, in File "/Users/sobh/spark/python/pyspark/mllib/linalg/__init__.py", line 802, in __getattr__ raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item)) AttributeError: '' object has no attribute 'sum'. ``` --- Systems with Scipy will give the result as expected: ``` from pyspark.mllib.linalg import SparseVector a = SparseVector(4, {1: 1.0, 3: 5.5}) a.sum() 6.5 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 @sethah Thanks for your input! SciPy is definitely a much better solution, I was naively operating under the assumption that I must match DenseVector types 1:1. Actually, there is already a flag "_have_scipy" so I have added the check for that in order to create the type when it is available. Otherwise, users just get the standard SparseVector back. The PR has been updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 It is possible that I am missing something or that I have unintentionally obfuscated this pull request, I will try summarizing my understanding/purpose and see if it sheds any light: DenseVector allows calls to numpy directly (i.e. DenseVector.mean() ) and always stores the array values in the object attribute DenseVector.array , this allows for a lot of neat numpy functions to be run on the array values without any trouble. SparseVector works differently, it never stores the full set of values as a full array. Instead, it uses a 'trick' which only searches non-zero index/value pairs if a specific entry is asked for (this can be found in the __geitem__ attribute for SparseVector). This prevents numpy functions from being usable on the SparseVector since there is no actual array to operate on directly. However, a conversion function is provided, toArray(). The solution proposed can, in effect, be thought of as a purely syntactical shortening from SparseVector.toArray().mean() to simply SparseVector.mean() . Thus, this should not introduce any increased complexity compared to how things are now. The current status of this object is confusing in that the intuitive function-call SparseVector.mean() just throws out an "AttributeError: 'SparseVector' object has no attribute 'mean'". As mildly hinted at on JIRA, there are even better implementations which could follow this one. For example simply replacing directly calling numpy by manually providing the same functions with reduced complexity. Much along the lines of how __getitem__ was made for SparseVectors, rather than the typical array slicing that DenseVector has. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...
Github user itg-abby commented on the issue: https://github.com/apache/spark/pull/15496 Whoops, looks like this code is fine and I just had a bug in my local build. Reopening. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
GitHub user itg-abby reopened a pull request: https://github.com/apache/spark/pull/15496 [SPARK-17950] [Python] Match SparseVector behavior with DenseVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. ## How was this patch tested? Manual testing on local machine. Temporary close: Failed ./python/run-tests No UI changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/itg-abby/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15496 commit d9b6ecc9c7aaebf834556797404ff783fef44dcf Author: itg-abby Date: 2016-10-15T00:03:08Z Match SparseVector behavior with DenseVector << What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <> Manual testing on local machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user itg-abby closed the pull request at: https://github.com/apache/spark/pull/15496 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
GitHub user itg-abby reopened a pull request: https://github.com/apache/spark/pull/15496 [SPARK-17950] [Python] Match SparseVector behavior with DenseVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. ## How was this patch tested? Manual testing on local machine. Passed ./dev/run-tests No UI changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/itg-abby/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15496 commit d9b6ecc9c7aaebf834556797404ff783fef44dcf Author: itg-abby Date: 2016-10-15T00:03:08Z Match SparseVector behavior with DenseVector << What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <> Manual testing on local machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: Match SparseVector behavior with DenseVector
Github user itg-abby closed the pull request at: https://github.com/apache/spark/pull/15496 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: Match SparseVector behavior with DenseVector
GitHub user itg-abby opened a pull request: https://github.com/apache/spark/pull/15496 Match SparseVector behavior with DenseVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. ## How was this patch tested? Manual testing on local machine. No UI changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/itg-abby/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15496 commit d9b6ecc9c7aaebf834556797404ff783fef44dcf Author: itg-abby Date: 2016-10-15T00:03:08Z Match SparseVector behavior with DenseVector << What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <> Manual testing on local machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org