[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r93112496 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,23 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scipy: +csr = scipy.sparse.csr_matrix((np.append(self.values, 0), --- End diff -- Super minor nit, but using named parameters might make this a bit easier when skimming the code (same for mllin) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r93116390 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,23 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scipy: +csr = scipy.sparse.csr_matrix((np.append(self.values, 0), --- End diff -- More concretely thouhh, why are we padding the data/values with a 0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r93116698 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,23 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scipy: +csr = scipy.sparse.csr_matrix((np.append(self.values, 0), + np.append(self.indices, self.size-1), --- End diff -- So this "works", in that it's skipped by the indexptrs range we supply bellow - but it makes the code confusing to read so why do we need it and maybe we could add a comment explaining why? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r93113956 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,23 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): --- End diff -- Would be good to have a comment here explaining its purpose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r90139924 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,22 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scipy: +csr = scipy.sparse.csr_matrix((np.append(self.values, 0), + np.append(self.indices, self.size-1), + [0, len(self.values)])) +func = getattr(csr, item) +result = func(*args, **kwargs) +if isinstance(result, scipy.sparse.csr.csr_matrix): +return SparseVector(result.shape[1],result.indices,result.data) +return result +else: +raise AttributeError( +"'{0}' object has no attribute '{1}' or SciPy not installed.".format(self.__class__, item)) --- End diff -- Ok so maybe we can improve the error message to something like "'{0}' object has no attribute '{1}' and SciPy is not installed to proxy request to SparseVector" (or similar). Because saying its X or Y is confusing since this error message only happens in the event SciPy is not installed. What do you think? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r89464558 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,22 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scipy: +csr = scipy.sparse.csr_matrix((np.append(self.values, 0), + np.append(self.indices, self.size-1), + [0, len(self.values)])) +func = getattr(csr, item) +result = func(*args, **kwargs) +if isinstance(result, scipy.sparse.csr.csr_matrix): +return SparseVector(result.shape[1],result.indices,result.data) +return result +else: +raise AttributeError( +"'{0}' object has no attribute '{1}' or SciPy not installed.".format(self.__class__, item)) --- End diff -- This error message would probably be better of just saying SciPy not installed since its on the else branch of `if _have_scipy` unless I'm missing something? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r86468088 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,22 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scip: +csr = scipy.sparse.csr_matrix((\ + np.append(self.values, 0),\ + np.append(self.indices, self.size-1),\ + [0, len(self.values)])) +func = getattr(csr, item) +result = func(*args, **kwargs) +if isinstance(result, scipy.sparse.csr.csr_matrix): +return SparseVector(result.shape[1],result.indices,result.data) +return result +else: +raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item)) --- End diff -- This seems like it might be kind of a confusing way to communicate that the user doesn't have scipy installed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r86468202 --- Diff: python/pyspark/mllib/tests.py --- @@ -861,6 +861,19 @@ def test_dot(self): dv = DenseVector(array([1., 2., 3., 4.])) self.assertEqual(10.0, dv.dot(lil)) +def test_assorted_functs(self): --- End diff -- It would be good to have the same tests for ml as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r86468128 --- Diff: python/pyspark/ml/linalg/__init__.py --- @@ -705,6 +705,22 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +def wrapper(*args, **kwargs): +if _have_scip: +csr = scipy.sparse.csr_matrix((\ --- End diff -- Do we need the `\`s? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user itg-abby commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r84350100 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -794,6 +794,13 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +if _have_scipy: +csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2])) +return getattr(csr, item) +else: +return self --- End diff -- Systems without Scipy return the following: ``` from pyspark.mllib.linalg import SparseVector a = SparseVector(4, {1: 1.0, 3: 5.5}) a.sum() Traceback (most recent call last): File "", line 1, in TypeError: 'SparseVector' object is not callable ``` This has been corrected in the latest commit. `__getattr__` essentially is the catch for "any attribute that does not have a name" so standard behavior should be an AttributeError not just returning the object (thank you for catching this!). The new code gives the following error when SciPy is not available: ``` a.sum() Traceback (most recent call last): File "", line 1, in File "/Users/sobh/spark/python/pyspark/mllib/linalg/__init__.py", line 802, in __getattr__ raise AttributeError("'{0}' object has no attribute '{1}'.".format(self.__class__, item)) AttributeError: '' object has no attribute 'sum'. ``` --- Systems with Scipy will give the result as expected: ``` from pyspark.mllib.linalg import SparseVector a = SparseVector(4, {1: 1.0, 3: 5.5}) a.sum() 6.5 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15496#discussion_r84179904 --- Diff: python/pyspark/mllib/linalg/__init__.py --- @@ -794,6 +794,13 @@ def __eq__(self, other): return Vectors._equals(self.indices, self.values, list(xrange(len(other))), other.array) return False +def __getattr__(self, item): +if _have_scipy: +csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 2])) +return getattr(csr, item) +else: +return self --- End diff -- This doesn't look like the correct behaviour. Can you test this on a system without scipy installed? (Probably easist with a simple virtualenv). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
GitHub user itg-abby reopened a pull request: https://github.com/apache/spark/pull/15496 [SPARK-17950] [Python] Match SparseVector behavior with DenseVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. ## How was this patch tested? Manual testing on local machine. Temporary close: Failed ./python/run-tests No UI changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/itg-abby/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15496 commit d9b6ecc9c7aaebf834556797404ff783fef44dcf Author: itg-abby Date: 2016-10-15T00:03:08Z Match SparseVector behavior with DenseVector << What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <> Manual testing on local machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
Github user itg-abby closed the pull request at: https://github.com/apache/spark/pull/15496 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...
GitHub user itg-abby reopened a pull request: https://github.com/apache/spark/pull/15496 [SPARK-17950] [Python] Match SparseVector behavior with DenseVector ## What changes were proposed in this pull request? Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. ## How was this patch tested? Manual testing on local machine. Passed ./dev/run-tests No UI changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/itg-abby/spark patch-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15496.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15496 commit d9b6ecc9c7aaebf834556797404ff783fef44dcf Author: itg-abby Date: 2016-10-15T00:03:08Z Match SparseVector behavior with DenseVector << What changes were proposed in this pull request? >> Simply added the __getattr__ to SparseVector that DenseVector has, but calls to self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. <> Manual testing on local machine. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org