[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-12-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r93112496
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,23 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((np.append(self.values, 0),
--- End diff --

Super minor nit, but using named parameters might make this a bit easier 
when skimming the code (same for mllin)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-12-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r93116390
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,23 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((np.append(self.values, 0),
--- End diff --

More concretely thouhh, why are we padding the data/values with a 0?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-12-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r93116698
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,23 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((np.append(self.values, 0),
+   np.append(self.indices, 
self.size-1),
--- End diff --

So this "works", in that it's skipped by the indexptrs range we supply 
bellow - but it makes the code confusing to read so why do we need it and maybe 
we could add a comment explaining why?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-12-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r93113956
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,23 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
--- End diff --

Would be good to have a comment here explaining its purpose.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-11-29 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r90139924
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,22 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((np.append(self.values, 0),
+   np.append(self.indices, 
self.size-1),
+   [0, len(self.values)]))
+func = getattr(csr, item)
+result = func(*args, **kwargs)
+if isinstance(result, scipy.sparse.csr.csr_matrix):
+return 
SparseVector(result.shape[1],result.indices,result.data)
+return result
+else:
+raise AttributeError(
+"'{0}' object has no attribute '{1}' or SciPy not 
installed.".format(self.__class__, item))
--- End diff --

Ok so maybe we can improve the error message to something like "'{0}' 
object has no attribute '{1}' and SciPy is not installed to proxy request to 
SparseVector" (or similar).

Because saying its X or Y is confusing since this error message only 
happens in the event SciPy is not installed.

What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-11-24 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r89464558
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,22 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((np.append(self.values, 0),
+   np.append(self.indices, 
self.size-1),
+   [0, len(self.values)]))
+func = getattr(csr, item)
+result = func(*args, **kwargs)
+if isinstance(result, scipy.sparse.csr.csr_matrix):
+return 
SparseVector(result.shape[1],result.indices,result.data)
+return result
+else:
+raise AttributeError(
+"'{0}' object has no attribute '{1}' or SciPy not 
installed.".format(self.__class__, item))
--- End diff --

This error message would probably be better of just saying SciPy not 
installed since its on the else branch of `if _have_scipy` unless I'm missing 
something?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-11-03 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r86468088
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,22 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scip:
+csr = scipy.sparse.csr_matrix((\
+   np.append(self.values, 0),\
+   np.append(self.indices, 
self.size-1),\
+   [0, len(self.values)]))
+func = getattr(csr, item)
+result = func(*args, **kwargs)
+if isinstance(result, scipy.sparse.csr.csr_matrix):
+return 
SparseVector(result.shape[1],result.indices,result.data)
+return result
+else:
+raise AttributeError("'{0}' object has no attribute 
'{1}'.".format(self.__class__, item))
--- End diff --

This seems like it might be kind of a confusing way to communicate that the 
user doesn't have scipy installed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-11-03 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r86468202
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -861,6 +861,19 @@ def test_dot(self):
 dv = DenseVector(array([1., 2., 3., 4.]))
 self.assertEqual(10.0, dv.dot(lil))
 
+def test_assorted_functs(self):
--- End diff --

It would be good to have the same tests for ml as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-11-03 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r86468128
  
--- Diff: python/pyspark/ml/linalg/__init__.py ---
@@ -705,6 +705,22 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+def wrapper(*args, **kwargs):
+if _have_scip:
+csr = scipy.sparse.csr_matrix((\
--- End diff --

Do we need the `\`s?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-20 Thread itg-abby
Github user itg-abby commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r84350100
  
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -794,6 +794,13 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 
2]))
+return getattr(csr, item)
+else:
+return self
--- End diff --

Systems without Scipy return the following: 
```
from pyspark.mllib.linalg import SparseVector
a = SparseVector(4, {1: 1.0, 3: 5.5})
a.sum()

Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'SparseVector' object is not callable
```
This has been corrected in the latest commit.
 `__getattr__` essentially is the catch for "any attribute that does not 
have a name" so standard behavior should be an AttributeError not just 
returning the object (thank you for catching this!). The new code gives the 
following error when SciPy is not available:

```
a.sum()
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/sobh/spark/python/pyspark/mllib/linalg/__init__.py", line 
802, in __getattr__
raise AttributeError("'{0}' object has no attribute 
'{1}'.".format(self.__class__, item))
AttributeError: '' object has no 
attribute 'sum'.


```
---
Systems with Scipy will give the result as expected:
```
from pyspark.mllib.linalg import SparseVector
a = SparseVector(4, {1: 1.0, 3: 5.5})
a.sum()

6.5
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r84179904
  
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -794,6 +794,13 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 
2]))
+return getattr(csr, item)
+else:
+return self
--- End diff --

This doesn't look like the correct behaviour. Can you test this on a system 
without scipy installed? (Probably easist with a simple virtualenv).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-17 Thread itg-abby
GitHub user itg-abby reopened a pull request:

https://github.com/apache/spark/pull/15496

[SPARK-17950] [Python] Match SparseVector behavior with DenseVector

## What changes were proposed in this pull request?
Simply added the `__getattr__` to SparseVector that DenseVector has, but 
calls self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

## How was this patch tested?
Manual testing on local machine.
Temporary close: Failed ./python/run-tests
No UI changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/itg-abby/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15496


commit d9b6ecc9c7aaebf834556797404ff783fef44dcf
Author: itg-abby 
Date:   2016-10-15T00:03:08Z

Match SparseVector behavior with DenseVector

<< What changes were proposed in this pull request? >>
Simply added the __getattr__ to SparseVector that DenseVector has, but 
calls to self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.

<>
Manual testing on local machine.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-14 Thread itg-abby
Github user itg-abby closed the pull request at:

https://github.com/apache/spark/pull/15496


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-14 Thread itg-abby
GitHub user itg-abby reopened a pull request:

https://github.com/apache/spark/pull/15496

[SPARK-17950] [Python] Match SparseVector behavior with DenseVector

## What changes were proposed in this pull request?
Simply added the `__getattr__` to SparseVector that DenseVector has, but 
calls self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

## How was this patch tested?
Manual testing on local machine.
Passed ./dev/run-tests
No UI changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/itg-abby/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15496


commit d9b6ecc9c7aaebf834556797404ff783fef44dcf
Author: itg-abby 
Date:   2016-10-15T00:03:08Z

Match SparseVector behavior with DenseVector

<< What changes were proposed in this pull request? >>
Simply added the __getattr__ to SparseVector that DenseVector has, but 
calls to self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.

<>
Manual testing on local machine.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org