[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2018-11-02 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
Hi, can you give a ref to the new style guides? I'm not sure if anything 
major needs changing.

In the meantime I resolved the one conflict at the bottom of ml/tests.py 
and extended the docstring to include your comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2017-04-18 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
ping @holdenk , revisiting this PR for the memories today. Wondering 
whether it was shipworthy in the end, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2017-01-07 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
ping @holdenk , please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-12-24 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
@holdenk 
Thanks for the review and your availability, I really appreciate the work 
you are doing too by mentoring me! I didn't realize np.append was making 
copies, though it made a lot of sense once I thought about what is actually 
going on there in terms of memory. 

The appends were also ultimately unnecessary and have been removed in favor 
of a much better, simpler, efficient call to make the csr matrix!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-12-18 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
ping @holdenk, can you take a look at this please? sorry for leaving this 
off for so long.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-11-17 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
@holdenk , are we there yet :P?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-11-04 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
@holdenk 
I did a quick benchmark on my Macbook to get a single CSR's approximate 
construction time (in minutes) on a large case as well as compared to LIL and 
DOK types.

2mil x 2mil Elements:
CSR,1.157418
DOK,5.531311
LIL,   0.839315
LIL2CSR, 1.650390

Snippet used for benchmarking:
```
setup_small1 = 'from scipy.sparse import csr_matrix;'
method1 = 'csr_matrix((range(0, 200),range(0, 200),[0, 200]))'
setup_small2 = 'from scipy.sparse import dok_matrix;'
method2 = 'dok_matrix((range(0, 200),range(0, 200)))'
setup_small3 = 'from scipy.sparse import lil_matrix;'
method3 = 'lil_matrix((200,200))'
setup_small4 = 'from scipy.sparse import lil_matrix;'
method4 = 'lil_matrix((200,200)).tocsr()'

t = timeit.Timer(method1, setup_small1)
print('CSR construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method2, setup_small2)
print('DOK construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method3, setup_small3)
print('LIL construction: ' + str(t.timeit(100)/float(100)))
t = timeit.Timer(method4, setup_small4)
print('LIL construction + Convert to CSR: ' + str(t.timeit(100)/float(100)))
```

I expect any changing of the Sparse Vector structure will take place using 
the PySpark object class, so CSR will definitely outperform LIL and DOK matrix 
types for function execution as well. This comes from the SciPy documentation 
[Advantages of the CSR format - efficient arithmetic operations CSR + CSR, CSR 
* CSR, etc., efficient row slicing, fast matrix vector products / Disadvantages 
of the CSR format - slow column slicing operations (consider CSC), changes to 
the sparsity structure are expensive (consider LIL or DOK)].


Other items resolved:
- The \s have been removed
- The error message has been lengthened to mention SciPy not being 
available as a possible cause for a lack of object attribute (the other cause 
being a call to something that does not exist in SciPy either).
- I have only added my single test function to ML since I do not know the 
reason why all of the SciPy tests in MLlib are not in ML.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-11-01 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
Sorry, I forgot to ping anyone after the last update.
@holdenk how does this look? What next steps should we take to push this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-10-24 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
I have applied the code change to both ML and MLLIB now. And, I added some 
simple tests to check if the SciPy sparse functions are behaving correctly. 
(Only MLLIB has tests for SciPy functions so I only added test cases there).

Additionally, I updated the implementation with a wrapper script which 1) 
allows for functions with inputs to work correctly and 2) seamlessly allows for 
SciPy's functions which generate a SciPy matrix output to be automatically 
returned as a SparseVector object.

Example use case:
```
c = SparseVector(40, {1: 1, 3: 2, 23: 7, 25:9, 39:3})
c.power(4)

SparseVector(40, {1: 1.0, 3: 16.0, 23: 2401.0, 25: 6561.0, 39: 81.0})
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-20 Thread itg-abby
Github user itg-abby commented on a diff in the pull request:

https://github.com/apache/spark/pull/15496#discussion_r84350100
  
--- Diff: python/pyspark/mllib/linalg/__init__.py ---
@@ -794,6 +794,13 @@ def __eq__(self, other):
 return Vectors._equals(self.indices, self.values, 
list(xrange(len(other))), other.array)
 return False
 
+def __getattr__(self, item):
+if _have_scipy:
+csr = scipy.sparse.csr_matrix((self.values, self.indices, [0, 
2]))
+return getattr(csr, item)
+else:
+return self
--- End diff --

Systems without Scipy return the following: 
```
from pyspark.mllib.linalg import SparseVector
a = SparseVector(4, {1: 1.0, 3: 5.5})
a.sum()

Traceback (most recent call last):
  File "", line 1, in 
TypeError: 'SparseVector' object is not callable
```
This has been corrected in the latest commit.
 `__getattr__` essentially is the catch for "any attribute that does not 
have a name" so standard behavior should be an AttributeError not just 
returning the object (thank you for catching this!). The new code gives the 
following error when SciPy is not available:

```
a.sum()
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/sobh/spark/python/pyspark/mllib/linalg/__init__.py", line 
802, in __getattr__
raise AttributeError("'{0}' object has no attribute 
'{1}'.".format(self.__class__, item))
AttributeError: '' object has no 
attribute 'sum'.


```
---
Systems with Scipy will give the result as expected:
```
from pyspark.mllib.linalg import SparseVector
a = SparseVector(4, {1: 1.0, 3: 5.5})
a.sum()

6.5
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-10-19 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
@sethah 
Thanks for your input! SciPy is definitely a much better solution, I was 
naively operating under the assumption that I must match DenseVector types 1:1.

Actually, there is already a flag "_have_scipy" so I have added the check 
for that in order to create the type when it is available. Otherwise, users 
just get the standard SparseVector back.

The PR has been updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-10-18 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
It is possible that I am missing something or that I have unintentionally 
obfuscated this pull request, I will try summarizing my understanding/purpose 
and see if it sheds any light:

DenseVector allows calls to numpy directly (i.e. DenseVector.mean() ) and 
always stores the array values in the object attribute DenseVector.array , this 
allows for a lot of neat numpy functions to be run on the array values without 
any trouble.

SparseVector works differently, it never stores the full set of values as a 
full array. Instead, it uses a 'trick' which only searches non-zero index/value 
pairs if a specific entry is asked for (this can be found in the __geitem__ 
attribute for SparseVector). This prevents numpy functions from being usable on 
the SparseVector since there is no actual array to operate on directly. 
However, a conversion function is provided, toArray().

The solution proposed can, in effect, be thought of as a purely syntactical 
shortening from SparseVector.toArray().mean() to simply SparseVector.mean() . 
Thus, this should not introduce any increased complexity compared to how things 
are now. The current status of this object is confusing in  that the intuitive 
function-call SparseVector.mean() just throws out an "AttributeError: 
'SparseVector' object has no attribute 'mean'". 

As mildly hinted at on JIRA, there are even better implementations which 
could follow this one. For example simply replacing directly calling numpy by 
manually providing the same functions with reduced complexity. Much along the 
lines of how __getitem__ was made for SparseVectors, rather than the typical 
array slicing that DenseVector has.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15496: [SPARK-17950] [Python] Match SparseVector behavior with ...

2016-10-17 Thread itg-abby
Github user itg-abby commented on the issue:

https://github.com/apache/spark/pull/15496
  
Whoops, looks like this code is fine and I just had a bug in my local 
build. Reopening.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-17 Thread itg-abby
GitHub user itg-abby reopened a pull request:

https://github.com/apache/spark/pull/15496

[SPARK-17950] [Python] Match SparseVector behavior with DenseVector

## What changes were proposed in this pull request?
Simply added the `__getattr__` to SparseVector that DenseVector has, but 
calls self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

## How was this patch tested?
Manual testing on local machine.
Temporary close: Failed ./python/run-tests
No UI changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/itg-abby/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15496


commit d9b6ecc9c7aaebf834556797404ff783fef44dcf
Author: itg-abby 
Date:   2016-10-15T00:03:08Z

Match SparseVector behavior with DenseVector

<< What changes were proposed in this pull request? >>
Simply added the __getattr__ to SparseVector that DenseVector has, but 
calls to self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.

<>
Manual testing on local machine.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-14 Thread itg-abby
Github user itg-abby closed the pull request at:

https://github.com/apache/spark/pull/15496


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: [SPARK-17950] [Python] Match SparseVector behavio...

2016-10-14 Thread itg-abby
GitHub user itg-abby reopened a pull request:

https://github.com/apache/spark/pull/15496

[SPARK-17950] [Python] Match SparseVector behavior with DenseVector

## What changes were proposed in this pull request?
Simply added the `__getattr__` to SparseVector that DenseVector has, but 
calls self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

## How was this patch tested?
Manual testing on local machine.
Passed ./dev/run-tests
No UI changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/itg-abby/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15496


commit d9b6ecc9c7aaebf834556797404ff783fef44dcf
Author: itg-abby 
Date:   2016-10-15T00:03:08Z

Match SparseVector behavior with DenseVector

<< What changes were proposed in this pull request? >>
Simply added the __getattr__ to SparseVector that DenseVector has, but 
calls to self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.

<>
Manual testing on local machine.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: Match SparseVector behavior with DenseVector

2016-10-14 Thread itg-abby
Github user itg-abby closed the pull request at:

https://github.com/apache/spark/pull/15496


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15496: Match SparseVector behavior with DenseVector

2016-10-14 Thread itg-abby
GitHub user itg-abby opened a pull request:

https://github.com/apache/spark/pull/15496

Match SparseVector behavior with DenseVector

## What changes were proposed in this pull request?
Simply added the `__getattr__` to SparseVector that DenseVector has, but 
calls self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

## How was this patch tested?
Manual testing on local machine.
No UI changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/itg-abby/spark patch-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15496


commit d9b6ecc9c7aaebf834556797404ff783fef44dcf
Author: itg-abby 
Date:   2016-10-15T00:03:08Z

Match SparseVector behavior with DenseVector

<< What changes were proposed in this pull request? >>
Simply added the __getattr__ to SparseVector that DenseVector has, but 
calls to self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in 
the same direct way that users interact with DenseVectors.

<>
Manual testing on local machine.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org