[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422663#comment-15422663
 ] 

Sean Owen commented on SPARK-17001:
---

Yes that came up on the thread that prompted this, that VectorAssembler could 
be told to make only dense vectors. That's reasonable. The downside pointed out 
there was that this means the vectors are dense when they could benefit from a 
sparse representation, just for compatibility with another component.

But maybe there's an argument that this is behavior that's generally useful to 
turn off because the same issue may come up elsewhere. If something doesn't 
work on sparse vectors, and the output of VectorAssembler is a mix of both just 
depending on the values in the input, it could make some pipelines succeed or 
error based on the input data, when the input is conceptually entirely valid.

Of course you can indeed manually make the vectors dense. Not bad at all and 
that's what we had done in the past. It involves an extra copy.

I had thought it simplest in this case to just let it work rather than fail, 
but I don't mind going other directions with the solution.

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422633#comment-15422633
 ] 

Nick Pentreath commented on SPARK-17001:


Yet another potential option is to have a transformer that turns all vectors -> 
dense (or sparse)?

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422630#comment-15422630
 ] 

Nick Pentreath commented on SPARK-17001:


This approach seems fine - I tend to agree with allowing users to configure 
certain options even if they are potentially dangerous, under the assumption 
that they should know the implications (with appropriate documentation and 
warnings).

However, an alternative (or perhaps additional) solution, is that 
{{VectorAssembler}} should allow an option to force dense (or sparse) vectors 
as output. This would allow the case where a user knows they want to scale the 
data even if it breaks sparsity, because the vectors are not that big.

Thoughts?

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-16 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422624#comment-15422624
 ] 

Nick Pentreath commented on SPARK-17001:


Note Spark now has a {{MaxAbsScaler}} transformer that could be an alternative 
for cases where one wants to preserve sparsity.

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422572#comment-15422572
 ] 

Apache Spark commented on SPARK-17001:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/14663

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-16 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15422511#comment-15422511
 ] 

Sean Owen commented on SPARK-17001:
---

Yeah, that sounds like how it works now. StandardScaler will actually cause an 
error if asked to center sparse data. The problem is that, sometimes sparse 
data is represented that way because it's small-er than the dense 
representation, not necessarily because the dense representation is too large 
to work with. In particular, VectorAssembler will output small sparse vectors 
if there are enough 0s, and that means it can't be used with StandardScaler 
with centering, even if it would be perfectly fine.

My attitude is that the user should be able to opt in to this behavior if 
desired. Yes it would potentially cause a job to fail if you centered massive 
sparse vectors, but that at least will be a fairly clear error. It seems better 
to potentially allow that than make StandardScaler unable to do centering in 
the common case. 

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17001) Enable standardScaler to standardize sparse vectors when withMean=True

2016-08-12 Thread Tobi Bosede (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419592#comment-15419592
 ] 

Tobi Bosede commented on SPARK-17001:
-

This can be implemented in a similar fashion to scikit learn's maxabs_scale. 
See 
http://scikit-learn.org/stable/modules/preprocessing.html#scaling-sparse-data 
for more info.

> Enable standardScaler to standardize sparse vectors when withMean=True
> --
>
> Key: SPARK-17001
> URL: https://issues.apache.org/jira/browse/SPARK-17001
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Tobi Bosede
>Priority: Minor
>
> When withMean = true, StandardScaler will not handle sparse vectors, and 
> instead throw an exception. This is presumably because subtracting the mean 
> makes a sparse vector dense, and this can be undesirable. 
> However, VectorAssembler generates vectors that may be a mix of sparse and 
> dense, even when vectors are smallish, depending on their values. It's common 
> to feed this into StandardScaler, but it would fail sometimes depending on 
> the input if withMean = true. This is kind of surprising.
> StandardScaler should go ahead and operate on sparse vectors and subtract the 
> mean, if explicitly asked to do so with withMean, on the theory that the user 
> knows what he/she is doing, and there is otherwise no way to make this work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org