Dense vs sparse is just a question of representation, so doesn't make
an operation on a vector more or less important as a result. You've
identified the reason that subtracting the mean can be undesirable: a
notionally billion-element sparse vector becomes too big to fit in
memory at once.

I know this came up as a problem recently (I think there's a JIRA?)
because VectorAssembler will *sometimes* output a small dense vector
and sometimes output a small sparse vector based on how many zeroes
there are. But that's bad because then the StandardScaler can't
process the output at all. You can work on this if you're interested;
I think the proposal was to be able to force a dense representation
only in VectorAssembler. I don't know if that's the nature of the
problem you're hitting.

It can be meaningful to only scale the dimension without centering it,
but it's not the same thing, no. The math is the math.

This has come up a few times -- it's necessary to center a sparse
vector but prohibitive to do so. One idea I'd toyed with in the past
was to let a sparse vector have an 'offset' value applied to all
elements. That would let you shift all values while preserving a
sparse representation. I'm not sure if it's worth implementing but
would help this case.




On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
> Hi everyone,
>
> I am doing some standardization using standardScaler on data from
> VectorAssembler which is represented as sparse vectors. I plan to fit a
> regularized model.  However, standardScaler does not allow the mean to be
> subtracted from sparse vectors. It will only divide by the standard
> deviation, which I understand is to keep the vector sparse. Thus I am trying
> to convert my sparse vectors into dense vectors, but this may not be
> worthwhile.
>
> So my questions are:
> Is subtracting the mean during standardization only important when working
> with dense vectors? Does it not matter for sparse vectors? Is just dividing
> by the standard deviation with sparse vectors equivalent to also dividing by
> standard deviation w and subtracting mean with dense vectors?
>
> Thank you,
> Tobi

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to