Re: Standardization with Sparse Vectors

Sean Owen Wed, 10 Aug 2016 10:02:18 -0700

Yeah I had thought the same, that perhaps it's fine to let the
StandardScaler proceed, if it's explicitly asked to center, rather
than refuse to. It's not really much more rope to let a user hang
herself with, and, blocks legitimate usages (we ran into this last
week and couldn't use StandardScaler as a result).


I'm personally supportive of the change and don't see a JIRA. I think
you could at least make one.

On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
> Thanks Sean, I agree with 100% that the math is math and dense vs sparse is
> just a matter of representation. I was trying to convince a co-worker of
> this to no avail. Sending this email was mainly a sanity check.
>
> I think having an offset would be a great idea, although I am not sure how
> to implement this. However, if anything should be done to rectify this
> issue, it should be done in the standardScaler, not vectorAssembler. There
> should not be any forcing of vectorAssembler to produce only dense vectors
> so as to avoid performance problems with data that does not fit in memory.
> Furthermore, not every machine learning algo requires standardization.
> Instead, standardScaler should have withmean=True as default and should
> apply an offset if the vector is sparse, whereas there would be normal
> subtraction if the vector is dense. This way the default behavior of
> standardScaler will always be what is generally understood to be
> standardization, as opposed to people thinking they are standardizing when
> they actually are not.
>
> Can anyone confirm whether there is a jira already?
>
> On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Dense vs sparse is just a question of representation, so doesn't make
>> an operation on a vector more or less important as a result. You've
>> identified the reason that subtracting the mean can be undesirable: a
>> notionally billion-element sparse vector becomes too big to fit in
>> memory at once.
>>
>> I know this came up as a problem recently (I think there's a JIRA?)
>> because VectorAssembler will *sometimes* output a small dense vector
>> and sometimes output a small sparse vector based on how many zeroes
>> there are. But that's bad because then the StandardScaler can't
>> process the output at all. You can work on this if you're interested;
>> I think the proposal was to be able to force a dense representation
>> only in VectorAssembler. I don't know if that's the nature of the
>> problem you're hitting.
>>
>> It can be meaningful to only scale the dimension without centering it,
>> but it's not the same thing, no. The math is the math.
>>
>> This has come up a few times -- it's necessary to center a sparse
>> vector but prohibitive to do so. One idea I'd toyed with in the past
>> was to let a sparse vector have an 'offset' value applied to all
>> elements. That would let you shift all values while preserving a
>> sparse representation. I'm not sure if it's worth implementing but
>> would help this case.
>>
>>
>>
>>
>> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> wrote:
>> > Hi everyone,
>> >
>> > I am doing some standardization using standardScaler on data from
>> > VectorAssembler which is represented as sparse vectors. I plan to fit a
>> > regularized model.  However, standardScaler does not allow the mean to
>> > be
>> > subtracted from sparse vectors. It will only divide by the standard
>> > deviation, which I understand is to keep the vector sparse. Thus I am
>> > trying
>> > to convert my sparse vectors into dense vectors, but this may not be
>> > worthwhile.
>> >
>> > So my questions are:
>> > Is subtracting the mean during standardization only important when
>> > working
>> > with dense vectors? Does it not matter for sparse vectors? Is just
>> > dividing
>> > by the standard deviation with sparse vectors equivalent to also
>> > dividing by
>> > standard deviation w and subtracting mean with dense vectors?
>> >
>> > Thank you,
>> > Tobi
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Standardization with Sparse Vectors

Reply via email to