Re: Standardization with Sparse Vectors

Nick Pentreath Wed, 10 Aug 2016 11:02:31 -0700

Ah right, got it. As you say for storage it helps significantly, but for
operations I suspect it puts one back in a "dense-like" position. Still,
for online / mini-batch algorithms it may still be feasible I guess.
On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote:


> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
> represents 0 3 0 7. Imagine it also has an offset stored which applies to
> all elements. If it is -2 then it now represents -2 1 -2 5, but this
> requires just one extra value to store. It only helps with storage of a
> shifted sparse vector; iterating still typically requires iterating all
> elements.
>
> Probably, where this would help, the caller can track this offset and even
> more efficiently apply this knowledge. I remember digging into this in how
> sparse covariance matrices are computed. It almost but not quite enabled an
> optimization.
>
>
> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Sean by 'offset' do you mean basically subtracting the mean but only from
>> the non-zero elements in each row?
>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Yeah I had thought the same, that perhaps it's fine to let the
>>> StandardScaler proceed, if it's explicitly asked to center, rather
>>> than refuse to. It's not really much more rope to let a user hang
>>> herself with, and, blocks legitimate usages (we ran into this last
>>> week and couldn't use StandardScaler as a result).
>>>
>>> I'm personally supportive of the change and don't see a JIRA. I think
>>> you could at least make one.
>>>
>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com>
>>> wrote:
>>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>>> sparse is
>>> > just a matter of representation. I was trying to convince a co-worker
>>> of
>>> > this to no avail. Sending this email was mainly a sanity check.
>>> >
>>> > I think having an offset would be a great idea, although I am not sure
>>> how
>>> > to implement this. However, if anything should be done to rectify this
>>> > issue, it should be done in the standardScaler, not vectorAssembler.
>>> There
>>> > should not be any forcing of vectorAssembler to produce only dense
>>> vectors
>>> > so as to avoid performance problems with data that does not fit in
>>> memory.
>>> > Furthermore, not every machine learning algo requires standardization.
>>> > Instead, standardScaler should have withmean=True as default and should
>>> > apply an offset if the vector is sparse, whereas there would be normal
>>> > subtraction if the vector is dense. This way the default behavior of
>>> > standardScaler will always be what is generally understood to be
>>> > standardization, as opposed to people thinking they are standardizing
>>> when
>>> > they actually are not.
>>> >
>>> > Can anyone confirm whether there is a jira already?
>>> >
>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com>
>>> wrote:
>>> >>
>>> >> Dense vs sparse is just a question of representation, so doesn't make
>>> >> an operation on a vector more or less important as a result. You've
>>> >> identified the reason that subtracting the mean can be undesirable: a
>>> >> notionally billion-element sparse vector becomes too big to fit in
>>> >> memory at once.
>>> >>
>>> >> I know this came up as a problem recently (I think there's a JIRA?)
>>> >> because VectorAssembler will *sometimes* output a small dense vector
>>> >> and sometimes output a small sparse vector based on how many zeroes
>>> >> there are. But that's bad because then the StandardScaler can't
>>> >> process the output at all. You can work on this if you're interested;
>>> >> I think the proposal was to be able to force a dense representation
>>> >> only in VectorAssembler. I don't know if that's the nature of the
>>> >> problem you're hitting.
>>> >>
>>> >> It can be meaningful to only scale the dimension without centering it,
>>> >> but it's not the same thing, no. The math is the math.
>>> >>
>>> >> This has come up a few times -- it's necessary to center a sparse
>>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>>> >> was to let a sparse vector have an 'offset' value applied to all
>>> >> elements. That would let you shift all values while preserving a
>>> >> sparse representation. I'm not sure if it's worth implementing but
>>> >> would help this case.
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com>
>>> wrote:
>>> >> > Hi everyone,
>>> >> >
>>> >> > I am doing some standardization using standardScaler on data from
>>> >> > VectorAssembler which is represented as sparse vectors. I plan to
>>> fit a
>>> >> > regularized model.  However, standardScaler does not allow the mean
>>> to
>>> >> > be
>>> >> > subtracted from sparse vectors. It will only divide by the standard
>>> >> > deviation, which I understand is to keep the vector sparse. Thus I
>>> am
>>> >> > trying
>>> >> > to convert my sparse vectors into dense vectors, but this may not be
>>> >> > worthwhile.
>>> >> >
>>> >> > So my questions are:
>>> >> > Is subtracting the mean during standardization only important when
>>> >> > working
>>> >> > with dense vectors? Does it not matter for sparse vectors? Is just
>>> >> > dividing
>>> >> > by the standard deviation with sparse vectors equivalent to also
>>> >> > dividing by
>>> >> > standard deviation w and subtracting mean with dense vectors?
>>> >> >
>>> >> > Thank you,
>>> >> > Tobi
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>

Re: Standardization with Sparse Vectors

Reply via email to