Ah right, got it. As you say for storage it helps significantly, but for operations I suspect it puts one back in a "dense-like" position. Still, for online / mini-batch algorithms it may still be feasible I guess. On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote:
> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually > represents 0 3 0 7. Imagine it also has an offset stored which applies to > all elements. If it is -2 then it now represents -2 1 -2 5, but this > requires just one extra value to store. It only helps with storage of a > shifted sparse vector; iterating still typically requires iterating all > elements. > > Probably, where this would help, the caller can track this offset and even > more efficiently apply this knowledge. I remember digging into this in how > sparse covariance matrices are computed. It almost but not quite enabled an > optimization. > > > On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Sean by 'offset' do you mean basically subtracting the mean but only from >> the non-zero elements in each row? >> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote: >> >>> Yeah I had thought the same, that perhaps it's fine to let the >>> StandardScaler proceed, if it's explicitly asked to center, rather >>> than refuse to. It's not really much more rope to let a user hang >>> herself with, and, blocks legitimate usages (we ran into this last >>> week and couldn't use StandardScaler as a result). >>> >>> I'm personally supportive of the change and don't see a JIRA. I think >>> you could at least make one. >>> >>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> >>> wrote: >>> > Thanks Sean, I agree with 100% that the math is math and dense vs >>> sparse is >>> > just a matter of representation. I was trying to convince a co-worker >>> of >>> > this to no avail. Sending this email was mainly a sanity check. >>> > >>> > I think having an offset would be a great idea, although I am not sure >>> how >>> > to implement this. However, if anything should be done to rectify this >>> > issue, it should be done in the standardScaler, not vectorAssembler. >>> There >>> > should not be any forcing of vectorAssembler to produce only dense >>> vectors >>> > so as to avoid performance problems with data that does not fit in >>> memory. >>> > Furthermore, not every machine learning algo requires standardization. >>> > Instead, standardScaler should have withmean=True as default and should >>> > apply an offset if the vector is sparse, whereas there would be normal >>> > subtraction if the vector is dense. This way the default behavior of >>> > standardScaler will always be what is generally understood to be >>> > standardization, as opposed to people thinking they are standardizing >>> when >>> > they actually are not. >>> > >>> > Can anyone confirm whether there is a jira already? >>> > >>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> >>> wrote: >>> >> >>> >> Dense vs sparse is just a question of representation, so doesn't make >>> >> an operation on a vector more or less important as a result. You've >>> >> identified the reason that subtracting the mean can be undesirable: a >>> >> notionally billion-element sparse vector becomes too big to fit in >>> >> memory at once. >>> >> >>> >> I know this came up as a problem recently (I think there's a JIRA?) >>> >> because VectorAssembler will *sometimes* output a small dense vector >>> >> and sometimes output a small sparse vector based on how many zeroes >>> >> there are. But that's bad because then the StandardScaler can't >>> >> process the output at all. You can work on this if you're interested; >>> >> I think the proposal was to be able to force a dense representation >>> >> only in VectorAssembler. I don't know if that's the nature of the >>> >> problem you're hitting. >>> >> >>> >> It can be meaningful to only scale the dimension without centering it, >>> >> but it's not the same thing, no. The math is the math. >>> >> >>> >> This has come up a few times -- it's necessary to center a sparse >>> >> vector but prohibitive to do so. One idea I'd toyed with in the past >>> >> was to let a sparse vector have an 'offset' value applied to all >>> >> elements. That would let you shift all values while preserving a >>> >> sparse representation. I'm not sure if it's worth implementing but >>> >> would help this case. >>> >> >>> >> >>> >> >>> >> >>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> >>> wrote: >>> >> > Hi everyone, >>> >> > >>> >> > I am doing some standardization using standardScaler on data from >>> >> > VectorAssembler which is represented as sparse vectors. I plan to >>> fit a >>> >> > regularized model. However, standardScaler does not allow the mean >>> to >>> >> > be >>> >> > subtracted from sparse vectors. It will only divide by the standard >>> >> > deviation, which I understand is to keep the vector sparse. Thus I >>> am >>> >> > trying >>> >> > to convert my sparse vectors into dense vectors, but this may not be >>> >> > worthwhile. >>> >> > >>> >> > So my questions are: >>> >> > Is subtracting the mean during standardization only important when >>> >> > working >>> >> > with dense vectors? Does it not matter for sparse vectors? Is just >>> >> > dividing >>> >> > by the standard deviation with sparse vectors equivalent to also >>> >> > dividing by >>> >> > standard deviation w and subtracting mean with dense vectors? >>> >> > >>> >> > Thank you, >>> >> > Tobi >>> > >>> > >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>>