Sean, I have created a jira; I hope you don't mind that I borrowed your explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001
So what did you do to standardize your data, if you didn't use standardScaler? Did you write a udf to subtract mean and divide by standard deviation? Although I know this is not the best approach for something I plan to put in production, I have been trying to write a udf to turn the sparse vector into a dense one and apply the udf in withcolumn(). withColumn() complains that the data is a tuple. I think the issue might be the datatype parameter. The function returns a vector of doubles but there is no type that would be adequate for this. *sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])), DoubleType())* *denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", sparseToDense("features"))* However the function works outside the udf, but I am unable to add an arbitrary column to the data frame I started out working with. Thoughts? *denseFeatures=TrainingRdf.select("features").map(lambda data: DenseVector([data.features.toArray()]))* *denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", denseFeatures)* Thanks, Tobi On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Ah right, got it. As you say for storage it helps significantly, but for > operations I suspect it puts one back in a "dense-like" position. Still, > for online / mini-batch algorithms it may still be feasible I guess. > On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote: > >> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually >> represents 0 3 0 7. Imagine it also has an offset stored which applies to >> all elements. If it is -2 then it now represents -2 1 -2 5, but this >> requires just one extra value to store. It only helps with storage of a >> shifted sparse vector; iterating still typically requires iterating all >> elements. >> >> Probably, where this would help, the caller can track this offset and >> even more efficiently apply this knowledge. I remember digging into this in >> how sparse covariance matrices are computed. It almost but not quite >> enabled an optimization. >> >> >> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >> >>> Sean by 'offset' do you mean basically subtracting the mean but only >>> from the non-zero elements in each row? >>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote: >>> >>>> Yeah I had thought the same, that perhaps it's fine to let the >>>> StandardScaler proceed, if it's explicitly asked to center, rather >>>> than refuse to. It's not really much more rope to let a user hang >>>> herself with, and, blocks legitimate usages (we ran into this last >>>> week and couldn't use StandardScaler as a result). >>>> >>>> I'm personally supportive of the change and don't see a JIRA. I think >>>> you could at least make one. >>>> >>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> >>>> wrote: >>>> > Thanks Sean, I agree with 100% that the math is math and dense vs >>>> sparse is >>>> > just a matter of representation. I was trying to convince a co-worker >>>> of >>>> > this to no avail. Sending this email was mainly a sanity check. >>>> > >>>> > I think having an offset would be a great idea, although I am not >>>> sure how >>>> > to implement this. However, if anything should be done to rectify this >>>> > issue, it should be done in the standardScaler, not vectorAssembler. >>>> There >>>> > should not be any forcing of vectorAssembler to produce only dense >>>> vectors >>>> > so as to avoid performance problems with data that does not fit in >>>> memory. >>>> > Furthermore, not every machine learning algo requires standardization. >>>> > Instead, standardScaler should have withmean=True as default and >>>> should >>>> > apply an offset if the vector is sparse, whereas there would be normal >>>> > subtraction if the vector is dense. This way the default behavior of >>>> > standardScaler will always be what is generally understood to be >>>> > standardization, as opposed to people thinking they are standardizing >>>> when >>>> > they actually are not. >>>> > >>>> > Can anyone confirm whether there is a jira already? >>>> > >>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> >>>> wrote: >>>> >> >>>> >> Dense vs sparse is just a question of representation, so doesn't make >>>> >> an operation on a vector more or less important as a result. You've >>>> >> identified the reason that subtracting the mean can be undesirable: a >>>> >> notionally billion-element sparse vector becomes too big to fit in >>>> >> memory at once. >>>> >> >>>> >> I know this came up as a problem recently (I think there's a JIRA?) >>>> >> because VectorAssembler will *sometimes* output a small dense vector >>>> >> and sometimes output a small sparse vector based on how many zeroes >>>> >> there are. But that's bad because then the StandardScaler can't >>>> >> process the output at all. You can work on this if you're interested; >>>> >> I think the proposal was to be able to force a dense representation >>>> >> only in VectorAssembler. I don't know if that's the nature of the >>>> >> problem you're hitting. >>>> >> >>>> >> It can be meaningful to only scale the dimension without centering >>>> it, >>>> >> but it's not the same thing, no. The math is the math. >>>> >> >>>> >> This has come up a few times -- it's necessary to center a sparse >>>> >> vector but prohibitive to do so. One idea I'd toyed with in the past >>>> >> was to let a sparse vector have an 'offset' value applied to all >>>> >> elements. That would let you shift all values while preserving a >>>> >> sparse representation. I'm not sure if it's worth implementing but >>>> >> would help this case. >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com> >>>> wrote: >>>> >> > Hi everyone, >>>> >> > >>>> >> > I am doing some standardization using standardScaler on data from >>>> >> > VectorAssembler which is represented as sparse vectors. I plan to >>>> fit a >>>> >> > regularized model. However, standardScaler does not allow the >>>> mean to >>>> >> > be >>>> >> > subtracted from sparse vectors. It will only divide by the standard >>>> >> > deviation, which I understand is to keep the vector sparse. Thus I >>>> am >>>> >> > trying >>>> >> > to convert my sparse vectors into dense vectors, but this may not >>>> be >>>> >> > worthwhile. >>>> >> > >>>> >> > So my questions are: >>>> >> > Is subtracting the mean during standardization only important when >>>> >> > working >>>> >> > with dense vectors? Does it not matter for sparse vectors? Is just >>>> >> > dividing >>>> >> > by the standard deviation with sparse vectors equivalent to also >>>> >> > dividing by >>>> >> > standard deviation w and subtracting mean with dense vectors? >>>> >> > >>>> >> > Thank you, >>>> >> > Tobi >>>> > >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>>