Opening this follow-up question to the entire mailing list. Anyone have thoughts on how I can add a column of dense vectors (created by converting a column of sparse features) to a data frame? My efforts are below.
Although I know this is not the best approach for something I plan to put in production, I have been trying to write a udf to turn the sparse vector into a dense one and apply the udf in withcolumn(). withColumn() complains that the data is a tuple. I think the issue might be the datatype parameter. The function returns a vector of doubles but there is no type that would be adequate for this. *sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])), DoubleType())* *denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", sparseToDense("features"))* However the function works outside the udf, but I am unable to add an arbitrary column to the data frame I started out working with. *denseFeatures=TrainingRdf.select("features").map(lambda data: DenseVector([data.features.toArray()]))* *denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", denseFeatures)* Thanks, Tobi On Thu, Aug 11, 2016 at 5:02 AM, Sean Owen <so...@cloudera.com> wrote: > No, that doesn't describe the change being discussed, since you've > copied the discussion about adding an 'offset'. That's orthogonal. > You're also suggesting making withMean=True the default, which we > don't want. The point is that if this is *explicitly* requested, the > scaler shouldn't refuse to subtract the mean from a sparse vector, and > fail. > > On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.to...@gmail.com> wrote: > > Sean, > > > > I have created a jira; I hope you don't mind that I borrowed your > > explanation of "offset". https://issues.apache.org/ > jira/browse/SPARK-17001 > > > > So what did you do to standardize your data, if you didn't use > > standardScaler? Did you write a udf to subtract mean and divide by > standard > > deviation? > > > > Although I know this is not the best approach for something I plan to > put in > > production, I have been trying to write a udf to turn the sparse vector > into > > a dense one and apply the udf in withcolumn(). withColumn() complains > that > > the data is a tuple. I think the issue might be the datatype parameter. > The > > function returns a vector of doubles but there is no type that would be > > adequate for this. > > > > sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])), > > DoubleType()) > > denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", > > sparseToDense("features")) > > > > However the function works outside the udf, but I am unable to add an > > arbitrary column to the data frame I started out working with. Thoughts? > > > > denseFeatures=TrainingRdf.select("features").map(lambda data: > > DenseVector([data.features.toArray()])) > > denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", > > denseFeatures) > > > > Thanks, > > Tobi > > > > > > On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath < > nick.pentre...@gmail.com> > > wrote: > >> > >> Ah right, got it. As you say for storage it helps significantly, but for > >> operations I suspect it puts one back in a "dense-like" position. > Still, for > >> online / mini-batch algorithms it may still be feasible I guess. > >> On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote: > >>> > >>> All elements, I think. Imagine a sparse vector 1:3 3:7 which > conceptually > >>> represents 0 3 0 7. Imagine it also has an offset stored which applies > to > >>> all elements. If it is -2 then it now represents -2 1 -2 5, but this > >>> requires just one extra value to store. It only helps with storage of a > >>> shifted sparse vector; iterating still typically requires iterating all > >>> elements. > >>> > >>> Probably, where this would help, the caller can track this offset and > >>> even more efficiently apply this knowledge. I remember digging into > this in > >>> how sparse covariance matrices are computed. It almost but not quite > enabled > >>> an optimization. > >>> > >>> > >>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> > >>> wrote: > >>>> > >>>> Sean by 'offset' do you mean basically subtracting the mean but only > >>>> from the non-zero elements in each row? > >>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote: > >>>>> > >>>>> Yeah I had thought the same, that perhaps it's fine to let the > >>>>> StandardScaler proceed, if it's explicitly asked to center, rather > >>>>> than refuse to. It's not really much more rope to let a user hang > >>>>> herself with, and, blocks legitimate usages (we ran into this last > >>>>> week and couldn't use StandardScaler as a result). > >>>>> > >>>>> I'm personally supportive of the change and don't see a JIRA. I think > >>>>> you could at least make one. > >>>>> > >>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> > >>>>> wrote: > >>>>> > Thanks Sean, I agree with 100% that the math is math and dense vs > >>>>> > sparse is > >>>>> > just a matter of representation. I was trying to convince a > co-worker > >>>>> > of > >>>>> > this to no avail. Sending this email was mainly a sanity check. > >>>>> > > >>>>> > I think having an offset would be a great idea, although I am not > >>>>> > sure how > >>>>> > to implement this. However, if anything should be done to rectify > >>>>> > this > >>>>> > issue, it should be done in the standardScaler, not > vectorAssembler. > >>>>> > There > >>>>> > should not be any forcing of vectorAssembler to produce only dense > >>>>> > vectors > >>>>> > so as to avoid performance problems with data that does not fit in > >>>>> > memory. > >>>>> > Furthermore, not every machine learning algo requires > >>>>> > standardization. > >>>>> > Instead, standardScaler should have withmean=True as default and > >>>>> > should > >>>>> > apply an offset if the vector is sparse, whereas there would be > >>>>> > normal > >>>>> > subtraction if the vector is dense. This way the default behavior > of > >>>>> > standardScaler will always be what is generally understood to be > >>>>> > standardization, as opposed to people thinking they are > standardizing > >>>>> > when > >>>>> > they actually are not. > >>>>> > > >>>>> > Can anyone confirm whether there is a jira already? > >>>>> > > >>>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> > >>>>> > wrote: > >>>>> >> > >>>>> >> Dense vs sparse is just a question of representation, so doesn't > >>>>> >> make > >>>>> >> an operation on a vector more or less important as a result. > You've > >>>>> >> identified the reason that subtracting the mean can be > undesirable: > >>>>> >> a > >>>>> >> notionally billion-element sparse vector becomes too big to fit in > >>>>> >> memory at once. > >>>>> >> > >>>>> >> I know this came up as a problem recently (I think there's a > JIRA?) > >>>>> >> because VectorAssembler will *sometimes* output a small dense > vector > >>>>> >> and sometimes output a small sparse vector based on how many > zeroes > >>>>> >> there are. But that's bad because then the StandardScaler can't > >>>>> >> process the output at all. You can work on this if you're > >>>>> >> interested; > >>>>> >> I think the proposal was to be able to force a dense > representation > >>>>> >> only in VectorAssembler. I don't know if that's the nature of the > >>>>> >> problem you're hitting. > >>>>> >> > >>>>> >> It can be meaningful to only scale the dimension without centering > >>>>> >> it, > >>>>> >> but it's not the same thing, no. The math is the math. > >>>>> >> > >>>>> >> This has come up a few times -- it's necessary to center a sparse > >>>>> >> vector but prohibitive to do so. One idea I'd toyed with in the > past > >>>>> >> was to let a sparse vector have an 'offset' value applied to all > >>>>> >> elements. That would let you shift all values while preserving a > >>>>> >> sparse representation. I'm not sure if it's worth implementing but > >>>>> >> would help this case. > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com > > > >>>>> >> wrote: > >>>>> >> > Hi everyone, > >>>>> >> > > >>>>> >> > I am doing some standardization using standardScaler on data > from > >>>>> >> > VectorAssembler which is represented as sparse vectors. I plan > to > >>>>> >> > fit a > >>>>> >> > regularized model. However, standardScaler does not allow the > >>>>> >> > mean to > >>>>> >> > be > >>>>> >> > subtracted from sparse vectors. It will only divide by the > >>>>> >> > standard > >>>>> >> > deviation, which I understand is to keep the vector sparse. > Thus I > >>>>> >> > am > >>>>> >> > trying > >>>>> >> > to convert my sparse vectors into dense vectors, but this may > not > >>>>> >> > be > >>>>> >> > worthwhile. > >>>>> >> > > >>>>> >> > So my questions are: > >>>>> >> > Is subtracting the mean during standardization only important > when > >>>>> >> > working > >>>>> >> > with dense vectors? Does it not matter for sparse vectors? Is > just > >>>>> >> > dividing > >>>>> >> > by the standard deviation with sparse vectors equivalent to also > >>>>> >> > dividing by > >>>>> >> > standard deviation w and subtracting mean with dense vectors? > >>>>> >> > > >>>>> >> > Thank you, > >>>>> >> > Tobi > >>>>> > > >>>>> > > >>>>> > >>>>> ------------------------------------------------------------ > --------- > >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>>>> > > >