Can someone also provide input on why my code may not be working? Below, I have pasted part of my previous reply which describes the issue I am having here. I am really more perplexed about the first set of code (in bold). I know why the second set of code doesn't work, it is just something I initially tried.
>> Although I know this is not the best approach for something I plan to put in >> production, I have been trying to write a udf to turn the sparse vector into >> a dense one and apply the udf in withcolumn(). withColumn() complains that >> the data is a tuple. I think the issue might be the datatype parameter. The >> function returns a vector of doubles but there is no type that would be >> adequate for this. >> *>> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),>> DoubleType())>> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",>> sparseToDense("features"))* >> >> However the function works outside the udf, but I am unable to add an >> arbitrary column to the data frame I started out working with. *Thoughts?* >> >> denseFeatures=TrainingRdf.select("features").map(lambda data: >> DenseVector([data.features.toArray()])) >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", >> denseFeatures) On Thu, Aug 11, 2016 at 12:55 PM, Sean Owen <so...@cloudera.com> wrote: > I should be more clear, since the outcome of the discussion above was > not that obvious actually. > > - I agree a change should be made to StandardScaler, and not > VectorAssembler > - However I do think withMean should still be false by default and be > explicitly enabled > - The 'offset' idea is orthogonal, and as Nick says may be problematic > anyway a step or two down the line. I'm proposing just converting to > dense vectors if asked to center (which is why it shouldn't be the > default) > > Indeed to answer your question, that's how I had resolved this in user > code earlier. It's the same thing you're suggesting here, to make a > UDF that converts the vectors to dense vectors manually. > > I updated the JIRA accordingly, to suggest converting to DenseVector > in StandardScaler if withMean is set explicitly to true. I think we > should consider something like the 'offset' idea separately if at all. > > On Thu, Aug 11, 2016 at 11:02 AM, Sean Owen <so...@cloudera.com> wrote: > > No, that doesn't describe the change being discussed, since you've > > copied the discussion about adding an 'offset'. That's orthogonal. > > You're also suggesting making withMean=True the default, which we > > don't want. The point is that if this is *explicitly* requested, the > > scaler shouldn't refuse to subtract the mean from a sparse vector, and > > fail. > > > > On Wed, Aug 10, 2016 at 8:47 PM, Tobi Bosede <ani.to...@gmail.com> > wrote: > >> Sean, > >> > >> I have created a jira; I hope you don't mind that I borrowed your > >> explanation of "offset". https://issues.apache.org/ > jira/browse/SPARK-17001 > >> > >> So what did you do to standardize your data, if you didn't use > >> standardScaler? Did you write a udf to subtract mean and divide by > standard > >> deviation? > >> > >> Although I know this is not the best approach for something I plan to > put in > >> production, I have been trying to write a udf to turn the sparse vector > into > >> a dense one and apply the udf in withcolumn(). withColumn() complains > that > >> the data is a tuple. I think the issue might be the datatype parameter. > The > >> function returns a vector of doubles but there is no type that would be > >> adequate for this. > >> > >> sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])), > >> DoubleType()) > >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", > >> sparseToDense("features")) > >> > >> However the function works outside the udf, but I am unable to add an > >> arbitrary column to the data frame I started out working with. Thoughts? > >> > >> denseFeatures=TrainingRdf.select("features").map(lambda data: > >> DenseVector([data.features.toArray()])) > >> denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", > >> denseFeatures) > >> > >> Thanks, > >> Tobi > >> > >> > >> On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath < > nick.pentre...@gmail.com> > >> wrote: > >>> > >>> Ah right, got it. As you say for storage it helps significantly, but > for > >>> operations I suspect it puts one back in a "dense-like" position. > Still, for > >>> online / mini-batch algorithms it may still be feasible I guess. > >>> On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote: > >>>> > >>>> All elements, I think. Imagine a sparse vector 1:3 3:7 which > conceptually > >>>> represents 0 3 0 7. Imagine it also has an offset stored which > applies to > >>>> all elements. If it is -2 then it now represents -2 1 -2 5, but this > >>>> requires just one extra value to store. It only helps with storage of > a > >>>> shifted sparse vector; iterating still typically requires iterating > all > >>>> elements. > >>>> > >>>> Probably, where this would help, the caller can track this offset and > >>>> even more efficiently apply this knowledge. I remember digging into > this in > >>>> how sparse covariance matrices are computed. It almost but not quite > enabled > >>>> an optimization. > >>>> > >>>> > >>>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Sean by 'offset' do you mean basically subtracting the mean but only > >>>>> from the non-zero elements in each row? > >>>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote: > >>>>>> > >>>>>> Yeah I had thought the same, that perhaps it's fine to let the > >>>>>> StandardScaler proceed, if it's explicitly asked to center, rather > >>>>>> than refuse to. It's not really much more rope to let a user hang > >>>>>> herself with, and, blocks legitimate usages (we ran into this last > >>>>>> week and couldn't use StandardScaler as a result). > >>>>>> > >>>>>> I'm personally supportive of the change and don't see a JIRA. I > think > >>>>>> you could at least make one. > >>>>>> > >>>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com> > >>>>>> wrote: > >>>>>> > Thanks Sean, I agree with 100% that the math is math and dense vs > >>>>>> > sparse is > >>>>>> > just a matter of representation. I was trying to convince a > co-worker > >>>>>> > of > >>>>>> > this to no avail. Sending this email was mainly a sanity check. > >>>>>> > > >>>>>> > I think having an offset would be a great idea, although I am not > >>>>>> > sure how > >>>>>> > to implement this. However, if anything should be done to rectify > >>>>>> > this > >>>>>> > issue, it should be done in the standardScaler, not > vectorAssembler. > >>>>>> > There > >>>>>> > should not be any forcing of vectorAssembler to produce only dense > >>>>>> > vectors > >>>>>> > so as to avoid performance problems with data that does not fit in > >>>>>> > memory. > >>>>>> > Furthermore, not every machine learning algo requires > >>>>>> > standardization. > >>>>>> > Instead, standardScaler should have withmean=True as default and > >>>>>> > should > >>>>>> > apply an offset if the vector is sparse, whereas there would be > >>>>>> > normal > >>>>>> > subtraction if the vector is dense. This way the default behavior > of > >>>>>> > standardScaler will always be what is generally understood to be > >>>>>> > standardization, as opposed to people thinking they are > standardizing > >>>>>> > when > >>>>>> > they actually are not. > >>>>>> > > >>>>>> > Can anyone confirm whether there is a jira already? > >>>>>> > > >>>>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com> > >>>>>> > wrote: > >>>>>> >> > >>>>>> >> Dense vs sparse is just a question of representation, so doesn't > >>>>>> >> make > >>>>>> >> an operation on a vector more or less important as a result. > You've > >>>>>> >> identified the reason that subtracting the mean can be > undesirable: > >>>>>> >> a > >>>>>> >> notionally billion-element sparse vector becomes too big to fit > in > >>>>>> >> memory at once. > >>>>>> >> > >>>>>> >> I know this came up as a problem recently (I think there's a > JIRA?) > >>>>>> >> because VectorAssembler will *sometimes* output a small dense > vector > >>>>>> >> and sometimes output a small sparse vector based on how many > zeroes > >>>>>> >> there are. But that's bad because then the StandardScaler can't > >>>>>> >> process the output at all. You can work on this if you're > >>>>>> >> interested; > >>>>>> >> I think the proposal was to be able to force a dense > representation > >>>>>> >> only in VectorAssembler. I don't know if that's the nature of the > >>>>>> >> problem you're hitting. > >>>>>> >> > >>>>>> >> It can be meaningful to only scale the dimension without > centering > >>>>>> >> it, > >>>>>> >> but it's not the same thing, no. The math is the math. > >>>>>> >> > >>>>>> >> This has come up a few times -- it's necessary to center a sparse > >>>>>> >> vector but prohibitive to do so. One idea I'd toyed with in the > past > >>>>>> >> was to let a sparse vector have an 'offset' value applied to all > >>>>>> >> elements. That would let you shift all values while preserving a > >>>>>> >> sparse representation. I'm not sure if it's worth implementing > but > >>>>>> >> would help this case. > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede < > ani.to...@gmail.com> > >>>>>> >> wrote: > >>>>>> >> > Hi everyone, > >>>>>> >> > > >>>>>> >> > I am doing some standardization using standardScaler on data > from > >>>>>> >> > VectorAssembler which is represented as sparse vectors. I plan > to > >>>>>> >> > fit a > >>>>>> >> > regularized model. However, standardScaler does not allow the > >>>>>> >> > mean to > >>>>>> >> > be > >>>>>> >> > subtracted from sparse vectors. It will only divide by the > >>>>>> >> > standard > >>>>>> >> > deviation, which I understand is to keep the vector sparse. > Thus I > >>>>>> >> > am > >>>>>> >> > trying > >>>>>> >> > to convert my sparse vectors into dense vectors, but this may > not > >>>>>> >> > be > >>>>>> >> > worthwhile. > >>>>>> >> > > >>>>>> >> > So my questions are: > >>>>>> >> > Is subtracting the mean during standardization only important > when > >>>>>> >> > working > >>>>>> >> > with dense vectors? Does it not matter for sparse vectors? Is > just > >>>>>> >> > dividing > >>>>>> >> > by the standard deviation with sparse vectors equivalent to > also > >>>>>> >> > dividing by > >>>>>> >> > standard deviation w and subtracting mean with dense vectors? > >>>>>> >> > > >>>>>> >> > Thank you, > >>>>>> >> > Tobi > >>>>>> > > >>>>>> > > >>>>>> > >>>>>> ------------------------------------------------------------ > --------- > >>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >>>>>> > >> >