Sean,

I have created a jira; I hope you don't mind that I borrowed your
explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001

So what did you do to standardize your data, if you didn't use
standardScaler? Did you write a udf to subtract mean and divide by standard
deviation?

Although I know this is not the best approach for something I plan to put
in production, I have been trying to write a udf to turn the sparse vector
into a dense one and apply the udf in withcolumn(). withColumn() complains
that the data is a tuple. I think the issue might be the datatype
parameter. The function returns a vector of doubles but there is no type
that would be adequate for this.


*sparseToDense=udf(lambda data: float(DenseVector([data.toArray()])),
DoubleType())*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures",
sparseToDense("features"))*

However the function works outside the udf, but I am unable to add an
arbitrary column to the data frame I started out working with. Thoughts?

*denseFeatures=TrainingRdf.select("features").map(lambda data:
DenseVector([data.features.toArray()]))*
*denseTrainingRdf=trainingRdfAssemb.withColumn("denseFeatures", denseFeatures)*

Thanks,
Tobi

On Wed, Aug 10, 2016 at 1:01 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Ah right, got it. As you say for storage it helps significantly, but for
> operations I suspect it puts one back in a "dense-like" position. Still,
> for online / mini-batch algorithms it may still be feasible I guess.
> On Wed, 10 Aug 2016 at 19:50, Sean Owen <so...@cloudera.com> wrote:
>
>> All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually
>> represents 0 3 0 7. Imagine it also has an offset stored which applies to
>> all elements. If it is -2 then it now represents -2 1 -2 5, but this
>> requires just one extra value to store. It only helps with storage of a
>> shifted sparse vector; iterating still typically requires iterating all
>> elements.
>>
>> Probably, where this would help, the caller can track this offset and
>> even more efficiently apply this knowledge. I remember digging into this in
>> how sparse covariance matrices are computed. It almost but not quite
>> enabled an optimization.
>>
>>
>> On Wed, Aug 10, 2016, 18:10 Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>
>>> Sean by 'offset' do you mean basically subtracting the mean but only
>>> from the non-zero elements in each row?
>>> On Wed, 10 Aug 2016 at 19:02, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Yeah I had thought the same, that perhaps it's fine to let the
>>>> StandardScaler proceed, if it's explicitly asked to center, rather
>>>> than refuse to. It's not really much more rope to let a user hang
>>>> herself with, and, blocks legitimate usages (we ran into this last
>>>> week and couldn't use StandardScaler as a result).
>>>>
>>>> I'm personally supportive of the change and don't see a JIRA. I think
>>>> you could at least make one.
>>>>
>>>> On Wed, Aug 10, 2016 at 5:57 PM, Tobi Bosede <ani.to...@gmail.com>
>>>> wrote:
>>>> > Thanks Sean, I agree with 100% that the math is math and dense vs
>>>> sparse is
>>>> > just a matter of representation. I was trying to convince a co-worker
>>>> of
>>>> > this to no avail. Sending this email was mainly a sanity check.
>>>> >
>>>> > I think having an offset would be a great idea, although I am not
>>>> sure how
>>>> > to implement this. However, if anything should be done to rectify this
>>>> > issue, it should be done in the standardScaler, not vectorAssembler.
>>>> There
>>>> > should not be any forcing of vectorAssembler to produce only dense
>>>> vectors
>>>> > so as to avoid performance problems with data that does not fit in
>>>> memory.
>>>> > Furthermore, not every machine learning algo requires standardization.
>>>> > Instead, standardScaler should have withmean=True as default and
>>>> should
>>>> > apply an offset if the vector is sparse, whereas there would be normal
>>>> > subtraction if the vector is dense. This way the default behavior of
>>>> > standardScaler will always be what is generally understood to be
>>>> > standardization, as opposed to people thinking they are standardizing
>>>> when
>>>> > they actually are not.
>>>> >
>>>> > Can anyone confirm whether there is a jira already?
>>>> >
>>>> > On Wed, Aug 10, 2016 at 10:58 AM, Sean Owen <so...@cloudera.com>
>>>> wrote:
>>>> >>
>>>> >> Dense vs sparse is just a question of representation, so doesn't make
>>>> >> an operation on a vector more or less important as a result. You've
>>>> >> identified the reason that subtracting the mean can be undesirable: a
>>>> >> notionally billion-element sparse vector becomes too big to fit in
>>>> >> memory at once.
>>>> >>
>>>> >> I know this came up as a problem recently (I think there's a JIRA?)
>>>> >> because VectorAssembler will *sometimes* output a small dense vector
>>>> >> and sometimes output a small sparse vector based on how many zeroes
>>>> >> there are. But that's bad because then the StandardScaler can't
>>>> >> process the output at all. You can work on this if you're interested;
>>>> >> I think the proposal was to be able to force a dense representation
>>>> >> only in VectorAssembler. I don't know if that's the nature of the
>>>> >> problem you're hitting.
>>>> >>
>>>> >> It can be meaningful to only scale the dimension without centering
>>>> it,
>>>> >> but it's not the same thing, no. The math is the math.
>>>> >>
>>>> >> This has come up a few times -- it's necessary to center a sparse
>>>> >> vector but prohibitive to do so. One idea I'd toyed with in the past
>>>> >> was to let a sparse vector have an 'offset' value applied to all
>>>> >> elements. That would let you shift all values while preserving a
>>>> >> sparse representation. I'm not sure if it's worth implementing but
>>>> >> would help this case.
>>>> >>
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Wed, Aug 10, 2016 at 4:41 PM, Tobi Bosede <ani.to...@gmail.com>
>>>> wrote:
>>>> >> > Hi everyone,
>>>> >> >
>>>> >> > I am doing some standardization using standardScaler on data from
>>>> >> > VectorAssembler which is represented as sparse vectors. I plan to
>>>> fit a
>>>> >> > regularized model.  However, standardScaler does not allow the
>>>> mean to
>>>> >> > be
>>>> >> > subtracted from sparse vectors. It will only divide by the standard
>>>> >> > deviation, which I understand is to keep the vector sparse. Thus I
>>>> am
>>>> >> > trying
>>>> >> > to convert my sparse vectors into dense vectors, but this may not
>>>> be
>>>> >> > worthwhile.
>>>> >> >
>>>> >> > So my questions are:
>>>> >> > Is subtracting the mean during standardization only important when
>>>> >> > working
>>>> >> > with dense vectors? Does it not matter for sparse vectors? Is just
>>>> >> > dividing
>>>> >> > by the standard deviation with sparse vectors equivalent to also
>>>> >> > dividing by
>>>> >> > standard deviation w and subtracting mean with dense vectors?
>>>> >> >
>>>> >> > Thank you,
>>>> >> > Tobi
>>>> >
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>

Reply via email to