Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
Can someone also provide input on why my code may not be working? Below, I have pasted part of my previous reply which describes the issue I am having here. I am really more perplexed about the first set of code (in bold). I know why the second set of code doesn't work, it is just something I

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
I should be more clear, since the outcome of the discussion above was not that obvious actually. - I agree a change should be made to StandardScaler, and not VectorAssembler - However I do think withMean should still be false by default and be explicitly enabled - The 'offset' idea is orthogonal,

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede
Opening this follow-up question to the entire mailing list. Anyone have thoughts on how I can add a column of dense vectors (created by converting a column of sparse features) to a data frame? My efforts are below. Although I know this is not the best approach for something I plan to put in

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen
No, that doesn't describe the change being discussed, since you've copied the discussion about adding an 'offset'. That's orthogonal. You're also suggesting making withMean=True the default, which we don't want. The point is that if this is *explicitly* requested, the scaler shouldn't refuse to

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Sean, I have created a jira; I hope you don't mind that I borrowed your explanation of "offset". https://issues.apache.org/jira/browse/SPARK-17001 So what did you do to standardize your data, if you didn't use standardScaler? Did you write a udf to subtract mean and divide by standard deviation?

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Ah right, got it. As you say for storage it helps significantly, but for operations I suspect it puts one back in a "dense-like" position. Still, for online / mini-batch algorithms it may still be feasible I guess. On Wed, 10 Aug 2016 at 19:50, Sean Owen wrote: > All

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
All elements, I think. Imagine a sparse vector 1:3 3:7 which conceptually represents 0 3 0 7. Imagine it also has an offset stored which applies to all elements. If it is -2 then it now represents -2 1 -2 5, but this requires just one extra value to store. It only helps with storage of a shifted

Re: Standardization with Sparse Vectors

2016-08-10 Thread Nick Pentreath
Sean by 'offset' do you mean basically subtracting the mean but only from the non-zero elements in each row? On Wed, 10 Aug 2016 at 19:02, Sean Owen wrote: > Yeah I had thought the same, that perhaps it's fine to let the > StandardScaler proceed, if it's explicitly asked to

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Yeah I had thought the same, that perhaps it's fine to let the StandardScaler proceed, if it's explicitly asked to center, rather than refuse to. It's not really much more rope to let a user hang herself with, and, blocks legitimate usages (we ran into this last week and couldn't use

Re: Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Thanks Sean, I agree with 100% that the math is math and dense vs sparse is just a matter of representation. I was trying to convince a co-worker of this to no avail. Sending this email was mainly a sanity check. I think having an offset would be a great idea, although I am not sure how to

Re: Standardization with Sparse Vectors

2016-08-10 Thread Sean Owen
Dense vs sparse is just a question of representation, so doesn't make an operation on a vector more or less important as a result. You've identified the reason that subtracting the mean can be undesirable: a notionally billion-element sparse vector becomes too big to fit in memory at once. I know

Standardization with Sparse Vectors

2016-08-10 Thread Tobi Bosede
Hi everyone, I am doing some standardization using standardScaler on data from VectorAssembler which is represented as sparse vectors. I plan to fit a regularized model. However, standardScaler does not allow the mean to be subtracted from sparse vectors. It will only divide by the standard