Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-20 Thread Stanley Xu
Dear all, Per my understand, what Feature Hashing did in SGD do compress the Feature Dimensions to a fixed length Vector. Won't that make the training result incorrect if Feature Hashing Collision happened? Won't the two features hashed to the same slot would be thought as the same feature? Even i

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-20 Thread Sebastian Schelter
Have a look at Ted's talk about Mahout's SGD classifier: http://vimeo.com/21273655 As far as I remember he also covers the hashing issues you describe. --sebastian On 20.04.2011 15:06, Stanley Xu wrote: Dear all, Per my understand, what Feature Hashing did in SGD do compress the Feature Dime

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-20 Thread Stanley Xu
Thanks Sebastian, downloading. Need 20 hours... Best wishes, Stanley Xu On Wed, Apr 20, 2011 at 9:09 PM, Sebastian Schelter wrote: > Have a look at Ted's talk about Mahout's SGD classifier: > http://vimeo.com/21273655 > > As far as I remember he also covers the hashing issues you describe. > >

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-20 Thread Ted Dunning
Stanley, Yes. What you say is correct. Feature hashing can cause degradation. With multiple hashing, however, you do have a fairly strong guarantee that the feature hashing is very close to information preserving. This is related to the fact that the feature hashing operation is a random linea

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-20 Thread Stanley Xu
Thanks Ted. Since the SGD is a sequential method, so the Vector be created for each line could be very large and won't consume too much memory. Could I assume if we have limited number of features, or could use the map-reduce to pre-process the data to know how many different values in a category c

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-21 Thread Ted Dunning
It is definitely a reasonable idea to convert data to hashed feature vectors using map-reduce. And yes, you can pick a vector length that is long enough so that you don't have to worry about collisions. You need to examine your data to decide how large that needs to be, but it isn't hard to do.

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-22 Thread Stanley Xu
Got it. Thanks so much, Ted. One more question, I am also trying to test the MixedGradient, it looks like the RankingGradient will take much more time than the DefaultGradient. If I set the alpha to 0.5, it will take 50 times of time comparing to the DefaultGradient, I thought it should be like t

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-22 Thread Dmitriy Lyubimov
Maybe there's a space for Mr based input conversion job indeed as a command line routine? I was kind of thinking about the same. Maybe even along with standartisation of the values. Some formal definition of inputs being fed to it. apologies for brevity. Sent from my android. -Dmitriy On Apr 21,

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-22 Thread Ted Dunning
On Fri, Apr 22, 2011 at 6:39 AM, Stanley Xu wrote: > One more question, I am also trying to test the MixedGradient, it looks > like the RankingGradient will take much more time than the DefaultGradient. > This is probably due to memory use. You need to review which way you group users. > > If

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-22 Thread Ted Dunning
Yes. But how do we specify the input? And how do we specify the encodings? This is what has always held me back in the past. Should we just allow classes to be specified on the command line? On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov wrote: > Maybe there's a space for Mr based input c

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Dmitriy Lyubimov
I am not sure i see the difficulty but it is possible we are talking about slightly different things. Hadoop solves this stuff thru some pluggable strategies, such as InputFormat . Those strategies are paramerized (and also perhaps persisted) thru some form of declarative definitions (if we keep a

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Dmitriy Lyubimov
(typo corrected ) I am not sure i see the difficulty but it is possible we are talking about slightly different things. Hadoop solves this stuff thru some pluggable strategies, such as InputFormat . Those strategies are paramerized (and also perhaps persisted) thru some form of declarative defini

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Ted Dunning
The difficulty is that vectorization often incorporates various kinds of interpretation of the original data. This can involve the nesting of field access, parsing, textual analysis as well as the basic vector encoding. It may involve running a classifier (possibly derived by clustering) on some

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Dmitriy Lyubimov
I don't think stuff like pre-clustering, dimensionality reduction should be included. Just the summarization, hashing trick and common strategies for parsing non-quantitative inputs included in the book. In addition we might leave a space to writing custom strategies (pretty much like hadoop leaves

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Ted Dunning
On Mon, Apr 25, 2011 at 12:04 PM, Dmitriy Lyubimov wrote: > I don't think stuff like pre-clustering, dimensionality reduction > should be included. Just the summarization, hashing trick and common > strategies for parsing non-quantitative inputs included in the book. > So you prefer the limited f

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Dmitriy Lyubimov
I see. I guess you mean nested preprocessors vs. pipelined jobs. There are some efforts, e.g. Rapid Miner, that allows to do more than just input normalization in a formal model -- although i did not play enough with that. But they do *something*. perhaps it could be a source for inspiration. Is

Re: Does the Feature Hashing and Collision in the SGD will harm the performance of the algorithm?

2011-04-25 Thread Ted Dunning
The transformation aspects of Rapid Miner are likely similar to what I am suggesting. I haven't used Rapid Miner, though, so can't comment in detail. On Mon, Apr 25, 2011 at 3:31 PM, Dmitriy Lyubimov wrote: > Is Rapid Miner modelling closer to what you > mean? >