Dear all,
Per my understand, what Feature Hashing did in SGD do compress the Feature
Dimensions to a fixed length Vector. Won't that make the training result
incorrect if Feature Hashing Collision happened? Won't the two features
hashed to the same slot would be thought as the same feature? Even i
Have a look at Ted's talk about Mahout's SGD classifier:
http://vimeo.com/21273655
As far as I remember he also covers the hashing issues you describe.
--sebastian
On 20.04.2011 15:06, Stanley Xu wrote:
Dear all,
Per my understand, what Feature Hashing did in SGD do compress the Feature
Dime
Thanks Sebastian, downloading. Need 20 hours...
Best wishes,
Stanley Xu
On Wed, Apr 20, 2011 at 9:09 PM, Sebastian Schelter wrote:
> Have a look at Ted's talk about Mahout's SGD classifier:
> http://vimeo.com/21273655
>
> As far as I remember he also covers the hashing issues you describe.
>
>
Stanley,
Yes. What you say is correct. Feature hashing can cause degradation.
With multiple hashing, however, you do have a fairly strong guarantee that
the feature hashing is very close to information preserving. This is
related to the fact that the feature hashing operation is a random linea
Thanks Ted. Since the SGD is a sequential method, so the Vector be created
for each line could be very large and won't consume too much memory. Could I
assume if we have limited number of features, or could use the map-reduce to
pre-process the data to know how many different values in a category c
It is definitely a reasonable idea to convert data to hashed feature vectors
using map-reduce.
And yes, you can pick a vector length that is long enough so that you don't
have to worry about
collisions. You need to examine your data to decide how large that needs to
be, but it isn't hard
to do.
Got it. Thanks so much, Ted.
One more question, I am also trying to test the MixedGradient, it looks like
the RankingGradient will take much more time than the DefaultGradient.
If I set the alpha to 0.5, it will take 50 times of time comparing to the
DefaultGradient, I thought it should be like t
Maybe there's a space for Mr based input conversion job indeed as a command
line routine? I was kind of thinking about the same. Maybe even along with
standartisation of the values. Some formal definition of inputs being fed to
it.
apologies for brevity.
Sent from my android.
-Dmitriy
On Apr 21,
On Fri, Apr 22, 2011 at 6:39 AM, Stanley Xu wrote:
> One more question, I am also trying to test the MixedGradient, it looks
> like the RankingGradient will take much more time than the DefaultGradient.
>
This is probably due to memory use. You need to review which way you group
users.
>
> If
Yes.
But how do we specify the input? And how do we specify the encodings?
This is what has always held me back in the past. Should we just allow
classes to be specified on the command line?
On Fri, Apr 22, 2011 at 8:47 AM, Dmitriy Lyubimov wrote:
> Maybe there's a space for Mr based input c
I am not sure i see the difficulty but it is possible we are talking
about slightly different things.
Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .
Those strategies are paramerized (and also perhaps persisted) thru
some form of declarative definitions (if we keep a
(typo corrected )
I am not sure i see the difficulty but it is possible we are talking
about slightly different things.
Hadoop solves this stuff thru some pluggable strategies, such as InputFormat .
Those strategies are paramerized (and also perhaps persisted) thru
some form of declarative defini
The difficulty is that vectorization often incorporates various kinds of
interpretation of the original data. This can involve the nesting of field
access, parsing, textual analysis as well as the basic vector encoding. It
may involve running a classifier (possibly derived by clustering) on some
I don't think stuff like pre-clustering, dimensionality reduction
should be included. Just the summarization, hashing trick and common
strategies for parsing non-quantitative inputs included in the book.
In addition we might leave a space to writing custom strategies
(pretty much like hadoop leaves
On Mon, Apr 25, 2011 at 12:04 PM, Dmitriy Lyubimov wrote:
> I don't think stuff like pre-clustering, dimensionality reduction
> should be included. Just the summarization, hashing trick and common
> strategies for parsing non-quantitative inputs included in the book.
>
So you prefer the limited f
I see.
I guess you mean nested preprocessors vs. pipelined jobs.
There are some efforts, e.g. Rapid Miner, that allows to do more than
just input normalization in a formal model -- although i did not play
enough with that. But they do *something*. perhaps it could be a
source for inspiration. Is
The transformation aspects of Rapid Miner are likely similar to what I am
suggesting.
I haven't used Rapid Miner, though, so can't comment in detail.
On Mon, Apr 25, 2011 at 3:31 PM, Dmitriy Lyubimov wrote:
> Is Rapid Miner modelling closer to what you
> mean?
>
17 matches
Mail list logo