On 1/26/11 12:51 AM, Olivier Grisel wrote:
Perceptron models (or more generaly linear models trained with SGD
such as the regularized logistic regression implemented in mahout) are
really fast to train, can scale to huge dataset (no need to load all
the data in memory, esp if you hash the features such as done in
Vowpal Wabbit and Mahout with murmurhash) and simple to implement and
debug.
The feature map we have I think is something which also needs
to scale, I once experimented with 64 Bit long values as features.
This way we simply take the "string feature" and hash them into a 64 bit
longs the outcome must not be unique, but in practice is.
I tested that on the Leipzig Corpora for all languages, and generated
all possible features. In the end I did not see a single hash collision.
Even if there are collisions once in a while it might not harm the detection
performance that much.
I also believe that these 64 bit features could be generated faster
than our current string features.
In the end they will at least need less memory, for String object we need
the object itself and a reference to it, the reference to it alone has
already
32 Bit, or maybe even 64 Bit.
Jörn