Re: Hash-coded Vectorization and bogus information

2012-02-13 Thread Ted Dunning
On Tue, Feb 14, 2012 at 2:25 AM, Lance Norskog wrote: > ... > OnlineLogisticRegression allocates DenseVector/DenseMatrix objects- if > it used RandomSparse Vector/Matrix could it operate on million-term > sparse arrays? > Not likely. The feature vectors that come in are sparse and the updates t

Re: Hash-coded Vectorization and bogus information

2012-02-13 Thread Lance Norskog
This is in the context of playing with the example classification scripts in mahout/examples/bin. OnlineLogisticRegression allocates DenseVector/DenseMatrix objects- if it used RandomSparse Vector/Matrix could it operate on million-term sparse arrays? The problem is that seq2sparse has several te

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Ted Dunning
If you don't use hashed encoding you lose the single pass nature of the example. Also many real applications require huge vocabularies which make non hashed representations infeasible due to memory use in the logistic regression models. Sent from my iPhone On Feb 12, 2012, at 20:53, Lance No

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Lance Norskog
Ah! Ok. The SGD examples in examples/bin/asf-examples.sh and examples/bin/classify-twentynewsgroups.sh both use hash vectorization. Should they use the sparse term vectors instead? The "new" Bayes examples (nbtrain and nbtest) in asf-examples.sh use sparse. On Sun, Feb 12, 2012 at 7:00 AM, Ted Dun

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Ted Dunning
Hash coded vectorization *is* a random projection. It is just one that preserves some degree of sparsity. It definitely loses information when you use it to decrease dimension of the input. It does not "add bogus information". SGD doesn't like dense vectors, actually. In fact, one of the nice

Hash-coded Vectorization and bogus information

2012-02-11 Thread Lance Norskog
Does hash-coded vectorization add bogus information compared to sparse term vectors? A more concrete question: would a random projection on the sparse vector give a "better quality" dense vector? (This is in the context of SGD classification, which "likes" dense vectors.) -- Lance Norskog goks...