Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Ted Dunning
If you don't use hashed encoding you lose the single pass nature of the example. Also many real applications require huge vocabularies which make non hashed representations infeasible due to memory use in the logistic regression models. Sent from my iPhone On Feb 12, 2012, at 20:53, Lance No

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Lance Norskog
Ah! Ok. The SGD examples in examples/bin/asf-examples.sh and examples/bin/classify-twentynewsgroups.sh both use hash vectorization. Should they use the sparse term vectors instead? The "new" Bayes examples (nbtrain and nbtest) in asf-examples.sh use sparse. On Sun, Feb 12, 2012 at 7:00 AM, Ted Dun

Decision Forest and text classification

2012-02-12 Thread Daniele Volpi
Hi everyone, I'd like to run the Decision Forest classifier on the 20 newsgroups dataset. According to the documentation, the Mahout implementation accepts only numerical or categorical attributes, so, the only way to do it is transforming the documents in fixed lenght vectors (maybe using tf-idf a

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Ted Dunning
Hash coded vectorization *is* a random projection. It is just one that preserves some degree of sparsity. It definitely loses information when you use it to decrease dimension of the input. It does not "add bogus information". SGD doesn't like dense vectors, actually. In fact, one of the nice

Re: Apache Mahout 0.6 Released

2012-02-12 Thread Dan Brickley
On 7 February 2012 14:04, Jeff Eastman wrote: > +1 Congratulations to Shannon for a job well done. We now have a 0.6 release > and can begin to concentrate on the plan and issues for a 0.7 release. Yes, congrats to all concerned, really great seeing this moving along :) Meanwhile, the homepage s

Re: Goals for Mahout 0.7

2012-02-12 Thread Jeff Eastman
We have a couple JIRAs that relate here: We want to factor all the (-cl) classification steps out of all of the driver classes (MAHOUT-930) and into a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable outlier removal capability to this job; and MAHOUT-933 is aimed at fact

Re: Goals for Mahout 0.7

2012-02-12 Thread Jeff Eastman
+ users@ These are great ideas, and are just the kinds of high level conversations I was hoping to engender. From my agile background, I'd hope to define 0.7 by a small number of "epic stories", in a subset of our overall capabilities, which could focus our attention to a set of derivative JI

Fwd: Re: Goals for Mahout 0.7

2012-02-12 Thread Jeff Eastman
+user@ I'd like our users involved in this discussion too. Original Message Subject:Re: Goals for Mahout 0.7 Date: Sat, 11 Feb 2012 22:29:02 +0100 From: Frank Scholten Reply-To: d...@mahout.apache.org To: d...@mahout.apache.org I'd like to add solving