Re: classifier architecture needed

2010-07-11 Thread Isabel Drost
On 21.06.2010 Ted Dunning wrote: > I would like to start a discussion about a framework that we can fit all of > these approaches together in much the same way that the recommendations > stuff has such nice pluggable properties. +1 Like the ideas that have been tossed around in this discussion. Do

Re: classifier architecture needed

2010-06-22 Thread Ted Dunning
I agree that models should be highly generic. I just don't think that we should legislate the content of either their internal model, nor of their serialized representation. The contract is pretty clear, however. There are just a few methods and it isn't hard for all models to support them, espe

Re: classifier architecture needed

2010-06-22 Thread Ted Dunning
On Tue, Jun 22, 2010 at 9:47 AM, Robin Anil wrote: > > > > Again, I would recommend a blob as the on-disk > > format. Why a blob. Why not a flexible multi list of matrices and vectors? > Is there any model storing byte level information ? > The SGD has a parameter vector as well as a trace dic

Re: classifier architecture needed

2010-06-22 Thread Ted Dunning
On Tue, Jun 22, 2010 at 9:44 AM, Robin Anil wrote: > > > On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil > > wrote: > > > > > >> A Classifier Training Job will take a Trainer, and a Vector location > and > > > > > > produce a Model > > > > How about A tranform layer which converts ondisk data into v

Re: classifier architecture needed

2010-06-22 Thread Robin Anil
Wikipedia unigram dictionary is 381MB on disk. Bigram and trigram sizes will explode like anything. So Vectorizer could be a pass through if reading vectors(parallely generated) in each of the jobs or on the fly converted if using the randomizer The reason I said models be generic is because they

Re: classifier architecture needed

2010-06-22 Thread Robin Anil
> > Again, I would recommend a blob as the on-disk > format. Why a blob. Why not a flexible multi list of matrices and vectors? > Is there any model storing byte level information ? >

Re: classifier architecture needed

2010-06-22 Thread Robin Anil
> > > > > > > On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil > wrote: > > > >> A Classifier Training Job will take a Trainer, and a Vector location and > > > > produce a Model > How about A tranform layer which converts ondisk data into vectors seamlessly? That should solve the issue

Re: classifier architecture needed

2010-06-22 Thread Ted Dunning
On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil wrote: > See how this sound(listing down requirements) > > A model can be class with a list of matrices, a list of vectors. Each > algorithm takes care of naming these matrices/vectors and reading and > writing values to it (similar to Datastore) > I t

Re: classifier architecture needed

2010-06-22 Thread Ted Dunning
On Tue, Jun 22, 2010 at 8:33 AM, Grant Ingersoll wrote: > > On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote: > > > We really need to have a simple way to integrate all of the input > processing > > options easily into new and old code > > More or less, what we need is a pipeline that can ingest man

Re: classifier architecture needed

2010-06-22 Thread Ted Dunning
On Tue, Jun 22, 2010 at 9:25 AM, Ted Dunning wrote: > > > On Mon, Jun 21, 2010 at 8:35 PM, Robin Anil wrote: > >> A Classifier Training Job will take a Trainer, and a Vector location and > > produce a Model >> > > No. Well, not exclusively, anyway. We can't be limited to reading vectors > due

Re: classifier architecture needed

2010-06-22 Thread Grant Ingersoll
On Jun 21, 2010, at 1:12 PM, Ted Dunning wrote: > We are now beginning to have lots of classifiers in Mahout. The naive > Bayes, complementary naive Bayes and random Forest grandfathers have been > joined by my recent SGD and Zhao Zhendong's prolific set of approaches for > logistic regression a

Re: classifier architecture needed

2010-06-21 Thread Robin Anil
See how this sound(listing down requirements) A model can be class with a list of matrices, a list of vectors. Each algorithm takes care of naming these matrices/vectors and reading and writing values to it (similar to Datastore) All Classifiers will work with vectors All Trainers will work with v