Small change in your world domination plan

On Sun, Sep 11, 2011 at 9:09 PM, Grant Ingersoll <[email protected]>wrote:

>
> On Sep 11, 2011, at 10:19 AM, Grant Ingersoll wrote:
>
> > A few classifier questions:
> >
> > What's the difference between the two naive bayes packages?  AFAICT, the
> naivebayes works off of vectors already, but are there any differences in
> the algorithms themselves?  In other words, if I do seq2sparse to get
> vectors in, all should be good to go w/ the new vector based naive bayes,
> right?  Do we have docs on the new naivebayes package anywhere?   For
> instance, how do the labels get associated with the training examples?  I
> see the --labels option, but it isn't clear how it relates to the training
> data.
> >
> > As for SplitBayesInput, I don't see that being used anywhere, but I think
> I have a case for it.  The only thing is, I want it to work off of
> SequenceFiles and split them, I think (b/c I want to run the new naivebayes
> package)  Does that make sense?
> >
> > Here's what I'm ultimately trying to do:
> > I've got all this ASF email data.  It's currently bucketed like the news
> groups stuff, so I thought I would build a similar example (but one that
> actually makes sense to run in a cluster due to size).  I want to take and
> split the data into test and training sets across all the mailing lists such
> that one could attempt to classify new mail as to which project it belongs
> to (it will be curious to see how it compares dev lists vs. user lists.)
>  WIP is at github.com/lucidimagination/mahout.
> >
>
> Just to follow up, my current plan would be to do:
> 1. Raw mail -> sequence files (SequenceFilesFromMailArchives)
> 2. seq2sparse
> 3. SplitBayesInput (which really should be renamed to just SplitInput, as
> there is nothing "Bayes" about it) -- also, make it work with Sequence files
> 4. Run training
> 5. Run test (need to load up the class vectors and compute dot products)
> 6. Conquer the world
>
> > Given time, I'd also like to hook in some of the various other
> classifiers, as I think it would be useful to be able to have a single
> example, with real data, that runs all the various algorithms (clustering,
> classification, CF, etc.)
> >
> > -Grant
> >
>
>
>
>

Reply via email to