Can I safely assume the input to the naive bayes is a sequence file with
Text as label (key) an Vector writable as an instance(value)? Or should it
be a dummy key and MultilabelledVector ? Have we closed the discussions
about it?

I will now modify the DictionaryVectorizer to output the sub directory chain
as a label.

SequenceFileFromDirectory will create text sequence files with name as
"./Subdir1/Subdir2/file"
DictionaryVectorizer will run an extra job which takes the named vectors it
generates, and makes labelled vectors from them.

The questions is the handling of the LabelDictionary. This is a messy way of
handling this. Other way is to let naivebayes read data as NamedVectors and
take care of tokenizing and extracting the label from the name (two choices
String or use a Dictionary lookup to convert it to integers)

Thoughts ?
Robin


On Sun, Sep 26, 2010 at 2:57 AM, Ted Dunning <[email protected]> wrote:

> Log normalization is already in TextValueEncoder
>
> On Sat, Sep 25, 2010 at 2:20 PM, Robin Anil <[email protected]> wrote:
>
> > Ok. I am going ahead with this. I would ask you to add the
> logNormalization
> > per document as an option in SGD. Jrennie's paper mentions how it
> improves
> > accuracy for text. I don't know how it affects sgd type of learning.
> >
>

Reply via email to