Can I safely assume the input to the naive bayes is a sequence file with Text as label (key) an Vector writable as an instance(value)? Or should it be a dummy key and MultilabelledVector ? Have we closed the discussions about it?
I will now modify the DictionaryVectorizer to output the sub directory chain as a label. SequenceFileFromDirectory will create text sequence files with name as "./Subdir1/Subdir2/file" DictionaryVectorizer will run an extra job which takes the named vectors it generates, and makes labelled vectors from them. The questions is the handling of the LabelDictionary. This is a messy way of handling this. Other way is to let naivebayes read data as NamedVectors and take care of tokenizing and extracting the label from the name (two choices String or use a Dictionary lookup to convert it to integers) Thoughts ? Robin On Sun, Sep 26, 2010 at 2:57 AM, Ted Dunning <[email protected]> wrote: > Log normalization is already in TextValueEncoder > > On Sat, Sep 25, 2010 at 2:20 PM, Robin Anil <[email protected]> wrote: > > > Ok. I am going ahead with this. I would ask you to add the > logNormalization > > per document as an option in SGD. Jrennie's paper mentions how it > improves > > accuracy for text. I don't know how it affects sgd type of learning. > > >
