I don't think I understand the questions entirely.  What you say starts out
easy, but then gets strange (to me).

On Sat, Sep 25, 2010 at 2:53 PM, Robin Anil <[email protected]> wrote:

> Can I safely assume the input to the naive bayes is a sequence file with
> Text as label (key) an Vector writable as an instance(value)? Or should it
> be a dummy key and MultilabelledVector ? Have we closed the discussions
> about it?
>

Hmm.... key=label, value=Vector sounds really good to me.

I don't understand the value of MultilabelledVector

I will now modify the DictionaryVectorizer to output the sub directory chain
> as a label.
>

If DictionaryVectorizer is 20 newsgroup specific, then that is OK.  In
general, there will be too many documents
to store one per file and it may be difficult to segregate data into one
category per directory.

SequenceFileFromDirectory will create text sequence files with name as
> "./Subdir1/Subdir2/file"
> DictionaryVectorizer will run an extra job which takes the named vectors it
> generates, and makes labelled vectors from them.
>

I can't have an opinion here.


>
> The questions is the handling of the LabelDictionary. This is a messy way
> of
> handling this. Other way is to let naivebayes read data as NamedVectors and
> take care of tokenizing and extracting the label from the name (two choices
>

My big questions center about how this might be used in a production
setting.  In that case, the assumption
of input in files breaks down because the user will probably have their own
intricate input setup.  If we assume
that the input will be in the form of hashed feature vectors, then the
following outline seems reasonable to me:

    algorithm = new NaiveBayes(...)

    for all training examples {
       int actual = target variable value
       Vector features = vectorize example
       algorithm.train(actual, features)          // secretly save vector as
appropriate
    }
    algorithm.close()                                  // map-reduce
actually happens here

My question to you is this, how does this outline mesh with what you are
saying?  Where do you think that the IDF would happen?
What role does the vector dictionary have here?

Reply via email to