I don't think I understand the questions entirely. What you say starts out easy, but then gets strange (to me).
On Sat, Sep 25, 2010 at 2:53 PM, Robin Anil <[email protected]> wrote: > Can I safely assume the input to the naive bayes is a sequence file with > Text as label (key) an Vector writable as an instance(value)? Or should it > be a dummy key and MultilabelledVector ? Have we closed the discussions > about it? > Hmm.... key=label, value=Vector sounds really good to me. I don't understand the value of MultilabelledVector I will now modify the DictionaryVectorizer to output the sub directory chain > as a label. > If DictionaryVectorizer is 20 newsgroup specific, then that is OK. In general, there will be too many documents to store one per file and it may be difficult to segregate data into one category per directory. SequenceFileFromDirectory will create text sequence files with name as > "./Subdir1/Subdir2/file" > DictionaryVectorizer will run an extra job which takes the named vectors it > generates, and makes labelled vectors from them. > I can't have an opinion here. > > The questions is the handling of the LabelDictionary. This is a messy way > of > handling this. Other way is to let naivebayes read data as NamedVectors and > take care of tokenizing and extracting the label from the name (two choices > My big questions center about how this might be used in a production setting. In that case, the assumption of input in files breaks down because the user will probably have their own intricate input setup. If we assume that the input will be in the form of hashed feature vectors, then the following outline seems reasonable to me: algorithm = new NaiveBayes(...) for all training examples { int actual = target variable value Vector features = vectorize example algorithm.train(actual, features) // secretly save vector as appropriate } algorithm.close() // map-reduce actually happens here My question to you is this, how does this outline mesh with what you are saying? Where do you think that the IDF would happen? What role does the vector dictionary have here?
