On Mon, Mar 1, 2010 at 1:33 PM, Robin Anil <robin.a...@gmail.com> wrote:
> I am kicking this discussion on how we are going to integrate RF, NB, CNB, > WINNOW, SGD, SVM Phew!. > Great. Architecture is good. Especially after we have several examples. > > From what I think right now > - Apart from SGD everything else a batch trainer. SGD is the only pure > online trainer cum classifier. > And even SGD is currently written so as to be used in batch setting. (actually Pegasos is on-line as well, but written in batch style like SGD). > Questions > - Interfaces (how are they going to look like) > - Trainer > - Classifier (binary) (multi label classification) > - Test > We need command line and some day some kind of workflow interfaces. The batch orientation that we have right now is probably just fine for 99% of all applications. > - Ensemble - bagging boosting ? > See random forests. But really, let's see if people come up with a need. Bagging and boosting can be good ways to improve over-fitting problems, but let's see if Pegasos and SGD solve those problems for us. > - What is the basic storage interface everyone should use Matrix? Then we > can have hdfs backed matrix, hbase backed matrix, inmemory matrix > Matrix, yes. But also I think that allowing Drew's avro document format with a randomizer (or field list for NB) specification would also be good. Lucene index + randomizer would also be useful. - If basic storage could be different i mean decision tree is not a matrix, > what is the fixed input output format. > Input formats are much more easily standardized. The only common characteristic I can think of for all output formats is that there should be a way to use the persistent output of classifier training to generate a model that can classify more inputs. Maybe there should be some vague requirement that it be possible to produce a more or less human readable representation. Other than those very generic and vague requirements, I can't understand what we can say about the output of a classifier. If anybody starts pushing for PMML output, that might be a nice way to meet the "more or less human readable" aspect. > - How can we extend the test setup like Confusion matrix to capture info > from all classifiers > If you can read a model from disk and apply new inputs, then it should be possible to generalize the evaluation process. > - If we make some assumptions now what will we do when classifiers like > HMM, CRF come into the picture. they need more than just vectors but also > the order of features. > Drew's document format becomes very important there.