On Mon, Mar 1, 2010 at 1:33 PM, Robin Anil <robin.a...@gmail.com> wrote:

> I am kicking this discussion on how we are going to integrate RF, NB, CNB,
> WINNOW, SGD, SVM Phew!.
>

Great.  Architecture is good.  Especially after we have several examples.


>
> From what I think right now
> - Apart from SGD everything else a batch trainer. SGD is the only pure
> online trainer cum classifier.
>

And even SGD is currently written so as to be used in batch setting.
(actually Pegasos is on-line as well, but written in batch style like SGD).


> Questions
> -  Interfaces (how are they going to look like)
>    -  Trainer
>    -  Classifier (binary) (multi label classification)
>    -  Test
>

We need command line and some day some kind of workflow interfaces.  The
batch orientation that we have right now is probably just fine for 99% of
all applications.


> -  Ensemble - bagging boosting ?
>

See random forests.

But really, let's see if people come up with a need.  Bagging and boosting
can be good ways to improve over-fitting problems, but let's see if Pegasos
and SGD solve those problems for us.



> -  What is the basic storage interface everyone should use Matrix? Then we
> can have hdfs backed matrix, hbase backed matrix, inmemory matrix
>

Matrix, yes.

But also I think that allowing Drew's avro document format with a randomizer
(or field list for NB) specification would also be good.

Lucene index + randomizer would also be useful.

-  If basic storage could be different i mean decision tree is not a matrix,
> what is the fixed input output format.
>

Input formats are much more easily standardized.  The only common
characteristic I can think of for all output formats is that there should be
a way to use the persistent output of classifier training to generate a
model that can classify more inputs.  Maybe there should be some vague
requirement that it be possible to produce a more or less human readable
representation.

Other than those very generic and vague requirements, I can't understand
what we can say about the output of a classifier.  If anybody starts pushing
for PMML output, that might be a nice way to meet the "more or less human
readable" aspect.


> -  How can we extend the test setup like Confusion matrix to capture info
> from all classifiers
>

If you can read a model from disk and apply new inputs, then it should be
possible to generalize the evaluation process.


> -  If we make some assumptions now what will we do when classifiers like
> HMM, CRF come into the picture. they need more than just vectors but also
> the order of features.
>

Drew's document format becomes very important there.

Reply via email to