>
> > We could try to create a function that takes an arbitrary matrix of
> feature
> > vectors, and automatically converts the fields that appear to be
> categorical
> > into boolean fields. Of course, we won't be able to write a function
> that
> > always knows which fields are categorical and which are numeric, but we
> > could have default values that get it right most of the time.
>
> How do you propose to do that?
>
This is tricky, and I'm not very familiar with this problem, but here are
some ideas.
The google prediction API seems to do some of this automatic detection of
whether a feature is categorical or numerical. For example, if at least
one value of a feature is a string, then they treat that feature as
categorical. I'd say that's pretty reasonable.
We could go further and count the number of unique values for each
attribute and compare that with the total number of examples. If there are
the number of examples >> number of unique values, then we could infer that
it's categorical. However, this is not correct in all situations, so it's
probably going too far, and I don't really recommend that.
Have other people dealt with this problem of automatically inferring
whether a feature is numeric or categorical? If users want this kind of
stuff done automatically, a safer way to do it would be to make them use arff
<http://www.cs.waikato.ac.nz/ml/weka/arff.html>files or something of that
nature. Does scikit-learn support arff files? In this file format, each
feature is explicitly labeled as numeric, categorical, string, or date.
Conrad
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general