Re: [Scikit-learn-general] How to load data into scikits

David Montgomery Wed, 27 Feb 2013 10:45:57 -0800

Thanks for the clarification.

I have to create clusters vis-a-vis a dependent variable.  I can't use
forests because I loose the structure.  Rules I create from R score 10K
segments a second.  About 1 billion a day.


The ideal algo will have the properties of a dtree.  Variable selection,
robust againts noise etc and the data is heavily biased....extreme rare
events.  Is there another algo I might be able to use on scikits?  I am
hoping to get rid of R.  But it does work.  R does print rules and it was a
chore to clean up as well.

Else....I will dig into the dtree algo in scikits.


See Ya












On Thu, Feb 28, 2013 at 2:06 AM, Peter Prettenhofer <
[email protected]> wrote:

> 2013/2/27 David Montgomery <[email protected]>:
> > Ok....now I am really confused on how to interpret the tree.
> >
> > So...I am trying to build a Prob est tree.  All of the independent
> variables
> > are categorical and created dummies.  What is throwing me off are the <=.
> >
> > I should have a rule that says e.g. if city=LA,NY and TIME=Noon then .20.
> >
> > In the chart I see city=Dubai<=.500  What does that mean?
>
> city.Dubai <= 0.5 means that if the indicator variable city=Dubai is
> smaller than 0.5 (i.e if city=Dubai is 0) then examples get routed
> down the left child otherwise they get routed down the right child.
>
>
> > What I am trying
> > so see is a chart that I would usually see in SPSS answer tree or SAS
> etc.
>
> since both SPSS and SAS are proprietary I've no clue how they look like
>
> >
> > So..how do I interpret the city=Dubai<=.500?
>
> The split node basically asks: is the city feature not Dubai? - if so
> go down left else right
>
> In order to generate rules from decision trees you have to look at a
> whole path (from root to leaf). Currently, there is no way to
> extracting rules from decision trees - you have to write your own code
> that analyzes the tree structure.
>
> >
> > My aim is to get a node id and to create sql rules to extract data.
> >
> > Unless I am wrong, it appears the the dtree algo is not designed to
> extract
> > rules and even assign a rule to a node id.  Dtrees in scikits are solely
> for
> > prediction.  Is this a fair statement?
>
> correct, scikit-learn is mostly a machine learning library; in fact,
> AFAIK you where the first user to request such a feature.
>
> >
> > I will be taking the *.dot file not to graph but to somehow parse the
> file
> > so I can create my rules.
>
> better operate on the DecisionTreeRegressor/Classifier.tree_ object.
> It represents the binary decision tree as a number of parallel arrays;
> you can find the documentation/code here:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38
>
> best,
>  Peter
>
>
> >
> > Thanks
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Feb 27, 2013 at 11:57 PM, Peter Prettenhofer
> > <[email protected]> wrote:
> >>
> >> Looks good to me - save the output to a file (e.g. foobar.dot) and run
> >> the following command:
> >>
> >>     $ dot -Tpdf foobar.dot -o foobar.pdf
> >>
> >> When I open the pdf all labels are correctly displayed - remember that
> >> they are not indicator features - so the thresholds are usually
> >> "country=AU <= 0.5".
> >>
> >> You can find more information here:
> >> http://scikit-learn.org/dev/modules/tree.html#classification
> >>
> >> 2013/2/27 David Montgomery <[email protected]>:
> >> > Thanks I used DictVectorizer()
> >> >
> >> > I am now trying to add lables to the tree graph.   Below are the
> labels
> >> > and
> >> > the digraph Tree.  However, I dont see lables on the tree nodes.  Did
> I
> >> > not
> >> > use feature names correct?
> >> >
> >> >
> >> >
> >> >
> >> > measurements = [
> >> > {'country':'US','city': 'Dubai'},
> >> > {'country':'US','city': 'London'},
> >> > {'country':'US','city': 'San Fransisco'},
> >> > {'country':'US','city': 'Dubai'},
> >> > {'country':'AU','city': 'Mel'},
> >> > {'country':'AU','city': 'Sydney'},
> >> > {'country':'AU','city': 'Mel'},
> >> > {'country':'AU','city': 'Sydney'},
> >> > {'country':'AU','city': 'Mel'},
> >> > {'country':'AU','city': 'Sydney'},
> >> > ]
> >> > y = [0,0,0,1,1,1,1,1,1,1]
> >> >
> >> >
> >> > vec = DictVectorizer()
> >> > X = vec.fit_transform(measurements)
> >> > feature_name = vec.get_feature_names()
> >> > clf = tree.DecisionTreeRegressor()
> >> > clf = clf.fit(X.todense(), y)
> >> > with open("au.dot", 'w') as f:
> >> >     f = tree.export_graphviz(clf,
> out_file=f,feature_names=feature_name)
> >> >
> >> >
> >> > feature_name = ['city=Dubai', 'city=London', 'city=Mel', 'city=San
> >> > Fransisco', 'city=Sydney', 'country=AU', 'country=US']
> >> >
> >> > digraph Tree {
> >> > 0 [label="country=AU <= 0.5000\nerror = 2.1\nsamples = 10\nvalue = [
> >> > 0.7]",
> >> > shape="box"] ;
> >> > 1 [label="city=Dubai <= 0.5000\nerror = 0.75\nsamples = 4\nvalue = [
> >> > 0.25]",
> >> > shape="box"] ;
> >> > 0 -> 1 ;
> >> > 2 [label="error = 0.0000\nsamples = 2\nvalue = [ 0.]", shape="box"] ;
> >> > 1 -> 2 ;
> >> > 3 [label="error = 0.5000\nsamples = 2\nvalue = [ 0.5]", shape="box"] ;
> >> > 1 -> 3 ;
> >> > 4 [label="error = 0.0000\nsamples = 6\nvalue = [ 1.]", shape="box"] ;
> >> > 0 -> 4 ;
> >> > }
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Feb 27, 2013 at 9:50 PM, Peter Prettenhofer
> >> > <[email protected]> wrote:
> >> >>
> >> >> Hi David,
> >> >>
> >> >> I recommend that you load the data using Pandas
> (``pandas.read_csv``).
> >> >> Scikit-learn does not support categorical features out-of-the-box;
> you
> >> >> need to encode them as dummy variables (aka one-hot encoding) - you
> >> >> can do this either using ``sklearn.preprocessing.DictVectorizer`` or
> >> >> via ``pandas.get_dummies`` .
> >> >>
> >> >> HTH,
> >> >>  Peter
> >> >>
> >> >> 2013/2/27 David Montgomery <[email protected]>:
> >> >> > Hi,
> >> >> >
> >> >> > I have a data structure that looks like this:
> >> >> >
> >> >> > 1 NewYork 1 6 high
> >> >> > 0 LA 3 4 low
> >> >> > .......
> >> >> >
> >> >> > I am trying to predict probability where Y is column one.  The all
> of
> >> >> > the
> >> >> > attributes of the X are categorical and I will use a dtree
> >> >> > regression.
> >> >> > How
> >> >> > do I load this data into the y and X?
> >> >> >
> >> >> > Thanks
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> ------------------------------------------------------------------------------
> >> >> > Everyone hates slow websites. So do we.
> >> >> > Make your web apps faster with AppDynamics
> >> >> > Download AppDynamics Lite for free today:
> >> >> > http://p.sf.net/sfu/appdyn_d2d_feb
> >> >> > _______________________________________________
> >> >> > Scikit-learn-general mailing list
> >> >> > [email protected]
> >> >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Peter Prettenhofer
> >> >>
> >> >>
> >> >>
> >> >>
> ------------------------------------------------------------------------------
> >> >> Everyone hates slow websites. So do we.
> >> >> Make your web apps faster with AppDynamics
> >> >> Download AppDynamics Lite for free today:
> >> >> http://p.sf.net/sfu/appdyn_d2d_feb
> >> >> _______________________________________________
> >> >> Scikit-learn-general mailing list
> >> >> [email protected]
> >> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >> >
> >> >
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > Everyone hates slow websites. So do we.
> >> > Make your web apps faster with AppDynamics
> >> > Download AppDynamics Lite for free today:
> >> > http://p.sf.net/sfu/appdyn_d2d_feb
> >> > _______________________________________________
> >> > Scikit-learn-general mailing list
> >> > [email protected]
> >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >> >
> >>
> >>
> >>
> >> --
> >> Peter Prettenhofer
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> Everyone hates slow websites. So do we.
> >> Make your web apps faster with AppDynamics
> >> Download AppDynamics Lite for free today:
> >> http://p.sf.net/sfu/appdyn_d2d_feb
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Everyone hates slow websites. So do we.
> > Make your web apps faster with AppDynamics
> > Download AppDynamics Lite for free today:
> > http://p.sf.net/sfu/appdyn_d2d_feb
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
>
> --
> Peter Prettenhofer
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to load data into scikits

Reply via email to