Re: [Scikit-learn-general] How to load data into scikits

Peter Prettenhofer Wed, 27 Feb 2013 10:06:42 -0800

2013/2/27 David Montgomery <[email protected]>:
> Ok....now I am really confused on how to interpret the tree.
>
> So...I am trying to build a Prob est tree.  All of the independent variables
> are categorical and created dummies.  What is throwing me off are the <=.
>
> I should have a rule that says e.g. if city=LA,NY and TIME=Noon then .20.
>
> In the chart I see city=Dubai<=.500  What does that mean?


city.Dubai <= 0.5 means that if the indicator variable city=Dubai is
smaller than 0.5 (i.e if city=Dubai is 0) then examples get routed
down the left child otherwise they get routed down the right child.


> What I am trying
> so see is a chart that I would usually see in SPSS answer tree or SAS etc.

since both SPSS and SAS are proprietary I've no clue how they look like

>
> So..how do I interpret the city=Dubai<=.500?

The split node basically asks: is the city feature not Dubai? - if so
go down left else right

In order to generate rules from decision trees you have to look at a
whole path (from root to leaf). Currently, there is no way to
extracting rules from decision trees - you have to write your own code
that analyzes the tree structure.

>
> My aim is to get a node id and to create sql rules to extract data.
>
> Unless I am wrong, it appears the the dtree algo is not designed to extract
> rules and even assign a rule to a node id.  Dtrees in scikits are solely for
> prediction.  Is this a fair statement?

correct, scikit-learn is mostly a machine learning library; in fact,
AFAIK you where the first user to request such a feature.

>
> I will be taking the *.dot file not to graph but to somehow parse the file
> so I can create my rules.

better operate on the DecisionTreeRegressor/Classifier.tree_ object.
It represents the binary decision tree as a number of parallel arrays;
you can find the documentation/code here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L38

best,
 Peter


>
> Thanks
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 27, 2013 at 11:57 PM, Peter Prettenhofer
> <[email protected]> wrote:
>>
>> Looks good to me - save the output to a file (e.g. foobar.dot) and run
>> the following command:
>>
>>     $ dot -Tpdf foobar.dot -o foobar.pdf
>>
>> When I open the pdf all labels are correctly displayed - remember that
>> they are not indicator features - so the thresholds are usually
>> "country=AU <= 0.5".
>>
>> You can find more information here:
>> http://scikit-learn.org/dev/modules/tree.html#classification
>>
>> 2013/2/27 David Montgomery <[email protected]>:
>> > Thanks I used DictVectorizer()
>> >
>> > I am now trying to add lables to the tree graph.   Below are the labels
>> > and
>> > the digraph Tree.  However, I dont see lables on the tree nodes.  Did I
>> > not
>> > use feature names correct?
>> >
>> >
>> >
>> >
>> > measurements = [
>> > {'country':'US','city': 'Dubai'},
>> > {'country':'US','city': 'London'},
>> > {'country':'US','city': 'San Fransisco'},
>> > {'country':'US','city': 'Dubai'},
>> > {'country':'AU','city': 'Mel'},
>> > {'country':'AU','city': 'Sydney'},
>> > {'country':'AU','city': 'Mel'},
>> > {'country':'AU','city': 'Sydney'},
>> > {'country':'AU','city': 'Mel'},
>> > {'country':'AU','city': 'Sydney'},
>> > ]
>> > y = [0,0,0,1,1,1,1,1,1,1]
>> >
>> >
>> > vec = DictVectorizer()
>> > X = vec.fit_transform(measurements)
>> > feature_name = vec.get_feature_names()
>> > clf = tree.DecisionTreeRegressor()
>> > clf = clf.fit(X.todense(), y)
>> > with open("au.dot", 'w') as f:
>> >     f = tree.export_graphviz(clf, out_file=f,feature_names=feature_name)
>> >
>> >
>> > feature_name = ['city=Dubai', 'city=London', 'city=Mel', 'city=San
>> > Fransisco', 'city=Sydney', 'country=AU', 'country=US']
>> >
>> > digraph Tree {
>> > 0 [label="country=AU <= 0.5000\nerror = 2.1\nsamples = 10\nvalue = [
>> > 0.7]",
>> > shape="box"] ;
>> > 1 [label="city=Dubai <= 0.5000\nerror = 0.75\nsamples = 4\nvalue = [
>> > 0.25]",
>> > shape="box"] ;
>> > 0 -> 1 ;
>> > 2 [label="error = 0.0000\nsamples = 2\nvalue = [ 0.]", shape="box"] ;
>> > 1 -> 2 ;
>> > 3 [label="error = 0.5000\nsamples = 2\nvalue = [ 0.5]", shape="box"] ;
>> > 1 -> 3 ;
>> > 4 [label="error = 0.0000\nsamples = 6\nvalue = [ 1.]", shape="box"] ;
>> > 0 -> 4 ;
>> > }
>> >
>> >
>> >
>> >
>> > On Wed, Feb 27, 2013 at 9:50 PM, Peter Prettenhofer
>> > <[email protected]> wrote:
>> >>
>> >> Hi David,
>> >>
>> >> I recommend that you load the data using Pandas (``pandas.read_csv``).
>> >> Scikit-learn does not support categorical features out-of-the-box; you
>> >> need to encode them as dummy variables (aka one-hot encoding) - you
>> >> can do this either using ``sklearn.preprocessing.DictVectorizer`` or
>> >> via ``pandas.get_dummies`` .
>> >>
>> >> HTH,
>> >>  Peter
>> >>
>> >> 2013/2/27 David Montgomery <[email protected]>:
>> >> > Hi,
>> >> >
>> >> > I have a data structure that looks like this:
>> >> >
>> >> > 1 NewYork 1 6 high
>> >> > 0 LA 3 4 low
>> >> > .......
>> >> >
>> >> > I am trying to predict probability where Y is column one.  The all of
>> >> > the
>> >> > attributes of the X are categorical and I will use a dtree
>> >> > regression.
>> >> > How
>> >> > do I load this data into the y and X?
>> >> >
>> >> > Thanks
>> >> >
>> >> >
>> >> >
>> >> > ------------------------------------------------------------------------------
>> >> > Everyone hates slow websites. So do we.
>> >> > Make your web apps faster with AppDynamics
>> >> > Download AppDynamics Lite for free today:
>> >> > http://p.sf.net/sfu/appdyn_d2d_feb
>> >> > _______________________________________________
>> >> > Scikit-learn-general mailing list
>> >> > [email protected]
>> >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Peter Prettenhofer
>> >>
>> >>
>> >>
>> >> ------------------------------------------------------------------------------
>> >> Everyone hates slow websites. So do we.
>> >> Make your web apps faster with AppDynamics
>> >> Download AppDynamics Lite for free today:
>> >> http://p.sf.net/sfu/appdyn_d2d_feb
>> >> _______________________________________________
>> >> Scikit-learn-general mailing list
>> >> [email protected]
>> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Everyone hates slow websites. So do we.
>> > Make your web apps faster with AppDynamics
>> > Download AppDynamics Lite for free today:
>> > http://p.sf.net/sfu/appdyn_d2d_feb
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>>
>>
>>
>> --
>> Peter Prettenhofer
>>
>>
>> ------------------------------------------------------------------------------
>> Everyone hates slow websites. So do we.
>> Make your web apps faster with AppDynamics
>> Download AppDynamics Lite for free today:
>> http://p.sf.net/sfu/appdyn_d2d_feb
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> Everyone hates slow websites. So do we.
> Make your web apps faster with AppDynamics
> Download AppDynamics Lite for free today:
> http://p.sf.net/sfu/appdyn_d2d_feb
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] How to load data into scikits

Reply via email to