Re: [R] Decision Tree: Am I Missing Anything?

Bhupendrasinh Thakre Thu, 20 Sep 2012 22:38:37 -0700

One possible way to think of it is using " variable reduction" before going for 
J48. You may want to use several methods available for that. Again prediction 
for brands is more of a business question to me.


Two solution which I can think of.
1. Variable reduction before decision tree.
2. Let the intuition decide how many of them are "really" important.

Please let us know your findings. All the best.

Best Regards,

Bhupendrasinh Thakre
Sent from my iPhone

On Sep 21, 2012, at 12:16 AM, Vik Rubenfeld <v...@mindspring.com> wrote:

> Bhupendrashinh, thanks very much!  I ran J48 on a respondent-level data set 
> and got a 61.75% correct classification rate!
> 
> Correctly Classified Instances         988               61.75   %
> Incorrectly Classified Instances       612               38.25   %
> Kappa statistic                          0.5651
> Mean absolute error                      0.0432
> Root mean squared error                  0.1469
> Relative absolute error                 52.7086 %
> Root relative squared error             72.6299 %
> Coverage of cases (0.95 level)          99.6875 %
> Mean rel. region size (0.95 level)      15.4915 %
> Total Number of Instances             1600     
> 
> When I plot it I get an enormous chart.  Running :
> 
> >respLevelTree = J48(BRAND_NAME ~ PRI + PROM + FORM + FAMI + DRRE + FREC + 
> >MODE + SPED + REVW, data = respLevel)
> >respLevelTree
> 
> ...reports:
> 
> J48 pruned tree
> ------------------
> 
> Is there a way to further prune the tree so that I can present a chart that 
> would fit on a single page or two?
> 
> Thanks very much in advance for any thoughts.
> 
> 
> -Vik
> 
> 
> 
> 
> On Sep 20, 2012, at 8:37 PM, Bhupendrasinh Thakre wrote:
> 
>> Not very sure what the problem is as I was not able to take your data for 
>> run. You might want to use dput() command to present the data. 
>> 
>> Now on the programming side. As we can see that we have more than 2 levels 
>> for the brands and hence method  = class is not able to able to understand 
>> what you actually want from it.
>> 
>> Suggestion : For predictions having more than 2 levels I will go for Weka 
>> and specifically C4.5 algorithm. You also have the RWeka package for it.
>> 
>> Best Regards,
>> 
>> Bhupendrasinh Thakre
>> Sent from my iPhone
>> 
>> On Sep 20, 2012, at 9:47 PM, Vik Rubenfeld <v...@mindspring.com> wrote:
>> 
>>> I'm working with some data from which a client would like to make a 
>>> decision tree predicting brand preference based on inputs such as price, 
>>> speed, etc.  After running the decision tree analysis using rpart, it 
>>> appears that this data is not capable of predicting brand preference.  
>>> 
>>> Here's the data set:
>>> 
>>> BRND      PRI       PROM      FORM      FAMI      DRRE      FREC      MODE  
>>>     SPED      REVW
>>> Brand 1       0.6989    0.4731    0.7849    0.6989    0.7419    0.6022    
>>> 0.8817    0.9032    0.6452
>>> Brand 2       0.8621    0.3793    0.8621     0.931    0.7586    0.6897    
>>> 0.8966    0.9655    0.8276
>>> Brand 3          0.6       0.1       0.6       0.7       0.9       0.7      
>>>  0.7       0.8       0.6
>>> Brand 4       0.6429      0.25    0.5714       0.5    0.6071       0.5      
>>> 0.75    0.8214       0.5
>>> Brand 5       0.7586    0.4224    0.7328    0.6638    0.7328    0.6379    
>>> 0.8621    0.8621    0.6897
>>> Brand 6         0.75    0.0833    0.5833    0.4167       0.5    0.4167      
>>> 0.75    0.6667       0.5
>>> Brand 7       0.7742    0.4839    0.6129    0.5161    0.8065    0.6452    
>>> 0.7742    0.9032    0.6129
>>> Brand 8       0.6429    0.2679    0.6964    0.7143     0.875    0.5536    
>>> 0.8036    0.9464    0.6607
>>> Brand 9        0.575     0.175      0.65      0.55     0.625     0.375     
>>> 0.825      0.85     0.475
>>> Brand 10      0.8095    0.5238    0.6667    0.6429    0.6667    0.5952    
>>> 0.8571    0.8095    0.5714
>>> Brand 11      0.6308       0.3    0.6077    0.5846    0.6769    0.5231    
>>> 0.7462    0.8846       0.6
>>> Brand 12      0.7212    0.3152    0.7152    0.6545    0.6606     0.503    
>>> 0.8061    0.8909       0.6
>>> Brand 13      0.7419    0.2258    0.6129    0.5806    0.7097    0.6129     
>>> 0.871    0.9677    0.3226
>>> Brand 14      0.7176    0.2706    0.6353    0.5647    0.6941    0.4471    
>>> 0.7176    0.9412    0.5176
>>> Brand 15      0.7287    0.3437    0.5995    0.5788    0.8527    0.5478    
>>> 0.8217    0.8941    0.6227
>>> Brand 16         0.7       0.4       0.6       0.4         1       0.4      
>>>  0.9       0.9       0.5
>>> Brand 17      0.7193    0.3333    0.6667    0.6667    0.7018    0.5263    
>>> 0.7719    0.8596    0.7018
>>> Brand 18      0.7778    0.4127    0.6508    0.6349    0.7937    0.6032    
>>> 0.8571    0.9206     0.619
>>> Brand 19      0.8028    0.2817    0.6197    0.4366    0.7042    0.4366    
>>> 0.7183    0.9155    0.5634
>>> Brand 20      0.7736    0.2453    0.6226    0.3774    0.5849    0.3019     
>>> 0.717    0.8679    0.4717
>>> Brand 21      0.8481    0.2152    0.6329    0.4051    0.6329    0.4557    
>>> 0.6962    0.8481    0.3418
>>> Brand 22        0.75    0.3333    0.6667       0.5    0.6667    0.5833    
>>> 0.9167    0.9167    0.4167
>>> 
>>> Here are my R commands:
>>> 
>>>> test.df = read.csv("test.csv")
>>>> head(test.df)
>>>    BRND    PRI   PROM   FORM   FAMI   DRRE   FREC   MODE   SPED   REVW
>>> 1 Brand 1 0.6989 0.4731 0.7849 0.6989 0.7419 0.6022 0.8817 0.9032 0.6452
>>> 2 Brand 2 0.8621 0.3793 0.8621 0.9310 0.7586 0.6897 0.8966 0.9655 0.8276
>>> 3 Brand 3 0.6000 0.1000 0.6000 0.7000 0.9000 0.7000 0.7000 0.8000 0.6000
>>> 4 Brand 4 0.6429 0.2500 0.5714 0.5000 0.6071 0.5000 0.7500 0.8214 0.5000
>>> 5 Brand 5 0.7586 0.4224 0.7328 0.6638 0.7328 0.6379 0.8621 0.8621 0.6897
>>> 6 Brand 6 0.7500 0.0833 0.5833 0.4167 0.5000 0.4167 0.7500 0.6667 0.5000
>>> 
>>>> testTree = rpart(BRAND~PRI  + PROM  + FORM +  FAMI+   DRRE +  FREC  + MODE 
>>>> +  SPED +  REVW, method="class", data=test.df)
>>> 
>>>> printcp(testTree)
>>> 
>>> Classification tree:
>>> rpart(formula = BRND ~ PRI + PROM + FORM + FAMI + DRRE + FREC + 
>>>   MODE + SPED + REVW, data = test.df, method = "class")
>>> 
>>> Variables actually used in tree construction:
>>> [1] FORM
>>> 
>>> Root node error: 21/22 = 0.95455
>>> 
>>> n= 22 
>>> 
>>>       CP nsplit rel error xerror xstd
>>> 1 0.047619      0   1.00000 1.0476    0
>>> 2 0.010000      1   0.95238 1.0476    0
>>> 
>>> I note that only one variable (FORM) was actually used in tree 
>>> construction. When I run a plot using:
>>> 
>>>> plot(testTree)
>>>> text(testTree)
>>> 
>>> ...I get a tree with one branch.  
>>> 
>>> It looks to me like I'm doing everything right, and this data is just not 
>>> capable of predicting brand preference. 
>>> 
>>> Am I missing anything?
>>> 
>>> Thanks very much in advance for any thoughts!
>>> 
>>> -Vik
>>> 
>>> 
>>> 
>>> 
>>> 
>>>   [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Decision Tree: Am I Missing Anything?

Reply via email to