[ 
https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797544#action_12797544
 ] 

Deneche A. Hakim edited comment on MAHOUT-216 at 1/7/10 7:30 AM:
-----------------------------------------------------------------

Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.2E-4 |

the results looks good, now I'll have to try the generated classifier on kdd 
test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh 
some Hadoop counter or the job is canceled when trying to build a lot of trees 
(400)


      was (Author: adeneche):
    Here are some results on a 5 slave ec2 cluster, using Kdd 100%

 || Num Map Tasks || Num Trees || Build Time || oob error ||
 | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 |
 | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 |

the results looks good, now I'll have to try the generated classifier on kdd 
test data and see...

Some known issues (that'll try to fix) are:
* mapreduce implementations cannot handle multiple file datasets
* because a lot of work is done when the mappers are closing I need to refresh 
some Hadoop counter or the job is canceled when trying to build a lot of trees 
(400)

  
> Improve the results of MAHOUT-145 by uniformly distributing the classes in 
> the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>    Affects Versions: 0.2
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 0.3
>
>
> the poor results of the partial decision forest implementation may be 
> explained by the particular distribution of the partitioned data. For 
> example, if a partition does not contain any instance of a given class, the 
> decision trees built using this partition won't be able to classify this 
> class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of 
> classes is perhaps the most sensible solution. Here we may attempt to 
> maintain the same frequency distribution over the ''class attribute" so that 
> each partition represents a good but a smaller model of the entire training 
> set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable 
> Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to