[ https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797544#action_12797544 ]
Deneche A. Hakim commented on MAHOUT-216: ----------------------------------------- Here are some results on a 5 slave ec2 cluster, using Kdd 100% || Num Map Tasks || Num Trees || Build Time || oob error || | 10 | 10 | 0h 2m 32s 643 | 1.7E-4 | | 10 | 100 | 0h 10m 5s 231 | 1.7E-4 | the results looks good, now I'll have to try the generated classifier on kdd test data and see... Some known issues (that'll try to fix) are: * mapreduce implementations cannot handle multiple file datasets * because a lot of work is done when the mappers are closing I need to refresh some Hadoop counter or the job is canceled when trying to build a lot of trees (400) > Improve the results of MAHOUT-145 by uniformly distributing the classes in > the partitioned data > ----------------------------------------------------------------------------------------------- > > Key: MAHOUT-216 > URL: https://issues.apache.org/jira/browse/MAHOUT-216 > Project: Mahout > Issue Type: Improvement > Components: Classification > Affects Versions: 0.2 > Reporter: Deneche A. Hakim > Assignee: Deneche A. Hakim > Fix For: 0.3 > > > the poor results of the partial decision forest implementation may be > explained by the particular distribution of the partitioned data. For > example, if a partition does not contain any instance of a given class, the > decision trees built using this partition won't be able to classify this > class. > According to [CHAN, 95]: > {quote} > Random Selection of the partitioned data sets with a uniform distribution of > classes is perhaps the most sensible solution. Here we may attempt to > maintain the same frequency distribution over the ''class attribute" so that > each partition represents a good but a smaller model of the entire training > set > {quote} > [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable > Data Mining" -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.