[ 
https://issues.apache.org/jira/browse/MAHOUT-216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788902#action_12788902
 ] 

Deneche A. Hakim edited comment on MAHOUT-216 at 12/10/09 8:32 PM:
-------------------------------------------------------------------

the next step is to implement a tool that redistributes the instances over the 
partitions, in order to get a uniform distribution of classes.
The tool that I implemented has the following simple algorithm:
{code}
input: 
* data
* number of partitions p

create p output files, one for each partition
for each class c, currents[c] = the partition where the next tuple of class c 
will be put. currents[] is randomly initialized with integers in the range [0, 
p[

for each tuple do the following:
* put the tuple in the file corresponding to the partition currents[tuple.class]
* currents[tuple.class]++; if (currents[tuple.class] == p) 
currents[tuple.class) = 0;
end for
{code}

computing the frequency distribution of the class attribute on the "uniform" 
data gives the following results:
||||label1||label2||label3||...||
|partition 0|98218|     3|      1|      0|      107202| 280789| 5|      27|     
98|     1041|   1248|   2|      0|      220|    1|      1590|   0|      231|    
1|      2|      102|    0|      1|
|partition 1|98220|     3|      1|      0|      107202| 280789| 5|      27|     
98|     1041|   1248|   2|      0|      220|    1|      1589|   0|      231|    
1|      2|      102|    1|      1|
|partition 2|98214|     3|      1|      0|      107202| 280789| 5|      26|     
98|     1041|   1248|   2|      1|      220|    1|      1589|   1|      231|    
1|      2|      102|    1|      1|
|partition 3|98221|     3|      1|      0|      107201| 280789| 5|      26|     
98|     1041|   1248|   2|      1|      220|    1|      1589|   1|      231|    
1|      2|      102|    0|      1|
|partition 4|98221|     3|      1|      0|      107201| 280788| 5|      26|     
98|     1042|   1249|   2|      1|      221|    1|      1589|   1|      232|    
1|      2|      102|    0|      1|
|partition 5|98218|     4|      1|      1|      107201| 280788| 5|      26|     
97|     1042|   1248|   3|      1|      221|    1|      1589|   1|      232|    
1|      2|      102|    0|      1|
|partition 6|98220|     3|      1|      0|      107202| 280788| 6|      26|     
98|     1042|   1248|   2|      1|      221|    1|      1589|   0|      232|    
1|      2|      102|    0|      1|
|partition 7|98219|     2|      1|      1|      107202| 280788| 6|      26|     
98|     1041|   1248|   2|      1|      220|    1|      1589|   0|      232|    
0|      2|      102|    0|      1|
|partition 8|97541|     3|      0|      1|      107204| 281453| 6|      27|     
98|     1041|   1248|   2|      1|      220|    2|      1589|   0|      232|    
0|      2|      102|    0|      1|
|partition 9|89489|     3|      1|      0|      107200| 280125| 5|      27|     
98|     1041|   1248|   2|      1|      220|    2|      1590|   0|      232|    
0|      2|      102|    0|      1|
 
This time the classes are well distributed. 

Here is a quick comparison between the original dataset and the "uniform" 
dataset, building 10 trees over 10 partitions with the partial implementation 
on the original data gave an out-of-bag error of 0.2, using the "uniform" 
dataset" the out-of-bag error is 1.7 E-4.

Although more tests are needed, this results are very encouraging

PS:
This program generates p files that should be used as training data instead of 
the original data. Although my mapreduce partial implementation should handle 
datasets split over many files, it did not handle them !!! For now I have to 
join the files in one single file before running the partial implementation =P

      was (Author: adeneche):
    the next step is to implement a tool that redistributes the instances over 
the partitions, in order to get a uniform distribution of classes.
The tool that I implemented has the following simple algorithm:
{code}
input: 
* data
* number of partitions p

create p output files, one for each partition
for each class c, currents[c] = the partition where the next tuple of class c 
will be put. currents[] is randomly initialized with integers in the range [0, 
p[

for each tuple do the following:
* put the tuple in the file corresponding to the partition currents[tuple.class]
* currents[tuple.class]++; if (currents[tuple.class] == p) 
currents[tuple.class) = 0;
end for
{code}

computing the frequency distribution of the class attribute on the "uniform" 
data gives the following results:
||||label1||label2||label3||...||
|partition 0|98218|     3|      1|      0|      107202| 280789| 5|      27|     
98|     1041|   1248|   2|      0|      220|    1|      1590|   0|      231|    
1|      2|      102|    0|      1|
|partition 1|98220|     3|      1|      0|      107202| 280789| 5|      27|     
98|     1041|   1248|   2|      0|      220|    1|      1589|   0|      231|    
1|      2|      102|    1|      1|
|partition 2|98214|     3|      1|      0|      107202| 280789| 5|      26|     
98|     1041|   1248|   2|      1|      220|    1|      1589|   1|      231|    
1|      2|      102|    1|      1|
|partition 3|98221|     3|      1|      0|      107201| 280789| 5|      26|     
98|     1041|   1248|   2|      1|      220|    1|      1589|   1|      231|    
1|      2|      102|    0|      1|
|partition 4|98221|     3|      1|      0|      107201| 280788| 5|      26|     
98|     1042|   1249|   2|      1|      221|    1|      1589|   1|      232|    
1|      2|      102|    0|      1|
|partition 5|98218|     4|      1|      1|      107201| 280788| 5|      26|     
97|     1042|   1248|   3|      1|      221|    1|      1589|   1|      232|    
1|      2|      102|    0|      1|
|partition 6|98220|     3|      1|      0|      107202| 280788| 6|      26|     
98|     1042|   1248|   2|      1|      221|    1|      1589|   0|      232|    
1|      2|      102|    0|      1|
|partition 7|98219|     2|      1|      1|      107202| 280788| 6|      26|     
98|     1041|   1248|   2|      1|      220|    1|      1589|   0|      232|    
0|      2|      102|    0|      1|
|partition 8|97541|     3|      0|      1|      107204| 281453| 6|      27|     
98|     1041|   1248|   2|      1|      220|    2|      1589|   0|      232|    
0|      2|      102|    0|      1|
|partition 9|89489|     3|      1|      0|      107200| 280125| 5|      27|     
98|     1041|   1248|   2|      1|      220|    2|      1590|   0|      232|    
0|      2|      102|    0|      1|
 
This time the classes are well distributed. 

Here is a quick comparison between the original dataset and the "uniform" 
dataset, building 10 trees over 10 partitions with the partial implementation :

||||Original||Unifomed||
|Out of Bag Error|0.2|1.7 E-4|

Although more tests are needed, this results are very encouraging

PS:
This program generates p files that should be used as training data instead of 
the original data. Although my mapreduce partial implementation should handle 
datasets split over many files, it did not handle them !!! For now I have to 
join the files in one single file before running the partial implementation =P
  
> Improve the results of MAHOUT-145 by uniformly distributing the classes in 
> the partitioned data
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-216
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-216
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Classification
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>
> the poor results of the partial decision forest implementation may be 
> explained by the particular distribution of the partitioned data. For 
> example, if a partition does not contain any instance of a given class, the 
> decision trees built using this partition won't be able to classify this 
> class. 
> According to [CHAN, 95]:
> {quote}
> Random Selection of the partitioned data sets with a uniform distribution of 
> classes is perhaps the most sensible solution. Here we may attempt to 
> maintain the same frequency distribution over the ''class attribute" so that 
> each partition represents a good but a smaller model of the entire training 
> set
> {quote}
> [CHAN, 95]: Philip K. Chan, "On the Accuracy of Meta-learning for Scalable 
> Data Mining" 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to