[ 
https://issues.apache.org/jira/browse/MAHOUT-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pallavi Palleti updated MAHOUT-79:
----------------------------------

    Attachment: FUZZY.patch

There are three major changes that are done in this implementation: 

One is related to improving speed:
1. The existing implementation was passing the centroid information as a key to 
the next tasks (combiner and reducer). 
When the dimensionality is huge, then passing this huge information as a key 
throws out of memory error as it is difficult hold the whole data into memory.
So, the approach I have taken in this implementation is to send only the 
cluster-id as the key value in mapper tasks.
and In reducer phase we read the cluster information in configure method and 
accessing cluster information by maintaining a map of id to softcluster object. 
As we are not changing the cluster values till one single iteration ends. We 
can optimize the code in this way and there by improving speed.
I have personally seen a speed improvement of hours to minutes.


Two are related to bugs:
1. Combiner is removed as it is not sure about how many times a combiner run on 
a dataset. It may run zero to many times. If it runs more than once, it is 
going to be a big logical bug. So, combiner is removed in new implementation.
2. There was a logical bug where in place of power, I used multiplication in 
previous implementation. I fixed it in this implementation.


NOTE:The above(Combiner, improving speed) can be applicable to K-Means too. 
Because, 
1. K-Means do modify the data points in combiner and as per hadoop 
specifications, it is not given guarantee that combiner run only once over a 
data point. So, in this way, it may create a bug.
2. By passing only cluster-id, we can improve the speed as it reduces the 
amount of data that is being transferred between map and reduce tasks.

We can apply this idea of passing cluster-id rather than whole cluster wherever 
it is applicable in any other mahout implementations.







> Improving the speed of Fuzzy K-Means by optimizing data transfer between map 
> and reduce tasks
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-79
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-79
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>         Attachments: FUZZY.patch
>
>
> Improve the speed of fuzzy k-Means by passing only the cluster-id info as key 
> output of mapper task and reading the cluster information in reducer task 
> where this info is needed. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to