[ 
https://issues.apache.org/jira/browse/HAMA-834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Illecker updated HAMA-834:
---------------------------------

    Attachment: HAMA-834.patch

{quote}
The number of BSP task is determined by InputFormat. Basically, the number of 
tasks equals to the number of blocks of single input file, or the number of 
multiple input files. So, you can’t force the number of tasks without input 
partitioning.

Meanwhile, in GraphJob case, PartitioningRunner creates the partitions as user 
desired, it runs before GraphJobRunner. So, you can set the number of tasks for 
a graph job.
{quote}

Please see the updated patch.
Now I divide input vectors (0,0)..(100,100) into multiple input files (manual 
partitioning) to force the number of tasks.

The results depend on the number of tasks:

*numBspTask = 1*
{code}
bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
bsp.BSPJobClient:     SUPERSTEPS=1
bsp.BSPJobClient:     LAUNCHED_TASKS=1
bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
bsp.BSPJobClient:     SUPERSTEP_SUM=2
bsp.BSPJobClient:     IO_BYTES_READ=8847
bsp.BSPJobClient:     TIME_IN_SYNC_MS=0
bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=2
bsp.BSPJobClient:     TASK_INPUT_RECORDS=303
bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=2
bsp.BSPJobClient:     TASK_OUTPUT_RECORDS=101

{0=[50.0, 50.0]}
{code}

*numBspTask = 2*
{code}
bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
bsp.BSPJobClient:     SUPERSTEPS=1
bsp.BSPJobClient:     LAUNCHED_TASKS=2
bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
bsp.BSPJobClient:     SUPERSTEP_SUM=4
bsp.BSPJobClient:     IO_BYTES_READ=13110
bsp.BSPJobClient:     TIME_IN_SYNC_MS=3
bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=8
bsp.BSPJobClient:     TASK_INPUT_RECORDS=450
bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=8
bsp.BSPJobClient:     TASK_OUTPUT_RECORDS=150

{0=[58.166666666666664, 58.166666666666664]}
{code}


> Fix KMeans example
> ------------------
>
>                 Key: HAMA-834
>                 URL: https://issues.apache.org/jira/browse/HAMA-834
>             Project: Hama
>          Issue Type: Bug
>          Components: examples, machine learning
>    Affects Versions: 0.6.3
>            Reporter: Martin Illecker
>              Labels: example
>             Fix For: 0.7.0
>
>         Attachments: HAMA-834.patch
>
>
> Fix problems in KMeans example and revise test case.
> 1) Typo \[1] and input path issue
> 2) Wrong *summationCount* in assignCentersInternal
> *summationCount* should also be incremented if \[2] 
> {code}
> if (clusterCenter == null) {
>   newCenterArray[lowestDistantCenter] = key;
> }
> {code}
> Otherwise *summationCount* may stay zero when only one value is assigned. 
> Then this zero will be propagated to *incrementSum* \[3] and might cause a 
> divide by zero in \[4]. 
> By the way if we add three vectors and the *summationCount* would only be 
> two, this will lead to wrong results. Because later we are dividing the 
> vector by the amount of increments.
> 3) Results depend on the amount *numBspTask*
> (results vary if *numBspTask* is changed)
> \[1]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L518-519
> \[2] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L249
> \[3]
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L161
> \[4] 
> https://github.com/apache/hama/blob/trunk/ml/src/main/java/org/apache/hama/ml/kmeans/KMeansBSP.java#L172



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to