[jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

Grant Ingersoll (JIRA) Sat, 08 Jun 2013 04:13:23 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678726#comment-13678726
 ]


Grant Ingersoll commented on MAHOUT-1233:
-----------------------------------------

Yannis, any chance you have a small self contained test?  Or, can you reproduce 
this using any of the examples?  Just trying to make it easier to test this.
                
> Problem in processing datasets as a single chunk vs many chunks in HADOOP 
> mode in mostly all the clustering algos
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1233
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1233
>             Project: Mahout
>          Issue Type: Question
>          Components: Clustering
>    Affects Versions: 0.7, 0.8
>            Reporter: yannis ats
>            Assignee: yannis ats
>            Priority: Minor
>             Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many 
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results 
> are fine 
> and by fine i mean that if i have in the input 1000 vectors i get in the 
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans 
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then 
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in  for 
> example 10 files then in the output i will obtain more vector ids(w.g 1100 
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many 
> files?
> I have observed when mahout is performing the computations that in the screen 
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos

Reply via email to