[
https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672494#comment-13672494
]
Grant Ingersoll commented on MAHOUT-1233:
-----------------------------------------
Can you provide the exact commands you ran?
> Problem in processing datasets as a single chunk vs many chunks in HADOOP
> mode in mostly all the clustering algos
> -----------------------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1233
> URL: https://issues.apache.org/jira/browse/MAHOUT-1233
> Project: Mahout
> Issue Type: Question
> Components: Clustering
> Affects Versions: 0.7, 0.8
> Reporter: yannis ats
> Priority: Minor
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many
> smaller chunks in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results
> are fine
> and by fine i mean that if i have in the input 1000 vectors i get in the
> output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans
> and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then
> strange phenomena occur.
> For instance the same dataset that contains 1000 vectors and is split in for
> example 10 files then in the output i will obtain more vector ids(w.g 1100
> vectorids with their corresponding clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many
> files?
> I have observed when mahout is performing the computations that in the screen
> says that processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira