[ https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678726#comment-13678726 ]
Grant Ingersoll commented on MAHOUT-1233: ----------------------------------------- Yannis, any chance you have a small self contained test? Or, can you reproduce this using any of the examples? Just trying to make it easier to test this. > Problem in processing datasets as a single chunk vs many chunks in HADOOP > mode in mostly all the clustering algos > ----------------------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1233 > URL: https://issues.apache.org/jira/browse/MAHOUT-1233 > Project: Mahout > Issue Type: Question > Components: Clustering > Affects Versions: 0.7, 0.8 > Reporter: yannis ats > Assignee: yannis ats > Priority: Minor > Fix For: 0.8 > > > I am trying to process a dataset and i do it in two ways. > Firstly i give it as a single chunk(all the dataset) and secondly as many > smaller chunks in order to increase the throughput of my machine. > The problem is that when i perform the single chunk computation the results > are fine > and by fine i mean that if i have in the input 1000 vectors i get in the > output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans > and fuzzy kmeans). > However when i split the dataset in order to speed up the computations then > strange phenomena occur. > For instance the same dataset that contains 1000 vectors and is split in for > example 10 files then in the output i will obtain more vector ids(w.g 1100 > vectorids with their corresponding clusterids). > The question is, am i doing something wrong in the process? > Is there a problem in clusterdump and seqdumper when the input is in many > files? > I have observed when mahout is performing the computations that in the screen > says that processed the correct number of vectors. > Am i missing something? > I use as input the transformed to mvc weka vectors. > I have tried this in v0.7 and the v0.8 snapshot. > Thank you in advance for your time. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira