Hi, Thanks for the replies everyone... just getting the hang of things... appreciate the tolerance for all the dumb questions...
Gaurav, a small question: You run the clustering and then you run the cluster post processor. I ran the cluster dumper on the initial clusteredPoints and I get output all with n=1, r=[] and a very large centroid. So from what I understand, the cluster algo is run again. Can I know for the out put you've shown in the jira, for which part did you run the clustering again? (I have 1000 clusters shown) I'm asking this so I can verify that I've run things correctly, and I'm generating the same output. On Fri, Feb 17, 2012 at 6:45 PM, Jeff Eastman <j...@windwardsolutions.com>wrote: > For human-readable output, yes. > > > On 2/17/12 6:09 AM, Tharindu Mathew wrote: > >> Or I can just use the cluster dump tool right...? >> >> On Fri, Feb 17, 2012 at 5:55 PM, Paritosh Ranjan<pran...@xebia.com> >> wrote: >> >> Try logging in and updating. >>> >>> Thanks... >>> On 17-02-2012 17:54, Tharindu Mathew wrote: >>> >>> OffTopic: How would I contribute a documentation patch? >>>> >>>> On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar<gauravred...@gmail.com>* >>>> *** >>>> >>>> wrote: >>>> >>>> If that is the only thing that is contained in the part-r-* file, then >>>> >>>>> the >>>>> reducer responsible to write to that part-r-* file did not recieve any >>>>> input records to write to it. This happens because the program uses the >>>>> default hash partitioner which sometimes maps records belonging to >>>>> different clusters to a same reducer; thus leaving some reducers >>>>> without >>>>> any input records. >>>>> >>>>> the simplest and the quickest way to view the contents of the part-r-* >>>>> files will be to change the outputformat of the job from >>>>> SequenceFileOutputFormat to TextOutputFormat and comment the line where >>>>> the >>>>> program calls the "****movePartFilesToRespectiveDirec****tories()" >>>>> function >>>>> >>>>> since >>>>> this function expects the part-r-* files to be in sequencefile format. >>>>> This >>>>> way you will get all the part files in human-readable format. >>>>> >>>>> You can later even modify the "****movePartFilesToRespectiveDirec**** >>>>> >>>>> tories()" >>>>> function to move the part-r* files to respective directories. >>>>> >>>>> Hope this helps. >>>>> >>>>> >>>>> >>>>> On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan<pran...@xebia.com> >>>>> wrote: >>>>> >>>>> Check this out https://cwiki.apache.org/**** >>>>> >>>>>> MAHOUT/top-down-clustering.**<**https://cwiki.apache.org/**** >>>>>> MAHOUT/top-down-clustering.**<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**> >>>>>> > >>>>>> html<https://cwiki.apache.org/****MAHOUT/top-down-clustering.****html<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**html> >>>>>> <https://cwiki.apache.**org/MAHOUT/top-down-**clustering.html<https://cwiki.apache.org/MAHOUT/top-down-clustering.html> >>>>>> > >>>>>> >>>>>> . >>>>>>> >>>>>> It tells how to use clusterpp. >>>>>> >>>>>> You will not get a human readable version. >>>>>> The output will be in SequenceFileFormat, which is not human readable. >>>>>> SequeneFileFormat is a key value format. You will have to iterate over >>>>>> it >>>>>> and read the key value and print into a text file or console. >>>>>> >>>>>> Look into this package org.apache.mahout.common.**** >>>>>> >>>>>> iterator.sequencefile. >>>>>> This package contains some utility classes which can help you iterate >>>>>> through SequenceFileFormat files. >>>>>> >>>>>> >>>>>> On 17-02-2012 14:18, Tharindu Mathew wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> I'm trying to reproduce https://issues.apache.org/** >>>>>>> jira/browse/MAHOUT-966< >>>>>>> >>>>>>> >>>>>>> https://issues.apache.org/****jira/browse/MAHOUT-966<https://issues.apache.org/**jira/browse/MAHOUT-966> >>>>>> <https:/**/issues.apache.org/jira/**browse/MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966> >>>>>> > >>>>>> >>>>>> >>>>>> When executing clusterpp, I get out put such as this: >>>>>> >>>>>>> $bin/hadoop fs -cat /user/mackie/output/****** >>>>>>> ppclusters/part-r-00999 >>>>>>> SEQorg.apache.hadoop.io.Text%******org.apache.mahout.math.** >>>>>>> >>>>>>> VectorWritable_䪖?g???8?-?? >>>>>>> >>>>>>> Is this normal? I thought I would get some human readable output when >>>>>>> >>>>>>> this >>>>>> was used... I tried searching around but couldn't get any >>>>>> documentation >>>>>> >>>>>>> regarding clusterpp >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> >> > -- Regards, Tharindu blog: http://mackiemathew.com/