Re: How to use clusterpp?

Tharindu Mathew Fri, 17 Feb 2012 09:07:52 -0800

Hi,

Thanks for the replies everyone... just getting the hang of things...
appreciate the tolerance for all the dumb questions...


Gaurav, a small question:

You run the clustering and then you run the cluster post processor. I ran
the cluster dumper on the initial clusteredPoints and I get output all with
n=1, r=[] and a very large centroid. So from what I understand, the cluster
algo is run again.

Can I know for the out put you've shown in the jira, for which part did you
run the clustering again? (I have 1000 clusters shown) I'm asking this so I
can verify that I've run things correctly, and I'm generating the same
output.

On Fri, Feb 17, 2012 at 6:45 PM, Jeff Eastman <j...@windwardsolutions.com>wrote:

> For human-readable output, yes.
>
>
> On 2/17/12 6:09 AM, Tharindu Mathew wrote:
>
>> Or I can just use the cluster dump tool right...?
>>
>> On Fri, Feb 17, 2012 at 5:55 PM, Paritosh Ranjan<pran...@xebia.com>
>>  wrote:
>>
>>  Try logging in and updating.
>>>
>>> Thanks...
>>> On 17-02-2012 17:54, Tharindu Mathew wrote:
>>>
>>>  OffTopic: How would I contribute a documentation patch?
>>>>
>>>> On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar<gauravred...@gmail.com>*
>>>> ***
>>>>
>>>> wrote:
>>>>
>>>>  If that is the only thing that is contained in the part-r-* file, then
>>>>
>>>>> the
>>>>> reducer responsible to write to that part-r-* file did not recieve any
>>>>> input records to write to it. This happens because the program uses the
>>>>> default hash partitioner which sometimes maps records belonging to
>>>>> different clusters to a same reducer; thus leaving some reducers
>>>>> without
>>>>> any input records.
>>>>>
>>>>> the simplest and the quickest way to view the contents of the part-r-*
>>>>> files will be to change the outputformat of the job from
>>>>> SequenceFileOutputFormat to TextOutputFormat and comment the line where
>>>>> the
>>>>> program calls the "****movePartFilesToRespectiveDirec****tories()"
>>>>> function
>>>>>
>>>>> since
>>>>> this function expects the part-r-* files to be in sequencefile format.
>>>>> This
>>>>> way you will get all the part files in human-readable format.
>>>>>
>>>>> You can later even modify the "****movePartFilesToRespectiveDirec****
>>>>>
>>>>> tories()"
>>>>> function to move the part-r* files to respective directories.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan<pran...@xebia.com>
>>>>> wrote:
>>>>>
>>>>>  Check this out https://cwiki.apache.org/****
>>>>>
>>>>>> MAHOUT/top-down-clustering.**<**https://cwiki.apache.org/****
>>>>>> MAHOUT/top-down-clustering.**<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**>
>>>>>> >
>>>>>> html<https://cwiki.apache.org/****MAHOUT/top-down-clustering.****html<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**html>
>>>>>> <https://cwiki.apache.**org/MAHOUT/top-down-**clustering.html<https://cwiki.apache.org/MAHOUT/top-down-clustering.html>
>>>>>> >
>>>>>>
>>>>>>  .
>>>>>>>
>>>>>> It tells how to use clusterpp.
>>>>>>
>>>>>> You will not get a human readable version.
>>>>>> The output will be in SequenceFileFormat, which is not human readable.
>>>>>> SequeneFileFormat is a key value format. You will have to iterate over
>>>>>> it
>>>>>> and read the key value and print into a text file or console.
>>>>>>
>>>>>> Look into this package org.apache.mahout.common.****
>>>>>>
>>>>>> iterator.sequencefile.
>>>>>> This package contains some utility classes which can help you iterate
>>>>>> through SequenceFileFormat files.
>>>>>>
>>>>>>
>>>>>> On 17-02-2012 14:18, Tharindu Mathew wrote:
>>>>>>
>>>>>>  Hi,
>>>>>>
>>>>>>> I'm trying to reproduce https://issues.apache.org/**
>>>>>>> jira/browse/MAHOUT-966<
>>>>>>>
>>>>>>>  
>>>>>>> https://issues.apache.org/****jira/browse/MAHOUT-966<https://issues.apache.org/**jira/browse/MAHOUT-966>
>>>>>> <https:/**/issues.apache.org/jira/**browse/MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966>
>>>>>> >
>>>>>>
>>>>>>
>>>>>> When executing clusterpp, I get out put such as this:
>>>>>>
>>>>>>> $bin/hadoop fs -cat /user/mackie/output/******
>>>>>>> ppclusters/part-r-00999
>>>>>>> SEQorg.apache.hadoop.io.Text%******org.apache.mahout.math.**
>>>>>>>
>>>>>>> VectorWritable_䪖?g???8?-??
>>>>>>>
>>>>>>> Is this normal? I thought I would get some human readable output when
>>>>>>>
>>>>>>>  this
>>>>>> was used... I tried searching around but couldn't get any
>>>>>> documentation
>>>>>>
>>>>>>> regarding clusterpp
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>
>


-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Re: How to use clusterpp?

Reply via email to