If you want to have multiple small files in one file on hdfs, when you
want to pack them somehow.

You should run one of the cluster examples and examine each file along
the path. They all have a custom class that parse the input (email,
reuters article, email archive) into piece Usually the first pass
reads raw files in some format ((email, reuters article, wikipedia)
and writes them as key,value pairs in a sequenceFile, with say the
file name as the key and text as the value. This is usually fast.

The second pass turns these into term vectors. This creates a global
list of all of the words in all documents- this is the slow one.

On Mon, Apr 9, 2012 at 7:54 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote:
> Thanks! One thing I am not clear is if each customer review which might be
> just few bytes need to be in separate files? I am planning to utilize
> hadoop so I was thinking of using SequenceFiles to dump all the raw
> comments in a sequenceFile but I am not sure if it would mess up any TFDF
> or anything like that. Could someone help me clarify?
>
> On Sun, Apr 8, 2012 at 11:00 PM, Sean Owen <sro...@gmail.com> wrote:
>
>> I think you would cluster these like any other text document. The
>> centroid of each cluster tells you where the cluster is in
>> feature-space, but the features are just words. If you find the
>> features (words) with largest absolute value, those ought to be the
>> words that appear frequently in the cluster and are what they are
>> "about".
>>
>> As to ratings, not sure how you might want to involve them?
>>
>> On Sun, Apr 8, 2012 at 11:44 PM, Mohit Anchlia <mohitanch...@gmail.com>
>> wrote:
>> > I am new to Mahout and just going through some tutorials. One of the
>> > requirements I am working on involves extracting customer reviews from
>> > Amazon for a given item and then clustering those into similar topics to
>> > see what in general users have been talking about. So for eg: Rating of >
>> > 3 could say user experience is good, quality or rating of <=3 could say
>> > price, buggy etc.
>> >
>> > Could anyone suggest what would be the best way to approach this?
>>



-- 
Lance Norskog
goks...@gmail.com

Reply via email to