Re: Mahout KMeans clustering results

Arshad Khan Thu, 25 Feb 2010 00:49:47 -0800

This may not be related to Mahout but I am seeing an NPE when creating a
SequenceFile.Writer. The following code used to work fine with Mahout 0.2
but now breaks with 0.3. I suspected that some dependencies may be causing
that so updated the classpath with all deps for 0.3 but the exception still
happens. The only diff from before is the use of RandomAccessSparseVector
instead of SparseVector.


        VectorWriter sfWriter;
        Path path = new Path(outFile);
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        // NPE happens on following line
        SequenceFile.Writer seqWriter = SequenceFile.createWriter(fs, conf,
path,
                                          LongWritable.class,
RandomAccessSparseVector.class);

Stack trace is follows:
java.lang.NullPointerException
        at
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
        at
org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
        at
org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
        at
org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
        at
org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
        at
org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
        at
com.lilly.tma.lm.cluster.Analyzer.getSeqFileWriter(Analyzer.java:119)

Any help is greatly appreciated.

Thanks

On Thu, Feb 25, 2010 at 2:27 PM, Arshad Khan <[email protected]>wrote:

> Thanks for the help and explanation. :)
>
>
> On Thu, Feb 25, 2010 at 1:20 PM, Jake Mannix <[email protected]>wrote:
>
>> And to clarify: you can use either one, but you should think of them like
>> this:
>> RandomAccessSparseVector is useful for vectors whose contents change
>> a great deal (the moving centroids of a clustering algorithm, for
>> example),
>> and SequentialAccessSparseVector are useful (ie faster) in the case where
>> they are built up, and then are essentially used in an immutable fashion
>> (you repeatedly compute a lot of dot-products and add multiples of them
>> onto other vectors [either DenseVectors or RandomAccessSparseVectors]).
>>
>>  -jake
>>
>> On Wed, Feb 24, 2010 at 7:45 PM, Robin Anil <[email protected]> wrote:
>>
>> > They are replaced by the two impls RandomAccessSparseVector or
>> > SequentialAccessSparseVector
>> >
>> >
>> > On Thu, Feb 25, 2010 at 9:10 AM, Arshad Khan <[email protected]
>> > >wrote:
>> >
>> > > Thanks for the quick reply.
>> > >
>> > > I have downloaded the latest 0.3 code. There seems to be significant
>> > > changes
>> > > in this version. For example, currently I am using
>> > > org.apache.mahout.matrix.SparseVector class but in 0.3 I cannot find
>> this
>> > > class.
>> > >
>> > > What class it is replaced with?
>> > >
>> > > Thanks
>> > >
>> > > On Thu, Feb 25, 2010 at 10:12 AM, Ted Dunning <[email protected]>
>> > > wrote:
>> > >
>> > > > There are known problems with that version of k-means.
>> > > >
>> > > > Try using the trunk version.  0.3 is very close and we are entering
>> > code
>> > > > freeze for that so you should be fine with the latest version.
>> > > >
>> > > > On Wed, Feb 24, 2010 at 5:46 PM, Arshad Khan <
>> [email protected]
>> > > > >wrote:
>> > > >
>> > > > > Hello
>> > > > >
>> > > > > I am using Mahout 0.2 implementation of KMeans in one of my Text
>> > Mining
>> > > > > project. I apply KMeans with a default K value of 4. It seems that
>> > > every
>> > > > > time I repeat the clustering process on the same data set, the
>> > results
>> > > > are
>> > > > > different and difference (in terms of cluster size and membership)
>> is
>> > > > great
>> > > > > from run to run. The initial set of centroid points are chosen
>> > randomly
>> > > > > through RandomSeedGenerator. Is there a way to obtain more
>> consistent
>> > > > > results that do not differ so greatly? Or may be I am doing
>> something
>> > > > > wrong?
>> > > > >
>> > > > > Any help or idea is very much appreciated.
>> > > > >
>> > > > > Thanks and Regards
>> > > > > Arshad
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Ted Dunning, CTO
>> > > > DeepDyve
>> > > >
>> > >
>> >
>>
>
>

Re: Mahout KMeans clustering results

Reply via email to