Re: Mahout KMeans clustering results

Arshad Khan Thu, 25 Feb 2010 20:49:52 -0800

Thanks for the reply.

I found the following in utils/vectors/lucene/Driver.java


SequenceFile.Writer seqWriter = SequenceFile.createWriter(fs, conf, path,
                                          LongWritable.class,
VectorWritable.class);

This seems to work.

However, I hit another issue now when using the KMeansDriver. Previously,
the clusters were named in the final points file as 0,1,2...etc. Now they
are named on the centroid id (i.e. the 122, 156, etc.). Is there a way to
get the older behaviur?

Thanks again



On Thu, Feb 25, 2010 at 4:54 PM, Robin Anil <[email protected]> wrote:

> yeah... Now you cant write RandomAccessSparseVectors or any impls
>
> there is a class VectorWritable which does the serialization
>
> use VectorWritable.class in sequence file constructor
>
> for writing
>
> VectorWritable writable = new VectorWritable();
>
> for(each data) {
>  Vector v  = new RandomAccessSparseVector(your data);
>
>  writable.set(v);
>  writer.write(key, writable);
> }
>
>
>
>
>
>
> On Thu, Feb 25, 2010 at 2:19 PM, Arshad Khan <[email protected]
> >wrote:
>
> > This may not be related to Mahout but I am seeing an NPE when creating a
> > SequenceFile.Writer. The following code used to work fine with Mahout 0.2
> > but now breaks with 0.3. I suspected that some dependencies may be
> causing
> > that so updated the classpath with all deps for 0.3 but the exception
> still
> > happens. The only diff from before is the use of RandomAccessSparseVector
> > instead of SparseVector.
> >
> >        VectorWriter sfWriter;
> >        Path path = new Path(outFile);
> >        Configuration conf = new Configuration();
> >        FileSystem fs = FileSystem.get(conf);
> >
> >        // NPE happens on following line
> >        SequenceFile.Writer seqWriter = SequenceFile.createWriter(fs,
> conf,
> > path,
> >                                          LongWritable.class,
> > RandomAccessSparseVector.class);
> >
> > Stack trace is follows:
> > java.lang.NullPointerException
> >        at
> >
> >
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
> >        at
> > org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
> >        at
> >
> >
> org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
> >        at
> > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
> >        at
> > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
> >        at
> > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
> >        at
> > com.lilly.tma.lm.cluster.Analyzer.getSeqFileWriter(Analyzer.java:119)
> >
> > Any help is greatly appreciated.
> >
> > Thanks
> >
> > On Thu, Feb 25, 2010 at 2:27 PM, Arshad Khan <[email protected]
> > >wrote:
> >
> > > Thanks for the help and explanation. :)
> > >
> > >
> > > On Thu, Feb 25, 2010 at 1:20 PM, Jake Mannix <[email protected]
> > >wrote:
> > >
> > >> And to clarify: you can use either one, but you should think of them
> > like
> > >> this:
> > >> RandomAccessSparseVector is useful for vectors whose contents change
> > >> a great deal (the moving centroids of a clustering algorithm, for
> > >> example),
> > >> and SequentialAccessSparseVector are useful (ie faster) in the case
> > where
> > >> they are built up, and then are essentially used in an immutable
> fashion
> > >> (you repeatedly compute a lot of dot-products and add multiples of
> them
> > >> onto other vectors [either DenseVectors or
> RandomAccessSparseVectors]).
> > >>
> > >>  -jake
> > >>
> > >> On Wed, Feb 24, 2010 at 7:45 PM, Robin Anil <[email protected]>
> > wrote:
> > >>
> > >> > They are replaced by the two impls RandomAccessSparseVector or
> > >> > SequentialAccessSparseVector
> > >> >
> > >> >
> > >> > On Thu, Feb 25, 2010 at 9:10 AM, Arshad Khan <
> [email protected]
> > >> > >wrote:
> > >> >
> > >> > > Thanks for the quick reply.
> > >> > >
> > >> > > I have downloaded the latest 0.3 code. There seems to be
> significant
> > >> > > changes
> > >> > > in this version. For example, currently I am using
> > >> > > org.apache.mahout.matrix.SparseVector class but in 0.3 I cannot
> find
> > >> this
> > >> > > class.
> > >> > >
> > >> > > What class it is replaced with?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > On Thu, Feb 25, 2010 at 10:12 AM, Ted Dunning <
> > [email protected]>
> > >> > > wrote:
> > >> > >
> > >> > > > There are known problems with that version of k-means.
> > >> > > >
> > >> > > > Try using the trunk version.  0.3 is very close and we are
> > entering
> > >> > code
> > >> > > > freeze for that so you should be fine with the latest version.
> > >> > > >
> > >> > > > On Wed, Feb 24, 2010 at 5:46 PM, Arshad Khan <
> > >> [email protected]
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Hello
> > >> > > > >
> > >> > > > > I am using Mahout 0.2 implementation of KMeans in one of my
> Text
> > >> > Mining
> > >> > > > > project. I apply KMeans with a default K value of 4. It seems
> > that
> > >> > > every
> > >> > > > > time I repeat the clustering process on the same data set, the
> > >> > results
> > >> > > > are
> > >> > > > > different and difference (in terms of cluster size and
> > membership)
> > >> is
> > >> > > > great
> > >> > > > > from run to run. The initial set of centroid points are chosen
> > >> > randomly
> > >> > > > > through RandomSeedGenerator. Is there a way to obtain more
> > >> consistent
> > >> > > > > results that do not differ so greatly? Or may be I am doing
> > >> something
> > >> > > > > wrong?
> > >> > > > >
> > >> > > > > Any help or idea is very much appreciated.
> > >> > > > >
> > >> > > > > Thanks and Regards
> > >> > > > > Arshad
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Ted Dunning, CTO
> > >> > > > DeepDyve
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Mahout KMeans clustering results

Reply via email to