Re: Mahout KMeans clustering results

Robin Anil Thu, 25 Feb 2010 00:55:45 -0800

yeah... Now you cant write RandomAccessSparseVectors or any impls

there is a class VectorWritable which does the serialization


use VectorWritable.class in sequence file constructor

for writing

VectorWritable writable = new VectorWritable();

for(each data) {
  Vector v  = new RandomAccessSparseVector(your data);

  writable.set(v);
  writer.write(key, writable);
}






On Thu, Feb 25, 2010 at 2:19 PM, Arshad Khan <[email protected]>wrote:

> This may not be related to Mahout but I am seeing an NPE when creating a
> SequenceFile.Writer. The following code used to work fine with Mahout 0.2
> but now breaks with 0.3. I suspected that some dependencies may be causing
> that so updated the classpath with all deps for 0.3 but the exception still
> happens. The only diff from before is the use of RandomAccessSparseVector
> instead of SparseVector.
>
>        VectorWriter sfWriter;
>        Path path = new Path(outFile);
>        Configuration conf = new Configuration();
>        FileSystem fs = FileSystem.get(conf);
>
>        // NPE happens on following line
>        SequenceFile.Writer seqWriter = SequenceFile.createWriter(fs, conf,
> path,
>                                          LongWritable.class,
> RandomAccessSparseVector.class);
>
> Stack trace is follows:
> java.lang.NullPointerException
>        at
>
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
>        at
> org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
>        at
>
> org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
>        at
> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
>        at
> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
>        at
> org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
>        at
> com.lilly.tma.lm.cluster.Analyzer.getSeqFileWriter(Analyzer.java:119)
>
> Any help is greatly appreciated.
>
> Thanks
>
> On Thu, Feb 25, 2010 at 2:27 PM, Arshad Khan <[email protected]
> >wrote:
>
> > Thanks for the help and explanation. :)
> >
> >
> > On Thu, Feb 25, 2010 at 1:20 PM, Jake Mannix <[email protected]
> >wrote:
> >
> >> And to clarify: you can use either one, but you should think of them
> like
> >> this:
> >> RandomAccessSparseVector is useful for vectors whose contents change
> >> a great deal (the moving centroids of a clustering algorithm, for
> >> example),
> >> and SequentialAccessSparseVector are useful (ie faster) in the case
> where
> >> they are built up, and then are essentially used in an immutable fashion
> >> (you repeatedly compute a lot of dot-products and add multiples of them
> >> onto other vectors [either DenseVectors or RandomAccessSparseVectors]).
> >>
> >>  -jake
> >>
> >> On Wed, Feb 24, 2010 at 7:45 PM, Robin Anil <[email protected]>
> wrote:
> >>
> >> > They are replaced by the two impls RandomAccessSparseVector or
> >> > SequentialAccessSparseVector
> >> >
> >> >
> >> > On Thu, Feb 25, 2010 at 9:10 AM, Arshad Khan <[email protected]
> >> > >wrote:
> >> >
> >> > > Thanks for the quick reply.
> >> > >
> >> > > I have downloaded the latest 0.3 code. There seems to be significant
> >> > > changes
> >> > > in this version. For example, currently I am using
> >> > > org.apache.mahout.matrix.SparseVector class but in 0.3 I cannot find
> >> this
> >> > > class.
> >> > >
> >> > > What class it is replaced with?
> >> > >
> >> > > Thanks
> >> > >
> >> > > On Thu, Feb 25, 2010 at 10:12 AM, Ted Dunning <
> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > There are known problems with that version of k-means.
> >> > > >
> >> > > > Try using the trunk version.  0.3 is very close and we are
> entering
> >> > code
> >> > > > freeze for that so you should be fine with the latest version.
> >> > > >
> >> > > > On Wed, Feb 24, 2010 at 5:46 PM, Arshad Khan <
> >> [email protected]
> >> > > > >wrote:
> >> > > >
> >> > > > > Hello
> >> > > > >
> >> > > > > I am using Mahout 0.2 implementation of KMeans in one of my Text
> >> > Mining
> >> > > > > project. I apply KMeans with a default K value of 4. It seems
> that
> >> > > every
> >> > > > > time I repeat the clustering process on the same data set, the
> >> > results
> >> > > > are
> >> > > > > different and difference (in terms of cluster size and
> membership)
> >> is
> >> > > > great
> >> > > > > from run to run. The initial set of centroid points are chosen
> >> > randomly
> >> > > > > through RandomSeedGenerator. Is there a way to obtain more
> >> consistent
> >> > > > > results that do not differ so greatly? Or may be I am doing
> >> something
> >> > > > > wrong?
> >> > > > >
> >> > > > > Any help or idea is very much appreciated.
> >> > > > >
> >> > > > > Thanks and Regards
> >> > > > > Arshad
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Ted Dunning, CTO
> >> > > > DeepDyve
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Mahout KMeans clustering results

Reply via email to