Re: Mahout KMeans clustering results

Cui tony Thu, 25 Feb 2010 05:23:27 -0800

>From the view of the algorithm, in order to get more stable clustered
result, I think it's better to use -t1 and -t2 parameters instead of -k.


What you think, guys?



2010/2/25 Robin Anil <[email protected]>

> yeah... Now you cant write RandomAccessSparseVectors or any impls
>
> there is a class VectorWritable which does the serialization
>
> use VectorWritable.class in sequence file constructor
>
> for writing
>
> VectorWritable writable = new VectorWritable();
>
> for(each data) {
>  Vector v  = new RandomAccessSparseVector(your data);
>
>  writable.set(v);
>  writer.write(key, writable);
> }
>
>
>
>
>
>
> On Thu, Feb 25, 2010 at 2:19 PM, Arshad Khan <[email protected]
> >wrote:
>
> > This may not be related to Mahout but I am seeing an NPE when creating a
> > SequenceFile.Writer. The following code used to work fine with Mahout 0.2
> > but now breaks with 0.3. I suspected that some dependencies may be
> causing
> > that so updated the classpath with all deps for 0.3 but the exception
> still
> > happens. The only diff from before is the use of RandomAccessSparseVector
> > instead of SparseVector.
> >
> >        VectorWriter sfWriter;
> >        Path path = new Path(outFile);
> >        Configuration conf = new Configuration();
> >        FileSystem fs = FileSystem.get(conf);
> >
> >        // NPE happens on following line
> >        SequenceFile.Writer seqWriter = SequenceFile.createWriter(fs,
> conf,
> > path,
> >                                          LongWritable.class,
> > RandomAccessSparseVector.class);
> >
> > Stack trace is follows:
> > java.lang.NullPointerException
> >        at
> >
> >
> org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
> >        at
> > org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
> >        at
> >
> >
> org.apache.hadoop.io.SequenceFile$RecordCompressWriter.<init>(SequenceFile.java:1074)
> >        at
> > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:397)
> >        at
> > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:284)
> >        at
> > org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:265)
> >        at
> > com.lilly.tma.lm.cluster.Analyzer.getSeqFileWriter(Analyzer.java:119)
> >
> > Any help is greatly appreciated.
> >
> > Thanks
> >
> > On Thu, Feb 25, 2010 at 2:27 PM, Arshad Khan <[email protected]
> > >wrote:
> >
> > > Thanks for the help and explanation. :)
> > >
> > >
> > > On Thu, Feb 25, 2010 at 1:20 PM, Jake Mannix <[email protected]
> > >wrote:
> > >
> > >> And to clarify: you can use either one, but you should think of them
> > like
> > >> this:
> > >> RandomAccessSparseVector is useful for vectors whose contents change
> > >> a great deal (the moving centroids of a clustering algorithm, for
> > >> example),
> > >> and SequentialAccessSparseVector are useful (ie faster) in the case
> > where
> > >> they are built up, and then are essentially used in an immutable
> fashion
> > >> (you repeatedly compute a lot of dot-products and add multiples of
> them
> > >> onto other vectors [either DenseVectors or
> RandomAccessSparseVectors]).
> > >>
> > >>  -jake
> > >>
> > >> On Wed, Feb 24, 2010 at 7:45 PM, Robin Anil <[email protected]>
> > wrote:
> > >>
> > >> > They are replaced by the two impls RandomAccessSparseVector or
> > >> > SequentialAccessSparseVector
> > >> >
> > >> >
> > >> > On Thu, Feb 25, 2010 at 9:10 AM, Arshad Khan <
> [email protected]
> > >> > >wrote:
> > >> >
> > >> > > Thanks for the quick reply.
> > >> > >
> > >> > > I have downloaded the latest 0.3 code. There seems to be
> significant
> > >> > > changes
> > >> > > in this version. For example, currently I am using
> > >> > > org.apache.mahout.matrix.SparseVector class but in 0.3 I cannot
> find
> > >> this
> > >> > > class.
> > >> > >
> > >> > > What class it is replaced with?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > On Thu, Feb 25, 2010 at 10:12 AM, Ted Dunning <
> > [email protected]>
> > >> > > wrote:
> > >> > >
> > >> > > > There are known problems with that version of k-means.
> > >> > > >
> > >> > > > Try using the trunk version.  0.3 is very close and we are
> > entering
> > >> > code
> > >> > > > freeze for that so you should be fine with the latest version.
> > >> > > >
> > >> > > > On Wed, Feb 24, 2010 at 5:46 PM, Arshad Khan <
> > >> [email protected]
> > >> > > > >wrote:
> > >> > > >
> > >> > > > > Hello
> > >> > > > >
> > >> > > > > I am using Mahout 0.2 implementation of KMeans in one of my
> Text
> > >> > Mining
> > >> > > > > project. I apply KMeans with a default K value of 4. It seems
> > that
> > >> > > every
> > >> > > > > time I repeat the clustering process on the same data set, the
> > >> > results
> > >> > > > are
> > >> > > > > different and difference (in terms of cluster size and
> > membership)
> > >> is
> > >> > > > great
> > >> > > > > from run to run. The initial set of centroid points are chosen
> > >> > randomly
> > >> > > > > through RandomSeedGenerator. Is there a way to obtain more
> > >> consistent
> > >> > > > > results that do not differ so greatly? Or may be I am doing
> > >> something
> > >> > > > > wrong?
> > >> > > > >
> > >> > > > > Any help or idea is very much appreciated.
> > >> > > > >
> > >> > > > > Thanks and Regards
> > >> > > > > Arshad
> > >> > > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Ted Dunning, CTO
> > >> > > > DeepDyve
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Mahout KMeans clustering results

Reply via email to