Re: Text clustering with SVD

Donni Khan Tue, 31 Mar 2015 04:21:55 -0700

Hallo again,

I have run the ssvd on the  textual data as the following.
1. Run ssvd:
bin/mahout ssvd -i  outputTV/tfidf/tfidf-vectors/part-r-00000  -o svdOutput
-k 100   -us true -U false -V false   -t 1   -ow   -pca true
2. Run kmeans:
 bin/mahout kmeans -i svdOutput/USigma/  -c work/kmeans/kmeans-centroids
-cl -o work/kmeans/cluster -k 10 -ow -x 1000 -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
3. Dumping:
bin/mahout clusterdump  -d outputTV/dictionary.file-0   -dt sequencefile -i
work/kmeans/cluster/clusters-1-final -n 20 -b 100 -o work/kmeans/cDump.txt
-p work/kmeans/cluster/clusteredPoints/


A'm I right in the above steps?

I got bad results.  In the clustering output  all words start with the
letter "a*".  anyone has idea why?

Thanks in advance,
Donni

On Mon, Mar 30, 2015 at 11:07 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Lanczos may be more accurate than SSVD, but if you use a power step or
> three, this difference goes away as well.
>
> The best way to select k is actually to pick a value k_max larger than you
> expect to need and then pick random vectors instead of singular vectors.
> To evaluate how many singular vectors you really need, substitute more and
> more of the components of the random vectors with values from the singular
> vectors.  It is common that the best k_max will be 100-300 for text
> applications, but it is also common that the best k < k_max is much, much
> smaller.
>
> The reason that this is a better selection method is because a) random word
> vectors actually work pretty well because they maintain approximate
> independence of words and b) after k gets to a certain (pretty darned
> small) size, all the SVD is doing is acting as a very fancy and slow random
> number generator.
>
>
>
> On Mon, Mar 30, 2015 at 12:00 PM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > I am not aware of _any_ scenario under which lanczos would be faster (see
> > N. Halko's dissertation for comparisons), although admittedly i did not
> > study all possible cases.
> >
> > having -k=100 is probably enough for anything.  I would not recommend
> > running -q>0 for k>100 as it would become quite slow in power iterations
> > step.
> >
> > to your other questions, e.g. U*sigma result output, see "overview and
> > usage" link given here:
> > http://mahout.apache.org/users/dim-reduction/ssvd.html
> >
> > On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan <
> prince.don...@googlemail.com>
> > wrote:
> >
> > > Hallo Suneel,
> > > Thanks for fast reply.
> > > Is SSVD like SVD? which one is better?
> > > I run the SSVD  by java code on my data, but how do I compute U*Sigma?
> > Can
> > > I do that by Mahout?
> > > Is there optimal method to determin K?
> > >
> > > another quesion is how do I make the relation between ssvd output and
> > > words dictionary(real words)?
> > >
> > > Thank you
> > > Donni
> > >
> > > On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi <
> suneel.mar...@gmail.com
> > >
> > > wrote:
> > >
> > > > Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> > > > trunk:
> > > >
> > > > 1. Generate tfidf vectors from the input corpus using seq2sparse (I
> am
> > > > assuming you had done this before and hence avoiding the details)
> > > >
> > > > 2. Run SSVD on the generated tfidf vectors from (1)
> > > >
> > > >       ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca
> > > true
> > > > -us true -U false -V false
> > > >
> > > >      k = no. of reduced basis vectors
> > > >
> > > >     You would need the U*Sigma output of the PCA flow for the next
> > > > clustering step
> > > >
> > > > 3. Run KMeans (or any other clustering algo) with the U*Sigma from
> (2)
> > as
> > > > input.
> > > >
> > > >
> > > > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan <
> > > prince.don...@googlemail.com>
> > > > wrote:
> > > >
> > > > > Hallo Mahout users,
> > > > >
> > > > > I'm working on text clustering, I would like to reduce the features
> > to
> > > > > enhance the clustering process.
> > > > > I would like to use  the Singular Value Decomposition before
> > cluatering
> > > > > process. I will be thankfull if anyone has used this before, Is it
> a
> > > good
> > > > > idea for clustering?
> > > > > Is there any other method in mahout to reduce the text features
> > before
> > > > > clustring?
> > > > > Is anyone has idea how can I apply SVD by using Java code?
> > > > >
> > > > > Thanks in advance,
> > > > > Donni
> > > > >
> > > >
> > >
> >
>

Re: Text clustering with SVD

Reply via email to