Hallo again, I have run the ssvd on the textual data as the following. 1. Run ssvd: bin/mahout ssvd -i outputTV/tfidf/tfidf-vectors/part-r-00000 -o svdOutput -k 100 -us true -U false -V false -t 1 -ow -pca true 2. Run kmeans: bin/mahout kmeans -i svdOutput/USigma/ -c work/kmeans/kmeans-centroids -cl -o work/kmeans/cluster -k 10 -ow -x 1000 -dm org.apache.mahout.common.distance.CosineDistanceMeasure 3. Dumping: bin/mahout clusterdump -d outputTV/dictionary.file-0 -dt sequencefile -i work/kmeans/cluster/clusters-1-final -n 20 -b 100 -o work/kmeans/cDump.txt -p work/kmeans/cluster/clusteredPoints/
A'm I right in the above steps? I got bad results. In the clustering output all words start with the letter "a*". anyone has idea why? Thanks in advance, Donni On Mon, Mar 30, 2015 at 11:07 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Lanczos may be more accurate than SSVD, but if you use a power step or > three, this difference goes away as well. > > The best way to select k is actually to pick a value k_max larger than you > expect to need and then pick random vectors instead of singular vectors. > To evaluate how many singular vectors you really need, substitute more and > more of the components of the random vectors with values from the singular > vectors. It is common that the best k_max will be 100-300 for text > applications, but it is also common that the best k < k_max is much, much > smaller. > > The reason that this is a better selection method is because a) random word > vectors actually work pretty well because they maintain approximate > independence of words and b) after k gets to a certain (pretty darned > small) size, all the SVD is doing is acting as a very fancy and slow random > number generator. > > > > On Mon, Mar 30, 2015 at 12:00 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > I am not aware of _any_ scenario under which lanczos would be faster (see > > N. Halko's dissertation for comparisons), although admittedly i did not > > study all possible cases. > > > > having -k=100 is probably enough for anything. I would not recommend > > running -q>0 for k>100 as it would become quite slow in power iterations > > step. > > > > to your other questions, e.g. U*sigma result output, see "overview and > > usage" link given here: > > http://mahout.apache.org/users/dim-reduction/ssvd.html > > > > On Mon, Mar 30, 2015 at 2:19 AM, Donni Khan < > prince.don...@googlemail.com> > > wrote: > > > > > Hallo Suneel, > > > Thanks for fast reply. > > > Is SSVD like SVD? which one is better? > > > I run the SSVD by java code on my data, but how do I compute U*Sigma? > > Can > > > I do that by Mahout? > > > Is there optimal method to determin K? > > > > > > another quesion is how do I make the relation between ssvd output and > > > words dictionary(real words)? > > > > > > Thank you > > > Donni > > > > > > On Mon, Mar 30, 2015 at 10:04 AM, Suneel Marthi < > suneel.mar...@gmail.com > > > > > > wrote: > > > > > > > Here are the steps if u r using Mahout-mrlegacy in the present Mahout > > > > trunk: > > > > > > > > 1. Generate tfidf vectors from the input corpus using seq2sparse (I > am > > > > assuming you had done this before and hence avoiding the details) > > > > > > > > 2. Run SSVD on the generated tfidf vectors from (1) > > > > > > > > ./bin/mahout ssvd -i <tfidf vectors> -o <svd output> -k 80 -pca > > > true > > > > -us true -U false -V false > > > > > > > > k = no. of reduced basis vectors > > > > > > > > You would need the U*Sigma output of the PCA flow for the next > > > > clustering step > > > > > > > > 3. Run KMeans (or any other clustering algo) with the U*Sigma from > (2) > > as > > > > input. > > > > > > > > > > > > On Mon, Mar 30, 2015 at 3:39 AM, Donni Khan < > > > prince.don...@googlemail.com> > > > > wrote: > > > > > > > > > Hallo Mahout users, > > > > > > > > > > I'm working on text clustering, I would like to reduce the features > > to > > > > > enhance the clustering process. > > > > > I would like to use the Singular Value Decomposition before > > cluatering > > > > > process. I will be thankfull if anyone has used this before, Is it > a > > > good > > > > > idea for clustering? > > > > > Is there any other method in mahout to reduce the text features > > before > > > > > clustring? > > > > > Is anyone has idea how can I apply SVD by using Java code? > > > > > > > > > > Thanks in advance, > > > > > Donni > > > > > > > > > > > > > > >