Hallo again,
I have run the ssvd on the textual data as the following.
1. Run ssvd:
bin/mahout ssvd -i outputTV/tfidf/tfidf-vectors/part-r-0 -o svdOutput
-k 100 -us true -U false -V false -t 1 -ow -pca true
2. Run kmeans:
bin/mahout kmeans -i svdOutput/USigma/ -c work/kmeans/kmean
Lanczos may be more accurate than SSVD, but if you use a power step or
three, this difference goes away as well.
The best way to select k is actually to pick a value k_max larger than you
expect to need and then pick random vectors instead of singular vectors.
To evaluate how many singular vectors
Note that these instructions actually mean running PCA, not SVD but that's
probably the intention here. I don't think just running SVD helps.
On Mon, Mar 30, 2015 at 1:04 AM, Suneel Marthi
wrote:
> Here are the steps if u r using Mahout-mrlegacy in the present Mahout
> trunk:
>
> 1. Generate tfi
Lanczos has since been deprecated and will be removed in the upcoming
release, so please desist from using/suggesting Lanczos.
On Mon, Mar 30, 2015 at 3:00 PM, Dmitriy Lyubimov wrote:
> I am not aware of _any_ scenario under which lanczos would be faster (see
> N. Halko's dissertation for compa
I am not aware of _any_ scenario under which lanczos would be faster (see
N. Halko's dissertation for comparisons), although admittedly i did not
study all possible cases.
having -k=100 is probably enough for anything. I would not recommend
running -q>0 for k>100 as it would become quite slow in
SSVD is just one of may ways to compute a partial SVD. In mahout you also
have Lanczos method, which I have found faster and more reliable in some
applications, but most of people here seem to prefer SSVD, in fact I think
Lanczos is (or has been) planned to be deprecated. This may also have
changed
Hallo Suneel,
Thanks for fast reply.
Is SSVD like SVD? which one is better?
I run the SSVD by java code on my data, but how do I compute U*Sigma? Can
I do that by Mahout?
Is there optimal method to determin K?
another quesion is how do I make the relation between ssvd output and
words dictionary
Here are the steps if u r using Mahout-mrlegacy in the present Mahout trunk:
1. Generate tfidf vectors from the input corpus using seq2sparse (I am
assuming you had done this before and hence avoiding the details)
2. Run SSVD on the generated tfidf vectors from (1)
./bin/mahout ssvd -i -o
Hallo Mahout users,
I'm working on text clustering, I would like to reduce the features to
enhance the clustering process.
I would like to use the Singular Value Decomposition before cluatering
process. I will be thankfull if anyone has used this before, Is it a good
idea for clustering?
Is there