Rajesh, I am working off of trunk and this works fine.
As Dmitriy says u do need USigma. It would help to paste the entire stacktrace you are seeing with MatrixColumnMeansJob. If you are still seeing an issue, I would suggest that you work off of trunk. ________________________________ From: Dmitriy Lyubimov <dlie...@gmail.com> To: user@mahout.apache.org Sent: Friday, May 24, 2013 9:52 AM Subject: Re: Fwd: Re: convert input for SVD I think last time i verified this flow was as of https://issues.apache.org/jira/browse/MAHOUT-1097. It was woking then. Did not look at it since. On May 24, 2013 6:42 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> wrote: > Rajesh, you will get more help if you stay on the list. > > you do need u *sigma output. there is no substitute. > > If this option is indeed no longer there, i have no knowledge of it. Maybe > there was some work committed that screwed that but at the moment i have > no time to look at it. Obviously it was there at the time documentation was > written. I guess you may obtain an earlier snapshot as interim solution if > it is indeed the case. > > ---------- Forwarded message ---------- > From: "Rajesh Nikam" <rajeshni...@gmail.com> > Date: May 24, 2013 3:20 AM > Subject: Re: convert input for SVD > To: <user@mahout.apache.org> > Cc: > > > Hello Dmitriy, > > > > Thanks for reply. > > > > I see similar discussion on following link where I see your reply. > > > > > http://www.searchworkings.org/forum/-/message_boards/view_message/517870#_19_message_519704 > > > > I do also have same problem, need to apply dimensionality reduction and > use > > clustering algo on reduced features. > > > > Seems parameters for ssvd are changed from mentioned in SSVD-CLI.pdf. It > no > > longer shows *-us *as parameter > > > > I am using mahout-examples-0.7-job.jar > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -pca true -U true -V > > false *-us true* -ow -q 1 > > > > giving option as "*-pca true*" gives error as > > > > at > > > org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55) > > at > > > org.apache.mahout.math.hadoop.MatrixColumnMeansJob.run(MatrixColumnMeansJob.java:55) > > > > So I removed it. > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ --output > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -U true -V false > *-us > > true* -ow -q 1 > > > > *>> *with above command *>> Unexpected -us *while processing Job-Specific > > Options. > > > > I tried with "-U false -V false -uhs true" it just generated sigma file > as > > expected however no "Usigma" > > > > hadoop fs -lsr /user/hadoop/t/PE_EXE/input-set-svd/ > > > > -rw-r--r-- 2 hadoop supergroup 1712 2013-05-24 15:34 > > /user/hadoop/t/PE_EXE/input-set-svd/sigma > > > > Then with *"-U true -V false -uhs true" *output dir U is created. > > * > > *drwxr-xr-x - hadoop supergroup 0 2013-05-24 15:39 > > /user/hadoop/t/PE_EXE/input-set-svd/U > > -rw-r--r-- 2 hadoop supergroup 1712 2013-05-24 15:39 > > /user/hadoop/t/PE_EXE/input-set-svd/sigma* > > * > > > > My problem is how to use these U/V/sigma file as input to canopy/kmeans ? > > > > How to identify which important features from U/Sigma that are retained > in > > dimensionality reduction ? > > > > Thanks in Advance ! > > Rajesh > > > > > > On Fri, May 24, 2013 at 7:01 AM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > > > > > > > https://cwiki.apache.org/confluence/download/attachments/27832158/SSVD-CLI.pdf?version=17&modificationDate=1349999085000 > > > : > > > > > > "In most cases where you might be looking to reduce > > > dimensionality while retaining variance, you probably need combination > of > > > options -pca true -U false -V > > > false -us true. > > > > > > See §3 for details" > > > > > > > > > On Thu, May 23, 2013 at 6:24 PM, Dmitriy Lyubimov <dlie...@gmail.com> > > > wrote: > > > > > > > Also, for the dimensionality reduction it is important among other > things > > > > to re-center your input first, which is why you also want "-pca > true". > > > > > > > > > > > > On Thu, May 23, 2013 at 6:23 PM, Dmitriy Lyubimov <dlie...@gmail.com > > > >wrote: > > > > > > > >> did you specify -us option? SSVD by default produces only U, V and > > > Sigma. > > > >> but it can produce more, e.g. U*Sigma, U*sqrt(Sigma) etc. if you > ask for > > > >> it. And, alternatively, you can suppress any of U, V (you can't > suppress > > > >> sigma but that doesn't cost anything in space anyway). > > > >> > > > >> > > > >> On Thu, May 23, 2013 at 6:20 PM, Rajesh Nikam < > rajeshni...@gmail.com > > > >wrote: > > > >> > > > >>> I got all three U, V & sigma from ssvd, however which to use as > input > > > to > > > >>> canopy? > > > >>> On May 24, 2013 6:47 AM, "Dmitriy Lyubimov" <dlie...@gmail.com> > wrote: > > > >>> > > > >>> > I think you want U*Sigma > > > >>> > > > > >>> > What you want is ssvd ... -pca true ... -us true ... see the > manual > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > On Thu, May 23, 2013 at 6:07 PM, Rajesh Nikam < > rajeshni...@gmail.com > > > > > > > >>> > wrote: > > > >>> > > > > >>> > > Sorry for confusion. Here number of clusters are decided by > canopy. > > > >>> With > > > >>> > > data as it has 60 to 70 clusters. > > > >>> > > > > > >>> > > My question is which part from ssvd output U, V, Sigma should > be > > > >>> used as > > > >>> > > input to canopy? > > > >>> > > On May 24, 2013 3:56 AM, "Ted Dunning" <ted.dunn...@gmail.com > > > > > >>> wrote: > > > >>> > > > > > >>> > > > Rajesh, > > > >>> > > > > > > >>> > > > This is very confusing. > > > >>> > > > > > > >>> > > > You have 1500 things that you are clustering into more than > 1400 > > > >>> > > clusters. > > > >>> > > > > > > >>> > > > There is no way for most of these clusters to have >1 member > just > > > >>> > because > > > >>> > > > there aren't enough clusters compared to the items. > > > >>> > > > > > > >>> > > > Is there a typo here? > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > On Thu, May 23, 2013 at 5:34 AM, Rajesh Nikam < > > > >>> rajeshni...@gmail.com> > > > >>> > > > wrote: > > > >>> > > > > > > >>> > > > > Hi, > > > >>> > > > > > > > >>> > > > > I have input test set of 1500 instances with 1000+ > features. I > > > >>> want > > > >>> > to > > > >>> > > to > > > >>> > > > > SVD to reduce features. I have followed following steps > with > > > >>> generate > > > >>> > > > 1400+ > > > >>> > > > > clusters 99% of clusters contain 1 instance :( > > > >>> > > > > > > > >>> > > > > Please let me know what is wrong in below steps - > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > mahout arff.vector --input /mnt/cluster/t/input-set.arff > > > --output > > > >>> > > > > /user/hadoop/t/input-set-vector/ --dictOut > > > >>> > > /mnt/cluster/t/input-set-dict > > > >>> > > > > > > > >>> > > > > mahout ssvd --input /user/hadoop/t/input-set-vector/ > --output > > > >>> > > > > /user/hadoop/t/input-set-svd/ -k 200 --reduceTasks 2 -ow > > > >>> > > > > > > > >>> > > > > mahout canopy -i */user/hadoop/t/input-set-svd/U* -o > > > >>> > > > > /user/hadoop/t/input-set-canopy-centroids -dm > > > >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure > *-t1 > > > >>> 0.001 > > > >>> > > -t2 > > > >>> > > > > 0.002* > > > >>> > > > > > > > >>> > > > > mahout kmeans -i */user/hadoop/t/input-set-svd/U* -c > > > >>> > > > > /user/hadoop/t/input-set-canopy-centroids/clusters-0-final > -cl > > > -o > > > >>> > > > > /user/hadoop/t/input-set-kmeans-clusters -ow -x 10 -dm > > > >>> > > > > org.apache.mahout.common.distance.TanimotoDistanceMeasure > > > >>> > > > > > > > >>> > > > > mahout clusterdump -dt sequencefile -i > > > >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusters-1-final/ > -n > > > 20 > > > >>> -b > > > >>> > 100 > > > >>> > > > -o > > > >>> > > > > /mnt/cluster/t/cdump-input-set.txt -p > > > >>> > > > > /user/hadoop/t/input-set-kmeans-clusters/clusteredPoints/ > > > >>> --evaluate > > > >>> > > > > > > > >>> > > > > Thanks in advance ! > > > >>> > > > > > > > >>> > > > > Rajesh > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > On Wed, May 22, 2013 at 2:18 AM, Dmitriy Lyubimov < > > > >>> dlie...@gmail.com > > > >>> > > > > > >>> > > > > wrote: > > > >>> > > > > > > > >>> > > > > > PPS As far as the tool for arff, i am frankly not sure. > but > > > it > > > >>> > sounds > > > >>> > > > > like > > > >>> > > > > > you've already solved this. > > > >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov < > > > >>> > dlie...@gmail.com > > > >>> > > > > > > >>> > > > > > wrote: > > > >>> > > > > > > > > >>> > > > > > > ps as far as U, V data "close to zero", yes that's what > > > you'd > > > >>> > > expect. > > > >>> > > > > > > > > > >>> > > > > > > Here, by "close to zero" it still means much bigger > than a > > > >>> > rounding > > > >>> > > > > error > > > >>> > > > > > > of course. e.g. 1E-12 is indeed a small number, and > 1E-16 > > > to > > > >>> > 1E-18 > > > >>> > > > > would > > > >>> > > > > > be > > > >>> > > > > > > indeed "close to zero" for the purposes of singularity. > > > >>> > 1E-2..1E-5 > > > >>> > > > are > > > >>> > > > > > > actually quite "sizeable" numbers by the scale of > IEEE 754 > > > >>> > > > > arithmetics. > > > >>> > > > > > > > > > >>> > > > > > > U and V are orthonormal (which means their column > vectors > > > >>> have > > > >>> > > > > euclidiean > > > >>> > > > > > > norm of 1) . Note that for large m and n (large inputs) > > > they > > > >>> are > > > >>> > > also > > > >>> > > > > > > extremely skinny. The larger input is, the smaller the > > > >>> element > > > >>> > of U > > > >>> > > > > > or/and > > > >>> > > > > > > V is gonna be. > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > On Tue, May 21, 2013 at 8:48 AM, Dmitriy Lyubimov < > > > >>> > > dlie...@gmail.com > > > >>> > > > > > >wrote: > > > >>> > > > > > > > > > >>> > > > > > >> Sounds like dimensionality reduction to me. You may > want > > > to > > > >>> use > > > >>> > > ssvd > > > >>> > > > > > -pca > > > >>> > > > > > >> > > > >>> > > > > > >> Apologies for brevity. Sent from my Android phone. > > > >>> > > > > > >> -Dmitriy > > > >>> > > > > > >> On May 21, 2013 6:27 AM, "Rajesh Nikam" < > > > >>> rajeshni...@gmail.com> > > > >>> > > > > wrote: > > > >>> > > > > > >> > > > >>> > > > > > >>> Hello Ted, > > > >>> > > > > > >>> > > > >>> > > > > > >>> Thanks for reply. > > > >>> > > > > > >>> > > > >>> > > > > > >>> I have started exploring SVD based on its mention of > > > could > > > >>> help > > > >>> > > to > > > >>> > > > > drop > > > >>> > > > > > >>> features which are not relevant for clustering. > > > >>> > > > > > >>> > > > >>> > > > > > >>> My objective is reduce number of features before > passing > > > >>> them > > > >>> > to > > > >>> > > > > > >>> clustering > > > >>> > > > > > >>> and just keep important features. > > > >>> > > > > > >>> > > > >>> > > > > > >>> arff/csv==> ssvd (for dimensionality reduction) ==> > > > >>> clustering > > > >>> > > > > > >>> > > > >>> > > > > > >>> Could you please illustrate mahout props to join > above > > > >>> > pipeline. > > > >>> > > > > > >>> > > > >>> > > > > > >>> I think, Lanczos SVD needs to be used for mxm matrix. > > > >>> > > > > > >>> > > > >>> > > > > > >>> I have tried check ssvd, I have used arff.vector to > > > covert > > > >>> > > arff/csv > > > >>> > > > > to > > > >>> > > > > > >>> vector file which is then give as input to ssvd and > them > > > >>> dumped > > > >>> > > U, > > > >>> > > > V > > > >>> > > > > > and > > > >>> > > > > > >>> sigma using vectordump. > > > >>> > > > > > >>> > > > >>> > > > > > >>> I see most of the values dumped are near to 0. I dont > > > >>> > understand > > > >>> > > is > > > >>> > > > > > this > > > >>> > > > > > >>> correct or not. > > > >>> > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > {0:0.01066724825049657,1:0.016715498597386844,2:2.0187750952311708E-4,3:3.401020567221039E-4,4:-1.2388403347280688E-4,5:6.41502463540719E-5,6:-1.359187582538833E-4,7:6.329813140445419E-5,8:1.670015585746444E-4,9:3.5415113034592744E-4,10:7.108868213280763E-4,11:0.020553517552052456,12:-0.015118680942548916,13:0.007981746711271956,14:-0.003251236468768259,15:0.0038075014396303053,16:-0.0010925318534013683,17:-0.0026943024876179833,18:-0.001744794617721648,19:-0.0024528466548735714} > > > >>> > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > > {0:0.029978614322360833,1:-0.01431521245087889,2:1.3318592088199427E-4,3:1.495356283071516E-4,4:8.762709213918985E-5,5:1.2765191352425177E- > > > >>> > > > > > >>> > > > >>> > > > > > >>> Thanks, > > > >>> > > > > > >>> Rajesh > > > >>> > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > > >>> > > > >>> > > > > > >>> On Tue, May 21, 2013 at 11:35 AM, Ted Dunning < > > > >>> > > > ted.dunn...@gmail.com > > > >>> > > > > > > > > >>> > > > > > >>> wrote: > > > >>> > > > > > >>> > > > >>> > > > > > >>> > Are you using Lanczos instead of SSVD for a reason? > > > >>> > > > > > >>> > > > > >>> > > > > > >>> > > > > >>> > > > > > >>> > > > > >>> > > > > > >>> > > > > >>> > > > > > >>> > On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam < > > > >>> > > > > rajeshni...@gmail.com > > > >>> > > > > > > > > > >>> > > > > > >>> > wrote: > > > >>> > > > > > >>> > > > > >>> > > > > > >>> > > Hello, > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > I have arff / csv file containing input data > that I > > > >>> want to > > > >>> > > > pass > > > >>> > > > > to > > > >>> > > > > > >>> svd : > > > >>> > > > > > >>> > > Lanczos Singular Value Decomposition. > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > Which tool to use to convert it to required > format ? > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > Thanks in Advance ! > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > Thanks, > > > >>> > > > > > >>> > > Rajesh > > > >>> > > > > > >>> > > > > > >>> > > > > > >>> > > > > >>> > > > > > >>> > > > >>> > > > > > >> > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >> > > > >> > > > > > > > >