Sorry Scott I should have looked at this more closely. I apologize. 1. You are doing a seqdumper of the matrix (which is generated from the rowid job and is not the output of the rowsimilarity job).
Rowid Job generates a MxN matrix where M - no. of documents and N - terms associated with each document The value of a cell in the Matrix is the tf-idf weight of the term. So in the following output: {Code} Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,296 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,540 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,6890 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,14714 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,197 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,22 224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638, 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.072277361433487 77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.1479399632156 9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.1051528113 8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.139762177 1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.0987718 8897003744,} {Code} means for document 2 what follows are the terms:tf-df weights. To see the term corresponding to 41625 look at dictionary.file-0 for the corresponding key. Hope that clarifies and clears the confusion here. 2. In order to see the most similar documents for a given document you should be looking at a seqdumper of the output from rowsimilarity which in ur case would be the output in reuters-similarity. That should give the 10 most similar documents and their cosine distances from the referenced document. There's an error on the wiki link instructions, the seqdumper should have been on rowsimilarity/part-r-* and not on matrix/matrix for determining similar documents. Hope this helps. Sorry again for the confusion. On Friday, December 20, 2013 4:51 PM, Scott C. Cote <scottcc...@gmail.com> wrote: Suneel and others, I am still getting the strange results when I do the tour. Suneel: I manually wiped out the temp folder and also deleted the reuters-XXX folders. Also, per your advice I added the -ow option to all of the commands. NOTE: The step to create a matrix would NOT take a -ow option I have tried again, and am still seeing references to documents that do not exist. The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i reuters-matrix/matrix | tail) : INFO: Program took 1077 ms (Minutes: 0.01795) Key: 21569: Value: /reut2-021.sgm-91.txt Key: 21570: Value: /reut2-021.sgm-92.txt Key: 21571: Value: /reut2-021.sgm-93.txt Key: 21572: Value: /reut2-021.sgm-94.txt Key: 21573: Value: /reut2-021.sgm-95.txt Key: 21574: Value: /reut2-021.sgm-96.txt Key: 21575: Value: /reut2-021.sgm-97.txt Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 And the following snippet exists inside reuters-matrix/matrix and references key 41625 (which is larger than any key in docindex). Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,296 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,540 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,6890 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,14714 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,197 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,22 224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638, 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.072277361433487 77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.1479399632156 9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.1051528113 8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.139762177 1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.0987718 8897003744,} --->>>>> So in this email, I have listed the following pieces of information 1. Commands, 2. Env vars, 3. Sw version info Again, thank you in advance for your help. Scott INFO Below: 1. sequence of commands with relevant logged output points (omitted the sequence dump commands): mv reuters xreuters rm -r temp rm -r reuters-* mv xreuters reuters mvn -e -q exec:java -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" -Dexec.args="reuters/ reuters-extracted/" mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv # # added the -cd option per instructions in the Mahout In Action (MiA) so the convergance threhsold is .1 (originally this was default value but no affect on the unexpected results) # instead of default value of .5 because cosines lie within 0 and 1. # mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile -i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p reuters-kmeans-clusters/clusteredPoints/ mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o reuters-matrix # # the prior step had 21578 rows and 41807 columns # 41807 came from the prior step columns output # 10 most similar docs to each doc in the collection # mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r 41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess 2. env vars are as follows: MAHOUT_LOCAL=yes TERM_PROGRAM=Apple_Terminal MAHOUT_HOME=/Users/scottccote/mahout TERM=xterm-256color SHELL=/bin/bash TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/ Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render HADOOP_PREFIX=/Users/scottccote/hadoop TERM_PROGRAM_VERSION=309 TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF USER=scottccote COMMAND_MODE=unix2003 SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners __CF_USER_TEXT_ENCODING=0x1F5:0:0 Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/op t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/ usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/U sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin PWD=/Users/scottccote/Documents/toy-workspace/MiA HADOOP_VERSION=1.1.2 JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home EDITOR=/usr/bin/vi HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf LANG=en_US.UTF-8 HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk 3. Software/OS Version Info: version of mahout is (property of pom.xml in mahout home): 0.8 version of java (java -version): java version "1.6.0_65", Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode) Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >I don't see a need for uploading ur commands. Clean up HDFS (both output >and temp folders) and try running the 5 steps again - extract reuters, >seqdirectory, seq2sparse, rowid job, rowsimilarity job. > >Please use '-ow' option while running each of the jobs. > > > > > > > >On Thursday, December 19, 2013 2:04 PM, Scott C. Cote ><scottcc...@gmail.com> wrote: > >I manually deleted the temp folder too (After 2 failed starts). > >Would it be helpful for me to upload my shells that encapsulate all of the >commands posted on the tour? They reflect the current state of reuters >and .8 mahout. >And if I did - how would I do it? > >Thanks, > >SCott > > >On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: > >>Yep, that's what has happened in ur case. the wiki doesn't have but >>please specify the -ow (overwrite) option while running the >>RowsimilarityJob. That should clear up both the output and temp folders >>before running the job. >> >> >> >> >> >>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi >><suneel_mar...@yahoo.com> wrote: >> >>Haha... that could explain it, Rowsimilarityjob creates temp files during >>execution. If ur laptop 'sleeped' then the temp files still persist and >>running the job again wouldn't overwrite the old temp files (i need to >>verify that). >> >>It should be good enough to run the Rowsimilarity job again. >> >> >> >> >> >> >> >>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote >><scottcc...@gmail.com> wrote: >> >>Suneel, >> >>I'm going to do the similarity part of the tour over - my laptop was >>"sleeped" in the middle of the run of the rowsimilarity job. >>Maybe the job is sensitive to that …. :( Normally - a server would not >>go to sleep nor would it run >>in local mode. >> >>Sorry that I didn't think of that sooner. >>Will let you know my outcome. >> >>Am planning on redoing by deleting the contents and the folder titled >>"reuters-similarity" >> >>Please let me know if that is not good enough. >> >>Thanks again. >> >>SCott >> >> >>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >> >>>What you are seeing is the output matrix of the RowSimilarity job. You >>>are right there should be 21578 documents only in the reuters corpus. >>> >>>a) How many documents do you have in your docIndex? DocIndex is one of >>>the artifacts of the RowIDJob and should have been executed prior to the >>>RowSimilarity Job. You can run seqdumper on docIndex to see the output. >>> >>>b) Also what was the message at the end of the RowId job. It should read >>>something like 'Wrote out matrix with 21578 rows and 19515 columns to >>>reuters-matrix/matrix'. >>> >>> >>> >>> >>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote >>><scottcc...@gmail.com> wrote: >>> >>>All, >>> >>>I am a newbie Mahout user and am trying to use the "Quick tour of text >>>analysis using the Mahout command line" . Thank you to whomever >>>contributed >>>to that page. >>> >>>> >>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+a >>>>n >>>>a >>>>lysis >>>> +using+the+Mahout+command+line >>> >>>Went all the way from beginning to end of >> the page with "seemingly" no >>>hiccups. >>>At the very end of the "tour", I became confused because the command: >>> >>>> mahout seqdumper -i reuters-matrix/matrix | more >>> >>>Allowed me to see output (snippet) >>> >>>> Key: 1: Value: >>>> >>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4 >>>>4 >>>>0 >>>>3:0.2 >>>> >>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,1010 >>>>8 >>>>: >>>>0.126 >>>> >>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,1375 >>>>0 >>>>: >>>>0.188 >>>> >>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969: >>>>0 >>>>. >>>>36601 >>>> >>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734: >>>>0 >>>>. >>>>10869 >>>> >>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0 >>>>. >>>>1 >>>>23091 >>>> >>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0. >>>>0 >>>>6 >>>>16936 >>>> >>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0 >>>>. >>>>1 >>>>23271 >>>> >>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0. >>>>0 >>>>8 >>>>01873 >>>> >>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0. >>>>1 >>>>9 >>>>87470 >>>> >>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.1 >>>>4 >>>>7 >>>>88025 >>>> >>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.109 >>>>7 >>>>3 >>>>79357 >>>> >>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.03 >>>>5 >>>>8 >>>>19767 >>>> >>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.10 >>>>8 >>>>1 >>>>98203 >>>> >>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.09 >>>>5 >>>>2 >>>>82500 >>>> >> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} >>> >>>Reading through that snippet of data made me think that there exists a >>>document with rowed 41154 with cosine value of ~0.0658 (the last >>>element >>>in >>>the snippet). >>> >>>The problem is that the folder >>> >>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted >>> >>>Only has 21578 files in it. Indeed, my dictionary file (output command >>>used shown below) >>> >>>> mahout seqdumper -i reuters-matrix/docIndex | tail >>> >>>Has a max key of >>> >>>> Key: 21576: Value: /reut2-021.sgm-98.txt >>>> Key: 21577: Value: >> /reut2-021.sgm-99.txt >>>> Count: 21578 >>> >>>So I cannot find the document with key value 41154 . What does the >>>41154 >>>related to???? >>> >>>Obviously I have misunderstood something that I did or need to do in >>>the >>>tour. Can someone please shine a light on where I strayed? I have >>>scripted >>>every step that I took and can share them here if desired (I noticed >>>that >>>some of the output file names changed since the page was written so I >>>made >>>adjustments). >>> >>>Regards, >>> >>>SCott >>> >>>PS Thanks TD for helping me earlier