What you are seeing is the output matrix of the RowSimilarity job. You are right there should be 21578 documents only in the reuters corpus.
a) How many documents do you have in your docIndex? DocIndex is one of the artifacts of the RowIDJob and should have been executed prior to the RowSimilarity Job. You can run seqdumper on docIndex to see the output. b) Also what was the message at the end of the RowId job. It should read something like 'Wrote out matrix with 21578 rows and 19515 columns to reuters-matrix/matrix'. On Thursday, December 19, 2013 12:14 PM, Scott C. Cote <scottcc...@gmail.com> wrote: All, I am a newbie Mahout user and am trying to use the "Quick tour of text analysis using the Mahout command line" . Thank you to whomever contributed to that page. > https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis > +using+the+Mahout+command+line Went all the way from beginning to end of the page with "seemingly" no hiccups. At the very end of the "tour", I became confused because the command: > mahout seqdumper -i reuters-matrix/matrix | more Allowed me to see output (snippet) > Key: 1: Value: > /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4403:0.2 > 2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:0.126 > 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:0.188 > 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.36601 > 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.10869 > 648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.123091 > 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0616936 > 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.123271 > 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0801873 > 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1987470 > 224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.14788025 > 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097379357 > 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035819767 > 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108198203 > 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095282500 > 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} Reading through that snippet of data made me think that there exists a document with rowed 41154 with cosine value of ~0.0658 (the last element in the snippet). The problem is that the folder > /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted Only has 21578 files in it. Indeed, my dictionary file (output command used shown below) > mahout seqdumper -i reuters-matrix/docIndex | tail Has a max key of > Key: 21576: Value: /reut2-021.sgm-98.txt > Key: 21577: Value: /reut2-021.sgm-99.txt > Count: 21578 So I cannot find the document with key value 41154 . What does the 41154 related to???? Obviously I have misunderstood something that I did or need to do in the tour. Can someone please shine a light on where I strayed? I have scripted every step that I took and can share them here if desired (I noticed that some of the output file names changed since the page was written so I made adjustments). Regards, SCott PS Thanks TD for helping me earlier