What you are seeing is the output matrix of the RowSimilarity job.  You are 
right there should be 21578 documents only in the reuters corpus. 

a) How many documents do you have in your docIndex?  DocIndex is one of the 
artifacts of the RowIDJob and should have been executed prior to the 
RowSimilarity Job. You can run seqdumper on docIndex to see the output. 

b) Also what was the message at the end of the RowId job. It should read 
something like 'Wrote out matrix with 21578 rows and 19515 columns to 
reuters-matrix/matrix'.




On Thursday, December 19, 2013 12:14 PM, Scott C. Cote <scottcc...@gmail.com> 
wrote:
 
All,

I am a newbie Mahout user and am trying to use the "Quick tour of text
analysis using the Mahout command line" .  Thank you to whomever contributed
to that page.

> https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis
> +using+the+Mahout+command+line

Went all the way from beginning to end of the page with "seemingly" no
hiccups.
At the very end of the "tour", I became confused because the command:

> mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

> Key: 1: Value: 
> /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4403:0.2
> 2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:0.126
> 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:0.188
> 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.36601
> 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.10869
> 648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.123091
> 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0616936
> 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.123271
> 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0801873
> 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1987470
> 224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.14788025
> 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097379357
> 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035819767
> 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108198203
> 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095282500
> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last element in
the snippet).

The problem is that the folder

> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted

Only has 21578 files in it.  Indeed, my dictionary file  (output command
used shown below)

> mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

> Key: 21576: Value: /reut2-021.sgm-98.txt
> Key: 21577: Value: /reut2-021.sgm-99.txt
> Count: 21578

So I cannot find the document with key value 41154   .  What does the 41154
related to????

Obviously I have misunderstood something that I did ­ or need to do ­ in the
tour.  Can someone please shine a light on where I strayed?  I have scripted
every step that I took and can share them here if desired (I noticed that
some of the output file names changed since the page was written ­ so I made
adjustments).

Regards,

SCott  

PS  Thanks TD for helping me earlier

Reply via email to