All,

I am a newbie Mahout user and am trying to use the "Quick tour of text
analysis using the Mahout command line" .  Thank you to whomever contributed
to that page.

> https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+analysis
> +using+the+Mahout+command+line

Went all the way from beginning to end of the page with "seemingly" no
hiccups.
At the very end of the "tour", I became confused because the command:

> mahout seqdumper -i reuters-matrix/matrix | more

Allowed me to see output (snippet)

> Key: 1: Value: 
> /reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4403:0.2
> 2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:0.126
> 28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:0.188
> 8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.36601
> 581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.10869
> 648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.123091
> 46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.0616936
> 10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.123271
> 84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.0801873
> 7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.1987470
> 224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.14788025
> 30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.1097379357
> 6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.035819767
> 691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.108198203
> 50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.095282500
> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}

Reading through that snippet of data made me think that there exists a
document with rowed 41154 with cosine value of  ~0.0658 (the last element in
the snippet).

The problem is that the folder

> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted

Only has 21578 files in it.  Indeed, my dictionary file  (output command
used shown below)

> mahout seqdumper -i reuters-matrix/docIndex  | tail

Has a max key of

> Key: 21576: Value: /reut2-021.sgm-98.txt
> Key: 21577: Value: /reut2-021.sgm-99.txt
> Count: 21578

So I cannot find the document with key value 41154   .  What does the 41154
related to????

Obviously I have misunderstood something that I did ­ or need to do ­ in the
tour.  Can someone please shine a light on where I strayed?  I have scripted
every step that I took and can share them here if desired (I noticed that
some of the output file names changed since the page was written ­ so I made
adjustments).

Regards,

SCott  

PS  Thanks TD for helping me earlier


Reply via email to