Haha... that could explain it, Rowsimilarityjob creates temp files during 
execution. If ur laptop 'sleeped' then the temp files still persist and running 
the job again wouldn't overwrite the old temp files (i need to verify that).

It should be good enough to run the Rowsimilarity job again.







On Thursday, December 19, 2013 1:46 PM, Scott C. Cote <scottcc...@gmail.com> 
wrote:
 
Suneel,

I'm going to do the similarity part of the tour over - my laptop was
"sleeped" in the middle of the run of the rowsimilarity job.
Maybe the job is sensitive to that ….  :(  Normally - a server would not
go to sleep nor would it run
in local mode.

Sorry that I didn't think of that sooner.
Will let you know my outcome.

Am planning on redoing by deleting the contents and the folder titled
"reuters-similarity"

Please let me know if that is not good enough.

Thanks again.

SCott


On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

>What you are seeing is the output matrix of the RowSimilarity job.  You
>are right there should be 21578 documents only in the reuters corpus.
>
>a) How many documents do you have in your docIndex?  DocIndex is one of
>the artifacts of the RowIDJob and should have been executed prior to the
>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>
>b) Also what was the message at the end of the RowId job. It should read
>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>reuters-matrix/matrix'.
>
>
>
>
>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
><scottcc...@gmail.com> wrote:
> 
>All,
>
>I am a newbie Mahout user and am trying to use the "Quick tour of text
>analysis using the Mahout command line" .  Thank you to whomever
>contributed
>to that page.
>
>> 
>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ana
>>lysis
>> +using+the+Mahout+command+line
>
>Went all the way from beginning to end of the page with "seemingly" no
>hiccups.
>At the very end of the "tour", I became confused because the command:
>
>> mahout seqdumper -i reuters-matrix/matrix | more
>
>Allowed me to see output (snippet)
>
>> Key: 1: Value: 
>> 
>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,440
>>3:0.2
>> 
>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,10108:
>>0.126
>> 
>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,13750:
>>0.188
>> 
>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:0.
>>36601
>> 
>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:0.
>>10869
>> 
>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0.1
>>23091
>> 
>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.06
>>16936
>> 
>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0.1
>>23271
>> 
>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.08
>>01873
>> 
>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.19
>>87470
>> 
>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.147
>>88025
>> 
>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10973
>>79357
>> 
>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0358
>>19767
>> 
>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1081
>>98203
>> 
>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0952
>>82500
>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>
>Reading through that snippet of data made me think that there exists a
>document with rowed 41154 with cosine value of  ~0.0658 (the last element
>in
>the snippet).
>
>The problem is that the folder
>
>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>
>Only has 21578 files in it.  Indeed, my dictionary file  (output command
>used shown below)
>
>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>
>Has a max key of
>
>> Key: 21576: Value: /reut2-021.sgm-98.txt
>> Key: 21577: Value: /reut2-021.sgm-99.txt
>> Count: 21578
>
>So I cannot find the document with key value 41154   .  What does the
>41154
>related to????
>
>Obviously I have misunderstood something that I did ­ or need to do ­ in
>the
>tour.  Can someone please shine a light on where I strayed?  I have
>scripted
>every step that I took and can share them here if desired (I noticed that
>some of the output file names changed since the page was written ­ so I
>made
>adjustments).
>
>Regards,
>
>SCott  
>
>PS  Thanks TD for helping me earlier

Reply via email to