Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

Suneel Marthi Fri, 20 Dec 2013 14:31:27 -0800

Sorry Scott I should have looked at this more closely. I apologize.

1. You are doing a seqdumper of the matrix (which is generated from the rowid 
job and is not the output of the rowsimilarity job).


     Rowid Job generates a MxN matrix where M - no. of documents and N - terms 
associated with each document

    The value of a cell in the Matrix is the tf-idf weight of the term.

     So in the following output:

     {Code}


      
Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,296
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,540
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,6890
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,14714
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,197
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,22
224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.072277361433487
77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.1479399632156
9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.1051528113
8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.139762177
1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.0987718
8897003744,}

{Code}

means for document 2 what follows are the terms:tf-df weights.

To see the term corresponding to 41625 look at dictionary.file-0 for the 
corresponding key.

Hope that clarifies and clears the confusion here.

2.  In order to see the most similar documents for a given document you should 
be looking at a seqdumper of the output from rowsimilarity which in ur case 
would be the output in reuters-similarity.  That should give the 10 most 
similar documents and their cosine distances from the referenced document.

There's an error on the wiki link instructions, the seqdumper should have been 
on rowsimilarity/part-r-* and not on matrix/matrix for determining similar 
documents.

Hope this helps. Sorry again for the confusion.

    






On Friday, December 20, 2013 4:51 PM, Scott C. Cote <scottcc...@gmail.com> 
wrote:
 
Suneel and others,

I am still getting the strange results when I do the tour. Suneel: I
manually wiped out the temp folder and also deleted the reuters-XXX
folders.  
Also, per your advice I added the -ow option to all of the commands.
NOTE: The step to create a matrix would NOT take a -ow option

I have tried again, and am still seeing references to documents that do
not exist.

The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
reuters-matrix/matrix | tail) :

INFO: Program took 1077 ms (Minutes: 0.01795)
Key: 21569: Value: /reut2-021.sgm-91.txt
Key: 21570: Value: /reut2-021.sgm-92.txt
Key: 21571: Value: /reut2-021.sgm-93.txt
Key: 21572: Value: /reut2-021.sgm-94.txt
Key: 21573: Value: /reut2-021.sgm-95.txt
Key: 21574: Value: /reut2-021.sgm-96.txt
Key: 21575: Value: /reut2-021.sgm-97.txt
Key: 21576: Value: /reut2-021.sgm-98.txt
Key: 21577: Value: /reut2-021.sgm-99.txt
Count: 21578



And the following snippet exists inside reuters-matrix/matrix and
references key 41625 (which is larger than any key in docindex).

Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,296
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,540
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,6890
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,14714
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,197
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,22
224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.072277361433487
77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.1479399632156
9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.1051528113
8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.139762177
1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.0987718
8897003744,}

--->>>>> So in this email, I have listed the following pieces
 of
information 1. Commands, 2. Env vars, 3. Sw version info

Again, thank you in advance for your help.

Scott

INFO Below:

1. sequence of commands with relevant logged output points (omitted the
sequence dump commands):

mv reuters xreuters
rm -r temp

rm -r reuters-*
mv xreuters reuters
mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"
mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow
mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100
-x 90 -seq -ml 50 -n 2 -nv
#
# added the -cd option per instructions in the Mahout In Action (MiA) so
the convergance threhsold is .1 (originally this was default value but no
affect on the unexpected results)
#       instead of default value of .5  because cosines lie within 0 and 1.
#
mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile
-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p
reuters-kmeans-clusters/clusteredPoints/

mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o
reuters-matrix
#
# the prior step had 21578 rows and 41807 columns
# 41807 came from the prior step columns output
# 10 most similar docs to each doc in the collection
#
mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r
41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess




2. env vars are as follows:

MAHOUT_LOCAL=yes
TERM_PROGRAM=Apple_Terminal
MAHOUT_HOME=/Users/scottccote/mahout
TERM=xterm-256color
SHELL=/bin/bash
TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/
Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render
HADOOP_PREFIX=/Users/scottccote/hadoop
TERM_PROGRAM_VERSION=309
TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF
USER=scottccote
COMMAND_MODE=unix2003
SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners
__CF_USER_TEXT_ENCODING=0x1F5:0:0
Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message
PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/op
t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/
usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/U
sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin
PWD=/Users/scottccote/Documents/toy-workspace/MiA
HADOOP_VERSION=1.1.2
JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
EDITOR=/usr/bin/vi
HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf
LANG=en_US.UTF-8
HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK
-Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk





3. Software/OS Version Info:
version of mahout is (property of pom.xml in mahout home): 0.8

version of java (java -version): java version "1.6.0_65", Java(TM) SE
Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM)
64-Bit Server VM (build 20.65-b04-462, mixed mode)

Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin
Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013;
root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64





On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

>I don't see a need for uploading ur commands.  Clean up HDFS (both output
>and temp folders) and try running the 5 steps again - extract reuters,
>seqdirectory, seq2sparse, rowid job, rowsimilarity job.
>
>Please use '-ow' option while running each of the jobs.
>
>
>
>
>
>
>
>On Thursday, December 19, 2013 2:04 PM, Scott C. Cote
><scottcc...@gmail.com> wrote:
> 
>I manually deleted the temp folder too (After 2 failed starts).
>
>Would it be helpful for me to upload my shells that encapsulate all of the
>commands posted on the tour?  They reflect the current state of reuters
>and .8 mahout.
>And if I did - how would I do it?
>
>Thanks,
>
>SCott
>
>
>On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>
>>Yep, that's what has happened in ur case. the wiki doesn't have but
>>please specify the
 -ow (overwrite) option while running the
>>RowsimilarityJob. That should clear up both the output and temp folders
>>before running the job.
>>
>>
>>
>>
>>
>>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
>><suneel_mar...@yahoo.com> wrote:
>> 
>>Haha... that could explain it, Rowsimilarityjob creates temp files during
>>execution. If ur laptop 'sleeped' then the temp files still persist and
>>running the job again wouldn't overwrite the old temp files (i need to
>>verify that).
>>
>>It should be
 good enough to run the Rowsimilarity job again.
>>
>>
>>
>>
>>
>>
>>
>>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
>><scottcc...@gmail.com> wrote:
>> 
>>Suneel,
>>
>>I'm going to do the similarity part of the tour over - my laptop was
>>"sleeped" in the middle of the run of the rowsimilarity job.
>>Maybe the job is sensitive to that ….  :(  Normally - a server would not
>>go to sleep nor would it run
>>in local mode.
>>
>>Sorry that I didn't think of that sooner.
>>Will let you know my outcome.
>>
>>Am planning on redoing by deleting the contents and the folder titled
>>"reuters-similarity"
>>
>>Please let me know if that is not good enough.
>>
>>Thanks again.
>>
>>SCott
>>
>>
>>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>>
>>>What you are seeing is the output matrix of the RowSimilarity job.  You
>>>are right there should be 21578 documents only in the reuters
 corpus.
>>>
>>>a) How many documents do you have in your docIndex?  DocIndex is one of
>>>the artifacts of the RowIDJob and should have been executed prior to the
>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>>>
>>>b) Also what was the message at the end of the RowId job. It should read
>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>>>reuters-matrix/matrix'.
>>>
>>>
>>>
>>>
>>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
>>><scottcc...@gmail.com> wrote:
>>> 
>>>All,
>>>
>>>I am a newbie Mahout user and am trying to use the "Quick tour of text
>>>analysis using the Mahout command line" .  Thank you to whomever
>>>contributed
>>>to that page.
>>>
>>>> 
>>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+a
>>>>n
>>>>a
>>>>lysis
>>>> +using+the+Mahout+command+line
>>>
>>>Went all the way from beginning to end of
>> the page with "seemingly" no
>>>hiccups.
>>>At the very end of the "tour", I became confused because the command:
>>>
>>>> mahout seqdumper -i reuters-matrix/matrix | more
>>>
>>>Allowed me to see output (snippet)
>>>
>>>> Key: 1: Value:
>>>> 
>>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4
>>>>4
>>>>0
>>>>3:0.2
>>>> 
>>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,1010
>>>>8
>>>>:
>>>>0.126
>>>> 
>>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,1375
>>>>0
>>>>:
>>>>0.188
>>>> 
>>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:
>>>>0
>>>>.
>>>>36601
>>>> 
>>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:
>>>>0
>>>>.
>>>>10869
>>>> 
>>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0
>>>>.
>>>>1
>>>>23091
>>>> 
>>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.
>>>>0
>>>>6
>>>>16936
>>>> 
>>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0
>>>>.
>>>>1
>>>>23271
>>>> 
>>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.
>>>>0
>>>>8
>>>>01873
>>>> 
>>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.
>>>>1
>>>>9
>>>>87470
>>>> 
>>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.1
>>>>4
>>>>7
>>>>88025
>>>> 
>>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.109
>>>>7
>>>>3
>>>>79357
>>>> 
>>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.03
>>>>5
>>>>8
>>>>19767
>>>> 
>>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.10
>>>>8
>>>>1
>>>>98203
>>>> 
>>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.09
>>>>5
>>>>2
>>>>82500
>>>>
>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>>>
>>>Reading through that snippet of data made me think that there exists a
>>>document with rowed 41154 with cosine value of  ~0.0658 (the last
>>>element
>>>in
>>>the snippet).
>>>
>>>The problem is that the folder
>>>
>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>>>
>>>Only has 21578 files in it.  Indeed, my dictionary file  (output command
>>>used shown below)
>>>
>>>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>>>
>>>Has a max key of
>>>
>>>> Key: 21576: Value: /reut2-021.sgm-98.txt
>>>> Key: 21577: Value:
>> /reut2-021.sgm-99.txt
>>>> Count: 21578
>>>
>>>So I cannot find the document with key value 41154   .  What
 does the
>>>41154
>>>related to????
>>>
>>>Obviously I have misunderstood something that I did  or need to do  in
>>>the
>>>tour.  Can someone please shine a light on where I strayed?  I have
>>>scripted
>>>every step that I took and can share them here if desired (I noticed
>>>that
>>>some of the output file names changed since the page was written  so I
>>>made
>>>adjustments).
>>>
>>>Regards,
>>>
>>>SCott  
>>>
>>>PS  Thanks TD for helping me earlier

Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

Reply via email to