Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

Scott C. Cote Fri, 20 Dec 2013 16:28:39 -0800

What does the data in cdump.txt represent?  Can you point me in the right
direction?


SCott

On 12/20/13 4:30 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

>Sorry Scott I should have looked at this more closely. I apologize.
>
>1. You are doing a seqdumper of the matrix (which is generated from the
>rowid job and is not the output of the rowsimilarity job).
>
>     Rowid Job generates a MxN matrix where M - no. of documents and N -
>terms associated with each document
>
>    The value of a cell in the Matrix is the tf-idf weight of the term.
>
>     So in the following output:
>
>     {Code}
>
>
>      
>Key: 2: Value: 
>/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
>6
>2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
>0
>5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
>0
>:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
>0
>.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
>4
>:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
>7
>38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
>2
>224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
>,
>23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
>7
>77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
>6
>9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
>3
>8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
>7
>1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
>8
>8897003744,}
>
>{Code}
>
>means for document 2 what follows are the terms:tf-df weights.
>
>To see the term corresponding to 41625 look at dictionary.file-0 for the
>corresponding key.
>
>Hope that clarifies and clears the confusion here.
>
>2.  In order to see the most similar documents for a given document you
>should be looking at a seqdumper of the output from rowsimilarity which
>in ur case would be the output in reuters-similarity.  That should give
>the 10 most similar documents and their cosine distances from the
>referenced document.
>
>There's an error on the wiki link instructions, the seqdumper should have
>been on rowsimilarity/part-r-* and not on matrix/matrix for determining
>similar documents.
>
>Hope this helps. Sorry again for the confusion.
>
>    
>
>
>
>
>
>
>On Friday, December 20, 2013 4:51 PM, Scott C. Cote
><scottcc...@gmail.com> wrote:
> 
>Suneel and others,
>
>I am still getting the strange results when I do the tour. Suneel: I
>manually wiped out the temp folder and also deleted the reuters-XXX
>folders.  
>Also, per your advice I added the -ow option to all of the commands.
>NOTE: The step to create a matrix would NOT take a -ow option
>
>I have tried again, and am still seeing references to documents that do
>not exist.
>
>The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
>reuters-matrix/matrix | tail) :
>
>INFO: Program took 1077 ms (Minutes: 0.01795)
>Key: 21569: Value: /reut2-021.sgm-91.txt
>Key: 21570: Value: /reut2-021.sgm-92.txt
>Key: 21571: Value: /reut2-021.sgm-93.txt
>Key: 21572: Value: /reut2-021.sgm-94.txt
>Key: 21573: Value: /reut2-021.sgm-95.txt
>Key: 21574: Value: /reut2-021.sgm-96.txt
>Key: 21575: Value: /reut2-021.sgm-97.txt
>Key: 21576: Value: /reut2-021.sgm-98.txt
>Key: 21577: Value: /reut2-021.sgm-99.txt
>Count: 21578
>
>
>
>And the following snippet exists inside reuters-matrix/matrix and
>references key 41625 (which is larger than any key in docindex).
>
>Key: 2: Value: 
>/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
>6
>2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
>0
>5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
>0
>:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
>0
>.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
>4
>:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
>7
>38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
>2
>224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
>,
>23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
>7
>77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
>6
>9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
>3
>8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
>7
>1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
>8
>8897003744,}
>
>--->>>>> So in this email, I have listed the following pieces
> of
>information 1. Commands, 2. Env vars, 3. Sw version info
>
>Again, thank you in advance for your help.
>
>Scott
>
>INFO Below:
>
>1. sequence of commands with relevant logged output points (omitted the
>sequence dump commands):
>
>mv reuters xreuters
>rm -r temp
>
>rm -r reuters-*
>mv xreuters reuters
>mvn -e -q exec:java
>-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
>-Dexec.args="reuters/ reuters-extracted/"
>mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow
>mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100
>-x 90 -seq -ml 50 -n 2 -nv
>#
># added the -cd option per instructions in the Mahout In Action (MiA) so
>the convergance threhsold is .1 (originally this was default value but no
>affect on the unexpected results)
>#       instead of default value of .5  because cosines lie within 0 and
>1.
>#
>mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
>reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10
>-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
>mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile
>-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p
>reuters-kmeans-clusters/clusteredPoints/
>
>mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o
>reuters-matrix
>#
># the prior step had 21578 rows and 41807 columns
># 41807 came from the prior step columns output
># 10 most similar docs to each doc in the collection
>#
>mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r
>41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess
>
>
>
>
>2. env vars are as follows:
>
>MAHOUT_LOCAL=yes
>TERM_PROGRAM=Apple_Terminal
>MAHOUT_HOME=/Users/scottccote/mahout
>TERM=xterm-256color
>SHELL=/bin/bash
>TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/
>Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render
>HADOOP_PREFIX=/Users/scottccote/hadoop
>TERM_PROGRAM_VERSION=309
>TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF
>USER=scottccote
>COMMAND_MODE=unix2003
>SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners
>__CF_USER_TEXT_ENCODING=0x1F5:0:0
>Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message
>PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/o
>p
>t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:
>/
>usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/
>U
>sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin
>PWD=/Users/scottccote/Documents/toy-workspace/MiA
>HADOOP_VERSION=1.1.2
>JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
>EDITOR=/usr/bin/vi
>HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf
>LANG=en_US.UTF-8
>HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK
>-Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk
>
>
>
>
>
>3. Software/OS Version Info:
>version of mahout is (property of pom.xml in mahout home): 0.8
>
>version of java (java -version): java version "1.6.0_65", Java(TM) SE
>Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM)
>64-Bit Server VM (build 20.65-b04-462, mixed mode)
>
>Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin
>Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013;
>root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
>
>
>
>
>
>On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>
>>I don't see a need for uploading ur commands.  Clean up HDFS (both output
>>and temp folders) and try running the 5 steps again - extract reuters,
>>seqdirectory, seq2sparse, rowid job, rowsimilarity job.
>>
>>Please use '-ow' option while running each of the jobs.
>>
>>
>>
>>
>>
>>
>>
>>On Thursday, December 19, 2013 2:04 PM, Scott C. Cote
>><scottcc...@gmail.com> wrote:
>> 
>>I manually deleted the temp folder too (After 2 failed starts).
>>
>>Would it be helpful for me to upload my shells that encapsulate all of
>>the
>>commands posted on the tour?  They reflect the current state of reuters
>>and .8 mahout.
>>And if I did - how would I do it?
>>
>>Thanks,
>>
>>SCott
>>
>>
>>On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>>
>>>Yep, that's what has happened in ur case. the wiki doesn't have but
>>>please specify the
> -ow (overwrite) option while running the
>>>RowsimilarityJob. That should clear up both the output and temp folders
>>>before running the job.
>>>
>>>
>>>
>>>
>>>
>>>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
>>><suneel_mar...@yahoo.com> wrote:
>>> 
>>>Haha... that could explain it, Rowsimilarityjob creates temp files
>>>during
>>>execution. If ur laptop 'sleeped' then the temp files still persist and
>>>running the job again wouldn't overwrite the old temp files (i need to
>>>verify that).
>>>
>>>It should be
> good enough to run the Rowsimilarity job again.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
>>><scottcc...@gmail.com> wrote:
>>> 
>>>Suneel,
>>>
>>>I'm going to do the similarity part of the tour over - my laptop was
>>>"sleeped" in the middle of the run of the rowsimilarity job.
>>>Maybe the job is sensitive to that ….  :(  Normally - a server would not
>>>go to sleep nor would it run
>>>in local mode.
>>>
>>>Sorry that I didn't think of that sooner.
>>>Will let you know my outcome.
>>>
>>>Am planning on redoing by deleting the contents and the folder titled
>>>"reuters-similarity"
>>>
>>>Please let me know if that is not good enough.
>>>
>>>Thanks again.
>>>
>>>SCott
>>>
>>>
>>>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>>>
>>>>What you are seeing is the output matrix of the RowSimilarity job.  You
>>>>are right there should be 21578 documents only in the reuters
> corpus.
>>>>
>>>>a) How many documents do you have in your docIndex?  DocIndex is one of
>>>>the artifacts of the RowIDJob and should have been executed prior to
>>>>the
>>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>>>>
>>>>b) Also what was the message at the end of the RowId job. It should
>>>>read
>>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>>>>reuters-matrix/matrix'.
>>>>
>>>>
>>>>
>>>>
>>>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
>>>><scottcc...@gmail.com> wrote:
>>>> 
>>>>All,
>>>>
>>>>I am a newbie Mahout user and am trying to use the "Quick tour of text
>>>>analysis using the Mahout command line" .  Thank you to whomever
>>>>contributed
>>>>to that page.
>>>>
>>>>> 
>>>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+
>>>>>a
>>>>>n
>>>>>a
>>>>>lysis
>>>>> +using+the+Mahout+command+line
>>>>
>>>>Went all the way from beginning to end of
>>> the page with "seemingly" no
>>>>hiccups.
>>>>At the very end of the "tour", I became confused because the command:
>>>>
>>>>> mahout seqdumper -i reuters-matrix/matrix | more
>>>>
>>>>Allowed me to see output (snippet)
>>>>
>>>>> Key: 1: Value:
>>>>> 
>>>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,
>>>>>4
>>>>>4
>>>>>0
>>>>>3:0.2
>>>>> 
>>>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,101
>>>>>0
>>>>>8
>>>>>:
>>>>>0.126
>>>>> 
>>>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,137
>>>>>5
>>>>>0
>>>>>:
>>>>>0.188
>>>>> 
>>>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969
>>>>>:
>>>>>0
>>>>>.
>>>>>36601
>>>>> 
>>>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734
>>>>>:
>>>>>0
>>>>>.
>>>>>10869
>>>>> 
>>>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:
>>>>>0
>>>>>.
>>>>>1
>>>>>23091
>>>>> 
>>>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0
>>>>>.
>>>>>0
>>>>>6
>>>>>16936
>>>>> 
>>>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:
>>>>>0
>>>>>.
>>>>>1
>>>>>23271
>>>>> 
>>>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0
>>>>>.
>>>>>0
>>>>>8
>>>>>01873
>>>>> 
>>>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0
>>>>>.
>>>>>1
>>>>>9
>>>>>87470
>>>>> 
>>>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.
>>>>>1
>>>>>4
>>>>>7
>>>>>88025
>>>>> 
>>>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10
>>>>>9
>>>>>7
>>>>>3
>>>>>79357
>>>>> 
>>>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0
>>>>>3
>>>>>5
>>>>>8
>>>>>19767
>>>>> 
>>>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1
>>>>>0
>>>>>8
>>>>>1
>>>>>98203
>>>>> 
>>>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0
>>>>>9
>>>>>5
>>>>>2
>>>>>82500
>>>>>
>>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>>>>
>>>>Reading through that snippet of data made me think that there exists a
>>>>document with rowed 41154 with cosine value of  ~0.0658 (the last
>>>>element
>>>>in
>>>>the snippet).
>>>>
>>>>The problem is that the folder
>>>>
>>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>>>>
>>>>Only has 21578 files in it.  Indeed, my dictionary file  (output
>>>>command
>>>>used shown below)
>>>>
>>>>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>>>>
>>>>Has a max key of
>>>>
>>>>> Key: 21576: Value: /reut2-021.sgm-98.txt
>>>>> Key: 21577: Value:
>>> /reut2-021.sgm-99.txt
>>>>> Count: 21578
>>>>
>>>>So I cannot find the document with key value 41154   .  What
> does the
>>>>41154
>>>>related to????
>>>>
>>>>Obviously I have misunderstood something that I did  or need to do 
>>>>in
>>>>the
>>>>tour.  Can someone please shine a light on where I strayed?  I have
>>>>scripted
>>>>every step that I took and can share them here if desired (I noticed
>>>>that
>>>>some of the output file names changed since the page was written  so I
>>>>made
>>>>adjustments).
>>>>
>>>>Regards,
>>>>
>>>>SCott  
>>>>
>>>>PS  Thanks TD for helping me earlier

Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

Reply via email to