What does the data in cdump.txt represent? Can you point me in the right direction?
SCott On 12/20/13 4:30 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >Sorry Scott I should have looked at this more closely. I apologize. > >1. You are doing a seqdumper of the matrix (which is generated from the >rowid job and is not the output of the rowsimilarity job). > > Rowid Job generates a MxN matrix where M - no. of documents and N - >terms associated with each document > > The value of a cell in the Matrix is the tf-idf weight of the term. > > So in the following output: > > {Code} > > > >Key: 2: Value: >/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 >6 >2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 >0 >5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 >0 >:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: >0 >.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 >4 >:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 >7 >38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 >2 >224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 >, >23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 >7 >77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 >6 >9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 >3 >8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 >7 >1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 >8 >8897003744,} > >{Code} > >means for document 2 what follows are the terms:tf-df weights. > >To see the term corresponding to 41625 look at dictionary.file-0 for the >corresponding key. > >Hope that clarifies and clears the confusion here. > >2. In order to see the most similar documents for a given document you >should be looking at a seqdumper of the output from rowsimilarity which >in ur case would be the output in reuters-similarity. That should give >the 10 most similar documents and their cosine distances from the >referenced document. > >There's an error on the wiki link instructions, the seqdumper should have >been on rowsimilarity/part-r-* and not on matrix/matrix for determining >similar documents. > >Hope this helps. Sorry again for the confusion. > > > > > > > > >On Friday, December 20, 2013 4:51 PM, Scott C. Cote ><scottcc...@gmail.com> wrote: > >Suneel and others, > >I am still getting the strange results when I do the tour. Suneel: I >manually wiped out the temp folder and also deleted the reuters-XXX >folders. >Also, per your advice I added the -ow option to all of the commands. >NOTE: The step to create a matrix would NOT take a -ow option > >I have tried again, and am still seeing references to documents that do >not exist. > >The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i >reuters-matrix/matrix | tail) : > >INFO: Program took 1077 ms (Minutes: 0.01795) >Key: 21569: Value: /reut2-021.sgm-91.txt >Key: 21570: Value: /reut2-021.sgm-92.txt >Key: 21571: Value: /reut2-021.sgm-93.txt >Key: 21572: Value: /reut2-021.sgm-94.txt >Key: 21573: Value: /reut2-021.sgm-95.txt >Key: 21574: Value: /reut2-021.sgm-96.txt >Key: 21575: Value: /reut2-021.sgm-97.txt >Key: 21576: Value: /reut2-021.sgm-98.txt >Key: 21577: Value: /reut2-021.sgm-99.txt >Count: 21578 > > > >And the following snippet exists inside reuters-matrix/matrix and >references key 41625 (which is larger than any key in docindex). > >Key: 2: Value: >/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29 >6 >2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54 >0 >5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689 >0 >:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260: >0 >.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471 >4 >:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19 >7 >38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2 >2 >224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638 >, >23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348 >7 >77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215 >6 >9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811 >3 >8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217 >7 >1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771 >8 >8897003744,} > >--->>>>> So in this email, I have listed the following pieces > of >information 1. Commands, 2. Env vars, 3. Sw version info > >Again, thank you in advance for your help. > >Scott > >INFO Below: > >1. sequence of commands with relevant logged output points (omitted the >sequence dump commands): > >mv reuters xreuters >rm -r temp > >rm -r reuters-* >mv xreuters reuters >mvn -e -q exec:java >-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" >-Dexec.args="reuters/ reuters-extracted/" >mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow >mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100 >-x 90 -seq -ml 50 -n 2 -nv ># ># added the -cd option per instructions in the Mahout In Action (MiA) so >the convergance threhsold is .1 (originally this was default value but no >affect on the unexpected results) ># instead of default value of .5 because cosines lie within 0 and >1. ># >mahout kmeans -i reuters-vectors/tfidf-vectors/ -c >reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10 >-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 >mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile >-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p >reuters-kmeans-clusters/clusteredPoints/ > >mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o >reuters-matrix ># ># the prior step had 21578 rows and 41807 columns ># 41807 came from the prior step columns output ># 10 most similar docs to each doc in the collection ># >mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r >41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess > > > > >2. env vars are as follows: > >MAHOUT_LOCAL=yes >TERM_PROGRAM=Apple_Terminal >MAHOUT_HOME=/Users/scottccote/mahout >TERM=xterm-256color >SHELL=/bin/bash >TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/ >Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render >HADOOP_PREFIX=/Users/scottccote/hadoop >TERM_PROGRAM_VERSION=309 >TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF >USER=scottccote >COMMAND_MODE=unix2003 >SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners >__CF_USER_TEXT_ENCODING=0x1F5:0:0 >Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message >PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/o >p >t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin: >/ >usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/ >U >sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin >PWD=/Users/scottccote/Documents/toy-workspace/MiA >HADOOP_VERSION=1.1.2 >JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home >EDITOR=/usr/bin/vi >HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf >LANG=en_US.UTF-8 >HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK >-Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk > > > > > >3. Software/OS Version Info: >version of mahout is (property of pom.xml in mahout home): 0.8 > >version of java (java -version): java version "1.6.0_65", Java(TM) SE >Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM) >64-Bit Server VM (build 20.65-b04-462, mixed mode) > >Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin >Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; >root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 > > > > > >On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: > >>I don't see a need for uploading ur commands. Clean up HDFS (both output >>and temp folders) and try running the 5 steps again - extract reuters, >>seqdirectory, seq2sparse, rowid job, rowsimilarity job. >> >>Please use '-ow' option while running each of the jobs. >> >> >> >> >> >> >> >>On Thursday, December 19, 2013 2:04 PM, Scott C. Cote >><scottcc...@gmail.com> wrote: >> >>I manually deleted the temp folder too (After 2 failed starts). >> >>Would it be helpful for me to upload my shells that encapsulate all of >>the >>commands posted on the tour? They reflect the current state of reuters >>and .8 mahout. >>And if I did - how would I do it? >> >>Thanks, >> >>SCott >> >> >>On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >> >>>Yep, that's what has happened in ur case. the wiki doesn't have but >>>please specify the > -ow (overwrite) option while running the >>>RowsimilarityJob. That should clear up both the output and temp folders >>>before running the job. >>> >>> >>> >>> >>> >>>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi >>><suneel_mar...@yahoo.com> wrote: >>> >>>Haha... that could explain it, Rowsimilarityjob creates temp files >>>during >>>execution. If ur laptop 'sleeped' then the temp files still persist and >>>running the job again wouldn't overwrite the old temp files (i need to >>>verify that). >>> >>>It should be > good enough to run the Rowsimilarity job again. >>> >>> >>> >>> >>> >>> >>> >>>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote >>><scottcc...@gmail.com> wrote: >>> >>>Suneel, >>> >>>I'm going to do the similarity part of the tour over - my laptop was >>>"sleeped" in the middle of the run of the rowsimilarity job. >>>Maybe the job is sensitive to that …. :( Normally - a server would not >>>go to sleep nor would it run >>>in local mode. >>> >>>Sorry that I didn't think of that sooner. >>>Will let you know my outcome. >>> >>>Am planning on redoing by deleting the contents and the folder titled >>>"reuters-similarity" >>> >>>Please let me know if that is not good enough. >>> >>>Thanks again. >>> >>>SCott >>> >>> >>>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >>> >>>>What you are seeing is the output matrix of the RowSimilarity job. You >>>>are right there should be 21578 documents only in the reuters > corpus. >>>> >>>>a) How many documents do you have in your docIndex? DocIndex is one of >>>>the artifacts of the RowIDJob and should have been executed prior to >>>>the >>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output. >>>> >>>>b) Also what was the message at the end of the RowId job. It should >>>>read >>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to >>>>reuters-matrix/matrix'. >>>> >>>> >>>> >>>> >>>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote >>>><scottcc...@gmail.com> wrote: >>>> >>>>All, >>>> >>>>I am a newbie Mahout user and am trying to use the "Quick tour of text >>>>analysis using the Mahout command line" . Thank you to whomever >>>>contributed >>>>to that page. >>>> >>>>> >>>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+ >>>>>a >>>>>n >>>>>a >>>>>lysis >>>>> +using+the+Mahout+command+line >>>> >>>>Went all the way from beginning to end of >>> the page with "seemingly" no >>>>hiccups. >>>>At the very end of the "tour", I became confused because the command: >>>> >>>>> mahout seqdumper -i reuters-matrix/matrix | more >>>> >>>>Allowed me to see output (snippet) >>>> >>>>> Key: 1: Value: >>>>> >>>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121, >>>>>4 >>>>>4 >>>>>0 >>>>>3:0.2 >>>>> >>>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,101 >>>>>0 >>>>>8 >>>>>: >>>>>0.126 >>>>> >>>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,137 >>>>>5 >>>>>0 >>>>>: >>>>>0.188 >>>>> >>>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969 >>>>>: >>>>>0 >>>>>. >>>>>36601 >>>>> >>>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734 >>>>>: >>>>>0 >>>>>. >>>>>10869 >>>>> >>>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224: >>>>>0 >>>>>. >>>>>1 >>>>>23091 >>>>> >>>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0 >>>>>. >>>>>0 >>>>>6 >>>>>16936 >>>>> >>>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507: >>>>>0 >>>>>. >>>>>1 >>>>>23271 >>>>> >>>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0 >>>>>. >>>>>0 >>>>>8 >>>>>01873 >>>>> >>>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0 >>>>>. >>>>>1 >>>>>9 >>>>>87470 >>>>> >>>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0. >>>>>1 >>>>>4 >>>>>7 >>>>>88025 >>>>> >>>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10 >>>>>9 >>>>>7 >>>>>3 >>>>>79357 >>>>> >>>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0 >>>>>3 >>>>>5 >>>>>8 >>>>>19767 >>>>> >>>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1 >>>>>0 >>>>>8 >>>>>1 >>>>>98203 >>>>> >>>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0 >>>>>9 >>>>>5 >>>>>2 >>>>>82500 >>>>> >>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} >>>> >>>>Reading through that snippet of data made me think that there exists a >>>>document with rowed 41154 with cosine value of ~0.0658 (the last >>>>element >>>>in >>>>the snippet). >>>> >>>>The problem is that the folder >>>> >>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted >>>> >>>>Only has 21578 files in it. Indeed, my dictionary file (output >>>>command >>>>used shown below) >>>> >>>>> mahout seqdumper -i reuters-matrix/docIndex | tail >>>> >>>>Has a max key of >>>> >>>>> Key: 21576: Value: /reut2-021.sgm-98.txt >>>>> Key: 21577: Value: >>> /reut2-021.sgm-99.txt >>>>> Count: 21578 >>>> >>>>So I cannot find the document with key value 41154 . What > does the >>>>41154 >>>>related to???? >>>> >>>>Obviously I have misunderstood something that I did or need to do >>>>in >>>>the >>>>tour. Can someone please shine a light on where I strayed? I have >>>>scripted >>>>every step that I took and can share them here if desired (I noticed >>>>that >>>>some of the output file names changed since the page was written so I >>>>made >>>>adjustments). >>>> >>>>Regards, >>>> >>>>SCott >>>> >>>>PS Thanks TD for helping me earlier