Suneel and others, I am still getting the strange results when I do the tour. Suneel: I manually wiped out the temp folder and also deleted the reuters-XXX folders. Also, per your advice I added the -ow option to all of the commands. NOTE: The step to create a matrix would NOT take a -ow option
I have tried again, and am still seeing references to documents that do not exist. The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i reuters-matrix/matrix | tail) : INFO: Program took 1077 ms (Minutes: 0.01795) Key: 21569: Value: /reut2-021.sgm-91.txt Key: 21570: Value: /reut2-021.sgm-92.txt Key: 21571: Value: /reut2-021.sgm-93.txt Key: 21572: Value: /reut2-021.sgm-94.txt Key: 21573: Value: /reut2-021.sgm-95.txt Key: 21574: Value: /reut2-021.sgm-96.txt Key: 21575: Value: /reut2-021.sgm-97.txt Key: 21576: Value: /reut2-021.sgm-98.txt Key: 21577: Value: /reut2-021.sgm-99.txt Count: 21578 And the following snippet exists inside reuters-matrix/matrix and references key 41625 (which is larger than any key in docindex). Key: 2: Value: /reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,296 2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,540 5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,6890 :0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:0 .13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,14714 :0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,197 38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,22 224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638, 23063:0.06357107330586896,23218:0.13920493300455258,25480:0.072277361433487 77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.1479399632156 9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.1051528113 8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.139762177 1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.0987718 8897003744,} --->>>>> So in this email, I have listed the following pieces of information 1. Commands, 2. Env vars, 3. Sw version info Again, thank you in advance for your help. Scott INFO Below: 1. sequence of commands with relevant logged output points (omitted the sequence dump commands): mv reuters xreuters rm -r temp rm -r reuters-* mv xreuters reuters mvn -e -q exec:java -Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters" -Dexec.args="reuters/ reuters-extracted/" mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -nv # # added the -cd option per instructions in the Mahout In Action (MiA) so the convergance threhsold is .1 (originally this was default value but no affect on the unexpected results) # instead of default value of .5 because cosines lie within 0 and 1. # mahout kmeans -i reuters-vectors/tfidf-vectors/ -c reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile -i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p reuters-kmeans-clusters/clusteredPoints/ mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o reuters-matrix # # the prior step had 21578 rows and 41807 columns # 41807 came from the prior step columns output # 10 most similar docs to each doc in the collection # mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r 41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess 2. env vars are as follows: MAHOUT_LOCAL=yes TERM_PROGRAM=Apple_Terminal MAHOUT_HOME=/Users/scottccote/mahout TERM=xterm-256color SHELL=/bin/bash TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/ Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render HADOOP_PREFIX=/Users/scottccote/hadoop TERM_PROGRAM_VERSION=309 TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF USER=scottccote COMMAND_MODE=unix2003 SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners __CF_USER_TEXT_ENCODING=0x1F5:0:0 Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/op t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/ usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/U sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin PWD=/Users/scottccote/Documents/toy-workspace/MiA HADOOP_VERSION=1.1.2 JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home EDITOR=/usr/bin/vi HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf LANG=en_US.UTF-8 HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk 3. Software/OS Version Info: version of mahout is (property of pom.xml in mahout home): 0.8 version of java (java -version): java version "1.6.0_65", Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode) Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >I don't see a need for uploading ur commands. Clean up HDFS (both output >and temp folders) and try running the 5 steps again - extract reuters, >seqdirectory, seq2sparse, rowid job, rowsimilarity job. > >Please use '-ow' option while running each of the jobs. > > > > > > > >On Thursday, December 19, 2013 2:04 PM, Scott C. Cote ><scottcc...@gmail.com> wrote: > >I manually deleted the temp folder too (After 2 failed starts). > >Would it be helpful for me to upload my shells that encapsulate all of the >commands posted on the tour? They reflect the current state of reuters >and .8 mahout. >And if I did - how would I do it? > >Thanks, > >SCott > > >On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: > >>Yep, that's what has happened in ur case. the wiki doesn't have but >>please specify the -ow (overwrite) option while running the >>RowsimilarityJob. That should clear up both the output and temp folders >>before running the job. >> >> >> >> >> >>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi >><suneel_mar...@yahoo.com> wrote: >> >>Haha... that could explain it, Rowsimilarityjob creates temp files during >>execution. If ur laptop 'sleeped' then the temp files still persist and >>running the job again wouldn't overwrite the old temp files (i need to >>verify that). >> >>It should be good enough to run the Rowsimilarity job again. >> >> >> >> >> >> >> >>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote >><scottcc...@gmail.com> wrote: >> >>Suneel, >> >>I'm going to do the similarity part of the tour over - my laptop was >>"sleeped" in the middle of the run of the rowsimilarity job. >>Maybe the job is sensitive to that …. :( Normally - a server would not >>go to sleep nor would it run >>in local mode. >> >>Sorry that I didn't think of that sooner. >>Will let you know my outcome. >> >>Am planning on redoing by deleting the contents and the folder titled >>"reuters-similarity" >> >>Please let me know if that is not good enough. >> >>Thanks again. >> >>SCott >> >> >>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote: >> >>>What you are seeing is the output matrix of the RowSimilarity job. You >>>are right there should be 21578 documents only in the reuters corpus. >>> >>>a) How many documents do you have in your docIndex? DocIndex is one of >>>the artifacts of the RowIDJob and should have been executed prior to the >>>RowSimilarity Job. You can run seqdumper on docIndex to see the output. >>> >>>b) Also what was the message at the end of the RowId job. It should read >>>something like 'Wrote out matrix with 21578 rows and 19515 columns to >>>reuters-matrix/matrix'. >>> >>> >>> >>> >>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote >>><scottcc...@gmail.com> wrote: >>> >>>All, >>> >>>I am a newbie Mahout user and am trying to use the "Quick tour of text >>>analysis using the Mahout command line" . Thank you to whomever >>>contributed >>>to that page. >>> >>>> >>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+a >>>>n >>>>a >>>>lysis >>>> +using+the+Mahout+command+line >>> >>>Went all the way from beginning to end of >> the page with "seemingly" no >>>hiccups. >>>At the very end of the "tour", I became confused because the command: >>> >>>> mahout seqdumper -i reuters-matrix/matrix | more >>> >>>Allowed me to see output (snippet) >>> >>>> Key: 1: Value: >>>> >>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4 >>>>4 >>>>0 >>>>3:0.2 >>>> >>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,1010 >>>>8 >>>>: >>>>0.126 >>>> >>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,1375 >>>>0 >>>>: >>>>0.188 >>>> >>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969: >>>>0 >>>>. >>>>36601 >>>> >>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734: >>>>0 >>>>. >>>>10869 >>>> >>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0 >>>>. >>>>1 >>>>23091 >>>> >>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0. >>>>0 >>>>6 >>>>16936 >>>> >>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0 >>>>. >>>>1 >>>>23271 >>>> >>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0. >>>>0 >>>>8 >>>>01873 >>>> >>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0. >>>>1 >>>>9 >>>>87470 >>>> >>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.1 >>>>4 >>>>7 >>>>88025 >>>> >>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.109 >>>>7 >>>>3 >>>>79357 >>>> >>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.03 >>>>5 >>>>8 >>>>19767 >>>> >>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.10 >>>>8 >>>>1 >>>>98203 >>>> >>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.09 >>>>5 >>>>2 >>>>82500 >>>> >> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,} >>> >>>Reading through that snippet of data made me think that there exists a >>>document with rowed 41154 with cosine value of ~0.0658 (the last >>>element >>>in >>>the snippet). >>> >>>The problem is that the folder >>> >>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted >>> >>>Only has 21578 files in it. Indeed, my dictionary file (output command >>>used shown below) >>> >>>> mahout seqdumper -i reuters-matrix/docIndex | tail >>> >>>Has a max key of >>> >>>> Key: 21576: Value: /reut2-021.sgm-98.txt >>>> Key: 21577: Value: >> /reut2-021.sgm-99.txt >>>> Count: 21578 >>> >>>So I cannot find the document with key value 41154 . What does the >>>41154 >>>related to???? >>> >>>Obviously I have misunderstood something that I did or need to do in >>>the >>>tour. Can someone please shine a light on where I strayed? I have >>>scripted >>>every step that I took and can share them here if desired (I noticed >>>that >>>some of the output file names changed since the page was written so I >>>made >>>adjustments). >>> >>>Regards, >>> >>>SCott >>> >>>PS Thanks TD for helping me earlier