Suneel and others,

I am still getting the strange results when I do the tour. Suneel: I
manually wiped out the temp folder and also deleted the reuters-XXX
folders.  
Also, per your advice I added the -ow option to all of the commands.
NOTE: The step to create a matrix would NOT take a -ow option

I have tried again, and am still seeing references to documents that do
not exist.

The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
reuters-matrix/matrix | tail) :

INFO: Program took 1077 ms (Minutes: 0.01795)
Key: 21569: Value: /reut2-021.sgm-91.txt
Key: 21570: Value: /reut2-021.sgm-92.txt
Key: 21571: Value: /reut2-021.sgm-93.txt
Key: 21572: Value: /reut2-021.sgm-94.txt
Key: 21573: Value: /reut2-021.sgm-95.txt
Key: 21574: Value: /reut2-021.sgm-96.txt
Key: 21575: Value: /reut2-021.sgm-97.txt
Key: 21576: Value: /reut2-021.sgm-98.txt
Key: 21577: Value: /reut2-021.sgm-99.txt
Count: 21578



And the following snippet exists inside reuters-matrix/matrix and
references key 41625 (which is larger than any key in docindex).

Key: 2: Value: 
/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,296
2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,540
5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,6890
:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:0
.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,14714
:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,197
38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,22
224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638,
23063:0.06357107330586896,23218:0.13920493300455258,25480:0.072277361433487
77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.1479399632156
9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.1051528113
8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.139762177
1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.0987718
8897003744,}

--->>>>> So in this email, I have listed the following pieces of
information 1. Commands, 2. Env vars, 3. Sw version info

Again, thank you in advance for your help.

Scott

INFO Below:

1. sequence of commands with relevant logged output points (omitted the
sequence dump commands):

mv reuters xreuters
rm -r temp

rm -r reuters-*
mv xreuters reuters
mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"
mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow
mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100
-x 90 -seq -ml 50 -n 2 -nv
#
# added the -cd option per instructions in the Mahout In Action (MiA) so
the convergance threhsold is .1 (originally this was default value but no
affect on the unexpected results)
#       instead of default value of .5  because cosines lie within 0 and 1.
#
mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10
-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile
-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p
reuters-kmeans-clusters/clusteredPoints/

mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o
reuters-matrix
#
# the prior step had 21578 rows and 41807 columns
# 41807 came from the prior step columns output
# 10 most similar docs to each doc in the collection
#
mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r
41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess




2. env vars are as follows:

MAHOUT_LOCAL=yes
TERM_PROGRAM=Apple_Terminal
MAHOUT_HOME=/Users/scottccote/mahout
TERM=xterm-256color
SHELL=/bin/bash
TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/
Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render
HADOOP_PREFIX=/Users/scottccote/hadoop
TERM_PROGRAM_VERSION=309
TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF
USER=scottccote
COMMAND_MODE=unix2003
SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners
__CF_USER_TEXT_ENCODING=0x1F5:0:0
Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message
PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/op
t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/
usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/U
sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin
PWD=/Users/scottccote/Documents/toy-workspace/MiA
HADOOP_VERSION=1.1.2
JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
EDITOR=/usr/bin/vi
HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf
LANG=en_US.UTF-8
HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK
-Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk





3. Software/OS Version Info:
version of mahout is (property of pom.xml in mahout home): 0.8

version of java (java -version): java version "1.6.0_65", Java(TM) SE
Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM)
64-Bit Server VM (build 20.65-b04-462, mixed mode)

Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin
Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013;
root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64


 

On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

>I don't see a need for uploading ur commands.  Clean up HDFS (both output
>and temp folders) and try running the 5 steps again - extract reuters,
>seqdirectory, seq2sparse, rowid job, rowsimilarity job.
>
>Please use '-ow' option while running each of the jobs.
>
>
>
>
>
>
>
>On Thursday, December 19, 2013 2:04 PM, Scott C. Cote
><scottcc...@gmail.com> wrote:
> 
>I manually deleted the temp folder too (After 2 failed starts).
>
>Would it be helpful for me to upload my shells that encapsulate all of the
>commands posted on the tour?  They reflect the current state of reuters
>and .8 mahout.
>And if I did - how would I do it?
>
>Thanks,
>
>SCott
>
>
>On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>
>>Yep, that's what has happened in ur case. the wiki doesn't have but
>>please specify the -ow (overwrite) option while running the
>>RowsimilarityJob. That should clear up both the output and temp folders
>>before running the job.
>>
>>
>>
>>
>>
>>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
>><suneel_mar...@yahoo.com> wrote:
>> 
>>Haha... that could explain it, Rowsimilarityjob creates temp files during
>>execution. If ur laptop 'sleeped' then the temp files still persist and
>>running the job again wouldn't overwrite the old temp files (i need to
>>verify that).
>>
>>It should be good enough to run the Rowsimilarity job again.
>>
>>
>>
>>
>>
>>
>>
>>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
>><scottcc...@gmail.com> wrote:
>> 
>>Suneel,
>>
>>I'm going to do the similarity part of the tour over - my laptop was
>>"sleeped" in the middle of the run of the rowsimilarity job.
>>Maybe the job is sensitive to that ….  :(  Normally - a server would not
>>go to sleep nor would it run
>>in local mode.
>>
>>Sorry that I didn't think of that sooner.
>>Will let you know my outcome.
>>
>>Am planning on redoing by deleting the contents and the folder titled
>>"reuters-similarity"
>>
>>Please let me know if that is not good enough.
>>
>>Thanks again.
>>
>>SCott
>>
>>
>>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>>
>>>What you are seeing is the output matrix of the RowSimilarity job.  You
>>>are right there should be 21578 documents only in the reuters corpus.
>>>
>>>a) How many documents do you have in your docIndex?  DocIndex is one of
>>>the artifacts of the RowIDJob and should have been executed prior to the
>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>>>
>>>b) Also what was the message at the end of the RowId job. It should read
>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>>>reuters-matrix/matrix'.
>>>
>>>
>>>
>>>
>>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
>>><scottcc...@gmail.com> wrote:
>>> 
>>>All,
>>>
>>>I am a newbie Mahout user and am trying to use the "Quick tour of text
>>>analysis using the Mahout command line" .  Thank you to whomever
>>>contributed
>>>to that page.
>>>
>>>> 
>>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+a
>>>>n
>>>>a
>>>>lysis
>>>> +using+the+Mahout+command+line
>>>
>>>Went all the way from beginning to end of
>> the page with "seemingly" no
>>>hiccups.
>>>At the very end of the "tour", I became confused because the command:
>>>
>>>> mahout seqdumper -i reuters-matrix/matrix | more
>>>
>>>Allowed me to see output (snippet)
>>>
>>>> Key: 1: Value:
>>>> 
>>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,4
>>>>4
>>>>0
>>>>3:0.2
>>>> 
>>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,1010
>>>>8
>>>>:
>>>>0.126
>>>> 
>>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,1375
>>>>0
>>>>:
>>>>0.188
>>>> 
>>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969:
>>>>0
>>>>.
>>>>36601
>>>> 
>>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734:
>>>>0
>>>>.
>>>>10869
>>>> 
>>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:0
>>>>.
>>>>1
>>>>23091
>>>> 
>>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0.
>>>>0
>>>>6
>>>>16936
>>>> 
>>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:0
>>>>.
>>>>1
>>>>23271
>>>> 
>>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0.
>>>>0
>>>>8
>>>>01873
>>>> 
>>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0.
>>>>1
>>>>9
>>>>87470
>>>> 
>>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.1
>>>>4
>>>>7
>>>>88025
>>>> 
>>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.109
>>>>7
>>>>3
>>>>79357
>>>> 
>>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.03
>>>>5
>>>>8
>>>>19767
>>>> 
>>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.10
>>>>8
>>>>1
>>>>98203
>>>> 
>>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.09
>>>>5
>>>>2
>>>>82500
>>>>
>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>>>
>>>Reading through that snippet of data made me think that there exists a
>>>document with rowed 41154 with cosine value of  ~0.0658 (the last
>>>element
>>>in
>>>the snippet).
>>>
>>>The problem is that the folder
>>>
>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>>>
>>>Only has 21578 files in it.  Indeed, my dictionary file  (output command
>>>used shown below)
>>>
>>>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>>>
>>>Has a max key of
>>>
>>>> Key: 21576: Value: /reut2-021.sgm-98.txt
>>>> Key: 21577: Value:
>> /reut2-021.sgm-99.txt
>>>> Count: 21578
>>>
>>>So I cannot find the document with key value 41154   .  What does the
>>>41154
>>>related to????
>>>
>>>Obviously I have misunderstood something that I did ­ or need to do ­ in
>>>the
>>>tour.  Can someone please shine a light on where I strayed?  I have
>>>scripted
>>>every step that I took and can share them here if desired (I noticed
>>>that
>>>some of the output file names changed since the page was written ­ so I
>>>made
>>>adjustments).
>>>
>>>Regards,
>>>
>>>SCott  
>>>
>>>PS  Thanks TD for helping me earlier


Reply via email to