What does the data in cdump.txt represent? Can you point me in the right


>Sorry Scott I should have looked at this more closely. I apologize.
>1. You are doing a seqdumper of the matrix (which is generated from the
>rowid job and is not the output of the rowsimilarity job).
>     Rowid Job generates a MxN matrix where M - no. of documents and N -
>terms associated with each document
>    The value of a cell in the Matrix is the tf-idf weight of the term.
>     So in the following output:
>     {Code}
>Key: 2: Value: 
>means for document 2 what follows are the terms:tf-df weights.
>To see the term corresponding to 41625 look at dictionary.file-0 for the
>corresponding key.
>Hope that clarifies and clears the confusion here.
>2.  In order to see the most similar documents for a given document you
>should be looking at a seqdumper of the output from rowsimilarity which
>in ur case would be the output in reuters-similarity.  That should give
>the 10 most similar documents and their cosine distances from the
>referenced document.
>There's an error on the wiki link instructions, the seqdumper should have
>been on rowsimilarity/part-r-* and not on matrix/matrix for determining
>similar documents.
>Hope this helps. Sorry again for the confusion.
>Suneel and others,
>I am still getting the strange results when I do the tour. Suneel: I
>manually wiped out the temp folder and also deleted the reuters-XXX
>Also, per your advice I added the -ow option to all of the commands.
>NOTE: The step to create a matrix would NOT take a -ow option
>I have tried again, and am still seeing references to documents that do
>not exist.
>The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
>reuters-matrix/matrix | tail) :
>INFO: Program took 1077 ms (Minutes: 0.01795)
>Key: 21569: Value: /reut2-021.sgm-91.txt
>Key: 21570: Value: /reut2-021.sgm-92.txt
>Key: 21571: Value: /reut2-021.sgm-93.txt
>Key: 21572: Value: /reut2-021.sgm-94.txt
>Key: 21573: Value: /reut2-021.sgm-95.txt
>Key: 21574: Value: /reut2-021.sgm-96.txt
>Key: 21575: Value: /reut2-021.sgm-97.txt
>Key: 21576: Value: /reut2-021.sgm-98.txt
>Key: 21577: Value: /reut2-021.sgm-99.txt
>Count: 21578
>And the following snippet exists inside reuters-matrix/matrix and
>references key 41625 (which is larger than any key in docindex).
>Key: 2: Value: 
>--->>>>> So in this email, I have listed the following pieces
> of
>information 1. Commands, 2. Env vars, 3. Sw version info
>Again, thank you in advance for your help.
>INFO Below:
>1. sequence of commands with relevant logged output points (omitted the
>sequence dump commands):
>mv reuters xreuters
>rm -r temp
>rm -r reuters-*
>mv xreuters reuters
>mvn -e -q exec:java
>-Dexec.args="reuters/ reuters-extracted/"
>mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow
>mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100
>-x 90 -seq -ml 50 -n 2 -nv
># added the -cd option per instructions in the Mahout In Action (MiA) so
>the convergance threhsold is .1 (originally this was default value but no
>affect on the unexpected results)
>#       instead of default value of .5  because cosines lie within 0 and
>mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
>reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10
>-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
>mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile
>-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p
>mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o
># the prior step had 21578 rows and 41807 columns
># 41807 came from the prior step columns output
># 10 most similar docs to each doc in the collection
>mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r
>41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess
>2. env vars are as follows:
>3. Software/OS Version Info:
>version of mahout is (property of pom.xml in mahout home): 0.8
>version of java (java -version): java version "1.6.0_65", Java(TM) SE
>Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM)
>64-Bit Server VM (build 20.65-b04-462, mixed mode)
>Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin
>Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013;
>root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
>>I don't see a need for uploading ur commands.  Clean up HDFS (both output
>>and temp folders) and try running the 5 steps again - extract reuters,
>>seqdirectory, seq2sparse, rowid job, rowsimilarity job.
>>Please use '-ow' option while running each of the jobs.
>>I manually deleted the temp folder too (After 2 failed starts).
>>Would it be helpful for me to upload my shells that encapsulate all of
>>commands posted on the tour?  They reflect the current state of reuters
>>and .8 mahout.
>>And if I did - how would I do it?
>>>Yep, that's what has happened in ur case. the wiki doesn't have but
>>>please specify the
> -ow (overwrite) option while running the
>>>RowsimilarityJob. That should clear up both the output and temp folders
>>>before running the job.
>>>Haha... that could explain it, Rowsimilarityjob creates temp files
>>>execution. If ur laptop 'sleeped' then the temp files still persist and
>>>running the job again wouldn't overwrite the old temp files (i need to
>>>verify that).
>>>It should be
> good enough to run the Rowsimilarity job again.
>>>I'm going to do the similarity part of the tour over - my laptop was
>>>"sleeped" in the middle of the run of the rowsimilarity job.
>>>Maybe the job is sensitive to that ….  :(  Normally - a server would not
>>>go to sleep nor would it run
>>>in local mode.
>>>Sorry that I didn't think of that sooner.
>>>Will let you know my outcome.
>>>Am planning on redoing by deleting the contents and the folder titled
>>>Please let me know if that is not good enough.
>>>Thanks again.
>>>>What you are seeing is the output matrix of the RowSimilarity job.  You
>>>>are right there should be 21578 documents only in the reuters
corpus.
>>>>a) How many documents do you have in your docIndex?  DocIndex is one of
>>>>the artifacts of the RowIDJob and should have been executed prior to
>>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>>>>b) Also what was the message at the end of the RowId job. It should
>>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>>>>I am a newbie Mahout user and am trying to use the "Quick tour of text
>>>>analysis using the Mahout command line" .  Thank you to whomever
>>>>to that page.
>>>>> +using+the+Mahout+command+line
>>>>Went all the way from beginning to end of
the page with "seemingly" no
>>>>At the very end of the "tour", I became confused because the command:
>>>>> mahout seqdumper -i reuters-matrix/matrix | more
>>>>Allowed me to see output (snippet)
>>>>> Key: 1: Value:
>>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>>>>Reading through that snippet of data made me think that there exists a
>>>>document with rowed 41154 with cosine value of  ~0.0658 (the last
>>>>the snippet).
>>>>The problem is that the folder
>>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>>>>Only has 21578 files in it.  Indeed, my dictionary file  (output
>>>>used shown below)
>>>>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>>>>Has a max key of
>>>>> Key: 21576: Value: /reut2-021.sgm-98.txt
>>>>> Key: 21577: Value:
/reut2-021.sgm-99.txt
>>>>> Count: 21578
does the
> does the
>>>>related to????
>>>>Obviously I have misunderstood something that I did ­ or need to do ­
>>>>tour.  Can someone please shine a light on where I strayed?  I have
>>>>every step that I took and can share them here if desired (I noticed
>>>>some of the output file names changed since the page was written ­ so I
>>>>PS  Thanks TD for helping me earlier

