Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

Suneel Marthi Fri, 20 Dec 2013 16:30:15 -0800

You could use clusterdump to see the output of your clusters.

Eg:


  $MAHOUT clusterdump \
    -i ${WORK_DIR}/reuters-kmeans/clusters-*-final \
    -o ${WORK_DIR}/reuters-kmeans/clusterdump \
    -d ${WORK_DIR}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 \
    -dt sequencefile -b 100 -n 20 --evaluate -dm 
org.apache.mahout.common.distance.CosineDistanceMeasure -sp 0 \
    --pointsDir ${WORK_DIR}/reuters-kmeans/clusteredPoints \

I am assuming you had run kmeans clustering, if so the clusters wouldn't 
overlap. You would see cluster overlap if u were to run fuzzy kmeans clustering.





On Friday, December 20, 2013 7:06 PM, Scott C. Cote <scottcc...@gmail.com> 
wrote:
 
Suneel,

Thank you for your help.  :)   Thought I was completely in the ditch.

If you are interested: inline with you comments are demonstrations that I
finally have it  (and the commands that I used)….

YAQ (Yet another question):
How do I see with the dumper the documents that belong in a given cluster?

I issued the command:  mahout seqdumper -I
reuters-kmeans-clusters/clusters-3-final/part-r-00000

Which yields data like:

Input Path: part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.iterator.ClusterWritable
Key: 0: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2
Key: 1: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2
Key: 2: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@28b301f2
…
Key: 19: Value: 
org.apache.mahout.clustering.iterator.ClusterWritable@193936e1
Count: 20


Was hoping to see something that associated a centroid/cluster with its
members.  
Given that there are 20 centroids, how do I break out the files into say:
20 folders - one folder per centroid so that I know their associations
(I'm assuming that the clusters don't overlap).  Or - is there a sequence
file that is generated somewhere that definitively associates the vectors
with each cluster? 

Here is what I do know:
I know that the clusters are not given names and it is suggested that we
use the top terms of the cluster to define a name.

According to the tour, I should be able to see a likelihood that a given
vector is in a cluster.  But

mahout seqdumper -i reuters-kmeans-clusters/clusteredPoints/part-m-00000 |
more

Yields:

Input Path: part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 10266: Value: 1.0: /reut2-000.sgm-0.txt = [62:0.085, 222:0.043,
291:0.084, 1411:0.083, 1421:0.087, 1451:0.085, 1456:0.092, 1457:0.092,
1462:0.135, 1512:0.070, 1543:0.104, 2962:0.037
….


which does NOT look like the output in the tour (did I miss something
again?).   But I'll try to interpret the output as saying vector with key
62 has a cosine distance of .085 from key 10266 - is that right?

What do I need to look at? - MiA sheds no light on this part that I have
found.  NOTE:  I wrote a very simple - non scalable k-means java routine
that found the clusters in a set of points (2 dimensional) and tracked
which point belongs to which cluster (no overlap).  Want to do the same
with Mahout.

Looking forward to your response to get me over this next hump ….

SCott

On 12/20/13 4:30 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

>Sorry Scott I should have looked at this more closely. I apologize.
>
>1. You are doing a seqdumper of the matrix (which is generated from the
>rowid job and is not the output of the rowsimilarity job).
>
>     Rowid Job generates a MxN matrix where M - no. of documents and N -
>terms associated with each document
>
>    The value of a cell in the Matrix is the tf-idf weight of the term.
>
>     So in the following output:
>
>     {Code}
>
>
>      
>Key: 2: Value: 
>/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
>6
>2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
>0
>5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
>0
>:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
>0
>.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
>4
>:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
>7
>38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
>2
>224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
>,
>23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
>7
>77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
>6
>9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
>3
>8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
>7
>1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
>8
>8897003744,}
>
>{Code}
>
>means for document 2 what follows are the terms:tf-df weights.
>
>To see the term corresponding to 41625 look at dictionary.file-0 for the
>corresponding key.
>
>Hope that clarifies and clears the confusion here.

To your point, a dump of the dictionary sequence file coupled with a tail
shows:

Dec 20, 2013 4:38:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Program took 1078 ms (Minutes: 0.017966666666666666)
Key: zuccherifici: Value: 41798
Key: zuckerman: Value: 41799
Key: zuercher: Value: 41800
Key: zulia: Value: 41801
Key: zurich: Value: 41802
Key: zurn: Value: 41803
Key: zverev: Value: 41804
Key: zweig: Value: 41805
Key: zy: Value: 41806
Count: 41807


This is what I get for only looking at the beginning of the file and not
really taking the time to understand the nature of the file.



>
>2.  In order to see the most similar documents for a given document you
>should be looking at a seqdumper of the output from rowsimilarity which
>in ur case would be the output in reuters-similarity.  That should give
>the 10 most similar documents and their cosine distances from the
>referenced document.


mahout seqdumper -i reuters-similarity/part-r-* | more

Yields

Input Path: reuters-similarity/part-r-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: 
{0:0.9999999999999999,13611:0.17446750688012366,13430:0.15853208358190823,1
7520:0.19351644052283437,18330:0.15898358188286904,4411:0.20851636244169733
,13403:0.1663674094837415,14458:0.17265033919444714,14613:0.153651769452232
38,11399:0.19745333923929734}
Key: 1: Value: 
{9858:0.32081902404236906,9704:0.2485999435029943,9833:0.30851564542610826,
19789:0.37458607189215337,10056:0.2885413911200995,10601:0.2598640283997712
4,11858:0.3057183602839999,17412:0.30330496505095894,1:0.9999999999999998,9
702:0.26198579353949075}
Key: 2: Value: 
{2:1.0000000000000004,1087:0.28125327148896956,10390:0.2690057046963114,100
22:0.27668518648436297,6746:0.26969982074464605,12886:0.27032675431539793,1
3168:0.25889934686395943,997:0.26225673856545156,1392:0.2673559453473729,20
614:0.3009916279814217}
…..




:)




>
>There's an error on the wiki link instructions, the seqdumper should have
>been on rowsimilarity/part-r-* and not on matrix/matrix for determining
>similar documents.
>
>Hope this helps. Sorry again for the confusion.
>
>    
>
>
>
>
>
>
>On Friday, December 20, 2013 4:51 PM, Scott C. Cote
><scottcc...@gmail.com> wrote:
> 
>Suneel and others,
>
>I am still getting the strange results when I do the tour. Suneel: I
>manually wiped out the temp folder and also deleted the reuters-XXX
>folders.  
>Also, per your advice I added the -ow option to all of the commands.
>NOTE: The step to create a matrix would NOT take a -ow option
>
>I have tried again, and am still seeing references to documents that do
>not exist.
>
>The tail end of reuters-matrix/docindex looks like (mahout seqdumper -i
>reuters-matrix/matrix | tail) :
>
>INFO: Program took 1077 ms (Minutes: 0.01795)
>Key: 21569: Value: /reut2-021.sgm-91.txt
>Key: 21570: Value: /reut2-021.sgm-92.txt
>Key: 21571: Value: /reut2-021.sgm-93.txt
>Key: 21572: Value: /reut2-021.sgm-94.txt
>Key: 21573: Value: /reut2-021.sgm-95.txt
>Key: 21574: Value: /reut2-021.sgm-96.txt
>Key: 21575: Value: /reut2-021.sgm-97.txt
>Key: 21576: Value: /reut2-021.sgm-98.txt
>Key: 21577: Value: /reut2-021.sgm-99.txt
>Count: 21578
>
>
>
>And the following snippet exists inside reuters-matrix/matrix and
>references key 41625 (which is larger than any key in docindex).
>
>Key: 2: Value: 
>/reut2-000.sgm-10.txt:{1534:0.22690468189202942,2594:0.2600104057711044,29
>6
>2:0.08824623819754489,3555:0.09541425900872381,3947:0.11560540405210848,54
>0
>5:0.11298345900879188,5997:0.03517426202612014,6734:0.3106030081260242,689
>0
>:0.14736266329145098,8991:0.3106030081260242,9010:0.3015597218796236,9260:
>0
>.13645477653417049,10631:0.2797706893700179,14440:0.13388804477098434,1471
>4
>:0.13299090204195838,16816:0.1779629918379883,19031:0.12390416915718179,19
>7
>38:0.1653201025120046,20362:0.08836103415407508,21961:0.1633235657199497,2
>2
>224:0.1442082289933512,22556:0.10371188492300307,22892:0.18501603682040638
>,
>23063:0.06357107330586896,23218:0.13920493300455258,25480:0.07227736143348
>7
>77,25502:0.1346557244620066,27862:0.13938509003289187,29413:0.147939963215
>6
>9253,30234:0.12729058617007422,30567:0.11144231373666175,31946:0.105152811
>3
>8396118,33426:0.102140161371664,34782:0.03562376273049143,36387:0.13976217
>7
>1750777,38507:0.12025741706914195,40723:0.20174866606511677,41625:0.098771
>8
>8897003744,}
>
>--->>>>> So in this email, I have listed the following pieces
> of
>information 1. Commands, 2. Env vars, 3. Sw version info
>
>Again, thank you in advance for your help.
>
>Scott
>
>INFO Below:
>
>1. sequence of commands with relevant logged output points (omitted the
>sequence dump commands):
>
>mv reuters xreuters
>rm -r temp
>
>rm -r reuters-*
>mv xreuters reuters
>mvn -e -q exec:java
>-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
>-Dexec.args="reuters/ reuters-extracted/"
>mahout seqdirectory -c UTF-8 -i reuters-extracted/ -o reuters-seqfiles -ow
>mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors/ -ow -chunk 100
>-x 90 -seq -ml 50 -n 2 -nv
>#
># added the -cd option per instructions in the Mahout In Action (MiA) so
>the convergance threhsold is .1 (originally this was default value but no
>affect on the unexpected results)
>#       instead of default value of .5  because cosines lie within 0 and
>1.
>#
>mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
>reuters-kmeans-centroids -cl -ow -o reuters-kmeans-clusters -k 20 -x 10
>-dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
>mahout clusterdump -d reuters-vectors/dictionary.file-0 -dt sequencefile
>-i reuters-kmeans-clusters/clusters-3-final -n 20 -b 100 -o cdump.txt -p
>reuters-kmeans-clusters/clusteredPoints/
>
>mahout rowid -i reuters-vectors/tfidf-vectors/part-r-00000 -o
>reuters-matrix
>#
># the prior step had 21578 rows and 41807 columns
># 41807 came from the prior step columns output
># 10 most similar docs to each doc in the collection
>#
>mahout rowsimilarity -i reuters-matrix/matrix -ow -o reuters-similarity -r
>41807 --similarityClassname SIMILARITY_COSINE -m 10 -ess
>
>
>
>
>2. env vars are as follows:
>
>MAHOUT_LOCAL=yes
>TERM_PROGRAM=Apple_Terminal
>MAHOUT_HOME=/Users/scottccote/mahout
>TERM=xterm-256color
>SHELL=/bin/bash
>TMPDIR=/var/folders/ym/9dhjygdj3mz8ys73_2r2rc500000gn/T/
>Apple_PubSub_Socket_Render=/tmp/launch-82C1fm/Render
>HADOOP_PREFIX=/Users/scottccote/hadoop
>TERM_PROGRAM_VERSION=309
>TERM_SESSION_ID=A5B10188-433E-419A-A263-65BDDEABB9CF
>USER=scottccote
>COMMAND_MODE=unix2003
>SSH_AUTH_SOCK=/tmp/launch-XEgaqv/Listeners
>__CF_USER_TEXT_ENCODING=0x1F5:0:0
>Apple_Ubiquity_Message=/tmp/launch-N1BDIz/Apple_Ubiquity_Message
>PATH=/opt/local/bin:/opt/local/sbin:/usr/local/mysql/bin:/opt/local/bin:/o
>p
>t/local/sbin:/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:
>/
>usr/local/bin:/Users/scottccote/hadoop/bin:/Users/scottccote/hadoop/sbin:/
>U
>sers/scottccote/mahout/bin:/Users/scottccote/mongodb/bin
>PWD=/Users/scottccote/Documents/toy-workspace/MiA
>HADOOP_VERSION=1.1.2
>JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
>EDITOR=/usr/bin/vi
>HADOOP_CONF_DIR=/Users/scottccote/hadoop/conf
>LANG=en_US.UTF-8
>HADOOP_OPTS=-Djava.security.krb5.realm=OX.AC.UK
>-Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk
>
>
>
>
>
>3. Software/OS Version Info:
>version of mahout is (property of pom.xml in mahout home): 0.8
>
>version of java (java -version): java version "1.6.0_65", Java(TM) SE
>Runtime Environment (build 1.6.0_65-b14-462-11M4609),Java HotSpot(TM)
>64-Bit Server VM (build 20.65-b04-462, mixed mode)
>
>Version of os (uname -a): Darwin Scotts-MacBook-Air.local 12.5.0 Darwin
>Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013;
>root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
>
>
>
>
>
>On 12/19/13 1:08 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>
>>I don't see a need for uploading ur commands.  Clean up HDFS (both output
>>and temp folders) and try running the 5 steps again - extract reuters,
>>seqdirectory, seq2sparse, rowid job, rowsimilarity job.
>>
>>Please use '-ow' option while running each of the jobs.
>>
>>
>>
>>
>>
>>
>>
>>On Thursday, December 19, 2013 2:04 PM, Scott C. Cote
>><scottcc...@gmail.com> wrote:
>> 
>>I manually deleted the temp folder too (After 2 failed starts).
>>
>>Would it be helpful for me to upload my shells that encapsulate all of
>>the
>>commands posted on the tour?  They reflect the current state of reuters
>>and .8 mahout.
>>And if I did - how would I do it?
>>
>>Thanks,
>>
>>SCott
>>
>>
>>On 12/19/13 1:00 PM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>>
>>>Yep, that's what has happened in ur case. the wiki doesn't have but
>>>please specify the
> -ow (overwrite) option while running the
>>>RowsimilarityJob. That should clear up both the output and temp folders
>>>before running the job.
>>>
>>>
>>>
>>>
>>>
>>>On Thursday, December 19, 2013 1:50 PM, Suneel Marthi
>>><suneel_mar...@yahoo.com> wrote:
>>> 
>>>Haha... that could explain it, Rowsimilarityjob creates temp files
>>>during
>>>execution. If ur laptop 'sleeped' then the temp files still persist and
>>>running the job again wouldn't overwrite the old temp files (i need to
>>>verify that).
>>>
>>>It should be
> good enough to run the Rowsimilarity job again.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>On Thursday, December 19, 2013 1:46 PM, Scott C. Cote
>>><scottcc...@gmail.com> wrote:
>>> 
>>>Suneel,
>>>
>>>I'm going to do the similarity part of the tour over - my laptop was
>>>"sleeped" in the middle of the run of the rowsimilarity job.
>>>Maybe the job is sensitive to that ….  :(  Normally - a server would not
>>>go to sleep nor would it run
>>>in local mode.
>>>
>>>Sorry that I didn't think of that sooner.
>>>Will let you know my outcome.
>>>
>>>Am planning on redoing by deleting the contents and the folder titled
>>>"reuters-similarity"
>>>
>>>Please let me know if that is not good enough.
>>>
>>>Thanks again.
>>>
>>>SCott
>>>
>>>
>>>On 12/19/13 11:53 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:
>>>
>>>>What you are seeing is the output matrix of the RowSimilarity job.  You
>>>>are right there should be 21578 documents only in the reuters
> corpus.
>>>>
>>>>a) How many documents do you have in your docIndex?  DocIndex is one of
>>>>the artifacts of the RowIDJob and should have been executed prior to
>>>>the
>>>>RowSimilarity Job. You can run seqdumper on docIndex to see the output.
>>>>
>>>>b) Also what was the message at the end of the RowId job. It should
>>>>read
>>>>something like 'Wrote out matrix with 21578 rows and 19515 columns to
>>>>reuters-matrix/matrix'.
>>>>
>>>>
>>>>
>>>>
>>>>On Thursday, December 19, 2013 12:14 PM, Scott C. Cote
>>>><scottcc...@gmail.com> wrote:
>>>> 
>>>>All,
>>>>
>>>>I am a newbie Mahout user and am trying to use the "Quick tour of text
>>>>analysis using the Mahout command line" .  Thank you to whomever
>>>>contributed
>>>>to that page.
>>>>
>>>>> 
>>>>>https://cwiki.apache.org/confluence/display/MAHOUT/Quick+tour+of+text+
>>>>>a
>>>>>n
>>>>>a
>>>>>lysis
>>>>> +using+the+Mahout+command+line
>>>>
>>>>Went all the way from beginning to end of
>>> the page with "seemingly" no
>>>>hiccups.
>>>>At the very end of the "tour", I became confused because the command:
>>>>
>>>>> mahout seqdumper -i reuters-matrix/matrix | more
>>>>
>>>>Allowed me to see output (snippet)
>>>>
>>>>> Key: 1: Value:
>>>>> 
>>>>>/reut2-000.sgm-1.txt:{312:0.1250488193181003,2962:0.07532412503846121,
>>>>>4
>>>>>4
>>>>>0
>>>>>3:0.2
>>>>> 
>>>>>2792379999043863,5405:0.0964390139170019,5997:0.030023608542497426,101
>>>>>0
>>>>>8
>>>>>:
>>>>>0.126
>>>>> 
>>>>>28552842745744,13043:0.14709923014699935,13653:0.07372109235301716,137
>>>>>5
>>>>>0
>>>>>:
>>>>>0.188
>>>>> 
>>>>>8955967611108,15886:0.1543819831189062,15901:0.10756083643096839,15969
>>>>>:
>>>>>0
>>>>>.
>>>>>36601
>>>>> 
>>>>>581899071867,16138:0.12548750176412274,16553:0.11490460601515046,17734
>>>>>:
>>>>>0
>>>>>.
>>>>>10869
>>>>> 
>>>>>648237816114,17978:0.11932381316475806,18019:0.1051527785317777,22224:
>>>>>0
>>>>>.
>>>>>1
>>>>>23091
>>>>> 
>>>>>46422711122,22456:0.1371221887995933,22837:0.19295627853659875,25480:0
>>>>>.
>>>>>0
>>>>>6
>>>>>16936
>>>>> 
>>>>>10076373216,25958:0.09251293588851367,26105:0.10304941346400417,26507:
>>>>>0
>>>>>.
>>>>>1
>>>>>23271
>>>>> 
>>>>>84002913602,28332:0.1794774670703689,28335:0.10843140748339948,28480:0
>>>>>.
>>>>>0
>>>>>8
>>>>>01873
>>>>> 
>>>>>7549811794,29541:0.11169278315306423,30534:0.18480378614987836,30921:0
>>>>>.
>>>>>1
>>>>>9
>>>>>87470
>>>>> 
>>>>>224449987,31071:0.17024007142554856,31386:0.22792379999043863,31433:0.
>>>>>1
>>>>>4
>>>>>7
>>>>>88025
>>>>> 
>>>>>30196623,31815:0.06001469365693789,32099:0.1284458798636675,32334:0.10
>>>>>9
>>>>>7
>>>>>3
>>>>>79357
>>>>> 
>>>>>6935256,32385:0.12143572490835457,34782:0.030407287755940444,35425:0.0
>>>>>3
>>>>>5
>>>>>8
>>>>>19767
>>>>> 
>>>>>691229826,37264:0.20518922008525398,37355:0.2879544482952078,37818:0.1
>>>>>0
>>>>>8
>>>>>1
>>>>>98203
>>>>> 
>>>>>50102567,39273:0.10347873039101099,39831:0.08810699655751153,39979:0.0
>>>>>9
>>>>>5
>>>>>2
>>>>>82500
>>>>>
>>> 26282217,40427:0.18975048184863322,41154:0.06582064373931332,}
>>>>
>>>>Reading through that snippet of data made me think that there exists a
>>>>document with rowed 41154 with cosine value of  ~0.0658 (the last
>>>>element
>>>>in
>>>>the snippet).
>>>>
>>>>The problem is that the folder
>>>>
>>>>> /Users/scottccote/Documents/toy-workspace/MiA/reuters-extracted
>>>>
>>>>Only has 21578 files in it.  Indeed, my dictionary file  (output
>>>>command
>>>>used shown below)
>>>>
>>>>> mahout seqdumper -i reuters-matrix/docIndex  | tail
>>>>
>>>>Has a max key of
>>>>
>>>>> Key: 21576: Value: /reut2-021.sgm-98.txt
>>>>> Key: 21577: Value:
>>> /reut2-021.sgm-99.txt
>>>>> Count: 21578
>>>>
>>>>So I cannot find the document with key value 41154   .  What
> does the
>>>>41154
>>>>related to????
>>>>
>>>>Obviously I have misunderstood something that I did  or need to do 
>>>>in
>>>>the
>>>>tour.  Can someone please shine a light on where I strayed?  I have
>>>>scripted
>>>>every step that I took and can share them here if desired (I noticed
>>>>that
>>>>some of the output file names changed since the page was written  so I
>>>>made
>>>>adjustments).
>>>>
>>>>Regards,
>>>>
>>>>SCott  
>>>>
>>>>PS  Thanks TD for helping me earlier

Re: unexpected results in seqdump of reuters-matrix in quick tour of text analysis

Reply via email to