Clustering Suggestions

Adam Estrada Thu, 16 Jun 2011 13:26:16 -0700

All,

I am very new to Mahout so please bare with me. I want to be able to get
usable topics from my data so I pull from my lucene index with a field that
that was created from Solr. See below


    <fieldType name="text_ws" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true" >
    <analyzer>
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^a-zA-Z]" replacement=" " replace="all"/>        <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords_en.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.LengthFilterFactory" min="2" max="999"/>
        <filter class="solr.PositionFilterFactory" />
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
         <filter class="solr.EnglishMinimalStemFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.TrimFilterFactory"/>
    </analyzer>
    </fieldType>

As you can see, it's pretty strict and creates single word tokens at the
whitespace. My question is, how can I pull "topics" out like the LDA
clustering algorithm suggests?
https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html

I wrote the following script that is supposed to walk through the process
from soup to nuts but it is really only generating clusters of single words.
Is that the intended usage for this algorithm?

##
# create term vectors from lucene
##
#./mahout lucene.vector --dir  /home/ubuntu/Documents/data/index --output
/home/ubuntu/Documents/part-out.vec --field translated --idField id
--dictOut /home/ubuntu/Documents/dict.out --max 5000 --norm 2 -err 1

##
# Latent Dirichlet Allocation Clustering
##
#./mahout lda -i /home/ubuntu/Documents/part-out.vec -o
/home/ubuntu/Documents/output/lda -k 25 -v 100000 -x 10 -ow

#./mahout ldatopics -i /home/ubuntu/Documents/output/lda/state-10 -o
/home/ubuntu/Documents/output/ldatopics -d /home/ubuntu/Documents/dict.out

#./mahout clusterdump -s /home/ubuntu/Documents/output/lda/clusters-10 -o
/home/ubuntu/Documents/output/ldatopics -d /home/ubuntu/Documents/dict.out
-dt text -b 100 -n 25 -p /home/ubuntu/Documents/output/lda/clusteredPoints

Any tips on what I am doing wrong would be greatly appreciated. I am using
trunk Mahout that is modified to work with Lucene 3.2. I just changed the
Lucene version number in the build script.

Thanks,

Adam

Clustering Suggestions

Reply via email to