LDA should work with both tf and tfidf vectors. 
If u were to generate tfidf vectors in Step 3 and feed the same into Step 4, 
LDA should work.




On Monday, January 27, 2014 7:33 PM, Peyman Faratin <[email protected]> 
wrote:
 
Hi Suneel

 I changed step 4 to tf-vectors (LDA needs tf vectors, not tfidf).  it seems to 
be working. 

thank you for your help Suneel

Peyman

 
On Jan 27, 2014, at 1:51 PM, Suneel Marthi <[email protected]> wrote:


>In Step 3, u r generating tf vectors but r expecting tf-idf vectors in Step 4.
>
>Change the weight in Step 3 to tfidf (which is the default BTW if none 
>specified).
>
>
>
>On , Suneel Marthi <[email protected]> wrote:
> 
>In Step #, u r generating tf vectors but r expecting tf-idf vectors in Step 4.
>
>Change the weight in Step 3 to tfidf (which is the default BTW if none 
>specified).
>
>
>
>
>
>
>On Monday, January 27, 2014 1:44 PM, Ted Dunning <[email protected]> wrote:
> 
>I am forwarding this to the list for Peyman.
>
>
>-----------------------------------------------------------------
>
>I am trying to run the CVB (Mahout 0.8) on a directory of plain text files,
>following the procedure outlined below. However, I am not able to see the
>vectordump (step 6). Run without the "-c csv" flag the generated file is
>empty. However, if I use the flag "-c csv" the generated file starts with a
>series of numbers followed by an alphabetically organized series of
>unigrams (see below)
>
>
>#1,10,1163,12,121,13,14,141,1462,15,16,17,185,1901,197,2,201,2227,23,283,298,3,331,35,4,402,4351,445,5,57,58,6,68,7,9,987,a.m,ab,abc,abercrombie,abercrombies,ability
>
>Can someone point out what I am doing wrong?
>
>thank you
>
>
>
>0: Set Paths
>
>    > export HDFS_PATH=/path/to/hdfs/
>    > export LOCAL_PATH=/path/to/localfs
>
>
>1: Put docs in HDFS using hadoop fs -put [-put <localsrc> ...
 <dst>]
>
>    > hadoop fs -put $LOCAL_PATH/test $HDFS_PATH/rawdata
>
>2: Generate sequence files (of Text) from a directory
>
>    > mahout seqdirectory \
>    -i $HDFS_PATH/rawdata \
>    -o $HDFS_PATH/sequenced \
>    -c UTF-8 -chunk 5
>
>3- Generate sparse Vector from Text sequence files
>
>    > mahout seq2sparse \
>    -i $HDFS_PATH/sequenced \
>    -o $HDFS_PATH/sparseVectors \
>    -ow --maxDFPercent 85 --namedVector --weight tf
>
>
>4- rowid: : Map SequenceFile<Text,VectorWritable> to
>{SequenceFile<IntWritable,VectorWritable>,
 SequenceFile<IntWritable,Text>}
>
>    > mahout rowid \
>    -i $HDFS_PATH/sparseVectors/tfidf-vectors \
>    -o $HDFS_PATH/matrix
>
>5- run cvb
>
>    > mahout cvb \
>    -i
 $HDFS_PATH/matrix/matrix \
>    -o $HDFS_PATH/test-lda \
>    -k 100 -ow -x 40 \
>    -dict $HDFS_PATH/sparseVectors/dictionary.file-0 \
>    -dt $HDFS_PATH/test-lda-topics \
>    -mt $HDFS_PATH/test-lda-model
>
>6- Dump vectors from a sequence file to text
>
>    > mahout vectordump \
>    -i $HDFS_PATH/test-lda-topics/part-m-00000 \
>    -o $LOCAL_PATH/vectordump \
>    -vs 10 -p true \
>    -d $HDFS_PATH/sparseVectors/dictionary.file-0 \
>    -dt sequencefile \
>    -sort $HDFS_PATH/test-lda-topics/part-m-00000 \
>    -c csv
>    ;  cat
 $LOCAL_PATH/vectordump
>
>
>
>
>

Reply via email to