LDA should work with both tf and tfidf vectors. If u were to generate tfidf vectors in Step 3 and feed the same into Step 4, LDA should work.
On Monday, January 27, 2014 7:33 PM, Peyman Faratin <[email protected]> wrote: Hi Suneel I changed step 4 to tf-vectors (LDA needs tf vectors, not tfidf). it seems to be working. thank you for your help Suneel Peyman On Jan 27, 2014, at 1:51 PM, Suneel Marthi <[email protected]> wrote: >In Step 3, u r generating tf vectors but r expecting tf-idf vectors in Step 4. > >Change the weight in Step 3 to tfidf (which is the default BTW if none >specified). > > > >On , Suneel Marthi <[email protected]> wrote: > >In Step #, u r generating tf vectors but r expecting tf-idf vectors in Step 4. > >Change the weight in Step 3 to tfidf (which is the default BTW if none >specified). > > > > > > >On Monday, January 27, 2014 1:44 PM, Ted Dunning <[email protected]> wrote: > >I am forwarding this to the list for Peyman. > > >----------------------------------------------------------------- > >I am trying to run the CVB (Mahout 0.8) on a directory of plain text files, >following the procedure outlined below. However, I am not able to see the >vectordump (step 6). Run without the "-c csv" flag the generated file is >empty. However, if I use the flag "-c csv" the generated file starts with a >series of numbers followed by an alphabetically organized series of >unigrams (see below) > > >#1,10,1163,12,121,13,14,141,1462,15,16,17,185,1901,197,2,201,2227,23,283,298,3,331,35,4,402,4351,445,5,57,58,6,68,7,9,987,a.m,ab,abc,abercrombie,abercrombies,ability > >Can someone point out what I am doing wrong? > >thank you > > > >0: Set Paths > > > export HDFS_PATH=/path/to/hdfs/ > > export LOCAL_PATH=/path/to/localfs > > >1: Put docs in HDFS using hadoop fs -put [-put <localsrc> ... <dst>] > > > hadoop fs -put $LOCAL_PATH/test $HDFS_PATH/rawdata > >2: Generate sequence files (of Text) from a directory > > > mahout seqdirectory \ > -i $HDFS_PATH/rawdata \ > -o $HDFS_PATH/sequenced \ > -c UTF-8 -chunk 5 > >3- Generate sparse Vector from Text sequence files > > > mahout seq2sparse \ > -i $HDFS_PATH/sequenced \ > -o $HDFS_PATH/sparseVectors \ > -ow --maxDFPercent 85 --namedVector --weight tf > > >4- rowid: : Map SequenceFile<Text,VectorWritable> to >{SequenceFile<IntWritable,VectorWritable>, SequenceFile<IntWritable,Text>} > > > mahout rowid \ > -i $HDFS_PATH/sparseVectors/tfidf-vectors \ > -o $HDFS_PATH/matrix > >5- run cvb > > > mahout cvb \ > -i $HDFS_PATH/matrix/matrix \ > -o $HDFS_PATH/test-lda \ > -k 100 -ow -x 40 \ > -dict $HDFS_PATH/sparseVectors/dictionary.file-0 \ > -dt $HDFS_PATH/test-lda-topics \ > -mt $HDFS_PATH/test-lda-model > >6- Dump vectors from a sequence file to text > > > mahout vectordump \ > -i $HDFS_PATH/test-lda-topics/part-m-00000 \ > -o $LOCAL_PATH/vectordump \ > -vs 10 -p true \ > -d $HDFS_PATH/sparseVectors/dictionary.file-0 \ > -dt sequencefile \ > -sort $HDFS_PATH/test-lda-topics/part-m-00000 \ > -c csv > ; cat $LOCAL_PATH/vectordump > > > > >
