All, I must have missed a param earlier; but, it seems that the below results in an export that includes the keys. Derp. See below:
mahout vectordump -i results/cvb_results/to_out \ --dictionary results/seq2sparse_results/dictionary.file-0 \ --vectorSize $NUM_KEYWORDS -sort true \ -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile -n true -u true -p true On Mon, Jul 14, 2014 at 4:42 PM, Mohammed Omer <beancinemat...@gmail.com> wrote: > Quick, brief update to all who are looking into this: > > It's become apparent that due to the inability to include a given Topic's > ID when using `vectordump` with a dictionary file, that I'll likely have to > resort to using `seqdumper` to dump out the term|topics and then use > `seqdumper` again to dump out the dictionary file, and finally write my own > map job to join the two items together. > > Issue resolved, I'll write a post on this in detail for others to learn > from and reference. If anyone comes up with a more streamlined solution, > I'll still donate the full $200 to Apache; otherwise, I'll throw in $100 > next week. > > Thank you all for your work on Mahout. > > Mo > > > On Mon, Jul 14, 2014 at 3:37 PM, Mohammed Omer <beancinemat...@gmail.com> > wrote: > >> All - to help illustrate the issue, I've put together my mahout cvb >> script and some truncated output files here for your review with real data: >> >> https://gist.github.com/momer/3ddaaa0c291a91d25709 >> >> Not sure if this is frowned upon, but to expedite some eyes on this >> issue, I'll donate $200 to the Apache foundation if we can figure this out >> by the end of the week; and, $100 if we can figure it out by the end of >> next week! >> >> Thank you, >> >> Mo >> >> >> On Sun, Jul 13, 2014 at 1:06 PM, Mohammed Omer <beancinemat...@gmail.com> >> wrote: >> >>> All - I'm having the same issue as mentioned at >>> http://comments.gmane.org/gmane.comp.apache.mahout.user/18889 on Mahout >>> 0.9. My CVB clusters describe my corpus well; however, the mapping file >>> generated by mahout's `rowid` seems to be wayyyyyy off. >>> >>> For example, there's a very obvious cluster which has keywords like >>> "beer, stout, pale" - the only cluster to contain these keywords. In my >>> vectordump for the p(term | topic) this cluster is at line 217. Vector dump >>> generated by: >>> >>> echo `date` ": Dumping the p(term | topic) vectors to local >>> filesystem..." >>> $mahout_bin/mahout vectordump -i results/cvb_results/to_out \ >>> --dictionary results/seq2sparse_results/dictionary.file-0 \ >>> --vectorSize $NUM_KEYWORDS -sort results/cvb_results/to_out \ >>> -o $OUTPUT_DIR/$PTOPIC_TERM_FILE -dt sequencefile >>> >>> And, while the results of dumping out the p(doc | topic) group all of >>> the documents which contain the words "beer, stout, pale" together - it >>> dumps them into cluster number 8. The dump is created via: >>> >>> echo `date` ": Dumping the p(doc | topic) vectors to local filesystem..." >>> $mahout_bin/mahout vectordump -i results/cvb_results/do_out \ >>> -sort results/cvb_results/do_out \ >>> -o $OUTPUT_DIR/$PDOC_TOPIC_FILE -p true -c csv -n true -u true >>> >>> IE: the result from the p(doc | topic) dump will result in: >>> >>> 123 0.001,...,0.60,... >>> >>> Where 123 maps to a document about "beer, stout, pale" and where 0.60 is >>> the 9th comma separated value -- thus belonging to cluster id#8 (at zero >>> index). >>> >>> However, if we look at the p(term | topic) file dumped earlier, cluster >>> id#8 has nothing to do with this document. >>> >>> Additionally, I wrote a script to review all of the documents belonging >>> to any given cluster; and, all of the documents in cluster #8 actually map >>> to the p(term|topic) entry described by cluster #217. That is to say, these >>> are the only documents containing the ngrams / keywords that cluster #217 >>> shows as describing it. >>> >>> I can't figure out where the gap is: Is it in the rowid docIndex/matrix >>> I have? I've tried dumping the above two files without sorting as I figured >>> that might be rearranging the ordering of cluster probabilities in the >>> p(doc | topic) dump, but that turned up inconclusive I believe. >>> >>> I would love any ideas - I've been stumped on this for a little while >>> now. >>> >>> Thank you, >>> >>> Mo >>> >> >> >