Re: vectors from pre-tokenized terms

2011-09-14 Thread Grant Ingersoll
I think createDictionaryChunks is the first thing that runs inside of createTermFrequencyVectors. It takes the input from DocumentProcessor.tokenizeDocuments, which outputs Text, StringTuple. So, I would suspect you would need Text, StringTuple as inputs.See

vectors from pre-tokenized terms

2011-09-09 Thread Jack Tanner
Hi all. I've got some documents described by binary features with integer ids, and i want to read them into sparse mahout vectors to do tfidf weighting and clustering. I do not want to paste them back together and run a Lucene tokenizer. What's the clean way to do this? I'm thinking that I