subject:"vectors from pre\-tokenized terms"

Re: vectors from pre-tokenized terms

2011-09-14 Thread Grant Ingersoll

I think createDictionaryChunks is the first thing that runs inside of createTermFrequencyVectors. It takes the input from DocumentProcessor.tokenizeDocuments, which outputs Text, StringTuple. So, I would suspect you would need Text, StringTuple as inputs.See

vectors from pre-tokenized terms

2011-09-09 Thread Jack Tanner

Hi all. I've got some documents described by binary features with integer ids, and i want to read them into sparse mahout vectors to do tfidf weighting and clustering. I do not want to paste them back together and run a Lucene tokenizer. What's the clean way to do this? I'm thinking that I