vectors from pre-tokenized terms

2011-09-09 Thread Jack Tanner
Hi all. I've got some documents described by binary features with 
integer ids, and i want to read them into sparse mahout vectors to do 
tfidf weighting and clustering. I do not want to paste them back 
together and run a Lucene tokenizer. What's the clean way to do this?


I'm thinking that I need to write out SequenceFile objects, with a 
document id key and a value that's either an IntTuple. Is that right? 
Should I use an IntegerTuple instead? It feels wrong to use either, 
actually, because these tuples claim to be ordered, but my features are 
not ordered.


I would then use DictionaryVectorizer.createTermFrequencyVectors and 
TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.


Am I on the right track?


Re: vectors from pre-tokenized terms

2011-09-13 Thread Jack Tanner
Ping? Please help if you can. Maybe I was unclear the first time; let me 
try again.


I have input like this:

term_id,doc_id
55,1
61,1
29,2
98,3

I want to do clustering, so (I think) I need to transform that into a 
bunch of SequenceFile objects.


key:1,value:<55,61>
key:2,value<29>
key:3,value<98>

What's the format of the SequenceFile value? IntTuple? IntegerTuple? 
something else?


The next step would be to use 
DictionaryVectorizer.createTermFrequencyVectors and 
TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.


On 9/9/2011 12:17 PM, Jack Tanner wrote:

Hi all. I've got some documents described by binary features with
integer ids, and i want to read them into sparse mahout vectors to do
tfidf weighting and clustering. I do not want to paste them back
together and run a Lucene tokenizer. What's the clean way to do this?

I'm thinking that I need to write out SequenceFile objects, with a
document id key and a value that's either an IntTuple. Is that right?
Should I use an IntegerTuple instead? It feels wrong to use either,
actually, because these tuples claim to be ordered, but my features are
not ordered.

I would then use DictionaryVectorizer.createTermFrequencyVectors and
TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.

Am I on the right track?






Re: vectors from pre-tokenized terms

2011-09-14 Thread Grant Ingersoll
I think createDictionaryChunks is the first thing that runs inside of 
createTermFrequencyVectors.  It takes the input from 
DocumentProcessor.tokenizeDocuments, which outputs Text, StringTuple.  
So, I would suspect you would need Text, StringTuple as inputs.See 
SequenceFileTokenizerMapper.java.


On Sep 13, 2011, at 10:52 AM, Jack Tanner wrote:

> Ping? Please help if you can. Maybe I was unclear the first time; let me try 
> again.
> 
> I have input like this:
> 
> term_id,doc_id
> 55,1
> 61,1
> 29,2
> 98,3
> 
> I want to do clustering, so (I think) I need to transform that into a bunch 
> of SequenceFile objects.
> 
> key:1,value:<55,61>
> key:2,value<29>
> key:3,value<98>
> 
> What's the format of the SequenceFile value? IntTuple? IntegerTuple? 
> something else?
> 
> The next step would be to use DictionaryVectorizer.createTermFrequencyVectors 
> and TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.
> 
> On 9/9/2011 12:17 PM, Jack Tanner wrote:
>> Hi all. I've got some documents described by binary features with
>> integer ids, and i want to read them into sparse mahout vectors to do
>> tfidf weighting and clustering. I do not want to paste them back
>> together and run a Lucene tokenizer. What's the clean way to do this?
>> 
>> I'm thinking that I need to write out SequenceFile objects, with a
>> document id key and a value that's either an IntTuple. Is that right?
>> Should I use an IntegerTuple instead? It feels wrong to use either,
>> actually, because these tuples claim to be ordered, but my features are
>> not ordered.
>> 
>> I would then use DictionaryVectorizer.createTermFrequencyVectors and
>> TFIDFConverter.processTfIdf, just like in SparseVectorsFromSequenceFiles.
>> 
>> Am I on the right track?
>> 
>> 
> 


Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com