Re: producing vectors from composite documents

Olivier Grisel Tue, 08 Jun 2010 16:38:10 -0700

2010/6/9 Jake Mannix <[email protected]>:
> On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel 
> <[email protected]>wrote:
>
>> 2010/6/8 Ted Dunning <[email protected]>:
>> > Got it.
>> >
>> > This really needs to be done before vectorization, but you can segregate
>> the
>> > output vector for different handling by passing in a view to different
>> parts
>> > of the vector.
>> >
>> > My recommendation is that you apply IDF using the weight dictionary in
>> the
>> > vectorizer.  That will let you have multiple text fields with different
>> > weighting schemes but still put all the results into a single result
>> vector.
>> >  As a side effect, if you put everything into a vector of dimension 1,
>> then
>> > you get multi-field weighted inputs for free.
>>
>> Instead of storing the exact IDF values in an explicit dictionnary,
>> one could use a counting bloom filters datastructure to reduce the
>> memory footprint and speedup the lookups (though lucene is able to
>> handle millions of terms without any perf issues).
>>
>
> Using counting bloom filters is a really good idea here.  Do you know
> any good java implementations of these?


Nope, but AFAIK Ted's combination of probes logic + Murmurhash
implementation does 90% of the work.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Re: producing vectors from composite documents

Reply via email to