2010/6/9 Jake Mannix <[email protected]>: > On Tue, Jun 8, 2010 at 4:10 PM, Olivier Grisel > <[email protected]>wrote: > >> 2010/6/8 Ted Dunning <[email protected]>: >> > Got it. >> > >> > This really needs to be done before vectorization, but you can segregate >> the >> > output vector for different handling by passing in a view to different >> parts >> > of the vector. >> > >> > My recommendation is that you apply IDF using the weight dictionary in >> the >> > vectorizer. That will let you have multiple text fields with different >> > weighting schemes but still put all the results into a single result >> vector. >> > As a side effect, if you put everything into a vector of dimension 1, >> then >> > you get multi-field weighted inputs for free. >> >> Instead of storing the exact IDF values in an explicit dictionnary, >> one could use a counting bloom filters datastructure to reduce the >> memory footprint and speedup the lookups (though lucene is able to >> handle millions of terms without any perf issues). >> > > Using counting bloom filters is a really good idea here. Do you know > any good java implementations of these?
Nope, but AFAIK Ted's combination of probes logic + Murmurhash implementation does 90% of the work. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel
