Re: Payloads

Grant Ingersoll Sat, 27 Dec 2008 09:38:02 -0800

Very cool stuff Karl. Would love to see some TREC-style evaluationsfor the ShingleMatrixQuery stuff just to see some comparisons. Also,you might have a look at the new TokenStream stuff that is in 2.9 andis a start on it's way towards Flexible Indexing. I think this mayactually allow you to have more strongly typed payloads which meansyou won't have to decode (well, kind of, Lucene will do the decodingfor you). Only problem is they aren't yet supported on the searchside. In other words, your wish for a reusable API is being worked on.

Have a look at Michael Busch's ApacheCon NO (I think it's up on the ACwebsite)


-Grant

On Dec 26, 2008, at 8:22 PM, Karl Wettin wrote:

I would very much like to hear how people use payloads.
Personally I use them for weight only. And I use them a lot, almostin all applications. I factor the weight of synonyms, stems,dediacritization and what not. I create huge indices that containslots tokens at the same position but with different weights. I mightfor instance create the stream "(1)motörhead^1", "(0)motorhead^0.7"and I'll do this at both index and query time, i.e. I use thepayload weight to calculate both payload weight used by theBoostingTermQuery scorer AND to set the boost in the query at thesame time.
In order to handle this I use an interface that looks something likethis:
public interface PayloadWeightHandler {
 public void setWeight(Token token, float weight);
 public float getWeight(Token token);
}
In order to use this I had to patch pretty much any filter I use andpass down a weight factor, something like:
TokenStream ts = analyzer.tokenStream(f, new StringReader("motörheadace of spaces"));
ts = new SynonymTokenFilter(ts, synonyms, 0.7f);
ts = new StemmerFilter(ts, 0.7f);
ts = new ASCIIFoldingFilter(ts, 0.5f);
All these filters would, if applicable, create new synonym tokenswith slightly less weight than the input rather than replace tokencontent:
"(1)mötorhead^1", "(0)motorhead^0.5", "(1)ace^1", "(1)of^1","(1)spades^1", "(1)spad^0.7"
I usually use 4 byte floats while creating the stream and thenconvert it to 8 bit floats in a final filter before adding it to thedocument.
Is anyone else doing something similar? It would be nice tonormalize this and perhaps come up with a reusable API for this. Itwould also be cool if all the existing filters could be rewritten tohandle this stuff.
I find it to be extemely useful when creating indices with ratherniched content such as song titles, names of people, streetaddresses, et c. For the last year or so I've done several (3)commercial implementations where I try to extend the index withincorrect typed queries but unique enough that it does not interferewith the quality of the results. It has been very successful, peopleget great responses in great time even though they enter an"incorrect" query.
On a side note, in these implementaions I've completely replacedphrase queries using shingles. ShingleMatrixQuery has some built ingoodies for calculating weight. Combined with SSD I see awesomeresults with very short response time even in fairly large indices(10M-100M documents). I'm talking about 100ms-500ms for rathercomplex queries under heavy load.
     karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Payloads

Reply via email to