SynonymFilter makes sense. The planned payloads are indeed not needed. I guess a better solution would be making out of the boost an attribute during query time that will be consumed in the queryParser in order to boost these n-gram terms.
Thanks for the hints. Manuel On Wed, Mar 12, 2014 at 12:17 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > You could also use SynonymFilter? > > Why does the boost need to be encoded in the index (in a payload) vs > at query time when you create the TermQuery for that term? Does the > boost vary depending on the surrounding context / document? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Wed, Mar 12, 2014 at 5:27 AM, Manuel Le Normand > <manuel.lenorm...@gmail.com> wrote: > > Hi, > > I posted this question on the Solr mailing list but it has more to do > with > > Lucene. > > > > I have a performance and scoring problem for phrase queries > > > > 1. Performance - phrase queries involving frequent terms are very slow > > due to the reading of large positions posting list. > > 2. Scoring - I want to control the boost of phrase and entity (in > > gazetteers) matches > > > > Indexing all terms as bi-grams and unigrams is not possible in my use > case, > > so I plan indexing only the useful bi-grams. Part of it will be achieved > by > > the CommonGram filter in which I put the frequent words. > > > > I think of going a step further and index phrase queries (extracted from > my > > query log) entities (from gazetteers). In order to control the boost on > > these N-gram matches I plan adding payloads to these terms. > > > > I'm thinking of two different implementations: > > > > 1. Using MappingCharFilter - the mapping.txt would be > > > > #phrase-query > > > > term1 term2 term3 => term1_term2_term3|1 > > > > #entity > > > > firstName lastName => firstName_lastName|2 > > > > > > Very simple to implement but an issue might be that I have 100k-1M > > (depending on frequency) phrases/entities as above. I saw that > > MappingCharFilter is implemented as an FST, so I'm not concerned with > > memory footprint, but I'm concerned that iterating on the charBuffer for > > long documents might cause problems. > > > > 2. Using the shingleTokenFilter - customizing it to compare the output > > against my gazetteers. It would demand and FST implementation in this > > TokenFilter. > > > > > > Will I get a quick win with opt.1? How hard would be implementing opt.2? > > > > General question: Is the above N-gram + payload resolution a common > > practice? > > > > Thanks in advance, > > Manuel > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >