In the documentation for FieldMaskingSpanQuery, it says:

"Note: as getField() returns the masked field, scoring will be done using the 
Similarity and collection statistics of the field name supplied, but with the 
term statistics of the real field. This may lead to exceptions, poor 
performance, and unexpected scoring behavior."

I assume this was implemented as such because the hypothetical use case was 
with very short fields, and collection statistics/idf are not so important when 
you're basically doing boolean queries.

However, we've given a lot of thought to how we could include linguistic 
annotations alongside the original text, and we're looking at separate fields + 
FieldMaskingSpanQuery to do the trick. (The idea is to create "annotation" 
fields with token offsets set by the tokenized text. Then FieldMaskingSpanQuery 
allows us to search both text and annotations as if they are in the same token 
position in the same field. We've considered payloads, synonyms, and a few 
other things, but not really been satisfied.)

In order for this to be scientifically interesting, though, we need for the 
collection statistics to remain consistent with the original "annotation" 
field; we would also like to ensure that all of these stats/SpanQuery 
descendents work with LMDirichletSimilarity.

Any idea how to implement a FieldMaskingSpanQuery that gets collection 
statistics right?

Many thanks for any help on the issue.

stephen
P.S. Has anyone made progress on allowing indexes to store word lattices, 
preserving the graphs that are produced with TokenFilters?

Reply via email to