On Thu, Dec 13, 2012 at 8:32 AM, Carsten Schnober <schno...@ids-mannheim.de> wrote: > Am 13.12.2012 12:27, schrieb Michael McCandless: > >>> For example: >>> - part of speech of a token. >>> - syntactic parse subtree (over a span). >>> - semantically normalized phrase (to canonical text or ontological code). >>> - semantic group (of a span). >>> - coreference link. >> >> So for example part-of-speech is a per-Token-position attribute. >> >> Today the easiest way to handle this is to encode these attributes >> into a Payload, which is straightforward (make a custom TokenFilter >> that creates the payload). >> >> At search time you would then use e.g. PayloadTermQuery to decode the >> Payload and do something with it to alter how the query is being >> scored. > > This is a relatively easy example, but how would deal with e.g. > annotations that include multiple tokens (as in spans), such as chunks, > or relations between tokens (and token spans), as in the coreference > links example given by Steven above?
I think you'd do something like what SynonymFilter does for multi-token synonyms. Eg a synonym for "wireless network" - > wifi would insert a new token ("wifi"), overlapped on wireless. Lucene doesn't store the end span, but if this is really important for your use case, you could add a payload to that wifi token that would encode the number of positions that the inserted token spans (2 in this case), and then the information would be present in the index. You'd still need to do something custom at read/search time to decode this end position and do something interesting with it ... Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org