On Wed, Dec 12, 2012 at 3:02 PM, Wu, Stephen T., Ph.D. <wu.step...@mayo.edu> wrote: >>> Is there any (preliminary) code checked in somewhere that I can look at, >>> that would help me understand the practical issues that would need to be >>> addressed? >> >> Maybe we can make this more concrete: what new attribute are you >> needing to record in the postings and access at search time? > > For example: > - part of speech of a token. > - syntactic parse subtree (over a span). > - semantically normalized phrase (to canonical text or ontological code). > - semantic group (of a span). > - coreference link.
So for example part-of-speech is a per-Token-position attribute. Today the easiest way to handle this is to encode these attributes into a Payload, which is straightforward (make a custom TokenFilter that creates the payload). At search time you would then use e.g. PayloadTermQuery to decode the Payload and do something with it to alter how the query is being scored. For the span-like attributes (eg a syntactic parse, semantically normalized phrase) I think you'd need to do something like SynonymFilter in your analysis, i.e. insert new tokens at the position where the span started. Unfortunately, Lucene doesn't properly index spans (it records the start position but not the end position), so that limits what kind of matching you can do at search time. Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org