On Wednesday 04 January 2006 07:34, Dave Kor wrote: > Hi, > > I would like to associate information (or labels) with each word or a > range of words in a document. Information such as this word is a noun, that > word is a verb, this period marks the end of a sentence, "kick the bucket" > is a contiguous phrase, "white house" is a location and so on. I am seeking > a good representation for such information so that they can be easily stored > as additional fields in a lucene document, and easily recovered after a > search. For the more technically inclined, this would allow me to store > part-of-speech tags, chunk tags, sentence boundary markers and parse trees > for every indexed document. > > These additional information will enable Lucene to perform additional > post-processing on retrieved documents for various purposes such as > information extraction, summarization, question answering, etc... Is there > any available api? If not, I would appreciate any suggestions and tips on > how such information can best be stored in a Lucene document.
Basically, the index information available in Lucene is the Term, which is a combination of a field name and a token. For these Lucene indexes document presence and all positions within a document. Lucene also indexes the field length as a norm. By using one ore more extra fields the tags and sentence boundary markers can be easily indexed at their positions. To search these have a look at the span package. In case you want to search for tokens combined with some (part of speech) tag, and the tokens and their tags are in different fields, the span package is not sufficient, because it does not allow position search over different fields. One use for positions as sentence boundary markers is to leave gaps at the sentence boundaries. This can safely be done when the slop (allowed distance) in the queries is always smaller than this gap. Parse trees carry (much) more information, and these will not be easy to map to Lucene, but it all depends on the searches you want to support. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]