Hi All:

I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.

The basic idea is to run queries to find out what booklet and page numbers are match in order to help people know where to look for information in the (rather
large and dry) booklets. Therefore each Document in my index represents a
particular page in one of the booklets.

So far I have been able to successfully scrape the raw text from the booklets, insert it into an index, and query it just fine using StandardAnalyzer on both
ends.

So here's my general question:
Many queries on the index will involve searching for place names mentioned in the booklets. Some place names use notational variants. For instance, in the body text it will be called "Ship Creek" but in a diagram it might be listed as "Ship Cr." or
elsewhere as "Ship Ck.".

If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.

What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms. So, in a nutshell I need the basic stuff provided by StandardAnalyzer, but with term grouping to emit place names
as complete terms and insert synonymous terms to cover the variants.

For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].

As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].

Should the detection and merging be done in a TokenFilter?
Some of the term grouping can probably be done heuristically [*],[creek] is [* creek] but I also have an exhaustive list of places mentioned in the text if that helps.

Thanks for any help you can provide.
Jason


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to