The reason I only want 2 hits is because [2] is more "specific" in my domain -- I could also have Toronto, Ontario; Kingston, Ontario etc. which would take the hits up to 5 now.
What I'm really after is finding a way to index and search that would make [2] an invalid retrieval. My latest attempt is like this (field name: value): Type: city Name: london ontario canada Name: london on canada Name: london ontario ca Name: london on ca Primary-name: london So the new list of documents is something like this (<type>: <name entries> {<primary-name>}): [1] city: London, United Kingdom {london} [2] city: London, Ontario, Canada {london} [3] city: Ontario, California, United States {ontario} [4] state: Ontario, Canada {ontario} [5] city: Vancouver, Washington, United States {vancouver} [6] city: Vancouver, British Columbia, Canada {vancouver} [7] city: Washington, DC, United States {washington} [8] state: Washington, United States {washington} I realize that I'm adding a lot of duplicate info -- I haven't got to the refactoring stage yet, so I'm trying to keep my unit test setup very explicit. The final analysis process will be pulling the geographic entities from a database so I'll have all the synonyms, types (city, state, country), etc. at that point and can write custom routines for documents of each type (city, state, country). The idea here is to filter the results so that only documents where the primary-name appears in the user's query string come back. i.e. if the user typed "Ontario, CA", so only [3, 4] are valid results now since [2] has a primary-name of "london" which does not appear in the user's query, while [3, 4] both have a primary-name of "ontario". Now I'm just having some trouble creating a filter (I've managed so far to filter out _everything_). I can't quite sort out how to do a (displaying my SQL background here) "where <term> in <query string>". I'm including my current search code at the end of this response. Unfortunately I can't just assume the first term in the user's query is the primary-name since it could be more than one word (e.g. for "St. Louis du Ha Ha Quebec", "St. Louis du Ha Ha" is the primary-name). Thanks Colin // sample call: // Hits hits = GeographySearch.Search(searcher, "any", new String[] {"Ontario", "CA"}); public static Hits Search(Searcher searcher, String typeToFind, String[] queryString) throws IOException, ParseException { TermQuery entityType = new TermQuery(new Term("class", typeToFind)); BooleanQuery filterQuery = new BooleanQuery(); PhraseQuery query = new PhraseQuery(); query.setSlop(1); for (int i = 0; i < queryString.length; i++) { query.add(new Term("name", queryString[i].toLowerCase())); filterQuery.add( new TermQuery(new Term("primary-name", queryString[i])), BooleanClause.Occur.SHOULD); } BooleanQuery geographyQuery = new BooleanQuery(); if (typeToFind != "any") geographyQuery.add(entityType, BooleanClause.Occur.MUST); geographyQuery.add(query, BooleanClause.Occur.MUST); QueryFilter filter = new QueryFilter(filterQuery); Hits hits = searcher.search(geographyQuery, filter); return hits; } -----Original Message----- From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] Sent: 27 January, 2006 14:28 To: java-user@lucene.apache.org Subject: Re: Help with indexing and query strategy Hi Colin, Even assuming you came up with a good way of indexing, the example query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are valid retrievals. Could you please justify which 2 hits you want and why? Thanks, Rajesh Munavalli Colin Young wrote: > I'm having some trouble coming up with a good search strategy for geographical data. e.g., given: > > [1] city: London, United Kingdom > [2] city: London, Ontario, Canada > [3] city: Ontario, California, United States [4] state: Ontario, > Canada [5] city: Vancouver, Washington, United States [6] city: > Vancouver, British Columbia, Canada [7] city: Washington, DC, United > States [8] state: Washington, United States > > and also given the following synonyms: > > Ontario = ON > California = CA > Washington = WA > Canada = CA > United States = US = America = United States of America United Kingdom > = UK = Great Britain = England > > for the following queries, I want the listed number of hits '()' from matching '[]': > > i. Ontario (2) [3, 4] > ii. London (2) [1, 2] > iii. Ontario, Canada (1) [4] > iv. Ontario, California (1) [3] > v. Ontario, CA (2) [3, 4] > vi. Ontario, US (1) [3] > vii. Vancouver (2) [5, 6] > viii. Washington (2) [7, 8] > ix. Washington, DC (1) [7] > x. Vancouver, CA (1) [6] > xi. Vancouver, WA (1) [5] > > How do I index and store the input (assume that I know the mechanics so I'm not looking for specific java syntax or how to generate synonyms during analysis) so that I get the desired results. My current attempt indexes strings like "London Ontario Canada", "London ON Canada", "London Ontario CA", "London ON CA" -- i.e. every combination of entity name and corresponding code -- in a content field and creates a type field containing "city" (or "state" or "country" as appropriate to identify the type of entity being indexed) and uses a phrase query with a slop of 1 which works really well except e.g. "Ontario CA" for which I'd like 2 hits, but given the above data gives 3 hits (from 2, 3 and 4, and the problem will only get worse as I add more cities in Ontario since each results in a hit). The slop of 1 is required since not all countries customarily use states, and I need to support the user optionally dropping the state as in the above example of "Ontario, CA" where we don't know if the user intended the "CA" to represent the state of California or the country of Canada, while "London, UK" would be unambiguous. > > The major problem as I see it is that at parse time I don't know if the user is searching for a city, state or country, and I don't want to force them to specify that. > > Does anyone have any good ideas to help me solve this problem? > > Thanks. > > Colin Young > > > Notice: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Notice: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]