The reason I only want 2 hits is because [2] is more "specific" in my
domain -- I could also have Toronto, Ontario; Kingston, Ontario etc.
which would take the hits up to 5 now.

What I'm really after is finding a way to index and search that would
make [2] an invalid retrieval.

My latest attempt is like this (field name: value):

Type: city
Name: london ontario canada
Name: london on canada
Name: london ontario ca
Name: london on ca
Primary-name: london

So the new list of documents is something like this (<type>: <name
entries> {<primary-name>}):

[1] city: London, United Kingdom {london}
[2] city: London, Ontario, Canada {london}
[3] city: Ontario, California, United States {ontario}
[4] state: Ontario, Canada  {ontario}
[5] city: Vancouver, Washington, United States {vancouver}
[6] city: Vancouver, British Columbia, Canada {vancouver}
[7] city: Washington, DC, United States {washington}
[8] state: Washington, United States {washington}

I realize that I'm adding a lot of duplicate info -- I haven't got to
the refactoring stage yet, so I'm trying to keep my unit test setup very
explicit. The final analysis process will be pulling the geographic
entities from a database so I'll have all the synonyms, types (city,
state, country), etc. at that point and can write custom routines for
documents of each type (city, state, country).

The idea here is to filter the results so that only documents where the
primary-name appears in the user's query string come back. i.e. if the
user typed "Ontario, CA", so only [3, 4] are valid results now since [2]
has a primary-name of "london" which does not appear in the user's
query, while [3, 4] both have a primary-name of "ontario". Now I'm just
having some trouble creating a filter (I've managed so far to filter out
_everything_). I can't quite sort out how to do a (displaying my SQL
background here) "where <term> in <query string>". I'm including my
current search code at the end of this response.

Unfortunately I can't just assume the first term in the user's query is
the primary-name since it could be more than one word (e.g. for "St.
Louis du Ha Ha Quebec", "St. Louis du Ha Ha" is the primary-name).

Thanks

Colin

// sample call:
// Hits hits = GeographySearch.Search(searcher, "any", new String[]
{"Ontario", "CA"});

public static Hits Search(Searcher searcher, String typeToFind, String[]
queryString)
        throws IOException, ParseException
{
        TermQuery entityType = new TermQuery(new Term("class",
typeToFind));
        BooleanQuery filterQuery = new BooleanQuery();
                PhraseQuery query = new PhraseQuery();
        query.setSlop(1);

        for (int i = 0; i < queryString.length; i++)
        {
                query.add(new Term("name",
queryString[i].toLowerCase()));
                filterQuery.add(
                        new TermQuery(new Term("primary-name",
queryString[i])), 
                        BooleanClause.Occur.SHOULD);
        }

        BooleanQuery geographyQuery = new BooleanQuery();
        if (typeToFind != "any") geographyQuery.add(entityType,
BooleanClause.Occur.MUST);
        geographyQuery.add(query, BooleanClause.Occur.MUST);
                
        QueryFilter filter = new QueryFilter(filterQuery);
        
        Hits hits = searcher.search(geographyQuery, filter);
        return hits;
}

-----Original Message-----
From: Rajesh Munavalli [mailto:[EMAIL PROTECTED] 
Sent: 27 January, 2006 14:28
To: java-user@lucene.apache.org
Subject: Re: Help with indexing and query strategy

Hi Colin,
         Even assuming you came up with a good way of indexing, the
example query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are
valid retrievals. Could you please justify which 2 hits you want and
why?

Thanks,

Rajesh Munavalli

Colin Young wrote:
> I'm having some trouble coming up with a good search strategy for
geographical data. e.g., given:
>  
> [1] city: London, United Kingdom
> [2] city: London, Ontario, Canada
> [3] city: Ontario, California, United States [4] state: Ontario, 
> Canada [5] city: Vancouver, Washington, United States [6] city: 
> Vancouver, British Columbia, Canada [7] city: Washington, DC, United 
> States [8] state: Washington, United States
>  
> and also given the following synonyms:
>  
> Ontario = ON
> California = CA
> Washington = WA
> Canada = CA
> United States = US = America = United States of America United Kingdom

> = UK = Great Britain = England
>  
> for the following queries, I want the listed number of hits '()' from
matching '[]':
>  
> i. Ontario (2) [3, 4]
> ii. London (2) [1, 2]
> iii. Ontario, Canada (1) [4]
> iv. Ontario, California (1) [3]
> v. Ontario, CA (2) [3, 4]
> vi. Ontario, US (1) [3]
> vii. Vancouver (2) [5, 6]
> viii. Washington (2) [7, 8]
> ix. Washington, DC (1) [7]
> x. Vancouver, CA (1) [6]
> xi. Vancouver, WA (1) [5]
>  
> How do I index and store the input (assume that I know the mechanics
so I'm not looking for specific java syntax or how to generate synonyms
during analysis) so that I get the desired results. My current attempt
indexes strings like "London Ontario Canada", "London ON Canada",
"London Ontario CA", "London ON CA" -- i.e. every combination of entity
name and corresponding code -- in a content field and creates a type
field containing "city" (or "state" or "country" as appropriate to
identify the type of entity being indexed) and uses a phrase query with
a slop of 1 which works really well except e.g. "Ontario CA" for which
I'd like 2 hits, but given the above data gives 3 hits (from 2, 3 and 4,
and the problem will only get worse as I add more cities in Ontario
since each results in a hit). The slop of 1 is required since not all
countries customarily use states, and I need to support the user
optionally dropping the state as in the above example of "Ontario, CA"
where we don't know if the user intended the "CA" to represent the state
of California or the country of Canada, while "London, UK" would be
unambiguous.
>  
> The major problem as I see it is that at parse time I don't know if
the user is searching for a city, state or country, and I don't want to
force them to specify that.
>  
> Does anyone have any good ideas to help me solve this problem?
>  
> Thanks.
>  
> Colin Young
>  
>
> Notice: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged information.
Any unauthorized review, use, disclosure or distribution is prohibited.
If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Notice: This email message is for the sole use of the intended recipient(s) and 
may contain confidential and privileged information. Any unauthorized review, 
use, disclosure or distribution is prohibited. If you are not the intended 
recipient, please contact the sender by reply email and destroy all copies of 
the original message.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to