I'm currently working on building a Geocoder. The purpose of a
Geocoder is to find the coordinates belonging to any given input
address. I have a rather simple version based on Lucene working,
however I have a feeling it can be a lot better. Also new
functionality will be added, which is difficult to implement in the
current version. I was hoping to get some input on what would be the
best way to implement searching addresses with Lucene. Below the
relevant requirements and some preconditions:

- the address is not structured, i.e. it's just a simple string which
contains the address
- a (dutch) address can, but does not have to, consist of the
following parts: province, municipality, city, street name, house
number, house number addition, zip code (4/6/7 chars)
- the order of the parts can be random, so it can be "street name,
city, zip code" or "zip code, street name, city"
- the words in a part are always in the right order, a street name "A
B C" will always be supplied as "A B C" and never "A C B"
- there is no guaranty that the values of each type are unique: e.g.
"Utrecht" is a province, municipality and a city
- none of the parts except zip code (6/7 chars) is recognizable as a
specific part
- it should be possible to find matches if the address contains one or
two typos (fuzzy search)
- the combination zip code (6/7 chars) and house number (+ addition)
uniquely identifies an address
- the combination city, street name, house number (+addition) uniquely
identifies an address
- the results should be a specific as possible, eg. if an specific
address is found, but also the city, just return the coordinates of
the specific address
- there can be multiple results if the given address matches to
multiple parts: search on "Utrecht" will give you the coordinaties of
the city, municipality and province
- the data file consists of all addresses in the Netherlands where
each line represents an address with all the address parts specified
(except house number addition, because most addresses don't have one).
This data can be arranged in any way necessary.
- searching on this data file with address "Amersfoort" (a city) will
result in all addresses in the city Amersfoort, however the Geocoder
should only return the coordinates of the city Amersfoort
- if the given address contains unknown values (eg. a house number
which doesn't exist in the street/zip code), the address should be
processed without those values

Current implementation consist of multiple database tables each having
a Lucene index. There is a "full address" table which contains all
addresses, a "street name" table with all streets, a "city" table with
all cities and so on. Each table is one level less detailed. I first
check if there is a zip code and house number (direct match), and if
not, I create queries to check if I can find an street, city,
municipality or province. Then some logic is applied to determine
which search results should be returned, eg. if a street and a city
are found, only return the street. This setup does not meet all above
requirements, the same data is stored in multiple tables, and around
5-6 Lucene queries are executed for each search, which seems
inefficient.

I think the best approach would be to parse the given address into the
different parts, however I am not sure how to do this. I can verify
each word of the address and a range of combinations of those words
against Lucene indexes (for each address part-type), however the
number of queries will only increase by this approach. Because finding
a match for a word (of combination of words) does not mean I can stop
matching for those word(s), because the word(s) are not unique for
each address part-type.

I understand that this is not directly a technical question, however I
think this mailinglist is the best shot I have for discussing this
problem :-)

Undoubtedly there are questions, but feel free to ask for clarification!

Kind regards,
Jasper

P.S. I am aware that Google and others have created Geocoders with
this functionality and more, but those companies either have
restrictions which are a problem, or charge per request, which also
isn't an option.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to