I'm currently working on building a Geocoder. The purpose of a Geocoder is to find the coordinates belonging to any given input address. I have a rather simple version based on Lucene working, however I have a feeling it can be a lot better. Also new functionality will be added, which is difficult to implement in the current version. I was hoping to get some input on what would be the best way to implement searching addresses with Lucene. Below the relevant requirements and some preconditions:
- the address is not structured, i.e. it's just a simple string which contains the address - a (dutch) address can, but does not have to, consist of the following parts: province, municipality, city, street name, house number, house number addition, zip code (4/6/7 chars) - the order of the parts can be random, so it can be "street name, city, zip code" or "zip code, street name, city" - the words in a part are always in the right order, a street name "A B C" will always be supplied as "A B C" and never "A C B" - there is no guaranty that the values of each type are unique: e.g. "Utrecht" is a province, municipality and a city - none of the parts except zip code (6/7 chars) is recognizable as a specific part - it should be possible to find matches if the address contains one or two typos (fuzzy search) - the combination zip code (6/7 chars) and house number (+ addition) uniquely identifies an address - the combination city, street name, house number (+addition) uniquely identifies an address - the results should be a specific as possible, eg. if an specific address is found, but also the city, just return the coordinates of the specific address - there can be multiple results if the given address matches to multiple parts: search on "Utrecht" will give you the coordinaties of the city, municipality and province - the data file consists of all addresses in the Netherlands where each line represents an address with all the address parts specified (except house number addition, because most addresses don't have one). This data can be arranged in any way necessary. - searching on this data file with address "Amersfoort" (a city) will result in all addresses in the city Amersfoort, however the Geocoder should only return the coordinates of the city Amersfoort - if the given address contains unknown values (eg. a house number which doesn't exist in the street/zip code), the address should be processed without those values Current implementation consist of multiple database tables each having a Lucene index. There is a "full address" table which contains all addresses, a "street name" table with all streets, a "city" table with all cities and so on. Each table is one level less detailed. I first check if there is a zip code and house number (direct match), and if not, I create queries to check if I can find an street, city, municipality or province. Then some logic is applied to determine which search results should be returned, eg. if a street and a city are found, only return the street. This setup does not meet all above requirements, the same data is stored in multiple tables, and around 5-6 Lucene queries are executed for each search, which seems inefficient. I think the best approach would be to parse the given address into the different parts, however I am not sure how to do this. I can verify each word of the address and a range of combinations of those words against Lucene indexes (for each address part-type), however the number of queries will only increase by this approach. Because finding a match for a word (of combination of words) does not mean I can stop matching for those word(s), because the word(s) are not unique for each address part-type. I understand that this is not directly a technical question, however I think this mailinglist is the best shot I have for discussing this problem :-) Undoubtedly there are questions, but feel free to ask for clarification! Kind regards, Jasper P.S. I am aware that Google and others have created Geocoders with this functionality and more, but those companies either have restrictions which are a problem, or charge per request, which also isn't an option. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org