Hi Jason,
Just a suggestion to start with, was almost an information overload for me
but here is what I could think of straight off. Let me know in case you try
it or have already tried it.

A few point I'd want to know.
1. I understand that the addresses would/could be
unstructured/ill-strcutured bu
t generally wouldn't parts be seperated by a comma? Also, wouldn't parts be
in a
n decreasing order of detail i.e. A,B,C would imply that A is a location in
B, w
hich happens to be in 'C'?
If what I just said is correct, how about indexing data with decreasing
boost?
i.e. A,B,C gets indexed in a single field with A having the max weight/boost
fol
lowed by B followed by C
e.g.
Record 1: 123, A Street, B City
Record 2: B City

For a search for B City, you'd get record 2 ranked higher up as compared to
1. F
or  search '123, A Street' you'd get record 1.
Also while searching you may tokenize on a comma or whatever set of chars
you fi
nd appropriate.



--
Anshum Gupta
http://ai-cafe.blogspot.com


On Tue, Oct 19, 2010 at 8:59 PM, Jasper de Barbanson <
lucene-mailingl...@de-barbanson.com> wrote:

> I'm currently working on building a Geocoder. The purpose of a
> Geocoder is to find the coordinates belonging to any given input
> address. I have a rather simple version based on Lucene working,
> however I have a feeling it can be a lot better. Also new
> functionality will be added, which is difficult to implement in the
> current version. I was hoping to get some input on what would be the
> best way to implement searching addresses with Lucene. Below the
> relevant requirements and some preconditions:
>
> - the address is not structured, i.e. it's just a simple string which
> contains the address
> - a (dutch) address can, but does not have to, consist of the
> following parts: province, municipality, city, street name, house
> number, house number addition, zip code (4/6/7 chars)
> - the order of the parts can be random, so it can be "street name,
> city, zip code" or "zip code, street name, city"
> - the words in a part are always in the right order, a street name "A
> B C" will always be supplied as "A B C" and never "A C B"
> - there is no guaranty that the values of each type are unique: e.g.
> "Utrecht" is a province, municipality and a city
> - none of the parts except zip code (6/7 chars) is recognizable as a
> specific part
> - it should be possible to find matches if the address contains one or
> two typos (fuzzy search)
> - the combination zip code (6/7 chars) and house number (+ addition)
> uniquely identifies an address
> - the combination city, street name, house number (+addition) uniquely
> identifies an address
> - the results should be a specific as possible, eg. if an specific
> address is found, but also the city, just return the coordinates of
> the specific address
> - there can be multiple results if the given address matches to
> multiple parts: search on "Utrecht" will give you the coordinaties of
> the city, municipality and province
> - the data file consists of all addresses in the Netherlands where
> each line represents an address with all the address parts specified
> (except house number addition, because most addresses don't have one).
> This data can be arranged in any way necessary.
> - searching on this data file with address "Amersfoort" (a city) will
> result in all addresses in the city Amersfoort, however the Geocoder
> should only return the coordinates of the city Amersfoort
> - if the given address contains unknown values (eg. a house number
> which doesn't exist in the street/zip code), the address should be
> processed without those values
>
> Current implementation consist of multiple database tables each having
> a Lucene index. There is a "full address" table which contains all
> addresses, a "street name" table with all streets, a "city" table with
> all cities and so on. Each table is one level less detailed. I first
> check if there is a zip code and house number (direct match), and if
> not, I create queries to check if I can find an street, city,
> municipality or province. Then some logic is applied to determine
> which search results should be returned, eg. if a street and a city
> are found, only return the street. This setup does not meet all above
> requirements, the same data is stored in multiple tables, and around
> 5-6 Lucene queries are executed for each search, which seems
> inefficient.
>
> I think the best approach would be to parse the given address into the
> different parts, however I am not sure how to do this. I can verify
> each word of the address and a range of combinations of those words
> against Lucene indexes (for each address part-type), however the
> number of queries will only increase by this approach. Because finding
> a match for a word (of combination of words) does not mean I can stop
> matching for those word(s), because the word(s) are not unique for
> each address part-type.
>
> I understand that this is not directly a technical question, however I
> think this mailinglist is the best shot I have for discussing this
> problem :-)
>
> Undoubtedly there are questions, but feel free to ask for clarification!
>
> Kind regards,
> Jasper
>
> P.S. I am aware that Google and others have created Geocoders with
> this functionality and more, but those companies either have
> restrictions which are a problem, or charge per request, which also
> isn't an option.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to