Have a look at libpostal for parsing addresses: https://github.com/openvenues/libpostal There's postgres extension: https://github.com/pramsey/pgsql-postal
вт, 29 нояб. 2016 г. в 15:24, Tom <[email protected]>: > Hi Sarah and Dmitry, > > thanks for your responses! I will definitely investigate into the > libpostal project later on as well as some of the geocoders Dmitry > suggested. > > But right now I’m doing some tests with pg_trgm. And Sarah, I cannot > confirm so far your comment > > "Trigrams only work with misspellings of a letter or two, they fail > > completely when trying to match up abbreviations.“ > > > To me the opposite seems true, as you can see in the following examples. > Let’s take this address, as I want to look for it and the way OSM has it > stored and spelled. > > (asked address) (OSM address) > —street: Верещагина ул улица Верещагина > —town: Ханская ст-ца Ханская > —city: Майкоп г городской округ Майкоп > —region: Адыгея Респ Адыгея > > The Nominatim standard query is basically this (for the street): > > select word_id, word_token, word > from word > where word_token = make_standard_name('Ханская ст-ца') > > > …and does not return anything. > > Now I enabled the extension (CREATE EXTENSION pg_trgm;) and created an > index (CREATE INDEX word_token_trgm_idx ON word USING GIST (word_token > gist_trgm_ops);) and modified the select slightly: > > > select word_id, word_token, word, > gettokenstring(transliteration(‚Верещагина ул')) as asked, > similarity(word_token, gettokenstring(transliteration('Верещагина ул'))) > as sml > from word > where word_token % make_standard_name('Верещагина ул') > order by sml desc > limit 20 > > > …and this is the result (I hope the formatting gets through…): > > > "word_id" "word_token" "word" "asked" "sml" > 19098 " ul virishchaghina" "улица Верещагина" " virishchaghina ul " 1.0 > 19099 "ul virishchaghina" "" " virishchaghina ul " 1.0 > 19100 „virishchaghina" "" " virishchaghina ul " 0.833333 > 1525904 " virishchaghina" "Верещагина" " virishchaghina ul " 0.833333 > 115343 "ul virishchaghino" "" " virishchaghina ul " 0.8 > 115342 " ul virishchaghino" "улица Верещагино" " virishchaghina ul " 0.8 > 568775 „ n virishchaghina" "На Верещагина" " virishchaghina ul " 0.75 > 568776 "n virishchaghina" "" " virishchaghina ul " 0.75 > 1256480 " pl virishchaghina" "площадь Верещагина" " virishchaghina ul " > 0.714286 > 1256481 "pl virishchaghina" "" " virishchaghina ul " 0.714286 > 351652 „ virishchaghin" "Верещагин" " virishchaghina ul " 0.684211 > 351653 "virishchaghin" "" " virishchaghina ul " 0.684211 > 217731 „ virishchaghinskaia ul" "Верещагинская улица"" virishchaghina ul " > 0.666667 > 217732 "virishchaghinskaia ul" "" " virishchaghina ul " 0.666667 > 115344 "virishchaghino" "" " virishchaghina ul " 0.65 > 824366 „ v v virishchaghin" "В.В.Верещагин" " virishchaghina ul " 0.65 > 824367 "v v virishchaghin" "" " virishchaghina ul " 0.65 > 855756 „ virishchaghino" "Верещагино" " virishchaghina ul " 0.65 > 721916 „ur virishchaghino" "" " virishchaghina ul " 0.636364 > 721915 „ ur virishchaghino" "ур. Верещагино“ „ virishchaghina ul " > 0.636364 > > So the first two answers with a matching of 1 (=100%) are exactly the > street I asked for! > > The same happens with the town („Ханская ст-ца“ <-> „Ханская“) and with > the region („Адыгея Респ“ <-> „Адыгея“). Of course the similarity is not > alway 1, but this doesn’t matter, as long as the best match is still my > address. And furthermore it tells me how certain the answer is, so I can > deal with the information. > > What Sarah mentions might apply to the city („Майкоп г“ <-> „городской > округ Майкоп“), where the real answer only appears as 23. result with a > matching of 40%, after the „best“ (but wrong) match of 70%. > > Maybe libpostal could help here, or the OSM data are wrong or the name I > asked for. Anyway this would be acceptable because of the huge difference > in spelling. It could even be healed with a clever combination of region, > city, town and street. > > So, in conclusion, to me pg_trgm looks really promising! And the query > doesn’t change a lot. Sure, Nominatim would have to deal with the > similarity in the response, but this doesn’t seem a huge thing, is it? > > Kind regards, > > Tom > > _______________________________________________ > dev mailing list > [email protected] > https://lists.openstreetmap.org/listinfo/dev >
_______________________________________________ dev mailing list [email protected] https://lists.openstreetmap.org/listinfo/dev

