It all depends on what you want to do with the various tokens. Is "Nest 2 my nesting place" important from the point of view of geocoding? or not? You can define a grammar to recognize "Building Names" preceding an address and separate those out from the rest of the address.

@building @house @street @city @prov @country @postal
@house @street @city @prov @country @postal
@house @street @city @prov @postal

Here is an example of setting up 3 variations of an address definition and the tokens will get compared to all of them with the best match sorting to the top of the list.

"@name" is a meta definition that can point to an explicit rules set or another meta definition

If @build is not important then you can assign the related tokens to standard field that you ignore in the queries. If it is important then you assign it to an appropriate field in the standardized address table for this query.

Here is the sample grammar for great britian (ignore the fact it say Germany, copy and past error)
https://github.com/woodbri/address-standardizer/blob/develop/data/sample/greatbritain.gmr

And here is the sample lexicon used to classify and standardize tokens for great britian.
https://github.com/woodbri/address-standardizer/blob/develop/data/sample/greatbritain.lex

These can be edited as required.

Be aware that by default most tokens are classified as WORD, which is ok, but the more specific classification the better it does to accurately assign tokens to the correct grammar terms.

Also, this thread is a little off topic for postgis, so unless others are interested in following this we should not continue on the list. So if you want to use it, then start with the list of steps I posted previously and read my docs. Geocoding is not easy as I think you have already seen. I'm willing to help you but you need to get this built on your system so we can talk about concrete issues and steps you are having in implementing it rather than potential problems around geocoding which are many, and many(most?) of them have been dealt in the code.

Also be aware that geocoding will never be 100% because it is language processing problem, but I have been able to get 95+% of a reference data set to be recognized and matched correctly with multiple different data sets.

-Steve


On 1/10/2021 9:57 AM, Shaozhong SHI wrote:
Hi, Steve,

Another solution appeals me most is as follows:

Given a space delimited full address line, we can parse it to correct BS7666 format.

Something like house number, street, area, city, postcode

E.g., Nest 2 my nesting place 1B Great Avenue Forest Park London WS22 5TT
Can you enlighten me about that?

Regards
,
David

On Saturday, 9 January 2021, Stephen Woodbridge <[email protected] <mailto:[email protected]>> wrote:

    David,

    This is the link to the address standardizer:
    https://github.com/woodbri/address-standardizer
    <https://github.com/woodbri/address-standardizer>

    This is a link to all my code that I developed consulting. It
    includes a few SQL geocoders based on the code above. And has some
    README files discussing how to build a geocoder which is the basis
    for how the geocoders work.

    https://github.com/woodbri/imaptools.com
    <https://github.com/woodbri/imaptools.com>

    this is the geocoder for Tiger data, but the code is essentially
    the same for every country because the when you load country
    specific data into the database it goes into its own table and
    then you standardize that data into stdstreets table and all
    queries are done against the stdstreets table and you only have to
    tweak the address range interpolation function which needs to
    access the source streets table for the geometry and house number
    ranges.

    
https://github.com/woodbri/imaptools.com/blob/master/sql-scripts/geocoder/prep-tiger-geo-new.sql
    
<https://github.com/woodbri/imaptools.com/blob/master/sql-scripts/geocoder/prep-tiger-geo-new.sql>

    I would approach this by:

    1. get the address standardizer compiled and installed. I can help
    if you run into problems or have questions.
    2. load your UK street data into rawdata schema, ideally it would
    be best if we can create a table/view that presents this data as a
    single table where each record represents one side of the street
    and one jurisdiction this may mean that a single record in your
    source data will generate multiple records in this table/view
    (this greatly simplifies the coding and performance later)
    3. look at the prep-tiger-geo-new.sql file
    4. create a stdstreets table and standardize your table/view data
    into it
    5. look at standardization failures and adjust lexicon and grammar
    as needed
    6. loop back to 4 until good enough
    7. load functions from prep-tiger-geo-new.sql file and adjust any
    for your data
    8. try it out!

    -Steve


    On 1/9/2021 10:22 AM, Shaozhong SHI wrote:

        Hi, Stephen,

        Many thanks.  We are interested in it is working with the UK
        addresses.

        Please send me the link to this.

        Regards,

        David

        On Sat, 9 Jan 2021 at 15:00, Stephen Woodbridge
        <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

            David,

            Yup and this is just one a dozens of cases that you have
        to deal
            with. You are dealing with a natural language processing
        problem.
            And you have to deal with human input that has typos and
            abbreviations.

            These issues are what the address standardizer fixes. It
        tokenized
            the address and uses the gazette to standardize the terms
        and then
            classifies each term and assigns it to part of the address
        based
            on a grammar.

            So there is a simple solution, use my address
        standardizer, it is
            free, MIT license, it has a sample lexicon/ gazette and
        grammar
            for the UK, it is easy to modify these to fit your needs,
        and it
            just works. Oh if you want to do another county it also
        has sample
            files for 25 countries.

            Sent from my iPhone

                On Jan 9, 2021, at 4:42 AM, Darafei Komяpa Praliaskouski
                <[email protected] <mailto:[email protected]>
            <mailto:[email protected] <mailto:[email protected]>>> wrote:

                
                Hello,

                People make neural networks for this kind of task:

            https://github.com/openvenues/libpostal
            <https://github.com/openvenues/libpostal>
                <https://github.com/openvenues/libpostal
            <https://github.com/openvenues/libpostal>>

                сб, 9 сту 2021, 12:40 карыстальнік Shaozhong SHI
                <[email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>> напісаў:

                    Hi, Steve W,

                    it is easy to parse addresses as tokens. But it is
            difficult
                    to put tokens in right columns, due to that the
            same address
                    could be expressed with partial address or full
            address.

                    The same address can be written like, Flat 1 122
            Great Avenue
                    London UK, or Flat 1 122 Greet Avenue Central
            London London
                    United Kingdom.

                    When this happens, each address has different
            number of
                    tokens, so different numbers of tokens.  Is there
            a way to
                    deal with this issue so that each token can get
            into right
                    column?

                    Please enlighten me.

                    Regards,

                    David

                    On Sat, 25 Apr 2020 at 05:09, Stephen Woodbridge
                    <[email protected]
            <mailto:[email protected]>
                    <mailto:[email protected]
            <mailto:[email protected]>>> wrote:

                        And I have create an address-standardizer
            project here
            https://github.com/woodbri/address-standardizer
            <https://github.com/woodbri/address-standardizer>
                       
            <https://github.com/woodbri/address-standardizer
            <https://github.com/woodbri/address-standardizer>> which
                        is user
                        configurable. I might be over kill is you just
            want to
                        strip off the
                        number, in which case you might just use a SQL
            regexp
                        replace to remove it.

                        -Steve W

                        On 4/25/2020 12:04 AM, Stephen Woodbridge wrote:
                        > PostGIS has address_standardizer extension
            that includes
                        > parse_address() and standardize_address()
            functions.
                        >
                        > -Steve W
                        >
                        > On 4/24/2020 9:54 PM, Imre Samu wrote:
                        >> > handle addresses in postgresql
                        >>
                        >> maybe you can use the
            https://github.com/openvenues/libpostal
            <https://github.com/openvenues/libpostal>
                        <https://github.com/openvenues/libpostal
            <https://github.com/openvenues/libpostal>> library
                        >> with your favorite language bindings (
            Python / Ruby /
                        Go / PHP /
                        >> Node / R / Java  ...)
                        >>
                        >> or as a Postgres database extension:
                        >>
            
https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal
            
<https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal>
                       
            
<https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal
            
<https://info.crunchydata.com/blog/quick-and-dirty-address-matching-with-libpostal>>

                        >>
                        >> https://github.com/pramsey/pgsql-postal
            <https://github.com/pramsey/pgsql-postal>
                        <https://github.com/pramsey/pgsql-postal
            <https://github.com/pramsey/pgsql-postal>>
                        >>
                        >> Regards,
                        >>  Imre
                        >>
                        >>
                        >>
                        >>
                        >> Shaozhong SHI <[email protected]
            <mailto:[email protected]>
                        <mailto:[email protected]
            <mailto:[email protected]>>
                        >> <mailto:[email protected]
            <mailto:[email protected]>
                        <mailto:[email protected]
            <mailto:[email protected]>>>> ezt írta (időpont:
                        2020. ápr. 25.,
                        >> Szo, 2:49):
                        >>
                        >>     I find this is a simple, but important
            question.
                        >>
                        >>     How best to split numbers and the rest
            of address?
                        >>
                        >>     For instance, one tricky one is as follows:
                        >>
                        >>     21-1 Great Avenue, a city, a country,
            this planet
                        >>
                        >>     How to turn this into the following:
                        >>
                        >>     column 1,       column 2
                        >>
                        >>       21-1              Great Avenue, a city, a
                        country, this planet
                        >>
                        >>     Note:  there is a hyphen in 21-1
                        >>
                        >>     Any clue?
                        >>
                        >>     Regards,
                        >>
                        >>     Shao
                        >> _______________________________________________
                        >>     postgis-users mailing list
                        >> [email protected]
            <mailto:[email protected]>
                        <mailto:[email protected]
            <mailto:[email protected]>>
                        <mailto:[email protected]
            <mailto:[email protected]>
                        <mailto:[email protected]
            <mailto:[email protected]>>>
                        >>
            https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>
                       
            <https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
                        >>
                        >>
                        >> _______________________________________________
                        >> postgis-users mailing list
                        >> [email protected]
            <mailto:[email protected]>
                        <mailto:[email protected]
            <mailto:[email protected]>>
                        >>
            https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>
                       
            <https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>>
                        >

                        _______________________________________________
                        postgis-users mailing list
            [email protected]
            <mailto:[email protected]>
                        <mailto:[email protected]
            <mailto:[email protected]>>
            https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>
                       
            <https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>>

                    _______________________________________________
                    postgis-users mailing list
            [email protected]
            <mailto:[email protected]>
                    <mailto:[email protected]
            <mailto:[email protected]>>
            https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>
                   
            <https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>>

                _______________________________________________
                postgis-users mailing list
            [email protected]
            <mailto:[email protected]>
            <mailto:[email protected]
            <mailto:[email protected]>>
            https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>
               
            <https://lists.osgeo.org/mailman/listinfo/postgis-users
            <https://lists.osgeo.org/mailman/listinfo/postgis-users>>

            _______________________________________________
            postgis-users mailing list
        [email protected]
        <mailto:[email protected]>
        <mailto:[email protected]
        <mailto:[email protected]>>
        https://lists.osgeo.org/mailman/listinfo/postgis-users
        <https://lists.osgeo.org/mailman/listinfo/postgis-users>
            <https://lists.osgeo.org/mailman/listinfo/postgis-users
        <https://lists.osgeo.org/mailman/listinfo/postgis-users>>


        _______________________________________________
        postgis-users mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.osgeo.org/mailman/listinfo/postgis-users
        <https://lists.osgeo.org/mailman/listinfo/postgis-users>


    _______________________________________________
    postgis-users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.osgeo.org/mailman/listinfo/postgis-users
    <https://lists.osgeo.org/mailman/listinfo/postgis-users>


_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users

_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users

Reply via email to