On 4/29/2020 9:15 AM, Shaozhong SHI wrote:
Hi, Steve,
I have address_standardizer extension created.
It is not like other functions. How to review its code? Does it need
to adapted to handle other countries addresses?
As I tried to indicate, the postGIS address_standardizer extension is
extremely hard to understand and nearly impossible to customize the
files. I migrated the PAGC code into this extension but I barely
understood the code. Conceptually, the ideas are straight forward, but
the code is very hard to follow and understand so I portted it as a
black box.
If the exisiting address_standardizer_data_us does not work for you or
you need to support other countries, you will need to build and install
the one here:
https://github.com/woodbri/address-standardizer
This was written as a replacement
This one has sample files for 25 countries here:
https://github.com/woodbri/address-standardizer/tree/develop/data/sample
that you can customize and has (hopefully) good documentation here:
https://github.com/woodbri/address-standardizer/blob/develop/README.md
https://github.com/woodbri/address-standardizer/blob/develop/DOCUMENTATION.md
https://github.com/woodbri/address-standardizer/tree/develop/data
Also, this code is written in C++ and I hope it is written to be easier
to understand and review.
Not sure I can help much more than this unless you have questions on
https://github.com/woodbri/address-standardizer which we should probably
take off the PostGIS list as this is not part of PostGIS.
-Steve
Regards,
Shao
On Sun, 26 Apr 2020 at 17:23, Stephen Woodbridge
<[email protected] <mailto:[email protected]>>
wrote:
Shao,
I just remembered the lex, gaz, and rules data is in a separate
extension. The correct way to install it is with:
create extension address_standardizer_data_us;
-Steve
On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
> Hi, Steve,
>
> Thanks.
>
> 2 questions.
>
> 1. How can we remove things like Room 2a, Buildings 2-6b and
etc with
> regexp replace?
> 2. Once extensions created, can these functions be adapted? Are
> codes available ? I will see whether to put it into a
project, so
> that our programmers can have work to do.
>
> Regards,
>
> Shao
>
> On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
> <[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>
> wrote:
>
> On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
> > Hi, Steve,
> >
> > Many thanks. Please send me the link to parse_address() and
> > standardize_address().
>
> If you already have postGIS installed then
>
> create extension address_standardizer;
>
> # \df parse_address
> List of functions
> Schema | Name | Result data type | Argument data
types
> | Type
>
--------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
> public | parse_address | record | text, OUT num
text, OUT
> street text, OUT street2 text, OUT address1 text, OUT city
text, OUT
> state text, OUT zip text, OUT zipplus text, OUT country text
| normal
> (1 row)
>
> # select * from parse_address('123-2 main street city ny');
> num | street | street2 | address1 | city |
state
> | zip
> | zipplus | country
>
-------+-------------+---------+-------------------+------+-------+-----+---------+---------
> 123-2 | main street | | 123-2 main street | city |
NY |
> | | US
> (1 row)
>
> # \df standardize*
> List of functions
> Schema | Name | Result data type |
> Argument data types | Type
>
--------+---------------------+------------------+---------------------------------------------------------------+--------
> public | standardize_address | stdaddr | lextab text,
> gaztab
> text, rultab text, address text | normal
> public | standardize_address | stdaddr | lextab text,
> gaztab
> text, rultab text, micro text, macro text | normal
>
> You need tables for the lexicon, gazetteer, and rules, which
> should be
> included in the extension but I'm not seeing them. So you
can grab
> these
> from:
>
>
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
>
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
>
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
>
> and load them like:
>
> psql mydb -f us-gaz.sql
> psql mydb -f us-lex.sql
> psql mydb -f us-rules.sql
>
> # select * from standardize_address('lex', 'gaz', 'rules',
'123-2
> main
> street city ny');
> building | house_num | predir | qual | pretype | name |
suftype |
> sufdir | ruralroute | extra | city | state | country |
postcode
> | box
> | unit
>
----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
> | 123 | | | | 2 MAIN |
STREET
> | | | | CITY | NEW YORK | USA |
| |
> (1 row)
>
>
> This is a good example of why parsing addresses is so
difficult. The
> rules for standardize_address do not account for a house
number like
> "123-2", but the regexp in parse_address do handle it. It is
easy
> to get
> the 80% right and very hard to get it much above that.
>
> -Steve
>
>
> >
> > I need to find these first before test-running.
> >
> > Regards,
> >
> > Shao
> >
> > On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
> > <[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>
> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>>
> > wrote:
> >
> > Shao,
> >
> > '^( *Building *[0-9]+)?[- 0-9]*'
> >
> > or something like that should do it. But I think you will
> find that a
> > more robust solution is to use parse_address() and/or
> > standardize_address() as they will recognize a lot of
other
> address
> > constructs, like "apt 3a" for for example.
> >
> > parse_address() that a text field and breaks it into
"house
> number
> > street name" and "city state zip", but only works well
in North
> > America.
> >
> > standardize_address() that comes with postGIS, breaks the
> address
> > down
> > into its components and can separate out things like
> buildings, and
> > apartment/unit specifiers so you can then take the fields
> you are
> > interested in and recombine just them in a new string.
> Again, this
> > works
> > best in North America.
> >
> > My github address-standardizer is built to recognize
address
> for most
> > counties, but it can also be configured to recognize
address
> > standards
> > for any county without too much effort. It compiles and
> installs as
> > postgresql extension.
> >
> > Addresses are generally very messy and unless your
addresses
> are vary
> > simple you will be constantly fighting with this or that
> exception.
> >
> > -Steve
> >
> > On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
> > > Is there a way to left trim including the building
and number?
> > >
> > > Building 3 21-1 Great Avenue, a city, a
> country, this
> > > planet.
> > >
> > > How to take way those things which are too local to an
> address?
> > >
> > > Regards,
> > >
> > > Shao
> > >
> > > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
> > <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
> <mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
> > > <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>>>
> > wrote:
> > >
> > > I find this is a simple, but important question.
> > >
> > > How best to split numbers and the rest of address?
> > >
> > > For instance, one tricky one is as follows:
> > >
> > > 21-1 Great Avenue, a city, a country, this planet
> > >
> > > How to turn this into the following:
> > >
> > > column 1, column 2
> > >
> > > 21-1 Great Avenue, a city, a country,
> this planet
> > >
> > > Note: there is a hyphen in 21-1
> > >
> > > Any clue?
> > >
> > > Regards,
> > >
> > > Shao
> > >
> > >
> > > _______________________________________________
> > > postgis-users mailing list
> > > [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>
> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>
> > > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> > _______________________________________________
> > postgis-users mailing list
> > [email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>
> <mailto:[email protected]
<mailto:[email protected]>
> <mailto:[email protected]
<mailto:[email protected]>>>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > [email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> [email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> [email protected] <mailto:[email protected]>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected] <mailto:[email protected]>
https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users