On 4/26/2020 8:09 AM, Shaozhong SHI wrote:
Hi, Steve,
Thanks.
2 questions.
1. How can we remove things like Room 2a, Buildings 2-6b and etc with
regexp replace?
You need to read up on how to write regexp, becuase you will keep
finding cases like this you want to add to your regexp and you need to
understand what they are doing. Compare the first one I gave you to this
and make sure you understand why I made these changes to it. Also this
is a little outside the scope of the PostGIS list.
'^( *(Buildings?|Bldg|Room) *[0-9]+)?[- 0-9]*[a-zA-Z]? '
2. Once extensions created, can these functions be adapted? Are
codes available ? I will see whether to put it into a project, so
that our programmers can have work to do.
The code for these in in the PostGIS repository, but if you are serious
about making changes to them, then I recommend working with
https://github.com/woodbri/address-standardizer because I rewrote these
function to be easier to modify. The existing code is very hard to
understand and very difficult to make changes to without breaking
things. My new code is hopefully well documented in the repository and
should be straight forward to work with. FYI, I have never built/tested
with PostgreSQL 12+ so it might need some changes to support that, but
it should be good for 9, 10 and 11 versions.
-Steve
Regards,
Shao
On Sun, 26 Apr 2020 at 03:09, Stephen Woodbridge
<[email protected] <mailto:[email protected]>>
wrote:
On 4/25/2020 7:19 PM, Shaozhong SHI wrote:
> Hi, Steve,
>
> Many thanks. Please send me the link to parse_address() and
> standardize_address().
If you already have postGIS installed then
create extension address_standardizer;
# \df parse_address
List of functions
Schema | Name | Result data type | Argument data types
| Type
--------+---------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+--------
public | parse_address | record | text, OUT num text, OUT
street text, OUT street2 text, OUT address1 text, OUT city text, OUT
state text, OUT zip text, OUT zipplus text, OUT country text | normal
(1 row)
# select * from parse_address('123-2 main street city ny');
num | street | street2 | address1 | city | state
| zip
| zipplus | country
-------+-------------+---------+-------------------+------+-------+-----+---------+---------
123-2 | main street | | 123-2 main street | city | NY |
| | US
(1 row)
# \df standardize*
List of functions
Schema | Name | Result data type |
Argument data types | Type
--------+---------------------+------------------+---------------------------------------------------------------+--------
public | standardize_address | stdaddr | lextab text,
gaztab
text, rultab text, address text | normal
public | standardize_address | stdaddr | lextab text,
gaztab
text, rultab text, micro text, macro text | normal
You need tables for the lexicon, gazetteer, and rules, which
should be
included in the extension but I'm not seeing them. So you can grab
these
from:
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-gaz.sql
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-lex.sql
https://raw.githubusercontent.com/woodbri/imaptools.com/master/sql-scripts/geocoder/us-rules.sql
and load them like:
psql mydb -f us-gaz.sql
psql mydb -f us-lex.sql
psql mydb -f us-rules.sql
# select * from standardize_address('lex', 'gaz', 'rules', '123-2
main
street city ny');
building | house_num | predir | qual | pretype | name | suftype |
sufdir | ruralroute | extra | city | state | country | postcode
| box
| unit
----------+-----------+--------+------+---------+--------+---------+--------+------------+-------+------+----------+---------+----------+-----+------
| 123 | | | | 2 MAIN | STREET
| | | | CITY | NEW YORK | USA | | |
(1 row)
This is a good example of why parsing addresses is so difficult. The
rules for standardize_address do not account for a house number like
"123-2", but the regexp in parse_address do handle it. It is easy
to get
the 80% right and very hard to get it much above that.
-Steve
>
> I need to find these first before test-running.
>
> Regards,
>
> Shao
>
> On Sat, 25 Apr 2020 at 21:20, Stephen Woodbridge
> <[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>
> wrote:
>
> Shao,
>
> '^( *Building *[0-9]+)?[- 0-9]*'
>
> or something like that should do it. But I think you will
find that a
> more robust solution is to use parse_address() and/or
> standardize_address() as they will recognize a lot of other
address
> constructs, like "apt 3a" for for example.
>
> parse_address() that a text field and breaks it into "house
number
> street name" and "city state zip", but only works well in North
> America.
>
> standardize_address() that comes with postGIS, breaks the
address
> down
> into its components and can separate out things like
buildings, and
> apartment/unit specifiers so you can then take the fields
you are
> interested in and recombine just them in a new string.
Again, this
> works
> best in North America.
>
> My github address-standardizer is built to recognize address
for most
> counties, but it can also be configured to recognize address
> standards
> for any county without too much effort. It compiles and
installs as
> postgresql extension.
>
> Addresses are generally very messy and unless your addresses
are vary
> simple you will be constantly fighting with this or that
exception.
>
> -Steve
>
> On 4/25/2020 2:55 PM, Shaozhong SHI wrote:
> > Is there a way to left trim including the building and number?
> >
> > Building 3 21-1 Great Avenue, a city, a
country, this
> > planet.
> >
> > How to take way those things which are too local to an
address?
> >
> > Regards,
> >
> > Shao
> >
> > On Sat, 25 Apr 2020 at 01:48, Shaozhong SHI
> <[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
> > <mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>>
> wrote:
> >
> > I find this is a simple, but important question.
> >
> > How best to split numbers and the rest of address?
> >
> > For instance, one tricky one is as follows:
> >
> > 21-1 Great Avenue, a city, a country, this planet
> >
> > How to turn this into the following:
> >
> > column 1, column 2
> >
> > 21-1 Great Avenue, a city, a country,
this planet
> >
> > Note: there is a hyphen in 21-1
> >
> > Any clue?
> >
> > Regards,
> >
> > Shao
> >
> >
> > _______________________________________________
> > postgis-users mailing list
> > [email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
> > https://lists.osgeo.org/mailman/listinfo/postgis-users
>
> _______________________________________________
> postgis-users mailing list
> [email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
>
>
> _______________________________________________
> postgis-users mailing list
> [email protected] <mailto:[email protected]>
> https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected] <mailto:[email protected]>
https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users
_______________________________________________
postgis-users mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/postgis-users