Re: [SQL] extracting location info from string

2011-05-25 Thread Lew

Tarlika Elisabeth Schmitz wrote:

Lew wrote:

That isn't a table structure, that's a freeform text structure.  You
didn't state your question, Tarlika, but your database structure is
terrible.  For example, "region" and "country" should be different
columns.



I presume you are referring to my original post:
CREATE TABLE person
(
   id integer NOT NULL,
   "name" character varying(256) NOT NULL,
   "location" character varying(256),
   CONSTRAINT person_pkey PRIMARY KEY (id)
);

Sorry, this was just a TEMPORARY table I created for quick analysis of
my CSV data (now renamed to temp_person).



The target table is:
CREATE TABLE person
(
   id integer NOT NULL,
   "name" character varying(100) NOT NULL,
   country character varying(3),
   county character varying(3),
   town character varying(50),
   CONSTRAINT trainer_pkey PRIMARY KEY (id),
   CONSTRAINT country_person_fk FOREIGN KEY (country)
   REFERENCES country (id) MATCH SIMPLE
   ON UPDATE CASCADE ON DELETE RESTRICT,
   CONSTRAINT county_person_fk FOREIGN KEY (country, county)
   REFERENCES county (country, code) MATCH SIMPLE
   ON UPDATE NO ACTION ON DELETE NO ACTION,
);


Ah, yes, that makes much more sense.  Temporary tables such as you describe 
can be very convenient and effective.  Thanks for the clarification.


I think this problem is very widespread, namely how to get structured 
information out of freeform data.  I've encountered it many times over the 
years, as have so many I know.  I believe that human intervention will always 
be needed for this type of work, e.g., distinguishing place names that seem 
the same or correlating ones that seem distinct.  I also don't know of any 
perfect approach.  Perhaps the best one can find is a probabilistic promise 
that error will be less than some epsilon.


That said, if you have a robust process to correct errors as the user 
population discovers them, then you can approach perfection asymptotically. 
Sometimes the best solution to a technical problem is a good human process. 
From an engineering standpoint, user feedback is a vital element of 
homeostatic control.


Edward W. Rouse's suggestion of a reference table to resolve different forms 
of address or region identification would fit well with such a process.


--
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Performance of NOT IN and <> with PG 9.0.4

2011-05-25 Thread Robert Haas
On Tue, May 24, 2011 at 7:45 AM, Jasmin Dizdarevic
 wrote:
> Hi,
> found the problem.
> 238 sec. with set enable_material = 'on'
> 4(!) sec. with set enable_material = 'off'
>
> @Robert Haas: I thought it would be interesting to you, because
> you've committed a patch regarding materialization for 9.0. If you like to
> investigate this further, I can provide you more details.

Well, it makes me glad I insisted we add enable_material.

But I can't really tell from this output what is happening.  Can we
see the EXPLAIN ANALYZE output on 9.0, with and without
enable_material?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Performance of NOT IN and <> with PG 9.0.4

2011-05-25 Thread Robert Haas
On Tue, May 24, 2011 at 3:32 PM, Jasmin Dizdarevic
 wrote:
> enable_material = off

Is there any chance you can reproduce this with a simpler test case
that doesn't involve quite so many joins?

It looks to me like shutting off enable_material is saving you mostly
by accident here.  There's only one materialize node in the whole
plan.

And just incidentally, do you have any of the other enable_* settings
turned off?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Performance of NOT IN and <> with PG 9.0.4

2011-05-25 Thread Tom Lane
Robert Haas  writes:
> On Tue, May 24, 2011 at 3:32 PM, Jasmin Dizdarevic
>  wrote:
>> enable_material = off

> Is there any chance you can reproduce this with a simpler test case
> that doesn't involve quite so many joins?

I didn't stop to count, but are there enough that join_collapse_limit
or from_collapse_limit could be in play?

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] extracting location info from string

2011-05-25 Thread Rob Sargent



On 05/24/2011 10:57 AM, Lew wrote:

Tarlika Elisabeth Schmitz wrote:

Lew wrote:

That isn't a table structure, that's a freeform text structure. You
didn't state your question, Tarlika, but your database structure is
terrible. For example, "region" and "country" should be different
columns.



I presume you are referring to my original post:
CREATE TABLE person
(
id integer NOT NULL,
"name" character varying(256) NOT NULL,
"location" character varying(256),
CONSTRAINT person_pkey PRIMARY KEY (id)
);

Sorry, this was just a TEMPORARY table I created for quick analysis of
my CSV data (now renamed to temp_person).



The target table is:
CREATE TABLE person
(
id integer NOT NULL,
"name" character varying(100) NOT NULL,
country character varying(3),
county character varying(3),
town character varying(50),
CONSTRAINT trainer_pkey PRIMARY KEY (id),
CONSTRAINT country_person_fk FOREIGN KEY (country)
REFERENCES country (id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE RESTRICT,
CONSTRAINT county_person_fk FOREIGN KEY (country, county)
REFERENCES county (country, code) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
);


Ah, yes, that makes much more sense. Temporary tables such as you
describe can be very convenient and effective. Thanks for the
clarification.

I think this problem is very widespread, namely how to get structured
information out of freeform data. I've encountered it many times over
the years, as have so many I know. I believe that human intervention
will always be needed for this type of work, e.g., distinguishing place
names that seem the same or correlating ones that seem distinct. I also
don't know of any perfect approach. Perhaps the best one can find is a
probabilistic promise that error will be less than some epsilon.

That said, if you have a robust process to correct errors as the user
population discovers them, then you can approach perfection
asymptotically. Sometimes the best solution to a technical problem is a
good human process. From an engineering standpoint, user feedback is a
vital element of homeostatic control.

Edward W. Rouse's suggestion of a reference table to resolve different
forms of address or region identification would fit well with such a
process.

To minimize the ultimately quite necessary human adjudication, one might 
make good use of what is often termed "crowd sourcing":  Keep all the 
distinct "hand entered" values and a map to the final human assessment. 
 At least repeated oddities won't repeatedly involve hand editing.


--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Performance of NOT IN and <> with PG 9.0.4

2011-05-25 Thread Jasmin Dizdarevic
As I've understood the docs those 2 limits should not take effect, because
the performance is going down when adding two aggregated columns, but only
when enable_material is on.

2011/5/25 Tom Lane 

> Robert Haas  writes:
> > On Tue, May 24, 2011 at 3:32 PM, Jasmin Dizdarevic
> >  wrote:
> >> enable_material = off
>
> > Is there any chance you can reproduce this with a simpler test case
> > that doesn't involve quite so many joins?
>
> I didn't stop to count, but are there enough that join_collapse_limit
> or from_collapse_limit could be in play?
>
>regards, tom lane
>


Re: [SQL] extracting location info from string

2011-05-25 Thread Tarlika Elisabeth Schmitz
On Tue, 24 May 2011 12:57:57 -0400
Lew  wrote:

>Tarlika Elisabeth Schmitz wrote:
>>this was just a TEMPORARY table I created for quick analysis
>> of my CSV data (now renamed to temp_person).
>Ah, yes, that makes much more sense.  Temporary tables such as you
>describe can be very convenient and effective. 

The ER model is pretty simple. It has only 30 tables, of which 12 are
"active" tables; the rest are pretty static (e.g. COUNTRY).
There are 7-8 CVS formats and my idea is to import, clean up, normalize
and distribute the data via temp tables and associated triggers.

The whole process has to be pretty much automated as volume is too big
to babysit import. As I said in my other post, with the daily import, I
can't reject data just because part of the data cannot [yet] be tidied
because other data will relate to these.

Therefore my intention is to partially import (using the above example
import a PERSON without resolving location) and flag up cases for later
manual intervention in a log table.

Some data just can't be cleaned up and I have to take a pragmatic
approach - for instance, non-existing countries: I see there is an ISO
standard for split countries such as Czechoslovakia
(http://en.wikipedia.org/wiki/ISO_3166-3). That gets 12 PERSONs off my
list. But what on earth do I do with "North America". Create a new
country because some info is better than none? I can hardly make them
US citizens because I know how upset people get when mistaken for
their neighbours.

>I think this problem is very widespread, namely how to get structured 
>information out of freeform data. 

The PERSON.location is peanuts compared to other data of 20x the volume
and the really important information encoded in a freeform string with
all sorts of abbreviations used.

>That said, if you have a robust process to correct errors as the user 
>population discovers them, then you can approach perfection
>asymptotically.[...]

1 user - moi

> From an engineering standpoint, user feedback is a vital element of 
>homeostatic control.

I'll be having a lot of conversations with myself. ;-)


-- 

Best Regards,
Tarlika Elisabeth Schmitz

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] extracting location info from string

2011-05-25 Thread Tarlika Elisabeth Schmitz
On Wed, 25 May 2011 09:25:48 -0600
Rob Sargent  wrote:

>
>
>On 05/24/2011 10:57 AM, Lew wrote:
>> Tarlika Elisabeth Schmitz wrote:
>>
>>> CREATE TABLE person
>>> (
>>> id integer NOT NULL,
>>> "name" character varying(256) NOT NULL,
>>> "location" character varying(256),
>>> CONSTRAINT person_pkey PRIMARY KEY (id)
>>> );
>>>
>>> this was just a TEMPORARY table I created for quick analysis
>>> of my CSV data (now renamed to temp_person).

CREATE TABLE country
(
  id character varying(3) NOT NULL, -- alpha-3 code
  "name" character varying(50) NOT NULL,
  CONSTRAINT country_pkey PRIMARY KEY (id)
);


>To minimize the ultimately quite necessary human adjudication, one
>might make good use of what is often termed "crowd sourcing":  Keep
>all the distinct "hand entered" values and a map to the final human
>assessment. 

I was wondering how to do just that. I don't think it would be a good
idea to hard code this into the clean-up script. Take, for instance,
variations of COUNTRY.NAME spelling. Where would I store these? 

I could do with a concept for this problem, which applies to a lot of
string-type info.

-- 

Best Regards,
Tarlika Elisabeth Schmitz

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] extracting location info from string

2011-05-25 Thread Rob Sargent



On 05/25/2011 03:13 PM, Tarlika Elisabeth Schmitz wrote:

On Wed, 25 May 2011 09:25:48 -0600
Rob Sargent  wrote:




On 05/24/2011 10:57 AM, Lew wrote:

Tarlika Elisabeth Schmitz wrote:


CREATE TABLE person
(
id integer NOT NULL,
"name" character varying(256) NOT NULL,
"location" character varying(256),
CONSTRAINT person_pkey PRIMARY KEY (id)
);

this was just a TEMPORARY table I created for quick analysis
of my CSV data (now renamed to temp_person).


CREATE TABLE country
(
   id character varying(3) NOT NULL, -- alpha-3 code
   "name" character varying(50) NOT NULL,
   CONSTRAINT country_pkey PRIMARY KEY (id)
);



To minimize the ultimately quite necessary human adjudication, one
might make good use of what is often termed "crowd sourcing":  Keep
all the distinct "hand entered" values and a map to the final human
assessment.


I was wondering how to do just that. I don't think it would be a good
idea to hard code this into the clean-up script. Take, for instance,
variations of COUNTRY.NAME spelling. Where would I store these?

I could do with a concept for this problem, which applies to a lot of
string-type info.

I think you keep your current structures used for deducing the canonical 
forms, but with each unique input encounter you add it to you 
seen-thus-far list which becomes just one more check (possibly the first 
such check).


create table address_input
(
   id unique/sequence,
   human_input character varying(256),
   resolution character varying(256)
)

You may have to add a column for the type of input (if you know for 
instance the input is for street address v. country) or you may want the 
resolution to be portioned in to county, city and so on.



--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


[SQL] Re: [SQL] extracting location info from string

2011-05-25 Thread Charlie
Have you looked at 

http://en.m.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance




- Reply message -
From: "Tarlika Elisabeth Schmitz" 
Date: Wed, May 25, 2011 6:13 pm
Subject: [SQL] extracting location info from string
To: 

On Wed, 25 May 2011 09:25:48 -0600
Rob Sargent  wrote:

>
>
>On 05/24/2011 10:57 AM, Lew wrote:
>> Tarlika Elisabeth Schmitz wrote:
>>
>>> CREATE TABLE person
>>> (
>>> id integer NOT NULL,
>>> "name" character varying(256) NOT NULL,
>>> "location" character varying(256),
>>> CONSTRAINT person_pkey PRIMARY KEY (id)
>>> );
>>>
>>> this was just a TEMPORARY table I created for quick analysis
>>> of my CSV data (now renamed to temp_person).

CREATE TABLE country
(
  id character varying(3) NOT NULL, -- alpha-3 code
  "name" character varying(50) NOT NULL,
  CONSTRAINT country_pkey PRIMARY KEY (id)
);


>To minimize the ultimately quite necessary human adjudication, one
>might make good use of what is often termed "crowd sourcing":  Keep
>all the distinct "hand entered" values and a map to the final human
>assessment. 

I was wondering how to do just that. I don't think it would be a good
idea to hard code this into the clean-up script. Take, for instance,
variations of COUNTRY.NAME spelling. Where would I store these? 

I could do with a concept for this problem, which applies to a lot of
string-type info.

-- 

Best Regards,
Tarlika Elisabeth Schmitz

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql



Re: [SQL] extracting location info from string

2011-05-25 Thread Andrej
On 26 May 2011 09:13, Tarlika Elisabeth Schmitz
 wrote:
> On Wed, 25 May 2011 09:25:48 -0600
> Rob Sargent  wrote:
>
>>
>>
>>On 05/24/2011 10:57 AM, Lew wrote:
>>> Tarlika Elisabeth Schmitz wrote:
>>>
 CREATE TABLE person
 (
 id integer NOT NULL,
 "name" character varying(256) NOT NULL,
 "location" character varying(256),
 CONSTRAINT person_pkey PRIMARY KEY (id)
 );

 this was just a TEMPORARY table I created for quick analysis
 of my CSV data (now renamed to temp_person).
>
> CREATE TABLE country
> (
>  id character varying(3) NOT NULL, -- alpha-3 code
>  "name" character varying(50) NOT NULL,
>  CONSTRAINT country_pkey PRIMARY KEY (id)
> );
>
>
>>To minimize the ultimately quite necessary human adjudication, one
>>might make good use of what is often termed "crowd sourcing":  Keep
>>all the distinct "hand entered" values and a map to the final human
>>assessment.
>
> I was wondering how to do just that. I don't think it would be a good
> idea to hard code this into the clean-up script. Take, for instance,
> variations of COUNTRY.NAME spelling. Where would I store these?
>
> I could do with a concept for this problem, which applies to a lot of
> string-type info.

I'd start w/ downloading a list as mentioned here:
http://answers.google.com/answers/threadview?id=596822

And run it through a wee perl script using
http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm
to make phonetic matches ...

Then I'd run your own data through DoubleMetaphone, and clean up
matches if not too many false positives show up.



Cheers,
Andrej

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] enum data type vs table

2011-05-25 Thread Peter Koczan
On Tue, May 17, 2011 at 11:23 PM, Seb  wrote:
> Are there any guidelines for deciding whether to 1) create an enum data
> type or 2) create a table with the set of values and then have foreign
> keys referencing this table?  Some fields in a database take a small
> number of values, and I'm not sure which of these routes to take.  The
> enum data type seems like a clean way to handle this without creating a
> constellation of tables for all these values, but if one wants to add a
> new label to the enum or make changes to it at some point, then the
> tables using it have to be recreated, so it's quite rigid.  Have I got
> this right?  Thanks.

I think your choice depends on a few things:

1 - How do you want to interact with the tables? What I mean is, are
you planning on querying, inserting, or updating data to those tables
via text or will you need to join to your reference table? If you
don't want to join, you'll either need to use enum types, use views
(which can be a pain if you want to update a view), or
duplicate/reference the text directly (which is slow and a bad idea
for several reasons).

2 - How much can you tolerate downtime or a busy database? Changing
types is a single transaction and requires an exclusive lock. On small
tables this is negligible, but on big tables it can require downtime.

3 - How often do you really expect changes to the enum type? If adding
a new value to an enum type is truly a rare event, it's . If it's
frequent or regular, you should probably have a table.

I've used both of these approaches and I've found enum types to be
well worth any trouble to drop/recreate types. The changes I've made
have been rare, and I've been able to schedule downtime pretty easily,
so it made the most sense for me.

Also, Postgres 9.1 allows adding values to enum types, so you could
always use that when it is finally released.

Hope this helps,

Cheers,
Peter

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Sorting Issue

2011-05-25 Thread Ozer, Pam
Since no one has responded does that mean there is no list anywhere?  Or
does anyone know of a collation that will allow for case insensitive
sorting as well as not ignoring spaces?

-Original Message-
From: pgsql-sql-ow...@postgresql.org
[mailto:pgsql-sql-ow...@postgresql.org] On Behalf Of Ozer, Pam
Sent: Wednesday, May 18, 2011 3:22 PM
To: Tom Lane
Cc: Samuel Gendler; em...@encs.concordia.ca; pgsql-sql@postgresql.org
Subject: Re: [SQL] Sorting Issue 

Is there anywhere that gives you all the available collations and their
definitions?  I found with the C collation it now sorts the spaces
correct but it is also case sensitive which messes with some of our
other sorts.

-Original Message-
From: Tom Lane [mailto:t...@sss.pgh.pa.us] 
Sent: Tuesday, May 10, 2011 9:47 AM
To: Ozer, Pam
Cc: Samuel Gendler; em...@encs.concordia.ca; pgsql-sql@postgresql.org
Subject: Re: [SQL] Sorting Issue 

"Ozer, Pam"  writes:
> Isn't this the English standard for collation?  Or is this a non-c
> locale as mentioned below?  Is there anyway around this?  

>LC_COLLATE = 'en_US.utf8'

en_US is probably using somebody's idea of "dictionary order", which
I believe includes ignoring spaces in the first pass.  You might be
happier using "C" collation.  Unfortunately that requires re-initdb'ing
your database (as of existing PG releases).

regards, tom lane

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] enum data type vs table

2011-05-25 Thread Seb
On Wed, 25 May 2011 17:23:26 -0500,
Peter Koczan  wrote:

> On Tue, May 17, 2011 at 11:23 PM, Seb  wrote:
>> Are there any guidelines for deciding whether to 1) create an enum
>> data type or 2) create a table with the set of values and then have
>> foreign keys referencing this table?  Some fields in a database take
>> a small number of values, and I'm not sure which of these routes to
>> take.  The enum data type seems like a clean way to handle this
>> without creating a constellation of tables for all these values, but
>> if one wants to add a new label to the enum or make changes to it at
>> some point, then the tables using it have to be recreated, so it's
>> quite rigid.  Have I got this right?  Thanks.

> I think your choice depends on a few things:

> 1 - How do you want to interact with the tables? What I mean is, are
> you planning on querying, inserting, or updating data to those tables
> via text or will you need to join to your reference table? If you
> don't want to join, you'll either need to use enum types, use views
> (which can be a pain if you want to update a view), or
> duplicate/reference the text directly (which is slow and a bad idea
> for several reasons).

> 2 - How much can you tolerate downtime or a busy database? Changing
> types is a single transaction and requires an exclusive lock. On small
> tables this is negligible, but on big tables it can require downtime.

> 3 - How often do you really expect changes to the enum type? If adding
> a new value to an enum type is truly a rare event, it's . If it's
> frequent or regular, you should probably have a table.

> I've used both of these approaches and I've found enum types to be
> well worth any trouble to drop/recreate types. The changes I've made
> have been rare, and I've been able to schedule downtime pretty easily,
> so it made the most sense for me.

> Also, Postgres 9.1 allows adding values to enum types, so you could
> always use that when it is finally released.

These are great guidelines, thanks.


-- 
Seb


-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql