[SQL] Regexp matching

2010-09-29 Thread Eduardas Kazakas
Hello, I have some problems using character class matching (e.g. [:alpha:]).

For example I have a table:

CREATE TABLE re_test (text_column character varying (50) NOT NULL);

Notice, that there are some specific characters.

INSERT INTO re_test VALUES ('AŠDF');
INSERT INTO re_test VALUES ('AŠDF45');
INSERT INTO re_test VALUES ('AŠDF FDŠA');
INSERT INTO re_test VALUES ('ASDF FDŠA');
INSERT INTO re_test VALUES ('58ASDF FDŠA');
INSERT INTO re_test VALUES ('ašDf');
INSERT INTO re_test VALUES ('aŠdf');

SELECT * FROM re_test WHERE text_column ~ '[^[:alpha:]]' and text_column ~
[:upper:];

Goal:
I want to write such statement which returns me only those records which
have only one word and those words must be uppercase.
So I expect this statement to return only one record where text_column =
AŠDF.

Maybe someone could give me more detail explanation how to use those regexp
classes, because the documentation tells very little about this.

Some more information:

PostgreSQL9

OS - Windows x86-32
DB encoding - UTF-8
lc_collate - English_United States.1252
lc_ctype - English_United States.1252
lc_messages - English_United States.1252
lc_monetary - English_United States.1252
lc_numeric - English_United States.1252
lc_time - English_United States.1252


Re: [SQL] identifying duplicates in table with redundancies

2010-09-29 Thread Andreas Schmitz

 On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:

On Tue, 28 Sep 2010 11:34:31 +0100
"Oliveiros d'Azevedo Cristina"  wrote:


- Original Message -
From: "Tarlika Elisabeth Schmitz"
To:
Sent: Monday, September 27, 2010 5:54 PM
Subject: Re: [SQL] identifying duplicates in table with redundancies



On Fri, 24 Sep 2010 18:12:18 +0100
Oliver d'Azevedo Christina  wrote:


SELECT DISTINCT trainer_id,trainer_name
FROM (
SELECT trainer_name   -- The field you want to test for duplicates
FROM (
SELECT DISTINCT "trainer_id","trainer_name"
FROM student
) x
GROUP BY "trainer_name"-- the field you want to test for
duplicates
HAVING (COUNT(*)>  1)
) z
NATURAL JOIN student y



What indices would you recommend for this operation?

But, on this query in particular I would recomend an indice on
trainer_name, as this field will be used on the join and on the group
by. For the other query, the one you get by substituting trainer_name
by trainer_id, place an index on trainer_id.
Also, these indexes may help speed up the order by clause, if you use
one.

If you have a table with lots of data you can try them around and see
how performance varies (and don't forget there's also EXPLAIN ANALYZE)


Strangely, these indices did not do anything.
Without, the query took about 8500ms. Same with index.

The table has 25 records. 11000 have trainer_name = null. Only
13000 unique trainer_names.

It is not hugely important as these queries are not time-critical.
This is only a helper table, which I use to analyze the date prior to
populating the destination tables with the data.

Regards,
Tarlika



I guess explain analyze shows up a seq scan. try avoiding to use 
distinct. use group by instead.


regards

Andreas

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] identifying duplicates in table with redundancies

2010-09-29 Thread Tarlika Elisabeth Schmitz
On Wed, 29 Sep 2010 10:40:03 +0200
Andreas Schmitz  wrote:

>  On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
>> On Tue, 28 Sep 2010 11:34:31 +0100
>> "Oliveiros d'Azevedo Cristina"
>> wrote:
>>
>>> - Original Message -
>>> From: "Tarlika Elisabeth Schmitz"
>>> To:
>>> Sent: Monday, September 27, 2010 5:54 PM
>>> Subject: Re: [SQL] identifying duplicates in table with redundancies
>>>
>>>
 On Fri, 24 Sep 2010 18:12:18 +0100
 Oliver d'Azevedo Christina  wrote:

>>> SELECT DISTINCT trainer_id,trainer_name
>>> FROM (
>>> SELECT trainer_name   -- The field you want to test for
>>> duplicates FROM (
>>> SELECT DISTINCT "trainer_id","trainer_name"
>>> FROM student
>>> ) x
>>> GROUP BY "trainer_name"-- the field you want to test for
>>> duplicates
>>> HAVING (COUNT(*)>  1)
>>> ) z
>>> NATURAL JOIN student y


 What indices would you recommend for this operation?
>>> But, on this query in particular I would recomend an indice on
>>> trainer_name,[...]
>> Strangely, these indices did not do anything.
>> Without, the query took about 8500ms. Same with index.
>>
>> The table has 25 records. 11000 have trainer_name = null. Only
>> 13000 unique trainer_names.
>
>I guess explain analyze shows up a seq scan. try avoiding to use 
>distinct. use group by instead.
>
>regards, Andreas

Hallo Andreas,
I reduced the problem to the innermost query:

1) SELECT DISTINCT trainer_id, trainer_name FROM student
This results in a sequential table scan. Execution time 7500ms.

2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
scan instead, which still cost 7000ms.

3) Next, I changed from DISTINCT to GROUP BY:
SELECT trainer_id, trainer_name FROM student
GROUP BY trainer_id, trainer_name
This resulted in an index scan @ 6750ms

4) I filtered out NULL trainer_ids
WHERE trainer_id IS NOT NULL
Amazingly, this resulted in a sequential table scan, which only took
1300ms!!

Please, explain (pun not intended)! How can this be. Only 11000/25
rows have a null trainer_id.


Thanks for the GROUP BY tip!


-- 

Best Regards,
Tarlika Elisabeth Schmitz

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] identifying duplicates in table with redundancies

2010-09-29 Thread Oliveiros d'Azevedo Cristina




Hallo Andreas,
I reduced the problem to the innermost query:

1) SELECT DISTINCT trainer_id, trainer_name FROM student
This results in a sequential table scan. Execution time 7500ms.

2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index
scan instead, which still cost 7000ms.

3) Next, I changed from DISTINCT to GROUP BY:
SELECT trainer_id, trainer_name FROM student
GROUP BY trainer_id, trainer_name
This resulted in an index scan @ 6750ms

4) I filtered out NULL trainer_ids
WHERE trainer_id IS NOT NULL
Amazingly, this resulted in a sequential table scan, which only took
1300ms!!

Please, explain (pun not intended)! How can this be. Only 11000/25
rows have a null trainer_id.



That's an impressive improvement...
Personally I have no idea what caused it, specially when you say it was 
sequential :-|

Warmed caches ?

Best,
Oliver 



--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql


Re: [SQL] Regexp matching

2010-09-29 Thread Osvaldo Kussama
2010/9/28 Eduardas Kazakas :
> Hello, I have some problems using character class matching (e.g. [:alpha:]).
>
> For example I have a table:
>
> CREATE TABLE re_test (text_column character varying (50) NOT NULL);
>
> Notice, that there are some specific characters.
>
> INSERT INTO re_test VALUES ('AŠDF');
> INSERT INTO re_test VALUES ('AŠDF45');
> INSERT INTO re_test VALUES ('AŠDF FDŠA');
> INSERT INTO re_test VALUES ('ASDF FDŠA');
> INSERT INTO re_test VALUES ('58ASDF FDŠA');
> INSERT INTO re_test VALUES ('ašDf');
> INSERT INTO re_test VALUES ('aŠdf');
>
> SELECT * FROM re_test WHERE text_column ~ '[^[:alpha:]]' and text_column ~
> [:upper:];
>
> Goal:
> I want to write such statement which returns me only those records which
> have only one word and those words must be uppercase.
> So I expect this statement to return only one record where text_column =
> AŠDF.
>
> Maybe someone could give me more detail explanation how to use those regexp
> classes, because the documentation tells very little about this.
>
> Some more information:
>
> PostgreSQL9
>
> OS - Windows x86-32
> DB encoding - UTF-8
> lc_collate - English_United States.1252
> lc_ctype - English_United States.1252
> lc_messages - English_United States.1252
> lc_monetary - English_United States.1252
> lc_numeric - English_United States.1252
> lc_time - English_United States.1252


I believe that "Š" isn't an alphabetical character in English_United
States (LC_CTYPE).
http://www.postgresql.org/docs/current/interactive/locale.html

Osvaldo

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql