[SQL] Regexp matching
Hello, I have some problems using character class matching (e.g. [:alpha:]). For example I have a table: CREATE TABLE re_test (text_column character varying (50) NOT NULL); Notice, that there are some specific characters. INSERT INTO re_test VALUES ('AŠDF'); INSERT INTO re_test VALUES ('AŠDF45'); INSERT INTO re_test VALUES ('AŠDF FDŠA'); INSERT INTO re_test VALUES ('ASDF FDŠA'); INSERT INTO re_test VALUES ('58ASDF FDŠA'); INSERT INTO re_test VALUES ('ašDf'); INSERT INTO re_test VALUES ('aŠdf'); SELECT * FROM re_test WHERE text_column ~ '[^[:alpha:]]' and text_column ~ [:upper:]; Goal: I want to write such statement which returns me only those records which have only one word and those words must be uppercase. So I expect this statement to return only one record where text_column = AŠDF. Maybe someone could give me more detail explanation how to use those regexp classes, because the documentation tells very little about this. Some more information: PostgreSQL9 OS - Windows x86-32 DB encoding - UTF-8 lc_collate - English_United States.1252 lc_ctype - English_United States.1252 lc_messages - English_United States.1252 lc_monetary - English_United States.1252 lc_numeric - English_United States.1252 lc_time - English_United States.1252
Re: [SQL] identifying duplicates in table with redundancies
On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote: On Tue, 28 Sep 2010 11:34:31 +0100 "Oliveiros d'Azevedo Cristina" wrote: - Original Message - From: "Tarlika Elisabeth Schmitz" To: Sent: Monday, September 27, 2010 5:54 PM Subject: Re: [SQL] identifying duplicates in table with redundancies On Fri, 24 Sep 2010 18:12:18 +0100 Oliver d'Azevedo Christina wrote: SELECT DISTINCT trainer_id,trainer_name FROM ( SELECT trainer_name -- The field you want to test for duplicates FROM ( SELECT DISTINCT "trainer_id","trainer_name" FROM student ) x GROUP BY "trainer_name"-- the field you want to test for duplicates HAVING (COUNT(*)> 1) ) z NATURAL JOIN student y What indices would you recommend for this operation? But, on this query in particular I would recomend an indice on trainer_name, as this field will be used on the join and on the group by. For the other query, the one you get by substituting trainer_name by trainer_id, place an index on trainer_id. Also, these indexes may help speed up the order by clause, if you use one. If you have a table with lots of data you can try them around and see how performance varies (and don't forget there's also EXPLAIN ANALYZE) Strangely, these indices did not do anything. Without, the query took about 8500ms. Same with index. The table has 25 records. 11000 have trainer_name = null. Only 13000 unique trainer_names. It is not hugely important as these queries are not time-critical. This is only a helper table, which I use to analyze the date prior to populating the destination tables with the data. Regards, Tarlika I guess explain analyze shows up a seq scan. try avoiding to use distinct. use group by instead. regards Andreas -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] identifying duplicates in table with redundancies
On Wed, 29 Sep 2010 10:40:03 +0200 Andreas Schmitz wrote: > On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote: >> On Tue, 28 Sep 2010 11:34:31 +0100 >> "Oliveiros d'Azevedo Cristina" >> wrote: >> >>> - Original Message - >>> From: "Tarlika Elisabeth Schmitz" >>> To: >>> Sent: Monday, September 27, 2010 5:54 PM >>> Subject: Re: [SQL] identifying duplicates in table with redundancies >>> >>> On Fri, 24 Sep 2010 18:12:18 +0100 Oliver d'Azevedo Christina wrote: >>> SELECT DISTINCT trainer_id,trainer_name >>> FROM ( >>> SELECT trainer_name -- The field you want to test for >>> duplicates FROM ( >>> SELECT DISTINCT "trainer_id","trainer_name" >>> FROM student >>> ) x >>> GROUP BY "trainer_name"-- the field you want to test for >>> duplicates >>> HAVING (COUNT(*)> 1) >>> ) z >>> NATURAL JOIN student y What indices would you recommend for this operation? >>> But, on this query in particular I would recomend an indice on >>> trainer_name,[...] >> Strangely, these indices did not do anything. >> Without, the query took about 8500ms. Same with index. >> >> The table has 25 records. 11000 have trainer_name = null. Only >> 13000 unique trainer_names. > >I guess explain analyze shows up a seq scan. try avoiding to use >distinct. use group by instead. > >regards, Andreas Hallo Andreas, I reduced the problem to the innermost query: 1) SELECT DISTINCT trainer_id, trainer_name FROM student This results in a sequential table scan. Execution time 7500ms. 2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index scan instead, which still cost 7000ms. 3) Next, I changed from DISTINCT to GROUP BY: SELECT trainer_id, trainer_name FROM student GROUP BY trainer_id, trainer_name This resulted in an index scan @ 6750ms 4) I filtered out NULL trainer_ids WHERE trainer_id IS NOT NULL Amazingly, this resulted in a sequential table scan, which only took 1300ms!! Please, explain (pun not intended)! How can this be. Only 11000/25 rows have a null trainer_id. Thanks for the GROUP BY tip! -- Best Regards, Tarlika Elisabeth Schmitz -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] identifying duplicates in table with redundancies
Hallo Andreas, I reduced the problem to the innermost query: 1) SELECT DISTINCT trainer_id, trainer_name FROM student This results in a sequential table scan. Execution time 7500ms. 2) I created an INDEX ON (trainer_id, trainer_name). Then I had an index scan instead, which still cost 7000ms. 3) Next, I changed from DISTINCT to GROUP BY: SELECT trainer_id, trainer_name FROM student GROUP BY trainer_id, trainer_name This resulted in an index scan @ 6750ms 4) I filtered out NULL trainer_ids WHERE trainer_id IS NOT NULL Amazingly, this resulted in a sequential table scan, which only took 1300ms!! Please, explain (pun not intended)! How can this be. Only 11000/25 rows have a null trainer_id. That's an impressive improvement... Personally I have no idea what caused it, specially when you say it was sequential :-| Warmed caches ? Best, Oliver -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Regexp matching
2010/9/28 Eduardas Kazakas : > Hello, I have some problems using character class matching (e.g. [:alpha:]). > > For example I have a table: > > CREATE TABLE re_test (text_column character varying (50) NOT NULL); > > Notice, that there are some specific characters. > > INSERT INTO re_test VALUES ('AŠDF'); > INSERT INTO re_test VALUES ('AŠDF45'); > INSERT INTO re_test VALUES ('AŠDF FDŠA'); > INSERT INTO re_test VALUES ('ASDF FDŠA'); > INSERT INTO re_test VALUES ('58ASDF FDŠA'); > INSERT INTO re_test VALUES ('ašDf'); > INSERT INTO re_test VALUES ('aŠdf'); > > SELECT * FROM re_test WHERE text_column ~ '[^[:alpha:]]' and text_column ~ > [:upper:]; > > Goal: > I want to write such statement which returns me only those records which > have only one word and those words must be uppercase. > So I expect this statement to return only one record where text_column = > AŠDF. > > Maybe someone could give me more detail explanation how to use those regexp > classes, because the documentation tells very little about this. > > Some more information: > > PostgreSQL9 > > OS - Windows x86-32 > DB encoding - UTF-8 > lc_collate - English_United States.1252 > lc_ctype - English_United States.1252 > lc_messages - English_United States.1252 > lc_monetary - English_United States.1252 > lc_numeric - English_United States.1252 > lc_time - English_United States.1252 I believe that "Š" isn't an alphabetical character in English_United States (LC_CTYPE). http://www.postgresql.org/docs/current/interactive/locale.html Osvaldo -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql