Re: [GENERAL] Optimizing query?

2013-02-03 Thread Jasen Betts
On 2013-01-31, haman...@t-online.de haman...@t-online.de wrote:

 Pavel Stehlule wrote:

  Hi,
 
  I am trying to match items from 2 tables based on a common string.
  One is a big table which has one column with entries like XY123, ABC44, =
 etc
  The table has an index on that column.
  The second table is, typically, much smaller
 
  select  from tab1, tab2 where tab1.code =3D tab2.code;
 
  This works fine and fast.
  Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=
  in the
  big table and want them to match XY423, GF55 in the second table
 
  Variants I have tried
 
  select   from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
  select   from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=
 ])');
 
 
  Have you tried the substring function?
 
  select   from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
  tab2.code
 

 Hi Pavel, it was just by chance that a fixed size substring would match the
 data at hand. It is more common to have a digit/letter (or vice versa) 
 boundary
 or a hyphen there

 
  both take an enormous time. In the better case that I can subset (e.g. a=
 ll candidates in table 2
  share initial AX) I get back to manageable times by adding
and tab1.code ~ '^AX'
  into the recipe. Actual runtime with about a million entries in tab1 and=
  800 entries in tab2
  is about 40 seconds.
 
 any join where result is related to some function result can be very
 slow, because estimation will be out and any repeated function
 evaluation is just expensive.

 I see the problem since obviously every the ~ operator with a non-constant
 pattern is constantly recompiling the pattern.

 I wonder whether it would be possible to invent a prefix-match operator that 
 approaches
 the performance of string equality. I noted in the past (not sure whether 
 anything
 has changed in regex matching) that a constant leading part of regex would 
 improve
 performance, i.e. use an index scan to select possible candidates.
   

you could write a set returning function that opens cursors on both tables using
ORDER BY code and merges the results 

-- 
⚂⚃ 100% natural



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-02-02 Thread hamann . w
Pavel Stehule wrote:

 
 but maybe  https://github.com/dimitri/prefix can help
 

Hi Pavel,

thanks - this works perfect. However, it does not seem to play well
with the optimizer, so I ended up with

select all candidates into a temp table using prefix operator
apply all other conditions by joining that temp table to original ones

Regards
Wolfgang



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-02-02 Thread Pavel Stehule
2013/2/2  haman...@t-online.de:
 Pavel Stehule wrote:


 but maybe  https://github.com/dimitri/prefix can help


 Hi Pavel,

 thanks - this works perfect. However, it does not seem to play well
 with the optimizer, so I ended up with

 select all candidates into a temp table using prefix operator
 apply all other conditions by joining that temp table to original ones


you can send  proposals to enhancing to Dimitry - He will be happy :)

Regards

Pavel

 Regards
 Wolfgang



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-02-02 Thread Pavel Stehule
2013/2/2 Pavel Stehule pavel.steh...@gmail.com:
 2013/2/2  haman...@t-online.de:
 Pavel Stehule wrote:


 but maybe  https://github.com/dimitri/prefix can help


 Hi Pavel,

 thanks - this works perfect. However, it does not seem to play well
 with the optimizer, so I ended up with

 select all candidates into a temp table using prefix operator
 apply all other conditions by joining that temp table to original ones


seriously - it is typical solution - and it is great so PostgreSQL
help to you :)

Regards

Pavel


 you can send  proposals to enhancing to Dimitry - He will be happy :)

 Regards

 Pavel

 Regards
 Wolfgang



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread Pavel Stehule
2013/1/31 Vincent Veyron vv.li...@wanadoo.fr:
 Le mercredi 30 janvier 2013 à 11:08 +, wolfg...@noten5.maas-noten.de
 a écrit :
 Hi,

 I am trying to match items from 2 tables based on a common string.
 One is a big table which has one column with entries like XY123, ABC44, etc
 The table has an index on that column.
 The second table is, typically, much smaller

 select  from tab1, tab2 where tab1.code = tab2.code;

 This works fine and fast.
 Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in 
 the
 big table and want them to match XY423, GF55 in the second table

 Variants I have tried

 select   from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
 select   from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])');


 Have you tried the substring function?

 select   from tab1, tab2 where substring(tab1.code from 1 for 5) =
 tab2.code


 both take an enormous time. In the better case that I can subset (e.g. all 
 candidates in table 2
 share initial AX) I get back to manageable times by adding
   and tab1.code ~ '^AX'
 into the recipe. Actual runtime with about a million entries in tab1 and 800 
 entries in tab2
 is about 40 seconds.

any join where result is related to some function result can be very
slow, because estimation will be out and any repeated function
evaluation is just expensive.

You can try use a functional index.

create index on tab2 ((substring(tab1.code from 1 for 5))

Regards

Pavel Stehule


 Regards
 Wolfgang Hamann







 --
 Salutations, Vincent Veyron
 http://marica.fr/site/demonstration
 Logiciel de gestion des contentieux juridiques et des sinistres d'assurance



 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread Vincent Veyron
Le jeudi 31 janvier 2013 à 09:49 +0100, Pavel Stehule a écrit :

 any join where result is related to some function result can be very
 slow, because estimation will be out and any repeated function
 evaluation is just expensive.
 

Hi Pavel,

Thank you for the correction. Since we're at it, I have a question
regarding functions in a query :

Suppose I have a query of the form 

SELECT my_function(column_1), column_2 
FROM my_table 
GROUP BY my_function(column_1)
ORDER BY my_function(column_1);

where my_function is a user defined function.
 
How many times is the function computed?




-- 
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread Pavel Stehule
2013/1/31 Vincent Veyron vv.li...@wanadoo.fr:
 Le jeudi 31 janvier 2013 à 09:49 +0100, Pavel Stehule a écrit :

 any join where result is related to some function result can be very
 slow, because estimation will be out and any repeated function
 evaluation is just expensive.


 Hi Pavel,

 Thank you for the correction. Since we're at it, I have a question
 regarding functions in a query :

 Suppose I have a query of the form

 SELECT my_function(column_1), column_2
 FROM my_table
 GROUP BY my_function(column_1)
 ORDER BY my_function(column_1);

 where my_function is a user defined function.

 How many times is the function computed?

if function is stable or immutable, then once per row

Pavel





 --
 Salutations, Vincent Veyron
 http://marica.fr/site/demonstration
 Logiciel de gestion des contentieux juridiques et des sinistres d'assurance



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread Vincent Veyron
Le jeudi 31 janvier 2013 à 11:06 +0100, Pavel Stehule a écrit :
 2013/1/31 Vincent Veyron vv.li...@wanadoo.fr:
 
  Suppose I have a query of the form
 
  SELECT my_function(column_1), column_2
  FROM my_table
  GROUP BY my_function(column_1)
  ORDER BY my_function(column_1);
 
  where my_function is a user defined function.
 
  How many times is the function computed?
 
 if function is stable or immutable, then once per row
 

In this post (watch for line-wrap) :

http://www.postgresql.org/message-id/CAFj8pRAdYL1-hCxH
+qszqkht9ynoaoigkfx4cnc9mzutimc...@mail.gmail.com

you wrote that it is usually better not to mark SQL functions (as
opposed to plpgsql functions).

So should I mark SQL functions stable/immutable if I use them in a query
like the one above, or is it unnecessary?


-- 
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread Pavel Stehule
2013/1/31 Vincent Veyron vv.li...@wanadoo.fr:
 Le jeudi 31 janvier 2013 à 11:06 +0100, Pavel Stehule a écrit :
 2013/1/31 Vincent Veyron vv.li...@wanadoo.fr:
 
  Suppose I have a query of the form
 
  SELECT my_function(column_1), column_2
  FROM my_table
  GROUP BY my_function(column_1)
  ORDER BY my_function(column_1);
 
  where my_function is a user defined function.
 
  How many times is the function computed?

 if function is stable or immutable, then once per row


 In this post (watch for line-wrap) :

 http://www.postgresql.org/message-id/CAFj8pRAdYL1-hCxH
 +qszqkht9ynoaoigkfx4cnc9mzutimc...@mail.gmail.com

 you wrote that it is usually better not to mark SQL functions (as
 opposed to plpgsql functions).

 So should I mark SQL functions stable/immutable if I use them in a query
 like the one above, or is it unnecessary?


It should not be marked

Regards

Pavel


 --
 Salutations, Vincent Veyron
 http://marica.fr/site/demonstration
 Logiciel de gestion des contentieux juridiques et des sinistres d'assurance



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread hamann . w

Pavel Stehlule wrote:

  Hi,
 
  I am trying to match items from 2 tables based on a common string.
  One is a big table which has one column with entries like XY123, ABC44, =
 etc
  The table has an index on that column.
  The second table is, typically, much smaller
 
  select  from tab1, tab2 where tab1.code =3D tab2.code;
 
  This works fine and fast.
  Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=
  in the
  big table and want them to match XY423, GF55 in the second table
 
  Variants I have tried
 
  select   from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
  select   from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=
 ])');
 
 
  Have you tried the substring function?
 
  select   from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
  tab2.code
 

Hi Pavel, it was just by chance that a fixed size substring would match the
data at hand. It is more common to have a digit/letter (or vice versa) boundary
or a hyphen there

 
  both take an enormous time. In the better case that I can subset (e.g. a=
 ll candidates in table 2
  share initial AX) I get back to manageable times by adding
and tab1.code ~ '^AX'
  into the recipe. Actual runtime with about a million entries in tab1 and=
  800 entries in tab2
  is about 40 seconds.
 
 any join where result is related to some function result can be very
 slow, because estimation will be out and any repeated function
 evaluation is just expensive.

I see the problem since obviously every the ~ operator with a non-constant
pattern is constantly recompiling the pattern.

I wonder whether it would be possible to invent a prefix-match operator that 
approaches
the performance of string equality. I noted in the past (not sure whether 
anything
has changed in regex matching) that a constant leading part of regex would 
improve
performance, i.e. use an index scan to select possible candidates.
  
 You can try use a functional index.
 
 create index on tab2 ((substring(tab1.code from 1 for 5))
 

What kind of trick is that - mixing two tables into a functional index?
What would the exact syntax be for that?

Regards
Wolfgang Hamann






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-31 Thread Pavel Stehule
Hello

2013/1/31  haman...@t-online.de:

 Pavel Stehlule wrote:

  Hi,
 
  I am trying to match items from 2 tables based on a common string.
  One is a big table which has one column with entries like XY123, ABC44, =
 etc
  The table has an index on that column.
  The second table is, typically, much smaller
 
  select  from tab1, tab2 where tab1.code =3D tab2.code;
 
  This works fine and fast.
  Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D=
  in the
  big table and want them to match XY423, GF55 in the second table
 
  Variants I have tried
 
  select   from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
  select   from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z=
 ])');
 
 
  Have you tried the substring function?
 
  select   from tab1, tab2 where substring(tab1.code from 1 for 5) =3D
  tab2.code
 

 Hi Pavel, it was just by chance that a fixed size substring would match the
 data at hand. It is more common to have a digit/letter (or vice versa) 
 boundary
 or a hyphen there

 
  both take an enormous time. In the better case that I can subset (e.g. a=
 ll candidates in table 2
  share initial AX) I get back to manageable times by adding
and tab1.code ~ '^AX'
  into the recipe. Actual runtime with about a million entries in tab1 and=
  800 entries in tab2
  is about 40 seconds.

 any join where result is related to some function result can be very
 slow, because estimation will be out and any repeated function
 evaluation is just expensive.

 I see the problem since obviously every the ~ operator with a non-constant
 pattern is constantly recompiling the pattern.

 I wonder whether it would be possible to invent a prefix-match operator that 
 approaches
 the performance of string equality. I noted in the past (not sure whether 
 anything
 has changed in regex matching) that a constant leading part of regex would 
 improve
 performance, i.e. use an index scan to select possible candidates.

 You can try use a functional index.

 create index on tab2 ((substring(tab1.code from 1 for 5))


 What kind of trick is that - mixing two tables into a functional index?

it is not possible - you can do some auxiliary table and creating
indexes over this table

but maybe  https://github.com/dimitri/prefix can help

Regards

Pavel

 What would the exact syntax be for that?

 Regards
 Wolfgang Hamann






 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Optimizing query?

2013-01-30 Thread wolfgang

Hi,

I am trying to match items from 2 tables based on a common string.
One is a big table which has one column with entries like XY123, ABC44, etc
The table has an index on that column.
The second table is, typically, much smaller

select  from tab1, tab2 where tab1.code = tab2.code;

This works fine and fast.
Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in the
big table and want them to match XY423, GF55 in the second table

Variants I have tried

select   from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
select   from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])');

both take an enormous time. In the better case that I can subset (e.g. all 
candidates in table 2
share initial AX) I get back to manageable times by adding
  and tab1.code ~ '^AX'
into the recipe. Actual runtime with about a million entries in tab1 and 800 
entries in tab2
is about 40 seconds.

Regards
Wolfgang Hamann






-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Optimizing query?

2013-01-30 Thread Vincent Veyron
Le mercredi 30 janvier 2013 à 11:08 +, wolfg...@noten5.maas-noten.de
a écrit :
 Hi,
 
 I am trying to match items from 2 tables based on a common string.
 One is a big table which has one column with entries like XY123, ABC44, etc
 The table has an index on that column.
 The second table is, typically, much smaller
 
 select  from tab1, tab2 where tab1.code = tab2.code;
 
 This works fine and fast.
 Now, as a variant, I have some entries like XY423A, XY423B, GF55A, GF55D in 
 the
 big table and want them to match XY423, GF55 in the second table
 
 Variants I have tried
 
 select   from tab1, tab2 where tab1.code ~ (tab2.code||'($|[A-Z])');
 select   from tab1, tab2 where tab1.code ~ ('^'||tab2.code||'($|[A-Z])');
 

Have you tried the substring function?

select   from tab1, tab2 where substring(tab1.code from 1 for 5) =
tab2.code


 both take an enormous time. In the better case that I can subset (e.g. all 
 candidates in table 2
 share initial AX) I get back to manageable times by adding
   and tab1.code ~ '^AX'
 into the recipe. Actual runtime with about a million entries in tab1 and 800 
 entries in tab2
 is about 40 seconds.
 
 Regards
 Wolfgang Hamann
 
 
 
 
 
 

-- 
Salutations, Vincent Veyron
http://marica.fr/site/demonstration
Logiciel de gestion des contentieux juridiques et des sinistres d'assurance



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Optimizing query

2010-11-24 Thread pasman pasmański
Hello.

I have a query which works a bit slow.

It's runned on desktop computer: AMD Athlon X2 2GHz , Win Xp sp2, 1GB ram.
Postgres 8.4.5 with some changes in config:

shared_buffers = 200MB  # min 128kB
# (change requires restart)
temp_buffers = 8MB  # min 800kB
work_mem = 12MB # min 64kB
maintenance_work_mem = 32MB # min 1MB

Indexes in table NumeryA:
NTA, NKA, KodBłędu, Plikprimary key
DataPliku, KodBłędu index dp_kb
NKA, NTAindex nka_nta

Indexes in table Rejestr stacji do naprawy:
LP- primary key
Numer kierunkowy, substr(Numer stacji::text, 1, 5)  - index 3
Data weryfikacji  - index Data weryfikacji_1
Numer kierunkowy, Numer stacji, Data odrzucania bilingu z
Serat  - index Powtórzenia

-
Query is:
--
SELECT
  A.NKA,
  A.NTA,
  Min(PołączeniaMin) || ',' || Max(PołączeniaMax) AS Biling,
  Sum(Ile)::text AS Ilość CDR,
  R.LP::text AS Sprawa,
  (R.Osoba weryfikująca) AS Osoba,
  to_char(min(Wartość),'FM990D00') AS Wartość po kontroli,
  max(R.Kontrola po naprawie w Serat - CDR)::text AS CDR po kontroli,
  min(A.KodBłędu)::text AS KodBłędu,
  Max(to_char(R.Data kontroli,'-MM-DD')) AS Ostatnia Kontrola
, max(Skutek wprowadzenia błednej ewidencji w Serat) as Skutek
, sum(www.a_biling_070(NRB))::text
, sum(www.a_biling_darmowy(NRB))::text
FROM
  (SELECT NumeryA.*
   FROM ONLY NumeryA
   WHERE DataPliku = current_date-4*30 and KodBłędu=74::text
  ) AS A
LEFT JOIN
  (SELECT * FROM Rejestr stacji do naprawy
   WHERE Data weryfikacji = current_date-4*30
  ) AS R
ON
  A.NKA = R.Numer kierunkowy
  and substr(A.NTA,1,5) = substr(R.Numer stacji,1,5)
  and A.NTA like R.Numer stacji
GROUP BY R.Osoba weryfikująca,R.LP,A.NKA, A.NTA
ORDER BY Sum(Ile) DESC
LIMIT 5000
--
Explain analyze:
--

Limit  (cost=30999.84..31012.34 rows=5000 width=149) (actual
time=7448.483..7480.094 rows=5000 loops=1)
  -  Sort  (cost=30999.84..31073.19 rows=29341 width=149) (actual
time=7448.475..7459.663 rows=5000 loops=1)
Sort Key: (sum(NumeryA.Ile))
Sort Method:  top-N heapsort  Memory: 1488kB
-  GroupAggregate  (cost=11093.77..29050.46 rows=29341
width=149) (actual time=4700.654..7377.762 rows=14225 loops=1)
  -  Sort  (cost=11093.77..11167.12 rows=29341
width=149) (actual time=4699.587..4812.776 rows=46732 loops=1)
Sort Key: Rejestr stacji do naprawy.Osoba
weryfikująca, Rejestr stacji do naprawy.LP, NumeryA.NKA,
NumeryA.NTA
Sort Method:  quicksort  Memory: 9856kB
-  Merge Left Join  (cost=8297.99..8916.58
rows=29341 width=149) (actual time=2931.449..3735.876 rows=46732
loops=1)
  Merge Cond: (((NumeryA.NKA)::text =
(Rejestr stacji do naprawy.Numer kierunkowy)::text) AND
((substr((NumeryA.NTA)::text, 1, 5)) = (substr((Rejestr stacji do
naprawy.Numer stacji)::text, 1, 5
  Join Filter: ((NumeryA.NTA)::text ~~
(Rejestr stacji do naprawy.Numer stacji)::text)
  -  Sort  (cost=6062.18..6135.53 rows=29341
width=95) (actual time=2131.297..2241.303 rows=46694 loops=1)
Sort Key: NumeryA.NKA,
(substr((NumeryA.NTA)::text, 1, 5))
Sort Method:  quicksort  Memory: 7327kB
-  Bitmap Heap Scan on NumeryA
(cost=1502.09..3884.98 rows=29341 width=95) (actual
time=282.570..1215.355 rows=46694 loops=1)
  Recheck Cond: ((DataPliku =
(('now'::text)::date - 120)) AND ((KodBłędu)::text = '74'::text))
  -  Bitmap Index Scan on dp_kb
(cost=0.00..1494.75 rows=29341 width=0) (actual time=281.991..281.991
rows=46694 loops=1)
Index Cond: ((DataPliku
= (('now'::text)::date - 120)) AND ((KodBłędu)::text =
'74'::text))
  -  Sort  (cost=2235.82..2285.03 rows=19684
width=64) (actual time=800.101..922.463 rows=54902 loops=1)
Sort Key: Rejestr stacji do
naprawy.Numer kierunkowy, (substr((Rejestr stacji do
naprawy.Numer stacji)::text, 1, 5))
Sort Method:  quicksort  Memory: 3105kB
-  Seq Scan on Rejestr stacji do
naprawy  (cost=0.00..831.88 rows=19684 width=64) (actual
time=2.118..361.463 rows=19529 loops=1)
  Filter: (Data weryfikacji =
(('now'::text)::date - 120))
Total runtime: 7495.697 ms
-

How to make it faster ?



pasman

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Optimizing query: select ... where id = 4 and md5(...) = '...'

2006-07-03 Thread Alexander Farber

Hello,

in my application I'm trying to authenticate users
against a table called users. The integer column
id should match, but also an md5 hash of the
password column (salted with a string) should match.
My authentication function (written in C, using libpq)
should return a username (is a varchar(200) field).

I wonder, what is faster: fetching 2 columns - the
username and the md5-result and then comparing the
md5 string against the argument in my app, like here:

punbb= select username, md5('deadbeef' || password) from users where id = 4;
username |   md5
--+--
Vasja| dcde745cc304742e26d62e683a9ecb0a
(1 row)

punbb= explain select username, md5('deadbeef' || password) from
users where id = 4;
   QUERY PLAN
--
Index Scan using users_pkey on users  (cost=0.00..5.95 rows=1 width=156)
  Index Cond: (id = 4)
(2 rows)

Or letting the database doing this comparison for me:

punbb= select username from users where id = 4 and md5('deadbeef' ||
password) = 'dcde745cc304742e26d62e683a9ecb0a';
username
--
Vasja
(1 row)

punbb= explain select username from users where id = 4 and
md5('deadbeef' || password) = 'dcde745cc304742e26d62e683a9ecb0a';
 QUERY PLAN
--
Index Scan using users_pkey on users  (cost=0.00..5.95 rows=1 width=118)
  Index Cond: (id = 4)
  Filter: (md5(('deadbeef'::text || (password)::text)) =
'dcde745cc304742e26d62e683a9ecb0a'::text)
(3 rows)

I've prepared a test case with the code listed at the
botom and have run it 1000 times, but am still unsure:

$ time perl -e 'for (1..1000) {system(./fetch-user, APP_QUERY) and die $!}'

username: Vasja

5.038u 5.734s 0:26.29 40.9% 0+0k 0+4io 0pf+0w

$ time perl -e 'for (1..1000) {system(./fetch-user, DB_QUERY) and die $!}'

username: Vasja

4.757u 5.890s 0:26.52 40.1% 0+0k 0+8io 0pf+0w

How does one profile PostgreSQL-queries in general?

Thank you
Alex

PS: Using Postgresql 8.1.0 (from packages) on OpenBSD/386 -current

PPS: My test program, call with APP_QUERY or DB_QUERY:

#include err.h
#include stdio.h
#include libpq-fe.h

#define DB_CONN_STR host=/var/www/tmp user=punbb dbname=punbb
#define APP_QUERY   select username, md5('deadbeef' || password)  \
   from users where id = $1
#define DB_QUERYselect username from users where id = $1 and  \
   md5('deadbeef' || password) = $2
int
main(int argc, char *argv[])
{
   PGconn  *conn;
   PGresult*res;
   const char  *query;
   const char  *args[2];
   unsignednargs;
   charusername[201];

   if (! strcmp(argv[1], APP_QUERY)) {
   query = APP_QUERY;
   nargs = 1;
   } else if (! strcmp(argv[1], DB_QUERY)) {
   query = DB_QUERY;
   nargs = 2;
   } else
   errx(1, wrong usage: supply APP_QUERY or DB_QUERY);

   if ((conn = PQconnectdb(DB_CONN_STR)) == NULL)
   err(1, Connect failed: out of memory);

   if (PQstatus(conn) != CONNECTION_OK)
   err(1, Connect failed: %s, PQerrorMessage(conn));

   if ((res = PQprepare(conn, sql_fetch_username,
   query, nargs, NULL)) == NULL)
   err(1, Preparing '%s' failed: out of memory, query);

   if (PQresultStatus(res) != PGRES_COMMAND_OK)
   err(1, Preparing statement failed: %s, PQerrorMessage(conn));

   PQclear(res);

   args[0] = 4;
   args[1] = dcde745cc304742e26d62e683a9ecb0a;

   if ((res = PQexecPrepared(conn, sql_fetch_username,
   nargs, args, NULL, NULL, 0)) == NULL)
   err(1, Executing statement '%s' failed: out of memory,
   query);

   if (PQresultStatus(res) != PGRES_TUPLES_OK)
   err(1, Executing statement '%s' failed: %s,
   query, PQerrorMessage(conn));
   PQclear(res);

   if (nargs == 1)
   (void) strcmp(args[1], PQgetvalue(res, 0, 1));

   fprintf(stderr, username: %s\n, PQgetvalue(res, 0, 0));

   PQfinish(conn);
   return 0;
}


--
http://preferans.de

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Optimizing query: select ... where id = 4 and md5(...) = '...'

2006-07-03 Thread Martijn van Oosterhout
On Mon, Jul 03, 2006 at 03:13:15PM +0200, Alexander Farber wrote:
 Hello,
 
 in my application I'm trying to authenticate users
 against a table called users. The integer column
 id should match, but also an md5 hash of the
 password column (salted with a string) should match.
 My authentication function (written in C, using libpq)
 should return a username (is a varchar(200) field).
 
 I wonder, what is faster: fetching 2 columns - the
 username and the md5-result and then comparing the
 md5 string against the argument in my app, like here:

I don't know about speed, but I think the choice should really be based
on whether you want to be able to tell the difference between unknown
user and bad password. You can still do the comparison in the database
by doing something like:

select username, md5('deadbeef' || password) = 'blah' 
from users where id = 4;

So the second field will be true or false.

In any case, the testing you're doing is bogus, since you're probably
testing backend startup time as well, which is probably longer than the
query you're running anyway. Even then, 5ms for the whole process is
not to be sneezed at.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [GENERAL] Optimizing query: select ... where id = 4 and md5(...) = '...'

2006-07-03 Thread Alexander Farber

Yes, you're probably right.

I'm just trying to ensure, that the slow md5() function isn't
called for every row in the table. If that's not the case, then
the other tiny speed differences are not that important for me.

Your query works too, thanks for the hint.

punbb= select username, md5('deadbeef' || password) = 'blah' from
users where id = 4;
username | ?column?
--+--
Vasja| f
(1 row)

punbb= explain select username, md5('deadbeef' || password) = 'blah'
from users where id = 4;
   QUERY PLAN
--
Index Scan using users_pkey on users  (cost=0.00..5.95 rows=1 width=156)
  Index Cond: (id = 4)
(2 rows)

Regards
Alex

On 7/3/06, Martijn van Oosterhout kleptog@svana.org wrote:

On Mon, Jul 03, 2006 at 03:13:15PM +0200, Alexander Farber wrote:
 in my application I'm trying to authenticate users
 against a table called users. The integer column
 id should match, but also an md5 hash of the
 password column (salted with a string) should match.
 My authentication function (written in C, using libpq)
 should return a username (is a varchar(200) field).

 I wonder, what is faster: fetching 2 columns - the
 username and the md5-result and then comparing the
 md5 string against the argument in my app, like here:

I don't know about speed, but I think the choice should really be based
on whether you want to be able to tell the difference between unknown
user and bad password. You can still do the comparison in the database
by doing something like:

select username, md5('deadbeef' || password) = 'blah'
from users where id = 4;

So the second field will be true or false.

In any case, the testing you're doing is bogus, since you're probably
testing backend startup time as well, which is probably longer than the
query you're running anyway. Even then, 5ms for the whole process is
not to be sneezed at.
--
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
litigate.





--
http://preferans.de

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [GENERAL] Optimizing query: select ... where id = 4 and md5(...)

2006-07-03 Thread Alban Hertroys

Alexander Farber wrote:

I wonder, what is faster: fetching 2 columns - the
username and the md5-result and then comparing the
md5 string against the argument in my app, like here:

punbb= select username, md5('deadbeef' || password) from users where id 
= 4;

username |   md5
--+--
Vasja| dcde745cc304742e26d62e683a9ecb0a
(1 row)


Why don't you store the hashed value instead?

--
Alban Hertroys
[EMAIL PROTECTED]

magproductions b.v.

T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
   7500 AK Enschede

// Integrate Your World //

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [GENERAL] Optimizing query: select ... where id = 4 and md5(...) = '...'

2006-07-03 Thread Alexander Farber

Hi Alban,

On 7/3/06, Alban Hertroys [EMAIL PROTECTED] wrote:

Alexander Farber wrote:
 punbb= select username, md5('deadbeef' || password) from users where id
 = 4;
 username |   md5
 --+--
 Vasja| dcde745cc304742e26d62e683a9ecb0a
 (1 row)

Why don't you store the hashed value instead?


actually it's not me - it's the punBB forum SW
( http://docs.punbb.org/dev.html#dbtables )
against which I want my libpq-program to authenticate.

And additional reason is that I could change my salt string
if I wanted and the users could still authenticate without
changing their passwords

Regards
Alex

--
http://preferans.de

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] Optimizing query

2005-08-15 Thread Poul Møller Hansen

I have a problem creating a usable index for the following simple query:
SELECT * FROM my.table WHERE node = '10' ORDER BY id DESC LIMIT 1

id is a serial, so the query is to find the latest entry to a given node 
and id is the primary key.
The table contains around 1 million records and the query takes around 2 
seconds.


I have tried to make an index on node and also on both id  node, but is 
doesn't lower the

query time.

What am I doing wrong ?

Thanks,
Poul


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Optimizing query

2005-08-15 Thread Richard Huxton

Poul Møller Hansen wrote:

I have a problem creating a usable index for the following simple query:
SELECT * FROM my.table WHERE node = '10' ORDER BY id DESC LIMIT 1

id is a serial, so the query is to find the latest entry to a given node 
and id is the primary key.


You're not necessarily getting the latest entry, just the one with the 
highest id. Sequences guarantee uniqueness but if you have concurrent 
inserts not necessarily ordering.


The table contains around 1 million records and the query takes around 2 
seconds.



Well, you don't say how many different values for node there are, nor 
how many rows you would expect where node='10'.


I have tried to make an index on node and also on both id  node, but is 
doesn't lower the query time.


Difficult to say what's happening since you don't supply any EXPLAIN 
ANALYSE output.


However, if you have an index on (node,id) you might want to try:
  SELECT ... ORDER BY node DESC, id DESC LIMIT 1;
That way the ORDER BY part clearly tells the planner that a 
reverse-order on your index will be useful.


--
  Richard Huxton
  Archonet Ltd


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [GENERAL] Optimizing query

2005-08-15 Thread Poul Møller Hansen



I have a problem creating a usable index for the following simple query:
SELECT * FROM my.table WHERE node = '10' ORDER BY id DESC LIMIT 1

id is a serial, so the query is to find the latest entry to a given 
node and id is the primary key.



You're not necessarily getting the latest entry, just the one with the 
highest id. Sequences guarantee uniqueness but if you have 
concurrent inserts not necessarily ordering.



Right you are, but I have no concurrent inserts from the same node.



Difficult to say what's happening since you don't supply any EXPLAIN 
ANALYSE output.


However, if you have an index on (node,id) you might want to try:
  SELECT ... ORDER BY node DESC, id DESC LIMIT 1;
That way the ORDER BY part clearly tells the planner that a 
reverse-order on your index will be useful.



Thanks a lot, that did the trick !

explain analyze SELECT * FROM my.table WHERE node = '10' ORDER BY id 
DESC LIMIT 1
 
QUERY 
PLAN  
---
Limit  (cost=0.00..764.00 rows=1 width=246) (actual 
time=1874.890..1874.896 rows=1 loops=1)
  -  Index Scan Backward using table_pkey on table  
(cost=0.00..4347913.94 rows=5691 width=246) (actual 
time=1874.867..1874.867 rows=1 loops=1)

Filter: ((node)::text = '10'::text)
Total runtime: 1875.111 ms

explain analyze SELECT * FROM my.table WHERE node = '10' ORDER BY node, 
id DESC LIMIT 1
QUERY 
PLAN

Limit  (cost=22638.36..22638.36 rows=1 width=246) (actual 
time=3.001..3.007 rows=1 loops=1)
  -  Sort  (cost=22638.36..22652.59 rows=5691 width=246) (actual 
time=2.984..2.984 rows=1 loops=1)

Sort Key: node, id
-  Index Scan using node_date on table  (cost=0.00..21898.65 
rows=5691 width=246) (actual time=0.077..1.852 rows=62 loops=1)

  Index Cond: ((node)::text = '10'::text)
Total runtime: 3.127 ms


Poul


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Optimizing query

2005-08-15 Thread Dennis Bjorklund
On Mon, 15 Aug 2005, Poul Møller Hansen wrote:

 I have a problem creating a usable index for the following simple query:
 SELECT * FROM my.table WHERE node = '10' ORDER BY id DESC LIMIT 1
 
 id is a serial, so the query is to find the latest entry to a given node 
 and id is the primary key.
 The table contains around 1 million records and the query takes around 2 
 seconds.
 
 I have tried to make an index on node and also on both id  node, but is 
 doesn't lower the
 query time.

Try to make an index on (node,id) and write the query as:

SELECT * FROM my.table WHERE node = '10' ORDER BY node desc, id desc LIMIT 1;

Then i'm pretty sure it will use that index.

-- 
/Dennis Björklund


---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Optimizing query

2005-08-15 Thread Tom Lane
=?ISO-8859-1?Q?Poul_M=F8ller_Hansen?= [EMAIL PROTECTED] writes:
 explain analyze SELECT * FROM my.table WHERE node = '10' ORDER BY node, 
 id DESC LIMIT 1
  QUERY 
 PLAN
 
  Limit  (cost=22638.36..22638.36 rows=1 width=246) (actual 
 time=3.001..3.007 rows=1 loops=1)
-  Sort  (cost=22638.36..22652.59 rows=5691 width=246) (actual 
 time=2.984..2.984 rows=1 loops=1)
  Sort Key: node, id
  -  Index Scan using node_date on table  (cost=0.00..21898.65 
 rows=5691 width=246) (actual time=0.077..1.852 rows=62 loops=1)
Index Cond: ((node)::text = '10'::text)
  Total runtime: 3.127 ms

You're not there yet: you want what Richard said, namely

explain analyze SELECT * FROM my.table WHERE node = '10' ORDER BY node DESC, 
id DESC LIMIT 1

There shouldn't be any Sort in the plan, just the indexscan and Limit.
The plan above is going to suck if there are a lot of rows with node = '10'.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Optimizing query

2005-08-15 Thread Poul Møller Hansen


You're not there yet: you want what Richard said, namely



I realized that it wasn't optimal for all nodes, namely those with a lot 
of rows.


So you are absolutely right, I followed the suggestion of Richard and it 
works perfect.

Thank you all, I learned a lesson of indexes today...


Poul

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] Optimizing query

2003-11-19 Thread Uros
Hello!

I have some trouble getting good results from my query.

here is structure

stat_views
id| integer
id_zone   | integer
created   | timestamp


I have btree index on created and also id and there is  1633832 records in
that table

First of all I have to manualy set seq_scan to OFF because I always get
seq_scan. When i set it to off my explain show:

explain SELECT count(*) as views FROM stat_views WHERE id = 12;
 QUERY PLAN

 Aggregate  (cost=122734.86..122734.86 rows=1 width=0)
   -  Index Scan using stat_views_id_idx on stat_views  (cost=0.00..122632.60 
rows=40904 width=0)
 Index Cond: (id = 12)

But what I need is to count views for some day, so I use

explain SELECT count(*) as views FROM stat_views WHERE date_part('day', created) = 18;

 QUERY PLAN

 Aggregate  (cost=100101618.08..100101618.08 rows=1 width=0)
   -  Seq Scan on stat_views  (cost=1.00..100101565.62 rows=20984 width=0)
 Filter: (date_part('day'::text, created) = 18::double precision)


How can I make this to use index and speed the query. Now it takes about 12
seconds.
 
-- 
Best regards,
 Uros  mailto:[EMAIL PROTECTED]


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [GENERAL] Optimizing query

2003-11-19 Thread Matthew Lunnon



Do something like:

CREATE OR REPLACE FUNCTION my_date_part( timestamp) 
RETURNS DOUBLE precision AS 'DECLAREmydate ALIAS FOR 
$1;BEGINreturn date_part( ''day'', mydate );END;' LANGUAGE 
'plpgsql' IMMUTABLE;
create index idx_tmp on stat_views( my_date_part( created) );
or add an extra date_part column to your table 
which pre-calculates date_part('day', 
created) and put an index on this.

Cheers
Matthew
--


  - Original Message - 
  From: 
  Uros 
  To: [EMAIL PROTECTED] 
  
  Sent: Wednesday, November 19, 2003 10:41 
  AM
  Subject: [GENERAL] Optimizing query
  Hello!I have some trouble getting good results from my 
  query.here is 
  structurestat_viewsid | 
  integerid_zone | integercreated | 
  timestampI have btree index on created and also id and there 
  is 1633832 records inthat tableFirst of all I have to 
  manualy set seq_scan to OFF because I always getseq_scan. When i set it to 
  off my explain show:explain SELECT count(*) as views FROM stat_views 
  WHERE id = 
  12; 
  QUERY 
  PLANAggregate 
  (cost=122734.86..122734.86 rows=1 width=0) - Index 
  Scan using stat_views_id_idx on stat_views (cost=0.00..122632.60 
  rows=40904 width=0) Index 
  Cond: (id = 12)But what I need is to count views for some day, so I 
  useexplain SELECT count(*) as views FROM stat_views WHERE 
  date_part('day', created) = 
  18; 
  QUERY 
  PLANAggregate 
  (cost=100101618.08..100101618.08 rows=1 width=0) - 
  Seq Scan on stat_views (cost=1.00..100101565.62 rows=20984 
  width=0) Filter: 
  (date_part('day'::text, created) = 18::double precision)How can I 
  make this to use index and speed the query. Now it takes about 
  12seconds. -- 
  Best 
  regards,Uros 
  mailto:[EMAIL PROTECTED]---(end 
  of broadcast)---TIP 1: subscribe and unsubscribe 
  commands go to [EMAIL PROTECTED]_This 
  e-mail has been scanned for viruses by MCI's Internet Managed Scanning 
  Services - powered by MessageLabs. For further information visit http://www.mci.com


Re: [GENERAL] Optimizing query

2003-11-19 Thread Peter Eisentraut
Uros writes:

 explain SELECT count(*) as views FROM stat_views WHERE date_part('day', created) = 
 18;

  QUERY PLAN
 
  Aggregate  (cost=100101618.08..100101618.08 rows=1 width=0)
-  Seq Scan on stat_views  (cost=1.00..100101565.62 rows=20984 width=0)
  Filter: (date_part('day'::text, created) = 18::double precision)

Create an index on date_part('day', created).  In 7.3 and earlier you need
to create a wrapper function and index that, in 7.4 you can index
arbitrarz expressions directly.  The documentation contains more
information about that.

-- 
Peter Eisentraut   [EMAIL PROTECTED]


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [GENERAL] Optimizing query

2003-11-19 Thread Shridhar Daithankar
Uros wrote:

Hello!

I have some trouble getting good results from my query.

here is structure

stat_views
id| integer
id_zone   | integer
created   | timestamp
I have btree index on created and also id and there is  1633832 records in
that table
First of all I have to manualy set seq_scan to OFF because I always get
seq_scan. When i set it to off my explain show:
explain SELECT count(*) as views FROM stat_views WHERE id = 12;
 QUERY PLAN

 Aggregate  (cost=122734.86..122734.86 rows=1 width=0)
   -  Index Scan using stat_views_id_idx on stat_views  (cost=0.00..122632.60 
rows=40904 width=0)
 Index Cond: (id = 12)
But what I need is to count views for some day, so I use

explain SELECT count(*) as views FROM stat_views WHERE date_part('day', created) = 18;

 QUERY PLAN

 Aggregate  (cost=100101618.08..100101618.08 rows=1 width=0)
   -  Seq Scan on stat_views  (cost=1.00..100101565.62 rows=20984 width=0)
 Filter: (date_part('day'::text, created) = 18::double precision)
How can I make this to use index and speed the query. Now it takes about 12
seconds.
Can you post explain analyze for the same?

 Shridhar



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [GENERAL] Optimizing query

2003-11-19 Thread Uros
Hello Shridhar,

I use Matthew's solution and it works. Query takes only half a second. I
didn't know that i can index function to.

Thanks

Uros

Wednesday, November 19, 2003, 1:23:26 PM, you wrote:

SD Uros wrote:

 Hello!
 
 I have some trouble getting good results from my query.
 
 here is structure
 
 stat_views
 id| integer
 id_zone   | integer
 created   | timestamp
 
 
 I have btree index on created and also id and there is  1633832 records in
 that table
 
 First of all I have to manualy set seq_scan to OFF because I always get
 seq_scan. When i set it to off my explain show:
 
 explain SELECT count(*) as views FROM stat_views WHERE id = 12;
  QUERY PLAN
 
  Aggregate  (cost=122734.86..122734.86 rows=1 width=0)
-  Index Scan using stat_views_id_idx on stat_views 
 (cost=0.00..122632.60 rows=40904 width=0)
  Index Cond: (id = 12)
 
 But what I need is to count views for some day, so I use
 
 explain SELECT count(*) as views FROM stat_views WHERE date_part('day', created) = 
 18;
 
  QUERY PLAN
 
  Aggregate  (cost=100101618.08..100101618.08 rows=1 width=0)
-  Seq Scan on stat_views  (cost=1.00..100101565.62 rows=20984 width=0)
  Filter: (date_part('day'::text, created) = 18::double precision)
 
 
 How can I make this to use index and speed the query. Now it takes about 12
 seconds.

SD Can you post explain analyze for the same?

SD   Shridhar







---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster