Re: [PERFORM] query not using GIN index

2015-08-21 Thread Tomas Vondra

Hi,

On 08/22/2015 03:55 AM, Guo, Yun wrote:

Hi,

We have a query on a column with GIN index, but query plan chooses not
using the index but do an seq scan whichi is must slower

CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING
gin (access_tokens);

explain analyze SELECT access_grants.* FROM access_grants  WHERE
(access_tokens @ ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
 QUERY PLAN
--
  Limit  (cost=0.00..7.46 rows=1 width=157) (actual
time=260.376..260.377 rows=1 loops=1)
-  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985
width=157) (actual time=260.373..260.373 rows=1 loops=1)
  Filter: (access_tokens @
'{124e5a1f9de325fc176a7c89152ac734}'::text[])
  Rows Removed by Filter: 796818
  Total runtime: 260.408 ms



I find it very likely that the explain output actually comes from a 
slightly different query, including a LIMIT 1 clause.


That might easily be the problem here, because the optimizer expects the 
3985 matches to be uniformly distributed in the table, so it thinks 
it'll scan just a tiny fraction of the table (1/3985) until the first 
match. But it's quite possible all at rows are end of the table, and the 
executor has to actually scan the whole table.


It's difficult to say without further details of the table and how the 
data are generated.



We tested on smaller table in development region and it chooses to use
the index there. However, in production size table it decides to ignore
the index for unknown reasons.


Please provide explain output from that table. It's difficult to say 
what's different without seeing the details.


Also please provide important details about the system (e.g. which 
PostgreSQL version, how much RAM, what work_mem/shared_buffers and such 
stuff).



Is the large number of tuples skewing the query planner’s decision
or the index itself is larger than the table therefor it would decide
to do table scan?


What large number of tuples? The indexes are supposed to be more 
efficient the larger the table is.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Performance bottleneck due to array manipulation

2015-08-21 Thread Genc , Ömer
Hey,

i have a very long running stored procedure, due to array manipulation in a 
stored procedure. The following procedure takes 13 seconds to finish.

BEGIN
point_ids_older_than_one_hour := '{}';
object_ids_to_be_invalidated := '{}';

select ARRAY(SELECT
point_id
from ONLY
public.ims_point as p
where
p.timestamp  m_before_one_hour
)
into point_ids_older_than_one_hour ; -- this array has a size of 20k

select ARRAY(SELECT
object_id
  from
public.ims_object_header h
  WHERE
h.last_point_id= ANY(point_ids_older_than_one_hour)
 )
into object_ids_to_be_invalidated; -- this array has a size of 100

--current_last_point_ids will have a size of 100k
current_last_point_ids := ARRAY( SELECT
last_point_id
  from
public.ims_object_header h
 );
-- START OF PERFORMANCE BOTTLENECK
IF(array_length(current_last_point_ids, 1)  0)
THEN
FOR i IN 0 .. array_upper(current_last_point_ids, 1)
LOOP
point_ids_older_than_one_hour = 
array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint);
END LOOP;
END IF;
-- END OF PERFORMANCE BOTTLENECK
END;

The array manipulation part is the performance bottleneck. I am pretty sure, 
that there is a better way of doing this, however I couldn't find one.
What I have is two table, lets call them ims_point and ims_object_header. 
ims_object_header references some entries of ims_point in the column 
last_point_id.
Now I want to delete all entries from ims_point, where the timestamp is older 
than one hour. The currently being referenced ids of the table 
ims_object_header should be excluded from this deletion. Therefore I stored the 
ids in arrays and iterate over those arrays to exclude the referenced values 
from being deleted.

However, I not sure if using an array for an operation like this is the best 
approach.

Can anyone give me some advice how this could be enhanced.

Thanks in advance.



Re: [PERFORM] Most efficient way of querying M 'related' tables where N out of M may contain the key

2015-08-21 Thread David G. Johnston
On Fri, Aug 21, 2015 at 8:07 AM, Stephane Bailliez sbaill...@gmail.com
wrote:


 On Thu, Aug 20, 2015 at 8:19 PM, David G. Johnston 
 david.g.johns...@gmail.com wrote:


 ​SELECT [...]
 FROM (SELECT reference_id, [...] FROM table_where_referenced_id_is_a_pk
 WHERE reference_id EXISTS/IN/JOIN)​

 ​src
 ​LEFT JOIN type1 USING (reference_id)
 LEFT JOIN type2 USING (reference_id)
 [...]



​Place ^ in a CTE named (find_all)​


 there are no tables where reference_id is a pk, I could create one or do :
 select reference_id from ( values (..), (...), (...)  )

 the tricky part with the join (and where I was not clear about it in my
 original description) is that a reference_id  can match in multiple tables
 (eg. it can be a fk in type1 and type2), so it then becomes a bit harder to
 collect all the common attributes and 'types' when doing joins like this.

 For example let's assume there is a group_id to be be retrieved among all
 tables as a common attribute:

 if reference_id was existing only in one table, I could do
 coalesce(type1.group_id, ... type5.group_id) as group_id in the main select
 however that would not work in this case.


​WITH find_all (reference_id, type_identifier, type_id) AS ( ... )
SELECT ​type_identifier, array_agg(reference_id), array_agg(type_id)
FROM find_all
WHERE type_identifier IS NOT NULL
GROUP BY type_identifier

​find_all will return at least one row, possibly empty if no matches are
present, and will return multiple rows if more than one matches.  You can
use array_agg as shown, or play around with custom composite types, ​or
even build a JSON document.

David J.


Re: [PERFORM] Performance bottleneck due to array manipulation

2015-08-21 Thread Tom Lane
=?iso-8859-1?Q?Genc=2C_=D6mer?= oemer.g...@iais.fraunhofer.de writes:
 i have a very long running stored procedure, due to array manipulation in a 
 stored procedure. The following procedure takes 13 seconds to finish.

 BEGIN
 point_ids_older_than_one_hour := '{}';
 object_ids_to_be_invalidated := '{}';

 select ARRAY(SELECT
 point_id
 from ONLY
 public.ims_point as p
 where
 p.timestamp  m_before_one_hour
 )
 into point_ids_older_than_one_hour ; -- this array has a size of 20k

 select ARRAY(SELECT
 object_id
   from
 public.ims_object_header h
   WHERE
 h.last_point_id= ANY(point_ids_older_than_one_hour)
  )
 into object_ids_to_be_invalidated; -- this array has a size of 100

 --current_last_point_ids will have a size of 100k
 current_last_point_ids := ARRAY( SELECT
 last_point_id
   from
 public.ims_object_header h
  );
 -- START OF PERFORMANCE BOTTLENECK
 IF(array_length(current_last_point_ids, 1)  0)
 THEN
 FOR i IN 0 .. array_upper(current_last_point_ids, 1)
 LOOP
 point_ids_older_than_one_hour = 
 array_remove(point_ids_older_than_one_hour, 
 current_last_point_ids[i]::bigint);
 END LOOP;
 END IF;
 -- END OF PERFORMANCE BOTTLENECK
 END;

Well, in the first place, this is the cardinal sin of working with SQL
databases: doing procedurally that which should be done relationally.
Forget the arrays entirely and use EXCEPT, ie

  SELECT point_id FROM ...
  EXCEPT
  SELECT last_point_id FROM ...

Or maybe you want EXCEPT ALL, depending on whether duplicates are possible
and what you want to do with them if so.

Having said that, the way you have this is necessarily O(N^2) because
array_remove has to search the entire array on each call, and then
reconstruct the entire array less any removed element(s).  The new
expanded array infrastructure in 9.5 could perhaps reduce some of the
constant factor, but the array search loop remains so it would still be
O(N^2); and anyway array_remove has not been taught about expanded arrays
(which means this example is probably even slower in 9.5 :-().

If you were using integers, you could possibly get decent performance from
contrib/intarray's int[] - int[] operator (which I think does a sort and
merge internally); but I gather that these are bigints, so that won't
work.

regards, tom lane








   from






 The array manipulation part is the performance bottleneck. I am pretty sure, 
 that there is a better way of doing this, however I couldn't find one.
 What I have is two table, lets call them ims_point and ims_object_header. 
 ims_object_header references some entries of ims_point in the column 
 last_point_id.
 Now I want to delete all entries from ims_point, where the timestamp is older 
 than one hour. The currently being referenced ids of the table 
 ims_object_header should be excluded from this deletion. Therefore I stored 
 the ids in arrays and iterate over those arrays to exclude the referenced 
 values from being deleted.

 However, I not sure if using an array for an operation like this is the best 
 approach.

 Can anyone give me some advice how this could be enhanced.

 Thanks in advance.




-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance bottleneck due to array manipulation

2015-08-21 Thread Félix GERZAGUET
Hello,

On Fri, Aug 21, 2015 at 2:48 PM, Genc, Ömer oemer.g...@iais.fraunhofer.de
wrote:

 Now I want to delete all entries from ims_point, where the timestamp is
 older than one hour. The currently being referenced ids of the table
 ims_object_header should be excluded from this deletion.




delete from public.ims_point ip
  where ip.timestamp  current_timestamp - interval '1 hour'
and not exists ( select 'reference exists'
   from public.ims_object_header ioh
  where ioh.last_point_id = ip.point_id
 )
;

Does this works for you ?


Re: [PERFORM] Performance bottleneck due to array manipulation

2015-08-21 Thread Igor Neyman


From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Genc, Ömer
Sent: Friday, August 21, 2015 8:49 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance bottleneck due to array manipulation

Hey,

i have a very long running stored procedure, due to array manipulation in a 
stored procedure. The following procedure takes 13 seconds to finish.

BEGIN
point_ids_older_than_one_hour := '{}';
object_ids_to_be_invalidated := '{}';

select ARRAY(SELECT
point_id
from ONLY
public.ims_point as p
where
p.timestamp  m_before_one_hour
)
into point_ids_older_than_one_hour ; -- this array has a size of 20k

select ARRAY(SELECT
object_id
  from
public.ims_object_header h
  WHERE
h.last_point_id= ANY(point_ids_older_than_one_hour)
 )
into object_ids_to_be_invalidated; -- this array has a size of 100

--current_last_point_ids will have a size of 100k
current_last_point_ids := ARRAY( SELECT
last_point_id
  from
public.ims_object_header h
 );
-- START OF PERFORMANCE BOTTLENECK
IF(array_length(current_last_point_ids, 1)  0)
THEN
FOR i IN 0 .. array_upper(current_last_point_ids, 1)
LOOP
point_ids_older_than_one_hour = 
array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint);
END LOOP;
END IF;
-- END OF PERFORMANCE BOTTLENECK
END;

The array manipulation part is the performance bottleneck. I am pretty sure, 
that there is a better way of doing this, however I couldn't find one.
What I have is two table, lets call them ims_point and ims_object_header. 
ims_object_header references some entries of ims_point in the column 
last_point_id.
Now I want to delete all entries from ims_point, where the timestamp is older 
than one hour. The currently being referenced ids of the table 
ims_object_header should be excluded from this deletion. Therefore I stored the 
ids in arrays and iterate over those arrays to exclude the referenced values 
from being deleted.

However, I not sure if using an array for an operation like this is the best 
approach.

Can anyone give me some advice how this could be enhanced.

Thanks in advance.


I think in this case (as is in many other cases) pure SQL does the job much 
better than procedural language:

DELETE FROM public.ims_point as P
WHERE  P.timestamp  m_before_one_hour
 AND NOT EXISTS (SELECT 1 FROM  public.ims_object_header OH
WHERE OH.last_point_id = 
P.object_id);

Is that what you are trying to accomplish?

Regards,
Igor Neyman






Re: [PERFORM] Most efficient way of querying M 'related' tables where N out of M may contain the key

2015-08-21 Thread Stephane Bailliez
On Thu, Aug 20, 2015 at 8:19 PM, David G. Johnston 
david.g.johns...@gmail.com wrote:


 ​SELECT [...]
 FROM (SELECT reference_id, [...] FROM table_where_referenced_id_is_a_pk
 WHERE reference_id EXISTS/IN/JOIN)​

 ​src
 ​LEFT JOIN type1 USING (reference_id)
 LEFT JOIN type2 USING (reference_id)
 [...]



there are no tables where reference_id is a pk, I could create one or do :
select reference_id from ( values (..), (...), (...)  )

the tricky part with the join (and where I was not clear about it in my
original description) is that a reference_id  can match in multiple tables
(eg. it can be a fk in type1 and type2), so it then becomes a bit harder to
collect all the common attributes and 'types' when doing joins like this.

For example let's assume there is a group_id to be be retrieved among all
tables as a common attribute:

if reference_id was existing only in one table, I could do
coalesce(type1.group_id, ... type5.group_id) as group_id in the main select
however that would not work in this case.

I could work around the common attributes however.

But for retrieving the types, what I really need to have as a return of
this query is data that allows me to partition the reference_id for each
type like:

type1 - ref1, ref2, ref5
type2 - ref1, ref3
type3 - ref4, ref3

 I guess I could try to return an array and fill it with case/when for each
table eg. something like
ARRAY( CASE WHEN type1.id IS NOT NULL THEN 'type1' END, ... CASE WHEN
type1.id IS NOT NULL THEN 'type5' END)

and then collect all the non-null values in the code.




 Or consider whether PostgreSQL Inheritance would work - though basically
 its a friendly API over the UNION ALL query you proposed.



The problem with postgresql inheritance is that it would not play well with
the orm and substantially complicates implementation.


Thanks for the all the ideas,  that helps me a lot to brainstorm more.


[PERFORM] query not using GIN index

2015-08-21 Thread Guo, Yun
Hi,

We have a query on a column with GIN index, but query plan chooses not using 
the index but do an seq scan whichi is must slower

CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING gin 
(access_tokens);

explain analyze SELECT access_grants.* FROM access_grants  WHERE 
(access_tokens @ ARRAY['124e5a1f9de325fc176a7c89152ac734']) ;
QUERY PLAN
--
 Limit  (cost=0.00..7.46 rows=1 width=157) (actual time=260.376..260.377 rows=1 
loops=1)
   -  Seq Scan on access_grants  (cost=0.00..29718.03 rows=3985 width=157) 
(actual time=260.373..260.373 rows=1 loops=1)
 Filter: (access_tokens @ '{124e5a1f9de325fc176a7c89152ac734}'::text[])
 Rows Removed by Filter: 796818
 Total runtime: 260.408 ms


We tested on smaller table in development region and it chooses to use the 
index there. However, in production size table it decides to ignore the index 
for unknown reasons.

Is the large number of tuples skewing the query planner’s decision or the index 
itself is larger than the table therefor it would decide to do table scan?

Any suggestions are greatly appreciated!

Yun