Re: [PERFORM] query not using GIN index
Hi, On 08/22/2015 03:55 AM, Guo, Yun wrote: Hi, We have a query on a column with GIN index, but query plan chooses not using the index but do an seq scan whichi is must slower CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING gin (access_tokens); explain analyze SELECT access_grants.* FROM access_grants WHERE (access_tokens @ ARRAY['124e5a1f9de325fc176a7c89152ac734']) ; QUERY PLAN -- Limit (cost=0.00..7.46 rows=1 width=157) (actual time=260.376..260.377 rows=1 loops=1) - Seq Scan on access_grants (cost=0.00..29718.03 rows=3985 width=157) (actual time=260.373..260.373 rows=1 loops=1) Filter: (access_tokens @ '{124e5a1f9de325fc176a7c89152ac734}'::text[]) Rows Removed by Filter: 796818 Total runtime: 260.408 ms I find it very likely that the explain output actually comes from a slightly different query, including a LIMIT 1 clause. That might easily be the problem here, because the optimizer expects the 3985 matches to be uniformly distributed in the table, so it thinks it'll scan just a tiny fraction of the table (1/3985) until the first match. But it's quite possible all at rows are end of the table, and the executor has to actually scan the whole table. It's difficult to say without further details of the table and how the data are generated. We tested on smaller table in development region and it chooses to use the index there. However, in production size table it decides to ignore the index for unknown reasons. Please provide explain output from that table. It's difficult to say what's different without seeing the details. Also please provide important details about the system (e.g. which PostgreSQL version, how much RAM, what work_mem/shared_buffers and such stuff). Is the large number of tuples skewing the query planner’s decision or the index itself is larger than the table therefor it would decide to do table scan? What large number of tuples? The indexes are supposed to be more efficient the larger the table is. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Performance bottleneck due to array manipulation
Hey, i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure takes 13 seconds to finish. BEGIN point_ids_older_than_one_hour := '{}'; object_ids_to_be_invalidated := '{}'; select ARRAY(SELECT point_id from ONLY public.ims_point as p where p.timestamp m_before_one_hour ) into point_ids_older_than_one_hour ; -- this array has a size of 20k select ARRAY(SELECT object_id from public.ims_object_header h WHERE h.last_point_id= ANY(point_ids_older_than_one_hour) ) into object_ids_to_be_invalidated; -- this array has a size of 100 --current_last_point_ids will have a size of 100k current_last_point_ids := ARRAY( SELECT last_point_id from public.ims_object_header h ); -- START OF PERFORMANCE BOTTLENECK IF(array_length(current_last_point_ids, 1) 0) THEN FOR i IN 0 .. array_upper(current_last_point_ids, 1) LOOP point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint); END LOOP; END IF; -- END OF PERFORMANCE BOTTLENECK END; The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing this, however I couldn't find one. What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries of ims_point in the column last_point_id. Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in arrays and iterate over those arrays to exclude the referenced values from being deleted. However, I not sure if using an array for an operation like this is the best approach. Can anyone give me some advice how this could be enhanced. Thanks in advance.
Re: [PERFORM] Most efficient way of querying M 'related' tables where N out of M may contain the key
On Fri, Aug 21, 2015 at 8:07 AM, Stephane Bailliez sbaill...@gmail.com wrote: On Thu, Aug 20, 2015 at 8:19 PM, David G. Johnston david.g.johns...@gmail.com wrote: SELECT [...] FROM (SELECT reference_id, [...] FROM table_where_referenced_id_is_a_pk WHERE reference_id EXISTS/IN/JOIN) src LEFT JOIN type1 USING (reference_id) LEFT JOIN type2 USING (reference_id) [...] Place ^ in a CTE named (find_all) there are no tables where reference_id is a pk, I could create one or do : select reference_id from ( values (..), (...), (...) ) the tricky part with the join (and where I was not clear about it in my original description) is that a reference_id can match in multiple tables (eg. it can be a fk in type1 and type2), so it then becomes a bit harder to collect all the common attributes and 'types' when doing joins like this. For example let's assume there is a group_id to be be retrieved among all tables as a common attribute: if reference_id was existing only in one table, I could do coalesce(type1.group_id, ... type5.group_id) as group_id in the main select however that would not work in this case. WITH find_all (reference_id, type_identifier, type_id) AS ( ... ) SELECT type_identifier, array_agg(reference_id), array_agg(type_id) FROM find_all WHERE type_identifier IS NOT NULL GROUP BY type_identifier find_all will return at least one row, possibly empty if no matches are present, and will return multiple rows if more than one matches. You can use array_agg as shown, or play around with custom composite types, or even build a JSON document. David J.
Re: [PERFORM] Performance bottleneck due to array manipulation
=?iso-8859-1?Q?Genc=2C_=D6mer?= oemer.g...@iais.fraunhofer.de writes: i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure takes 13 seconds to finish. BEGIN point_ids_older_than_one_hour := '{}'; object_ids_to_be_invalidated := '{}'; select ARRAY(SELECT point_id from ONLY public.ims_point as p where p.timestamp m_before_one_hour ) into point_ids_older_than_one_hour ; -- this array has a size of 20k select ARRAY(SELECT object_id from public.ims_object_header h WHERE h.last_point_id= ANY(point_ids_older_than_one_hour) ) into object_ids_to_be_invalidated; -- this array has a size of 100 --current_last_point_ids will have a size of 100k current_last_point_ids := ARRAY( SELECT last_point_id from public.ims_object_header h ); -- START OF PERFORMANCE BOTTLENECK IF(array_length(current_last_point_ids, 1) 0) THEN FOR i IN 0 .. array_upper(current_last_point_ids, 1) LOOP point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint); END LOOP; END IF; -- END OF PERFORMANCE BOTTLENECK END; Well, in the first place, this is the cardinal sin of working with SQL databases: doing procedurally that which should be done relationally. Forget the arrays entirely and use EXCEPT, ie SELECT point_id FROM ... EXCEPT SELECT last_point_id FROM ... Or maybe you want EXCEPT ALL, depending on whether duplicates are possible and what you want to do with them if so. Having said that, the way you have this is necessarily O(N^2) because array_remove has to search the entire array on each call, and then reconstruct the entire array less any removed element(s). The new expanded array infrastructure in 9.5 could perhaps reduce some of the constant factor, but the array search loop remains so it would still be O(N^2); and anyway array_remove has not been taught about expanded arrays (which means this example is probably even slower in 9.5 :-(). If you were using integers, you could possibly get decent performance from contrib/intarray's int[] - int[] operator (which I think does a sort and merge internally); but I gather that these are bigints, so that won't work. regards, tom lane from The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing this, however I couldn't find one. What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries of ims_point in the column last_point_id. Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in arrays and iterate over those arrays to exclude the referenced values from being deleted. However, I not sure if using an array for an operation like this is the best approach. Can anyone give me some advice how this could be enhanced. Thanks in advance. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Performance bottleneck due to array manipulation
Hello, On Fri, Aug 21, 2015 at 2:48 PM, Genc, Ömer oemer.g...@iais.fraunhofer.de wrote: Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. delete from public.ims_point ip where ip.timestamp current_timestamp - interval '1 hour' and not exists ( select 'reference exists' from public.ims_object_header ioh where ioh.last_point_id = ip.point_id ) ; Does this works for you ?
Re: [PERFORM] Performance bottleneck due to array manipulation
From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Genc, Ömer Sent: Friday, August 21, 2015 8:49 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Performance bottleneck due to array manipulation Hey, i have a very long running stored procedure, due to array manipulation in a stored procedure. The following procedure takes 13 seconds to finish. BEGIN point_ids_older_than_one_hour := '{}'; object_ids_to_be_invalidated := '{}'; select ARRAY(SELECT point_id from ONLY public.ims_point as p where p.timestamp m_before_one_hour ) into point_ids_older_than_one_hour ; -- this array has a size of 20k select ARRAY(SELECT object_id from public.ims_object_header h WHERE h.last_point_id= ANY(point_ids_older_than_one_hour) ) into object_ids_to_be_invalidated; -- this array has a size of 100 --current_last_point_ids will have a size of 100k current_last_point_ids := ARRAY( SELECT last_point_id from public.ims_object_header h ); -- START OF PERFORMANCE BOTTLENECK IF(array_length(current_last_point_ids, 1) 0) THEN FOR i IN 0 .. array_upper(current_last_point_ids, 1) LOOP point_ids_older_than_one_hour = array_remove(point_ids_older_than_one_hour, current_last_point_ids[i]::bigint); END LOOP; END IF; -- END OF PERFORMANCE BOTTLENECK END; The array manipulation part is the performance bottleneck. I am pretty sure, that there is a better way of doing this, however I couldn't find one. What I have is two table, lets call them ims_point and ims_object_header. ims_object_header references some entries of ims_point in the column last_point_id. Now I want to delete all entries from ims_point, where the timestamp is older than one hour. The currently being referenced ids of the table ims_object_header should be excluded from this deletion. Therefore I stored the ids in arrays and iterate over those arrays to exclude the referenced values from being deleted. However, I not sure if using an array for an operation like this is the best approach. Can anyone give me some advice how this could be enhanced. Thanks in advance. I think in this case (as is in many other cases) pure SQL does the job much better than procedural language: DELETE FROM public.ims_point as P WHERE P.timestamp m_before_one_hour AND NOT EXISTS (SELECT 1 FROM public.ims_object_header OH WHERE OH.last_point_id = P.object_id); Is that what you are trying to accomplish? Regards, Igor Neyman
Re: [PERFORM] Most efficient way of querying M 'related' tables where N out of M may contain the key
On Thu, Aug 20, 2015 at 8:19 PM, David G. Johnston david.g.johns...@gmail.com wrote: SELECT [...] FROM (SELECT reference_id, [...] FROM table_where_referenced_id_is_a_pk WHERE reference_id EXISTS/IN/JOIN) src LEFT JOIN type1 USING (reference_id) LEFT JOIN type2 USING (reference_id) [...] there are no tables where reference_id is a pk, I could create one or do : select reference_id from ( values (..), (...), (...) ) the tricky part with the join (and where I was not clear about it in my original description) is that a reference_id can match in multiple tables (eg. it can be a fk in type1 and type2), so it then becomes a bit harder to collect all the common attributes and 'types' when doing joins like this. For example let's assume there is a group_id to be be retrieved among all tables as a common attribute: if reference_id was existing only in one table, I could do coalesce(type1.group_id, ... type5.group_id) as group_id in the main select however that would not work in this case. I could work around the common attributes however. But for retrieving the types, what I really need to have as a return of this query is data that allows me to partition the reference_id for each type like: type1 - ref1, ref2, ref5 type2 - ref1, ref3 type3 - ref4, ref3 I guess I could try to return an array and fill it with case/when for each table eg. something like ARRAY( CASE WHEN type1.id IS NOT NULL THEN 'type1' END, ... CASE WHEN type1.id IS NOT NULL THEN 'type5' END) and then collect all the non-null values in the code. Or consider whether PostgreSQL Inheritance would work - though basically its a friendly API over the UNION ALL query you proposed. The problem with postgresql inheritance is that it would not play well with the orm and substantially complicates implementation. Thanks for the all the ideas, that helps me a lot to brainstorm more.
[PERFORM] query not using GIN index
Hi, We have a query on a column with GIN index, but query plan chooses not using the index but do an seq scan whichi is must slower CREATE INDEX idx_access_grants_on_access_tokens ON access_grants USING gin (access_tokens); explain analyze SELECT access_grants.* FROM access_grants WHERE (access_tokens @ ARRAY['124e5a1f9de325fc176a7c89152ac734']) ; QUERY PLAN -- Limit (cost=0.00..7.46 rows=1 width=157) (actual time=260.376..260.377 rows=1 loops=1) - Seq Scan on access_grants (cost=0.00..29718.03 rows=3985 width=157) (actual time=260.373..260.373 rows=1 loops=1) Filter: (access_tokens @ '{124e5a1f9de325fc176a7c89152ac734}'::text[]) Rows Removed by Filter: 796818 Total runtime: 260.408 ms We tested on smaller table in development region and it chooses to use the index there. However, in production size table it decides to ignore the index for unknown reasons. Is the large number of tuples skewing the query planner’s decision or the index itself is larger than the table therefor it would decide to do table scan? Any suggestions are greatly appreciated! Yun