Fw: [PERFORM] Getting rid of a seq scan in query on a large table

2011-06-27 Thread Denis de Bernardy




- Forwarded Message -
From: Denis de Bernardy ddeberna...@yahoo.com
To: Jens Hoffrichter jens.hoffrich...@gmail.com
Sent: Tuesday, June 28, 2011 12:59 AM
Subject: Re: [PERFORM] Getting rid of a seq scan in query on a large table


 Hash Cond: (posts.poster_id = posters.poster_id)

                     -  Seq Scan on posts  (cost=0.00..11862.12 rows=112312 
 width=24) (actual time=0.019..60.092 rows=112312 loops=1)


Unless I am mistaking, you've very few poster ids in there (since the two rows 
arguments are equal). The Postgres planner will identify this and just seq 
scan the whole thing instead of bothering to randomly access the rows one by 
one using the index. This looks like a case where you actually do not want it 
to use an index scan -- doing so will be slower.


D








From: Jens Hoffrichter jens.hoffrich...@gmail.com
To: pgsql-performance@postgresql.org
Sent: Monday, June 27, 2011 2:46 PM
Subject: [PERFORM] Getting rid of a seq scan in query on a large table


Hi everyone,


I'm having trouble getting rid of a sequential scan on a table with roughly 
120k entries it. Creation of an index on that particular column which 
triggers the sequential scan doesn't do anything, VACUUM and ANALYZE has been 
done on the table.


The table in question has the following definition:


       Column       |           Type           |                            
Modifiers
+--+--
 post_id            | bigint                   | not null default 
nextval('posts_post_id_seq'::regclass)
 forum_id           | bigint                   | not null
 threadlink         | character varying(255)   | not null
 timestamp          | timestamp with time zone | not null
 poster_id          | bigint                   |
 thread_id          | bigint                   | not null
 subject            | text                     | not null
 text               | text                     | not null
 postername         | character varying(255)   |
 internal_post_id   | bigint                   | not null default 
nextval('posts_internal_post_id_seq'::regclass)
 internal_thread_id | bigint                   |
Indexes:
    posts_pkey PRIMARY KEY, btree (internal_post_id)
    posts_forum_id_key UNIQUE, btree (forum_id, post_id)
    idx_internal_thread_id btree (internal_thread_id)
    idx_posts_poster_id btree (poster_id)
Foreign-key constraints:
    posts_forum_id_fkey FOREIGN KEY (forum_id) REFERENCES forums(forum_id)
    posts_internal_thread_id_fkey FOREIGN KEY (internal_thread_id) 
REFERENCES threads(internal_thread_id)
    posts_poster_id_fkey FOREIGN KEY (poster_id) REFERENCES 
posters(poster_id)


The query is this:


SELECT threads.internal_thread_id AS threads_internal_thread_id, 
threads.forum_id AS threads_forum_id, threads.thread_id AS threads_thread_id, 
threads.title AS threads_title, threads.poster_id AS threads_poster_id, 
threads.postername AS threads_postername, threads.category AS 
threads_category, threads.posttype AS threads_posttype                        
                                                                              
                                                     FROM threads JOIN posts 
ON threads.internal_thread_id = posts.internal_thread_id JOIN posters ON 
posts.poster_id = posters.poster_id JOIN posters_groups AS posters_groups_1 
ON posters.poster_id = posters_groups_1.poster_id JOIN groups ON 
groups.group_id = posters_groups_1.group_id WHERE groups.group_id = 4 ORDER 
BY posts.timestamp DESC;


The query plan (with an explain analyze) gives me the following:


                                                                              
QUERY PLAN
--
 Sort  (cost=13995.93..14006.63 rows=4279 width=108) (actual 
time=79.927..79.947 rows=165 loops=1)
   Sort Key: posts.timestamp
   Sort Method:  quicksort  Memory: 50kB
   -  Nested Loop  (cost=6.97..13737.84 rows=4279 width=108) (actual 
time=0.605..79.693 rows=165 loops=1)
         -  Seq Scan on groups  (cost=0.00..1.05 rows=1 width=8) (actual 
time=0.013..0.014 rows=1 loops=1)
               Filter: (group_id = 4)
         -  Nested Loop  (cost=6.97..13694.00 rows=4279 width=116) (actual 
time=0.587..79.616 rows=165 loops=1)
               -  Hash Join  (cost=6.97..12343.10 rows=4279 width=24) 
(actual time=0.568..78.230 rows=165 loops=1)
                     Hash Cond: (posts.poster_id = posters.poster_id)
                     -  Seq Scan on posts  (cost=0.00..11862.12 rows=112312 
width=24) (actual time=0.019..60.092 rows=112312 loops=1)
                     -  Hash  (cost=6.79..6.79 rows=14 width=24) (actual 
time=0.101..0.101 rows=14 loops=1)
                           -  Hash Join  (cost=2.14..6.79 rows=14 width=24

Re: [PERFORM] Why query takes soo much time

2011-05-16 Thread Denis de Bernardy
[big nestloop with a huge number of rows]

You're in an edge case, and I doubt you'll get things to run much faster: you 
want the last 1k rows out of an 18M row result set... It will be slow no matter 
what you do.

What the plan is currently doing, is it's going through these 18M rows using a 
for each loop, until it returns the 1k requested rows. Without the offset, the 
plan is absolutely correct (and quite fast, I take it). With the enormous 
offset, it's a different story as you've noted.

An alternative plan could have been to hash join the tables together, to sort 
the result set, and to apply the limit/offset on the resulting set. You can 
probably force the planner to do so by rewriting your statement using a with 
statement, too:

EXPLAIN ANALYZE
WITH rows AS (
SELECT c.clause, s.subject ,s.object , s.verb, s.subject_type, s.object_type 
,s.doc_id ,s.svo_id 
FROM clause2 c INNER JOIN svo2 s ON (c.clause_id=s.clause_id AND 
c.source_id=s.doc_id AND c.sentence_id=s.sentence_id)
   INNER JOIN page_content p ON (s.doc_id=p.crawled_page_id)
)
SELECT *
FROM rows
ORDER BY svo_id limit 1000 offset 17929000


I've my doubts that it'll make much of a different, though: you'll still be 
extracting the last 1k rows out of 18M.

D


Re: [PERFORM] How to avoid seq scans for joins between union-all views (test case included)

2011-05-13 Thread Denis de Bernardy
I might have misread, but:

 select * from connections where locked_by  4711
 union all
 select * from connections_locked where locked_by = 4711;


The first part will result in a seq scan irrespective of indexes, and the 
second has no index on locked_by. The best you can do is to eliminate the seq 
scan on the second by adding the missing index on locked_by.

That said, note that index usage depends on your data distribution: postgres 
may identify that it'll read most/all of the table anyway, and opt to do a 
(cheaper) seq scan instead.

D


- Original Message -
 From: Fredrik Widlert fredrik.widl...@digpro.se
 To: pgsql-performance@postgresql.org
 Cc: 
 Sent: Friday, May 13, 2011 1:55 PM
 Subject: [PERFORM] How to avoid seq scans for joins between union-all views 
 (test case included)
 
 Hi everyone
 
 We have recently started to port an application from Oracle to PostgreSQL.
 So far, we are amazed with how great most things work.
 
 However, we have run into performance problems in one type of query which
 is quite common in our application. We have created a (simplified)
 reproducible test case which (hopefully!) creates all necessary tables
 and data to
 show the problem.
 
 Plain-text description of the data model in the test case:
 
 We have a set of objects (like electrical cables), each having
 two nodes in the table connections (think of these two rows together
 as an edge in a graph).
 
 Another table connections_locked contains rows for some of
 the same objects, which are locked by a long transaction.
 
 The view connections_v performs a union all of the rows from
 connections which are not modified in the current long
 transaction with the rows from connections_locked which
 are modified in the current long transaction.
 
 Goal:
 Given an object id, we want to find all neighbors for this
 object (that is, objects which share a node with this object).
 
 Problem:
 We think that our query used to find neighbors would benefit
 greatly from using some of our indexes, but we fail to make it
 do so.
 
 
 Over to the actual test case:
 
 --
 
 -- Tested on (from select version ()):
 -- PostgreSQL 9.0.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
 4.1.2 20080704 (Red Hat 4.1.2-46), 32-bit
 -- PostgreSQL 9.1beta1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
 4.1.2 20080704 (Red Hat 4.1.2-46), 32-bit
 
 -- Ubuntu 11.04, uname -a output:
 -- Linux hostname 2.6.38-8-generic-pae #42-Ubuntu SMP Mon Apr 11
 05:17:09 UTC 2011 i686 i686 i386 GNU/Linux
 -- Processor: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz
 -- Drive: Intel X25-M SSD
 
 
 drop table if exists connections cascade;
 drop table if exists connections_locked cascade;
 
 
 create table connections (
   con_id serial primary key,
   locked_by integer not null,
   obj_id integer not null,
   node integer not null
 );
 
 
 -- create test nodes, two per obj_id
 insert into connections (locked_by, obj_id, node)
 select 0, n/2, 1000 + (n + 1)/2 from generate_series (1,50) as n;
 
 create index connections_node_idx on connections (node);
 create index connections_obj_idx on connections (obj_id);
 vacuum analyze connections;
 
 
 
 create table connections_locked (
   con_id integer not null,
   locked_by integer not null,
   obj_id integer not null,
   node integer not null,
   constraint locked_pk primary key (con_id, locked_by)
 );
 
 -- mark a few of the objects as locked by a long transaction
 insert into connections_locked (con_id, locked_by, obj_id, node)
 select n, 1 + n/50, n/2, 1000 + (n + 1)/2 from generate_series (1,25000) as n;
 
 create index connections_locked_node_idx on connections_locked (node);
 create index connections_locked_obj_idx on connections_locked (obj_id);
 vacuum analyze connections_locked;
 
 
 -- Create a view showing the world as seen by long transaction 4711.
 -- In real life, this uses a session variable instead of a hard-coded value.
 create or replace view connections_v as
 select * from connections where locked_by  4711
 union all
 select * from connections_locked where locked_by = 4711;
 
 
 -- This is the query we are trying to optimize.
 -- We expect this to be able to use our indexes, but instead get
 sequential scans
 explain analyze
 select
      con2.obj_id
 from
      connections_v con1,
      connections_v con2
 where
      con1.obj_id = 17 and
      con2.node = con1.node
 ;
 
 
 -- Output:
 -- Hash Join  (cost=16.69..16368.89 rows=7501 width=4) (actual
 time=0.096..778.830 rows=4 loops=1)
 --   Hash Cond: (*SELECT* 1.node = *SELECT* 1.node)
 --   -  Append  (cost=0.00..14402.00 rows=500050 width=8) (actual
 time=0.011..640.163 rows=50 loops=1)
 --         -  Subquery Scan on *SELECT* 1  (cost=0.00..13953.00
 rows=50 width=8) (actual time=0.011..430.645 rows=50 loops=1)
 --               -  Seq Scan on connections  (cost=0.00..8953.00
 rows=50 width=16) (actual time=0.009..178.535 rows=50 loops=1)
 --                   

[PERFORM] row estimate very wrong for array type

2011-05-04 Thread Denis de Bernardy
Hi,

I'm running into erroneous row estimates when using an array type, and I'm 
running out of ideas on how to steer postgres into the right direction... I've 
tried setting statistics to 1000, vacuuming and analyzing over and over, 
rewriting the query differently... to no avail.

The table looks like this:

create table test (
id serial primary key,
sortcol float unique,
intarr1 int[] not null default '{}',
intarr2 int[] not null default '{}'

);

It contains 40k rows of random data which, for the sake of the queries that 
follow, aren't too far from what they'd contain in production.


# select intarr1, intarr2, count(*) from test group by intarr1, intarr2;


 intarr1 | intarr2 | count 
--+-+---
 {0}          | {2,3}       | 40018


The stats seem right:

# select * from pg_stats where tablename = 'test' and attname = 'intarr1';

 schemaname | tablename |   attname    | inherited | null_frac | avg_width | 
n_distinct | most_common_vals | most_common_freqs | histogram_bounds | 
correlation 
+---+--+---+---+---++--+---+--+-
 test       | test      | intarr1 | f         |         0 |        25 |         
 1 | {{0}}          | {1}               |                  |           1


A query without any condition on the array results in reasonable estimates and 
the proper use of the index on the sort column:

# explain analyze select * from test order by sortcol limit 10;

                                                             QUERY PLAN         
                                                    

 Limit  (cost=0.00..3.00 rows=10 width=217) (actual time=0.098..0.109 rows=10 
loops=1)
   -  Index Scan using test_sortcol_key on test  (cost=0.00..12019.08 
rows=40018 width=217) (actual time=0.096..0.105 rows=10 loops=1)
 Total runtime: 0.200 ms


After adding a condition on the array, however, the row estimate is completely 
off:


# explain analyze select * from test where intarr1  '{0,1}' order 
by sortcol limit 10;

                                                            QUERY PLAN          
                                                  
--
 Limit  (cost=0.00..605.96 rows=10 width=217) (actual time=0.131..0.175 rows=10 
loops=1)
   -  Index Scan using test_sortcol_key on test  (cost=0.00..12119.13 rows=200 
width=217) (actual time=0.129..0.169 rows=10 loops=1)
         Filter: (intarr1  '{0,1}'::integer[])


When there's a condition on both arrays, this then leads to a seq scan:

# explain analyze select * from test where intarr1  '{0,1}' and intarr2  
'{2,4}' order by sortcol limit 10;
                                                     QUERY PLAN                 
                                    

 Limit  (cost=3241.28..3241.29 rows=1 width=217) (actual time=88.260..88.265 
rows=10 loops=1)
   -  Sort  (cost=3241.28..3241.29 rows=1 width=217) (actual 
time=88.258..88.260 rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         -  Seq Scan on test  (cost=0.00..3241.27 rows=1 width=217) (actual 
time=0.169..68.785 rows=40018 loops=1)
               Filter: ((intarr1  '{0,1}'::integer[]) AND (intarr2  
'{2,4}'::integer[]))
 Total runtime: 88.339 ms


Adding a GIN index on the two arrays results in similar ugliness:

# explain analyze select * from test where intarr1  '{0,1}' and intarr2  
'{2,4}' order by lft limit 10;
                                                                         QUERY 
PLAN                                                                         

 Limit  (cost=8.29..8.29 rows=1 width=217) (actual time=56.122..56.127 rows=10 
loops=1)
   -  Sort  (cost=8.29..8.29 rows=1 width=217) (actual time=56.120..56.122 
rows=10 loops=1)
         Sort Key: sortcol
         Sort Method:  top-N heapsort  Memory: 27kB
         -  Bitmap Heap Scan on test  (cost=4.26..8.28 rows=1 width=217) 
(actual time=19.635..39.824 rows=40018 loops=1)
               Recheck Cond: ((intarr1  '{0,1}'::integer[]) AND (intarr2  
'{2,4}'::integer[]))
               -  Bitmap Index Scan on test_intarr1_intarr2_idx  
(cost=0.00..4.26 rows=1 width=0) (actual time=19.387..19.387 rows=40018 loops=1)
                     Index Cond: ((intarr1  '{0,1}'::integer[]) AND 
(intarr2  '{2,4}'::integer[]))
 Total runtime: 56.210 ms


Might this be a bug in the operator's selectivity, or 

Re: [PERFORM] row estimate very wrong for array type

2011-05-04 Thread Denis de Bernardy
That kind of limits the usefulness of aggregating hierarchical dependencies 
into array columns to avoid enormous join statements. :-|


Re your todo item you mention in this thread:

http://archives.postgresql.org/pgsql-hackers/2010-05/msg01864.php

My C is rusty, but I might have enough understanding of the PG internals to 
massage pre-existing code... Feel free to message me off list with pointers if 
you think I might be able to help.



- Original Message -
 From: Tom Lane t...@sss.pgh.pa.us
 To: Denis de Bernardy ddeberna...@yahoo.com
 Cc: pgsql-performance@postgresql.org pgsql-performance@postgresql.org
 Sent: Wednesday, May 4, 2011 4:12 PM
 Subject: Re: [PERFORM] row estimate very wrong for array type 
 
 Denis de Bernardy ddeberna...@yahoo.com writes:
  [ estimates for array  suck ]
  Might this be a bug in the operator's selectivity, or am I doing 
 something wrong?
 
 Array  uses areasel() which is only a stub :-(
 
 In the particular case here it'd be possible to get decent answers
 just by trying the operator against all the MCV-list entries, but it's
 unlikely that that would fix things for enough people to be worth the
 trouble.  Really you'd need to maintain statistics about the element
 values appearing in the array column in order to get useful estimates
 for  queries.  Jan Urbanski did something similar for tsvector columns
 a year or two ago, but nobody's gotten around to doing it for array
 columns.
 
             regards, tom lane
 
 -- 
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] row estimate very wrong for array type

2011-05-04 Thread Denis de Bernardy
 - Original Message -

  From: Tom Lane t...@sss.pgh.pa.us
  To: Denis de Bernardy ddeberna...@yahoo.com
  Cc: pgsql-performance@postgresql.org 
 pgsql-performance@postgresql.org
  Sent: Wednesday, May 4, 2011 4:12 PM
  Subject: Re: [PERFORM] row estimate very wrong for array type 
 
  Array  uses areasel() which is only a stub :-(


On a separate note, in case this ever gets found via google, I managed to force 
the use of the correct index in the meanwhile:

# explain analyze select * from test where (0 = any(intarr1) or 1 = 
any(intarr1)) and (2 = any(intarr2) or 4 = any(intarr2)) order by sortcol limit 
10;
                                                            QUERY PLAN          
                                                   
---
 Limit  (cost=0.00..385.16 rows=10 width=217) (actual time=0.107..0.151 rows=10 
loops=1)
   -  Index Scan using test_sortcol_key on test  (cost=0.00..14019.98 rows=364 
width=217) (actual time=0.106..0.146 rows=10 loops=1)
         Filter: (((0 = ANY (intarr1)) OR (1 = ANY (intarr1))) AND ((2 = ANY 
(intarr2)) OR (4 = ANY (intarr2
 Total runtime: 0.214 ms


I guess I'm in for maintaining counts and rewriting queries as needed. :-(

D

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] amazon ec2

2011-05-04 Thread Denis de Bernardy
- Original Message -

 From: Josh Berkus j...@agliodbs.com
 To: postgres performance list pgsql-performance@postgresql.org
 Cc: 
 Sent: Thursday, May 5, 2011 2:02 AM
 Subject: Re: [PERFORM] amazon ec2
 So memcached basically replaces the filesystem?
 
 That sounds cool, but I'm wondering if it's actually a performance
 speedup.  Seems like it would only be a benefit for single-row lookups;
 any large reads would be a mess.


I've never tested with pgsql, but with mysql it makes a *huge* difference when 
you're pulling data repeatedly. Multi-row lookups can be cached too:

$rows = $cache-get(md5($query . '--' . serialize($args)));

if ( !$rows) {
  // query and cache for a few hours...
}

This is true even with mysql's caching features turned on. You spare the DB 
from doing identical queries that get repeated over and over. Memcache lets you 
pull those straight from the memory, allowing for the DB server to handle new 
queries exclusively.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance