Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Matthew Wakeling

On Wed, 6 Jan 2010, Dmitri Girski wrote:

On the other hand, if I use ip addresses this should not attract any possible 
issues with
DNS, right?


Not true. It is likely that the server program you are connecting to will 
perform a reverse DNS lookup to work out who the client is, for logging or 
authentication purposes.


Matthew

--
To err is human; to really louse things up requires root
privileges. -- Alexander Pope, slightly paraphrased

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Robert Haas
On Tue, Jan 5, 2010 at 11:50 PM, Craig Ringer
cr...@postnewspapers.com.au wrote:
 Wireshark is your friend.

+1.  I think if you put a packet sniffer on the interface you are
connecting from it will become clear what the problem is in short
order.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Dave Crooke
The fact that the delays are clustered at (3 + 0.2 n) seconds, rather than a
distributed range, strongly indicates a timeout and not (directly) a
resource issue.

3 seconds is too fast for a timeout on almost any DNS operation, unless it
has been modified, so I'd suspect it's the TCP layer, e.g. perhaps the SYN
packet goes awol and it has to retry.

I'd second the vote for investigation with a packet sniffing tool
(Wireshark, tcpdump, etc)

Cheers
Dave

On Mon, Jan 4, 2010 at 8:12 PM, Dmitri Girski mite...@gmail.com wrote:

 Hi everybody,

 I am running a PostgreSQL server 8.3.5 with a pretty much standard config.

 The web application server which runs Apache 1.3/PHP2.9  has an
 intermittent problem:
 pg_connect takes exactly 3.0 seconds. The usual connection time is 0.0045.
 The long request happens at approximate rate 1:100.

 I turned on logs on postgres server side, and there is
 nothing suspicious for me there. When a connection request comes, it is
 being served without any delay.

 Could anyone point me to the direction in which I should investigate this
 problem further?
 Thank you in advance!


 PS The hardware is: Dell SC1435/4Gb/2x2.0GHz/Gentoo Linux.
 The database  web servers are in the 2 local subnets.


 Dmitri.




Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Craig James

Dave Crooke wrote:
The fact that the delays are clustered at (3 + 0.2 n) seconds, rather 
than a distributed range, strongly indicates a timeout and not 
(directly) a resource issue.


3 seconds is too fast for a timeout on almost any DNS operation, unless 
it has been modified, so I'd suspect it's the TCP layer, e.g. perhaps 
the SYN packet goes awol and it has to retry.


I'd second the vote for investigation with a packet sniffing tool 
(Wireshark, tcpdump, etc)


If you have a PC (Windows), pingplotter is a remarkable and simple tool to use 
that quickly identifies problems, and gives results that are convincing when 
you show them to your network admin.  Wireshark and tcpdump have a pretty steep 
learning curve and are overkill if your problem is simple.

Craig

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Digesting explain analyze

2010-01-06 Thread Jesper Krogh
Hi.

I have a table that consists of somewhere in the magnitude of 100.000.000
rows and all rows are of this tuples

(id1,id2,evalue);

Then I'd like to speed up a query like this:

explain analyze select id from table where id1 = 2067 or id2 = 2067 order
by evalue asc limit 100;
  QUERY PLAN
---
 Limit  (cost=1423.28..1423.28 rows=100 width=12) (actual
time=2.565..2.567 rows=100 loops=1)
   -  Sort  (cost=1423.28..1424.54 rows=505 width=12) (actual
time=2.560..2.560 rows=100 loops=1)
 Sort Key: evalue
 Sort Method:  top-N heapsort  Memory: 25kB
 -  Bitmap Heap Scan on table  (cost=16.58..1420.75 rows=505
width=12) (actual time=0.709..1.752 rows=450 loops=1)
   Recheck Cond: ((id1 = 2067) OR (id2 = 2067))
   -  BitmapOr  (cost=16.58..16.58 rows=506 width=0) (actual
time=0.676..0.676 rows=0 loops=1)
 -  Bitmap Index Scan on id1_evalue_idx
(cost=0.00..11.44 rows=423 width=0) (actual
time=0.599..0.599 rows=450 loops=1)
   Index Cond: (id1_id = 2067)
 -  Bitmap Index Scan on id2_evalue_idx
(cost=0.00..4.89 rows=83 width=0) (actual
time=0.070..0.070 rows=1 loops=1)
   Index Cond: (id2_id = 2067)
 Total runtime: 2.642 ms
(12 rows)


What I had expected was to see the Bitmap Index Scan on id1_evalue_idx
to chop it off at a limit 1. The inner sets are on average 3.000 for
both id1 and id2 and a typical limit would be 100, so if I could convince
postgresql to not fetch all of them then I would reduce the set retrieved
by around 60. The dataset is quite large so the random query is not very
likely to be hitting the same part of the dataset again, so there is going
to be a fair amount of going to disk.,

I would also mean that using it in a for loop in a stored-procedure in
plpgsql it would not get any benefit from the CURSOR effect?

I actually tried to stuff id1,id2 into an array and do a GIST index on the
array,evalue hoping that it directly would satisfy this query.. it used
the GIST index fetch the rows the post-sorting and limit on the set.

What it boils down to is more or less:

Does a bitmap index scan support ordering and limit ?
Does a multicolummn gist index support ordering and limit ?

Have I missed anything that can hugely speed up fetching these (typically
100 rows) from the database.

-- 
Jesper



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Digesting explain analyze

2010-01-06 Thread Ron Mayer
Jesper Krogh wrote:
 I have a table that consists of somewhere in the magnitude of 100.000.000
 rows and all rows are of this tuples
 
 (id1,id2,evalue);
 
 Then I'd like to speed up a query like this:
 
 explain analyze select id from table where id1 = 2067 or id2 = 2067 order
 by evalue asc limit 100;
 
 ...The inner sets are on average 3.000 for
 both id1 and id2 and a typical limit would be 100, so if I could convince
 postgresql to not fetch all of them then I would reduce the set retrieved
 by around 60. The dataset is quite large so the random query is not very
 likely to be hitting the same part of the dataset again, so there is going
 to be a fair amount of going to disk.,

If disk seeks are killing you a kinda crazy idea would be to
duplicate the table - clustering one by (id1) and
the other one by an index on (id2) and unioning the
results of each.

Since each of these duplicates of the table will be clustered
by the column you're querying it on, it should just take one
seek in each table.

Then your query could be something like

  select * from (
select * from t1 where id1=2067 order by evalue limit 100
union
select * from t2 where id2=2067 order by evalue limit 100
  ) as foo order by evalue limit 100;

Hmm..  and I wonder if putting evalue into the criteria to cluster
the tables too (i.e. cluster on id1,evalue) if you could make it
so the limit finds the right 100 evalues first for each table






-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] noob inheritance question

2010-01-06 Thread Zintrigue
Hello,

I am complete noob to Postgres and to this list, and I hope this will be the
appropriate list for this question.

I'm hoping the inheritance feature will be a nice alternative method for me
to implement categories in particular database of products I need to keep
updated. I suppose in MySQL I would probably do this by creating, for
example, one table for the products, and then a table(s) for categories, and
then I'd be faced with a choice between using an adjacency list or nested
set paradigm for, say, breadcrumb links in my private web app.

On the other hand, in Postgres what I'd like to do it just create an empty
root product table, then create, for example, a spirts table that
inherits from products, and rums table that inherits from spirits, and
then aged rum, flavored rum, et al, which inherit from rums.

In this scenario, my idea was to have all my fields in products and to not
add any additional fields in the child tables. Also, only the lowest level
of child tables in any given branch of products would actually contain data
/ rows.

Assuming this is a good design, what I'm wondering is how inheritance is
actually implemented deep down inside Postgres, if it's anything at all like
JOINS (say, in the case of merely doing:
SELECT * FROM flavored_rum (the lowest level in a given branch)
or
SELECT * FROM spirits (the root level, or some intermediate level in a
given branch)

I'm wondering if there's any performance penalty here, analogous to the
penalty of JOINs in a regular RDBMS (versus an ORDBMS).

If anyone can offer in any insight as too how inheritance is actually
executed (compared to JOINs especially), I'd be most grateful.

Thank you,
DG


Re: [PERFORM] noob inheritance question

2010-01-06 Thread Richard Broersma
On Wed, Jan 6, 2010 at 3:53 PM, Zintrigue zintri...@gmail.com wrote:
 I'm wondering if there's any performance penalty here, analogous to the
 penalty of JOINs in a regular RDBMS (versus an ORDBMS).
 If anyone can offer in any insight as too how inheritance is actually
 executed (compared to JOINs especially), I'd be most grateful.

PostgreSQL inheritance is just a sugar coated form of horizontal table
partitioning.  So it suffers from all of the problems associated with
selection on UNION ALL queries.


-- 
Regards,
Richard Broersma Jr.

Visit the Los Angeles PostgreSQL Users Group (LAPUG)
http://pugs.postgresql.org/lapug

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Digesting explain analyze

2010-01-06 Thread Robert Haas
On Wed, Jan 6, 2010 at 2:10 PM, Jesper Krogh jes...@krogh.cc wrote:
 Hi.

 I have a table that consists of somewhere in the magnitude of 100.000.000
 rows and all rows are of this tuples

 (id1,id2,evalue);

 Then I'd like to speed up a query like this:

 explain analyze select id from table where id1 = 2067 or id2 = 2067 order
 by evalue asc limit 100;
                                                                  QUERY PLAN
 ---
  Limit  (cost=1423.28..1423.28 rows=100 width=12) (actual
 time=2.565..2.567 rows=100 loops=1)
   -  Sort  (cost=1423.28..1424.54 rows=505 width=12) (actual
 time=2.560..2.560 rows=100 loops=1)
         Sort Key: evalue
         Sort Method:  top-N heapsort  Memory: 25kB
         -  Bitmap Heap Scan on table  (cost=16.58..1420.75 rows=505
 width=12) (actual time=0.709..1.752 rows=450 loops=1)
               Recheck Cond: ((id1 = 2067) OR (id2 = 2067))
               -  BitmapOr  (cost=16.58..16.58 rows=506 width=0) (actual
 time=0.676..0.676 rows=0 loops=1)
                     -  Bitmap Index Scan on id1_evalue_idx
 (cost=0.00..11.44 rows=423 width=0) (actual
 time=0.599..0.599 rows=450 loops=1)
                           Index Cond: (id1_id = 2067)
                     -  Bitmap Index Scan on id2_evalue_idx
 (cost=0.00..4.89 rows=83 width=0) (actual
 time=0.070..0.070 rows=1 loops=1)
                           Index Cond: (id2_id = 2067)
  Total runtime: 2.642 ms
 (12 rows)


 What I had expected was to see the Bitmap Index Scan on id1_evalue_idx
 to chop it off at a limit 1. The inner sets are on average 3.000 for
 both id1 and id2 and a typical limit would be 100, so if I could convince
 postgresql to not fetch all of them then I would reduce the set retrieved
 by around 60. The dataset is quite large so the random query is not very
 likely to be hitting the same part of the dataset again, so there is going
 to be a fair amount of going to disk.,

 I would also mean that using it in a for loop in a stored-procedure in
 plpgsql it would not get any benefit from the CURSOR effect?

 I actually tried to stuff id1,id2 into an array and do a GIST index on the
 array,evalue hoping that it directly would satisfy this query.. it used
 the GIST index fetch the rows the post-sorting and limit on the set.

 What it boils down to is more or less:

 Does a bitmap index scan support ordering and limit ?
 Does a multicolummn gist index support ordering and limit ?

 Have I missed anything that can hugely speed up fetching these (typically
 100 rows) from the database.

Bitmap index scans always return all the matching rows.  It would be
nice if they could fetch them in chunks for queries like this, but
they can't.  I am not sure whether there's any way to make GIST do
what you want.

You might try something like this (untested):

SELECT * FROM (
   select id from table where id1 = 2067 order by evalue asc limit 100
   union all
   select id from table where id2 = 2067 order by evalue asc limit 100
) x ORDER BY evalue LIMIT 100

If you have an index by (id1, evalue) and by (id2, evalue) then I
would think this would be pretty quick, as it should do two index
scans (not bitmap index scans) to fetch 100 rows for each, then append
the results, sort them, and then limit again.

...Robert

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Dmitri Girski
Hi everybody,

Many thanks to everyone replied, I think we are on the right way.
I've used tcpdump to generate the logs and there are a lot of dropped
packets due to the bad checksum. Network guy is currently looking at the
problem and most likely this is hardware issue.

Cheers,
Dmitri.

On Tue, Jan 5, 2010 at 1:12 PM, Dmitri Girski mite...@gmail.com wrote:

 Hi everybody,

 I am running a PostgreSQL server 8.3.5 with a pretty much standard config.

 The web application server which runs Apache 1.3/PHP2.9  has an
 intermittent problem:
 pg_connect takes exactly 3.0 seconds. The usual connection time is 0.0045.
 The long request happens at approximate rate 1:100.

 I turned on logs on postgres server side, and there is
 nothing suspicious for me there. When a connection request comes, it is
 being served without any delay.

 Could anyone point me to the direction in which I should investigate this
 problem further?
 Thank you in advance!


 PS The hardware is: Dell SC1435/4Gb/2x2.0GHz/Gentoo Linux.
 The database  web servers are in the 2 local subnets.


 Dmitri.




-- 
@Gmail


Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Scott Marlowe
On Wed, Jan 6, 2010 at 7:44 PM, Dmitri Girski mite...@gmail.com wrote:
 Hi everybody,
 Many thanks to everyone replied, I think we are on the right way.
 I've used tcpdump to generate the logs and there are a lot of dropped
 packets due to the bad checksum. Network guy is currently looking at the
 problem and most likely this is hardware issue.

95% of these problems are a bad NIC or a bad cable. Since cables are
easy to change I'd try those first, then NICs.  Since lots of servers
have dual nics that's a pretty easy change too.

 Every now and then it's a bad switch / router.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] noob inheritance question

2010-01-06 Thread Nikolas Everett
Inheritance would only make sense if each of your categories had more
columns.  Say if you had a wines category and only they had a year column.
 Its probably not worth it for one or two columns but if you've got a big
crazy heterogeneous tree of stuff then its probably appropriate.

I'm with Richard in that it sounds like the right way to solve your problem
is to have a categories table and a products table.  Let the categories
table have a reference to the parent.  I suppose just like they do in the
first section of
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html .  The
other sections on the page just seem like overkill to me.

On Wed, Jan 6, 2010 at 7:13 PM, Richard Neill rn...@cam.ac.uk wrote:

 Zintrigue wrote:

  I'm hoping the inheritance feature will be a nice alternative method for
 me to implement categories in particular database of products I need to keep
 updated. I suppose in MySQL I would probably do this by creating, for
 example, one table for the products, and then a table(s) for categories, and
 then I'd be faced with a choice between using an adjacency list or nested
 set paradigm for, say, breadcrumb links in my private web app.

 On the other hand, in Postgres what I'd like to do it just create an empty
 root product table, then create, for example, a spirts table that
 inherits from products, and rums table that inherits from spirits, and
 then aged rum, flavored rum, et al, which inherit from rums.

 In this scenario, my idea was to have all my fields in products and to
 not add any additional fields in the child tables. Also, only the lowest
 level of child tables in any given branch of products would actually contain
 data / rows.

 Assuming this is a good design,


 May I venture to stop you there. This sounds like you are doing it
 The Hard Way.

 In particular, each time you add a new category, you're going to have to
 add a new database table, and your schema is going to get to be horrible.
 Inserts aren't going to be much fun either.

 Rather than adding multiple child tables, may I suggest some other way of
 tracking which item is a subset of the other.
  You could do it by having 2 columns:
id, parent_id  (each integer and indexed)
 or you could do it by having 2 columns:
id, list   (id is integer, list is eg 1,3,5,13)
 (where the list is a comma-separated list, or an array, and holds the full
 path)


 Depending on scale, you may be able to choose a simple algorithm instead of
 hunting for the most efficient one.


 Best wishes,

 Richard


 P.S. This is the performance mailing list - you may find one of the other
 lists better suited to your questions.


 --
 Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-performance



[PERFORM] Joining on text field VS int

2010-01-06 Thread Radhika S
Hi,
I am going to test this out but would be good to know anyways. A large
table is joined to a tiny table (8 rows) on a text field. Should I be
joining on an int field eg: recid intead of name? Is the performance
affected in either case?
Thanks .


Re: [PERFORM] Digesting explain analyze

2010-01-06 Thread Jesper Krogh
Ron Mayer wrote:
 ...The inner sets are on average 3.000 for
 both id1 and id2 and a typical limit would be 100, so if I could convince
 postgresql to not fetch all of them then I would reduce the set retrieved
 by around 60. The dataset is quite large so the random query is not very
 likely to be hitting the same part of the dataset again, so there is going
 to be a fair amount of going to disk.,
 
 If disk seeks are killing you a kinda crazy idea would be to
 duplicate the table - clustering one by (id1) and
 the other one by an index on (id2) and unioning the
 results of each.

That's doubling the disk space needs for the table. Is there any odds
that this would benefit when the intitial table significantly exceeds
available memory by itself?

 Since each of these duplicates of the table will be clustered
 by the column you're querying it on, it should just take one
 seek in each table.
 
 Then your query could be something like
 
   select * from (
 select * from t1 where id1=2067 order by evalue limit 100
 union
 select * from t2 where id2=2067 order by evalue limit 100
   ) as foo order by evalue limit 100;

This is actually what I ended up with as the best performing query, just
still on a single table, because without duplication I can add index and
optimize this one by (id1,evalue) and (id2,evalue). It is still getting
killed quite a lot by disk IO. So I guess I'm up to:

1) By better disk (I need to get an estimate how large it actually is
going to get).
2) Stick with one table, but make sure to have enough activity to get a
large part of the index in the OS-cache anyway. (and add more memory if
nessesary).

The data is seeing a fair amount of growth (doubles in a couple of years
) so it is fairly hard to maintain clustering on them .. I would suspect.

Is it possible to get PG to tell me, how many rows that fits in a
disk-page. All columns are sitting in plain storage according to \d+
on the table.

 Hmm..  and I wonder if putting evalue into the criteria to cluster
 the tables too (i.e. cluster on id1,evalue) if you could make it
 so the limit finds the right 100 evalues first for each table

I didnt cluster it, since clustering locks everything.

-- 
Jesper

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] pg_connect takes 3.0 seconds

2010-01-06 Thread Craig Ringer

On 7/01/2010 10:44 AM, Dmitri Girski wrote:

Hi everybody,

Many thanks to everyone replied, I think we are on the right way.
I've used tcpdump to generate the logs and there are a lot of dropped
packets due to the bad checksum. Network guy is currently looking at the
problem and most likely this is hardware issue.


Hang on a sec. You need to ignore bad checksums on *outbound* packets, 
because many (most?) Ethernet drivers implement some level of TCP 
offloading, and this will result in packet sniffers seeing invalid 
checksums for transmitted packets - the checksums haven't been generated 
by the NIC yet.


Unless you know for sure that your NIC doesn't do TSO, ignore bad 
checksums on outbound packets from the local interface.


--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Massive table (500M rows) update nightmare

2010-01-06 Thread Carlo Stonebanks
Our DB has an audit table which is 500M rows and growing. (FYI the objects 
being audited are grouped semantically, not individual field values).


Recently we wanted to add a new feature and we altered the table to add a 
new column. We are backfilling this varchar(255) column by writing a TCL 
script to page through the rows (where every update is a UPDATE ... WHERE id 
= x AND id  x+10 and a commit is performed after every 1000 updates 
statement, i.e. every 1 rows.)


We have 10 columns, six of which are indexed. Rough calculations suggest 
that this will take two to three weeks to complete on an 8-core CPU with 
more than enough memory.


As a ballpark estimate - is this sort of performance for an 500M updates 
what one would expect of PG given the table structure (detailed below) or 
should I dig deeper to look for performance issues?


As always, thanks!

Carlo

Table/index structure:

CREATE TABLE mdx_core.audit_impt
(
 audit_impt_id serial NOT NULL,
 impt_session integer,
 impt_version character varying(255),
 impt_name character varying(255),
 impt_id integer,
 target_table character varying(255),
 target_id integer,
 target_op character varying(10),
 note text,
 source_table character varying(255),
 CONSTRAINT audit_impt_pkey PRIMARY KEY (audit_impt_id)
)

CREATE INDEX audit_impt_impt_id_idx
 ON mdx_core.audit_impt
 USING btree
 (impt_id);
CREATE INDEX audit_impt_impt_name
 ON mdx_core.audit_impt
 USING btree
 (impt_name, impt_version);
CREATE INDEX audit_impt_session_idx
 ON mdx_core.audit_impt
 USING btree
 (impt_session);
CREATE INDEX audit_impt_source_table
 ON mdx_core.audit_impt
 USING btree
 (source_table);
CREATE INDEX audit_impt_target_id_idx
 ON mdx_core.audit_impt
 USING btree
 (target_id, audit_impt_id);
CREATE INDEX audit_impt_target_table_idx
 ON mdx_core.audit_impt
 USING btree
 (target_table, target_id, audit_impt_id);




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Optimizer use of index slows down query by factor

2010-01-06 Thread Michael Ruf

Hi,

Tom Lane wrote:

 I think you need to see about getting this rowcount estimate to be more
 accurate:

 -  Index Scan using idx_link_1 on link
 (cost=0.00..680.51 rows=13477 width=26) (actual time=5.707..12.043
 rows=126 loops=1)
   Index Cond: (task_id = 1556)
   Filter: (((deletable IS NULL) OR (NOT
 deletable)) AND ((link_type = 8) OR (link_type = 9)))

 If it realized there'd be only 126 rows out of that scan, it'd probably
 have gone for a nestloop join against the big table, which I think would
 be noticeably faster than either of the plans you show here.

 You already did crank up default_statistics_target, so I'm not sure if
 raising it further would help any.

After i've increased the statistic target for the specific column on the 
link table alter table link alter task_id set statistics 200;, the sql 
runs fine (  1 second ):


Limit  (cost=448478.40..448492.17 rows=1 width=30) (actual 
time=850.698..860.838 rows=12 loops=1)
   -  GroupAggregate  (cost=448478.40..448492.17 rows=1 width=30) 
(actual time=850.695..860.824 rows=12 loops=1)
 -  Sort  (cost=448478.40..448481.15 rows=1100 width=30) 
(actual time=850.569..853.985 rows=6445 loops=1)

   Sort Key: link.link_type, link.link_alias
   Sort Method:  quicksort  Memory: 696kB
   -  Nested Loop Left Join  (cost=0.00..448422.84 
rows=1100 width=30) (actual time=819.519..838.422 rows=6445 loops=1)
 -  Seq Scan on link  (cost=0.00..142722.52 
rows=203 width=26) (actual time=819.486..820.016 rows=126 loops=1)
   Filter: (((deletable IS NULL) OR (NOT 
deletable)) AND (task_id = 1556) AND ((link_type = 8) OR (link_type = 9)))
 -  Index Scan using idx_click_1 on click 
(cost=0.00..1370.01 rows=10872 width=12) (actual time=0.003..0.088 
rows=51 loops=126)

   Index Cond: (link.link_id = click.link_id)
 Total runtime: 860.929 ms


 What I'd suggest is trying to avoid
 using non-independent AND/OR conditions.  For instance recasting the
 first OR as just deletable is not true would probably result in a
 better estimate.  The size of the error seems to be more than that would
 account for though, so I suspect that the deletable and link_type
 conditions are interdependent.  Is it practical to recast your data
 representation to avoid that?


I've tried that, but with no positive/negative effects.

Thanks for your help.

Michael

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Massive table (500M rows) update nightmare

2010-01-06 Thread Scott Marlowe
On Thu, Jan 7, 2010 at 12:17 AM, Carlo Stonebanks
stonec.regis...@sympatico.ca wrote:
 Our DB has an audit table which is 500M rows and growing. (FYI the objects
 being audited are grouped semantically, not individual field values).

 Recently we wanted to add a new feature and we altered the table to add a
 new column. We are backfilling this varchar(255) column by writing a TCL
 script to page through the rows (where every update is a UPDATE ... WHERE id
= x AND id  x+10 and a commit is performed after every 1000 updates
 statement, i.e. every 1 rows.)

 We have 10 columns, six of which are indexed. Rough calculations suggest
 that this will take two to three weeks to complete on an 8-core CPU with
 more than enough memory.

 As a ballpark estimate - is this sort of performance for an 500M updates
 what one would expect of PG given the table structure (detailed below) or
 should I dig deeper to look for performance issues?

Got an explain analyze of the delete query?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance