Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-14 Thread Marti Raudsepp
On Sun, Mar 13, 2011 at 18:36, runner run...@winning.com wrote:
 Tried removing the indexes and other constraints just for
 the import but for a noob like me, this was too much to ask.  Maybe
 when I get more experience.

pgAdmin should make it pretty easy. Choose each index and constraint,
save the code from the SQL pane for when you need to restore it, and
do a right click - Drop

 Other than being very inefficient, and consuming
 more time than necessary, is there any other down side to importing
 into an indexed table?

Doing so will result in somewhat larger (more bloated) indexes, but
generally the performance impact of this is minimal.

Regards,
Marti

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Tuning massive UPDATES and GROUP BY's?

2011-03-14 Thread Greg Spiegelberg
On Mon, Mar 14, 2011 at 4:17 AM, Marti Raudsepp ma...@juffo.org wrote:

 On Sun, Mar 13, 2011 at 18:36, runner run...@winning.com wrote:
  Other than being very inefficient, and consuming
  more time than necessary, is there any other down side to importing
  into an indexed table?

 Doing so will result in somewhat larger (more bloated) indexes, but
 generally the performance impact of this is minimal.


Bulk data imports of this size I've done with minimal pain by simply
breaking the raw data into chunks (10M records becomes 10 files of 1M
records), on a separate spindle from the database, and performing multiple
COPY commands but no more than 1 COPY per server core.  I tested this a
while back on a 4 core server and when I attempted 5 COPY's at a time the
time to complete went up almost 30%.  I don't recall any benefit having
fewer than 4 in this case but the server was only processing my data at the
time.  Indexes were on the target table however I dropped all constraints.
 The UNIX split command is handy for breaking the data up into individual
files.

Greg


Re: [PERFORM] unexpected stable function behavior

2011-03-14 Thread Merlin Moncure
On Mon, Mar 14, 2011 at 3:46 AM, Julius Tuskenis jul...@nsoft.lt wrote:
 Hello, Merlin

 Thank you for your quick response.

 2011.03.10 23:14, Merlin Moncure rašė:

 This is a huge problem with non trivial functions in the select list.
 Pushing the result into and a subquery does NOT guarantee that the
 inner result is materialized first.

 From the postgresql documentation about STABLE functions: This category
 allows the optimizer to optimize multiple calls of the function to a single
 call. I thought that this means that optimizer executes the function only
 for now parameter sets and stores results in some cache and use it if the
 parameters are already known. I realize this is very naive approach and most
 probably everything is much more complicated. I would appreciate if someone
 would explain the mechanism (or provide with some useful link).

Just because some optimizations can happen doesn't mean they will
happen or there is even capability to make them happen.  There was
some recent discussion about this very topic here:
http://postgresql.1045698.n5.nabble.com/function-contants-evaluated-for-every-row-td3278945.html.


Try a CTE.

 with foo as
 (
   select yadda;
 )
 select func(foo.a), foo.* from foo;

 I'm sorry, but I'm totally new to CTE. Would you please show me how should I
 use the stable function and where the parameters should be put to improve
 the behavior of the optimizer for my problem?

WITH results as
(
 SELECT distinct price_id as price_id
 FROM ticket_price
   JOIN ticket_has_ticket_price ON (price_id = thtp_price_id)
 WHERE price_event_id = 7820 and (current_timestamp = price_date AND
current_timestamp = price_date_till)
 ) as qq
)
 SELECT web_select_extra_price(price_id, 7820, 1) from results;


Another way to fight this is to play with the cost planner hint
parameter in 'create function', but I prefer the CTE -- it gives
strong guarantees about order of execution which is what you really
want.  CTEs are great btw, I'd start learning them immediately.

IMNSHO, this (uncontrolled number of function executions when run via
field select list) is a common gotcha w/postgres and a FAQ.  Also the
documentation is not very helpful on this point...do you agree CTE is
the right way to advise handling this problem...is it worth further
notation?

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unexpected stable function behavior

2011-03-14 Thread Andres Freund
On Thursday, March 10, 2011 05:26:00 PM Julius Tuskenis wrote:
 3) As STABLE function should be executed once for every different set of 
 parameters
Thats not true. Thats not what any of the volatility information (like STABLE, 
IMMUTABLE, VOLATILE) does.

See http://www.postgresql.org/docs/current/interactive/xfunc-volatility.html

It *does* change how often a function is executed though. I.e.

SELECT g.i, some_stable_func(1) FROM generate_series(1, 1000) g(i)

will call some_stable_func only once because it can determine all the 
parameters beforehand.


Andres

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regression from 8.3.7 to 9.0.3

2011-03-14 Thread Claudio Freire
Nothing?

No ideas?

Did I forget to include some useful bit?

On Fri, Mar 4, 2011 at 8:22 PM, Claudio Freire klaussfre...@gmail.com wrote:
 Hello, first post to this list.

 I have this query that ran in milliseconds in postgres 8.3.7 (usually 50,
 100ms), and now it takes a full 10 minutes to complete.

 I tracked the problem to the usage of hash aggregates to resolve EXISTS
 clauses in conjunction with large IN clauses, which seem to reverse the
 execution plan - in 8.3.7, it would use indices to fetch the rows from the
 IN, then compute the exists with a nested loop, never doing big sequential
 scans. In 9.0.3, it computes the set of applicable entries with a hash
 aggregate, but in order to do that it performs a huge index scan - no
 sequential scans either, but the big index scan is worse.

 9.0.3 always misses the estimate of how many rows will come out the hash
 aggregate, always estimating 200, while in fact the real count is more like
 300.000. I've tried increasing statistics in all the columns involved, up to
 4000 for each, to the point where it accurately estimates the input to the
 hash agg, but the output is always estimated to be 200 rows.

 Rewriting the query to use 0  (select count(*)..) instead of EXISTS (select
 * ..) does revert to the old postgres 8.3 plan, although intuitively I would
 think it to be sub-optimal.

 The tables in question receive many updates, but never in such a volume as
 to create enough bloat - plus, the tests I've been running are on a
 pre-production server without much traffic (so not many updates - probably
 none in weeks).

 The server is a Core 2 E7400 dual core with 4GB of ram running linux and a
 pg 9.0.3 / 8.3.7 (both there, doing migration testing) built from source.
 Quite smaller than our production server, but I've tested the issue on
 higher-end hardware and it produces the same results.

 Any ideas as to how to work around this issue?

 I can't plug the select count() version everywhere, since I won't be using
 this form of the query every time (it's generated programatically with an
 ORM), and some forms of it perform incredibly worse with the select count().

 Also, any help I can provide to fix it upstream I'll be glad to - I believe
 (I would have to check) I can even create a dump of the tables (stripping
 sensitive info of course) - only, well, you'll see the size below - a tad
 big to be mailing it ;-)

 pg 9.0 is configured with:

 work_mem = 64M
 shared_buffers = 512M
 temp_buffers = 64M
 effective_cache_size = 128M

 pg 8.3.7 is configured with:

 work_mem = 64M
 shared_buffers = 100M
 temp_buffers = 64M
 effective_cache_size = 128M


 The query in question:

 SELECT member_statistics.member_id
     FROM member_statistics
     WHERE member_statistics.member_id IN ( 400 ids ) AND (EXISTS
 (SELECT mat1.tag_id
     FROM member_all_tags_v AS mat1
     WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
 IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity = 90))

 -- View: member_all_tags_v

 CREATE OR REPLACE VIEW member_all_tags_v AS
  SELECT member_tags.member_id, member_tags.last_modification_date,
 member_tags.polarity, member_tags.tag_id, 'mr' AS source
    FROM member_tags
 UNION ALL
  SELECT member_profile_tags.member_id,
 member_profile_tags.last_modification_date, member_profile_tags.polarity,
 member_profile_tags.tag_id, 'profile' AS source
    FROM member_profile_tags;

 -- Table: member_profile_tags

 -- DROP TABLE member_profile_tags;

 CREATE TABLE member_profile_tags
 (
   member_id integer NOT NULL,
   last_modification_date timestamp without time zone NOT NULL,
   polarity smallint NOT NULL,
   tag_id integer NOT NULL,
   CONSTRAINT member_profile_tags_pkey PRIMARY KEY (member_id, tag_id),
   CONSTRAINT fka52b6e7491ac9123 FOREIGN KEY (tag_id)
   REFERENCES tags (id) MATCH SIMPLE
   ON UPDATE NO ACTION ON DELETE NO ACTION
 )
 WITH (
   OIDS=FALSE
 );

 -- Index: idx_member_profile_tags_tag_id
 CREATE INDEX idx_member_profile_tags_tag_id
   ON member_profile_tags
   USING btree
   (tag_id);


 -- Table: member_tags

 -- DROP TABLE member_tags;
 CREATE TABLE member_tags
 (
   member_id integer NOT NULL,
   last_modification_date timestamp without time zone NOT NULL,
   polarity smallint NOT NULL,
   tag_id integer NOT NULL,
   CONSTRAINT member_tags_pkey PRIMARY KEY (member_id, tag_id),
   CONSTRAINT fk527ef29e91ac9123 FOREIGN KEY (tag_id)
   REFERENCES tags (id) MATCH SIMPLE
   ON UPDATE NO ACTION ON DELETE NO ACTION
 )
 WITH (
   OIDS=FALSE
 );

 -- Index: idx_member_tags_tag_id
 CREATE INDEX idx_member_tags_tag_id
   ON member_tags
   USING btree
   (tag_id);


 member_tags : 637M bytes, 12.7M rows
 member_profile_tags : 1824M bytes, 36.6M rows
 member_statistics : 581M bytes, 2.5M rows

 member_profile_tags_pkey : 785M
 member_tags_pkey : 274M
 member_statistics_pkey : 54M

 idx_member_tags_tag_id : 274M
 idx_member_profile_tags_tag_id : 

Re: [PERFORM] Table partitioning problem

2011-03-14 Thread Samba GUEYE
Yeah but is there a workaround to force the root table to propagate the 
foreign key to the partitionned table
because right now all foreign keys to partitionned table throws 
constraints violation and it's a big problem for me

Le 11/03/2011 19:31, Robert Haas a écrit :

On Thu, Mar 10, 2011 at 3:25 AM, Samba GUEYEsamba.gu...@intesens.com  wrote:

The measure insertion is successful but problems raise up when inserting in
the simpleMeasure table because it can't find the foreign key inserted the
measure table and do not look at the partitionned tables

Yes, that's how it works.




--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unexpected stable function behavior

2011-03-14 Thread Julius Tuskenis

Hello, Merlin

Thank you for your quick response.

2011.03.10 23:14, Merlin Moncure rašė:

This is a huge problem with non trivial functions in the select list.
Pushing the result into and a subquery does NOT guarantee that the
inner result is materialized first.
From the postgresql documentation about STABLE functions: This 
category allows the optimizer to optimize multiple calls of the function 
to a single call. I thought that this means that optimizer executes the 
function only for now parameter sets and stores results in some cache 
and use it if the parameters are already known. I realize this is very 
naive approach and most probably everything is much more complicated. I 
would appreciate if someone would explain the mechanism (or provide with 
some useful link).



Try a CTE.

with foo as
(
   select yadda;
)
select func(foo.a), foo.* from foo;
I'm sorry, but I'm totally new to CTE. Would you please show me how 
should I use the stable function and where the parameters should be put 
to improve the behavior of the optimizer for my problem?


Thank you in advance

--
Julius Tuskenis
Programavimo skyriaus vadovas
UAB nSoft
mob. +37068233050



Re: [PERFORM] Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1

2011-03-14 Thread John Surcombe
  When we 'EXPLAIN' this query, PostgreSQL says it is using the index
  idx_receiveddatetime.  The way the application is designed means
that
  in virtually all cases the query will have to scan a very long way
  into idx_receiveddatetime to find the first record where userid =
 311369000.
  If however we delete the idx_receiveddatetime index, the query uses
  the idx_userid_receiveddatetime index, and the query only takes a
few
  milliseconds.
 
  That's just bizarre ... it knows the index is applicable, and the
cost
  estimates clearly favor the better index, so why did it pick the
worse
  one?
 
 No, scratch that, I misread the plans.  It *is* picking the plan it
thinks has
 lower cost; it's just a mistaken cost estimate.  It's strange though
that the less
 selective indexscan is getting a lower cost estimate.  I wonder
whether your
 table is (almost) perfectly ordered by receiveddatetime, such that the
one-
 column index has correlation close to 1.0.  That could possibly lower
the cost
 estimate to the point where it'd appear to dominate the other index.
It'd be
 useful to see the pg_stats.correlation value for both the userid and
 receiveddatetime columns.

Yes, the table is indeed nearly perfectly ordered by receiveddatetime
(correlation 0.998479).  correlation on userid is -0.065556.  n_distinct
on userid is also low: 1097.

Is the problem perhaps something like the following:  PostgreSQL is
thinking that because there are not many userids and there is low
correlation, that if it just scans the table from the top in date order,
this will be cheap (because receiveddatetime correlation is high so it
won't have to seek randomly), and it won't have to scan very far before
it finds the first row with a matching userid.

The problem is though that in our application the userids are not
scattered randomly.  There are small regions of the table where a
particular userid appears frequently, interspersed with much larger
regions (perhaps millions or tens of millions of rows) where it doesn't
appear at all.  So in fact the planner's preferred solution is often
pathologically bad.

Is there a solution?

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] unexpected stable function behavior

2011-03-14 Thread Tom Lane
Julius Tuskenis jul...@nsoft.lt writes:
  From the postgresql documentation about STABLE functions: This 
 category allows the optimizer to optimize multiple calls of the function 
 to a single call. I thought that this means that optimizer executes the 
 function only for now parameter sets and stores results in some cache 
 and use it if the parameters are already known.

No, it does not.  That function property *allows* the optimizer to
invoke the function fewer times than would happen in an un-optimized
query.  It does not *require* it to do so.  There is no such cache
mechanism in Postgres, and it's unlikely that there ever will be,
because it probably would be a net performance loss on average.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Bug in the planner?

2011-03-14 Thread Claudio Freire
This is postgresql 9.0.3:


Query:

select
sum(stat_responses) * 100.0 / sum(stat_invites) as stat_response_rate,
sum(real_responses) * 100.0 / sum(real_invites) as real_response_rate
from (
select
ms.invites as stat_invites,
(select count(*) from invites i join deliveries d on d.id = 
i.delivery_id
where i.member_id = ms.member_id
and d.recontact_number = 0
and d.delivery_type = 1) as real_invites,
ms.responses as stat_responses,
(select count(*) from track_logs tl join tracks t on t.id = 
tl.track_id
where t.member_id = ms.member_id
and t.partner_id is null and t.recontact_number = 0 and
t.contact_method_id = 1
and t.delivery_type = 1
and tl.track_status_id = 10) as real_responses
from member_statistics ms
join livra_users lu on lu.id = ms.member_id
where lu.country_id = 2 and lu.is_panelist and lu.confirmed and not
lu.unregistered
) as rtab;



QUERY PLAN
---
 Aggregate  (cost=224382.22..225969.27 rows=1 width=12)
   -  Hash Join  (cost=88355.09..221837.46 rows=254475 width=12)
 Hash Cond: (ms.member_id = lu.id)
 -  Seq Scan on member_statistics ms  (cost=0.00..99539.50
rows=2511850 width=12)
 -  Hash  (cost=85174.15..85174.15 rows=254475 width=4)
   -  Bitmap Heap Scan on livra_users lu
(cost=14391.40..85174.15 rows=254475 width=4)
 Recheck Cond: (country_id = 2)
 Filter: (is_panelist AND confirmed AND (NOT unregistered))
 -  Bitmap Index Scan on ix_user_state
(cost=0.00..14327.78 rows=763100 width=0)
   Index Cond: (country_id = 2)
   SubPlan 1
 -  Aggregate  (cost=181.25..181.26 rows=1 width=0)
   -  Nested Loop  (cost=0.00..181.19 rows=24 width=0)
 -  Index Scan using idx_tracks_partner_id_member_id
on tracks t  (cost=0.00..49.83 rows=9 width=8)
   Index Cond: ((partner_id IS NULL) AND (member_id = $0))
   Filter: ((recontact_number = 0) AND
(contact_method_id = 1) AND (delivery_type = 1))
 -  Index Scan using idx_track_logs_track_id on
track_logs tl  (cost=0.00..14.56 rows=3 width=8)
   Index Cond: (tl.track_id = t.id)
   Filter: (tl.track_status_id = 10)
   SubPlan 2
 -  Aggregate  (cost=1405.75..1405.76 rows=1 width=0)
   -  Nested Loop  (cost=0.00..1405.45 rows=119 width=0)
 -  Index Scan using
idx_invites_member_id_delivery_id on invites i  (cost=0.00..431.03
rows=119 width=4)
   Index Cond: (member_id = $0)
 -  Index Scan using deliveries_pkey on deliveries d
(cost=0.00..8.18 rows=1 width=4)
   Index Cond: (d.id = i.delivery_id)
   Filter: ((d.recontact_number = 0) AND
(d.delivery_type = 1))
(27 rows)


If you inspect the plan, it's not computing the total expected cost correctly.

The top Aggregate node acts as a nested loop on SubPlan 1  2, but
it's only adding the cost of the subplans without regard as to how
many iterations it will perform (254475)

Explain analyze didn't end in an hour of runtime, running on a Core2
with 4G RAM.

It's not a big issue to me, I can work around it, but it should
perhaps be looked into.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Planner wrongly shuns multi-column index for select .. order by col1, col2 limit 1

2011-03-14 Thread Tom Lane
John Surcombe john.surco...@digimap.gg writes:
 It'd be
 useful to see the pg_stats.correlation value for both the userid and
 receiveddatetime columns.

 Yes, the table is indeed nearly perfectly ordered by receiveddatetime
 (correlation 0.998479).  correlation on userid is -0.065556.  n_distinct
 on userid is also low: 1097.

Ah-hah.

 Is the problem perhaps something like the following:  PostgreSQL is
 thinking that because there are not many userids and there is low
 correlation, that if it just scans the table from the top in date order,
 this will be cheap (because receiveddatetime correlation is high so it
 won't have to seek randomly), and it won't have to scan very far before
 it finds the first row with a matching userid.

There's some of that, but I think the main problem is that there's a
very high discount on the cost estimate for a perfectly-correlated
index, and that makes it end up looking cheaper to use than the
uncorrelated one.  (It doesn't help any that we don't do correlation
properly for multicolumn indexes; but given what you say above, the
correlation estimate for the two-column index would be small even if
we'd computed it exactly.)

You might find that reducing random_page_cost would avoid the problem.
That should reduce the advantage conferred on the high-correlation
index, and it probably would represent your actual configuration better
anyway, given the results you're showing here.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regression from 8.3.7 to 9.0.3

2011-03-14 Thread Merlin Moncure
On Mon, Mar 14, 2011 at 10:54 AM, Claudio Freire klaussfre...@gmail.com wrote:
 Nothing?

 No ideas?

 Did I forget to include some useful bit?

 On Fri, Mar 4, 2011 at 8:22 PM, Claudio Freire klaussfre...@gmail.com wrote:
 Hello, first post to this list.

 I have this query that ran in milliseconds in postgres 8.3.7 (usually 50,
 100ms), and now it takes a full 10 minutes to complete.

 I tracked the problem to the usage of hash aggregates to resolve EXISTS
 clauses in conjunction with large IN clauses, which seem to reverse the
 execution plan - in 8.3.7, it would use indices to fetch the rows from the
 IN, then compute the exists with a nested loop, never doing big sequential
 scans. In 9.0.3, it computes the set of applicable entries with a hash
 aggregate, but in order to do that it performs a huge index scan - no
 sequential scans either, but the big index scan is worse.

 9.0.3 always misses the estimate of how many rows will come out the hash
 aggregate, always estimating 200, while in fact the real count is more like
 300.000. I've tried increasing statistics in all the columns involved, up to
 4000 for each, to the point where it accurately estimates the input to the
 hash agg, but the output is always estimated to be 200 rows.

 Rewriting the query to use 0  (select count(*)..) instead of EXISTS (select
 * ..) does revert to the old postgres 8.3 plan, although intuitively I would
 think it to be sub-optimal.

 The tables in question receive many updates, but never in such a volume as
 to create enough bloat - plus, the tests I've been running are on a
 pre-production server without much traffic (so not many updates - probably
 none in weeks).

 The server is a Core 2 E7400 dual core with 4GB of ram running linux and a
 pg 9.0.3 / 8.3.7 (both there, doing migration testing) built from source.
 Quite smaller than our production server, but I've tested the issue on
 higher-end hardware and it produces the same results.

 Any ideas as to how to work around this issue?

 I can't plug the select count() version everywhere, since I won't be using
 this form of the query every time (it's generated programatically with an
 ORM), and some forms of it perform incredibly worse with the select count().

 Also, any help I can provide to fix it upstream I'll be glad to - I believe
 (I would have to check) I can even create a dump of the tables (stripping
 sensitive info of course) - only, well, you'll see the size below - a tad
 big to be mailing it ;-)

 pg 9.0 is configured with:

 work_mem = 64M
 shared_buffers = 512M
 temp_buffers = 64M
 effective_cache_size = 128M

 pg 8.3.7 is configured with:

 work_mem = 64M
 shared_buffers = 100M
 temp_buffers = 64M
 effective_cache_size = 128M


 The query in question:

 SELECT member_statistics.member_id
     FROM member_statistics
     WHERE member_statistics.member_id IN ( 400 ids ) AND (EXISTS
 (SELECT mat1.tag_id
     FROM member_all_tags_v AS mat1
     WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
 IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity = 90))

hm the regression in and of itself is interesting, but I wonder if you
can get past your issue like this:

SELECT member_statistics.member_id
 FROM member_statistics
 WHERE member_statistics.member_id IN ( 400 ids ) AND (EXISTS
 (SELECT mat1.tag_id
 FROM member_all_tags_v AS mat1
 WHERE mat1.member_id = member_statistics.member_id AND mat1.tag_id
 IN (640, 641, 3637, 3638, 637, 638, 639) AND mat1.polarity = 90))

changes to:

SELECT member_statistics.member_id
 FROM member_statistics
 WHERE EXISTS
 (
   SELECT mat1.tag_id
   FROM member_all_tags_v AS mat1
   WHERE mat1.member_id = member_statistics.member_id
 AND mat1.tag_id
 IN (640, 641, 3637, 3638, 637, 638, 639) AND
mat1.polarity = 90
 AND mat1.member_id  IN ( 400 ids )
 )

also, always try to compare vs straight join version:


SELECT member_statistics.member_id
 FROM member_statistics
 JOIN VALUES ( 400 ids ) q(member_id) using (member_id)
 JOIN
 (
   SELECT mat1.member_id
   FROM member_all_tags_v AS mat1
   WHERE mat1.tag_id  IN (640, 641, 3637, 3638, 637, 638, 639)
 AND mat1.polarity = 90) p
USING(member_id)
  ) p using(member_id);

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regression from 8.3.7 to 9.0.3

2011-03-14 Thread Merlin Moncure
On Mon, Mar 14, 2011 at 12:46 PM, Claudio Freire klaussfre...@gmail.com wrote:
 On Mon, Mar 14, 2011 at 2:34 PM, Merlin Moncure mmonc...@gmail.com wrote:
 changes to:

 SELECT member_statistics.member_id
         FROM member_statistics
         WHERE EXISTS
         (
           SELECT mat1.tag_id
           FROM member_all_tags_v AS mat1
           WHERE mat1.member_id = member_statistics.member_id
             AND mat1.tag_id
             IN (640, 641, 3637, 3638, 637, 638, 639) AND
 mat1.polarity = 90
             AND mat1.member_id  IN ( 400 ids )
         )

 It isn't easy to get the ORM to spit that kind of queries, but I could
 try them by hand.

 also, always try to compare vs straight join version:


 SELECT member_statistics.member_id
         FROM member_statistics
         JOIN VALUES ( 400 ids ) q(member_id) using (member_id)
         JOIN
         (
           SELECT mat1.member_id
           FROM member_all_tags_v AS mat1
           WHERE mat1.tag_id  IN (640, 641, 3637, 3638, 637, 638, 639)
             AND mat1.polarity = 90) p
            USING(member_id)
          ) p using(member_id);

 merlin

 The straight join like that was used long ago, but it replicates rows
 unacceptably: for each row in the subquery, one copy of member_id is
 output, which create an unacceptable overhead in the application and
 network side. It could be perhaps fixed with distinct, but then
 there's sorting overhead.

ah -- right. my mistake. well, you could always work around with
'distinct', although the exists version should be better (semi vs full
join).  what options *do* you have in terms of coaxing the ORM to
produce particular sql? :-).  This is likely 100% work-aroundable via
tweaking the SQL.  I don't have the expertise to suggest a solution
with your exact sql, if there is one.

merlin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Performance regression from 8.3.7 to 9.0.3

2011-03-14 Thread Tom Lane
Claudio Freire klaussfre...@gmail.com writes:
 CREATE OR REPLACE VIEW member_all_tags_v AS
  SELECT member_tags.member_id, member_tags.last_modification_date,
 member_tags.polarity, member_tags.tag_id, 'mr' AS source
FROM member_tags
 UNION ALL
  SELECT member_profile_tags.member_id,
 member_profile_tags.last_modification_date, member_profile_tags.polarity,
 member_profile_tags.tag_id, 'profile' AS source
FROM member_profile_tags;

Try casting those constants to text (or something) explicitly, ie
'mr'::text AS source etc.  I forget at the moment why leaving them as
unknown literals interferes with optimization, but it does.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Table partitioning problem

2011-03-14 Thread Robert Haas
On Mon, Mar 14, 2011 at 12:42 PM, Samba GUEYE samba.gu...@intesens.com wrote:
 Yeah but is there a workaround to force the root table to propagate the
 foreign key to the partitionned table
 because right now all foreign keys to partitionned table throws constraints
 violation and it's a big problem for me

No.  Generally, table partitioning is not a good idea unless you are
dealing with really large tables, and nearly all of your queries apply
only to a single partition.  Most likely you are better off not using
table inheritance in the first place if you need this feature.

It would be nice if we had a way to do this for the rare cases where
it would be useful, but we don't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Table partitioning problem

2011-03-14 Thread Conor Walsh
On Mon, Mar 14, 2011 at 12:40 PM, Robert Haas robertmh...@gmail.com wrote:
 Generally, table partitioning is not a good idea unless you are
 dealing with really large tables, and nearly all of your queries apply
 only to a single partition.  Most likely you are better off not using
 table inheritance in the first place if you need this feature.

I don't know if my tables count as 'large' or not, but I've gotten
some good mileage in the past out of time-based partitioning and
setting higher compression levels on old tables.  Also the ability to
drop-and-reload a day is sometimes useful, but I grant that it would
be better to never need to do that.

-C.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Bug in the planner?

2011-03-14 Thread Tom Lane
Claudio Freire klaussfre...@gmail.com writes:
 This is postgresql 9.0.3:
 Query:

 select
   sum(stat_responses) * 100.0 / sum(stat_invites) as stat_response_rate,
   sum(real_responses) * 100.0 / sum(real_invites) as real_response_rate
 from (
   select
   ms.invites as stat_invites,
   (select count(*) from invites i join deliveries d on d.id = 
 i.delivery_id
   where i.member_id = ms.member_id
   and d.recontact_number = 0
   and d.delivery_type = 1) as real_invites,
   ms.responses as stat_responses,
   (select count(*) from track_logs tl join tracks t on t.id = 
 tl.track_id
   where t.member_id = ms.member_id
   and t.partner_id is null and t.recontact_number = 0 and
 t.contact_method_id = 1
   and t.delivery_type = 1
   and tl.track_status_id = 10) as real_responses
   from member_statistics ms
   join livra_users lu on lu.id = ms.member_id
   where lu.country_id = 2 and lu.is_panelist and lu.confirmed and not
 lu.unregistered
 ) as rtab;

 The top Aggregate node acts as a nested loop on SubPlan 1  2, but
 it's only adding the cost of the subplans without regard as to how
 many iterations it will perform (254475)

Hmm, interesting.  The reason is that it's computing the cost of the
output SELECT list on the basis of the number of times that select list
will be evaluated, ie, once.  But the aggregate function argument
expressions will be evaluated more times than that.  Most of the time an
aggregate is applied to something trivial like a Var reference, so
nobody's noticed that the cost of its input expression is underestimated.

 Explain analyze didn't end in an hour of runtime, running on a Core2
 with 4G RAM.

A better estimate isn't going to make that go any faster :-(.

regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance