[PERFORM] Merge Join vs Nested Loop

2006-09-26 Thread Tobias Brox
I have some odd cases here joining two tables - the planner insists on
Merge Join, but Nested Loop is really faster - and that makes sense,
since I'm selecting just a small partition of the data available.  All
planner constants seems to be set at the default values, the only way to
get a shift towards Nested Loops seems to be to raise the constants.  I
believe our memory is big enough to hold the indices, and that the
effective_cache_size is set to a sane value (but how to verify that,
anyway?).

What causes the nested loops to be estimated so costly - or is it the
merge joins that are estimated too cheaply?  Should I raise all the
planner cost constants, or only one of them?

Here are some sample explains:


prod= explain analyze select * from ticket join users on users_id=users.id 
where ticket.created'2006-09-25 17:00';
  QUERY PLAN
--
 Nested Loop  (cost=0.00..67664.15 rows=10977 width=675) (actual 
time=0.038..202.877 rows=10627 loops=1)
   -  Index Scan using ticket_on_created on ticket  (cost=0.00..11665.94 
rows=10977 width=80) (actual time=0.014..35.571 rows=10627 loops=1)
 Index Cond: (created  '2006-09-25 17:00:00'::timestamp without time 
zone)
   -  Index Scan using users_pkey on users  (cost=0.00..5.00 rows=1 width=595) 
(actual time=0.007..0.008 rows=1 loops=10627)
 Index Cond: (outer.users_id = users.id)
 Total runtime: 216.612 ms
(6 rows)

prod= explain analyze select * from ticket join users on users_id=users.id 
where ticket.created'2006-09-25 16:00';
 QUERY PLAN

 Merge Join  (cost=12844.93..68580.37 rows=11401 width=675) (actual 
time=106.631..1712.458 rows=11554 loops=1)
   Merge Cond: (outer.id = inner.users_id)
   -  Index Scan using users_pkey on users  (cost=0.00..54107.38 rows=174508 
width=595) (actual time=0.041..1215.221 rows=174599 loops=1)
   -  Sort  (cost=12844.93..12873.43 rows=11401 width=80) (actual 
time=105.753..123.905 rows=11554 loops=1)
 Sort Key: ticket.users_id
 -  Index Scan using ticket_on_created on ticket  (cost=0.00..12076.68 
rows=11401 width=80) (actual time=0.074..65.297 rows=11554 loops=1)
   Index Cond: (created  '2006-09-25 16:00:00'::timestamp without 
time zone)
 Total runtime: 1732.452 ms
(8 rows)

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Decreasing BLKSZ

2006-09-26 Thread Bucky Jordan
  The bottom line here is likely to be you need more RAM :-(
 
 Yup.  Just trying to get a handle on what I can do if I need more than
 16G
 Of ram... That's as much as I can put on the installed based of
 servers 100s of them.
 
 
  I wonder whether there is a way to use table partitioning to
  make the insert pattern more localized?  We'd need to know a
  lot more about your insertion patterns to guess how, though.
 
  regards, tom lane
 
 We're doing partitioning as well.
 
I'm guessing that you basically have a data collection application that
sends in lots of records, and a reporting application that wants
summaries of the data? So, if I understand the problem correctly, you
don't have enough ram (or may not in the future) to index the data as it
comes in. 

Not sure how much you can change the design, but what about either
updating a summary table(s) as the records come in (trigger, part of the
transaction, or do it in the application) or, index periodically? In
otherwords, load a partition (say a day's worth) then index that
partition all at once. If you're doing real-time analysis that might not
work so well though, but the summary tables should. 

I assume the application generates unique records on its own due to the
timestamp, so this isn't really about checking for constraint
violations? If so, you can probably do away with the index on the tables
that you're running the inserts on.

- Bucky

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Decreasing BLKSZ

2006-09-26 Thread Marc Morin
Yes, that is our application.   We have implemented both scenarios...

1- partitions loaded without indexes on them.. And build index when
partition is full.  Slow to drill down into incomplete partitions.
2- paritions with index as loaded.  Slow, on insert (problem mentioned)
but good to drill down

So, I'd like my cake and eat it too... :-)

I'd like to have my indexes built as rows are inserted into the
partition so help with the drill down...

 -Original Message-
 From: Bucky Jordan [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, September 26, 2006 5:26 PM
 To: Marc Morin; Tom Lane
 Cc: Markus Schaber; pgsql-performance@postgresql.org
 Subject: RE: [PERFORM] Decreasing BLKSZ 
 
   The bottom line here is likely to be you need more RAM :-(
  
  Yup.  Just trying to get a handle on what I can do if I 
 need more than 
  16G Of ram... That's as much as I can put on the installed based of 
  servers 100s of them.
  
  
   I wonder whether there is a way to use table partitioning to make 
   the insert pattern more localized?  We'd need to know a lot more 
   about your insertion patterns to guess how, though.
  
 regards, tom lane
  
  We're doing partitioning as well.
  
 I'm guessing that you basically have a data collection 
 application that sends in lots of records, and a reporting 
 application that wants summaries of the data? So, if I 
 understand the problem correctly, you don't have enough ram 
 (or may not in the future) to index the data as it comes in. 
 
 Not sure how much you can change the design, but what about 
 either updating a summary table(s) as the records come in 
 (trigger, part of the transaction, or do it in the 
 application) or, index periodically? In otherwords, load a 
 partition (say a day's worth) then index that partition all 
 at once. If you're doing real-time analysis that might not 
 work so well though, but the summary tables should. 
 
 I assume the application generates unique records on its own 
 due to the timestamp, so this isn't really about checking for 
 constraint violations? If so, you can probably do away with 
 the index on the tables that you're running the inserts on.
 
 - Bucky
 

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Merge Join vs Nested Loop

2006-09-26 Thread Tom Lane
Tobias Brox [EMAIL PROTECTED] writes:
 What causes the nested loops to be estimated so costly - or is it the
 merge joins that are estimated too cheaply?  Should I raise all the
 planner cost constants, or only one of them?

If your tables are small enough to fit (mostly) in memory, then the
planner tends to overestimate the cost of a nestloop because it fails to
account for cacheing effects across multiple scans of the inner table.
This is addressed in 8.2, but in earlier versions about all you can do
is reduce random_page_cost, and a sane setting of that (ie not less than
1.0) may not be enough to push the cost estimates where you want them.
Still, reducing random_page_cost ought to be your first recourse.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Decreasing BLKSZ

2006-09-26 Thread Bucky Jordan
 
 So, I'd like my cake and eat it too... :-)
 
 I'd like to have my indexes built as rows are inserted into the
 partition so help with the drill down...
 
So you want to drill down so fine grained that summary tables don't do
much good? Keep in mind, even if you roll up only two records, that's
half as many you have to process (be it for drill down or index). 

I've seen applications that have a log table with no indexes/constraints
and lots of records being inserted, then they only report on very fine
grained summary tables. Drill downs still work pretty well, but if you
get audited and want to see that specific action, well, you're in for a
bit of a wait, but hopefully that doesn't happen too often.

If that's the case (summary tables won't work), I'd be very curious how
you manage to get your cake and eat it too :) 

- Bucky

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Decreasing BLKSZ

2006-09-26 Thread Markus Schaber
Hi, Marc,

Marc Morin wrote:

 I wonder whether there is a way to use table partitioning to 
 make the insert pattern more localized?  We'd need to know a 
 lot more about your insertion patterns to guess how, though.
 
 We're doing partitioning as well.

And is constraint exclusion set up properly, and have you verified that
it works?

HTH,
Markus

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Update on high concurrency OLTP application and Postgres

2006-09-26 Thread Jim Nasby
Have you ever done any testing to see if just setting  
default_statistics_target to 500 has a negative impact on the system?


On Sep 22, 2006, at 4:48 PM, Cosimo Streppone wrote:


Christian Storm wrote:


At the moment, my rule of thumb is to check out the ANALYZE VERBOSE
messages to see if all table pages are being scanned.

  INFO: mytable: scanned xxx of yyy pages, containing ...

If xxx = yyy, then I keep statistics at the current level.
When xxx is way less than yyy, I increase the numbers a bit
and retry.

It's probably primitive, but it seems to work well.



What heuristic do you use to up the statistics for such a table?


No heuristics, just try and see.
For tables of ~ 10k pages, I set statistics to 100/200.
For ~ 100k pages, I set them to 500 or more.
I don't know the exact relation.


Once you've changed it, what metric do you use to

 see if it helps or was effective?

I rerun an analyze and see the results... :-)
If you mean checking the usefulness, I can see it only
under heavy load, if particular db queries run in the order
of a few milliseconds.

If I see normal queries that take longer and longer, or
they even appear in the server's log ( 500 ms), then
I know an analyze is needed, or statistics should be set higher.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [PERFORM] IN not handled very well?

2006-09-26 Thread Jim Nasby

On Sep 24, 2006, at 2:12 PM, Ben wrote:

Ah, so I do. Thanks, that helps an awful lot.

But the plan is still twice as expensive as when I put in the  
static values. Is it just unreasonable to expect the planner to see  
that there aren't many rows in the subselect, so to use the bitmap  
scans after all?


Based on your initial post, it probably should know that it's only  
getting 15 rows (since it did in your initial plan), so it's unclear  
why it's not choosing the bitmap scan.


Can you post the results of EXPLAIN ANALYZE?
--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Decreasing BLKSZ

2006-09-26 Thread Jim Nasby

On Sep 26, 2006, at 5:36 PM, Marc Morin wrote:

1- partitions loaded without indexes on them.. And build index when
partition is full.  Slow to drill down into incomplete partitions.
2- paritions with index as loaded.  Slow, on insert (problem  
mentioned)

but good to drill down


How big are your partitions? The number of rows in your active  
partition will determine how large your indexes are (and probably  
more importantly, how many levels there are), which will definitely  
affect your timing. So, you might have better luck with a smaller  
partition size.


I'd definitely try someone else's suggestion of making the PK  
logtime, key (assuming that you need to enforce uniqueness) and  
having an extra index on just key. If you don't need to enforce  
uniqueness, just have one index on key and one on logtime. Or if your  
partitions are small enough, don't even create the logtime index  
until the partition isn't being inserted into anymore.


If the number of key values is pretty fixed, it'd be an interesting  
experiment to try partitioning on that, perhaps even with one key per  
partition (which would allow you to drop the key from the tables  
entirely, ie:


CREATE TABLE stats_1 (logtime PRIMARY KEY, stat1, stat2, stat3);
CREATE TABLE stats_2 ...

CREATE VIEW stats AS
SELECT 1 AS  key, * FROM stats_1
UNION ALL SELECT 2, * FROM stats_2
...

I wouldn't put too much work into that as no real effort's been  
expended to optimize for that case (especially the resulting monster  
UNION ALL), but you might get lucky.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Confusion and Questions about blocks read

2006-09-26 Thread Jim Nasby

On Sep 23, 2006, at 8:19 AM, Markus Schaber wrote:

Btw, would it be feasible to enhance normal index scans by looking at
all rows in the current table block whether they meet the query
criteria, fetch them all, and blacklist the block for further  
revisiting

during the same index scan?

I think that, for non-sorted cases, this could improve index scans a
little, but I don't know whether it's worth the effort, given that
bitmap indidex scans exist.


The trade-off is you'd burn a lot more CPU on those pages. What might  
be interesting would be collapsing bitmap scan data down to a page  
level when certain conditions were met, such as if you're getting a  
significant number of hits for a given page. There's probably other  
criteria that could be used as well. One issue would be considering  
the effects of other bitmap index operations; if you're ANDing a  
bunch of scans together, you're likely to have far fewer tuples per  
page coming out the backside, which means you probably wouldn't want  
to burn the extra CPU to do full page scans.


BTW, I remember discussion at some point about ordering the results  
of a bitmap scan by page/tuple ID, which would essentially do what  
you're talking about. I don't know if it actually happened or not,  
though.


If this is something that interests you, I recommend taking a look at  
the code; it's generally not too hard to read through thanks to all  
the comments.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)




--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] PostgreSQL and sql-bench

2006-09-26 Thread Jim Nasby

On Sep 25, 2006, at 10:58 AM, yoav x wrote:
I am not comparing Postgres to MyISAM (obviously it is not a very  
fair comparison) and we do need
ACID, so all comparison are made against InnoDB (which now supports  
MVCC as well). I will try

again with the suggestions posted here.


Make sure that you're not inadvertently disabling ACIDity in MySQL/ 
InnoDB; some options/performance tweaks will do that last I looked.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org