Re: [HACKERS] choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-13 Thread Alvaro Herrera
Tom Lane wrote:

> One idea I thought about was to sort by index scan cost, using
> selectivity only as a tiebreaker for cost, rather than the other way
> around as is currently done.  This seems fairly plausible because
> indexscans that are cheaper than other indexscans likely return fewer
> rows too, and so selectivity is already accounted for to some extent ---
> at least you can't have an enormously worse selectivity at lower cost,
> whereas Steve's example proves it doesn't work the other way.  But I'm
> worried about breaking the reasoning about redundant indexes that's
> mentioned in the comments.
> 
> Another alternative that would respond to the immediate problem is to
> maintain the current sort order, but as we come to each index, consider
> using that one alone, and throw away whatever AND we might have built up
> if that one alone beats the AND-so-far.  This seems more conservative,
> as it's unlikely to break any cases that work well now, but on the other
> hand it feels like plastering another wart atop a structure that's
> already rather rickety.
> 
> Has anyone got any thoughts about the best way to do this?

How about doing both: sort the index by index scan cost; then pick the
first index on the list and start adding indexes when they lower the
cost.  When adding each index, consider it by itself against the
already stacked indexes.  If the cost is lower, put this index at the
top of the list, and restart the algorithm (after the sorting step of
course).

I think the concern about condition redundancy should be attacked
separately.  How about just comparing whether they have common prefixes
of conditions?  I admit I don't understand what would happen with
indexes defined like (lower(A), B, C) versus (A, B) for example.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


choose_bitmap_and again (was Re: [PERFORM] Strangely Variable Query Performance)

2007-04-13 Thread Tom Lane
Steve <[EMAIL PROTECTED]> writes:
> [ strange planner misbehavior in 8.2.3 ]

After some off-list investigation (thanks, Steve, for letting me poke
at your machine), the short answer is that the heuristics used by
choose_bitmap_and() suck.  The problem query is like 

select ... from ds where
ds.receipt >= '1998-12-30 0:0:0' and
ds.encounter_id in ( ... 100 distinct values ... );

and the table has a truly remarkable variety of indexes on encounter_id,
receipt, and combinations of them with other columns.  The receipt
condition is actually in effect a no-op, because all receipt dates are
later than that, but because ineq_histogram_selectivity doesn't trust
histogram data unreservedly we compute a selectivity of about 0.7
for it.  That means that the indexes that cover both receipt and
encounter_id are given a selectivity score just fractionally better than
those involving encounter_id alone, and therefore they sort first in
choose_bitmap_and's sort step, and the way that that routine is coded,
only combinations of the very first index with other ones will be
considered for a bitmap heap scan.  So the possibility of using just the
index on encounter_id alone is never considered, even though that
alternative is vastly cheaper than the alternatives that are considered.
(It happens that encounter_id is a low-order column in all the indexes
that include receipt, and so these scans end up covering the whole index
... multiple times even.  The cost estimation is fine --- the thing
knows these are expensive --- what's falling down is the heuristic for
which combinations of indexes to consider using in a bitmap scan.)

The original coding of choose_bitmap_and involved a "fuzzy" comparison
of selectivities, which would have avoided this problem, but we got rid
of that later because it had its own problems.  In fact,
choose_bitmap_and has caused us enough problems that I'm thinking we
need a fundamental rethink of how it works, rather than just marginal
tweaks.  If you haven't looked at this code before, the comments explain
the idea well enough:

/*
 * choose_bitmap_and
 *Given a nonempty list of bitmap paths, AND them into one path.
 *
 * This is a nontrivial decision since we can legally use any subset of the
 * given path set.  We want to choose a good tradeoff between selectivity
 * and cost of computing the bitmap.
 *
 * The result is either a single one of the inputs, or a BitmapAndPath
 * combining multiple inputs.
 */
...
/*
 * In theory we should consider every nonempty subset of the given paths.
 * In practice that seems like overkill, given the crude nature of the
 * estimates, not to mention the possible effects of higher-level AND and
 * OR clauses.  As a compromise, we sort the paths by selectivity.  We
 * always take the first, and sequentially add on paths that result in a
 * lower estimated cost.
 *
 * We also make some effort to detect directly redundant input paths, as
 * can happen if there are multiple possibly usable indexes.  (Another way
 * it can happen is that best_inner_indexscan will find the same OR join
 * clauses that create_or_index_quals has pulled OR restriction clauses
 * out of, and then both versions show up as duplicate paths.)  We
 * consider an index redundant if any of its index conditions were already
 * used by earlier indexes.  (We could use predicate_implied_by to have a
 * more intelligent, but much more expensive, check --- but in most cases
 * simple pointer equality should suffice, since after all the index
 * conditions are all coming from the same RestrictInfo lists.)
 *
 * You might think the condition for redundancy should be "all index
 * conditions already used", not "any", but this turns out to be wrong.
 * For example, if we use an index on A, and then come to an index with
 * conditions on A and B, the only way that the second index can be later
 * in the selectivity-order sort is if the condition on B is completely
 * non-selective.  In any case, we'd surely be drastically misestimating
 * the selectivity if we count the same condition twice.
 *
 * We include index predicate conditions in the redundancy test.  Because
 * the test is just for pointer equality and not equal(), the effect is
 * that use of the same partial index in two different AND elements is
 * considered redundant.  (XXX is this too strong?)
 *
 * Note: outputting the selected sub-paths in selectivity order is a good
 * thing even if we weren't using that as part of the selection method,
 * because it makes the short-circuit case in MultiExecBitmapAnd() more
 * likely to apply.
 */


One idea I thought about was to sort by index scan cost, using
selectivity only as a tiebreaker for cost, rather than the other way
around as is currently done.  This seems fairly plausible because
indexscans that are cheaper than other indexscans likely

Re: [PERFORM] Finding bloated indexes?

2007-04-13 Thread Vivek Khera


On Apr 13, 2007, at 4:01 PM, Dan Harris wrote:

Is there a pg_stat_* table or the like that will show how bloated  
an index is? I am trying to squeeze some disk space and want to  
track down where the worst offenders are before performing a global  
REINDEX on all tables, as the database is rougly 400GB on disk and  
this takes a very long time to run.


I find this as a helpful guide:

select relname,relkind,relpages from pg_class where relname like 'user 
%';


for example (obviously change the LIKE clause to something useful to  
you).


then with your knowledge of how big your rows are and how many  
relpages the table itself takes, you can see if your index is too  
big.  It helps to watch these numbers over time.


Also, running "analyze verbose" on the table gives you a hint at how  
sparse the pages are, which might imply something for table bloat.   
I'm not sure.


More expensive is "vacuum verbose" which gives lots of info on how  
many "unused pointers" there are in your indexes.  This may be of  
use.  If this is a high number compared to the number of row  
versions, then you probably have bloat there.






smime.p7s
Description: S/MIME cryptographic signature


Re: [PERFORM] Please humor me ...

2007-04-13 Thread Simon Riggs
On Mon, 2007-04-09 at 16:05 -0400, Carlos Moreno wrote:
> 2) What would be the real implications of doing that?

Many people ask, hence why a whole chapter of the manual is devoted to
this important topic.

http://developer.postgresql.org/pgdocs/postgres/wal.html

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Question about memory allocations

2007-04-13 Thread Jan de Visser
On Friday 13 April 2007 14:53:53 Carlos Moreno wrote:
> How does PG take advantage of the available memory?  I mean, if I have a
> machine with, say, 4 or 8GB of memory, how will those GBs would end
> up being used?   They just do??   (I mean, I would find that a vaild
> answer;

On linux the filesystem cache will gobble them up, which means indirectly 
pgsql profits as well (assuming no other apps poison the fs cache).

jan

-- 
--
Jan de Visser                     [EMAIL PROTECTED]

                Baruk Khazad! Khazad ai-menu!
--

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[PERFORM] Finding bloated indexes?

2007-04-13 Thread Dan Harris
Is there a pg_stat_* table or the like that will show how bloated an index is? 
I am trying to squeeze some disk space and want to track down where the worst 
offenders are before performing a global REINDEX on all tables, as the database 
is rougly 400GB on disk and this takes a very long time to run.


I have been able to do this with tables, using a helpful view posted to this 
list a few months back, but I'm not sure if I can get the same results on indexes.


Thanks

-Dan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Question about memory allocations

2007-04-13 Thread Carlos Moreno

Steve wrote:


Common wisdom in the past has been that values above a couple of hundred
MB will degrade performance.  


The annotated config file talks about setting shared_buffers to a third 
of the
available memory --- well, it says "it should be no more than 1/3 of the 
total

amount of memory"  (quoting off the top of my head).  Don't recall seeing
any warning about not exceeding a few hundred megabytes.

My eternal curiosity when it comes to this memory and shared_buffers thing:

How does PG take advantage of the available memory?  I mean, if I have a
machine with, say, 4 or 8GB of memory, how will those GBs would end
up being used?   They just do??   (I mean, I would find that a vaild 
answer;

but I ask, because this configuration parameters stuff makes me think that
perhaps PG does not simply use whatever memory is in there, but it has
to go through the parameters in the config file to allocate whatever it has
to use).

So, is it just like that?   We put more memory and PG will automatically
make use of it?

Carlos
--


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Question about memory allocations

2007-04-13 Thread Ron

At 12:38 PM 4/13/2007, Steve wrote:

Really?

Wow!

Common wisdom in the past has been that values above a couple of hundred
MB will degrade performance.  Have you done any benchmarks on 8.2.x that
show that you get an improvement from this, or did you just take the
"too much of a good thing is wonderful" approach?


Not to be rude, but there's more common wisdom on this 
particular subject than anything else in postgres I'd say ;)  I 
think I recently read someone else on this list who's 
laundry-listed the recommended memory values that are out there 
these days and pretty much it ranges from what you've just said to 
"half of system memory".


I've tried many memory layouts, and in my own experience 
with this huge DB, more -does- appear to be better but marginally 
so; more memory alone won't fix a speed problem.  It may be a 
function of how much reading/writing is done to the DB and if fsync 
is used or not if that makes any sense :)  Seems there's no "silver 
bullet" to the shared_memory question.  Or if there is, nobody can 
agree on it ;)


One of the reasons for the wide variance in suggested values for pg 
memory use is that pg 7.x and pg 8.x are =very= different beasts.


If you break the advice into pg 7.x and pg 8.x categories, you find 
that there is far less variation in the suggestions.


Bottom line: pg 7.x could not take advantage of larger sums of memory 
anywhere near as well as pg 8.x can.


Cheers,
Ron 



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Question about memory allocations

2007-04-13 Thread Steve

Really?

Wow!

Common wisdom in the past has been that values above a couple of hundred
MB will degrade performance.  Have you done any benchmarks on 8.2.x that
show that you get an improvement from this, or did you just take the
"too much of a good thing is wonderful" approach?



	Not to be rude, but there's more common wisdom on this particular 
subject than anything else in postgres I'd say ;)  I think I recently read 
someone else on this list who's laundry-listed the recommended memory 
values that are out there these days and pretty much it ranges from 
what you've just said to "half of system memory".


	I've tried many memory layouts, and in my own experience with 
this huge DB, more -does- appear to be better but marginally so; more 
memory alone won't fix a speed problem.  It may be a function of how much 
reading/writing is done to the DB and if fsync is used or not if that 
makes any sense :)  Seems there's no "silver bullet" to the shared_memory 
question.  Or if there is, nobody can agree on it ;)



Anyway, talk to you later!


Steve

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] local selectivity estimation - computing frequency of predicates

2007-04-13 Thread Tom Lane
"Avdhoot Kishore Saple" <[EMAIL PROTECTED]> writes:
> How to compute the frequency of predicate (e.g. Salary > $7) in an
> SQL query from a DB's pre-defined indexes?". I'm specifically looking at
> how to retrieve information about indices (like number of pages at each
> level of index, range of attribute values etc.)

I don't think what you're looking for is exposed anywhere.  Postgres
doesn't rely on indexes for statistical information anyway; the
pg_statistic system catalog (see also pg_stats view) is used for that.

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[PERFORM] local selectivity estimation - computing frequency of predicates

2007-04-13 Thread Avdhoot Kishore Saple

Dear All.

 

How to compute the frequency of predicate (e.g. Salary > $7) in an
SQL query from a DB's pre-defined indexes?". I'm specifically looking at
how to retrieve information about indices (like number of pages at each
level of index, range of attribute values etc.)

 

Any suggestions regarding the same would be great

 

Thanks,

 

 

Avdhoot K. Saple
Junior Research Associate
High Performance & Grid Computing 
Infosys Technologies Ltd., Bangalore

 



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not to copy, disclose, or distribute this e-mail or its contents to any other 
person and any such actions are unlawful. This e-mail may contain viruses. 
Infosys has taken every reasonable precaution to minimize this risk, but is not 
liable for any damage you may sustain as a result of any virus in this e-mail. 
You should carry out your own virus checks before opening the e-mail or 
attachment. Infosys reserves the right to monitor and review the content of all 
messages sent to or from this e-mail address. Messages sent to or from this 
e-mail address may be stored on the Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***

Re: [PERFORM] Question about memory allocations

2007-04-13 Thread Andrew McMillan
On Tue, 2007-04-10 at 15:28 -0400, Steve wrote:
> 
> I'm trying to tune the memory usage of a new machine that has a -lot- of 
> memory in it (32 gigs).

...
> 
> shared_buffers = 16GB

Really?

Wow!

Common wisdom in the past has been that values above a couple of hundred
MB will degrade performance.  Have you done any benchmarks on 8.2.x that
show that you get an improvement from this, or did you just take the
"too much of a good thing is wonderful" approach?

Cheers,
Andrew.

-
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/PHYS: Level 2, 150-154 Willis St
DDI: +64(4)803-2201  MOB: +64(272)DEBIAN  OFFICE: +64(4)499-2267
 You have an unusual equipment for success.  Be sure to use it properly.
-



signature.asc
Description: This is a digitally signed message part