[HACKERS] Improve compression speeds in pg_lzcompress.c

2013-01-06 Thread Takeshi Yamamuro
Hi, hackers,

The attached is a patch to improve compression speeds with loss of
compression ratios in backend/utils/adt/pg_lzcompress.c. Recent
modern compression techniques like google LZ4 and Snappy inspreid
me to write this patch. Thre are two points of my patch:

1. Skip at most 255 literals that might be incompressible
   during pattern matching for LZ compression.

2. Update a hash table every PGLZ_HASH_GAP literals.

A sequence of literals is typically mixed up with compressible parts
and incompressible ones. Then, IMHO that it is reasonable to skip
PGLZ_SKIP_SIZE literals every a match is not found. The skipped multiple
literals are just copied to the output buffer, so pglz_out_literal() is
re-written (and renamed pglz_out_literals) so as to copy multiple
bytes, not a single byte.

And also, the current implementation updates a hash table for every a single
literal. However, as the updates obviously eat much processor time, skipping
the updates dynamically improves compression speeds.

I've done quick comparison tests with a Xeon 5670 processor.
A sequence logs of Apache hadoop and TREC GOV2 web data were used
as test sets. The former is highly compressible (low entroy) and the
other is difficult to compress (high entropy).

***
 Compression Speed (Ratio)
Apache hadoop logs:
gzip78.22MiB/s ( 5.31%)
bzip23.34MiB/s ( 3.04%)
lz4939.45MiB/s ( 9.17%)
pg_lzcompress(original) 37.80MiB/s (11.76%)
pg_lzcompress(patch apaplied)   99.42MiB/s (14.19%)

TREC GOV2 web data:
gzip21.22MiB/s (32.66%)
bzip28.61MiB/s (27.86%)
lz4250.98MiB/s (49.82%)
pg_lzcompress(original) 20.44MiB/s (50.09%)
pg_lzcompress(patch apaplied)   48.67MiB/s (61.87%)

***

Obviously, both the compression ratio and the speed in the current
implementation are inferior to those in gzip. And, my patch
loses gzip and bzip2 in view of compression ratios though, the
compression speed overcomes those in gzip and bzip2.

Anyway, the compression speed in lz4 is very fast, so in my
opinion, there is a room to improve the current implementation
in pg_lzcompress.

regards,
-- 

Takeshi Yamamuro
NTT Cyber Communications Laboratory Group
Software Innovation Center
(Open Source Software Center)
Tel: +81-3-5860-5057 Fax: +81-3-5463-5490
Mail:yamamuro.take...@lab.ntt.co.jp
diff --git src/backend/utils/adt/pg_lzcompress.c 
src/backend/utils/adt/pg_lzcompress.c
index 466982e..3980334 100644
--- src/backend/utils/adt/pg_lzcompress.c
+++ src/backend/utils/adt/pg_lzcompress.c
@@ -62,8 +62,10 @@
  * Otherwise the first byte after the header tells what to 
do
  * the next 8 times. We call this the control byte.
  *
- * An unset bit in the control byte means, that one 
uncompressed
- * byte follows, which is copied from input to output.
+ * An unset bit in the control byte means, that one byte 
header
+ * and uncompressed 1-255 literals follow. The header 
includes
+ * the length of the literals, and the literals are just 
copied
+ * from input to output.
  *
  * A set bit in the control byte means, that a tag of 2-3 
bytes
  * follows. A tag contains information to copy some bytes, 
that
@@ -184,6 +186,8 @@
 #define PGLZ_HISTORY_MASK  (PGLZ_HISTORY_LISTS - 1)
 #define PGLZ_HISTORY_SIZE  4096
 #define PGLZ_MAX_MATCH 273
+#define PGLZ_SKIP_SIZE 3
+#define PGLZ_HASH_GAP  8
 
 
 /* --
@@ -322,16 +326,20 @@ do { \
 
 
 /* --
- * pglz_out_literal -
+ * pglz_out_literals -
  *
  * Outputs a literal byte to the destination buffer including the
  * appropriate control bit.
+ * Outputs multiple literals from the source to the destination
+ * buffer including the appropriate control bit.
  * --
  */
-#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \
+#define pglz_out_literals(_ctrlp,_ctrlb,_ctrl,_buf,_sp,_len) \
 do { \
pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf);
\
-   *(_buf)++ = (unsigned char)(_byte); 
\
+   *(_buf)++ = (unsigned char)(_len);  
\
+   memcpy(_buf, _sp, _len);
\
+   _buf += _len;   
\
_ctrl <<= 1;   

Re: [HACKERS] question: foreign key constraints and AccessExclusive locks

2013-01-06 Thread Tom Lane
Jon Nelson  writes:
> On Sun, Jan 6, 2013 at 4:14 AM, Simon Riggs  wrote:
>> FKs are enforced by triggers currently. Adding triggers requires
>> AccessExclusiveLock because of catalog visibility issues; you are
>> right that a lower lock is eventually possible.

> I've read and re-read this a few times, and I think I understand.
> However, could you clarify "you are right that a lower lock is
> eventually possible" for me, please?

We have some ideas about how to add/drop triggers while locking out only
operations that would actually try to fire the triggers.  Right now,
though, any DDL operation done with less than full exclusive lock would
risk having other transactions fetch an inconsistent view of the table's
catalog entries.  (This is true for any sort of ALTER TABLE, not just
trigger add/drop.)  Simon actually tried to fix this last year, but the
effort crashed and burned, and we're not sure how to get around the
problems.  Yet.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] too much pgbench init output

2013-01-06 Thread Tatsuo Ishii
>>> AFAIK the "5 second" logging is much quieter in most cases (and a bit
>>> more verbose when the initialization gets very slower), so I think the
>>> "quiet" logging is not a bad match although maybe there's a better name.
>>>
>>> This change (adding the elapsed/remaining fields to the original loggin)
>>> would be consistent with this name, because considering a single line,
>>> the "-q" is more verbose right now.
>>>
>>> So I'd stick with the "-q" option and added the fields to the original
>>> logging. But I'm not opposing a different name, I just can't think of a
>>> better one.
>> 
>> Ok, I'm with you ("-q" and along with adding the elapsed/remaining
>> fields to the original logging).
> 
> Great, attached is a patch that does that.

Committed. Thanks.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] question: foreign key constraints and AccessExclusive locks

2013-01-06 Thread Jon Nelson
On Sun, Jan 6, 2013 at 4:14 AM, Simon Riggs  wrote:
> On 6 January 2013 03:08, Jon Nelson  wrote:
>> When adding a foreign key constraint on tableA which references
>> tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a
>> lock that prevents writes be sufficient, or does PostgreSQL have to
>> modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on
>> Linux.
>
> FKs are enforced by triggers currently. Adding triggers requires
> AccessExclusiveLock because of catalog visibility issues; you are
> right that a lower lock is eventually possible.
>
> SQLStandard requires the check to be symmetrical, so adding FKs
> requires a trigger on each table and so an AEL is placed on tableB.

I've read and re-read this a few times, and I think I understand.
However, could you clarify "you are right that a lower lock is
eventually possible" for me, please?

-- 
Jon


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Proposal: Store "timestamptz" of database creation on "pg_database"

2013-01-06 Thread Fabrízio de Royes Mello
* Stephen Frost  wrote:
>
> Yes, and have the actual 'description' field (as it's variable) at the
> end of the catalog.
>
> Regarding the semantics of it- I was thinking about how directories and
> unix files work.  Basically, adding or removing a sub-object would
> update the alter time on the object itself, changing an already existing
> object or sub-object would update only the object/sub-object's alter
> time.  Creating an object or sub/object would set its create time and
> alter time to the same value.  I would distinguish 'create' from
> 'ctime', however, and have our 'create' time be only the actual
> *creation* time of the object.  ALTER table OWNER TO user; would update
> "table"s alter time.
>

Understood... a "COMMENT" is a database object, then if we add a creation
time column to pg_description/shdescription tables how we track his
creation time?


>
> Open to other thoughts on this and perhaps we should create a wiki page
> to start documentating the semantics.  Once we get agreement there, it's
> just a bit of code. :)
>

+1

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Simon Riggs
On 6 January 2013 23:03, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 6 January 2013 18:58, Tom Lane  wrote:
>>> IIRC, one of my very first attempts to deal with this was to charge
>>> random_page_cost per level of index descended.  This was such a horrid
>>> overestimate that it never went anywhere.  I think that reflects that in
>>> practical applications, the upper levels of the index tend to stay in
>>> cache.  We could ignore I/O on that assumption and still try to model
>>> CPU costs of the descent, which is basically what Jeff is proposing.
>>> My objection to his formula is mainly that it ignores physical index
>>> size, which I think is important to include somehow for the reasons
>>> I explained in my other message.
>
>> Having a well principled approach will help bring us towards a
>> realistic estimate.
>
> I thought about this some more and came up with what might be a
> reasonably principled compromise.  Assume that we know there are N
> leaf entries in the index (from VACUUM stats) and that we know the
> root page height is H (from looking at the btree metapage).  (Note:
> H starts at zero for a single-page index.)  If we assume that the
> number of tuples per page, P, is more or less uniform across leaf
> and upper pages (ie P is the fanout for upper pages), then we have
> N/P = number of leaf pages
> N/P/P = number of level 1 pages
> N/P^3 = number of level 2 pages
> N/P^(h+1) = number of level h pages
> Solving for the minimum P that makes N/P^(H+1) <= 1, we get
> P = ceil(exp(ln(N)/(H+1)))
> as an estimate of P given the known N and H values.
>
> Now, if we consider only CPU costs of index descent, we expect
> about log2(P) comparisons to be needed on each of the H upper pages
> to be descended through, that is we have total descent cost
> cpu_index_tuple_cost * H * log2(P)
>
> If we ignore the ceil() step as being a second-order correction, this
> can be simplified to
>
> cpu_index_tuple_cost * H * log2(N)/(H+1)
>
> I propose this, rather than Jeff's formula of cpu_index_tuple_cost *
> log2(N), as our fudge factor.  The reason I like this better is that
> the additional factor of H/(H+1) provides the correction I want for
> bloated indexes: if an index is bloated, the way that reflects into
> the cost of any particular search is that the number of pages to be
> descended through is larger than otherwise.  The correction is fairly
> small, particularly for large indexes, but that seems to be what's
> expected given the rest of our discussion.

Seems good to have something with both N and H in it. This cost model
favours smaller indexes over larger ones, whether that be because
they're partial and so have smaller N, or whether the key values are
thinner and so have lower H.

> We could further extend this by adding some I/O charge when the index is
> sufficiently large, as per Simon's comments, but frankly I think that's
> unnecessary.  Unless the fan-out factor is really awful, practical-sized
> indexes probably have all their upper pages in memory.  What's more, per
> my earlier comment, when you start to think about tables so huge that
> that's not true it really doesn't matter if we charge another
> random_page_cost or two for an indexscan --- it'll still be peanuts
> compared to the seqscan alternative.

Considering that we're trying to decide between various indexes on one
table, we don't have enough information to say which index the cache
favours and the other aspects of cacheing are the same for all indexes
of any given size. So we can assume those effects cancel out for
comparison purposes, even if they're non-zero. And as you say, they're
negligible in comparison with bitmapindexscans etc..

The only time I'd question that would be in the case of a nested loops
join but that's not important here.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Tom Lane
I wrote:
> [ slightly bogus graph ]

Ooops, it seems the ^ operator doesn't do what I thought in gnuplot.
Here's a corrected version.

regards, tom lane

<>set terminal png small color
set output 'new_fudge.png'
set xlabel "Index tuples"
set ylabel "Added cost"
set logscale x
set logscale y
h(x) = (x <= 256) ? 0.0001/0.005 : (x <= 256*256) ? (1./2)*log(x)/log(2) : (x 
<= 256*256*256) ? (2./3)*log(x)/log(2) : (x <= 256.0*256*256*256) ? 
(3./4)*log(x)/log(2) : (x <= 256.0*256*256*256*256) ? (4./5)*log(x)/log(2) : 
(5./6)*log(x)/log(2)
historical(x) = (4 * x/10) < 100 ? 4 * x/10 : 1/0
ninepoint2(x) = (4 * x/1) < 100 ? 4 * x/1 : 1/0
head(x) = 4*log(1 + x/1)
plot [10:1e9] h(x)*0.005, 0.005 * log(x)/log(2), head(x), historical(x), 
ninepoint2(x)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Tom Lane
Simon Riggs  writes:
> On 6 January 2013 18:58, Tom Lane  wrote:
>> IIRC, one of my very first attempts to deal with this was to charge
>> random_page_cost per level of index descended.  This was such a horrid
>> overestimate that it never went anywhere.  I think that reflects that in
>> practical applications, the upper levels of the index tend to stay in
>> cache.  We could ignore I/O on that assumption and still try to model
>> CPU costs of the descent, which is basically what Jeff is proposing.
>> My objection to his formula is mainly that it ignores physical index
>> size, which I think is important to include somehow for the reasons
>> I explained in my other message.

> Having a well principled approach will help bring us towards a
> realistic estimate.

I thought about this some more and came up with what might be a
reasonably principled compromise.  Assume that we know there are N
leaf entries in the index (from VACUUM stats) and that we know the
root page height is H (from looking at the btree metapage).  (Note:
H starts at zero for a single-page index.)  If we assume that the
number of tuples per page, P, is more or less uniform across leaf
and upper pages (ie P is the fanout for upper pages), then we have
N/P = number of leaf pages
N/P/P = number of level 1 pages
N/P^3 = number of level 2 pages
N/P^(h+1) = number of level h pages
Solving for the minimum P that makes N/P^(H+1) <= 1, we get
P = ceil(exp(ln(N)/(H+1)))
as an estimate of P given the known N and H values.

Now, if we consider only CPU costs of index descent, we expect
about log2(P) comparisons to be needed on each of the H upper pages
to be descended through, that is we have total descent cost
cpu_index_tuple_cost * H * log2(P)

If we ignore the ceil() step as being a second-order correction, this
can be simplified to

cpu_index_tuple_cost * H * log2(N)/(H+1)

I propose this, rather than Jeff's formula of cpu_index_tuple_cost *
log2(N), as our fudge factor.  The reason I like this better is that
the additional factor of H/(H+1) provides the correction I want for
bloated indexes: if an index is bloated, the way that reflects into
the cost of any particular search is that the number of pages to be
descended through is larger than otherwise.  The correction is fairly
small, particularly for large indexes, but that seems to be what's
expected given the rest of our discussion.

We could further extend this by adding some I/O charge when the index is
sufficiently large, as per Simon's comments, but frankly I think that's
unnecessary.  Unless the fan-out factor is really awful, practical-sized
indexes probably have all their upper pages in memory.  What's more, per
my earlier comment, when you start to think about tables so huge that
that's not true it really doesn't matter if we charge another
random_page_cost or two for an indexscan --- it'll still be peanuts
compared to the seqscan alternative.

To illustrate the behavior of this function, I've replotted my previous
graph, still taking the assumed fanout to be 256 tuples/page.  I limited
the range of the functions to 0.0001 to 100 to keep the log-scale graph
readable, but actually the H/(H+1) formulation would charge zero for
indexes of less than 256 tuples.  I think it's significant (and a good
thing) that this curve is nowhere significantly more than the historical
pre-9.2 fudge factor.

Thoughts?

regards, tom lane

<>set terminal png small color
set output 'new_fudge.png'
set logscale x
set logscale y
h(x) = (x <= 256) ? 0.0001/0.005 : (x <= 256*256) ? (1./2)*log(x)/log(2) : (x 
<= 256^3) ? (2./3)*log(x)/log(2) : (x <= 256^4) ? (3./4)*log(x)/log(2) : (x <= 
256^5) ? (4./5)*log(x)/log(2) : (5./6)*log(x)/log(2)
historical(x) = (4 * x/10) < 100 ? 4 * x/10 : 1/0
ninepoint2(x) = (4 * x/1) < 100 ? 4 * x/1 : 1/0
head(x) = 4*log(1 + x/1)
plot [10:1e9] h(x)*0.005, 0.005 * log(x)/log(2), head(x), historical(x), 
ninepoint2(x)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Skip checkpoint on promoting from streaming replication

2013-01-06 Thread Simon Riggs
On 9 August 2012 10:45, Simon Riggs  wrote:
> On 22 June 2012 05:03, Kyotaro HORIGUCHI
>  wrote:
>
>>I hope this is promising.
>
> I've reviewed this and thought about it over some time.

I've been torn between the need to remove the checkpoint for speed and
being worried about the implications of doing so.

We promote in multiple use cases. When we end a PITR, or are
performing a switchover, it doesn't really matter how long the
shutdown checkpoint takes, so I'm inclined to leave it there in those
cases. For failover, we need fast promotion.

So my thinking is to make   pg_ctl promote -m fast
be the way to initiate a fast failover that skips the shutdown checkpoint.

That way all existing applications work the same as before, while new
users that explicitly choose to do so will gain from the new option.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Invent a "one-shot" variant of CachedPlans for better performanc

2013-01-06 Thread Simon Riggs
On 5 January 2013 19:15, Tom Lane  wrote:

>> We need a full "one-shot" concept linking the planner and executor for
>> all sorts of reasons, not just this one. We can discuss the
>> practicality of individual optimizations but the linkage should be
>> clear throughout the whole infrastructure.
>
> I thought then, and I think now, that such a concept was too squishy
> to be useful as an actual guide to what to change.  The particular
> arguments you advanced then have been overtaken by events anyway;
> notably that Marti Raudsepp's work on caching stable subexpressions at
> execution seems like a much more general answer to the problem of
> handling stable functions efficiently.

I knew, and accepted that that specific optimization has been
superceded. My point is that the "one-shot" situation lends itself to
a great many optimizations and if we can pass that concept through, we
can simply get on with implementing them.

Having the planner pretend that it doesn't know what will happen at
execution time isn't sensible. Having the executor know its a one-off
will surely help somewhere along the line. It's hardly a modularity
violation to pass small, yet major pieces of information across the
divide.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Simon Riggs
On 6 January 2013 18:58, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 5 January 2013 22:18, Tom Lane  wrote:
>>> No.  The argument is that if we don't have some such correction, the
>>> planner is liable to believe that different-sized indexes have *exactly
>>> the same cost*, if a given query would fetch the same number of index
>>> entries.
>
>> The only difference between a large and a small index is the initial
>> fetch, since the depth of the index may vary. After that the size of
>> the index is irrelevant to the cost of the scan, since we're just
>> scanning across the leaf blocks. (Other differences may exist but not
>> related to size).
>
> Right: except for the "fudge factor" under discussion, all the indexscan
> costs that we model come from accessing index leaf pages and leaf
> tuples.  So to the extent that the fudge factor has any principled basis
> at all, it's an estimate of index descent costs.  And in that role I
> believe that total index size needs to be taken into account.
>
>> Perhaps the cost of the initial fetch is what you mean by a
>> "correction"? In that case, why not use the index depth directly from
>> the metapage, rather than play with size?
>
> IIRC, one of my very first attempts to deal with this was to charge
> random_page_cost per level of index descended.  This was such a horrid
> overestimate that it never went anywhere.  I think that reflects that in
> practical applications, the upper levels of the index tend to stay in
> cache.  We could ignore I/O on that assumption and still try to model
> CPU costs of the descent, which is basically what Jeff is proposing.
> My objection to his formula is mainly that it ignores physical index
> size, which I think is important to include somehow for the reasons
> I explained in my other message.

Having a well principled approach will help bring us towards a
realistic estimate.

I can well believe what you say about random_page_cost * index_depth
being an over-estimate.

Making a fudge factor be random_page_cost * ln(1 + index_pages/10)
 just seems to presume an effective cache of 8GB and a fixed
depth:size ratio, which it might not be. On a busy system, or with a
very wide index that could also be wrong.

I'd be more inclined to explicitly discount the first few levels by
using random_page_cost * (max(index_depth - 3, 0))
or even better use a formula that includes the effective cache size
and index width to work out the likely number of tree levels cached
for an index.

Whatever we do we must document that we are estimating the cache
effects on the cost of index descent, so we can pick that up on a
future study on cacheing effects.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Tom Lane
Simon Riggs  writes:
> On 5 January 2013 22:18, Tom Lane  wrote:
>> No.  The argument is that if we don't have some such correction, the
>> planner is liable to believe that different-sized indexes have *exactly
>> the same cost*, if a given query would fetch the same number of index
>> entries.

> The only difference between a large and a small index is the initial
> fetch, since the depth of the index may vary. After that the size of
> the index is irrelevant to the cost of the scan, since we're just
> scanning across the leaf blocks. (Other differences may exist but not
> related to size).

Right: except for the "fudge factor" under discussion, all the indexscan
costs that we model come from accessing index leaf pages and leaf
tuples.  So to the extent that the fudge factor has any principled basis
at all, it's an estimate of index descent costs.  And in that role I
believe that total index size needs to be taken into account.

> Perhaps the cost of the initial fetch is what you mean by a
> "correction"? In that case, why not use the index depth directly from
> the metapage, rather than play with size?

IIRC, one of my very first attempts to deal with this was to charge
random_page_cost per level of index descended.  This was such a horrid
overestimate that it never went anywhere.  I think that reflects that in
practical applications, the upper levels of the index tend to stay in
cache.  We could ignore I/O on that assumption and still try to model
CPU costs of the descent, which is basically what Jeff is proposing.
My objection to his formula is mainly that it ignores physical index
size, which I think is important to include somehow for the reasons
I explained in my other message.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Simon Riggs
On 6 January 2013 16:29, Jeff Janes  wrote:

> Worse, this over-punishment of bloat is more likely to penalize partial
> indexes.  Since they are  vacuumed on the table's schedule, not their own
> schedule, they likely get vacuumed less often relative to the amount of
> turn-over they experience and so have higher steady-state bloat. (I'm
> assuming the partial index is on the particularly hot rows, which I would
> expect is how partial indexes would generally be used)

That's an interesting thought. Thanks for noticing that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Simon Riggs
On 5 January 2013 22:18, Tom Lane  wrote:

>> But I am wondering if it should be present at all in 9.3.  When it was
>> introduced, the argument seemed to be that smaller indexes might be easier
>> to keep in cache.
>
> No.  The argument is that if we don't have some such correction, the
> planner is liable to believe that different-sized indexes have *exactly
> the same cost*, if a given query would fetch the same number of index
> entries.

The only difference between a large and a small index is the initial
fetch, since the depth of the index may vary. After that the size of
the index is irrelevant to the cost of the scan, since we're just
scanning across the leaf blocks. (Other differences may exist but not
related to size).

Perhaps the cost of the initial fetch is what you mean by a
"correction"? In that case, why not use the index depth directly from
the metapage, rather than play with size?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Tom Lane
Jeff Janes  writes:
> On Saturday, January 5, 2013, Tom Lane wrote:
>> Jeff Janes > writes:
>>> One thing which depends on the index size which, as far as I can tell, is
>>> not currently being counted is the cost of comparing the tuples all the way
>>> down the index.  This would be proportional to log2(indextuples) *
>>> cpu_index_tuple_cost, or maybe log2(indextuples) *
>>> (cpu_index_tuple_cost+cpu_operator_cost), or something like that.

>> Yeah, I know.  I've experimented repeatedly over the years with trying
>> to account explicitly for index descent costs.  But every time, anything
>> that looks even remotely principled turns out to produce an overly large
>> correction that results in bad plan choices.  I don't know exactly why
>> this is, but it's true.

> log2(indextuples) * cpu_index_tuple_cost  should produce pretty darn small
> corrections, at least if cost parameters are at the defaults.  Do you
> remember if that one of the ones you tried?

Well, a picture is worth a thousand words, so see the attached plot of
the various proposed corrections for indexes of 10 to 1e9 tuples.  For
purposes of argument I've supposed that the index has loading factor
256 tuples/page, and I used the default values of random_page_cost and
cpu_index_tuple_cost.  The red line is your proposal, the green one is
mine, the blue one is current HEAD behavior.

Both the blue and green lines get to values that might be thought
excessively high for very large indexes, but I doubt that that really
matters: if the table contains a billion rows, the cost of a seqscan
will be so high that it'll hardly matter if we overshoot the cost of an
index probe a bit.  (Also, once the table gets that large it's debatable
whether the upper index levels all fit in cache, so charging an extra
random_page_cost or so isn't necessarily unrealistic.)

The real problem though is at the other end of the graph: I judge that
the red line represents an overcorrection for indexes of a few thousand
tuples.

It might also be worth noting that for indexes of a million or so
tuples, we're coming out to about the same place anyway.

>> One other point is that I think it is better for any such correction
>> to depend on the index's total page count, not total tuple count,
>> because otherwise two indexes that are identical except for bloat
>> effects will appear to have identical costs.

> This isn't so.  A bloated index will be estimated to visit more pages than
> an otherwise identical non-bloated index, and so have a higher cost.

No it won't, or at least not reliably so, if there is no form of
correction for index descent costs.  For instance, in a probe into a
unique index, we'll always estimate that we're visiting a single index
tuple on a single index page.  The example you show is tweaked to ensure
that it estimates visiting more than one index page, and in that context
the leaf-page-related costs probably do scale with bloat; but they won't
if the query is only looking for one index entry.

> For the bloated index, this correction might even be too harsh.  If the
> index is bloated by having lots of mostly-empty pages, then this seems
> fair.  If it is bloated by having lots of entirely empty pages that are not
> even linked into the tree, then those empty ones will never be visited and
> so it shouldn't be penalized.

It's true that an un-linked empty page adds no cost by itself.  But if
there are a lot of now-empty pages, that probably means a lot of vacant
space on upper index pages (which must once have held downlinks to those
pages).  Which means more upper pages traversed to get to the target
leaf page than we'd have in a non-bloated index.  Without more
experimental evidence than we've got at hand, I'm disinclined to suppose
that index bloat is free.

> This extra bloat was one of the reasons the partial index was avoided in
> "Why does the query planner use two full indexes, when a dedicated partial
> index exists?"

Interesting point, but it's far from clear that the planner was wrong in
supposing that that bloat had significant cost.  We agree that the
current 9.2 correction is too large, but it doesn't follow that zero is
a better value.

>> So from that standpoint,
>> the ln() form of the fudge factor seems quite reasonable as a crude form
>> of index descent cost estimate.  The fact that we're needing to dial
>> it down so much reinforces my feeling that descent costs are close to
>> negligible in practice.

> If they are negligible, why do we really care that it use a partial index
> vs a full index?

TBH, in situations like the ones I'm thinking about it's not clear that
a partial index is a win at all.  The cases where a partial index really
wins are where it doesn't index rows that you would otherwise have to
visit and make a non-indexed predicate test against --- and those costs
we definitely do model.  However, if the planner doesn't pick the
partial index if available, people are going to report that as a bug.
They won't be ab

Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles

2013-01-06 Thread Jeff Janes
On Saturday, January 5, 2013, Tom Lane wrote:

> Jeff Janes > writes:
> > [moved to hackers]
> > On Wednesday, December 5, 2012, Tom Lane wrote:
> >> Hm.  To tell you the truth, in October I'd completely forgotten about
> >> the January patch, and was thinking that the 1/1 cost had a lot
> >> of history behind it.  But if we never shipped it before 9.2 then of
> >> course that idea is false.  Perhaps we should backpatch the log curve
> >> into 9.2 --- that would reduce the amount of differential between what
> >> 9.2 does and what previous branches do for large indexes.
>
> > I think we should backpatch it for 9.2.3.  I've seen another email which
> is
> > probably due to the same issue (nested loop vs hash join).  And some
> > monitoring of a database I am responsible for suggests it might be
> heading
> > in that direction as well as the size grows.
>
> I received an off-list report of a case where not only did the 1/1
> factor cause a nestloop-vs-hashjoin decision to be made wrongly, but
> even adding the ln() computation as in commit bf01e34b556 didn't fix it.
> I believe the index in question was on the order of 2 pages, so
> it's not too hard to see why this might be the case:
>
> * historical fudge factor   4 * 2/10 = 0.8
> * 9.2 fudge factor  4 * 2/1 = 8.0
> * with ln() correction  4 * ln(1 + 2/1) = 4.39 or so
>
> At this point I'm about ready to not only revert the 10-to-1
> change, but keep the ln() adjustment, ie make the calculation be
> random_page_cost * ln(1 + index_pages/10).  This would give
> essentially the pre-9.2 behavior for indexes up to some tens of
> thousands of pages, and keep the fudge factor from getting out of
> control even for very very large indexes.
>

Yeah, I agree that even the log function grows too rapidly, especially at
the early stages.  I didn't know if a change that changes that asymptote
would be welcome in a backpatch, though.


>
> > But I am wondering if it should be present at all in 9.3.  When it was
> > introduced, the argument seemed to be that smaller indexes might be
> easier
> > to keep in cache.
>
> No.  The argument is that if we don't have some such correction, the
> planner is liable to believe that different-sized indexes have *exactly
> the same cost*, if a given query would fetch the same number of index
> entries.


But it seems like they very likely *do* have exactly the same cost, unless
you want to take either the CPU cost of descending the index into account,
or take cachebility into account.  If they do have the same cost, why
shouldn't the estimate reflect that?  Using cpu_index_tuple_cost * lg(#
index tuples) would break the tie, but by such a small amount that it would
easily get swamped by the stochastic nature of ANALYZE for nodes expected
to return more than one row.


> This is quite easy to demonstrate when experimenting with
> partial indexes, in particular - without the fudge factor the planner
> sees no advantage of a partial index over a full index from which the
> query would fetch the same number of entries.  We do want the planner
> to pick the partial index if it's usable, and a fudge factor is about
> the least unprincipled way to make it do so.
>

I noticed a long time ago that ordinary index scans seemed to be preferred
 over bitmap index scans with the same cost estimate, as best as I could
determine because they are tested first and the tie goes to the first one
(and there is something about it needs to be better by 1% to be counted as
better--although that part might only apply when the start-up cost and the
full cost disagree over which one is best).  If I've reconstructed that
correctly, could something similar be done for partial indexes, where they
are just considered first?  I guess the problem there is a index scan on a
partial index is not a separate node type from a index scan on a full
index, unlike index vs bitmap.

>
> > The argument for increasing the penalty by a factor of 10 was that the
> > smaller one could be "swamped by noise such as page-boundary-roundoff
> > behavior".
>
> Yeah, I wrote that, but in hindsight it seems like a mistaken idea.
> The noise problem is that because we round off page count and row count
> estimates to integers at various places, it's fairly easy for small
> changes in statistics to move a plan's estimated cost by significantly
> more than this fudge factor will.  However, the case that the fudge
> factor is meant to fix is indexes that are otherwise identical for
> the query's purposes --- and any roundoff effects will be the same.
> (The fudge factor itself is *not* rounded off anywhere, it flows
> directly to the bottom-line cost for the indexscan.)
>

OK, and this agrees with my experience.  It seemed like it was the
stochastic nature of analyze, not round off problems, that caused the plans
to go back and forth.


>
> > One thing which depends on the index size which, as far as I can tell, is
> > not

Re: [HACKERS] too much pgbench init output

2013-01-06 Thread Tomas Vondra
On 6.1.2013 10:35, Tatsuo Ishii wrote:
>>> If we do so, probably '-q' is not appropeate option name any more,
>>> since the only difference between old logging and new one is, the
>>> former is printed every 1 lines while the latter is every 5
>>> seconds, which is not really "quiet". What do you think?
>>
>> AFAIK the "5 second" logging is much quieter in most cases (and a bit
>> more verbose when the initialization gets very slower), so I think the
>> "quiet" logging is not a bad match although maybe there's a better name.
>>
>> This change (adding the elapsed/remaining fields to the original loggin)
>> would be consistent with this name, because considering a single line,
>> the "-q" is more verbose right now.
>>
>> So I'd stick with the "-q" option and added the fields to the original
>> logging. But I'm not opposing a different name, I just can't think of a
>> better one.
> 
> Ok, I'm with you ("-q" and along with adding the elapsed/remaining
> fields to the original logging).

Great, attached is a patch that does that.

Tomas

diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c
index e376452..9b9ca77 100644
--- a/contrib/pgbench/pgbench.c
+++ b/contrib/pgbench/pgbench.c
@@ -39,6 +39,7 @@
 #include "portability/instr_time.h"
 
 #include 
+#include 
 
 #ifndef WIN32
 #include 
@@ -102,6 +103,7 @@ extern int  optind;
 #define MAXCLIENTS 1024
 #endif
 
+#define LOG_STEP_SECONDS   5   /* seconds between log messages */
 #define DEFAULT_NXACTS 10  /* default nxacts */
 
 intnxacts = 0; /* number of 
transactions per client */
@@ -150,6 +152,7 @@ char   *index_tablespace = NULL;
 #define naccounts  10
 
 bool   use_log;/* log transaction latencies to 
a file */
+bool   use_quiet;  /* quiet logging onto stderr */
 bool   is_connect; /* establish connection for 
each transaction */
 bool   is_latencies;   /* report per-command latencies */
 intmain_pid;   /* main process id used 
in log filename */
@@ -359,6 +362,7 @@ usage(void)
   "  -n   do not run VACUUM after initialization\n"
   "  -F NUM   fill factor\n"
   "  -s NUM   scaling factor\n"
+  "  -q   quiet logging (one message each 5 seconds)\n"
   "  --foreign-keys\n"
   "   create foreign key constraints between 
tables\n"
   "  --index-tablespace=TABLESPACE\n"
@@ -1362,6 +1366,11 @@ init(bool is_no_vacuum)
charsql[256];
int i;
 
+   /* used to track elapsed time and estimate of the remaining time */
+   instr_time  start, diff;
+   double  elapsed_sec, remaining_sec;
+   int log_interval = 1;
+
if ((con = doConnect()) == NULL)
exit(1);
 
@@ -1430,6 +1439,8 @@ init(bool is_no_vacuum)
}
PQclear(res);
 
+   INSTR_TIME_SET_CURRENT(start);
+
for (i = 0; i < naccounts * scale; i++)
{
int j = i + 1;
@@ -1441,10 +1452,42 @@ init(bool is_no_vacuum)
exit(1);
}
 
-   if (j % 10 == 0)
-   fprintf(stderr, "%d of %d tuples (%d%%) done.\n",
-   j, naccounts * scale,
-   (int) (((int64) j * 100) / (naccounts * 
scale)));
+   /* If we want to stick with the original logging, print a 
message each
+* 100k inserted rows. */
+   if ((! use_quiet) && (j % 10 == 0))
+   {
+   INSTR_TIME_SET_CURRENT(diff);
+   INSTR_TIME_SUBTRACT(diff, start);
+
+   elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+   remaining_sec = (scale * naccounts - j) * elapsed_sec / 
j;
+
+   fprintf(stderr, "%d of %d tuples (%d%%) done (elapsed 
%.2f s, remaining %.2f s).\n",
+   j, naccounts * scale,
+   (int) (((int64) j * 
100) / (naccounts * scale)),
+   elapsed_sec, 
remaining_sec);
+   }
+   /* let's not call the timing for each row, but only each 100 
rows */
+   else if (use_quiet && (j % 100 == 0))
+   {
+   INSTR_TIME_SET_CURRENT(diff);
+   INSTR_TIME_SUBTRACT(diff, start);
+
+   elapsed_sec = INSTR_TIME_GET_DOUBLE(diff);
+   remaining_sec = (scale * naccounts - j) * elapsed_sec / 
j;
+
+   /* have we reached the next interval (

Re: [HACKERS] question: foreign key constraints and AccessExclusive locks

2013-01-06 Thread Simon Riggs
On 6 January 2013 03:08, Jon Nelson  wrote:
> When adding a foreign key constraint on tableA which references
> tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a
> lock that prevents writes be sufficient, or does PostgreSQL have to
> modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on
> Linux.

FKs are enforced by triggers currently. Adding triggers requires
AccessExclusiveLock because of catalog visibility issues; you are
right that a lower lock is eventually possible.

SQLStandard requires the check to be symmetrical, so adding FKs
requires a trigger on each table and so an AEL is placed on tableB.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] too much pgbench init output

2013-01-06 Thread Tatsuo Ishii
>> If we do so, probably '-q' is not appropeate option name any more,
>> since the only difference between old logging and new one is, the
>> former is printed every 1 lines while the latter is every 5
>> seconds, which is not really "quiet". What do you think?
> 
> AFAIK the "5 second" logging is much quieter in most cases (and a bit
> more verbose when the initialization gets very slower), so I think the
> "quiet" logging is not a bad match although maybe there's a better name.
> 
> This change (adding the elapsed/remaining fields to the original loggin)
> would be consistent with this name, because considering a single line,
> the "-q" is more verbose right now.
> 
> So I'd stick with the "-q" option and added the fields to the original
> logging. But I'm not opposing a different name, I just can't think of a
> better one.

Ok, I'm with you ("-q" and along with adding the elapsed/remaining
fields to the original logging).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers