[HACKERS] Improve compression speeds in pg_lzcompress.c
Hi, hackers, The attached is a patch to improve compression speeds with loss of compression ratios in backend/utils/adt/pg_lzcompress.c. Recent modern compression techniques like google LZ4 and Snappy inspreid me to write this patch. Thre are two points of my patch: 1. Skip at most 255 literals that might be incompressible during pattern matching for LZ compression. 2. Update a hash table every PGLZ_HASH_GAP literals. A sequence of literals is typically mixed up with compressible parts and incompressible ones. Then, IMHO that it is reasonable to skip PGLZ_SKIP_SIZE literals every a match is not found. The skipped multiple literals are just copied to the output buffer, so pglz_out_literal() is re-written (and renamed pglz_out_literals) so as to copy multiple bytes, not a single byte. And also, the current implementation updates a hash table for every a single literal. However, as the updates obviously eat much processor time, skipping the updates dynamically improves compression speeds. I've done quick comparison tests with a Xeon 5670 processor. A sequence logs of Apache hadoop and TREC GOV2 web data were used as test sets. The former is highly compressible (low entroy) and the other is difficult to compress (high entropy). *** Compression Speed (Ratio) Apache hadoop logs: gzip78.22MiB/s ( 5.31%) bzip23.34MiB/s ( 3.04%) lz4939.45MiB/s ( 9.17%) pg_lzcompress(original) 37.80MiB/s (11.76%) pg_lzcompress(patch apaplied) 99.42MiB/s (14.19%) TREC GOV2 web data: gzip21.22MiB/s (32.66%) bzip28.61MiB/s (27.86%) lz4250.98MiB/s (49.82%) pg_lzcompress(original) 20.44MiB/s (50.09%) pg_lzcompress(patch apaplied) 48.67MiB/s (61.87%) *** Obviously, both the compression ratio and the speed in the current implementation are inferior to those in gzip. And, my patch loses gzip and bzip2 in view of compression ratios though, the compression speed overcomes those in gzip and bzip2. Anyway, the compression speed in lz4 is very fast, so in my opinion, there is a room to improve the current implementation in pg_lzcompress. regards, -- Takeshi Yamamuro NTT Cyber Communications Laboratory Group Software Innovation Center (Open Source Software Center) Tel: +81-3-5860-5057 Fax: +81-3-5463-5490 Mail:yamamuro.take...@lab.ntt.co.jp diff --git src/backend/utils/adt/pg_lzcompress.c src/backend/utils/adt/pg_lzcompress.c index 466982e..3980334 100644 --- src/backend/utils/adt/pg_lzcompress.c +++ src/backend/utils/adt/pg_lzcompress.c @@ -62,8 +62,10 @@ * Otherwise the first byte after the header tells what to do * the next 8 times. We call this the control byte. * - * An unset bit in the control byte means, that one uncompressed - * byte follows, which is copied from input to output. + * An unset bit in the control byte means, that one byte header + * and uncompressed 1-255 literals follow. The header includes + * the length of the literals, and the literals are just copied + * from input to output. * * A set bit in the control byte means, that a tag of 2-3 bytes * follows. A tag contains information to copy some bytes, that @@ -184,6 +186,8 @@ #define PGLZ_HISTORY_MASK (PGLZ_HISTORY_LISTS - 1) #define PGLZ_HISTORY_SIZE 4096 #define PGLZ_MAX_MATCH 273 +#define PGLZ_SKIP_SIZE 3 +#define PGLZ_HASH_GAP 8 /* -- @@ -322,16 +326,20 @@ do { \ /* -- - * pglz_out_literal - + * pglz_out_literals - * * Outputs a literal byte to the destination buffer including the * appropriate control bit. + * Outputs multiple literals from the source to the destination + * buffer including the appropriate control bit. * -- */ -#define pglz_out_literal(_ctrlp,_ctrlb,_ctrl,_buf,_byte) \ +#define pglz_out_literals(_ctrlp,_ctrlb,_ctrl,_buf,_sp,_len) \ do { \ pglz_out_ctrl(_ctrlp,_ctrlb,_ctrl,_buf); \ - *(_buf)++ = (unsigned char)(_byte); \ + *(_buf)++ = (unsigned char)(_len); \ + memcpy(_buf, _sp, _len); \ + _buf += _len; \ _ctrl <<= 1;
Re: [HACKERS] question: foreign key constraints and AccessExclusive locks
Jon Nelson writes: > On Sun, Jan 6, 2013 at 4:14 AM, Simon Riggs wrote: >> FKs are enforced by triggers currently. Adding triggers requires >> AccessExclusiveLock because of catalog visibility issues; you are >> right that a lower lock is eventually possible. > I've read and re-read this a few times, and I think I understand. > However, could you clarify "you are right that a lower lock is > eventually possible" for me, please? We have some ideas about how to add/drop triggers while locking out only operations that would actually try to fire the triggers. Right now, though, any DDL operation done with less than full exclusive lock would risk having other transactions fetch an inconsistent view of the table's catalog entries. (This is true for any sort of ALTER TABLE, not just trigger add/drop.) Simon actually tried to fix this last year, but the effort crashed and burned, and we're not sure how to get around the problems. Yet. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too much pgbench init output
>>> AFAIK the "5 second" logging is much quieter in most cases (and a bit >>> more verbose when the initialization gets very slower), so I think the >>> "quiet" logging is not a bad match although maybe there's a better name. >>> >>> This change (adding the elapsed/remaining fields to the original loggin) >>> would be consistent with this name, because considering a single line, >>> the "-q" is more verbose right now. >>> >>> So I'd stick with the "-q" option and added the fields to the original >>> logging. But I'm not opposing a different name, I just can't think of a >>> better one. >> >> Ok, I'm with you ("-q" and along with adding the elapsed/remaining >> fields to the original logging). > > Great, attached is a patch that does that. Committed. Thanks. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] question: foreign key constraints and AccessExclusive locks
On Sun, Jan 6, 2013 at 4:14 AM, Simon Riggs wrote: > On 6 January 2013 03:08, Jon Nelson wrote: >> When adding a foreign key constraint on tableA which references >> tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a >> lock that prevents writes be sufficient, or does PostgreSQL have to >> modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on >> Linux. > > FKs are enforced by triggers currently. Adding triggers requires > AccessExclusiveLock because of catalog visibility issues; you are > right that a lower lock is eventually possible. > > SQLStandard requires the check to be symmetrical, so adding FKs > requires a trigger on each table and so an AEL is placed on tableB. I've read and re-read this a few times, and I think I understand. However, could you clarify "you are right that a lower lock is eventually possible" for me, please? -- Jon -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Proposal: Store "timestamptz" of database creation on "pg_database"
* Stephen Frost wrote: > > Yes, and have the actual 'description' field (as it's variable) at the > end of the catalog. > > Regarding the semantics of it- I was thinking about how directories and > unix files work. Basically, adding or removing a sub-object would > update the alter time on the object itself, changing an already existing > object or sub-object would update only the object/sub-object's alter > time. Creating an object or sub/object would set its create time and > alter time to the same value. I would distinguish 'create' from > 'ctime', however, and have our 'create' time be only the actual > *creation* time of the object. ALTER table OWNER TO user; would update > "table"s alter time. > Understood... a "COMMENT" is a database object, then if we add a creation time column to pg_description/shdescription tables how we track his creation time? > > Open to other thoughts on this and perhaps we should create a wiki page > to start documentating the semantics. Once we get agreement there, it's > just a bit of code. :) > +1 Regards, -- FabrÃzio de Royes Mello Consultoria/Coaching PostgreSQL >> Blog sobre TI: http://fabriziomello.blogspot.com >> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello >> Twitter: http://twitter.com/fabriziomello
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
On 6 January 2013 23:03, Tom Lane wrote: > Simon Riggs writes: >> On 6 January 2013 18:58, Tom Lane wrote: >>> IIRC, one of my very first attempts to deal with this was to charge >>> random_page_cost per level of index descended. This was such a horrid >>> overestimate that it never went anywhere. I think that reflects that in >>> practical applications, the upper levels of the index tend to stay in >>> cache. We could ignore I/O on that assumption and still try to model >>> CPU costs of the descent, which is basically what Jeff is proposing. >>> My objection to his formula is mainly that it ignores physical index >>> size, which I think is important to include somehow for the reasons >>> I explained in my other message. > >> Having a well principled approach will help bring us towards a >> realistic estimate. > > I thought about this some more and came up with what might be a > reasonably principled compromise. Assume that we know there are N > leaf entries in the index (from VACUUM stats) and that we know the > root page height is H (from looking at the btree metapage). (Note: > H starts at zero for a single-page index.) If we assume that the > number of tuples per page, P, is more or less uniform across leaf > and upper pages (ie P is the fanout for upper pages), then we have > N/P = number of leaf pages > N/P/P = number of level 1 pages > N/P^3 = number of level 2 pages > N/P^(h+1) = number of level h pages > Solving for the minimum P that makes N/P^(H+1) <= 1, we get > P = ceil(exp(ln(N)/(H+1))) > as an estimate of P given the known N and H values. > > Now, if we consider only CPU costs of index descent, we expect > about log2(P) comparisons to be needed on each of the H upper pages > to be descended through, that is we have total descent cost > cpu_index_tuple_cost * H * log2(P) > > If we ignore the ceil() step as being a second-order correction, this > can be simplified to > > cpu_index_tuple_cost * H * log2(N)/(H+1) > > I propose this, rather than Jeff's formula of cpu_index_tuple_cost * > log2(N), as our fudge factor. The reason I like this better is that > the additional factor of H/(H+1) provides the correction I want for > bloated indexes: if an index is bloated, the way that reflects into > the cost of any particular search is that the number of pages to be > descended through is larger than otherwise. The correction is fairly > small, particularly for large indexes, but that seems to be what's > expected given the rest of our discussion. Seems good to have something with both N and H in it. This cost model favours smaller indexes over larger ones, whether that be because they're partial and so have smaller N, or whether the key values are thinner and so have lower H. > We could further extend this by adding some I/O charge when the index is > sufficiently large, as per Simon's comments, but frankly I think that's > unnecessary. Unless the fan-out factor is really awful, practical-sized > indexes probably have all their upper pages in memory. What's more, per > my earlier comment, when you start to think about tables so huge that > that's not true it really doesn't matter if we charge another > random_page_cost or two for an indexscan --- it'll still be peanuts > compared to the seqscan alternative. Considering that we're trying to decide between various indexes on one table, we don't have enough information to say which index the cache favours and the other aspects of cacheing are the same for all indexes of any given size. So we can assume those effects cancel out for comparison purposes, even if they're non-zero. And as you say, they're negligible in comparison with bitmapindexscans etc.. The only time I'd question that would be in the case of a nested loops join but that's not important here. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
I wrote: > [ slightly bogus graph ] Ooops, it seems the ^ operator doesn't do what I thought in gnuplot. Here's a corrected version. regards, tom lane <>set terminal png small color set output 'new_fudge.png' set xlabel "Index tuples" set ylabel "Added cost" set logscale x set logscale y h(x) = (x <= 256) ? 0.0001/0.005 : (x <= 256*256) ? (1./2)*log(x)/log(2) : (x <= 256*256*256) ? (2./3)*log(x)/log(2) : (x <= 256.0*256*256*256) ? (3./4)*log(x)/log(2) : (x <= 256.0*256*256*256*256) ? (4./5)*log(x)/log(2) : (5./6)*log(x)/log(2) historical(x) = (4 * x/10) < 100 ? 4 * x/10 : 1/0 ninepoint2(x) = (4 * x/1) < 100 ? 4 * x/1 : 1/0 head(x) = 4*log(1 + x/1) plot [10:1e9] h(x)*0.005, 0.005 * log(x)/log(2), head(x), historical(x), ninepoint2(x) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
Simon Riggs writes: > On 6 January 2013 18:58, Tom Lane wrote: >> IIRC, one of my very first attempts to deal with this was to charge >> random_page_cost per level of index descended. This was such a horrid >> overestimate that it never went anywhere. I think that reflects that in >> practical applications, the upper levels of the index tend to stay in >> cache. We could ignore I/O on that assumption and still try to model >> CPU costs of the descent, which is basically what Jeff is proposing. >> My objection to his formula is mainly that it ignores physical index >> size, which I think is important to include somehow for the reasons >> I explained in my other message. > Having a well principled approach will help bring us towards a > realistic estimate. I thought about this some more and came up with what might be a reasonably principled compromise. Assume that we know there are N leaf entries in the index (from VACUUM stats) and that we know the root page height is H (from looking at the btree metapage). (Note: H starts at zero for a single-page index.) If we assume that the number of tuples per page, P, is more or less uniform across leaf and upper pages (ie P is the fanout for upper pages), then we have N/P = number of leaf pages N/P/P = number of level 1 pages N/P^3 = number of level 2 pages N/P^(h+1) = number of level h pages Solving for the minimum P that makes N/P^(H+1) <= 1, we get P = ceil(exp(ln(N)/(H+1))) as an estimate of P given the known N and H values. Now, if we consider only CPU costs of index descent, we expect about log2(P) comparisons to be needed on each of the H upper pages to be descended through, that is we have total descent cost cpu_index_tuple_cost * H * log2(P) If we ignore the ceil() step as being a second-order correction, this can be simplified to cpu_index_tuple_cost * H * log2(N)/(H+1) I propose this, rather than Jeff's formula of cpu_index_tuple_cost * log2(N), as our fudge factor. The reason I like this better is that the additional factor of H/(H+1) provides the correction I want for bloated indexes: if an index is bloated, the way that reflects into the cost of any particular search is that the number of pages to be descended through is larger than otherwise. The correction is fairly small, particularly for large indexes, but that seems to be what's expected given the rest of our discussion. We could further extend this by adding some I/O charge when the index is sufficiently large, as per Simon's comments, but frankly I think that's unnecessary. Unless the fan-out factor is really awful, practical-sized indexes probably have all their upper pages in memory. What's more, per my earlier comment, when you start to think about tables so huge that that's not true it really doesn't matter if we charge another random_page_cost or two for an indexscan --- it'll still be peanuts compared to the seqscan alternative. To illustrate the behavior of this function, I've replotted my previous graph, still taking the assumed fanout to be 256 tuples/page. I limited the range of the functions to 0.0001 to 100 to keep the log-scale graph readable, but actually the H/(H+1) formulation would charge zero for indexes of less than 256 tuples. I think it's significant (and a good thing) that this curve is nowhere significantly more than the historical pre-9.2 fudge factor. Thoughts? regards, tom lane <>set terminal png small color set output 'new_fudge.png' set logscale x set logscale y h(x) = (x <= 256) ? 0.0001/0.005 : (x <= 256*256) ? (1./2)*log(x)/log(2) : (x <= 256^3) ? (2./3)*log(x)/log(2) : (x <= 256^4) ? (3./4)*log(x)/log(2) : (x <= 256^5) ? (4./5)*log(x)/log(2) : (5./6)*log(x)/log(2) historical(x) = (4 * x/10) < 100 ? 4 * x/10 : 1/0 ninepoint2(x) = (4 * x/1) < 100 ? 4 * x/1 : 1/0 head(x) = 4*log(1 + x/1) plot [10:1e9] h(x)*0.005, 0.005 * log(x)/log(2), head(x), historical(x), ninepoint2(x) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Skip checkpoint on promoting from streaming replication
On 9 August 2012 10:45, Simon Riggs wrote: > On 22 June 2012 05:03, Kyotaro HORIGUCHI > wrote: > >>I hope this is promising. > > I've reviewed this and thought about it over some time. I've been torn between the need to remove the checkpoint for speed and being worried about the implications of doing so. We promote in multiple use cases. When we end a PITR, or are performing a switchover, it doesn't really matter how long the shutdown checkpoint takes, so I'm inclined to leave it there in those cases. For failover, we need fast promotion. So my thinking is to make pg_ctl promote -m fast be the way to initiate a fast failover that skips the shutdown checkpoint. That way all existing applications work the same as before, while new users that explicitly choose to do so will gain from the new option. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Invent a "one-shot" variant of CachedPlans for better performanc
On 5 January 2013 19:15, Tom Lane wrote: >> We need a full "one-shot" concept linking the planner and executor for >> all sorts of reasons, not just this one. We can discuss the >> practicality of individual optimizations but the linkage should be >> clear throughout the whole infrastructure. > > I thought then, and I think now, that such a concept was too squishy > to be useful as an actual guide to what to change. The particular > arguments you advanced then have been overtaken by events anyway; > notably that Marti Raudsepp's work on caching stable subexpressions at > execution seems like a much more general answer to the problem of > handling stable functions efficiently. I knew, and accepted that that specific optimization has been superceded. My point is that the "one-shot" situation lends itself to a great many optimizations and if we can pass that concept through, we can simply get on with implementing them. Having the planner pretend that it doesn't know what will happen at execution time isn't sensible. Having the executor know its a one-off will surely help somewhere along the line. It's hardly a modularity violation to pass small, yet major pieces of information across the divide. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
On 6 January 2013 18:58, Tom Lane wrote: > Simon Riggs writes: >> On 5 January 2013 22:18, Tom Lane wrote: >>> No. The argument is that if we don't have some such correction, the >>> planner is liable to believe that different-sized indexes have *exactly >>> the same cost*, if a given query would fetch the same number of index >>> entries. > >> The only difference between a large and a small index is the initial >> fetch, since the depth of the index may vary. After that the size of >> the index is irrelevant to the cost of the scan, since we're just >> scanning across the leaf blocks. (Other differences may exist but not >> related to size). > > Right: except for the "fudge factor" under discussion, all the indexscan > costs that we model come from accessing index leaf pages and leaf > tuples. So to the extent that the fudge factor has any principled basis > at all, it's an estimate of index descent costs. And in that role I > believe that total index size needs to be taken into account. > >> Perhaps the cost of the initial fetch is what you mean by a >> "correction"? In that case, why not use the index depth directly from >> the metapage, rather than play with size? > > IIRC, one of my very first attempts to deal with this was to charge > random_page_cost per level of index descended. This was such a horrid > overestimate that it never went anywhere. I think that reflects that in > practical applications, the upper levels of the index tend to stay in > cache. We could ignore I/O on that assumption and still try to model > CPU costs of the descent, which is basically what Jeff is proposing. > My objection to his formula is mainly that it ignores physical index > size, which I think is important to include somehow for the reasons > I explained in my other message. Having a well principled approach will help bring us towards a realistic estimate. I can well believe what you say about random_page_cost * index_depth being an over-estimate. Making a fudge factor be random_page_cost * ln(1 + index_pages/10) just seems to presume an effective cache of 8GB and a fixed depth:size ratio, which it might not be. On a busy system, or with a very wide index that could also be wrong. I'd be more inclined to explicitly discount the first few levels by using random_page_cost * (max(index_depth - 3, 0)) or even better use a formula that includes the effective cache size and index width to work out the likely number of tree levels cached for an index. Whatever we do we must document that we are estimating the cache effects on the cost of index descent, so we can pick that up on a future study on cacheing effects. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
Simon Riggs writes: > On 5 January 2013 22:18, Tom Lane wrote: >> No. The argument is that if we don't have some such correction, the >> planner is liable to believe that different-sized indexes have *exactly >> the same cost*, if a given query would fetch the same number of index >> entries. > The only difference between a large and a small index is the initial > fetch, since the depth of the index may vary. After that the size of > the index is irrelevant to the cost of the scan, since we're just > scanning across the leaf blocks. (Other differences may exist but not > related to size). Right: except for the "fudge factor" under discussion, all the indexscan costs that we model come from accessing index leaf pages and leaf tuples. So to the extent that the fudge factor has any principled basis at all, it's an estimate of index descent costs. And in that role I believe that total index size needs to be taken into account. > Perhaps the cost of the initial fetch is what you mean by a > "correction"? In that case, why not use the index depth directly from > the metapage, rather than play with size? IIRC, one of my very first attempts to deal with this was to charge random_page_cost per level of index descended. This was such a horrid overestimate that it never went anywhere. I think that reflects that in practical applications, the upper levels of the index tend to stay in cache. We could ignore I/O on that assumption and still try to model CPU costs of the descent, which is basically what Jeff is proposing. My objection to his formula is mainly that it ignores physical index size, which I think is important to include somehow for the reasons I explained in my other message. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
On 6 January 2013 16:29, Jeff Janes wrote: > Worse, this over-punishment of bloat is more likely to penalize partial > indexes. Since they are vacuumed on the table's schedule, not their own > schedule, they likely get vacuumed less often relative to the amount of > turn-over they experience and so have higher steady-state bloat. (I'm > assuming the partial index is on the particularly hot rows, which I would > expect is how partial indexes would generally be used) That's an interesting thought. Thanks for noticing that. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
On 5 January 2013 22:18, Tom Lane wrote: >> But I am wondering if it should be present at all in 9.3. When it was >> introduced, the argument seemed to be that smaller indexes might be easier >> to keep in cache. > > No. The argument is that if we don't have some such correction, the > planner is liable to believe that different-sized indexes have *exactly > the same cost*, if a given query would fetch the same number of index > entries. The only difference between a large and a small index is the initial fetch, since the depth of the index may vary. After that the size of the index is irrelevant to the cost of the scan, since we're just scanning across the leaf blocks. (Other differences may exist but not related to size). Perhaps the cost of the initial fetch is what you mean by a "correction"? In that case, why not use the index depth directly from the metapage, rather than play with size? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
Jeff Janes writes: > On Saturday, January 5, 2013, Tom Lane wrote: >> Jeff Janes > writes: >>> One thing which depends on the index size which, as far as I can tell, is >>> not currently being counted is the cost of comparing the tuples all the way >>> down the index. This would be proportional to log2(indextuples) * >>> cpu_index_tuple_cost, or maybe log2(indextuples) * >>> (cpu_index_tuple_cost+cpu_operator_cost), or something like that. >> Yeah, I know. I've experimented repeatedly over the years with trying >> to account explicitly for index descent costs. But every time, anything >> that looks even remotely principled turns out to produce an overly large >> correction that results in bad plan choices. I don't know exactly why >> this is, but it's true. > log2(indextuples) * cpu_index_tuple_cost should produce pretty darn small > corrections, at least if cost parameters are at the defaults. Do you > remember if that one of the ones you tried? Well, a picture is worth a thousand words, so see the attached plot of the various proposed corrections for indexes of 10 to 1e9 tuples. For purposes of argument I've supposed that the index has loading factor 256 tuples/page, and I used the default values of random_page_cost and cpu_index_tuple_cost. The red line is your proposal, the green one is mine, the blue one is current HEAD behavior. Both the blue and green lines get to values that might be thought excessively high for very large indexes, but I doubt that that really matters: if the table contains a billion rows, the cost of a seqscan will be so high that it'll hardly matter if we overshoot the cost of an index probe a bit. (Also, once the table gets that large it's debatable whether the upper index levels all fit in cache, so charging an extra random_page_cost or so isn't necessarily unrealistic.) The real problem though is at the other end of the graph: I judge that the red line represents an overcorrection for indexes of a few thousand tuples. It might also be worth noting that for indexes of a million or so tuples, we're coming out to about the same place anyway. >> One other point is that I think it is better for any such correction >> to depend on the index's total page count, not total tuple count, >> because otherwise two indexes that are identical except for bloat >> effects will appear to have identical costs. > This isn't so. A bloated index will be estimated to visit more pages than > an otherwise identical non-bloated index, and so have a higher cost. No it won't, or at least not reliably so, if there is no form of correction for index descent costs. For instance, in a probe into a unique index, we'll always estimate that we're visiting a single index tuple on a single index page. The example you show is tweaked to ensure that it estimates visiting more than one index page, and in that context the leaf-page-related costs probably do scale with bloat; but they won't if the query is only looking for one index entry. > For the bloated index, this correction might even be too harsh. If the > index is bloated by having lots of mostly-empty pages, then this seems > fair. If it is bloated by having lots of entirely empty pages that are not > even linked into the tree, then those empty ones will never be visited and > so it shouldn't be penalized. It's true that an un-linked empty page adds no cost by itself. But if there are a lot of now-empty pages, that probably means a lot of vacant space on upper index pages (which must once have held downlinks to those pages). Which means more upper pages traversed to get to the target leaf page than we'd have in a non-bloated index. Without more experimental evidence than we've got at hand, I'm disinclined to suppose that index bloat is free. > This extra bloat was one of the reasons the partial index was avoided in > "Why does the query planner use two full indexes, when a dedicated partial > index exists?" Interesting point, but it's far from clear that the planner was wrong in supposing that that bloat had significant cost. We agree that the current 9.2 correction is too large, but it doesn't follow that zero is a better value. >> So from that standpoint, >> the ln() form of the fudge factor seems quite reasonable as a crude form >> of index descent cost estimate. The fact that we're needing to dial >> it down so much reinforces my feeling that descent costs are close to >> negligible in practice. > If they are negligible, why do we really care that it use a partial index > vs a full index? TBH, in situations like the ones I'm thinking about it's not clear that a partial index is a win at all. The cases where a partial index really wins are where it doesn't index rows that you would otherwise have to visit and make a non-indexed predicate test against --- and those costs we definitely do model. However, if the planner doesn't pick the partial index if available, people are going to report that as a bug. They won't be ab
Re: [HACKERS] [PERFORM] Slow query: bitmap scan troubles
On Saturday, January 5, 2013, Tom Lane wrote: > Jeff Janes > writes: > > [moved to hackers] > > On Wednesday, December 5, 2012, Tom Lane wrote: > >> Hm. To tell you the truth, in October I'd completely forgotten about > >> the January patch, and was thinking that the 1/1 cost had a lot > >> of history behind it. But if we never shipped it before 9.2 then of > >> course that idea is false. Perhaps we should backpatch the log curve > >> into 9.2 --- that would reduce the amount of differential between what > >> 9.2 does and what previous branches do for large indexes. > > > I think we should backpatch it for 9.2.3. I've seen another email which > is > > probably due to the same issue (nested loop vs hash join). And some > > monitoring of a database I am responsible for suggests it might be > heading > > in that direction as well as the size grows. > > I received an off-list report of a case where not only did the 1/1 > factor cause a nestloop-vs-hashjoin decision to be made wrongly, but > even adding the ln() computation as in commit bf01e34b556 didn't fix it. > I believe the index in question was on the order of 2 pages, so > it's not too hard to see why this might be the case: > > * historical fudge factor 4 * 2/10 = 0.8 > * 9.2 fudge factor 4 * 2/1 = 8.0 > * with ln() correction 4 * ln(1 + 2/1) = 4.39 or so > > At this point I'm about ready to not only revert the 10-to-1 > change, but keep the ln() adjustment, ie make the calculation be > random_page_cost * ln(1 + index_pages/10). This would give > essentially the pre-9.2 behavior for indexes up to some tens of > thousands of pages, and keep the fudge factor from getting out of > control even for very very large indexes. > Yeah, I agree that even the log function grows too rapidly, especially at the early stages. I didn't know if a change that changes that asymptote would be welcome in a backpatch, though. > > > But I am wondering if it should be present at all in 9.3. When it was > > introduced, the argument seemed to be that smaller indexes might be > easier > > to keep in cache. > > No. The argument is that if we don't have some such correction, the > planner is liable to believe that different-sized indexes have *exactly > the same cost*, if a given query would fetch the same number of index > entries. But it seems like they very likely *do* have exactly the same cost, unless you want to take either the CPU cost of descending the index into account, or take cachebility into account. If they do have the same cost, why shouldn't the estimate reflect that? Using cpu_index_tuple_cost * lg(# index tuples) would break the tie, but by such a small amount that it would easily get swamped by the stochastic nature of ANALYZE for nodes expected to return more than one row. > This is quite easy to demonstrate when experimenting with > partial indexes, in particular - without the fudge factor the planner > sees no advantage of a partial index over a full index from which the > query would fetch the same number of entries. We do want the planner > to pick the partial index if it's usable, and a fudge factor is about > the least unprincipled way to make it do so. > I noticed a long time ago that ordinary index scans seemed to be preferred over bitmap index scans with the same cost estimate, as best as I could determine because they are tested first and the tie goes to the first one (and there is something about it needs to be better by 1% to be counted as better--although that part might only apply when the start-up cost and the full cost disagree over which one is best). If I've reconstructed that correctly, could something similar be done for partial indexes, where they are just considered first? I guess the problem there is a index scan on a partial index is not a separate node type from a index scan on a full index, unlike index vs bitmap. > > > The argument for increasing the penalty by a factor of 10 was that the > > smaller one could be "swamped by noise such as page-boundary-roundoff > > behavior". > > Yeah, I wrote that, but in hindsight it seems like a mistaken idea. > The noise problem is that because we round off page count and row count > estimates to integers at various places, it's fairly easy for small > changes in statistics to move a plan's estimated cost by significantly > more than this fudge factor will. However, the case that the fudge > factor is meant to fix is indexes that are otherwise identical for > the query's purposes --- and any roundoff effects will be the same. > (The fudge factor itself is *not* rounded off anywhere, it flows > directly to the bottom-line cost for the indexscan.) > OK, and this agrees with my experience. It seemed like it was the stochastic nature of analyze, not round off problems, that caused the plans to go back and forth. > > > One thing which depends on the index size which, as far as I can tell, is > > not
Re: [HACKERS] too much pgbench init output
On 6.1.2013 10:35, Tatsuo Ishii wrote: >>> If we do so, probably '-q' is not appropeate option name any more, >>> since the only difference between old logging and new one is, the >>> former is printed every 1 lines while the latter is every 5 >>> seconds, which is not really "quiet". What do you think? >> >> AFAIK the "5 second" logging is much quieter in most cases (and a bit >> more verbose when the initialization gets very slower), so I think the >> "quiet" logging is not a bad match although maybe there's a better name. >> >> This change (adding the elapsed/remaining fields to the original loggin) >> would be consistent with this name, because considering a single line, >> the "-q" is more verbose right now. >> >> So I'd stick with the "-q" option and added the fields to the original >> logging. But I'm not opposing a different name, I just can't think of a >> better one. > > Ok, I'm with you ("-q" and along with adding the elapsed/remaining > fields to the original logging). Great, attached is a patch that does that. Tomas diff --git a/contrib/pgbench/pgbench.c b/contrib/pgbench/pgbench.c index e376452..9b9ca77 100644 --- a/contrib/pgbench/pgbench.c +++ b/contrib/pgbench/pgbench.c @@ -39,6 +39,7 @@ #include "portability/instr_time.h" #include +#include #ifndef WIN32 #include @@ -102,6 +103,7 @@ extern int optind; #define MAXCLIENTS 1024 #endif +#define LOG_STEP_SECONDS 5 /* seconds between log messages */ #define DEFAULT_NXACTS 10 /* default nxacts */ intnxacts = 0; /* number of transactions per client */ @@ -150,6 +152,7 @@ char *index_tablespace = NULL; #define naccounts 10 bool use_log;/* log transaction latencies to a file */ +bool use_quiet; /* quiet logging onto stderr */ bool is_connect; /* establish connection for each transaction */ bool is_latencies; /* report per-command latencies */ intmain_pid; /* main process id used in log filename */ @@ -359,6 +362,7 @@ usage(void) " -n do not run VACUUM after initialization\n" " -F NUM fill factor\n" " -s NUM scaling factor\n" + " -q quiet logging (one message each 5 seconds)\n" " --foreign-keys\n" " create foreign key constraints between tables\n" " --index-tablespace=TABLESPACE\n" @@ -1362,6 +1366,11 @@ init(bool is_no_vacuum) charsql[256]; int i; + /* used to track elapsed time and estimate of the remaining time */ + instr_time start, diff; + double elapsed_sec, remaining_sec; + int log_interval = 1; + if ((con = doConnect()) == NULL) exit(1); @@ -1430,6 +1439,8 @@ init(bool is_no_vacuum) } PQclear(res); + INSTR_TIME_SET_CURRENT(start); + for (i = 0; i < naccounts * scale; i++) { int j = i + 1; @@ -1441,10 +1452,42 @@ init(bool is_no_vacuum) exit(1); } - if (j % 10 == 0) - fprintf(stderr, "%d of %d tuples (%d%%) done.\n", - j, naccounts * scale, - (int) (((int64) j * 100) / (naccounts * scale))); + /* If we want to stick with the original logging, print a message each +* 100k inserted rows. */ + if ((! use_quiet) && (j % 10 == 0)) + { + INSTR_TIME_SET_CURRENT(diff); + INSTR_TIME_SUBTRACT(diff, start); + + elapsed_sec = INSTR_TIME_GET_DOUBLE(diff); + remaining_sec = (scale * naccounts - j) * elapsed_sec / j; + + fprintf(stderr, "%d of %d tuples (%d%%) done (elapsed %.2f s, remaining %.2f s).\n", + j, naccounts * scale, + (int) (((int64) j * 100) / (naccounts * scale)), + elapsed_sec, remaining_sec); + } + /* let's not call the timing for each row, but only each 100 rows */ + else if (use_quiet && (j % 100 == 0)) + { + INSTR_TIME_SET_CURRENT(diff); + INSTR_TIME_SUBTRACT(diff, start); + + elapsed_sec = INSTR_TIME_GET_DOUBLE(diff); + remaining_sec = (scale * naccounts - j) * elapsed_sec / j; + + /* have we reached the next interval (
Re: [HACKERS] question: foreign key constraints and AccessExclusive locks
On 6 January 2013 03:08, Jon Nelson wrote: > When adding a foreign key constraint on tableA which references > tableB, why is an AccessExclusive lock on tableB necessary? Wouldn't a > lock that prevents writes be sufficient, or does PostgreSQL have to > modify *both* tables in some fashion? I'm using PostgreSQL 8.4 on > Linux. FKs are enforced by triggers currently. Adding triggers requires AccessExclusiveLock because of catalog visibility issues; you are right that a lower lock is eventually possible. SQLStandard requires the check to be symmetrical, so adding FKs requires a trigger on each table and so an AEL is placed on tableB. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] too much pgbench init output
>> If we do so, probably '-q' is not appropeate option name any more, >> since the only difference between old logging and new one is, the >> former is printed every 1 lines while the latter is every 5 >> seconds, which is not really "quiet". What do you think? > > AFAIK the "5 second" logging is much quieter in most cases (and a bit > more verbose when the initialization gets very slower), so I think the > "quiet" logging is not a bad match although maybe there's a better name. > > This change (adding the elapsed/remaining fields to the original loggin) > would be consistent with this name, because considering a single line, > the "-q" is more verbose right now. > > So I'd stick with the "-q" option and added the fields to the original > logging. But I'm not opposing a different name, I just can't think of a > better one. Ok, I'm with you ("-q" and along with adding the elapsed/remaining fields to the original logging). -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers