Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
Hi, On 11/02/2017 06:45 PM, Alvaro Herrera wrote: > Tomas Vondra wrote: > >> Unfortunately, I think we still have a problem ... I've been wondering >> if we end up producing correct indexes, so I've done a simple test. > > Here's a proposed patch that should fix this problem (and does, in my > testing). Would you please give it a try? > > This patch changes two things: > > 1. in VACUUM or brin_summarize_new_values, we only process fully loaded > ranges, and ignore the partial range at end of table. > > 2. when summarization is requested on the partial range at the end of a > table, we acquire extension lock on the rel, then compute relation size > and run summarization with the lock held. This guarantees that we don't > miss any pages. This is bad for concurrency though, so it's only done > in that specific scenario. > FWIW this patch fixes the issue for me - I can no longer reproduce the bitmapscan vs. seqscan result discrepancies (even with the extra UPDATE phase). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
On 10/31/2017 11:44 PM, Tomas Vondra wrote: > ... > Unfortunately, I think we still have a problem ... I've been wondering > if we end up producing correct indexes, so I've done a simple test. > > 1) create the table as before > > 2) let the insert + vacuum run for some time, to see if there are > crashes (result: no crashes after one hour, inserting ~92M rows) > > 3) do a bunch of random updates on the data (while still doing the > concurrent vacuum in another session) > > 4) run a bunch of simple queries to compare the results, essentially > >-- BRIN index >SET enable_bitmapscan = on; >SELECT COUNT(*) FROM brin_test WHERE a = $1; > > >-- seq scan >SET enable_bitmapscan = on; >SELECT COUNT(*) FROM brin_test WHERE a = $1; > > and unfortunately what I get is not particularly pleasant: > > test=# set enable_bitmapscan = on; > SET > test=# select count(*) from brin_test where a = 0; > count > --- > 9062 > (1 row) > > test=# set enable_bitmapscan = off; > SET > test=# select count(*) from brin_test where a = 0; > count > --- > 9175 > (1 row) > > Attached is a SQL script with commands I used. You'll need to copy the > commands into multiple psql sessions, though, to simulate concurrent > activity). > FWIW I can reproduce this on 9.5, and I don't even need to run the UPDATE part. That is, INSERT + VACUUM running concurrently is enough to produce broken BRIN indexes :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
Hi, On 10/31/2017 08:46 PM, Tom Lane wrote: > I wrote: >> maybe >> we just have some run-of-the-mill bugs to find, like the off-the-end >> bug I spotted in brin_doupdate. There's apparently at least one >> more, but given the error message it must be something like not >> checking for a page to have turned into a revmap page. Shouldn't >> be too hard to find... > > Actually, I think it might be as simple as the attached. > brin_getinsertbuffer checks for the old page having turned into revmap, > but the "samepage" path in brin_doupdate does not :-( > > With this applied, Alvaro's version of the test case has survived > without error for quite a bit longer than its former MTBF. There > might still be some issues though in other code paths. > That does fix the crashes for me - I've been unable to reproduce any even after one hour (it took a couple of minutes to crash before). Unfortunately, I think we still have a problem ... I've been wondering if we end up producing correct indexes, so I've done a simple test. 1) create the table as before 2) let the insert + vacuum run for some time, to see if there are crashes (result: no crashes after one hour, inserting ~92M rows) 3) do a bunch of random updates on the data (while still doing the concurrent vacuum in another session) 4) run a bunch of simple queries to compare the results, essentially -- BRIN index SET enable_bitmapscan = on; SELECT COUNT(*) FROM brin_test WHERE a = $1; -- seq scan SET enable_bitmapscan = on; SELECT COUNT(*) FROM brin_test WHERE a = $1; and unfortunately what I get is not particularly pleasant: test=# set enable_bitmapscan = on; SET test=# select count(*) from brin_test where a = 0; count --- 9062 (1 row) test=# set enable_bitmapscan = off; SET test=# select count(*) from brin_test where a = 0; count --- 9175 (1 row) Attached is a SQL script with commands I used. You'll need to copy the commands into multiple psql sessions, though, to simulate concurrent activity). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services brin-test.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)
Hi, On 10/20/2017 03:23 PM, Robert Haas wrote: > > ... > > The main points I want to make clearly understood is the current > design relies on (1) functions being labeled correctly and (2) other > dangerous code paths being unreachable because there's nothing that > runs between EnterParallelMode and ExitParallelMode which could invoke > them, except by calling a mislabeled function. Your patch expands the > vulnerability surface from "executor code that can be reached without > calling a mislabeled function" to "any code that can be reached by > typing an SQL command". Just rejecting any queries that are > parallel-unsafe probably closes a good chunk of the holes, but that > still leaves a lot of code that's never been run in parallel mode > before potentially now running in parallel mode - e.g. any DDL command > you happen to type, transaction control commands, code that only runs > when the server is idle like idle_in_transaction_timeout, cursor > operations. A lot of that stuff is probably fine, but it's got to be > thought through. Error handling might be a problem, too: what happens > if a parallel worker is killed while the query is suspended? I > suspect that doesn't work very nicely at all. > OK, understood and thanks for explaining what may be the possible issues. I do appreciate that. I still think it'd be valuable to support this, though, so I'm going to spend more time on investigating what needs to be handled. But maybe there's a simpler option - what if we only allow fetches from the PARALLEL cursor while the cursor is open? That is, this would work: BEGIN; ... DECLARE x PARALLEL CURSOR FOR SELECT * FROM t2 WHERE ...; FETCH 1000 FROM x; FETCH 1000 FROM x; FETCH 1000 FROM x; CLOSE x; ... COMMIT; but adding any other command between the OPEN/CLOSE commands would fail. That should close all the holes with parallel-unsafe stuff, right? Of course, this won't solve the issue with error handling / killing suspended workers (which didn't occur to me before as a possible issue at all, so that's for pointing that out). But that's a significantly more limited issue to fix than all the parallel-unsafe bits. Now, I agree this is somewhat more limited than I hoped for, but OTOH it still solves the issue I initially aimed for (processing large query results in efficient way). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: Anyone have experience benchmarking very high effective_io_concurrency on NVME's?
Hi, On 10/31/2017 04:48 PM, Greg Stark wrote: > On 31 October 2017 at 07:05, Chris Travers <chris.trav...@adjust.com> wrote: >> Hi; >> >> After Andres's excellent talk at PGConf we tried benchmarking >> effective_io_concurrency on some of our servers and found that those which >> have a number of NVME storage volumes could not fill the I/O queue even at >> the maximum setting (1000). > > And was the system still i/o bound? If the cpu was 100% busy then > perhaps Postgres just can't keep up with the I/O system. It would > depend on workload though, if you start many very large sequential > scans you may be able to push the i/o system harder. > > Keep in mind effective_io_concurrency only really affects bitmap > index scans (and to a small degree index scans). It works by issuing > posix_fadvise() calls for upcoming buffers one by one. That gets > multiple spindles active but it's not really going to scale to many > thousands of prefetches (and effective_io_concurrency of 1000 > actually means 7485 prefetches). At some point those i/o are going > to start completing before Postgres even has a chance to start > processing the data. > Yeah, initiating the prefetches is not expensive, but it's not free either. So there's a trade-off between time spent on prefetching and processing the data. I believe this may be actually illustrated using Amdahl's law - the I/O is the parallel part, and processing the data is the serial part. And no matter what you do, the device only has so much bandwidth, which defines the maximum possible speedup (compared to "no prefetch" case). Furthermore, the device does not wait for all the I/O requests to be submitted - it won't wait for 1000 requests and then go "OMG! There's a lot of work to do!" It starts processing the requests as they arrive, and some of them will complete before you're done with submitting the rest, so you'll never see all the requests in the queue at once. And of course, iostat and other tools only give you "average queue length", which is mostly determined by the average throughput. In my experience (on all types of storage, including SSDs and NVMe), the performance quickly and significantly improves once you start increasing the value (say, to 8 or 16, maybe 64). And then the gains become much more modest - not because the device could not handle more, but because of the prefetch/processing ratio reached the optimal value. But all this is actually per-process, if you can run multiple backends (particularly when doing bitmap index scans), I'm sure you'll see the queues being more full. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN multi-range indexes
Hi, attached is a patch series that includes both the BRIN multi-range minmax indexes discussed in this thread, and the BRIN bloom indexes initially posted in [1]. It seems easier to just deal with a single patch series, although we may end up adding just one of those proposed opclasses. There are 4 parts: 0001 - Modifies bringetbitmap() to pass all scan keys to the consistent function at once. This is actually needed by the multi-minmax indexes, but not really required for the others. I'm wondering if this is a safechange, considering it affects the BRIN interface. I.e. custom BRIN opclasses (perhaps in extensions) will be broken by this change. Maybe we should extend the BRIN API to support two versions of the consistent function - one that processes scan keys one by one, and the other one that processes all of them at once. 0002 - Adds BRIN bloom indexes, along with opclasses for all built-in data types (or at least those that also have regular BRIN opclasses). 0003 - Adds BRIN multi-minmax indexes, but only with float8 opclasses (which also includes timestamp etc.). That should be good enough for now, but adding support for other data types will require adding some sort of "distance" operator which is needed for merging ranges (to pick the two "closest" ones). For float8 it's simply a subtraction. 0004 - Moves dealing with IS [NOT] NULL search keys from opclasses to bringetbitmap(). The code was exactly the same in all opclasses, so moving it to bringetbitmap() seems right. It also allows some nice optimizations where we can skip the consistent() function entirely, although maybe that's useless. Also, maybe the there are opclasses that actually need to deal with the NULL values in consistent() function? regards [1] https://www.postgresql.org/message-id/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-Pass-all-keys-to-BRIN-consistent-function-at-once.patch.gz Description: application/gzip 0002-BRIN-bloom-indexes.patch.gz Description: application/gzip 0003-BRIN-multi-range-minmax-indexes.patch.gz Description: application/gzip 0004-Move-IS-NOT-NULL-checks-to-bringetbitmap.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
On 10/30/2017 09:03 PM, Tom Lane wrote: > Michael Paquier <michael.paqu...@gmail.com> writes: >> b1328d78f is close enough, but per what I see that's actually not >> true. I have been able to reproduce the problem, which shows up within >> a window of 2-4 minutes. Still sometimes it can take way longer, and I >> spotted one test where it took 15 minutes to show up... So, bisecting >> with a test that looks for core files for 20 minutes, I am seeing that >> the following commit is actually at fault: > >> commit 24992c6db9fd40f342db1f22747ec9e56483796d >> Author: Tom Lane <t...@sss.pgh.pa.us> >> Date: Fri Sep 9 19:00:59 2016 -0400 > >> Rewrite PageIndexDeleteNoCompact into a form that only deletes 1 tuple. > > [ scratches head ... ] Not sure how that could've introduced any > problems? Will look. > Not sure, but I can confirm Michael's findings - I've been unable to reproduce the issue on 1a4be103a5 even after 20 minutes, and on 24992c6db9 it failed after only 2. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
FWIW I can reproduce this on REL_10_STABLE, but not on REL9_6_STABLE. So it seems to be due to something that changed in the last release. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PANIC: invalid index offnum: 186 when processing BRIN indexes in VACUUM
Hi, while doing some weekend hacking & testing on the BRIN patches I posted recently, I ran into PANICs in VACUUM, when it summarizes data inserted concurrently (in another session): PANIC: invalid index offnum: 186 Initially I thought it's a bug in my patches, but apparently it's not and I can reproduce it easily on current master (846fcc8516). I'm not sure if/how this is related to the BRIN autosummarization issue reported by Justin Pryzby about two weeks ago (thread [1]), but I don't see any segfaults and the messages are always exactly the same. [1] https://www.postgresql.org/message-id/flat/20171014035732.GB31726%40telsasoft.com Reproducing the issue is simple: create table brin_test (a int, b bigint, c float, d double precision, e uuid); create index on brin_test using brin (a); create index on brin_test using brin (b); create index on brin_test using brin (c); create index on brin_test using brin (d); create index on brin_test using brin (e); and then run insert into brin_test select mod(i,10)/25, mod(i,10)/25, mod(i,10)/25, mod(i,10)/25, md5((mod(i,10)/25)::text)::uuid from generate_series(1,10) s(i) \watch 0.1 vacuum brin_test \watch 1 And sooner or later (for me it only takes a minute or two in most cases) the VACUUM session should fail with the PANIC message mentioned above. It always fails with the message, only the value (offset) changes. The stack trace always looks exactly the same - see the attachment. At first it seemed the idxrel is always the index on 'e' (i.e. the UUID column), but it seems I also got failures on the other indexes. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services Core was generated by `postgres: postgres test [local] VACUUM '. Program terminated with signal SIGABRT, Aborted. #0 0x7fcadc413167 in raise () from /lib64/libc.so.6 (gdb) bt #0 0x7fcadc413167 in raise () from /lib64/libc.so.6 #1 0x7fcadc4145ea in abort () from /lib64/libc.so.6 #2 0x00a2409f in errfinish (dummy=0) at elog.c:557 #3 0x00a26734 in elog_finish (elevel=20, fmt=0xc2ee3d "invalid index offnum: %u") at elog.c:1378 #4 0x008a4e86 in PageIndexTupleDeleteNoCompact (page=0x7fcad4595080 "\004", offnum=292) at bufpage.c:982 #5 0x00476b0c in brin_doupdate (idxrel=0x7fcad33700d8, pagesPerRange=128, revmap=0x21657f0, heapBlk=1676160, oldbuf=1442, oldoff=292, origtup=0x2169518, origsz=8, newtup=0x2165dc0, newsz=24, samepage=0 '\000') at brin_pageops.c:246 #6 0x00475b34 in summarize_range (indexInfo=0x2165940, state=0x21673d0, heapRel=0x7fcad336f050, heapBlk=1676160, heapNumBlks=1676189) at brin.c:1199 #7 0x00475ddd in brinsummarize (index=0x7fcad33700d8, heapRel=0x7fcad336f050, pageRange=4294967295, numSummarized=0x2166e20, numExisting=0x2166e20) at brin.c:1306 #8 0x00474e7d in brinvacuumcleanup (info=0x7ffdf5c28d70, stats=0x2166e10) at brin.c:794 #9 0x004f85d2 in index_vacuum_cleanup (info=0x7ffdf5c28d70, stats=0x0) at indexam.c:772 #10 0x006c59c1 in lazy_cleanup_index (indrel=0x7fcad33700d8, stats=0x0, vacrelstats=0x21656c0) at vacuumlazy.c:1651 #11 0x006c48f7 in lazy_scan_heap (onerel=0x7fcad336f050, options=1, vacrelstats=0x21656c0, Irel=0x2165550, nindexes=5, aggressive=0 '\000') at vacuumlazy.c:1334 #12 0x006c286e in lazy_vacuum_rel (onerel=0x7fcad336f050, options=1, params=0x7ffdf5c29450, bstrategy=0x218c710) at vacuumlazy.c:258 #13 0x006c2326 in vacuum_rel (relid=16385, relation=0x209ecd0, options=1, params=0x7ffdf5c29450) at vacuum.c:1543 #14 0x006c0942 in vacuum (options=1, relations=0x218c888, params=0x7ffdf5c29450, bstrategy=0x218c710, isTopLevel=1 '\001') at vacuum.c:340 #15 0x006c0455 in ExecVacuum (vacstmt=0x209edc0, isTopLevel=1 '\001') at vacuum.c:141 #16 0x008b4cb3 in standard_ProcessUtility (pstmt=0x209f110, queryString=0x209e2b0 "vacuum brin_test ", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x209f208, completionTag=0x7ffdf5c298d0 "") at utility.c:675 #17 0x008b4413 in ProcessUtility (pstmt=0x209f110, queryString=0x209e2b0 "vacuum brin_test ", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x209f208, completionTag=0x7ffdf5c298d0 "") at utility.c:357 #18 0x008b33b4 in PortalRunUtility (portal=0x211b9f0, pstmt=0x209f110, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x209f208, completionTag=0x7ffdf5c298d0 "") at pquery.c:1178 #19 0x008b35cc in PortalRunMulti (portal=0x211b9f0, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0x209f208, altdest=0x209f208, completionTag=0x7ffdf5c298d0 "") at pquery.c:1324 #20 0x0
Re: [HACKERS] WIP: BRIN bloom indexes
hi, On 10/28/2017 02:41 AM, Nico Williams wrote: > On Fri, Oct 27, 2017 at 10:06:58PM +0200, Tomas Vondra wrote: >>> + * We use an optimisation that initially we store the uint32 values >>> directly, >>> + * without the extra hashing step. And only later filling the bitmap space, >>> + * we switch to the regular bloom filter mode. >>> >>> I don't think that optimization is worthwhile. If I'm going to be using >>> a Bloom filter, it's because I expect to have a lot of rows. >>> >>> (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new >>> table just won't have lots of rows, then I might want this optimization, >>> but I can always just drop the Bloom filter index, or not include >>> indexes in the LIKE.) >> >> I think you're confusing "rows" and "distinct values". Furthermore, it's >> rather irrelevant how many rows are in the table because BRIN indexes >> split the table into "ranges" that are 1MB by default. And you can only >> squash certain number of rows into such range. >> >> The idea of the optimization is to efficiently support cases where each >> range contains only small number of distinct values - which is quite >> common in the cases I described in my initial message (with bursts of >> rows with the same IP / client identifier etc.). > > What range? It's just bits to set. > > The point is that Bloom filters should ideally be about 50% full -- much > less than that and you are wasting space, much more than than and your > false positive rate becomes useless. The range defined by BRIN indexes. BRIN indexes split the table into a sequence of page ranges (128 pages by default, i.e. 1MB), and the bloom filters are built on those table "chunks". So it's irrelevant how many rows are in the table in total, what matters is just the page range. > >>> Filter compression is not worthwhile. You want to have a fairly uniform >>> hash distribution, and you want to end up with a Bloom filter that's not >>> much more than 50% full. That means that when near 50% full the filter >>> will not be very sparse at all. Optimizing for the not common case >>> doesn't seem worthwhile, and adds complexity. >> >> Properly sizing the bloom filter requires knowledge of many variables, >> in particular the number of distinct values expected to be added to the >> filter. But we don't really know that number, and we also don't even >> know many values useful for estimating that (how many rows will fit into >> a range, number of distinct values in the whole table, etc.) > > This is why Scalable Bloom filters exist: so you can adapt. > >> So the idea was to oversize the bloom filter, and then use the sparse >> representation to reduce the size. > > A space-efficient representation of sparse bitmaps is not as simple as a > Scalable Bloom filter. > > And a filter that requires user intervention to size correctly, or which > requires rebuilding when it gets too full, is also not as convenient as > a Scalable Bloom filter. > Maybe. For now I'm fine with the simple relopts-based approach and don't plan to spend much time on these improvements. Which is not to say that they may not be worthwhile, but it's not the only thing I'm working on. I also suspect there are practical implications, as some things are only possible with equally-sized bloom filters. I believe the "union" (missing in the current patch) will rely on that when merging bloom filters. >>> + * XXX We can also watch the number of bits set in the bloom filter, and >>> + * then stop using it (and not store the bitmap, to save space) when the >>> + * false positive rate gets too high. >>> >>> Ah, right, what you want is a "Scalable Bloom Filter". >> >> That's not what I had in mind. My idea was to size the bloom filter on >> "best effort" basis, and then if one range gets a bit too inaccurate >> then just get rid of the filter. If that happens for many ranges, we >> should probably allow specifying parameters as relopts for the index. > > Scalable Bloom filters are way more convenient than that. They're > always not-too-full, and only the open filter is less-than-full-enough. > > And since you should grow them somewhat exponentially (but not quite as > much as a doubling in each generation), there should never be too many > filters to search. But you can always "vacuum" (rebuild) the filter > starting with the size of the last filter added prior to the vacuum. > >> I think this is really an over-engineering, and I certainly don't plan &
Re: [HACKERS] WIP: BRIN bloom indexes
Hi, On 10/27/2017 05:22 PM, Sokolov Yura wrote: > > Hi, Tomas > > BRIN bloom index is a really cool feature, that definitely should be in > core distribution (either in contrib or builtin)!!! > > Small suggestion for algorithm: > > It is well known practice not to calculate whole hash function for every > point, but use double hashing approach. > Example of proving double hashing approach for bloom filters: > https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf > > So, instead of: > for (i = 0; i < filter->nhashes; i++) > { > /* compute the hashes with a seed, used for the bloom filter */ > uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, > i))) % filter->nbits; > > /* set or check bit h */ > } > > such construction could be used: > /* compute the hashes, used for the bloom filter */ > uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))); > uint32 h = big_h % filter->nbits; > /* ensures d is never >= filter->nbits */ > uint32 d = big_h % (filter->nbits - 1); > > for (i = 0; i < filter->nhashes; i++) > { > /* set or check bit h */ > > /* compute next bit h */ > h += d++; > if (h >= filter->nbits) h -= filter->nbits; > if (d == filter->nbits) d = 0; > } > > Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`) > gives two good-quality functions usable for double hashing. > OK, thanks for the suggestion. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN bloom indexes
Hi, On 10/27/2017 07:17 PM, Nico Williams wrote: > On Thu, Oct 19, 2017 at 10:15:32PM +0200, Tomas Vondra wrote: > > A bloom filter index would, indeed, be wonderful. > > Comments: > > + * We use an optimisation that initially we store the uint32 values directly, > + * without the extra hashing step. And only later filling the bitmap space, > + * we switch to the regular bloom filter mode. > > I don't think that optimization is worthwhile. If I'm going to be using > a Bloom filter, it's because I expect to have a lot of rows. > > (Granted, if I CREATE TABLE .. LIKE .. INCLUDING INDEXES, and the new > table just won't have lots of rows, then I might want this optimization, > but I can always just drop the Bloom filter index, or not include > indexes in the LIKE.) > I think you're confusing "rows" and "distinct values". Furthermore, it's rather irrelevant how many rows are in the table because BRIN indexes split the table into "ranges" that are 1MB by default. And you can only squash certain number of rows into such range. The idea of the optimization is to efficiently support cases where each range contains only small number of distinct values - which is quite common in the cases I described in my initial message (with bursts of rows with the same IP / client identifier etc.). > + * XXX Perhaps we could save a few bytes by using different data types, but > + * considering the size of the bitmap, the difference is negligible. > > A bytea is all that's needed. See below. > > + * XXX We could also implement "sparse" bloom filters, keeping only the > + * bytes that are not entirely 0. That might make the "sorted" phase > + * mostly unnecessary. > > Filter compression is not worthwhile. You want to have a fairly uniform > hash distribution, and you want to end up with a Bloom filter that's not > much more than 50% full. That means that when near 50% full the filter > will not be very sparse at all. Optimizing for the not common case > doesn't seem worthwhile, and adds complexity. > Properly sizing the bloom filter requires knowledge of many variables, in particular the number of distinct values expected to be added to the filter. But we don't really know that number, and we also don't even know many values useful for estimating that (how many rows will fit into a range, number of distinct values in the whole table, etc.) So the idea was to oversize the bloom filter, and then use the sparse representation to reduce the size. > + * XXX We can also watch the number of bits set in the bloom filter, and > + * then stop using it (and not store the bitmap, to save space) when the > + * false positive rate gets too high. > > Ah, right, what you want is a "Scalable Bloom Filter". > That's not what I had in mind. My idea was to size the bloom filter on "best effort" basis, and then if one range gets a bit too inaccurate then just get rid of the filter. If that happens for many ranges, we should probably allow specifying parameters as relopts for the index. > A Scalable Bloom filter is actually a series of Bloom filters where all > but the newest filter are closed to additions, and the newest filter is > where you do all the additions. You generally want to make each new > filter bigger than the preceding one because when searching a Scalable > Bloom filter you have to search *all* of them, so you want to minimize > the number of filters. > > Eventually, when you have enough sub-filters, you may want to re-write > the whole thing so that you start with a single sub-filter that is large > enough. > > The format of the bytea might then be something like: > > [[[...]] > > where the last bitmap is the filter that is open to additions. > I think this is really an over-engineering, and I certainly don't plan to extend the patch in this direction. I do not expect these parameters (particularly the number of distinct values in a range) to significantly change over time, so the easiest solution is to provide a reloption to specify that number in CREATE/ALTER index. Alternatively, we could allow the summarization process to gather some statistics (either on the whole range, or perhaps the whole table), and store them somewhere for later use. For example to compute the number of distinct values per range (in the existing data), and use that later. > ... > > Of course, for something like: > > SELECT a.*, b.* FROM a a JOIN b b USING (some_col); > > a Bloom filter can only be used as an optimization to avoid using a > slower index (or heap scan) on the inner table source. > > What I'm getting at is that the query planner really needs to understand > that a Bloom filter is a probabilistic data structure. > It does, and
Re: [HACKERS] WIP: BRIN bloom indexes
hi, On 10/27/2017 09:34 AM, Simon Riggs wrote: > On 27 October 2017 at 07:20, Robert Haas <robertmh...@gmail.com> wrote: >> On Thu, Oct 19, 2017 at 10:15 PM, Tomas Vondra >> <tomas.von...@2ndquadrant.com> wrote: >>> Let's see a query like this: >>> >>> select * from bloom_test >>> where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'; >>> >>> The minmax index produces this plan >>> >>>Heap Blocks: lossy=2061856 >>> Execution time: 22707.891 ms >>> >>> Now, the bloom index: >>> >>>Heap Blocks: lossy=25984 >>> Execution time: 338.946 ms >> >> It's neat to see BRIN being extended like this. Possibly we could >> consider making it a contrib module rather than including it in core, >> although I don't have strong feelings about it. > > I see that SP-GIST includes two operator classes in core, one default. > > Makes sense to do the same thing with BRIN and add this new op class > as a non-default option in core. > Not sure "a number of in-core opclasses" is a good reason to (not) add new ones. Also, we already have two built-in BRIN opclasses (minmax and inclusion). In general, "BRIN bloom" can be packed as a contrib module (at least I believe so). That's not the case for the "BRIN multi-range" which also requires some changes to some code in brin.c (but the rest can be moved to contrib module, of course). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Burst in WAL size when UUID is used as PK while full_page_writes are enabled
On 10/27/2017 07:56 AM, sanyam jain wrote: > Hi, > > I was reading the > blog https://blog.2ndquadrant.com/on-the-impact-of-full-page-writes . > For the record, I assume you're referring to this part: With BIGSERIAL new values are sequential, and so get inserted to the same leaf pages in the btree index. As only the first modification to a page triggers the full-page write, only a tiny fraction of the WAL records are FPIs. With UUID it’s completely different case, of couse – the values are not sequential at all, in fact each insert is likely to touch completely new leaf index leaf page (assuming the index is large enough). > My queries: > > How randomness of UUID will likely to create new leaf page in btree index? > In my understanding as the size of UUID is 128 bits i.e. twice of > BIGSERIAL , more number of pages will be required to store the same > number of rows and hence there can be increase in WAL size due to FPW . > When compared the index size in local setup UUID index is ~2x greater in > size. > Perhaps this is just a poor choice of words on my side, but I wasn't suggesting new leaf pages will be created but merely that the inserts will touch a different (possibly existing) leaf page. That's a direct consequence of the inherent UUID randomness. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: BRIN multi-range indexes
Apparently I've managed to botch the git format-patch thing :-( Attached are both patches (the first one adding BRIN bloom indexes, the other one adding the BRIN multi-range). Hopefully I got it right this time ;-) regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From c27f02d2839e08ebcee852448ed3838c8932d2ea Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@2ndquadrant.com> Date: Mon, 23 Oct 2017 22:48:58 +0200 Subject: [PATCH 1/2] brin bloom v1 --- doc/src/sgml/brin.sgml | 236 ++ src/backend/access/brin/Makefile | 2 +- src/backend/access/brin/brin_bloom.c | 727 +++ src/include/catalog/pg_amop.h| 59 +++ src/include/catalog/pg_amproc.h | 153 +++ src/include/catalog/pg_opclass.h | 25 ++ src/include/catalog/pg_opfamily.h| 20 + src/include/catalog/pg_proc.h| 10 + src/test/regress/expected/opr_sanity.out | 3 +- 9 files changed, 1233 insertions(+), 2 deletions(-) create mode 100644 src/backend/access/brin/brin_bloom.c diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index b7483df..49d53e1 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -118,6 +118,13 @@ + abstime_bloom_ops + abstime + + = + + + abstime_minmax_ops abstime @@ -129,6 +136,13 @@ + int8_bloom_ops + bigint + + = + + + int8_minmax_ops bigint @@ -180,6 +194,13 @@ + bytea_bloom_ops + bytea + + = + + + bytea_minmax_ops bytea @@ -191,6 +212,13 @@ + bpchar_bloom_ops + character + + = + + + bpchar_minmax_ops character @@ -202,6 +230,13 @@ + char_bloom_ops + "char" + + = + + + char_minmax_ops "char" @@ -213,6 +248,13 @@ + date_bloom_ops + date + + = + + + date_minmax_ops date @@ -224,6 +266,13 @@ + float8_bloom_ops + double precision + + = + + + float8_minmax_ops double precision @@ -235,6 +284,13 @@ + inet_bloom_ops + inet + + = + + + inet_minmax_ops inet @@ -258,6 +314,13 @@ + int4_bloom_ops + integer + + = + + + int4_minmax_ops integer @@ -269,6 +332,13 @@ + interval_bloom_ops + interval + + = + + + interval_minmax_ops interval @@ -280,6 +350,13 @@ + macaddr_bloom_ops + macaddr + + = + + + macaddr_minmax_ops macaddr @@ -291,6 +368,13 @@ + macaddr8_bloom_ops + macaddr8 + + = + + + macaddr8_minmax_ops macaddr8 @@ -302,6 +386,13 @@ + name_bloom_ops + name + + = + + + name_minmax_ops name @@ -313,6 +404,13 @@ + numeric_bloom_ops + numeric + + = + + + numeric_minmax_ops numeric @@ -324,6 +422,13 @@ + pg_lsn_bloom_ops + pg_lsn + + = + + + pg_lsn_minmax_ops pg_lsn @@ -335,6 +440,13 @@ + oid_bloom_ops + oid + + = + + + oid_minmax_ops oid @@ -366,6 +478,13 @@ + float4_bloom_ops + real + + = + + + float4_minmax_ops real @@ -377,6 +496,13 @@ + reltime_bloom_ops + reltime + + = + + + reltime_minmax_ops reltime @@ -388,6 +514,13 @@ + int2_bloom_ops + smallint + + = + + + int2_minmax_ops smallint @@ -399,6 +532,13 @@ + text_bloom_ops + text + + = + + + text_minmax_ops text @@ -410,6 +550,13 @@ + tid_bloom_ops + tid + + = + + + tid_minmax_ops tid @@ -421,6 +568,13 @@ + timestamp_bloom_ops + timestamp without time zone + + = + + + timestamp_minmax_ops timestamp without time zone @@ -432,6 +586,13 @@ + timestamptz_bloom_ops + timestamp with time zone + + = + + + timestamptz_minmax_ops timestamp with t
[HACKERS] WIP: BRIN multi-range indexes
'1'::double precision)) Rows Removed by Index Recheck: 3361 Heap Blocks: lossy=64 -> Bitmap Index Scan on a_val_idx (cost=0.00..830.01 rows=10200 width=0) (actual time=9.274..9.274 rows=640 loops=1) Index Cond: ((val >= '100'::double precision) AND (val <= '1'::double precision)) Planning time: 0.138 ms Execution time: 36.100 ms (8 rows) That is, the timings drop from 785ms/871ms to 9ms/36s. The index is a bit larger (1700kB instead of 150kB), but it's still orders of magnitudes smaller than btree index (which is ~214MB in this case). The index build is slower than the regular BRIN indexes (about comparable to btree), but I'm sure it can be significantly improved. Also, I'm sure it's not bug-free. Two additional notes: 1) The patch does break the current BRIN indexes, because it requires passing all SearchKeys to the "consistent" BRIN function at once (otherwise we couldn't eliminate individual intervals in the summary), while currently the BRIN only deals with one SearchKey at a time. And I haven't modified the existing brin_minmax_consistent() function (yeah, I'm lazy, but this should be enough for interested people to try it out I believe). 2) It only works with float8 (and also timestamp data types) for now, but it should be straightforward to add support for additional data types. Most of that will be about adding catalog definitions anyway. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c index 6ec3297..94d696e 100644 --- a/src/backend/access/brin/brin_minmax_multi.c +++ b/src/backend/access/brin/brin_minmax_multi.c @@ -29,9 +29,6 @@ typedef struct MinmaxMultiOpaque FmgrInfo strategy_procinfos[BTMaxStrategyNumber]; } MinmaxMultiOpaque; -#define MODE_VALUES 1 -#define MODE_RANGES 2 - #define MINMAX_MAX_VALUES 64 typedef struct MinmaxMultiRanges @@ -39,12 +36,13 @@ typedef struct MinmaxMultiRanges /* varlena header (do not touch directly!) */ int32 vl_len_; - int mode; /* values or ranges in the data array */ + /* maxvalues >= (2*nranges + nvalues) */ int maxvalues; /* maximum number of values in the buffer */ - int numvalues; /* number of values in the data buffer */ + int nranges; /* number of ranges stored in the array */ + int nvalues; /* number of values in the data array */ /* values stored for this range - either raw values, or ranges */ - Datum data[FLEXIBLE_ARRAY_MEMBER]; + Datum values[FLEXIBLE_ARRAY_MEMBER]; } MinmaxMultiRanges; @@ -68,13 +66,11 @@ minmax_multi_init(int maxvalues) /* * Allocate the range list with space for the max number of values. */ - len = offsetof(MinmaxMultiRanges, data) + maxvalues * sizeof(Datum); + len = offsetof(MinmaxMultiRanges, values) + maxvalues * sizeof(Datum); ranges = (MinmaxMultiRanges *) palloc0(len); - ranges->mode = MINMAX_MAX_VALUES; ranges->maxvalues = maxvalues; - ranges->mode = MODE_VALUES; /* start by accumulating values */ SET_VARSIZE(ranges, len); @@ -87,6 +83,12 @@ typedef struct compare_context Oid colloid; } compare_context; +typedef struct DatumRange { + Datum minval; + Datum maxval; + bool collapsed; +} DatumRange; + static int compare_values(const void *a, const void *b, void *arg) { @@ -109,27 +111,77 @@ compare_values(const void *a, const void *b, void *arg) return 0; } +static int +compare_ranges(const void *a, const void *b, void *arg) +{ + DatumRange ra = *(DatumRange *)a; + DatumRange rb = *(DatumRange *)b; + Datum r; + + compare_context *cxt = (compare_context *)arg; + + r = FunctionCall2Coll(cxt->cmpFn, cxt->colloid, ra.minval, rb.minval); + + if (DatumGetBool(r)) + return -1; + + r = FunctionCall2Coll(cxt->cmpFn, cxt->colloid, rb.minval, ra.minval); + + if (DatumGetBool(r)) + return 1; + + return 0; +} + +/* static void print_range(char * label, int numvalues, Datum *values) { - int i; + int idx; StringInfoData str; initStringInfo(); - for (i = 0; i < (numvalues/2); i++) + idx = 0; + while (idx < 2*ranges->nranges) + { + if (idx == 0) + appendStringInfoString(, "RANGES: ["); + else + appendStringInfoString(, ", "); + + appendStringInfo(, "%d => [%.9f, %.9f]", idx/2, DatumGetFloat8(values[idx]), DatumGetFloat8(values[idx+1])); + + idx += 2; + } + + if (ranges->nranges > 0) + appendStringInfoString(, "]"); + + if ((ranges->nranges > 0) && (ranges->nvalues > 0)) + appendStringInfoString(, " "); + + while (idx < 2*ranges->nranges + ranges->nvalues) { - if (i > 0) + if (idx == 2*ranges->nranges) + appendStringInfoString(, "VALUES: ["); + else append
[HACKERS] CUBE seems a bit confused about ORDER BY
Hi, I've noticed this suspicious behavior of "cube" data type with ORDER BY, which I believe is a bug in the extension (or the GiST index support). The following example comes directly from regression tests added by 33bd250f (so CC Teodor and Stas, who are mentioned in the commit). This query should produce results with ordering "ascending by 2nd coordinate or upper right corner". To make it clear, I've added the "c~>4" expression to the query, otherwise it's right from the test. test=# SELECT c~>4 "c~>4", * FROM test_cube ORDER BY c~>4 LIMIT 15; c~>4 | c --+--- 50 | (30333, 50),(30273, 6) 75 | (43301, 75),(43227, 43) 142 | (19650, 142),(19630, 51) 160 | (2424, 160),(2424, 81) 171 | (3449, 171),(3354, 108) 155 | (18037, 155),(17941, 109) 208 | (28511, 208),(28479, 114) 217 | (19946, 217),(19941, 118) 191 | (16906, 191),(16816, 139) 187 | (759, 187),(662, 163) 266 | (22684, 266),(22656, 181) 255 | (24423, 255),(24360, 213) 249 | (45989, 249),(45910, 222) 377 | (11399, 377),(11360, 294) 389 | (12162, 389),(12103, 309) (15 rows) As you can see, it's not actually sorted by the c~>4 coordinate (but by c~>2, which it the last number). Moreover, disabling index scans fixes the ordering: test=# set enable_indexscan = off; SET test=# SELECT c~>4, * FROM test_cube ORDER BY c~>4 LIMIT 15; -- ascending by 2nd coordinate or upper right corner ?column? | c --+--- 50 | (30333, 50),(30273, 6) 75 | (43301, 75),(43227, 43) 142 | (19650, 142),(19630, 51) 155 | (18037, 155),(17941, 109) 160 | (2424, 160),(2424, 81) 171 | (3449, 171),(3354, 108) 187 | (759, 187),(662, 163) 191 | (16906, 191),(16816, 139) 208 | (28511, 208),(28479, 114) 217 | (19946, 217),(19941, 118) 249 | (45989, 249),(45910, 222) 255 | (24423, 255),(24360, 213) 266 | (22684, 266),(22656, 181) 367 | (31018, 367),(30946, 333) 377 | (11399, 377),(11360, 294) (15 rows) Seems like a bug somewhere in gist_cube_ops, I guess? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP: BRIN bloom indexes
his case, but that's mostly thanks to the whole dataset being entirely in-memory. There are some remaining open questions. 1) sizing the bloom filter Firstly, the size of the filter is currently static, based on 1000 distinct values per range, with 5% false-positive rate. Those are mostly arbitrary values, and I don't have any clear idea how to determine optimal values. Maybe we could do the sizing based on ndistinct value for the indexed column? It also depends on the pages_per_range value, so perhaps it should be a user-defined option too. An interesting feature is that the bloom indexes "degrade locally". If a page range has significantly more distinct values, the bloom filter may be way too small and will suffer by high false-positive rate. But it only affects that one page range, and the rest may be perfectly fine. I was thinking about disabling the bloom filter when it gets "too full" (too many bits set, resulting in high false-positive cases). But that seems like a bad idea - the false-positive rate automatically jumps to 100% for that range, and we only save much smaller amount of space in the index. So even if the false-positive rate is 50% it still seems efficient to keep the bloom filter. Another thing to consider is what happens when there are very few distinct values in a given page range. The current code tries to be a bit smart in this case - instead of building the bloom filter right away, it initially keeps the exact values and only switches to bloom filter when filling the same space. 2) costing The other thing is costing of BRIN indexes. At this point it's rather simplistic and independent of the operator class, so the only thing that matters is size of the BRIN index. That means a "minmax" index is always considered cheaper than "bloom" index, because the optimizer has no idea the "minmax" indexes are more vulnerable to "wide ranges". But perhaps this is a non-issue, as the bloom index are meant for cases when minmax indexes don't work. And when minmax indexes work, they work better than bloom. So people are unlikely to build both of them at the same time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/doc/src/sgml/brin.sgml b/doc/src/sgml/brin.sgml index 91c0170..26483dd 100644 --- a/doc/src/sgml/brin.sgml +++ b/doc/src/sgml/brin.sgml @@ -118,6 +118,13 @@ + abstime_bloom_ops + abstime + + = + + + abstime_minmax_ops abstime @@ -129,6 +136,13 @@ + int8_bloom_ops + bigint + + = + + + int8_minmax_ops bigint @@ -180,6 +194,13 @@ + bytea_bloom_ops + bytea + + = + + + bytea_minmax_ops bytea @@ -191,6 +212,13 @@ + bpchar_bloom_ops + character + + = + + + bpchar_minmax_ops character @@ -202,6 +230,13 @@ + char_bloom_ops + "char" + + = + + + char_minmax_ops "char" @@ -213,6 +248,13 @@ + date_bloom_ops + date + + = + + + date_minmax_ops date @@ -224,6 +266,13 @@ + float8_bloom_ops + double precision + + = + + + float8_minmax_ops double precision @@ -235,6 +284,13 @@ + inet_bloom_ops + inet + + = + + + inet_minmax_ops inet @@ -258,6 +314,13 @@ + int4_bloom_ops + integer + + = + + + int4_minmax_ops integer @@ -269,6 +332,13 @@ + interval_bloom_ops + interval + + = + + + interval_minmax_ops interval @@ -280,6 +350,13 @@ + macaddr_bloom_ops + macaddr + + = + + + macaddr_minmax_ops macaddr @@ -291,6 +368,13 @@ + macaddr8_bloom_ops + macaddr8 + + = + + + macaddr8_minmax_ops macaddr8 @@ -302,6 +386,13 @@ + name_bloom_ops + name + + = + + + name_minmax_ops name @@ -313,6 +404,13 @@ + numeric_bloom_ops + numeric + + = + + + numeric_minmax_ops numeric @@ -324,6 +422,13 @@ + pg_lsn_bloom_ops + pg_lsn + + = + + + pg_lsn_minmax_ops pg_lsn @@ -335,6 +440,13 @@ + oid_bloom_ops + oid + + = + + + oid_minmax_ops oid @@ -366,6 +478,13
Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)
On 10/17/2017 03:16 PM, Robert Haas wrote: > On Sat, Oct 14, 2017 at 2:14 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> I propose is to add a new cursor option (PARALLEL), which would allow >> parallel plans for that particular user-defined cursor. Attached is an >> experimental patch doing this (I'm sure there are some loose ends). > > I think you would need to do a huge amount of additional work in > order to actually make this robust. I believe that a fair amount of > what goes on in parallel mode right now is checked with elog() if we > think that it's unreachable without writing C code -- but this will > make a lot of those things reachable, which means they would need to > be checked some other way. Sure, additional checks may be necessary. I've tried to come up with examples causing crashes, and haven't succeeded. Of course, that's no proof of correctness, so if you have an example that will crash and burn I'd like to see see it. In general, it may be acceptable to rely on the elog() checks - which is pretty much what the FETCH+INSERT+SHARE example I shared in the first message of this thread does. I agree it's not particularly convenient, and perhaps we should replace it with checks while planning the queries. > Also, I doubt this guarantees that we won't try to call > parallel-unsafe functions will parallel mode is active, so things > will just blow up in whatever way they do, maybe crashing the server > or rendering the database inconsistent or whatever. > Probably. What happens right now is that declaring the cursor switches the whole transaction into parallel mode (EnterParallelMode), so if you do FETCH + INSERT + FETCH that will fail with elog(ERROR). But yeah, this should probably be checked at planning time, and the whole query should fail if the transaction is in parallel mode and the query contains unsafe/restricted functions. > Possibly I'm overestimating the extent of the danger, but I don't > think so. You're try to take a mechanism that was only ever meant to > be active during the course of one query and applying it for long > periods of time during which a user can do anything, with basically no > upgrade of the infrastructure. I think something like this could be > made to work if you put a large amount of energy into it, but I think > the patch as proposed is about the easiest 3-5% of what would actually > be required to cover all the holes. > Well, soliciting feedback like this was one of the points of sharing the experimental patch, so thank you for that. I'm not sure if the estimate of 3-5% is accurate, but I'm sure the patch is incomplete - which is why I marked it as experimental, after all. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGSEGV in BRIN autosummarize
On 10/17/2017 02:29 PM, Justin Pryzby wrote: > On Tue, Oct 17, 2017 at 12:59:16PM +0200, Alvaro Herrera wrote: >> Anyway, can give this patch a try? > > I've only compiled postgres once before and this is a production environment > (althought nothing so important that the crashes are a serious concern > either). > > Is it reasonable to wget the postgres tarball, apply the patch, and run the > compiled postgres binary from the source tree, without running make install or > similar ? Otherwise, would it be good enough to copy the postgres binary to > /usr/pgsql-10/bin (and reinstall the binary package later) ? > You don't have to install the binaries to the same location, i.e. you can keep both the current and modified binaries. ./configure --prefix=/path/to/alternative/binaries --enable-debug CFLAGS="..." To get the same compilation options you can run pg_config, look for CONFIGURE line and then just modify add --prefix option. And after `make install` you can add it to $PATH and start the server using those binaries. $ export PATH=/path/to/alternative/binaries/bin:$PATH $ which pg_ctl $ pg_ctl -D $DATADIR stop $ pg_ctl -D $DATADIR start regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)
On 10/16/2017 01:13 PM, Amit Kapila wrote: > On Sat, Oct 14, 2017 at 11:44 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> Hi, >> >> >> I propose is to add a new cursor option (PARALLEL), which would allow >> parallel plans for that particular user-defined cursor. Attached is an >> experimental patch doing this (I'm sure there are some loose ends). >> > > How will this work for backward scans? > It may not work, just like for regular cursors without SCROLL. And if you specify SCROLL, then I believe a Materialize node will be added automatically if needed, but haven't tried. > >> >> Regarding (2), if the user suspends the cursor for a long time, bummer. >> The parallel workers will remain assigned, doing nothing. I don't have >> any idea how to get around that, but I don't see how we could do better. >> > > One idea could be that whenever someone uses Parallel option, we can > fetch and store all the data locally before returning control to the > user (something like WITH HOLD option). > I don't like that very much, as it adds unnecessary overhead. As I said before, I'm particularly interested in cursors returning large amounts of data, so the overhead may be quite significant. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGSEGV in BRIN autosummarize
Hi, On 10/15/2017 03:56 AM, Justin Pryzby wrote: > On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote: ... >> It's a bit difficult to guess what went wrong from this backtrace. For >> me gdb typically prints a bunch of lines immediately before the frames, >> explaining what went wrong - not sure why it's missing here. > > Do you mean this ? > > ... > Loaded symbols for /lib64/libnss_files-2.12.so > Core was generated by `postgres: autovacuum worker process gtt > '. > Program terminated with signal 11, Segmentation fault. > #0 pfree (pointer=0x298c740) at mcxt.c:954 > 954 (*context->methods->free_p) (context, pointer); > Yes. So either 'context' is bogus. Or 'context->methods' is bogus. Or 'context->methods->free_p' is bogus. >> Perhaps some of those pointers are bogus, the memory was already pfree-d >> or something like that. You'll have to poke around and try dereferencing >> the pointers to find what works and what does not. >> >> For example what do these gdb commands do in the #0 frame? >> >> (gdb) p *(MemoryContext)context > > (gdb) p *(MemoryContext)context > Cannot access memory at address 0x7474617261763a20 > OK, this means the memory context pointer (tracked in the header of a chunk) is bogus. There are multiple common ways how that could happen: * Something corrupts memory (typically out-of-bounds write). * The pointer got allocated in an incorrect memory context (which then was released, and the memory was reused for new stuff). * It's a use-after-free. * ... various other possibilities ... > > I uploaded the corefile: > http://telsasoft.com/tmp/coredump-postgres-autovacuum-brin-summarize.gz > Thanks, but I'm not sure that'll help, at this point. We already know what happened (corrupted memory), we don't know "how". And core files are mostly just "snapshots" so are not very useful in answering that :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGSEGV in BRIN autosummarize
Hi, On 10/15/2017 12:42 AM, Justin Pryzby wrote: > On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote: >> I don't have any reason to believe there's memory issue on the server, So I >> suppose this is just a "heads up" to early adopters until/in case it happens >> again and I can at least provide a stack trace. > > I'm back; find stacktrace below. > >> Today I see: >> < 2017-10-13 17:22:47.839 -04 >LOG: server process (PID 32127) was >> terminated by signal 11: Segmentation fault >> < 2017-10-13 17:22:47.839 -04 >DETAIL: Failed process was running: >> autovacuum: BRIN summarize public.gtt 747263 > > Is it a coincidence the server failed within 45m of yesterday's failure ? > Most likely just a coincidence. > postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in > postgres[40+692000] > < 2017-10-14 18:05:36.432 -04 >DETAIL: Failed process was running: > autovacuum: BRIN summarize public.gtt 41087 > >> It looks like this table was being inserted into simultaneously by a python >> program using multiprocessing. It looks like each subprocess was INSERTing >> into several tables, each of which has one BRIN index on timestamp column. > > I should add: > These are insert-only child tables in a heirarchy (not PG10 partitions), being > inserted into directly (not via trigger/rule). > > Also notice the vacuum process was interrupted, same as yesterday (think > goodness for full logs). Our INSERT script is using python > multiprocessing.pool() with "maxtasksperchild=1", which I think means we load > one file and then exit the subprocess, and pool() creates a new subproc, which > starts a new PG session and transaction. Which explains why autovacuum starts > processing the table only to be immediately interrupted. > I don't follow. Why does it explain that autovacuum gets canceled? I mean, merely opening a new connection/session should not cancel autovacuum. That requires a command that requires table-level lock conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...). > ... > Due to a .."behavioral deficiency" in the loader for those tables, the crashed > backend causes the loader to get stuck, so the tables should be untouched > since > the crash, should it be desirable to inspect them. > It's a bit difficult to guess what went wrong from this backtrace. For me gdb typically prints a bunch of lines immediately before the frames, explaining what went wrong - not sure why it's missing here. Perhaps some of those pointers are bogus, the memory was already pfree-d or something like that. You'll have to poke around and try dereferencing the pointers to find what works and what does not. For example what do these gdb commands do in the #0 frame? (gdb) p *(MemoryContext)context (gdb) p *GetMemoryChunkContext(pointer) > #0 pfree (pointer=0x298c740) at mcxt.c:954 > context = 0x7474617261763a20 > #1 0x006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at > autovacuum.c:2676 > cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid > 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 146} {CONST :consttype > 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull > fal"... > cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 > :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location > 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 > :constbyv"... > cur_relname = 0x298cd68 > "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx" > __func__ = "perform_work_item" > #2 0x006a6fd9 in do_autovacuum () at autovacuum.c:2533 ... cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType
On 10/14/2017 07:49 PM, Robert Haas wrote: > On Fri, Oct 13, 2017 at 4:49 PM, David Rowley > <david.row...@2ndquadrant.com> wrote: >> tps = 8282.481310 (including connections establishing) >> tps = 8282.750821 (excluding connections establishing) > > vs. > >> tps = 8520.822410 (including connections establishing) >> tps = 8521.132784 (excluding connections establishing) >> >> With the patch we are making use of the extended statistics, which >> we do expect to be more work for the planner. Although, we didn't >> add extended statistics to speed up the planner. > > Sure, I understand. That's actually a pretty substantial regression > - I guess that means that it's pretty important to avoid creating > extended statistics that are not needed, at least for short-running > queries. > Well, it's only about 3% difference in a single run, which may be easily due to slightly different binary layout, random noise etc. So I wouldn't call that "substantial regression", at least not based on this one test. I've done more thorough testing, and what I see is 1.0-1.2% drop, but on a test that's rather extreme (statistics on empty table). So again, likely well within noise, and on larger tables it'll get even less significant. But of course - it's not free. It's a bit more work we need to do. But if you don't need multi-column statistics, don't create them. If your queries are already fast, you probably don't need them at all. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)
Hi, One of the existing limitations of parallel query is that cursors generally do not benefit from it [1]. Commit 61c2e1a95f [2] improved the situation for cursors from procedural languages, but unfortunately for user-defined cursors parallelism is still disabled. For many use cases that is perfectly fine, but for applications that need to process large amounts of data this is rather annoying. When the result sets are large, cursors are extremely efficient - in terms of memory consumption, for example. So the applications have to choose between "cursor" approach (and no parallelism), or parallelism and uncomfortably large result sets. I believe there are two main reasons why parallelism is disabled for user-defined cursors (or queries that might get suspended): (1) We can't predict what will happen while the query is suspended (and the transaction is still in "parallel mode"), e.g. the user might run arbitrary DML which is not allowed. (2) If the cursor gets suspended, the parallel workers would be still assigned to it and could not be used for anything else. Clearly, we can't solve those issues in general, so the default will probably remain "parallelism disabled". I propose is to add a new cursor option (PARALLEL), which would allow parallel plans for that particular user-defined cursor. Attached is an experimental patch doing this (I'm sure there are some loose ends). This does not make either any of the issues go away, of course. We still enforce "no DML while parallel operation in progress" as before, so this will not work: BEGIN; DECLARE x PARALLEL CURSOR FOR SELECT * FROM t2 WHERE ...; FETCH 1000 FROM x; INSERT INTO t2 VALUES (1); FETCH 1000 FROM x; COMMIT; but this will BEGIN; DECLARE x PARALLEL CURSOR FOR SELECT * FROM t2 WHERE ...; FETCH 1000 FROM x; ... FETCH 1000 FROM x; CLOSE x; INSERT INTO t2 VALUES (1); COMMIT; Regarding (2), if the user suspends the cursor for a long time, bummer. The parallel workers will remain assigned, doing nothing. I don't have any idea how to get around that, but I don't see how we could do better. I don't see either of these limitations as fatal. Any opinions / obvious flaws that I missed? regards [1] https://www.postgresql.org/docs/9.6/static/when-can-parallel-query-be-used.html [2] https://www.postgresql.org/message-id/CAOGQiiMfJ%2B4SQwgG%3D6CVHWoisiU0%2B7jtXSuiyXBM3y%3DA%3DeJzmg%40mail.gmail.com -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/portalcmds.c b/src/backend/commands/portalcmds.c index 76d6cf1..ffaa096 100644 --- a/src/backend/commands/portalcmds.c +++ b/src/backend/commands/portalcmds.c @@ -66,6 +66,12 @@ PerformCursorOpen(DeclareCursorStmt *cstmt, ParamListInfo params, RequireTransactionChain(isTopLevel, "DECLARE CURSOR"); /* + * Enable parallel plans for cursors that explicitly requested it. + */ + if (cstmt->options & CURSOR_OPT_PARALLEL) + cstmt->options |= CURSOR_OPT_PARALLEL_OK; + + /* * Parse analysis was done already, but we still have to run the rule * rewriter. We do not do AcquireRewriteLocks: we assume the query either * came straight from the parser, or suitable locks were acquired by diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 9689429..64f8a32 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -423,6 +423,13 @@ standard_ExecutorFinish(QueryDesc *queryDesc) /* This should be run once and only once per Executor instance */ Assert(!estate->es_finished); + /* If this was PARALLEL cursor, do cleanup and exit parallel mode. */ + if (queryDesc->parallel_cursor) + { + ExecShutdownNode(queryDesc->planstate); + ExitParallelMode(); + } + /* Switch into per-query memory context */ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt); @@ -1085,6 +1092,18 @@ InitPlan(QueryDesc *queryDesc, int eflags) queryDesc->tupDesc = tupType; queryDesc->planstate = planstate; + + /* If this was PARALLEL cursor, enter parallel mode, except in EXPLAIN-only. */ + + queryDesc->parallel_cursor + = (eflags & EXEC_FLAG_PARALLEL) && !(eflags & EXEC_FLAG_EXPLAIN_ONLY); + + /* + * In PARALLEL cursors we have to enter the parallel mode once, at the very + * beginning (and not in ExecutePlan, as we do for execute_once plans). + */ + if (queryDesc->parallel_cursor) + EnterParallelMode(); } /* @@ -1725,7 +1744,8 @@ ExecutePlan(EState *estate, if (TupIsNull(slot)) { /* Allow nodes to release or shut down resources. */ - (void) ExecShutdownNode(planstate); + if (execute_once) +(void) ExecShutdownNode(planstate); break; } @@ -1772,7 +1792,8 @@ ExecutePlan(EState *estate, if (numberTuples && numberTuples == current_tuple_count) { /* Allow nodes to release or shut down re
Re: [HACKERS] Extended statistics is not working on Vars hidden under a RelabelType
On 10/13/2017 10:04 PM, Robert Haas wrote: > On Mon, Oct 9, 2017 at 11:03 PM, David Rowley > <david.row...@2ndquadrant.com> wrote: >> -- Unpatched >> Planning time: 0.184 ms >> Execution time: 105.878 ms >> >> -- Patched >> Planning time: 2.175 ms >> Execution time: 106.326 ms > > This might not be the best example to show the advantages of the > patch, honestly. > Not sure what exactly is your point? If you're suggesting this example is bad because the planning time increased from 0.184 to 2.175 ms, then perhaps consider the plans were likely generated on a assert-enabled build and on a laptop (both of which adds quite a bit of noise to occasional timings). The patch has no impact on planning time (at least I've been unable to measure any). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10
On 10/12/2017 02:40 PM, Dilip Kumar wrote: > On Thu, Oct 12, 2017 at 4:31 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> Hi, >> >> It seems that Q19 from TPC-H is consistently failing with segfaults due >> to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL). >> >> I'm not very familiar with how the dsa is initialized and passed around, >> but I only see the failures when the bitmap is constructed by a mix of >> BitmapAnd and BitmapOr operations. >> > I think I have got the issue, bitmap_subplan_mark_shared is not > properly pushing the isshared flag to lower level bitmap index node, > and because of that tbm_create is passing NULL dsa while creating the > tidbitmap. So this problem will come in very specific combination of > BitmapOr and BitmapAnd when BitmapAnd is the first subplan for the > BitmapOr. If BitmapIndex is the first subplan under BitmapOr then > there is no problem because BitmapOr node will create the tbm by > itself and isshared is set for BitmapOr. > > Attached patch fixing the issue for me. I will thoroughly test this > patch with other scenario as well. Thanks for reporting. > Yep, this fixes the failures for me. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Parallel Bitmap Heap Scans segfaults due to (tbm->dsa==NULL) on PostgreSQL 10
Hi, It seems that Q19 from TPC-H is consistently failing with segfaults due to calling tbm_prepare_shared_iterate() with (tbm->dsa==NULL). I'm not very familiar with how the dsa is initialized and passed around, but I only see the failures when the bitmap is constructed by a mix of BitmapAnd and BitmapOr operations. Another interesting observation is that setting force_parallel_mode=on may not be enough - there really need to be multiple parallel workers, which is why the simple query does cpu_tuple_cost=1. Attached is a bunch of files: 1) details for "full" query: * query.sql * plan.txt * backtrace.txt 2) details for the "minimal" query triggering the issue: * query-minimal.sql * plan-minimal.txt * backtrace-minimal.txt regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services Program terminated with signal 6, Aborted. #0 0x7fe21265d1f7 in raise () from /lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 (gdb) bt #0 0x7fe21265d1f7 in raise () from /lib64/libc.so.6 #1 0x7fe21265e8e8 in abort () from /lib64/libc.so.6 #2 0x008468e7 in ExceptionalCondition (conditionName=conditionName@entry=0x9d213a "!(tbm->dsa != ((void *)0))", errorType=errorType@entry=0x88fc69 "FailedAssertion", fileName=fileName@entry=0x9d2014 "tidbitmap.c", lineNumber=lineNumber@entry=800) at assert.c:54 #3 0x0065b04f in tbm_prepare_shared_iterate (tbm=tbm@entry=0x2b244e8) at tidbitmap.c:800 #4 0x0062294a in BitmapHeapNext (node=node@entry=0x2adf118) at nodeBitmapHeapscan.c:155 #5 0x00616d7a in ExecScanFetch (recheckMtd=0x623050 , accessMtd=0x622250 , node=0x2adf118) at execScan.c:97 #6 ExecScan (node=0x2adf118, accessMtd=0x622250 , recheckMtd=0x623050 ) at execScan.c:147 #7 0x00624c75 in ExecProcNode (node=0x2adf118) at ../../../src/include/executor/executor.h:250 #8 gather_getnext (gatherstate=0x2aded50) at nodeGather.c:281 #9 ExecGather (pstate=0x2aded50) at nodeGather.c:215 #10 0x00610d12 in ExecProcNode (node=0x2aded50) at ../../../src/include/executor/executor.h:250 #11 ExecutePlan (execute_once=, dest=0x2b09220, direction=, numberTuples=0, sendTuples=1 '\001', operation=CMD_SELECT, use_parallel_mode=, planstate=0x2aded50, estate=0x2adeb00) at execMain.c:1721 #12 standard_ExecutorRun (queryDesc=0x2a3bdf0, direction=, count=0, execute_once=) at execMain.c:363 #13 0x0074b50b in PortalRunSelect (portal=portal@entry=0x2a34050, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, dest=dest@entry=0x2b09220) at pquery.c:932 #14 0x0074ca18 in PortalRun (portal=portal@entry=0x2a34050, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x2b09220, altdest=altdest@entry=0x2b09220, completionTag=completionTag@entry=0x7ffc8dad21c0 "") at pquery.c:773 #15 0x0074875b in exec_simple_query ( query_string=0x2a96ff0 "select\n*\nfrom\npart\nwhere\n(\n p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')\nand p_size between 1 and 5\n)\nor\n(\np_container in ('MED BAG', 'MED B"...) at postgres.c:1099 #16 0x00749a03 in PostgresMain (argc=, argv=argv@entry=0x2a44048, dbname=0x2a43eb0 "test", username=) at postgres.c:4088 #17 0x0047665f in BackendRun (port=0x2a37cc0) at postmaster.c:4357 #18 BackendStartup (port=0x2a37cc0) at postmaster.c:4029 #19 ServerLoop () at postmaster.c:1753 #20 0x006d70d9 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x2a14b20) at postmaster.c:1361 #21 0x004774c1 in main (argc=3, argv=0x2a14b20) at main.c:228 QUERY PLAN Gather Workers Planned: 2 -> Parallel Bitmap Heap Scan on part Recheck Cond: (((p_size <= 5) AND (p_size >= 1) AND (p_container = ANY ('{"SM CASE","SM BOX","SM PACK","SM PKG"}'::bpchar[]))) OR ((p_container = ANY ('{"MED BAG","MED BOX","MED PKG","MED PACK"}'::bpchar[])) AND (p_size <= 10) AND (p_size >= 1))) -> BitmapOr -> BitmapAnd -> Bitmap Index Scan on part_p_size_idx Index C
[HACKERS] Re: Extended statistics is not working on Vars hidden under a RelabelType
On 10/10/2017 05:03 AM, David Rowley wrote: > Basically, $subject is causing us not to properly find matching > extended stats in this case. > > The attached patch fixes it. > > The following test cases is an example of the misbehaviour. Note > rows=1 vs rows=98 in the Gather node. > Thanks for noticing this. The patch seems fine to me. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)
Hi, Attached is an updated version of the patch, tweaking the comments. 1) I've added a section at the end of src/backend/utils/mmgr/README, briefly explaining the alternative memory allocators we have. I don't think we should get into too much low-level detail here, that's more appropriate for the .c file for each context. 2) I've slightly reworded a paragraph in generation.c describing what use cases are suitable for the memory context. It used to say: This memory context is based on the assumption that the allocated chunks have similar lifespan, i.e. that chunks allocated close from each other (by time) will also be freed in close proximity, and mostly in the same order. This is typical for various queue-like use cases, i.e. when tuples are constructed, processed and then thrown away. and now it says: This memory context is based on the assumption that the chunks are freed roughly in the same order as they were allocated (FIFO), or in groups with similar lifespan (generations - hence the name of the context). This is typical for various queue-like use cases, i.e. when tuples are constructed, processed and then thrown away. 3) I've also added a brief note into reorderbuffer.c mentioning that it uses SlabContext and GenerationContext. As I explained, I don't think we should include details about how we tested the patch or whatever here. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 25806c68a05287f3294f2d8508bd45599232f67b Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@2ndquadrant.com> Date: Sun, 24 Sep 2017 22:19:17 +0200 Subject: [PATCH] Generational memory allocator This memory context is based on the assumption that the allocated chunks have similar lifespan, i.e. that chunks allocated close from each other (by time) will also be freed in close proximity, and mostly in the same order. This is typical for various queue-like use cases, i.e. when tuples are constructed, processed and then thrown away. The memory context uses a very simple approach to free space management. Instead of a complex global freelist, each block tracks a number of allocated and freed chunks. The space released by freed chunks is not reused, and once all chunks are freed (i.e. when nallocated == nfreed), the whole block is thrown away. When the allocated chunks have similar lifespan, this works very well and is extremely cheap. --- src/backend/replication/logical/reorderbuffer.c | 80 +-- src/backend/utils/mmgr/Makefile | 2 +- src/backend/utils/mmgr/README | 23 + src/backend/utils/mmgr/generation.c | 768 src/include/nodes/memnodes.h| 4 +- src/include/nodes/nodes.h | 1 + src/include/replication/reorderbuffer.h | 15 +- src/include/utils/memutils.h| 5 + 8 files changed, 819 insertions(+), 79 deletions(-) create mode 100644 src/backend/utils/mmgr/generation.c diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 0f607ba..dc0ad5b 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -43,6 +43,12 @@ * transaction there will be no other data carrying records between a row's * toast chunks and the row data itself. See ReorderBufferToast* for * details. + * + * ReorderBuffer uses two special memory context types - SlabContext for + * allocations of fixed-length structures (changes and transactions), and + * GenerationContext for the variable-length transaction data (allocated + * and freed in groups with similar lifespan). + * * - */ #include "postgres.h" @@ -150,15 +156,6 @@ typedef struct ReorderBufferDiskChange */ static const Size max_changes_in_memory = 4096; -/* - * We use a very simple form of a slab allocator for frequently allocated - * objects, simply keeping a fixed number in a linked list when unused, - * instead pfree()ing them. Without that in many workloads aset.c becomes a - * major bottleneck, especially when spilling to disk while decoding batch - * workloads. - */ -static const Size max_cached_tuplebufs = 4096 * 2; /* ~8MB */ - /* --- * primary reorderbuffer support routines * --- @@ -248,6 +245,10 @@ ReorderBufferAllocate(void) SLAB_DEFAULT_BLOCK_SIZE, sizeof(ReorderBufferTXN)); + buffer->tup_context = GenerationContextCreate(new_ctx, + "Tuples", + SLAB_LARGE_BLOCK_SIZE); + hash_ctl.keysize = sizeof(TransactionId); hash_ctl.entrysize = sizeof(ReorderBufferTXNByIdEnt); hash_ctl.hcxt = buffer->context; @@ -258,15 +259,12
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
Hi, Apologies, I forgot to respond to the second part of your message. On 09/06/2017 09:45 AM, Haribabu Kommi wrote: > > While testing this patch, I found another problem that is not related to > this patch. When the vacuum command is executed mutiple times on > a table with no dead rows, the number of reltuples value is slowly > reducing. > > postgres=# select reltuples, n_live_tup, n_dead_tup > from pg_stat_user_tables join pg_class using (relname) > where relname = 't'; > reltuples | n_live_tup | n_dead_tup > ---++ > 899674 | 899674 | 0 > (1 row) > > postgres=# vacuum t; > VACUUM > postgres=# select reltuples, n_live_tup, n_dead_tup > from pg_stat_user_tables join pg_class using (relname) > where relname = 't'; > reltuples | n_live_tup | n_dead_tup > ---++ > 899622 | 899622 | 0 > (1 row) > > postgres=# vacuum t; > VACUUM > postgres=# select reltuples, n_live_tup, n_dead_tup > from pg_stat_user_tables join pg_class using (relname) > where relname = 't'; > reltuples | n_live_tup | n_dead_tup > ---++ > 899570 | 899570 | 0 > (1 row) > > > In lazy_scan_heap() function, we force to scan the last page of the > relation to avoid the access exclusive lock in lazy_truncate_heap if > there are tuples in the last page. Because of this reason, the > scanned_pages value will never be 0, so the vac_estimate_reltuples > function will estimate the tuples based on the number of tuples from > the last page of the relation. This estimation is leading to reduce > the number of retuples. > Hmmm, that's annoying. Perhaps if we should not update the values in this case, then? I mean, if we only scan the last page, how reliable the derived values are? For the record - AFAICS this issue is unrelated to do with the patch (i.e. it's not introduced by it). > I am thinking whether this problem really happen in real world > scenarios to produce a fix? > Not sure. As vacuum run decrements the query only a little bit, so you'd have to run the vacuum many times to be actually bitten by it. For people relying on autovacuum that won't happen, as it only runs on tables with certain number of dead tuples. So you'd have to be running VACUUM in a loop or something (but not VACUUM ANALYZE, because that works fine) from a script, or something like that. That being said, fixing a bug is always a good thing I guess. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
On 09/06/2017 09:45 AM, Haribabu Kommi wrote: > > > On Tue, Jul 25, 2017 at 9:33 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote: > > On 7/25/17 12:55 AM, Tom Lane wrote: > > Tomas Vondra <tomas.von...@2ndquadrant.com > <mailto:tomas.von...@2ndquadrant.com>> writes: > > It seems to me that VACUUM and ANALYZE somewhat disagree on what > exactly reltuples means. VACUUM seems to be thinking that > reltuples > = live + dead while ANALYZE apparently believes that reltuples = > live > > > The question is - which of the reltuples definitions is the > right > one? I've always assumed that "reltuples = live + dead" but > perhaps > not? > > > I think the planner basically assumes that reltuples is the live > tuple count, so maybe we'd better change VACUUM to get in step. > > > Attached is a patch that (I think) does just that. The disagreement > was caused by VACUUM treating recently dead tuples as live, while > ANALYZE treats both of those as dead. > > At first I was worried that this will negatively affect plans in the > long-running transaction, as it will get underestimates (due to > reltuples not including rows it can see). But that's a problem we > already have anyway, you just need to run ANALYZE in the other session. > > > Thanks for the patch. > From the mail, I understand that this patch tries to improve the > reltuples value update in the catalog table by the vacuum command > to consider the proper visible tuples similar like analyze command. > > -num_tuples); > +num_tuples - nkeep); > > With the above correction, there is a problem in reporting the number > of live tuples to the stats. > > postgres=# select reltuples, n_live_tup, n_dead_tup > from pg_stat_user_tables join pg_class using (relname) > where relname = 't'; > reltuples | n_live_tup | n_dead_tup > ---++ > 899818 | 799636 | 100182 > (1 row) > > > The live tuples data value is again decremented with dead tuples > value before sending them to stats in function lazy_vacuum_rel(), > > /* report results to the stats collector, too */ > new_live_tuples = new_rel_tuples - vacrelstats->new_dead_tuples; > > The fix needs a correction here also. Or change the correction in > lazy_vacuum_rel() function itself before updating catalog table similar > like stats. > Ah, haven't noticed that for some reason - you're right, we estimate the reltuples based on (num_tuples - nkeep), so it doesn't make much sense to subtract nkeep again. Attached is v2 of the fix. I've removed the subtraction from lazy_vacuum_rel(), leaving just new_live_tuples = new_rel_tuples; and now it behaves as expected (no second subtraction). That means we can get rid of new_live_tuples altogether (and the protection against negative values), and use new_rel_tuples directly. Which pretty much means that in this case (pg_class.reltuples == pg_stat_user_tables.n_live_tup) but I guess that's fine, based on the initial discussion in this thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index 45b1859..1886f0d 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -198,7 +198,6 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params, BlockNumber new_rel_pages; double new_rel_tuples; BlockNumber new_rel_allvisible; - double new_live_tuples; TransactionId new_frozen_xid; MultiXactId new_min_multi; @@ -335,13 +334,9 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params, false); /* report results to the stats collector, too */ - new_live_tuples = new_rel_tuples - vacrelstats->new_dead_tuples; - if (new_live_tuples < 0) - new_live_tuples = 0; /* just in case */ - pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, - new_live_tuples, + new_rel_tuples, vacrelstats->new_dead_tuples); pgstat_progress_end_command(); @@ -1267,7 +1262,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, nblocks, vacrelstats->tupcount_pages, - num_tuples); + num_tuples - nkeep); /* * Release any remaining pin on visibility map page. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] user-defined numeric data types triggering ERROR: unsupported type
On 09/21/2017 04:24 PM, Tom Lane wrote: > Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >> [ scalarineqsel may fall over when used by extension operators ] > > I concur with your thought that we could have > ineq_histogram_selectivity fall back to a "mid bucket" default if > it's working with a datatype it is unable to convert_to_scalar. But I > think if we're going to touch this at all, we ought to have higher > ambition than that, and try to provide a mechanism whereby an > extension that's willing to work a bit harder could get that > additional increment of estimation accuracy. I don't care for this > way to do that: > The question is - do we need a solution that is back-patchable? This means we can't really use FIXEDDECIMAL without writing effectively copying a lot of the selfuncs.c stuff, only to make some minor fixes. What about using two-pronged approach: 1) fall back to mid bucket in back branches (9.3 - 10) 2) do something smarter (along the lines you outlined) in PG11 I'm willing to spend some time on both, but (2) alone is not a particularly attractive for us as it only helps PG11 and we'll have to do the copy-paste evil anyway to get the data type working on back branches. >> * Make convert_numeric_to_scalar smarter, so that it checks if >> there is an implicit cast to numeric, and fail only if it does not >> find one. > > because it's expensive, and it only works for numeric-category cases, > and it will fail outright for numbers outside the range of "double". > (Notice that convert_numeric_to_scalar is *not* calling the regular > cast function for numeric-to-double.) Moreover, any operator ought to > know what types it can accept; we shouldn't have to do more catalog > lookups to figure out what to do. > Ah, I see. I haven't realized it's not using the regular cast functions, but now that you point that out it's clear relying on casts would fail. > Now that scalarltsel and friends are just trivial wrappers for a > common function, we could imagine exposing scalarineqsel_wrapper as a > non-static function, with more arguments (and perhaps a better-chosen > name ;-)). The idea would be for extensions that want to go this > extra mile to provide their own selectivity estimation functions, > which again would just be trivial wrappers for the core function, but > would provide additional knowledge through additional arguments. > > The additional arguments I'm envisioning are a couple of C function > pointers, one function that knows how to convert the operator's > left-hand input type to scalar, and one function that knows how to > convert the right-hand type to scalar. (Identical APIs of course.) > Passing a NULL would imply that the core code must fall back on its > own devices for that input. > > Now the thing about convert_to_scalar is that there are several > different conversion conventions already (numeric, string, timestamp, > ...), and there probably could be more once extension types are > coming to the party. So I'm imagining that the API for these > conversion functions could be something like > > bool convert(Datum value, Oid valuetypeid, >int *conversion_convention, double *converted_value); > > The conversion_convention output would make use of some agreed-on > constants, say 1 for numeric, 2 for string, etc etc. In the core > code, if either converter fails (returns false) or if they don't > return the same conversion_convention code, we give up and use the > mid-bucket default. If they both produce results with the same > conversion_convention, then we can treat the converted_values as > commensurable. > OK, so instead of re-implementing the whole function, we'd essentially do just something like this: #typedef bool (*convert_callback) (Datum value, Oid valuetypeid, \ int *conversion_convention, \ double *converted_value); double scalarineqsel(PlannerInfo *root, Oid operator, bool isgt, bool iseq, VariableStatData *vardata, Datum constval, Oid consttype, convert_calback convert); and then, from the extension double scalarineqsel_wrapper(PlannerInfo *root, Oid operator, bool isgt, bool iseq, VariableStatData *vardata, Datum constval, Oid consttype) { return scalarineqsel(root, operator, isgt, iseq, vardata, constval, consttype, my_convert_func); } Sounds reasonable to me, I guess - I can't really think about anything simpler giving us the same flexibility. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Boom filters for hash joins (was: A design for amcheck heapam verification)
On 09/19/2017 06:03 PM, Peter Geoghegan wrote: > On Tue, Sep 19, 2017 at 6:28 AM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> The patch is fairly simple, and did not try to push the bloom filters to >> scan nodes or anything like that. It might be a meaningful first step, >> though, particularly for selective joins (where only small number of >> rows from the outer relation has a match in the hash table). > > I believe that parallelism makes the use of Bloom filter a lot more > compelling, too. Obviously this is something that wasn't taken into > consideration in 2015. > I haven't thought about it from that point of view. Can you elaborate why that would be the case? Sorry if this was explained earlier in this thread (I don't see it in the history, though). I can't quite remember why I haven't pursued the patch in 2015, but it was probably clear it wouldn't get in in the last CF, and I never got back to it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Boom filters for hash joins (was: A design for amcheck heapam verification)
Hi, On 09/19/2017 02:55 AM, Robert Haas wrote: > On Mon, Sep 18, 2017 at 5:13 PM, Peter Geoghegan <p...@bowt.ie> wrote: >> On Mon, Sep 18, 2017 at 2:07 PM, Robert Haas <robertmh...@gmail.com> wrote: >>> On Mon, Sep 18, 2017 at 1:29 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >>>> Uh, why does the planner need to be involved at all? >>> >>> Because it loses if the Bloom filter fails to filter anything. That's >>> not at all far-fetched; consider SELECT * FROM a.x, b.x WHERE a.x = >>> b.x given a foreign key on a.x referencing b(x). >> >> Wouldn't a merge join be a lot more likely in this case anyway? Low >> selectivity hash joins with multiple batches are inherently slow; the >> wasted overhead of using a bloom filter may not matter. >> >> Obviously this is all pretty speculative. I suspect that this could be >> true, and it seems worth investigating that framing of the problem >> first. > > ISTR Tomas Vondra doing some experiments with this a few years ago and > finding that it was, in fact, a problem. > You seem to have better memory than me, but you're right - I did some experiments with this in 2015, the WIP patch and discussion is here: https://www.postgresql.org/message-id/5670946e.8070...@2ndquadrant.com The whole idea was that with a bloom filter we can reduce the amount of tuples (from the outer relation) written to batches. The patch is fairly simple, and did not try to push the bloom filters to scan nodes or anything like that. It might be a meaningful first step, though, particularly for selective joins (where only small number of rows from the outer relation has a match in the hash table). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] user-defined numeric data types triggering ERROR: unsupported type
Hi, while testing a custom data type FIXEDDECIMAL [1], implementing a numeric-like data type with limited range, I ran into a several issues that I suspect may not be entirely intentional / expected behavior. [1] https://github.com/2ndQuadrant/fixeddecimal Attached is a minimal subset of the extension SQL definition, which may be more convenient when looking into the issue. The most important issue is that when planning a simple query, the estimation of range queries on a column with the custom data type fails like this: test=# create table t (a fixeddecimal); CREATE TABLE test=# insert into t select random() from generate_series(1,10); INSERT 0 10 test=# analyze t; ANALYZE test=# select * from t where a > 0.9; ERROR: unsupported type: 16385 The error message here comes from convert_numeric_to_scalar, which gets called during histogram processing (ineq_histogram_selectivity) when approximating the histogram. convert_to_scalar does this: switch (valuetypeid) { ... case NUMERICOID: ... *scaledvalue = convert_numeric_to_scalar(value, valuetypid); *scaledlobound = convert_numeric_to_scalar(lobound, boundstypid); *scaledhibound = convert_numeric_to_scalar(hibound, boundstypid); return true; ... } The first call works fine, as the constant really is numeric (valuetypeid=1700). But the histogram boundaries are using the custom data type, causing the error as convert_numeric_to_scalar expects only a bunch of hard-coded data types. So it's pretty much guaranteed to fail with any user-defined data type. This seems a bit unfortunate :-( One solution would be to implement custom estimation function, replacing scalarltsel/scalargtsel. But that seems rather unnecessary, especially considering there is an implicit cast from fixeddecimal to numeric. Another thing is that when there's just an MCV, the estimation works just fine. So I see two basic ways to fix this: * Make convert_numeric_to_scalar smarter, so that it checks if there is an implicit cast to numeric, and fail only if it does not find one. * Make convert_to_scalar smarter, so that it does return false for unexpected data types, so that ineq_histogram_selectivity uses the default estimate of 0.5 (for that one bucket). Both options seem more favorable than what's happening currently. Is there anything I missed, making those fixes unacceptable? If anything, the fact that MCV estimates work while histogram does not makes this somewhat unpredictable - a change in the data distribution (or perhaps even just a different sample in ANALYZE) may result in sudden failures. I ran into one additional strange thing while investigating this. The attached SQL script defines two operator classes - fixeddecimal_ops and fixeddecimal_numeric_ops, defining (fixeddecimal,fixeddecimal) and (fixeddecimal,numeric) operators. Dropping one of those operator classes changes the estimates in a somewhat suspicious ways. When only fixeddecimal_ops is defined, we get this: test=# explain select * from t where a > 0.1; QUERY PLAN Seq Scan on t (cost=0.00..1943.00 rows=3 width=8) Filter: ((a)::numeric > 0.1) (2 rows) That is, we get the default estimate for inequality clauses, 33%. But when only fixeddecimal_numeric_ops, we get this: test=# explain select * from t where a > 0.1; QUERY PLAN Seq Scan on t (cost=0.00..1693.00 rows=5 width=8) Filter: (a > 0.1) (2 rows) That is, we get 50% estimate, because that's what scalarineqsel uses when it ineq_histogram_selectivity can't compute selectivity from a histogram for some reason. Wouldn't it make it more sense to use the default 33% estimate here? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services fixeddecimal-minimal.sql Description: application/sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] valgrind vs. shared typmod registry
Hi, I've been running some regression tests under valgrind, and it seems select_parallel triggers some uses of uninitialized values in dshash. If I'm reading the reports right, it complains about hashtable->size_log2 being not being initialized in ensure_valid_bucket_pointers. I've been running tests under valgrind not too long ago and I don't recall such failures, so perhaps something broke it in the past few days. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services ==18897== Conditional jump or move depends on uninitialised value(s) ==18897==at 0x6C0B4F: ensure_valid_bucket_pointers (dshash.c:752) ==18897==by 0x6C0123: dshash_find (dshash.c:390) ==18897==by 0x971224: find_or_make_matching_shared_tupledesc (typcache.c:2265) ==18897==by 0x9704E7: assign_record_type_typmod (typcache.c:1602) ==18897==by 0x68879E: BlessTupleDesc (execTuples.c:1036) ==18897==by 0x6721C8: ExecInitExprRec (execExpr.c:1512) ==18897==by 0x671E62: ExecInitExprRec (execExpr.c:1385) ==18897==by 0x66FC87: ExecInitExpr (execExpr.c:130) ==18897==by 0x15AB9E92: exec_eval_simple_expr (pl_exec.c:5584) ==18897==by 0x15AB95E9: exec_eval_expr (pl_exec.c:5202) ==18897==by 0x15AB4388: exec_stmt_return (pl_exec.c:2755) ==18897==by 0x15AB2563: exec_stmt (pl_exec.c:1606) ==18897==by 0x15AB2306: exec_stmts (pl_exec.c:1521) ==18897==by 0x15AB21B1: exec_stmt_block (pl_exec.c:1459) ==18897==by 0x15AB046B: plpgsql_exec_function (pl_exec.c:464) ==18897==by 0x15AAACCD: plpgsql_call_handler (pl_handler.c:258) ==18897==by 0x674C7E: ExecInterpExpr (execExprInterp.c:650) ==18897==by 0x6AA6A0: ExecEvalExprSwitchContext (executor.h:309) ==18897==by 0x6AA709: ExecProject (executor.h:343) ==18897==by 0x6AA878: ExecResult (nodeResult.c:136) ==18897== Uninitialised value was created by a heap allocation ==18897==at 0x9A837E: palloc (mcxt.c:871) ==18897==by 0x6BFEEA: dshash_attach (dshash.c:269) ==18897==by 0x9707F7: SharedRecordTypmodRegistryAttach (typcache.c:1778) ==18897==by 0x484826: AttachSession (session.c:183) ==18897==by 0x504225: ParallelWorkerMain (parallel.c:1100) ==18897==by 0x7820B5: StartBackgroundWorker (bgworker.c:835) ==18897==by 0x793F9E: do_start_bgworker (postmaster.c:5686) ==18897==by 0x7942D7: maybe_start_bgworkers (postmaster.c:5890) ==18897==by 0x793399: sigusr1_handler (postmaster.c:5079) ==18897==by 0x4E4B5CF: ??? (in /usr/lib64/libpthread-2.24.so) ==18897==by 0x586AC72: __select_nocancel (in /usr/lib64/libc-2.24.so) ==18897==by 0x78F220: ServerLoop (postmaster.c:1720) ==18897==by 0x78E9C5: PostmasterMain (postmaster.c:1364) ==18897==by 0x6D5714: main (main.c:228) ==18897== { Memcheck:Cond fun:ensure_valid_bucket_pointers fun:dshash_find fun:find_or_make_matching_shared_tupledesc fun:assign_record_type_typmod fun:BlessTupleDesc fun:ExecInitExprRec fun:ExecInitExprRec fun:ExecInitExpr fun:exec_eval_simple_expr fun:exec_eval_expr fun:exec_stmt_return fun:exec_stmt fun:exec_stmts fun:exec_stmt_block fun:plpgsql_exec_function fun:plpgsql_call_handler fun:ExecInterpExpr fun:ExecEvalExprSwitchContext fun:ExecProject fun:ExecResult } ==18897== Conditional jump or move depends on uninitialised value(s) ==18897==at 0x6C0B4F: ensure_valid_bucket_pointers (dshash.c:752) ==18897==by 0x6C024E: dshash_find_or_insert (dshash.c:440) ==18897==by 0x971310: find_or_make_matching_shared_tupledesc (typcache.c:2294) ==18897==by 0x9704E7: assign_record_type_typmod (typcache.c:1602) ==18897==by 0x68879E: BlessTupleDesc (execTuples.c:1036) ==18897==by 0x6721C8: ExecInitExprRec (execExpr.c:1512) ==18897==by 0x671E62: ExecInitExprRec (execExpr.c:1385) ==18897==by 0x66FC87: ExecInitExpr (execExpr.c:130) ==18897==by 0x15AB9E92: exec_eval_simple_expr (pl_exec.c:5584) ==18897==by 0x15AB95E9: exec_eval_expr (pl_exec.c:5202) ==18897==by 0x15AB4388: exec_stmt_return (pl_exec.c:2755) ==18897==by 0x15AB2563: exec_stmt (pl_exec.c:1606) ==18897==by 0x15AB2306: exec_stmts (pl_exec.c:1521) ==18897==by 0x15AB21B1: exec_stmt_block (pl_exec.c:1459) ==18897==by 0x15AB046B: plpgsql_exec_function (pl_exec.c:464) ==18897==by 0x15AAACCD: plpgsql_call_handler (pl_handler.c:258) ==18897==by 0x674C7E: ExecInterpExpr (execExprInterp.c:650) ==18897==by 0x6AA6A0: ExecEvalExprSwitchContext (executor.h:309) ==18897==by 0x6AA709: ExecProject (executor.h:343) ==18897==by 0x6AA878: ExecResult (nodeResult.c:136) ==18897== Uninitialised value was created by a heap allocation ==18897==at 0x9A837E: palloc (mcxt.c:871) ==18897==by 0x6BFEEA: dshash_attach (dshash.c:269) ==18897==by 0x970820: SharedRecordTypmodRegistryAttach (typcache.c:1782) ==18897==by 0x484826: AttachSession (session.c:183)
Re: [HACKERS] PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)
On 09/14/2017 04:21 PM, Simon Riggs wrote: > On 14 August 2017 at 01:35, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: >> Hi, >> >> Attached is a rebased version of the Generational context, originally >> submitted with SlabContext (which was already committed into Pg 10). >> >> The main change is that I've abandoned the pattern of defining a Data >> structure and then a pointer typedef, i.e. >> >> typedef struct GenerationContextData { ... } GenerationContextData; >> typedef struct GenerationContextData *GenerationContext; >> >> Now it's just >> >> typedef struct GenerationContext { ... } GenerationContext; >> >> mostly because SlabContext was committed like that, and because Andres was >> complaining about this code pattern ;-) >> >> Otherwise the design is the same as repeatedly discussed before. >> >> To show that this is still valuable change (even after SlabContext and >> adding doubly-linked list to AllocSet), I've repeated the test done by >> Andres in [1] using the test case described in [2], that is >> >> -- generate data >> SELECT COUNT(*) FROM (SELECT test1() >> FROM generate_series(1, 5)) foo; >> >> -- benchmark (measure time and VmPeak) >> SELECT COUNT(*) FROM (SELECT * >> FROM pg_logical_slot_get_changes('test', NULL, >> NULL, 'include-xids', '0')) foo; >> >> with different values passed to the first step (instead of the 5). The >> VmPeak numbers look like this: >> >> N masterpatched >> -- >> 10 1155220 kB 361604 kB >> 20 2020668 kB 434060 kB >> 30 2890236 kB 502452 kB >> 40 3751592 kB 570816 kB >> 50 4621124 kB 639168 kB >> >> and the timing (on assert-enabled build): >> >> N masterpatched >> -- >> 10 1103.182 ms 412.734 ms >> 20 2216.711 ms 820.438 ms >> 30 3320.095 ms1223.576 ms >> 40 4584.919 ms1621.261 ms >> 50 5590.444 ms2113.820 ms >> >> So it seems it's still a significant improvement, both in terms of memory >> usage and timing. Admittedly, this is a single test, so ideas of other >> useful test cases are welcome. > > This all looks good. > > What I think this needs is changes to >src/backend/utils/mmgr/README > which decribe the various options that now exist (normal?, slab) and > will exist (generational) > > Don't really care about the name, as long as its clear when to use it > and when not to use it. > > This description of generational seems wrong: "When the allocated > chunks have similar lifespan, this works very well and is extremely > cheap." > They don't need the same lifespan they just need to be freed in groups > and in the order they were allocated. > Agreed, should be described differently. What matters is (mostly) FIFO pattern of the palloc/pfree requests, which allows us to release the memory. > > For this patch specifically, we need additional comments in > reorderbuffer.c to describe the memory allocation pattern in that > module so that it is clear that the choice of allocator is useful > and appropriate, possibly with details of how that testing was > performed, so it can be re-tested later or tested on a variety of > platforms. > Including details about the testing into reorderbuffer.c does not seem very attractive to me. I don't recall any other place describing the tests in detail. Why not to discuss the details here, and then include a link to this thread in the commit message? In any case, I doubt anyone will repeat the testing on a variety of platforms (and I don't have any such comprehensive test suite anyway). What will likely happen is someone bumping into a poorly performing corner case, we will analyze it and fix it as usual. > > Particularly in reorderbuffer, surely we will almost immediately > reuse chunks again, so is it worth issuing free() and then malloc() > again soon after? Does that cause additional overhead we could also > avoid? Could we possibly keep the last/few free'd chunks around > rather than re-malloc? > I haven't seen anything like that in tests I've done. The malloc/free overhead is negligible thanks as our allocators significantly reduce the number of calls to those functions. If we happen to run into such case, it shouldn't be difficult to keep a few empty blocks. But perhaps we can leave that as a future optimization. > > Seems like we should commit this soon. > Thanks. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hooks to track changed pages for backup purposes
On 09/13/2017 07:53 AM, Andrey Borodin wrote: >> * I see there are conditions like this: >> >> if(xlogreader->blocks[nblock].forknum == MAIN_FORKNUM) >> >> Why is it enough to restrict the block-tracking code to main fork? >> Aren't we interested in all relation forks? > fsm, vm and others are small enough to take them > That seems like an optimization specific to your backup solution, not necessarily to others and/or to other possible use cases. >> I guess you'll have to explain >> what the implementation of the hooks is supposed to do, and why these >> locations for hook calls are the right ones. It's damn impossible to >> validate the patch without that information. >> >> Assuming you still plan to use the hook approach ... > Yes, I still think hooking is good idea, but you are right - I need > prototype first. I'll mark patch as Returned with feedback before > prototype implementation. > OK >> >>>> There >>>> are no arguments fed to this hook, so modules would not be able to >>>> analyze things in this context, except shared memory and process >>>> state? >>> >>>> >>>> Those hooks are put in hot code paths, and could impact performance of >>>> WAL insertion itself. >>> I do not think sending few bytes to cached array is comparable to disk >> write of XLog record. Checking the func ptr is even cheaper with correct >> branch prediction. >>> >> >> That seems somewhat suspicious, for two reasons. Firstly, I believe we >> only insert the XLOG records into WAL buffer here, so why should there >> be any disk write related? Or do you mean the final commit? > Yes, I mean finally we will be waiting for disk. Hundred empty ptr > checks are neglectable in comparision with disk. Aren't we doing these calls while holding XLog locks? IIRC there was quite a significant performance improvement after Heikki reduced the amount of code executed while holding the locks. >> >> But more importantly, doesn't this kind of information require some >> durability guarantees? I mean, if it gets lost during server crashes or >> restarts, doesn't that mean the incremental backups might miss some >> buffers? I'd guess the hooks will have to do some sort of I/O, to >> achieve that, no? > We need durability only on the level of one segment. If we do not have > info from segment we can just rescan it. > If we send segment to S3 as one file, we are sure in it's integrity. But > this IO can by async. > > PTRACK in it's turn switch bits in fork's buffers which are written in > checkpointer and..well... recovered during recovery. By usual WAL replay > of recovery. > But how do you do that from the hooks, if they only store the data into a buffer in memory? Let's say you insert ~8MB of WAL into a segment, and then the system crashes and reboots. How do you know you have incomplete information from the WAL segment? Although, that's probably what wal_switch_hook() might do - sync the data whenever the WAL segment is switched. Right? > >> From this POV, the idea to collect this information on the backup system >> (WAL archive) by pre-processing the arriving WAL segments seems like the >> most promising. It moves the work to another system, the backup system >> can make it as durable as the WAL segments, etc. > > Well, in some not so rare cases users encrypt backups and send to S3. > And there is no system with CPUs that can handle that WAL parsing. > Currently, I'm considering mocking prototype for wal-g, which works > exactly this. > Why couldn't there be a system with enough CPU power? Sure, if you want to do this, you'll need a more powerful system, but regular CPUs can do >1GB/s in AES-256-GCM thanks to AES-NI. Or you could do it on the database as part of archive_command, before the encryption, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patches that don't apply or don't compile: 2017-09-12
Hi Aleksander, On 09/13/2017 11:49 AM, Aleksander Alekseev wrote: > Hi Tomas, > > I appreciate your feedback, although it doesn't seem to be completely > fair. Particularly: > >> You gave everyone about 4 hours to object > > This is not quite accurate since my proposal was sent 2017-09-11 > 09:41:32 and this thread started - 2017-09-12 14:14:55. > Understood. I didn't really consider the first message to be a proposal with a deadline, as it starts with "here's a crazy idea" and it's not immediately clear that you intend to proceed with it immediately, and that you expect people to object. The message I referenced is a much clearer on this. >> You just changed the status of 10-15% open patches. I'd expect >> things like this to be consulted with the CF manager, yet I don't see >> any comments from Daniel. > > Agree, that was clearly a mistake, I had to add Daniel to CC. Sorry I > didn't do that. I've returned all affected patches back to "Needs > Review". On the bright side while doing this I've noticed that many > patches were already updated by their authors. > Yeah. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Patches that don't apply or don't compile: 2017-09-12
On 09/12/2017 04:14 PM, Aleksander Alekseev wrote: > Hello, hackers! > > Thanks to the work of Thomas Munro now we have a CI for the patches on > the commitfest [1]. Naturally there is still room for improvement, but > in any case it's much, much better than nothing. > > After a short discussion [2] we agreed (or at least no one objected) > to determine the patches that don't apply / don't compile / don't > pass regression tests and have "Needs Review" status, change the > status of these patches to "Waiting on Author" and write a report > (this one) with a CC to the authors. As all we know, we are short on > reviewers and this action will save them a lot of time. Here [3] you > can find a script I've been using to find such patches. > > I rechecked the list manually and did my best to exclude the patches > that were updated recently or that depend on other patches. However > there still a chance that your patch got to the list by a mistake. > In this case please let me know.> With all due respect, it's hard not to see this as a disruption of the current CF. I agree automating the patch processing is a worthwhile goal, but we're not there yet and it seems somewhat premature. Let me explain why I think so: (1) You just changed the status of 10-15% open patches. I'd expect things like this to be consulted with the CF manager, yet I don't see any comments from Daniel. Considering he's been at the Oslo PUG meetup today, I doubt he was watching hackers very closely. (2) You gave everyone about 4 hours to object, ending 3PM UTC, which excludes about one whole hemisphere where it's either too early or too late for people to respond. I'd say waiting for >24 hours would be more appropriate. (3) The claim that "on one objected" is somewhat misleading, I guess. I myself objected to automating this yesterday, and AFAICS Thomas Munro shares the opinion that we're not ready for automating it. (4) You say you rechecked the list manually - can you elaborate what you checked? Per reports from others, some patches seem to "not apply" merely because "git apply" is quite strict. Have you actually tried applying / compiling the patches yourself? (5) I doubt "does not apply" is actually sufficient to move the patch to "waiting on author". For example my statistics patch was failing to apply merely due to 821fb8cdbf lightly touching the SGML docs, changing "type" to "kind" on a few places. Does that mean the patch can't get any reviews until the author fixes it? Hardly. But after switching it to "waiting on author" that's exactly what's going to happen, as people are mostly ignoring patches in that state. (6) It's generally a good idea to send a message the patch thread whenever the status is changed, otherwise the patch authors may not notice the change for a long time. I don't see any such messages, certainly not in "my" patch thread. I object to changing the patch status merely based on the script output. It's a nice goal, but we need to do the legwork first, otherwise it'll be just annoying and disrupting. I suggest we inspect the reported patches manually, investigate whether the failures are legitimate or not, and eliminate as many false positives as possible. Once we are happy with the accuracy, we can enable it again. kind regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Attached is an updated version of the patch, dealing with fallout of 821fb8cdbf700a8aadbe12d5b46ca4e61be5a8a8 which touched the SGML documentation for CREATE STATISTICS. regards On 09/07/2017 10:07 PM, Tomas Vondra wrote: > Hi, > > Attached is an updated version of the patch, fixing the issues reported > by Adrien Nayrat, and also a bunch of issues pointed out by valgrind. > > regards > -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-multivariate-MCV-lists.patch.gz Description: application/gzip 0002-multivariate-histograms.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
On 09/11/2017 05:32 PM, Robert Haas wrote: > On Sun, Sep 10, 2017 at 9:39 PM, Peter Geoghegan <p...@bowt.ie> wrote: >> To be clear, you'll still need to set replacement_sort_tuples high >> when testing RS, to make sure that we really use it for at least the >> first run when we're expected to. (There is no easy way to have >> testing mechanically verify that we really do only have one run in the >> end with RS, but I assume that such paranoia is unneeded.) > > I seem to recall that raising replacement_sort_tuples makes > replacement selection look worse in some cases -- the optimization is > more likely to apply, sure, but the heap is also bigger, which hurts. > The question is what is the optimal replacement_sort_tuples value? I assume it's the number of tuples that effectively uses CPU caches, at least that's what our docs say. So I think you're right it to 1B rows may break this assumption, and make it perform worse. But perhaps the fact that we're testing with multiple work_mem values, and with smaller data sets (100k or 1M rows) makes this a non-issue? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Automatic testing of patches in commit fest
On 09/11/2017 03:01 PM, Aleksander Alekseev wrote: > Hi Tomas, > >>> Unless there are any objections to give this idea a try I'm willing to >>> write and host a corresponding script. >>> >> That won't work until (2) is reliable enough. There are patches >> (for example my "multivariate MCV lists and histograms") which >> fails to apply only because the tool picks the wrong patch. >> Possibly because it does not recognize compressed patches, or >> something. > > Agree. However we could simply add an "Enable autotests" checkbox to > the commitfest application. In fact we could even left the > commitfest application as is and provide corresponding checkboxes in > the web interface of the "autoreviewer" application. Naturally every > automatically generated code review will include a link that > disables autotests for this particular commitfest entry. > I think we should try making it as reliable as possible first, and only then consider adding some manual on/off switches. Otherwise the patches may randomly flap between "OK" and "failed" after each new submission, disrupting the CF process. Making the testing reliable may even require establishing some sort of policy (say, always send a complete patch, not just one piece). > > I hope this observation will change your mind :) > Not sure. But I'm mostly just a passenger here, not the driver. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove 1MB size limit in tsvector
On 09/11/2017 01:54 PM, Robert Haas wrote: > On Mon, Sep 11, 2017 at 5:33 AM, Ildus Kurbangaliev > <i.kurbangal...@postgrespro.ru> wrote: >> Moreover, RUM index >> stores positions + lexemes, so it doesn't need tsvectors for ranked >> search. As a result, tsvector becomes a storage for >> building indexes (indexable type), not something that should be used at >> runtime. And the change of the format doesn't affect index creation >> time. > > RUM indexes, though, are not in core. > Yeah, but I think Ildus has a point that this should not really matter on indexed tsvectors. So the question is how realistic that benchmark actually is. How likely are we to do queries on fts directly, not through a GIN/GiST index? Particularly in performance-sensitive cases? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Automatic testing of patches in commit fest
On 09/11/2017 11:41 AM, Aleksander Alekseev wrote: > Hi Thomas, > > Great job! > +1 > Here is a crazy idea. What if we write a script that would automatically > return the patches that: > > 1) Are not in "Waiting on Author" status > 2) Don't apply OR don't pass `make installcheck-world` > > ... to the "Waiting on Author" status and describe the problem through > the "Add review" form on commitfest.postgresql.org? This will save a lot > of time to the reviewers. Naturally nobody wants to spam pgsql-hackers@ > with automatic messages to often. I believe that sending such messages > once a day would be enough. > > Unless there are any objections to give this idea a try I'm willing to > write and host a corresponding script. > That won't work until (2) is reliable enough. There are patches (for example my "multivariate MCV lists and histograms") which fails to apply only because the tool picks the wrong patch. Possibly because it does not recognize compressed patches, or something. In such cases switching it to "Waiting on Author" automatically would be damaging, as (a) there's nothing wrong with the patch, and (b) it's not clear what to do to fix the problem. So -1 to this for now, until we make this part smart enough. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Polyphase merge is obsolete
Hi, I planned to do some benchmarking on this patch, but apparently the patch no longer applies. Rebase please? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
On 09/11/2017 02:22 AM, Peter Geoghegan wrote: > On Sun, Sep 10, 2017 at 5:07 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> I'm currently re-running the benchmarks we did in 2016 for 9.6, but >> those are all sorts with a single column (see the attached script). But >> it'd be good to add a few queries testing sorts with multiple keys. We >> can either tweak some of the existing data sets + queries, or come up >> with entirely new tests. > > I see that work_mem is set like this in the script: > > "for wm in '1MB' '8MB' '32MB' '128MB' '512MB' '1GB'; do" > > I suggest that we forget about values over 32MB, since the question of > how well quicksort does there was settled by your tests in 2016. I > would also add '4MB' to the list of wm values that you'll actually > test. OK, so 1MB, 4MB, 8MB, 32MB? > > Any case with input that is initially in random order or DESC sort > order is not interesting, either. I suggest you remove those, too. > OK. > I think we're only interested in benchmarks where replacement > selection really does get its putative best case (no merge needed in > the end). Any (almost) sorted cases (the only cases that you are > interesting to test now) will always manage that, once you set > replacement_sort_tuples high enough, and provided there isn't even a > single tuple that is completely out of order. The "before" cases here > should have a replacement_sort_tuples of 1 billion (so that we're sure > to not have the limit prevent the use of replacement selection in the > first place), versus the "after" cases, which should have a > replacement_sort_tuples of 0 to represent my proposal (to represent > performance in a world where replacement selection is totally > removed). > Ah, so you suggest doing all the tests on current master, by only tweaking the replacement_sort_tuples value? I've been testing master vs. your patch, but I guess setting replacement_sort_tuples=0 should have the same effect. I probably won't eliminate the random/DESC data sets, though. At least not from the two smaller data sets - I want to do a bit of benchmarking on Heikki's polyphase merge removal patch, and for that patch those data sets are still relevant. Also, it's useful to have a subset of results where we know we don't expect any change. >> For the existing queries, I should have some initial results >> tomorrow, at least for the data sets with 100k and 1M rows. The >> tests with 10M rows will take much more time (it takes 1-2hours for >> a single work_mem value, and we're testing 6 of them). > > I myself don't see that much value in a 10M row test. > Meh, more data is probably better. And with the reduced work_mem values and skipping of random/DESC data sets it should complete much faster. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
On 09/11/2017 01:03 AM, Peter Geoghegan wrote: > On Sat, Sep 9, 2017 at 8:28 AM, Greg Stark <st...@mit.edu> wrote: >> This may be a bit "how long is a piece of string" but how do those two >> compare with string sorting in an interesting encoding/locale -- say >> /usr/share/dict/polish in pl_PL for example. It's certainly true that >> people do sort text as well as numbers. > > I haven't looked at text, because I presume that it's very rare for > tuples within a table to be more or less ordered by a text > attribute. While the traditional big advantage of replacement > selection has always been its ability to double initial run length on > average, where text performance is quite interesting because > localized clustering still happens, that doesn't really seem relevant > here. The remaining use of replacement selection is expressly only > about the "single run, no merge" best case, and in particular about > avoiding regressions when upgrading from versions prior to 9.6 (9.6 > is the version where we began to generally prefer quicksort). > >> Also, people often sort on keys of more than one column > > Very true. I should test this. > I'm currently re-running the benchmarks we did in 2016 for 9.6, but those are all sorts with a single column (see the attached script). But it'd be good to add a few queries testing sorts with multiple keys. We can either tweak some of the existing data sets + queries, or come up with entirely new tests. The current tests construct data sets with different features (unique or low/high-cardinality, random/sorted/almost-sorted). How should that work for multi-key sorts? Same features for all columns, or some mix? For the existing queries, I should have some initial results tomorrow, at least for the data sets with 100k and 1M rows. The tests with 10M rows will take much more time (it takes 1-2hours for a single work_mem value, and we're testing 6 of them). But perhaps we can reduce the number of tests somehow, only testing the largest/smallest work_mem values? So that we could get some numbers now, possibly running the complete test suite later. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services sort-bench.sh Description: application/shellscript -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Separate log file for extension
On 08/28/2017 11:23 AM, Antonin Houska wrote: > Craig Ringer <cr...@2ndquadrant.com> wrote: > >> On 25 August 2017 at 15:12, Antonin Houska <a...@cybertec.at> wrote: >> >> How will this play with syslog? csvlog? etc? > > There's nothing special about csvlog: the LogStream structure has a > "destination" field, so if particular extension wants this kind of output, it > simply sets the LOG_DESTINATION_CSVLOG bit here. > I assume Craig's point was more about CSVLOG requiring log_collector=on. So what will happen if the PostgreSQL is started without the collector, and an extension attempts to use LOG_DESTINATION_CSVLOG? Or will it start a separate collector for each such extension? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segment size depending *_wal_size defaults (was increasing the default WAL segment size)
On 08/30/2017 03:16 AM, Andres Freund wrote: > On 2017-08-30 10:14:22 +0900, Michael Paquier wrote: >> On Wed, Aug 30, 2017 at 10:06 AM, Andres Freund <and...@anarazel.de> wrote: >>> On 2017-08-30 09:49:14 +0900, Michael Paquier wrote: >>>> Do you think that we should worry about wal segment sizes higher than >>>> 2GB? Support for int64 GUCs is not here yet. >>> >>> 1GB will be the limit anyway. >> >> Yeah, but imagine that we'd want to raise that even more up. > > I'm doubtfull of that. But even if, it'd not be hard to GUC support. > It's not hard - it's just a lot of copy-pasting of infrastructure code. Incidentally, I already have a patch doing that, as we had to add that support to XL, and I can submit it to PostgreSQL. Might save some boring coding. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: new help related to variables are not too readable
On 09/09/2017 01:24 AM, Tom Lane wrote: > Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >> The translator has exactly the same context in both cases, and the >> struggle to wrap it at 80 characters will be pretty much the same too. > > Really? With the old way, you had something under 60 characters to > work in, now it's nearly 80. I don't buy that that's not a significant > difference. It's also much less ugly if you decide you need one more > line than the English version uses. > That's not what I meant. I've never felt a strong urge to avoid wrapping at 80 chars when translating strings - simply translate in the most meaningful way, if it gets longer than 80 chars and can't be easily shortened, just wrap. If there are 60 or 80 characters does not change this much - 80 chars may allow more unwrapped strings, of course, but it's a minor change for the translators. Or at least me, others may disagree, of course. What I find way more annoying are strings where it's not clear where to wrap, because it gets prefixed by something, we insert a value while formatting the error message, etc. But that's not the case here, as both _(" " " ") and _("XXXX ") give you the whole string. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: new help related to variables are not too readable
Hi, On 09/08/2017 07:25 AM, Fabien COELHO wrote: > > Hello, > >>> PSQL_HISTORY alternative location for the command history file >>> >>> I would prefer to revert to that more compact 9.6-formatting. >> >> There was a problem with line width .. its hard to respect 80 chars > > Yes. > > Scrolling in two dimensions because it does not fit either way is not > too practical, so the idea was the it should at least fit a reasonable > terminal in the horizontal dimension, the vertical one having been > unfittable anyway for a long time. > > Once you want to do strict 80 columns, a lot of/most descriptions do not > fit and need to be split somehow on two lines, one way or another. It > seemed that > > XXX > xxx xx xxx xxx > > Is a good way to do that systematically and with giving more space and > chance for a description to fit in its line. ISTM that it was already > done like for environment variables, so it is also for homogeneity. > FWIW I also find the new formatting hard to read, as it does not clearly separate the items. The old formatting was not ideal either, though. > > It also simplify work for translators that can just follow the same > formatting and know what to do if a one line English explanation does > not fit once translated. > As someone responsible for a significant part of the Czech translation, I find this argument somewhat dubious. I don't see how this _(" " " ") simplifies the work for translators, compared to this _(" ") The translator has exactly the same context in both cases, and the struggle to wrap it at 80 characters will be pretty much the same too. The thing that would make a difference is automatic wrapping, i.e. split on spaces, then re-build into shorter than 80 characters ... > Finally, as vertical scrolling is mandatory, I would be fine with > skipping lines with entries for readability, but it is just a matter of > taste and I expect there should be half a dozen different opinions on > the matter of formatting. > FWIW, +1 to extra lines from me - I find it way more readable, as it clearly separates the items. You're right this is also a matter of taste to some degree, so I understand Erik's point about vertical scrolling. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Pageinspect - add functions on GIN and GiST indexes from gevel
Hi, On 07/21/2017 03:40 PM, Alexander Korotkov wrote: > Hi, Alexey! > > On Fri, Jul 21, 2017 at 3:05 PM, Alexey Chernyshov > <a.chernys...@postgrespro.ru <mailto:a.chernys...@postgrespro.ru>> wrote: > > the following patch transfers functionality from gevel module > (http://www.sai.msu.su/~megera/wiki/Gevel > <http://www.sai.msu.su/~megera/wiki/Gevel>) which provides functions for > analyzing GIN and GiST indexes to pageinspect. Gevel was originally > designed by Oleg Bartunov, and Teodor Sigaev for developers of GiST and > GIN indexes. > > > It's very good that you've picked up this work! pageinspect is lacking > of this functionality. > > Functions added: > - gist_stat(text) - shows statistics on GiST Tree > - gist_tree(text) - shows GiST tree > - gist_tree(text, int4) - shows GiST tree up to MAXLEVEL > - gist_print(text) - prints objects stored in GiST tree > - spgist_stat(text) - shows statistics on SP-GiST > - spgist_print(text) - prints objects stored in index > - gin_value_count() - originally gin_stat(text) - prints estimated > counts > for index values > - gin_stats() - originally gin_statpage(text) - shows statistics > - gin_count_estimate(text, tsquery) - shows number of indexed rows > matched > query > > Tests also transferred, docs for new functions are added. I run pgindent > over the code, but the result is different from those I expected, so > I leave > pgindented one. > The patch is applicable to the commit > 866f4a7c210857aa342bf901558d170325094dde. > > > As we can see, gevel contains functions which analyze the whole index. > pageinspect is written in another manner: it gives you functionality to > analyze individual pages, tuples and so on. > Thus, we probably shouldn't try to move gevel functions to pageinspect > "as is". They should be rewritten in more granular manner as well as > other pageinspact functions are. Any other opinions? > I agree with Alexander that pageinspect is written in a very different way - as the extension name suggests, it's used to inspect pages. The proposed patch uses a very different approach, reading the whole index, not individual pages. Why should it be part of pageinspect? For example we have pgstattuple extension, which seems like a better match. Or just create a new extension - if the code is valuable, surely we can live one more extension instead of smuggling it in inside pageinspect. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The case for removing replacement selection sort
Hi, On 08/31/2017 02:56 AM, Peter Geoghegan wrote: > On Wed, Aug 30, 2017 at 5:38 PM, Robert Haas <robertmh...@gmail.com> wrote: >> Wow. Just to be clear, I am looking for the BEST case for replacement >> selection, not the worst case. But I would have expected that case to >> be a win for replacement selection, and it clearly isn't. I can >> reproduce your results here. > > But I *was* trying to get a best case. That's why it isn't even worse. > That's what the docs say the best case is, after all. > > This is the kind of thing that replacement selection actually did do > better with on 9.6. I clearly remember Tomas Vondra doing lots of > benchmarking, showing some benefit with RS with a work_mem of 8MB or > less. As I said in my introduction on this thread, we weren't wrong to > add replacement_sort_tuples to 9.6, given where things were with > merging at the time. But, it does very much appear to create less than > zero benefit these days. The picture changed. > Do we need/want to repeat some of that benchmarking on these patches? I don't recall how much this code changed since those benchmarks were done in the 9.6 cycle. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GnuTLS support
Hi, On 09/04/2017 04:24 PM, Bruce Momjian wrote: > On Fri, Sep 1, 2017 at 12:09:35PM -0400, Robert Haas wrote: >> I think that what this shows is that the current set of GUCs is overly >> OpenSSL-centric. We created a set of GUCs that are actually specific >> to one particular implementation but named them as if they were >> generic. My idea about this would be to actually rename the existing >> GUCs to start with "openssl" rather than "ssl", and then add new GUCs >> as needed for other SSL implementations. >> >> Depending on what we think is best, GUCs for an SSL implementation >> other than the one against which we compiled can either not exist at >> all, or can exist but be limited to a single value (e.g. "none", as we >> currently do when the compile has no SSL support at all). Also, we >> could add a read-only GUC with a name like ssl_library that exposes >> the name of the underlying SSL implementation - none, openssl, gnutls, >> or whatever. >> >> I think if we go the route of insisting that every SSL implementation >> has to use the existing GUCs, we're just trying to shove a square peg >> into a round hole, and there's no real benefit for users. If the >> string that has to be stuffed into ssl_ciphers differs based on which >> library was chosen at compile time, then you can't have a uniform >> default configuration for all libraries anyway. I think it'll be >> easier to explain and document this if there's separate documentation >> for openssl_ciphers, gnutls_ciphers, etc. rather than one giant >> documentation section that tries to explain every implementation >> separately. > > I am worried about having 3x version of TLS controls in > postgresql.conf, and only one set being active. Perhaps we need to > break out the TLS config to separate files or something. Anyway, this > needs more thought. > Well, people won't be able to set the inactive options, just like you can't set ssl=on when you build without OpenSSL support. But perhaps we could simply not include the inactive options into the config file, no? I don't see how breaking the TLS config into separate files improves the situation - instead of dealing with GUCs in a single file you'll be dealing with the same GUCs in multiple files. No? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Remove 1MB size limit in tsvector
Hi, On 08/17/2017 12:23 PM, Ildus Kurbangaliev wrote: > In my benchmarks when database fits into buffers (so it's measurement of > the time required for the tsvectors conversion) it gives me these > results: > > Without conversion: > > $ ./tsbench2 -database test1 -bench_time 300 > 2017/08/17 12:04:44 Number of connections: 4 > 2017/08/17 12:04:44 Database: test1 > 2017/08/17 12:09:44 Processed: 51419 > > With conversion: > > $ ./tsbench2 -database test1 -bench_time 300 > 2017/08/17 12:14:31 Number of connections: 4 > 2017/08/17 12:14:31 Database: test1 > 2017/08/17 12:19:31 Processed: 43607 > > I ran a bunch of these tests, and these results are stable on my > machine. So in these specific tests performance regression about 15%. > > Same time I think this could be the worst case, because usually data > is on disk and conversion will not affect so much to performance. > That seems like a fairly significant regression, TBH. I don't quite agree we can simply assume in-memory workloads don't matter, plenty of databases have 99% cache hit ratio (particularly when considering not just shared buffers, but also page cache). Can you share the benchmarks, so that others can retry running them? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hooks to track changed pages for backup purposes
Hi, On 09/01/2017 08:13 AM, Andrey Borodin wrote: > Thank you for your reply, Michael! Your comments are valuable, especially in the world of backups. > >> 31 авг. 2017 г., в 19:44, Michael Paquier <michael.paqu...@gmail.com> написал(а): >> Such things are not Postgres-C like. > Will be fixed. > A few more comments: * The patch defines wal_switch_hook, but it's never called. * I see there are conditions like this: if(xlogreader->blocks[nblock].forknum == MAIN_FORKNUM) Why is it enough to restrict the block-tracking code to main fork? Aren't we interested in all relation forks? >> I don't understand what xlog_begin_insert_hook() is good for. > memset control structures and array of blocknos and relfilenodes of the size XLR_MAX_BLOCK_ID . > Why can't that happen in the other hooks? I guess you'll have to explain what the implementation of the hooks is supposed to do, and why these locations for hook calls are the right ones. It's damn impossible to validate the patch without that information. Assuming you still plan to use the hook approach ... >> There >> are no arguments fed to this hook, so modules would not be able to >> analyze things in this context, except shared memory and process >> state? > >> >> Those hooks are put in hot code paths, and could impact performance of >> WAL insertion itself. > I do not think sending few bytes to cached array is comparable to disk write of XLog record. Checking the func ptr is even cheaper with correct branch prediction. > That seems somewhat suspicious, for two reasons. Firstly, I believe we only insert the XLOG records into WAL buffer here, so why should there be any disk write related? Or do you mean the final commit? But more importantly, doesn't this kind of information require some durability guarantees? I mean, if it gets lost during server crashes or restarts, doesn't that mean the incremental backups might miss some buffers? I'd guess the hooks will have to do some sort of I/O, to achieve that, no? In any case, claims like this probably deserve some benchmarking. >> So you basically move the cost of scanning WAL >> segments for those blocks from any backup solution to the WAL >> insertion itself. Really, wouldn't it be more simple to let for >> example the archiver process to create this meta-data if you just want >> to take faster backups with a set of segments? Even better, you could >> do a scan after archiving N segments, and then use M jobs to do this >> work more quickly. (A set of background workers could do this job as >> well). > I like the idea of doing this during archiving. It is different trade-off between performance of OLTP and performance of backuping. Essentially, it is parsing WAL some time before doing backup. The best thing about it is usage of CPUs that are usually spinning in idle loop on backup machines. > >> In the backup/restore world, backups can be allowed to be taken at a >> slow pace, what matters is to be able to restore them quickly. > Backups are taken much more often than restored. > I agree. The ability to take backups quickly (and often) is just as important as fast recovery. >> In short, anything moving performance from an external backup code path >> to a critical backend code path looks like a bad design to begin with. >> So I am dubious that what you are proposing here is a good idea. > I will think about it more. This proposal takes vanishingly small part of backend performance, but, indeed, nonzero part. > But I think this is not necessarily just about code paths - moving it to the archiver or a bunch of background workers may easily have just as negative effect (or event worse), as it's using the same CPUs, has to actually re-read the WAL segments from disk (which may be a lot of I/O, particularly when done from multiple processes). >From this POV, the idea to collect this information on the backup system (WAL archive) by pre-processing the arriving WAL segments seems like the most promising. It moves the work to another system, the backup system can make it as durable as the WAL segments, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
Hi, Attached is an updated version of the patch, fixing the issues reported by Adrien Nayrat, and also a bunch of issues pointed out by valgrind. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 0001-MCV-lists.patch.gz Description: application/gzip 0002-multivariate-histograms.patch.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
On 08/30/2017 02:19 AM, Andres Freund wrote: > On 2017-08-30 02:11:10 +0200, Tomas Vondra wrote: >> >> I'm not really following your reasoning. You may very well be right that >> the BeginInternalSubTransaction() example is not quite valid on postgres >> core, but I don't see how that implies there can be no other errors in >> the PG_TRY block. It was merely an explanation how I noticed this issue. >> >> To make it absolutely clear, I claim that the PG_CATCH() block is >> demonstrably broken as it calls AbortCurrentTransaction() and then >> accesses already freed memory. > > Yea, but that's not what happens normally. The txn memory isn't > allocated in the context created by the started [sub]transaction. I > think you're just seeing what you're seeing because an error thrown > before the BeginInternalSubTransaction() succeeds, will roll back the > *containing* transaction (the one that did the SQL SELECT * FROM > pg_*logical*() call), and free all the entire reorderbuffer stuff. > > >> The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in >> the switch, and AFAICS hitting any of them will have exactly the same >> effect as failure in BeginInternalSubTransaction(). > > No, absolutely not. Once the subtransaction starts, the > AbortCurrentTransaction() will abort that subtransaction, rather than > the containing one. > Ah, I see. You're right elog(ERROR) calls after the subtransaction start are handled correctly and don't trigger a segfault. Sorry for the noise and thanks for the explanation. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
On 08/30/2017 01:34 AM, Andres Freund wrote: > Hi, > > On 2017-08-30 01:27:34 +0200, Tomas Vondra wrote: >> I've been investigating some failures in test_decoding regression tests, >> and it seems to me the error-handling in ReorderBufferCommit() is >> somewhat broken, leading to segfault crashes. >> >> The problematic part is this: >> >> PG_CATCH() >> { >> /* >> * Force cache invalidation to happen outside of a valid transaction >> * to prevent catalog access as we just caught an error. >> */ >> AbortCurrentTransaction(); >> >> /* make sure there's no cache pollution */ >> ReorderBufferExecuteInvalidations(rb, txn); >> >> ... >> >> } >> >> Triggering it trivial - just add elog(ERROR,...) at the beginning of the >> PG_TRY() block. > > That's not really a valid thing to do, you should put it after the > BeginInternalSubTransaction()/StartTransactionCommand(). It's basically > assumed that those won't fail - arguably they should be outside of a > PG_TRY then, but that's a different matter. If you start decoding > outside of SQL failing before those will lead to rolling back the parent > tx... > > >> I suppose this is not quite intentional, but rather an example that >> error-handling code is an order of magnitude more complicated to write >> and test. I've only noticed as I'm investigating some regression >> failures on Postgres-XL 10, which does not support subtransactions and >> so the BeginInternalSubTransaction() call in the try branch always >> fails, triggering the issue. > > So, IIUC, there's no live problem in postgres core, besides some ugly & > undocumented assumptions? > I'm not really following your reasoning. You may very well be right that the BeginInternalSubTransaction() example is not quite valid on postgres core, but I don't see how that implies there can be no other errors in the PG_TRY block. It was merely an explanation how I noticed this issue. To make it absolutely clear, I claim that the PG_CATCH() block is demonstrably broken as it calls AbortCurrentTransaction() and then accesses already freed memory. The switch in the PG_TRY() blocks includes multiple elog(ERROR) calls in the switch, and AFAICS hitting any of them will have exactly the same effect as failure in BeginInternalSubTransaction(). And I suppose there many other ways to trigger an error and get into the catch block. In other words, why would we have the block at all? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] error-handling in ReorderBufferCommit() seems somewhat broken
Hi, I've been investigating some failures in test_decoding regression tests, and it seems to me the error-handling in ReorderBufferCommit() is somewhat broken, leading to segfault crashes. The problematic part is this: PG_CATCH() { /* * Force cache invalidation to happen outside of a valid transaction * to prevent catalog access as we just caught an error. */ AbortCurrentTransaction(); /* make sure there's no cache pollution */ ReorderBufferExecuteInvalidations(rb, txn); ... } Triggering it trivial - just add elog(ERROR,...) at the beginning of the PG_TRY() block. The problem is that AbortCurrentTransaction() apparently releases the memory where txn is allocated, making it entirely bogus. So in assert builds txn->ivalidations are 0x7f7f7f7f7f7f7f7f, triggering a segfault. Similar issues apply to subsequent calls in the catch block, that also use txn in some way (e.g. through snapshot_now). I suppose this is not quite intentional, but rather an example that error-handling code is an order of magnitude more complicated to write and test. I've only noticed as I'm investigating some regression failures on Postgres-XL 10, which does not support subtransactions and so the BeginInternalSubTransaction() call in the try branch always fails, triggering the issue. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: multivariate histograms and MCV lists
On 08/17/2017 12:06 PM, Adrien Nayrat wrote:> > Hello, > > There is no check of "statistics type/kind" in > pg_stats_ext_mcvlist_items and pg_histogram_buckets. > > select stxname,stxkind from pg_statistic_ext ; stxname | stxkind > ---+- stts3 | {h} stts2 | {m} > > So you can call : > > SELECT * FROM pg_mcv_list_items((SELECT oid FROM pg_statistic_ext > WHERE stxname = 'stts3')); > > SELECT * FROM pg_histogram_buckets((SELECT oid FROM pg_statistic_ext > WHERE stxname = 'stts2'), 0); > > Both crashes. > Thanks for the report, this is clearly a bug. I don't think we need to test the stxkind, but rather a missing check that the requested type is actually built. > Unfotunately, I don't have the knowledge to produce a patch :/ > > Small fix in documentation, patch attached. > Thanks, will fix. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On 08/15/2017 09:55 PM, Robert Haas wrote: On Tue, Aug 15, 2017 at 3:42 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: What I think we should not do is interpret the bitmasks (omitting some of the information) assuming all the bits were set correctly. I'm still confused. HEAP_XMIN_COMMITTED|HEAP_XMIN_ABORTED == HEAP_XMIN_FROZEN. Nobody is proposing to omit anything; to the contrary, what's being proposed is not to display the same thing twice (and in a misleading fashion, to boot). I understand your point. Assume you're looking at this bit of code: if (HeapTupleHeaderXminCommitted(enumval_tup->t_data)) return; which is essentially if (enumval_tup->t_data & HEAP_XMIN_COMMITTED) return; If the function only gives you HEAP_XMIN_FROZEN, how likely is it you miss this actually evaluates as true? You might say that people investigating issues in this area of code should be aware of how HEAP_XMIN_FROZEN is defined, and perhaps you're right ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On 08/15/2017 07:54 PM, Robert Haas wrote: On Tue, Aug 15, 2017 at 9:59 AM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: I don't think so -- the "committed" and "invalid" meanings are effectively canceled when the "frozen" mask is present. I mean, "committed" and "invalid" contradict each other... FWIW I agree with Craig - the functions should output the masks raw, without any filtering. The reason is that when you're investigating data corruption or unexpected behavior, all this is very useful when reasoning about what might (not) have happened. Or at least make the filtering optional. I don't think "filtering" is the right way to think about it. It's just labeling each combination of bits with the meaning appropriate to that combination of bits. I mean, if you were displaying the contents of a CLOG entry, would you want the value 3 to be displayed as COMMITTED ABORTED SUBCOMMITTED because TRANSACTION_STATUS_COMMITTED|TRANSACTION_STATUS_ABORTED == TRANSACTION_STATUS_SUB_COMMITTED? I realize that you may be used to thinking of the HEAP_XMIN_COMMITTED and HEAP_XMAX_COMMITTED bits as two separate bits, but that's not really true any more. They're a 2-bit field that can have one of four values: committed, aborted, frozen, or none of the above. All I'm saying is that having the complete information (knowing which bits are actually set in the bitmask) is valuable when reasoning about how you might have gotten to the current state. Which I think is what Craig is after. What I think we should not do is interpret the bitmasks (omitting some of the information) assuming all the bits were set correctly. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] pageinspect function to decode infomasks
On 08/15/2017 03:24 PM, Robert Haas wrote: On Mon, Aug 14, 2017 at 9:59 PM, Craig Ringer <cr...@2ndquadrant.com> wrote: The bits are set, those macros just test to exclude the special meaning of both bits being set at once to mean "frozen". I was reluctant to filter out HEAP_XMIN_COMMITTED and HEAP_XMIN_INVALID when we detect that it's frozen, because that could well be misleading when debugging. I don't think so -- the "committed" and "invalid" meanings are effectively canceled when the "frozen" mask is present. I mean, "committed" and "invalid" contradict each other... FWIW I agree with Craig - the functions should output the masks raw, without any filtering. The reason is that when you're investigating data corruption or unexpected behavior, all this is very useful when reasoning about what might (not) have happened. Or at least make the filtering optional. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PATCH : Generational memory allocator (was PATCH: two slab-like memory allocators)
Hi, Attached is a rebased version of the Generational context, originally submitted with SlabContext (which was already committed into Pg 10). The main change is that I've abandoned the pattern of defining a Data structure and then a pointer typedef, i.e. typedef struct GenerationContextData { ... } GenerationContextData; typedef struct GenerationContextData *GenerationContext; Now it's just typedef struct GenerationContext { ... } GenerationContext; mostly because SlabContext was committed like that, and because Andres was complaining about this code pattern ;-) Otherwise the design is the same as repeatedly discussed before. To show that this is still valuable change (even after SlabContext and adding doubly-linked list to AllocSet), I've repeated the test done by Andres in [1] using the test case described in [2], that is -- generate data SELECT COUNT(*) FROM (SELECT test1() FROM generate_series(1, 5)) foo; -- benchmark (measure time and VmPeak) SELECT COUNT(*) FROM (SELECT * FROM pg_logical_slot_get_changes('test', NULL, NULL, 'include-xids', '0')) foo; with different values passed to the first step (instead of the 5). The VmPeak numbers look like this: N masterpatched -- 10 1155220 kB 361604 kB 20 2020668 kB 434060 kB 30 2890236 kB 502452 kB 40 3751592 kB 570816 kB 50 4621124 kB 639168 kB and the timing (on assert-enabled build): N masterpatched -- 10 1103.182 ms 412.734 ms 20 2216.711 ms 820.438 ms 30 3320.095 ms1223.576 ms 40 4584.919 ms1621.261 ms 50 5590.444 ms2113.820 ms So it seems it's still a significant improvement, both in terms of memory usage and timing. Admittedly, this is a single test, so ideas of other useful test cases are welcome. regards [1] https://www.postgresql.org/message-id/20170227111732.vrx5v72ighehwpkf%40alap3.anarazel.de [2] https://www.postgresql.org/message-id/20160706185502.1426.28143%40wrigleys.postgresql.org -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >From 1c46d25ffa9bb104c415cba7c7b3a013958b6ab5 Mon Sep 17 00:00:00 2001 From: Tomas Vondra <to...@2ndquadrant.com> Date: Mon, 14 Aug 2017 01:52:50 +0200 Subject: [PATCH] Generational memory allocator This memory context is based on the assumption that the allocated chunks have similar lifespan, i.e. that chunks allocated close from each other (by time) will also be freed in close proximity, and mostly in the same order. This is typical for various queue-like use cases, i.e. when tuples are constructed, processed and then thrown away. The memory context uses a very simple approach to free space management. Instead of a complex global freelist, each block tracks a number of allocated and freed chunks. The space released by freed chunks is not reused, and once all chunks are freed (i.e. when nallocated == nfreed), the whole block is thrown away. When the allocated chunks have similar lifespan, this works very well and is extremely cheap. --- src/backend/replication/logical/reorderbuffer.c | 74 +-- src/backend/utils/mmgr/Makefile | 2 +- src/backend/utils/mmgr/generation.c | 768 src/include/nodes/memnodes.h| 4 +- src/include/nodes/nodes.h | 1 + src/include/replication/reorderbuffer.h | 15 +- src/include/utils/memutils.h| 5 + 7 files changed, 790 insertions(+), 79 deletions(-) create mode 100644 src/backend/utils/mmgr/generation.c diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 5567bee..5309170 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -150,15 +150,6 @@ typedef struct ReorderBufferDiskChange */ static const Size max_changes_in_memory = 4096; -/* - * We use a very simple form of a slab allocator for frequently allocated - * objects, simply keeping a fixed number in a linked list when unused, - * instead pfree()ing them. Without that in many workloads aset.c becomes a - * major bottleneck, especially when spilling to disk while decoding batch - * workloads. - */ -static const Size max_cached_tuplebufs = 4096 * 2; /* ~8MB */ - /* --- * primary reorderbuffer support routines * --- @@ -248,6 +239,10 @@ ReorderBufferAllocate(void) SLAB_DEFAULT_BLOCK_SIZE, sizeof(ReorderBufferTXN)
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
On 7/25/17 5:04 PM, Tom Lane wrote: Tomas Vondra <tomas.von...@2ndquadrant.com> writes: On 7/25/17 12:55 AM, Tom Lane wrote: I think the planner basically assumes that reltuples is the live tuple count, so maybe we'd better change VACUUM to get in step. Attached is a patch that (I think) does just that. The disagreement was caused by VACUUM treating recently dead tuples as live, while ANALYZE treats both of those as dead. At first I was worried that this will negatively affect plans in the long-running transaction, as it will get underestimates (due to reltuples not including rows it can see). But that's a problem we already have anyway, you just need to run ANALYZE in the other session. This definitely will have some impact on plans, at least in cases where there's a significant number of unvacuumable dead tuples. So I think it's a bit late for v10, and I wouldn't want to back-patch at all. Please add to the next commitfest. I dare to disagree here, for two reasons. Firstly, the impact *is* already there, it only takes running ANALYZE. Or VACUUM ANALYZE. In both those cases we already end up with reltuples=n_live_tup. Secondly, I personally strongly prefer stable predictable behavior over intermittent oscillations between two values. That's a major PITA on production, both to investigate and fix. So people already have this issue, although it only strikes randomly. And no way to fix it (well, except for fixing the cleanup, but that may not be possible). It is true we tend to run VACUUM more often than ANALYZE, particularly in situations where the cleanup can't proceed - ANALYZE will do it's work and VACUUM will be triggered over and over again, so it "wins" this way. But I'm not sure that's something we should rely on. FWIW I personally see this as a fairly annoying bug, and would vote to backpatch it, although I understand people might object. But I don't quite see a reason not to fix this in v10. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
On 7/25/17 12:55 AM, Tom Lane wrote: Tomas Vondra <tomas.von...@2ndquadrant.com> writes: It seems to me that VACUUM and ANALYZE somewhat disagree on what exactly reltuples means. VACUUM seems to be thinking that reltuples = live + dead while ANALYZE apparently believes that reltuples = live The question is - which of the reltuples definitions is the right one? I've always assumed that "reltuples = live + dead" but perhaps not? I think the planner basically assumes that reltuples is the live tuple count, so maybe we'd better change VACUUM to get in step. Attached is a patch that (I think) does just that. The disagreement was caused by VACUUM treating recently dead tuples as live, while ANALYZE treats both of those as dead. At first I was worried that this will negatively affect plans in the long-running transaction, as it will get underestimates (due to reltuples not including rows it can see). But that's a problem we already have anyway, you just need to run ANALYZE in the other session. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index c5c194a..574ca70 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -1261,7 +1261,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, vacrelstats->new_rel_tuples = vac_estimate_reltuples(onerel, false, nblocks, vacrelstats->tupcount_pages, - num_tuples); + num_tuples - nkeep); /* * Release any remaining pin on visibility map page. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means
Hi, It seems to me that VACUUM and ANALYZE somewhat disagree on what exactly reltuples means. VACUUM seems to be thinking that reltuples = live + dead while ANALYZE apparently believes that reltuples = live This causes somewhat bizarre changes in the value, depending on which of those commands was executed last. To demonstrate the issue, let's create a simple table with 1M rows, delete 10% rows and then we'll do a bunch of VACUUM / ANALYZE and check reltuples, n_live_tup and n_dead_tup in the catalogs. I've disabled autovacuum so that it won't interfere with this, and there's another transaction blocking VACUUM from actually cleaning any dead tuples. test=# create table t as select i from generate_series(1,100) s(i); test=# select reltuples, n_live_tup, n_dead_tup from pg_stat_user_tables join pg_class using (relname) where relname = 't'; reltuples | n_live_tup | n_dead_tup ---++ 1e+06 |100 | 0 So, that's nice. Now let's delete 10% of rows, and run VACUUM and ANALYZE a few times. test=# delete from t where random() < 0.1; test=# vacuum t; test=# select reltuples, n_live_tup, n_dead_tup from pg_stat_user_tables join pg_class using (relname) where relname = 't'; reltuples | n_live_tup | n_dead_tup ---++ 1e+06 | 900413 | 99587 test=# analyze t; reltuples | n_live_tup | n_dead_tup ---++ 900413 | 900413 | 99587 test=# vacuum t; reltuples | n_live_tup | n_dead_tup ---++ 1e+06 | 900413 | 99587 So, analyze and vacuum disagree. To further confuse the poor DBA, VACUUM always simply ignores the old values while ANALYZE combines the old and new values on large tables (and converges to the "correct" value after a few steps). This table is small (less than 30k pages), so ANALYZE does not do that. This is quite annoying, because people tend to look at reltuples while investigating bloat (e.g. because the check_postgres query mentioned on our wiki [1] uses reltuples in the formula). [1] https://wiki.postgresql.org/wiki/Show_database_bloat And when the cleanup is blocked for some reason (as in the example above), VACUUM tends to be running much more often (because it can't cleanup anything). So reltuples tend to be set to the higher value, which I'd argue is the wrong value for estimating bloat. I haven't looked at the code yet, but I've confirmed this happens both on 9.6 and 10. I haven't checked older versions, but I guess those are affected too. The question is - which of the reltuples definitions is the right one? I've always assumed that "reltuples = live + dead" but perhaps not? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 06/26/2017 05:18 PM, Alexander Korotkov wrote: Hackers, I see that design question for PostgreSQL pluggable storages is very hard. IMHO it's mostly expected to be hard. Firstly, PostgreSQL is a mature product with many advanced features, and reworking a low-level feature without breaking something on top of it is hard by definition. Secondly, project policies and code quality requirements set the bar very high too, I think. > BTW, I think it worth analyzing existing use-cases of pluggable storages. I think that the most famous DBMS with pluggable storage API is MySQL. This why I decided to start with it. I've added MySQL/MariaDB section on wiki page. https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB It appears that significant part of MySQL storage engines are misuses. MySQL lacks of various features like FDWs or writable views and so on. This is why developers created a lot of pluggable storages for that purposes. We definitely don't want something like this in PostgreSQL now. I created special resume column where I expressed whether it would be nice to have something like this table engine in PostgreSQL. I don't want to discourage you, but I'm not sure how valuable this is. I agree it's valuable to have a an over-view of use cases for pluggable storage, but I don't think we'll get that from looking at MySQL. As you noticed, most of the storage engines are misuses, so it's difficult to learn anything valuable from them. You can argue that using FDWs to implement alternative storage engines is a misuse too, but at least that gives us a valid use case (columnar storage implemented using FDW). If anything, the MySQL storage engines should serve as a cautionary tale how not to do things - there's also a plenty of references in the MySQL "Restrictions and Limitations" section of the manual: https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/23/17 10:38 AM, Teodor Sigaev wrote: 1. Table AM with a 6-byte TID. 2. Table AM with a custom locator format, which could be TID-like. 3. Table AM with no locators. Currently TID has its own type in system catalog. Seems, it's possible that storage claims type of TID which it uses. Then, AM could claim it too, so the core based on that information could solve the question about AM-storage compatibility. Storage could also claim that it hasn't TID type at all so it couldn't be used with any access method, use case: compressed column oriented storage. Isn't the fact that TID is an existing type defined in system catalogs is a fairly insignificant detail? I mean, we could just as easily define a new 64-bit locator data type, and use that instead, for example. The main issue here is that we assume things about the TID contents, i.e. that it contains page/offset etc. And Bitmap nodes rely on that, to some extent - e.g. when prefetching data. > As I remeber, only GIN depends on TID format, other indexes use it as opaque type. Except, at least, btree and GiST - they believe that internal pointers are the same as outer (to heap) I think you're probably right - GIN does compress the posting lists by exploiting the TID redundancy (that it's page/offset structure), and I don't think there are other index AMs doing that. But I'm not sure we can simply rely on that - it's possible people will try to improve other index types (e.g. by adding similar compression to other index types). Moreover we now support extensions defining custom index AMs, and we have no clue what people may do in those. So this would clearly require some sort of flag for each index AM. > Another dubious part - BitmapScan. It would be really great if you could explain why BitmapScans are dubious, instead of just labeling them as dubious. (I guess you mean Bitmap Heap Scans, right?) I see no conceptual issue with bitmap scans on arbitrary locator types, as long as there's sufficient locality information encoded in the value. What I mean by that is that for any two locator values A and B: (1) if (A < B) then (A is stored before B) (2) if (A is close to B) then (A is stored close to B) Without these features it's probably futile to try to do bitmap scans, because the bitmap would not result in mostly sequential access pattern and things like prefetch would not be very efficient, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/22/17 4:36 PM, Alexander Korotkov wrote: On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas <robertmh...@gmail.com <mailto:robertmh...@gmail.com>> wrote: On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov <a.korot...@postgrespro.ru <mailto:a.korot...@postgrespro.ru>> wrote: > On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paqu...@gmail.com <mailto:michael.paqu...@gmail.com>> > wrote: >> Putting that in a couple of words. >> 1. Table AM with a 6-byte TID. >> 2. Table AM with a custom locator format, which could be TID-like. >> 3. Table AM with no locators. >> >> Getting into having #1 first to work out would already be really >> useful for users. > > What exactly would be useful for *users*? Any kind of API itself is > completely useless for users, because they are users, not developers. > Storage API could be useful for developers to implement storage AMs whose in > turn could be useful for users. What's your point? I assume that is what Michael meant. TBH, I don't understand what particular enchantments do we expect from having #1. I'd say that's one of the things we're trying to figure out in this thread. Does it make sense to go with #1 in v1 of the patch, or do we have to implement #2 or #3 to get some real benefit for the users? > This is why it's hard for me to say if #1 is good idea. It's also hard for me to say if patch upthread is right way of implementing #1. But, I have gut feeling that if even #1 is good idea itself, it's definitely not what users expect from "pluggable storages". The question is "why" do you think that. What features do you expect from pluggable storage API that would be impossible to implement with option #1? I'm not trying to be annoying or anything like that - I don't know the answer and discussing those things is exactly why this thread exists. I do agree #1 has limitations, and that it'd be great to get API that supports all kinds of pluggable storage implementations. But I guess that'll take some time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/21/17 9:47 PM, Robert Haas wrote: ... > like int8 or numeric, it won't work at all. Even for other things, it's going to cause problems because the bit patterns won't be what the code is expecting; e.g. bitmap scans care about the structure of the TID, not just how many bits it is. (Due credit: Somebody, maybe Alvaro, pointed out this problem before, at PGCon.) Can you elaborate a bit more about this TID bit pattern issues? I do remember that not all TIDs are valid due to safeguards on individual fields, like for example Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits)) But perhaps there are some other issues? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel Aggregation support for aggregate functions that use transitions not implemented for array_agg
Hi, On 6/7/17 5:52 AM, Regina Obe wrote: On 6/6/17 13:52, Regina Obe wrote: It seems CREATE AGGREGATE was expanded in 9.6 to support parallelization of aggregate functions using transitions, with the addition of serialfunc and deserialfunc to the aggregate definitions. https://www.postgresql.org/docs/10/static/sql-createaggregate.html I was looking at the PostgreSQL 10 source code for some example usages of this and was hoping that array_agg and string_agg would support the feature. I'm not sure how you would parallelize these, since in most uses you want to have a deterministic output order. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services Good point. If that's the reason it wasn't done, that's good just wasn't sure. But if you didn't have an ORDER BY in your aggregate usage, and you did have those transition functions, it shouldn't be any different from any other use case right? I imagine you are right that most folks who use array_agg and string_agg usually combine it with array_agg(... ORDER BY ..) I think that TL had in mind is something like SELECT array_agg(x) FROM ( SELECT x FROM bar ORDER BY y ) foo; i.e. a subquery producing the data in predictable order. > My main reason for asking is that most of the PostGIS geometry and raster aggregate functions use transitions and were patterned after array agg. In the case of PostGIS the sorting is done internally and really only to expedite take advantage of things like cascaded union algorithms. That is always done though (so even if each worker does it on just it's batch that's still better than having only one worker). So I think it's still very beneficial to break into separate jobs since in the end the gather, will have say 2 biggish geometries or 2 biggish rasters to union if you have 2 workers which is still better than having a million smallish geometries/rasters to union I'm not sure I got your point correctly, but if you can (for example) sort the per-worker results as part of the "serialize" function, and benefit from that while combining that in the gather, then sure, that should be a huge win. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Data at rest encryption
Hi all, I've noticed this thread got resurrected a few days ago, but I haven't managed to read all the messages until today. I do have a bunch of comments, but let me share them as a single consistent message instead of sending a thousand responses to individual messages. 1) Threat model --- Firstly, I think the thread would seriously benefit from an explanation and discussion of the threat model - what types of attacks it's meant to address, and what attacks it can't defend against. My understanding is that data-at-rest encryption generally addresses only the "someone stole the disk" case and pretty much nothing else. Moreover, I claim that encryption implemented at the database-level is strictly weaker compared to LUKS or encrypted disks, because it simply reveals a lot more information even without decryption (file sizes, timestamps, etc.) That is a serious issue in practice, and researches have been proving that for a long time now. I do recommend this paper from Cornell Tech as a great starting point (it cites many papers relevant to this thread): Why Your Encrypted Database Is Not Secure Paul Grubbs, Thomas Ristenpart, Vitaly Schmatikov http://eprint.iacr.org/2017/468.pdf The paper explains how encryption schemes on general-purpose databases fail, due to exactly such side-channels. MVCC, logging and other side channels turn all attackers into "persistent passive attackers". Now, this does not mean the feature is useless - nothing is perfect, and security is not a binary feature. It certainly makes attacks mode difficult compared to plaintext database. But it's untrue that it's basically LUKS, just implemented at the database level. I'm not suggesting that we should not pursue this idea, but the threat model is a crucial piece of information missing in this thread. 2) How do other databases do it? It was repeatedly mentioned that other databases support this type of encryption. So how do they deal with the hard parts? For example how do they get and protect the encryption key? I agree with Stephen that we should not require a full key management from v1 of the patch, that's an incredibly difficult thing. And it largely depends on the hardware (e.g. it should be possible to move the key to TrustZone on ARM / SGX on Intel). 3) Why do users prefer this to FDE? --- I suppose we're discussing this feature because we've been asked about it by users/customers who can't use FDE. Can we reach to them and ask them about the reasons? Why can't they use FDE? It was mentioned in the thread that the FDE solutions are not portable between different systems, but honestly - is that an issue? You probably can't copy the datadir anyway due locale differences anyway. If you're running multiple operating systems, FDE is just one of many differences. 4) Other solutions? --- Clearly, FDE (at the block device level) and DB-level encryption are not the only solutions. There are also filesystems-level solutions now, for example. ext4 (since kernel 4.1) and f2fs (since kernel 4.2) allow encryption at directory level, are transparent to the user space, and include things like key management (well, that's provided by kernel). NTFS can do something quite similar using EFS. https://lwn.net/Articles/639427/ https://blog.quarkslab.com/a-glimpse-of-ext4-filesystem-level-encryption.html Of course, if you happen to use another filesystem (e.g. XFS), this won't work for you. But then there's eCryptfs, for example: https://en.wikipedia.org/wiki/ECryptfs regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!
Hi, On 6/11/17 7:54 PM, Peter Geoghegan wrote: On Sun, Jun 11, 2017 at 10:36 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: Do you mean teaching the optimizer to do something like this?: Uh, no. I don't think we want to add any run-time checks. The point in this example is that we'd get a better rowcount estimate if we noticed that the FK constraint could be considered while estimating the size of the partsupp-to-aggregated-subquery join. Sorry for not considering the context of the thread more carefully. Robert said something about selectivity estimation and TPC-H to me, which I decide to research; I then rediscovered this thread. Clearly Q20 is designed to reward systems that do better with moving predicates into subqueries, as opposed to systems with better selectivity estimation. I do strongly recommend reading this paper analyzing choke points of individual TPC-H queries: http://oai.cwi.nl/oai/asset/21424/21424B.pdf It's slightly orthogonal to the issue at hand (poor estimate in Q20 causing choice of inefficient plan), it's a great paper to read. I thought I've already posted a link to the this paper sometime in the past, but I don't see it in the archives. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TPC-H Q20 from 1 hour to 19 hours!
Hi, On 5/25/17 6:03 AM, Robert Haas wrote: On Thu, Apr 6, 2017 at 4:37 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: Which brings me to the slightly suspicious bit. On 9.5, there's no difference between GROUP and GROUP+LIKE cases - the estimates are exactly the same in both cases. This is true too, but only without the foreign key between "partsupp" and "part", i.e. the two non-grouped relations in the join. And what's more, the difference (1737 vs. 16) is pretty much exactly 100x, which is the estimate for the LIKE condition. I don't follow this. How does the foreign key between partsupp and part change the selectivity of LIKE? So it kinda seems 9.5 does not apply this condition for semi-joins, while =9.6 does that. Well, get_foreign_key_join_selectivity() does handle restrictions when calculating joinrel size estimate in calc_joinrel_size_estimate(), so assuming there's some thinko it might easily cause this. I haven't found any such thinko, but I don't dare to claim I fully understand what the current version of get_foreign_key_join_selectivity does :-/ If 9.5 and prior are ignoring some of the quals, that's bad, but I don't think I understand from your analysis why 7fa93eec4e0c9c3e801e3c51aa4bae3a38aaa218 regressed anything. It's been quite a bit of time since I looked into this, but I think my main point was that it's hard to say it's a regression when both the old and new estimates are so utterly wrong. I mean, 9.5 estimates 160, 9.6 estimates 18. We might fix the post-9.6 estimate to return the same value as 9.5, and it might fix this particular query with this particular scale. But in the end it's just noise considering that the actual value is 120k (so 3 or 4 orders of magnitude off). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITH clause in CREATE STATISTICS
On 5/12/17 4:46 AM, Tom Lane wrote: Alvaro Herrera <alvhe...@2ndquadrant.com> writes: Hmm, yeah, makes sense. Here's a patch for this approach. ... Also, while reading the docs changes, I got rather ill from the inconsistent and not very grammatical handling of "statistics" as a noun, particularly the inability to decide from one sentence to the next if that is singular or plural. In the attached, I fixed this in the ref/*_statistics.sgml files by always saying "statistics object" instead. If people agree that that reads better, we should make an effort to propagate that terminology elsewhere in the docs, and into error messages, objectaddress.c output, etc. I'm fine with the 'statistics object' wording. I've been struggling with this bit while working on the patch, and I agree it sounds better and getting it consistent is definitely worthwhile. > Although I've not done anything about it here, I'm not happy about the handling of dependencies for stats objects. I do not think that cloning RemoveStatistics as RemoveStatisticsExt is sane at all. The former is meant to deal with cleaning up pg_statistic rows that we know will be there, so there's no real need to expend pg_depend overhead to track them. For objects that are only loosely connected, the right thing is to use the dependency system; in particular, this design has no real chance of working well with cross-table stats. Also, it's really silly to have *both* this hard-wired mechanism and a pg_depend entry; the latter is surely redundant if you have the former. IMO we should revert RemoveStatisticsExt and instead handle things by making stats objects auto-dependent on the individual column(s) they reference (not the whole table). I'm also of the opinion that having an AUTO dependency, rather than a NORMAL dependency, on the stats object's schema is the wrong semantics. There isn't any other case where you can drop a non-empty schema without saying CASCADE, and I'm mystified why this case should act that way. Yeah, it's a bit frankensteinian ... > Lastly, I tried the example given in the CREATE STATISTICS reference page, and it doesn't seem to work. Without the stats object, the two queries are both estimated at one matching row, whereas they really produce 100 and 0 rows respectively. With the stats object, they seem to both get estimated at ~100 rows, which is a considerable improvement for one case but a considerable disimprovement for the other. If this is not broken, I'd like to know why not. What's the point of an expensive extended- stats mechanism if it can't get this right? I assume you're talking about the functional dependencies and in that case that's expected behavior, because f. dependencies do assume the conditions are "consistent" with the functional dependencies. This is an inherent limitation of functional dependencies, and perhaps a price for the simplicity of this statistics type. If your application executes queries that are likely not consistent with this, don't use functional dependencies. The simplicity is why dependencies were implemented first, mostly to introduce all the infrastructure. The other statistics types (MCV, histograms) don't have this limitation, but those did not make it into pg10. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 5/4/17 8:03 PM, Joe Conway wrote: On 05/04/2017 10:56 AM, Andrew Dunstan wrote: On 05/04/2017 01:52 PM, Joe Conway wrote: On 05/04/2017 10:33 AM, Alvaro Herrera wrote: I'm not sure what your point is. We know that for some cases the optimization barrier semantics are useful, which is why the proposal is to add a keyword to install one explicitely: with materialized r as ( select json_populate_record(null::mytype, myjson) as x from mytable ) select (x).* from r; this would preserve the current semantics. I haven't been able to follow this incredibly long thread, so please excuse me if way off base, but are we talking about that a CTE would be silently be rewritten as an inline expression potentially unless it is decorated with some new syntax? I would find that very disconcerting myself. For example, would this CTE potentially get rewritten with multiple evaluation as follows? DROP SEQUENCE IF EXISTS foo_seq; CREATE SEQUENCE foo_seq; WITH a(f1) AS (SELECT nextval('foo_seq')) SELECT a.f1, a.f1 FROM a; f1 | ?column? +-- 1 |1 (1 row) ALTER SEQUENCE foo_seq RESTART; SELECT nextval('foo_seq'), nextval('foo_seq'); nextval | ?column? -+-- 1 |2 (1 row) I think that would be a change in semantics, which we should definitely not be getting. Avoiding a change in semantics might be an interesting exercise, but we have lots of clever coders ... Well I think my point is that I always have understood CTEs to be executed precisely once producing a temporary result set that is then referenced elsewhere. I don't think that property of CTEs should change. Somewhere else in the thread someone mentioned predicate push down -- that makes sense and maybe some clever coder can come up with a patch that does that, but I would not be in favor of CTEs being inlined and therefore evaluated multiple times. I agree with this, but there's a difference between "executed exactly once" and "producing the same result as if executed exactly once". I may be misunderstanding what other people proposed in this thread, but I think the plan was to only inline CTEs where we know it won't change the results, etc. So e.g. CTEs with volatile functions would not get inlined, which includes nextval() for example. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 5/4/17 7:56 PM, Andrew Dunstan wrote: On 05/04/2017 01:52 PM, Joe Conway wrote: On 05/04/2017 10:33 AM, Alvaro Herrera wrote: I'm not sure what your point is. We know that for some cases the optimization barrier semantics are useful, which is why the proposal is to add a keyword to install one explicitely: with materialized r as ( select json_populate_record(null::mytype, myjson) as x from mytable ) select (x).* from r; this would preserve the current semantics. I haven't been able to follow this incredibly long thread, so please excuse me if way off base, but are we talking about that a CTE would be silently be rewritten as an inline expression potentially unless it is decorated with some new syntax? I would find that very disconcerting myself. For example, would this CTE potentially get rewritten with multiple evaluation as follows? DROP SEQUENCE IF EXISTS foo_seq; CREATE SEQUENCE foo_seq; WITH a(f1) AS (SELECT nextval('foo_seq')) SELECT a.f1, a.f1 FROM a; f1 | ?column? +-- 1 |1 (1 row) ALTER SEQUENCE foo_seq RESTART; SELECT nextval('foo_seq'), nextval('foo_seq'); nextval | ?column? -+-- 1 |2 (1 row) I think that would be a change in semantics, which we should definitely not be getting. Avoiding a change in semantics might be an interesting exercise, but we have lots of clever coders ... nextval is a volatile function, and in my understanding the plan was not to inline CTEs with volatile functions (or CTEs doing writes etc.). That is, we'd guarantee the same results as we get now. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITH clause in CREATE STATISTICS
On 5/3/17 11:36 PM, Alvaro Herrera wrote: Andrew Dunstan wrote: On 05/03/2017 04:42 PM, Tom Lane wrote: One other point is that as long as we've got reserved keywords introducing each clause, there isn't actually an implementation reason why we couldn't accept the clauses in any order. Not sure I want to document it that way, but it might not be a bad thing if the grammar was forgiving about whether you write the USING or ON part first ... +1 for allowing arbitrary order of clauses. I would document it with the USING clause at the end, and have that be what psql supports and pg_dump produces. Since there are no WITH options now we should leave that out until it's required. Ok, sounds good to me. Unless there are objections I'm going to have a shot at implementing this. Thanks for the discussion. Works for me. Do you also plan to remove the parentheses for the USING clause? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 5/3/17 9:54 PM, Andreas Karlsson wrote: On 05/03/2017 07:33 PM, Alvaro Herrera wrote: 1) we switch unmarked CTEs as inlineable by default in pg11. What seems likely to happen for a user that upgrades to pg11 is that 5 out of 10 CTE-using queries are going to become faster than with pg10, and they are going to be happy; 4 out of five are going to see no difference, but they didn't have to do anything about it; and the remaining query is going to become slower, either indistinguishably so (in which case they don't care and they remain happy because of the other improvements) or notably so, in which case they can easily figure where to add the MATERIALIZED option and regain the original performance. 2) unmarked CTEs continue to be an optimization barrier, but we add "WITH INLINED" so that they're inlineable. Some users may wonder about it and waste a lot of time trying to figure out which CTEs to add it to. They see a benefit in half the queries, which makes them happy, but they are angry that they had to waste all that time on the other queries. 3) We don't do anything, because we all agree that GUCs are not suitable. No progress. No anger, but nobody is happy either. +1 for option 1. And while I would not like if we had to combine it with a backwards compatibility GUC which enables the old behavior to get it merged I still personally would prefer that over option 2 and 3. > Andreas > +1 to what Andreas says -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 5/2/17 11:23 PM, Merlin Moncure wrote: \On Tue, May 2, 2017 at 12:05 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: On 5/2/17 6:34 PM, David Fetter wrote: On Tue, May 02, 2017 at 02:40:55PM +0200, Andreas Karlsson wrote: On 05/02/2017 04:38 AM, Craig Ringer wrote: On 1 May 2017 at 22:26, Andreas Karlsson <andr...@proxel.se> wrote: ... I see some alternatives, none of them perfect. 1. Just remove the optimization fence and let people add OFFSET 0 to their queries if they want an optimization fence. This lets us keep pretending that we do not have query hints (and therefore do not have to formalize any syntax for them) while still allowing people to add optimization fences. +1 I get that people with gigantic PostgreSQL installations with stringent performance requirements sometimes need to do odd things to squeeze out the last few percentage points of performance. As the people (well, at least the people close to the ground) at these organizations are fully aware, performance optimizations are extremely volatile with respect to new versions of software, whether it's PostgreSQL, Oracle, the Linux kernel, or what have you. They expect this, and they have processes in place to handle it. If they don't, it's pilot error. We should not be penalizing all our other users to maintain the fiction that people can treat performance optimizations as a "fire and forget" matter. Agreed. 2. Add a decorator for WITH (e.g. "WITH MATERIALIZED x (...") to add an explicit optimization fence. This will for the first time add official support for a query hint in the syntax which is a quite big precedent. Yep. It's one we should think very carefully before we introduce. I think it's a mistake to see this as an introduction of query hits. Firstly, it's a question whether it qualifies as a hint. I wouldn't call it a hint, but let's assume there is a definition of query hints that includes WITH MATERIALIZED. More importantly, however, this is not introducing anything new. It's just a different name for the current "WITH" semantics, and you can achieve the same behavior by "OFFSET 0". And people are already using these as hints, so I fail to see how this introduces anything new. In fact, if you see the optimization fence as an implicit query hint, this actually *removes* a hint (although most users are unaware of that behavior and use it unintentionally). +1 down the line. More to the point, for several years now we've (or at least I, but I'm not the only one) have been advocating for the usage of CTE to avoid the undocumented and bizarre OFFSET 0 trick. Jerking this out from users without giving a simple mechanic to get the same behavior minus a major query rewrite is blatantly user hostile. I can't believe we're even contemplating it. Also a GUC is not a solution for pretty obvious reasons I think. I'm not sure what you mean by "jerking this out from users". Isn't most of this thread about how to allow CTE inlining without hurting users unnecessarily? I think we agree that: * Just removing the optimization fence and telling users to use OFFSET 0 instead is a no-go, just like removing the fencing and not providing any sensible replacement. * GUC is not the solution. Which leaves us with either WITH INLINE or WITH MATERIALIZE, or something along those lines. If we go with WITH INLINE then we're likely not solving anything, because most people will simply use WITH just like now, and will be subject to the fencing without realizing it. Or we will choose WITH MATERIALIZE, and then the users aware of the fencing (and using the CTEs for that purpose) will have to modify the queries. But does adding MATERIALIZE quality as major query rewrite? Perhaps combining this with a GUC would be a solution. I mean, a GUC specifying the default behavior, and then INLINE / MATERIALIZE for individual CTEs in a query? If you have an application intentionally using CTEs as a fence, just do ALTER DATABASE x SET enable_guc_fencing = on and you don't have to rewrite the queries. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 5/2/17 6:34 PM, David Fetter wrote: On Tue, May 02, 2017 at 02:40:55PM +0200, Andreas Karlsson wrote: On 05/02/2017 04:38 AM, Craig Ringer wrote: On 1 May 2017 at 22:26, Andreas Karlsson <andr...@proxel.se> wrote: >> ... I see some alternatives, none of them perfect. 1. Just remove the optimization fence and let people add OFFSET 0 to their queries if they want an optimization fence. This lets us keep pretending that we do not have query hints (and therefore do not have to formalize any syntax for them) while still allowing people to add optimization fences. +1 I get that people with gigantic PostgreSQL installations with stringent performance requirements sometimes need to do odd things to squeeze out the last few percentage points of performance. As the people (well, at least the people close to the ground) at these organizations are fully aware, performance optimizations are extremely volatile with respect to new versions of software, whether it's PostgreSQL, Oracle, the Linux kernel, or what have you. They expect this, and they have processes in place to handle it. If they don't, it's pilot error. We should not be penalizing all our other users to maintain the fiction that people can treat performance optimizations as a "fire and forget" matter. Agreed. 2. Add a decorator for WITH (e.g. "WITH MATERIALIZED x (...") to add an explicit optimization fence. This will for the first time add official support for a query hint in the syntax which is a quite big precedent. Yep. It's one we should think very carefully before we introduce. I think it's a mistake to see this as an introduction of query hits. Firstly, it's a question whether it qualifies as a hint. I wouldn't call it a hint, but let's assume there is a definition of query hints that includes WITH MATERIALIZED. More importantly, however, this is not introducing anything new. It's just a different name for the current "WITH" semantics, and you can achieve the same behavior by "OFFSET 0". And people are already using these as hints, so I fail to see how this introduces anything new. In fact, if you see the optimization fence as an implicit query hint, this actually *removes* a hint (although most users are unaware of that behavior and use it unintentionally). 3. Add a new GUC which can enable and disable the optimization fence. This is a very clumsy tool, but maybe good enough for some users and some people here in this thread have complained about our similar GUCs. Any GUC would be unable to distinguish one WITH clause from another. The hammer would then be guaranteed to be too big for precisely the cases where it's most needed. If I could, I'd give -1 million to a GUC-based approach, as that would make it entirely unusable in practice, I think. Actually, I can give -1 million, so I'm giving it. >> 4. Add some new more generic query hinting facility. This is a lot of work and something which would be very hard to get consensus for. Just the design of the thing would be the work of months at a minimum, assuming we got to some consensus at all. Maybe it's worth doing. While I came to conclusion that query hints may be quite useful in some situations, I'm pretty sure this is not a battle you'd like to fight. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 5/2/17 4:44 PM, Andrew Dunstan wrote: On 05/02/2017 10:13 AM, Merlin Moncure wrote: On Sun, Apr 30, 2017 at 6:21 PM, Andres Freund <and...@anarazel.de> wrote: On 2017-04-30 07:19:21 +0200, Pavel Stehule wrote: why we cannot to introduce GUC option - enable_cteoptfence ? Doesn't really solve the issue, and we've generally shied away from GUCs that influence behaviour after a few bad experiences. What if you want one CTE inlined, but another one not? Yeah. Are we absolutely opposed to SQL syntax against WITH that allows or disallows fencing? for example, WITH [MATERIALIZED] Pushing people to OFFSET 0 is a giant step backwards IMO, and as in implementation detail is also subject to change. Agreed, it's an ugly as sin and completely non-obvious hack. Isn't OFFSET 0 an implementation detail anyway? Who says the planner couldn't get smarter in the future, realize OFFSET 0 is no-op? In that case replacing CTE optimization fence with "OFFSET 0" would be akin to painting yourself into a corner, waiting for the pain to dry, walking over to another corner and painting yourself into that one. cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 05/01/2017 06:22 AM, Pavel Stehule wrote: 2017-05-01 1:21 GMT+02:00 Andres Freund <and...@anarazel.de <mailto:and...@anarazel.de>>: On 2017-04-30 07:19:21 +0200, Pavel Stehule wrote: > why we cannot to introduce GUC option - enable_cteoptfence ? Doesn't really solve the issue, and we've generally shied away from GUCs that influence behaviour after a few bad experiences. What if you want one CTE inlined, but another one not? It change behave in same sense like enable_nestloop, enable_hashjoin, ... with same limits. Those (and also the other enable_*) GUCs are a great example why we should not use GUCs for tweaking planner behavior, except perhaps for the purpose of investigation. It's an extremely blunt tool. You typically want to affect just a single node in the query plan (say, one join), but those options don't allow you to do that. It's all or nothing thing. Even if you're OK with affecting the whole query, it's a separate control channel - it's not embedded in the query, the user has to set it somehow. So you either set it for the whole session (affecting all the other queries that don't really need it), or you set it before each query. Which however sucks for a number of reasons, e.g. if you have a slow query in the log, how do you know with what GUC values it was executed? (You don't, and there's no way to find out.) Exactly the same issues would affect this new GUC. It would be impossible to use multiple CTEs in the query with different fencing behavior, and it would be just as difficult to investigate. So no more planner-affecting GUCs, please, particularly if we expect regular users to use them. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 04/30/2017 09:46 AM, Andres Freund wrote: Hi, On 2017-04-30 13:58:14 +0800, Craig Ringer wrote: We have OFFSET 0 for anyone really depending on it, and at least when you read that you know to go "wtf" and look at the manual, wheras the CTE fence behaviour is invisible and silent. I don't think that's a good idea. What if you need to prevent inlining of something that actually needs an offset? What if the behaviour of offset is ever supposed to change? Relying more on that seems to just be repeating the mistake around CTEs. I agree with this. But OFFSET 0 would force people to modify the queries anyway, so why not just introduce a new keyword instead? Something like: WITH FENCED a (SELECT ...) But I think something like that was proposed not too long ago, and did not make it for some reason. There's a lot of other CTE improvements that would be great. Say, being able to define indexes on them, but that's really a separate topic. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] CTE inlining
On 04/30/2017 06:28 AM, Tom Lane wrote: Craig Ringer <craig.rin...@2ndquadrant.com> writes: - as you noted, it is hard to decide when it's worth inlining vs materializing for CTE terms referenced more than once. [ raised eyebrow... ] Please explain why the answer isn't trivially "never". There's already a pretty large hill to climb here in the way of breaking peoples' expectations about CTEs being optimization fences. Breaking the documented semantics about CTEs being single-evaluation seems to me to be an absolute non-starter. I'm not sure that's a universal expectation, though. I know there are people who actually do rely on that intentionally, no doubt about that. And we'd nee to make it work for them. But I keep running into people who face serious performance issues exactly because not realizing this, and using CTEs as named subqueries. And when I tell them "optimization fence" they react "Whaaat?" If I had to make up some numbers, I'd say the "What?" group is about 10x the group of people who intentionally rely on CTEs being optimization fences. FWIW I don't know how to do this. There were multiple attempts at this in the past, none of them succeeded. But perhaps we could at least propagate some of the CTE features, so that the outside query can benefit from that (e.g. when the CTE is sorted, we could set the sortkeys). That wouldn't break the fence thing, but it would allow other stuff. regard -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] to-do item for explain analyze of hash aggregates?
On 04/24/2017 10:55 PM, Jeff Janes wrote: On Mon, Apr 24, 2017 at 12:13 PM, Tomas Vondra <tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote: On 04/24/2017 08:52 PM, Andres Freund wrote: ... I've wanted that too. It's not impossible at all. Why wouldn't that be possible? We probably can't use exactly the same approach as Hash, because hashjoins use custom hash table while hashagg uses dynahash IIRC. But why couldn't measure the amount of memory by looking at the memory context, for example? He said "not impossible", meaning it is possible. Ah, the dreaded double negative ... regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] to-do item for explain analyze of hash aggregates?
On 04/24/2017 08:52 PM, Andres Freund wrote: On 2017-04-24 11:42:12 -0700, Jeff Janes wrote: The explain analyze of the hash step of a hash join reports something like this: -> Hash (cost=458287.68..458287.68 rows=24995368 width=37) (actual rows=24995353 loops=1) Buckets: 33554432 Batches: 1 Memory Usage: 2019630kB Should the HashAggregate node also report on Buckets and Memory Usage? I would have found that useful several times. Is there some reason this is not wanted, or not possible? I've wanted that too. It's not impossible at all. Why wouldn't that be possible? We probably can't use exactly the same approach as Hash, because hashjoins use custom hash table while hashagg uses dynahash IIRC. But why couldn't measure the amount of memory by looking at the memory context, for example? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Do we need multi-column frequency/histogram stats? WAS Re: [HACKERS] Statistics "dependency"
On 04/23/2017 04:16 PM, Bruce Momjian wrote: On Sun, Apr 23, 2017 at 10:01:16AM -0400, Bruce Momjian wrote: On Sun, Apr 23, 2017 at 11:44:12AM +0100, Simon Riggs wrote: For us "functional dependency" would sound like something to do with functions (e.g. CREATE FUNCTION), so just "dependency" appears to me to be the best term for this. There are multiple statistics for dependency stored, hence "dependencies". I don't like it, but its the best term I can see at present. OK, thank you for the reply, and I am sorry I forgot the previous discussion. I just wanted to re-check we had research this. Thanks. (Email subject updated.) Actually, I have a larger question that I was thinking about. Because we already have lots of per-column stats, and now the dependency score, is it possible to mix the per-column stats and dependency score in a way that multi-column frequency/histogram stats are not necessary? That might be a less costly approach I had not considered. Certainly not. Functional dependencies are "global" statistics, and only a very specific type of it. It only tells you that a particular column "implies" another column, i.e. knowledge of a value in A means there's only a single possible value in "B". That has a number of implications: * It only works for equality conditions. No inequalities or so. * It assumes the queries are "consistent" with the functional dependencies. * There are dependencies/correlations that are not functional dependencies, while MCV/histograms would help. Functional dependencies was the simplest type of extended statistics, and so it was the first one to implement (and introduce all the infrastructure). But we still need the other types. The other types of statistics actually track correlation between values in the columns, not just "column A implies column B". regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Statistics "dependency"
On 04/23/2017 12:44 PM, Simon Riggs wrote: On 23 April 2017 at 09:17, Dean Rasheed <dean.a.rash...@gmail.com> wrote: On 23 April 2017 at 03:37, Bruce Momjian <br...@momjian.us> wrote: In looking at the new multi-column statistics "dependency" option in Postgres 10, I am quite confused by the term "dependency". Wouldn't "correlation" be clearer and less confusing as "column dependency" already means something else. I also asked that exactly that question... Actually, the terms "dependency" and "correlation" are both quite broad terms that cover a whole range of other different things, and hence could be misleading. The precise term for this is "functional dependency" [1], so if anything, the option name should be "functional_dependencies" or some shortening of that, keeping a part of each of those words. ...and got that answer also. For us "functional dependency" would sound like something to do with functions (e.g. CREATE FUNCTION), so just "dependency" appears to me to be the best term for this. Not really. Functional dependency is a term well-defined in relational algebra, particularly in definition of normal forms. It has nothing to do with functions, and I'm sure it's not the only possible "ambiguity". I actually considered functional_dependency when working on this, and I think it might be a better choice, but seemed a bit longish. There are multiple statistics for dependency stored, hence "dependencies". I don't like it, but its the best term I can see at present. That is a good point, actually. It should probably be plural. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITH clause in CREATE STATISTICS
On 04/21/2017 12:13 AM, Tom Lane wrote: Alvaro Herrera <alvhe...@2ndquadrant.com> writes: Simon just pointed out that having the WITH clause appear in the middle of the CREATE STATISTICS command looks odd; apparently somebody else already complained on list about the same. Other commands put the WITH clause at the end, so perhaps we should do likewise in the new command. Here's a patch to implement that. I verified that if I change qualified_name to qualified_name_list, bison does not complain about conflicts, so this new syntax should support extension to multiple relations without a problem. Yeah, WITH is fully reserved, so as long as the clause looks like WITH ( stuff... ) you're pretty much gonna be able to drop it wherever you want. Discuss. +1 for WITH at the end; the existing syntax looks weird to me too. -1 from me I like the current syntax more, and WHERE ... WITH seems a bit weird to me. But more importantly, one thing Dean probably considered when proposing the current syntax was that we may add support for partial statistics, pretty much like partial indexes. And we don't allow WITH at the end (after WHERE) for indexes: test=# create index on t (a) where a < 100 with (fillfactor=10); ERROR: syntax error at or near "with" LINE 1: create index on t (a) where a < 100 with (fillfactor=10); ^ test=# create index on t (a) with (fillfactor=10) where a < 100; regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_statistic_ext.staenabled might not be the best column name
On 04/12/2017 03:36 PM, David Rowley wrote: "stakind" seems like a better name. I'd have personally gone with "statype" but pg_statistic already thinks stakind is better. +1 to stakind -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_stats_ext view does not seem all that useful
On 04/10/2017 12:12 PM, David Rowley wrote: During my review and time spent working on the functional dependencies part of extended statistics I wondered what was the use for the pg_stats_ext view. I was unsure why the length of the serialised dependencies was useful. Perhaps we could improve the view, but I'm not all that sure what value it would add. Maybe we need to discuss this, but in the meantime, I've attached a patch which just removes the view completely Yeah, let's get rid of the view. It was quite useful before introducing the custom data types (and implicit casts to text), because pg_statistic_ext would simply return the whole bytea value. But since then the view kinda lost the purpose and no one really noticed. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strange parallel query behavior after OOM crashes
On 04/10/2017 01:39 PM, Kuntal Ghosh wrote: On Thu, Apr 6, 2017 at 6:50 AM, Robert Haas <robertmh...@gmail.com> wrote: On Wed, Apr 5, 2017 at 8:17 PM, Neha Khatri <nehakhat...@gmail.com> wrote: The problem here seem to be the change in the max_parallel_workers value while the parallel workers are still under execution. So this poses two questions: 1. From usecase point of view, why could there be a need to tweak the max_parallel_workers exactly at the time when the parallel workers are at play. 2. Could there be a restriction on tweaking of max_parallel_workers while the parallel workers are at play? At least do not allow setting the max_parallel_workers less than the current # of active parallel workers. Well, that would be letting the tail wag the dog. The maximum value of max_parallel_workers is only 1024, and what we're really worried about here is seeing a value near PG_UINT32_MAX, which leaves a lot of daylight. How about just creating a #define that's used by guc.c as the maximum for the GUC, and here we assert that we're <= that value? I've added a GUC parameter named MAX_PARALLEL_WORKER_LIMIT and asserted that the absolute difference between the two counts <= that value, i.e., abs((int)(register_count - terminate_count)) <= MAX_PARALLEL_WORKER_LIMIT; Because, it's possible that register count has overflowed but terminate count has not yet overflowed. As a result, the difference in unsigned integer will be near UINT32_MAX. Hence, we need the absolute difference after typecasting the same to integer. This value should be less than the max_parallel_workers upper limit. I've attached both the patches for better accessibility. PFA. At first I was like 'WTF? Why do we need a new GUC just becase of an assert?' but you're actually not adding a new GUC parameter, you're adding a constant which is then used as a max value for max for the two existing parallel GUCs. I think this is fine. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance issue with postgres9.6
On 04/07/2017 06:31 PM, Merlin Moncure wrote: On Fri, Apr 7, 2017 at 5:16 AM, Prakash Itnal <prakash...@gmail.com> wrote: Hello, We currently use psotgres 9.3 in our products. Recently we upgraded to postgres 9.6. But with 9.6 we have seen a drastic reduction in throughput. After analyzing carefully I found that "planner time" in 9.6 is very high. Below are the details: Scenario: 1 Create a table with 10 rows. 2 Execute simple query: select * from subscriber where s_id = 100; 3 No update/delete/insert; tried vacuum, full vacuum; by default we enable auto-vacuum 9.3: Avg of "Total runtime" : 0.24ms [actual throughput: 650 TPS] 9.6: Avg of Total time: 0.56ms (Avg of "Planning time" : 0.38ms + Avg of "Execution time" : 0.18ms) [actual throughput: 80 TPS] I think your math is off. Looking at your attachments, planning time is 0.056ms, not 0.56ms. This is in no way relevant to performance on the order of your measured TPS. How are you measuring TPS? Not sure where did you get the 0.056ms? What I see is this in the 9.3 explains: Total runtime: 0.246 ms and this in those from 9.6: Planning time: 0.396 ms Execution time: 0.181 ms That is roughly 0.25ms vs. 0.6ms (0.4+0.2), as reported by Prakash. Obviously, this "just" 2x slowdown, so it does not match the drop from 650 to 80 tps. Also, 0.25ms would be ~4000 tps, so I guess this was just an example of a query that slowed down. Prakash, are you using packages (which ones?), or have you compiled from sources? Can you provide pg_config output from both versions, and also 'select * from pg_settings' (the full config)? It might also be useful to collect profiles, i.e. (1) install debug symbols (2) run the query in a loop and (3) collect profiles from that one backend using 'perf'. I assume you're using the same hardware / machine for the tests? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] increasing the default WAL segment size
On 04/06/2017 11:45 PM, David Steele wrote: On 4/6/17 5:05 PM, Tomas Vondra wrote: On 04/06/2017 08:33 PM, David Steele wrote: On 4/5/17 7:29 AM, Simon Riggs wrote: 2. It's not clear to me the advantage of being able to pick varying filesizes. I see great disadvantage in having too many options, which greatly increases the chance of incompatibility, annoyance and breakage. I favour a small number of values that have been shown by testing to be sweet spots in performance and usability. (1GB has been suggested) I'm in favor of 16,64,256,1024. I don't see a particular reason for this, TBH. The sweet spots will be likely dependent hardware / OS configuration etc. Assuming there actually are sweet spots - no one demonstrated that yet. Fair enough, but my feeling is that this patch has never been about server performance, per se. Rather, is is about archive management and trying to stem the tide of WAL as servers get bigger and busier. Generally, archive commands have to make a remote connection to offload WAL and that has a cost per segment. Perhaps, although Robert also mentioned that the fsync at the end of each WAL segment is noticeable. But the thread is a bit difficult to follow, different people have different ideas about the motivation of the patch, etc. Also, I don't see how supporting additional WAL sizes increases chance of incompatibility. We already allow that, so either the tools (e.g. backup solutions) assume WAL segments are always 16MB (in which case are essentially broken) or support valid file sizes (in which case they should have no issues with the new ones). I don't see how a compile-time option counts as "supporting that" in practice. How many people in the field are running custom builds of Postgres? And of those, how many have changed the WAL segment size? I've never encountered a non-standard segment size or talked to anyone who has. I'm not saying it has *never* happened but I would venture to say it's rare. I agree it's rare, but I don't think that means we can just consider the option as 'unsupported'. We're even mentioning it in the docs as a valid way to customize granularity of the WAL archival. I certainly know people who run custom builds, and some of them run with custom WAL segment size. Some of them are our customers, some are not. And yes, some of them actually patched the code to allow 256MB WAL segments. If we're going to do this, I'm in favor of deciding some reasonable upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values up to that limit. I'm OK with that. I'm also OK with providing a few reasonable choices. I guess that means I'll just go with the majority opinion. 3. New file allocation has been a problem raised with this patch for some months now. I've been playing around with this and I don't think short tests show larger sizes off to advantage. Larger segments will definitely perform more poorly until Postgres starts recycling WAL. Once that happens I think performance differences should be negligible, though of course this needs to be verified with longer-running tests. I'm willing to do some extensive performance testing on the patch. I don't see how that could happen in the next few day (before the feature freeze), particularly considering we're interested in long tests. Cool. I've been thinking about how to do some meaningful tests for this (mostly pgbench related). I'd like to hear what you are thinking. My plan was to do some pgbench tests with different workloads, scales (in shared buffers, in RAM, exceeds RAM), and different storage configurations (SSD vs. HDD, WAL/datadir on the same/different device/fs, possibly also ext4/xfs). The question however is whether we need to do this testing when we don't actually change the default (at least the patch submitted on 3/27 does seem to keep the 16MB). I assume people specifying a custom value when calling initdb are expected to know what they are doing (and I don't see how we can prevent distros from choosing a bad value in their packages - they could already do that with configure-time option). Just because we don't change the default doesn't mean that others won't. I still think testing for sizes other than 16MB is severely lacking and I don't believe caveat emptor is the way to go. Aren't you mixing regression and performance testing? I agree we need to be sure all segment sizes are handled correctly, no argument here. Do we actually have any infrastructure for that? Or do you plan to add some new animals with different WAL segment sizes? I don't have plans to add animals. I think we'd need a way to tell 'make check' to use a different segment size for tests and then hopefully reconfigure some of the existing animals. OK. My point was that we don't have that capability now, and the latest patch is not adding it either. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL D
Re: [HACKERS] increasing the default WAL segment size
On 04/06/2017 08:33 PM, David Steele wrote: On 4/5/17 7:29 AM, Simon Riggs wrote: On 5 April 2017 at 06:04, Beena Emerson <memissemer...@gmail.com> wrote: The WALfilename - LSN mapping disruption for higher values you mean? Is there anything else I have missed? I see various issues raised but not properly addressed 1. we would need to drop support for segment sizes < 16MB unless we adopt a new incompatible filename format. I think at 16MB the naming should be the same as now and that WALfilename -> LSN is very important. For this release, I think we should just allow >= 16MB and avoid the issue thru lack of time. +1. 2. It's not clear to me the advantage of being able to pick varying filesizes. I see great disadvantage in having too many options, which greatly increases the chance of incompatibility, annoyance and breakage. I favour a small number of values that have been shown by testing to be sweet spots in performance and usability. (1GB has been suggested) I'm in favor of 16,64,256,1024. I don't see a particular reason for this, TBH. The sweet spots will be likely dependent hardware / OS configuration etc. Assuming there actually are sweet spots - no one demonstrated that yet. Also, I don't see how supporting additional WAL sizes increases chance of incompatibility. We already allow that, so either the tools (e.g. backup solutions) assume WAL segments are always 16MB (in which case are essentially broken) or support valid file sizes (in which case they should have no issues with the new ones). If we're going to do this, I'm in favor of deciding some reasonable upper limit (say, 1GB or 2GB sounds good), and allowing all 2^n values up to that limit. 3. New file allocation has been a problem raised with this patch for some months now. I've been playing around with this and I don't think short tests show larger sizes off to advantage. Larger segments will definitely perform more poorly until Postgres starts recycling WAL. Once that happens I think performance differences should be negligible, though of course this needs to be verified with longer-running tests. I'm willing to do some extensive performance testing on the patch. I don't see how that could happen in the next few day (before the feature freeze), particularly considering we're interested in long tests. The question however is whether we need to do this testing when we don't actually change the default (at least the patch submitted on 3/27 does seem to keep the 16MB). I assume people specifying a custom value when calling initdb are expected to know what they are doing (and I don't see how we can prevent distros from choosing a bad value in their packages - they could already do that with configure-time option). If archive_timeout is set then there will also be additional time taken to zero out potentially larger unused parts of the segment. I don't see this as an issue, however, because if archive_timeout is being triggered then the system is very likely under lower than usual load. Lack of clarity and/or movement on these issues is very, very close to getting the patch rejected now. Let's not approach this with the viewpoint that I or others want it to be rejected, lets work forwards and get some solid changes that will improve the world without causing problems. I would definitely like to see this go in, though I agree with Peter that we need a lot more testing. I'd like to see some build farm animals testing with different sizes at the very least, even if there's no time to add new tests. Do we actually have any infrastructure for that? Or do you plan to add some new animals with different WAL segment sizes? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers