Re: [HACKERS] Optimize kernel readahead using buffer access strategy
Hi, I revise this patch and re-run performance test, it can work collectry in Linux and no complile wanings. I add GUC about enable_kernel_readahead option in new version. When this GUC is on(default), it works in POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by buffer hint in Postgres, readahead parameter is optimized by postgres. We can change this parameter in their transactions everywhere and everytime. * Test server Server: HP Proliant DL360 G7 CPU:Xeon E5640 2.66GHz (1P/4C) Memory: 18GB(PC3-10600R-9) Disk: 146GB(15k)*4 RAID1+0 RAID controller: P410i/256MB OS: RHEL 6.4(x86_64) FS: Ext4 * Test setting I use "pgbench -c 8 -j 4 -T 2400 -S -P 10 -a" I also use my accurate patch in this test. So I exexuted under following command before each benchmark. 1. cluster all database 2. truncate pgbench_history 3. checkpoint 4. sync 5. checkpoint * postresql.conf shared_buffers = 2048MB maintenance_work_mem = 64MB wal_level = minimal checkpoint_segments = 300 checkpoint_timeout = 15min checkpoint_completion_target = 0.7 * Performance test result ** In memory database size s=1000| 1 | 2 | 3 | avg - readahead=on | 39836 | 40229 | 40055 | 40040 readahead=off | 31259 | 29656 | 30693 | 30536 ratio | 78% | 74% | 77% | 76% ** Over memory database size s=2000| 1 | 2 |3| avg - readahead=on | 1288 | 1370 | 1367 | 1341 readahead=off | 1683 | 1688 | 1395 | 1589 ratio | 131% | 123% | 102% | 118% s=3000| 1 | 2 |3| avg - readahead=on | 965 | 862 | 993 | 940 readahead=off | 1113 | 1098 | 935 | 1049 ratio | 115% | 127% | 94% | 112% It seems good performance expect scale factor=1000. When readahead parameter is off, disk IO keep to a minimum or necessary, therefore it is faster than "readahead=on". "readahead=on" uses useless diskIO. For example, which is faster 8KB random read or 12KB random read from disks in many times transactions? It is self-evident that the former is faster. In scale factor 1000, it becomes to slower buffer-is-hot than "readahead=on". So it seems to less performance. But it is essence in measuring perfomance. And you can confirm it by attached benchmark graphs. We can use this parameter when buffer is reratively hot. If you want to see other trial graphs, I will send. And I will support to MacOS and create document about this patch in this week. #MacOS is in my house. Regards, -- Mitsumasa KONDO NTT Open Source Software Center *** a/configure --- b/configure *** *** 19937,19943 LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'` ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5 --- 19937,19943 ! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l do as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh` { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** *** 9119,9125 copy_relation_data(SMgrRelation src, SMgrRelation dst, /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); ! smgrread(src, forkNum, blkno, buf); if (!PageIsVerified(page, blkno)) ereport(ERROR, --- 9119,9125 /* If we got a cancel signal during the copy of the data, quit */ CHECK_FOR_INTERRUPTS(); ! smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD); if (!PageIsVerified(page, blkno)) ereport(ERROR, *** a/src/backend/storage/buffer/bufmgr.c --- b/src/backend/storage/buffer/bufmgr.c *** *** 41,46 --- 41,47 #include "pg_trace.h" #include "pgstat.h" #include "postmaster/bgwriter.h" + #include "storage/buf.h" #include "storage/buf_internals.h" #include "storage/bufmgr.h" #include "storage/ipc.h" *** *** 451,457 ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); ! smgrread(smgr, forkNum, blockNum, (char *) bufBlock); if (track_io_timing) { --- 452,458 if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); ! smgrread(smgr, forkNum, block
Re: [HACKERS] Bug in VACUUM reporting of "removed %d row versions" in 9.2+
On 9 December 2013 21:24, Bruce Momjian wrote: > > Where are we on this? Good question, see below. >> In those commits, lazy_vacuum_heap() skipped pinned blocks, but then >> failed to report that accurately, claiming that the tuples were >> actually removed when they were not. That bug has masked the effect of >> the page skipping behaviour. Wow, this is more complex than just that. Can't foresee backpatching anything. We prune rows down to dead item pointers. Then we remove from the index, then we try to remove them from heap, but may skip removal for some blocks. The user description of that is just wrong and needs more thought and work. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Compression of tables
Hi I have been wondering what the minimum useful heap table compression system would be for Postgres, in order to reduce disk footprint of large mostly static datasets. Do you think an approach similar to the static row-level compression of DB2 could make sense? I imagine something like this: 1. You have a table which already has data in it. 2. You run a COMPRESS operation, which builds a static dictionary, and rewrites the whole table with compressed frozen tuples. Frozen tuples have CTIDs just like regular tuples, and can be pointed to by indexes. They are decompressed on the fly when needed. Clearly things get tricky once you need to update rows. Assume for simplicity that future UPDATEs and INSERTs produce normal, non-compressed tuples that would only be compressed by a subsequent COMPRESS operation. The question is how to deal with the existing compressed rows when UPDATEd or DELETEd. Some approaches: 1. Just don't allow updates of compressed rows (!). 2. Exclusively lock the whole page when trying to update any compressed row, while you explode it into regular uncompressed tuples on new pages which you can work on (!). 3. Pull the minimum header fields out of the compressed tuples so that the MVCC machinery can work, to support updates of compressed tuples. Perhaps just the t_xmax, t_ctid values (t_xmin == frozen is implied), so that a writer can update them. This means an overhead of at least 10 bytes per tuple over the compressed size (plus the item offsets in the page header). 4. Something far cleverer. Well, these are straw-man suggestions really and I probably don't understand enough about PG internals (MVCC and implications for VACUUM) to be making them. But I'm curious to know if anyone has researched something like this. For example, I have a system that occupies a couple of TB on disk, but only a few to a few hundred MB per day change, mostly adding data to an active partition. I periodically run CLUSTER on any partition that has pg_stat.correlation < 0.9 (this effectively just re-CLUSTERs the active one). At the same times I would COMPRESS, and the DB could potentially fit on much smaller SSDs. Most commercial database systems I encounter these days are using compression of some sort (more sophisticated than the above, with dynamic dictionaries, and sometimes column based storage etc). Thanks Thomas
Re: [HACKERS] Get more from indices.
Kyotaro HORIGUCHI wrote: > I'm convinced of the validity of your patch. I'm satisfied with it. Thank > you. Thank you for the reply. Then, if there are no objections of others, I'll mark this as "Ready for Committer". Thanks, Best regards, Etsuro Fujita -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
2013/12/10 Amit Kapila > On Mon, Dec 9, 2013 at 10:54 AM, Pavel Stehule > wrote: > > 2013/12/9 Amit Kapila > >> > >> > > >> > There are two points, that should be solved > >> > > >> > a) introduction a dependency to other object in schema - now CREATE > >> > FUNCTION > >> > is fully independent on others > >> > > >> > b) slow start - if we check all paths on start, then start can be > slower > >> > - > >> > and some functions should not work due dependency on temporary tables. > >> > > >> > I am thinking about possible marking a function by #option (we have > same > >> > idea) > >> > > >> > some like > >> > > >> > #option check_on_first_start > >> > #option check_on_create > >> > #option check_newer > >> > >> what exactly check_newer means, does it mean whenever a function is > >> replaced (changed)? > >> > > > > no, it means, so request for check will be ignored ever - some functions > > cannot be deeply checked due using dynamic SQL or dynamic created data > types > > - temporary tables created in functions. > > Thanks for clarification, the part of name 'newer' has created > confusion. I understand > that creating/identifying dependency in some of the cases will be > quite tricky, does other > similar languages for other databases does that for all cases (objects > in dynamic statements). > I am sorry PL/pgSQL is really specific due invisible SQL base and mixing two environments and languages in user visible area. > > Is the main reason for identifying/creating dependency is to mark > function as invalid when > any dependent object is Dropped/Altered? > Now, PG has no any tool for checking dependency between functions and other objects. What has positive effects - we have very simply deploying, that works in almost use cases very well - and works with our temporary tables implementation. There is simple rule - depended object must living before they are >>used in runtime<<. But checking should not be runtime event and we would to decrease a false alarms. So we have to expect so almost all necessary object are created - it is reason, why we decided don't do check in create function statement time. We don't would to introduce new dependency if it will be possible. Regards Pavel > This thread is from quite some time, so please excuse me if I had > asked anything which has > been discussed previously. > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >
Re: [HACKERS] logical changeset generation v6.7
Hello, sorry for annoying you with meaningless questions. Your explanation made it far clearer to me. This will be the last message I mention on this patch.. On 2013-12-05 22:03:51 +0900, Kyotaro HORIGUCHI wrote: > > > - You assined HeapTupleGetOid(tuple) into relid to read in > > >several points but no modification. Nevertheless, you left > > >HeapTupleGetOid not replaced there. I think 'relid' just for > > >repeated reading has far small merit compared to demerit of > > >lowering readability. You'd be better to make them uniform in > > >either way. > > > > It's primarily to get the line lengths halfway under control. > > Mm. I'm afraid I couldn't catch your words, do you mean that > IsSystemClass() or isTempNamespace() could change the NULL bitmap > in the tuple? Huh? No. I just meant that the source code lines are longer if you use "HeapTupleGetOid(tuple)" instead of just "relid". Anway, that patch has since been committed... Really? Sorry for annoying you. Thank you, I understand that. > > > = 0002: > > > > > > - You are identifying the wal_level with the expr evel >= > > >WAL_LEVEL_LOGICAL' but it seems somewhat improper. > > > > Hm. Why? > > It actually does no harm and somewhat trifling so I don't > you should fix it. > > The reason for the comment is the greater values for vel > are undefined at the moment, so strictly saying, such values > should be handled as invalid ones. Note that other checks for wal_level are written the same way. Consider how much bigger this patch would be if every usage of wal_level would need to get changed because a new level had been added. I know the objective. But it is not obvious that the future value will need the process. Anyway we never know that until it actually comes so I said it trifling. > > > - RelationIsAccessibleInLogicalDecoding and > > >RelationIsLogicallyLogged are identical in code. Together with > > >the name ambiguity, this is quite confising and cause of > > >future misuse between these macros, I suppose. Plus the names > > >seem too long. > > > > Hm, don't think they are equivalent, rather the contrary. Note one > > returns false if IsCatalogRelation(), the other true. > > Oops, I'm sorry. I understand they are not same. Then I have > other questions. The name for the first one > 'RelationIsAccessibleInLogicalDecoding' doesn't seem representing > what its comment reads. > > | /* True if we need to log enough information to have access via > | decoding snapshot. */ > > Making the macro name for this comment directly, I suppose it > would be something like 'NeedsAdditionalInfoInLogicalDecoding' or > more directly 'LogicalDeodingNeedsCids' or so.. The comment talks about logging enough information that it is accessible - just as the name. Though I'm not convinced for that. But since it also seems not so wrong and you say so, I pretend to be convinced:-p > > > - In heap_insert, the information conveyed on rdata[3] seems to > > >be better to be in rdata[2] because of the scarecity of the > > >additional information. XLOG_HEAP_CONTAINS_NEW_TUPLE only > > >seems to be needed. Also is in heap_multi_insert and > > >heap_upate. > > > > Could you explain a bit more what you mean by that? The reason it's a > > separate rdata entry is that otherwise a full page write will remove the > > information. > > Sorry, I missed the comment 'so that an eventual FPW doesn't > remove the tuple's data'. Although given the necessity of removal > prevention, rewriting rdata[].buffer which is required by design > (correct?) with InvalidBuffer seems a bit outrageous for me and > obfuscating the objective of it. Well, it's added in a separate rdata element. Just as in dozens of other places. Mmmm. Was there any rdata entriy which has substantial content but .buffer is set to InvalidBuffer just for avoiding removal by fpw? Although for the objection I made, I became to be accostomed to see there and I became to think it is not so bad.. I put an end by this comment. > > > - In heap_delete, at the point adding replica identity, same to > > >the aboves, rdata[3] could be moved into rdata[2] making new > > >type like 'xl_heap_replident'. > > > > Hm. I don't think that'd be a good idea, because we'd then need special > > case decoding code for deletes because the wal format would be different > > for inserts/updates and deletes. > > Hmm. Although one common xl_heap_replident is impractical, > splitting a logcally single entity into two or more XLogRecDatas > also seems not to be a good idea. That's done everywhere. There's basically two reasons to use separate rdata elements: * the buffers are different * the data pointer is different The rdata chain elements don't survive in the WAL. Well, I came to see rdata's as simple containers holding fragments to be written into WAL stream. Thanks for patiently answering for such silly questions. > Considering above, looking he
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
2013/12/9 Peter Eisentraut > On 12/8/13, 12:01 PM, Pavel Stehule wrote: > > But still I have no idea, how to push check without possible slowdown > > execution with code duplication > > Create a GUC parameter plpgsql.slow_checks or whatever (perhaps more > specific), which people can turn on when they run their test suites. > > This doesn't really have to be all that much different from what we are > currently doing in C with scan-build and address sanitizer, for example. > > A main issue is placing these tests on critical path. You have to check it in every expression, in every internal switch There are two main purposes a) ensure a expression/query is valid (not only syntax valid) b) search all expressions/queries - visit all possible paths in code. so you should to place new switch everywhere in plpgsql executor, where is entry to some path. Second issue - these check decrease a readability of plpgsql statement executor handlers. This code is relative very readable now. With new switch is little bit (more) less clean. I think so fact we use a two other large statement switch (printing, free expressions) is natural, and it hard to write it better. Regards Pavel
Re: [HACKERS] [patch] Adding EXTRA_REGRESS_OPTS to all pg_regress invocations
Bruce Momjian writes: > OK, Christoph has provided a full set of tested patches back to 8.4. > Should I backpatch these? Peter says no, but two others say yes. It's hard to paint that as a bug fix, so I'd vote for HEAD only. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stats for network traffic WIP
On Tue, Dec 10, 2013 at 10:59 AM, Fujii Masao wrote: > On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron wrote: >> On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao wrote: >>> >>> Could you share the performance numbers? I'm really concerned about >>> the performance overhead caused by this patch. >>> >> >> I've tried pgbench in select mode with small data sets to avoid disk >> io and didn't see any difference. That was on my old core2duo laptop >> though .. I'll have to retry it on some server class multi core >> hardware. > > When I ran pgbench -i -s 100 in four parallel, I saw the performance > difference > between the master and the patched one. I ran the following commands. > > psql -c "checkpoint" > for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done > > The results are: > > * Master > 1000 of 1000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s). > 1000 of 1000 tuples (100%) done (elapsed 14.03 s, remaining 0.00 s). > 1000 of 1000 tuples (100%) done (elapsed 14.01 s, remaining 0.00 s). > 1000 of 1000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s). > > It took almost 14.0 seconds to store 1000 tuples. > > * Patched > 1000 of 1000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s). > 1000 of 1000 tuples (100%) done (elapsed 15.05 s, remaining 0.00 s). > 1000 of 1000 tuples (100%) done (elapsed 15.42 s, remaining 0.00 s). > 1000 of 1000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s). > > It took almost 15.0 seconds to store 1000 tuples. >-- Regards, Atri l'apprenant > Thus, I'm afraid that enabling network statistics would cause serious > performance > degradation. Thought? Hmm, I think I did not push it this high. The performance numbers here are cause of worry. Another point I may mention here is that if we can isolate a few points of performance degradation and work on them because I still feel that the entire patch itself does not cause a serious lapse, rather, a few points may. However, the above numbers bring up the original concerns for the performance voiced. I guess I was testing on too low number of clients for the gap to show up significantly. Regards, Atri -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stats for network traffic WIP
On Tue, Dec 10, 2013 at 6:56 AM, Nigel Heron wrote: > On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao wrote: >> >> Could you share the performance numbers? I'm really concerned about >> the performance overhead caused by this patch. >> > > I've tried pgbench in select mode with small data sets to avoid disk > io and didn't see any difference. That was on my old core2duo laptop > though .. I'll have to retry it on some server class multi core > hardware. When I ran pgbench -i -s 100 in four parallel, I saw the performance difference between the master and the patched one. I ran the following commands. psql -c "checkpoint" for i in $(seq 1 4); do time pgbench -i -s100 -q db$i & done The results are: * Master 1000 of 1000 tuples (100%) done (elapsed 13.91 s, remaining 0.00 s). 1000 of 1000 tuples (100%) done (elapsed 14.03 s, remaining 0.00 s). 1000 of 1000 tuples (100%) done (elapsed 14.01 s, remaining 0.00 s). 1000 of 1000 tuples (100%) done (elapsed 14.13 s, remaining 0.00 s). It took almost 14.0 seconds to store 1000 tuples. * Patched 1000 of 1000 tuples (100%) done (elapsed 14.90 s, remaining 0.00 s). 1000 of 1000 tuples (100%) done (elapsed 15.05 s, remaining 0.00 s). 1000 of 1000 tuples (100%) done (elapsed 15.42 s, remaining 0.00 s). 1000 of 1000 tuples (100%) done (elapsed 15.70 s, remaining 0.00 s). It took almost 15.0 seconds to store 1000 tuples. Thus, I'm afraid that enabling network statistics would cause serious performance degradation. Thought? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get more from indices.
Thank you, > > One is, you put the added code for getrelation_info() out of the block for > > the condition (info->relam == BTREE_AM_OID) (though amcanorder would be .. > By checking the following equation in build_index_paths(), the updated > version of the patch guarantees that the result of an index scan is ordered: > > index_is_ordered = (index->sortopfamily != NULL); Oops.. I forgot about it although many times I've seen... You're right. > > > Another is, you changed pathkeys expantion to be all-or-nothing decision. > > > While this change should simplify the code slightly, it also dismisses > > > the oppotunity for partially-extended pathkeys. Could you let me know > > > the reason why you did so. ... > > We might be able to partially-extend the original > > pathkey list optimally in something significant, but that seems useless > > complexity to me. So, I modified the patch to do the all-or-nothing > > decision. > > Here I mean the optimality for use in merge joins. Ok, I'll follow your advice. I agree on the point about merit vs complexity. I'm convinced of the validity of your patch. I'm satisfied with it. Thank you. regards, -- Kyotaro Horiguchi NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 17:18, Claudio Freire wrote: On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood wrote: Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s These results look more like Heikki's. Which suggests more benefit on SSD than spinning disks. Some more data points (apart from mine) would be good to see tho. Assuming ANALYZE is sampling less than 5% of the table, I'd say fadvising will always be a win. I'd also suggest higher e_i_c for rotating media. Rotating media has longer latencias, and e_i_c has to be computed in terms of latency, rather than "spindles" when doing prefetch. For backward index scans, I found the optimum number for a single spindle to be about 20. Yeah - was just fooling about with this on the velociraptor, looks like somewhere in the 20's works well. | pgbench_accounts eff_io_concurrency | analyze time (s) ---+- 8 | 43 16 | 40 24 | 36 32 | 35 64 | 35 Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
(2013/12/09 20:29), Andres Freund wrote: On 2013-12-09 19:51:01 +0900, KONDO Mitsumasa wrote: Add my comment. We have to consider three situations. 1. PITR 2. replication standby 3. replication standby with restore_command I think this patch cannot delay in 1 situation. Why? I have three reasons. 1. It is written in document. Can we remove it? 2. Name of this feature is "Time-delayed *standbys*", not "Time-delayed *recovery*". Can we change it? 3. I think it is unnessesary in master PITR. And if it can delay in master PITR, it will become master at unexpected timing, not to continue to recovery. It is meaningless. I'd like to ask you what do you expect from this feature and how to use it. Regards, -- Mitsumasa KONDO NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Tue, Dec 10, 2013 at 12:13 AM, Mark Kirkwood wrote: > Just one more... > > The Intel 520 with ext4: > > > Without patch: ANALYZE pgbench_accounts 5s > With patch: ANALYZE pgbench_accounts 1s > > And double checking - > With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s > > These results look more like Heikki's. Which suggests more benefit on SSD > than spinning disks. Some more data points (apart from mine) would be good > to see tho. Assuming ANALYZE is sampling less than 5% of the table, I'd say fadvising will always be a win. I'd also suggest higher e_i_c for rotating media. Rotating media has longer latencias, and e_i_c has to be computed in terms of latency, rather than "spindles" when doing prefetch. For backward index scans, I found the optimum number for a single spindle to be about 20. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
On Tue, Dec 10, 2013 at 08:47:22AM +0800, Craig Ringer wrote: > On 12/05/2013 11:25 PM, MauMau wrote: > > Hello, > > > > My customers and colleagues sometimes (or often?) ask about the > > following message: > > > > FATAL: the database system is starting up > > I would LOVE that message to do away, forever. > > It's a huge PITA for automated log monitoring, analysis, and alerting. > > The other one I'd personally like to change, but realise is harder to > actually do, is to separate "ERROR"s caused by obvious user input issues > from internal ERRORs like not finding the backing file for a relation, > block read errors, etc. > > String pattern matching is a crude and awful non-solution, especially > given the way PostgreSQL loves to output messages to the log in whatever > language and encoding the current database connection is in. Yes, this is certainly a challenge. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
On Mon, Dec 9, 2013 at 10:54 AM, Pavel Stehule wrote: > 2013/12/9 Amit Kapila >> >> > >> > There are two points, that should be solved >> > >> > a) introduction a dependency to other object in schema - now CREATE >> > FUNCTION >> > is fully independent on others >> > >> > b) slow start - if we check all paths on start, then start can be slower >> > - >> > and some functions should not work due dependency on temporary tables. >> > >> > I am thinking about possible marking a function by #option (we have same >> > idea) >> > >> > some like >> > >> > #option check_on_first_start >> > #option check_on_create >> > #option check_newer >> >> what exactly check_newer means, does it mean whenever a function is >> replaced (changed)? >> > > no, it means, so request for check will be ignored ever - some functions > cannot be deeply checked due using dynamic SQL or dynamic created data types > - temporary tables created in functions. Thanks for clarification, the part of name 'newer' has created confusion. I understand that creating/identifying dependency in some of the cases will be quite tricky, does other similar languages for other databases does that for all cases (objects in dynamic statements). Is the main reason for identifying/creating dependency is to mark function as invalid when any dependent object is Dropped/Altered? This thread is from quite some time, so please excuse me if I had asked anything which has been discussed previously. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Adding EXTRA_REGRESS_OPTS to all pg_regress invocations
On Tue, Dec 10, 2013 at 12:08 PM, Bruce Momjian wrote: > On Thu, Dec 5, 2013 at 09:52:27AM +0100, Christoph Berg wrote: >> > The change is sane in itself. It won't affect anyone who doesn't use >> > EXTRA_REGRESS_OPTS. Why would we want to make packagers do MORE >> > work? >> >> The patch has been in the Debian/Ubuntu/apt.pg.o packages for some >> time, for 8.3+. I'm attaching the patches used there. >> >> (Sidenote: To enable building of several package flavors in parallel >> on the same machine we use >> >> make -C build check-world EXTRA_REGRESS_OPTS='--host=/tmp --port=$(shell >> perl -le 'print 1024 + int(rand(64000))')' >> >> so pg_regress' static per-version ports do not conflict. But 9.2's >> contrib/pg_upgrade/{Makefile/test.sh} do not like --port in there, so >> the 9.2 patch has an extra sed hack in there to remove --port for >> pg_upgrade. That bit should probably not be applied for general use. >> The rest is safe, though.) > > OK, Christoph has provided a full set of tested patches back to 8.4. > Should I backpatch these? Peter says no, but two others say yes. My 2c. Adding a new feature in a maintenance branch is usually not done, so I'd vote no. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Proof of concept: standalone backend with full FE/BE protocol
On Fri, Dec 6, 2013 at 04:04:36PM +0100, Andres Freund wrote: > On 2013-12-05 23:01:28 +0200, Heikki Linnakangas wrote: > > On 12/05/2013 10:37 PM, Robert Haas wrote: > > >On Thu, Dec 5, 2013 at 3:05 PM, Tom Lane wrote: > > >>It might be unpleasant to use in some cases, though. > > > > > >Why would there be more than a few cases in the first place? Who is > > >going to use this beyond psql, pg_dump(all), and pg_upgrade, and why? > > > > Well, you might want to use pgAdmin, or your other favorite admin tool. I'm > > not sure how well it would work, and I think it's OK if we say "sorry, can't > > do that", but it's not a crazy thing to want. > > Pgadmin wouldn't work, it uses multiple connections for anything but the > most trivial tasks. You can't even send a manual sql query using only > one connection. > I think that's true for most of the non-trivial tools. FYI, pg_upgrade in parallel mode needs multiple database connections too. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
On Mon, Dec 9, 2013 at 7:31 PM, Peter Geoghegan wrote: > I go to some lengths to to avoid doing this with only a shared lock. I should have said: I go to great lengths to do this with only a shared lock, and not an exclusive (see gc_count stuff). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Storing pg_stat_statements query texts externally, pg_stat_statements in core
On Sat, Dec 7, 2013 at 9:26 AM, Fujii Masao wrote: > The patch doesn't apply cleanly against the master due to recent change of > pg_stat_statements. Could you update the patch? Revision is attached, including changes recently discussed on other thread [1] to allow avoiding sending query text where that's not desirable and increase in default pg_stat_statements.max to 5000. I've found myself tweaking things a bit more here and there. I've added additional clarification around why we do an in-place garbage collection (i.e. why we don't just write a new file and atomically rename -- grep "over-engineer" to find what I mean). I also fully documented the pg_stat_statements function. If we want external tool authors to use it to avoid sending query texts, they have to know about it in the first place. I now use the pg_stat_tmp directory for my temp file (so pg_stat_statements behaves more like the statistics collector). The stats_temp_directory GUC is not used, for reasons that a patch comment explains. I've fuzz-tested this throughout with the "make installcheck-parallel" regression tests. I found it useful to run that in one terminal, while running pgbench with multiple clients in another. The pgbench script looks something like: """ select * from pg_stat_statements; select pg_stat_statements_reset(); """ pg_stat_statements_reset() was temporarily rigged to perform a garbage collection (and not a reset) for the benefit of this stress-test. The function pg_stat_statements(), the garbage collection function and pgss_store() will complain loudly if they notice anything out of the ordinary. I was not able to demonstrate any problems through this technique (though I think it focused my thinking on the right areas during development). Clearly, missed subtleties around managing the external file while locking in order to ensure correct behavior (that no session can observe something that it should or should not) will be something that a committer will want to pay particular attention to. I go to some lengths to to avoid doing this with only a shared lock. FYI, without hacking things up, it's quite hard to make external query text garbage collection occur - need_gc_qtexts() will almost always say no. It will only automatically happen once with pg_stat_statements.max set to 1000 when the standard regression tests are run. This is a really extreme workload in terms of causing pg_stat_statements churn, which shows just how unlikely garbage collection is in the real world. Still, it ought to be totally robust when it happens. In passing, I did this: /* !* Rename file into place, to atomically commit to serializing. */ Because at this juncture, there ought to be no file to replace (not since Magnus had pg_stat_statements not keep the main global file around for as long as the server is running, in the style of the statistics collector). Thanks [1] http://www.postgresql.org/message-id/cam3swzsvff-vbnc8sc-0cpwzxvqh49reybchb95t95fcrgs...@mail.gmail.com -- Peter Geoghegan pg_stat_statements_ext_text.v5.2013_12_09.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
On Mon, Dec 9, 2013 at 5:52 PM, MauMau wrote: > From: "Amit Kapila" > >> 1. isn't it better to handle as it is done in write_eventlog() which >> means if string is empty then >>use PostgreSQL. >> "evtHandle = RegisterEventSource(NULL, event_source ? event_source : >> "PostgreSQL");" > > > Thank you for reviewing. Yes, I did so with the first revision of this > patch (see the first mail of this thread.) I wanted to avoid duplicating > the default value in both the server and pg_ctl code. If user does not set > event_source, postgres -C returns the default value "PostgreSQL" in the > normal case, so I wanted to rely on it. I think it is better to keep it like what I suggested above, because in that case it will assign default name even if postgres -C fails due to some reason. > > >> 2. What will happen if user doesn't change the name in "event_source" >> or kept the same name, >>won't it hit the same problem again? So shouldn't it try to >> generate different name by appending >>version string to it? > > > Yes, but I assume that the user has to set his own name to identify his > instance uniquely. Even if version string is added, the same issue can > happen --- in the likely case where the user explicitly installs, for > example, PostgreSQL 9.3 as a standalone database, as well as some packaged > application that embeds PostgreSQL 9.3 which the user is unaware of. I mentioned it just to make sure that if such things can happen, it is good to either avoid it or document it in some way, so that user can understand better how to configure his system. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:32, Mark Kirkwood wrote: On 10/12/13 15:17, Mark Kirkwood wrote: On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Good grief - repeating the test gave: Without patch: ANALYZE pgbench_accounts 90s With patch: ANALYZE pgbench_accounts 42s pretty consistent *now*. No idea what was going on in the 1st run (maybe I happened to have it running at the same time as a checkpoint)? Anyway will stop now before creating more confusion. Just one more... The Intel 520 with ext4: Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 1s And double checking - With patch, but effective_io_concurrency = 1: ANALYZE pgbench_accounts 5s These results look more like Heikki's. Which suggests more benefit on SSD than spinning disks. Some more data points (apart from mine) would be good to see tho. Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [patch] Adding EXTRA_REGRESS_OPTS to all pg_regress invocations
On Thu, Dec 5, 2013 at 09:52:27AM +0100, Christoph Berg wrote: > > The change is sane in itself. It won't affect anyone who doesn't use > > EXTRA_REGRESS_OPTS. Why would we want to make packagers do MORE > > work? > > The patch has been in the Debian/Ubuntu/apt.pg.o packages for some > time, for 8.3+. I'm attaching the patches used there. > > (Sidenote: To enable building of several package flavors in parallel > on the same machine we use > > make -C build check-world EXTRA_REGRESS_OPTS='--host=/tmp --port=$(shell perl > -le 'print 1024 + int(rand(64000))')' > > so pg_regress' static per-version ports do not conflict. But 9.2's > contrib/pg_upgrade/{Makefile/test.sh} do not like --port in there, so > the 9.2 patch has an extra sed hack in there to remove --port for > pg_upgrade. That bit should probably not be applied for general use. > The rest is safe, though.) OK, Christoph has provided a full set of tested patches back to 8.4. Should I backpatch these? Peter says no, but two others say yes. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:17, Mark Kirkwood wrote: On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Good grief - repeating the test gave: Without patch: ANALYZE pgbench_accounts 90s With patch: ANALYZE pgbench_accounts 42s pretty consistent *now*. No idea what was going on in the 1st run (maybe I happened to have it running at the same time as a checkpoint)? Anyway will stop now before creating more confusion. Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/09/2013 02:37 PM, Robert Haas wrote: > I've never seen an n_distinct value of more than 5 digits, regardless > of reality. Typically I've seen 20-50k, even if the real number is > much higher. But the n_distinct value is only for non-MCVs, so if we > estimate the selectivity of column = 'rarevalue' to be > (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces > the estimate, and making the MCV list longer naturally makes mcvfrac > bigger. I'm not sure how important the > less-frequent-than-the-least-common-MCV part is, but I'm very sure > that raising the statistics target helps to solve the problem of > overestimating the prevalence of uncommon values in a very big table. I did an analysis of our ndistinct algorithm several years ago ( ~~ 8.1), and to sum up: 1. we take far too small of a sample to estimate ndistinct well for tables larger than 100,000 rows. 2. the estimation algo we have chosen is one which tends to be wrong in the downwards direction, rather strongly so. That is, if we could potentially have an ndistinct of 1000 to 100,000 based on the sample, our algo estimates 1500 to 3000. 3. Other algos exist. The tend to be wrong in other directions. 4. Nobody has done an analysis of whether it's worse, on average, to estimate low vs. high for ndistinct. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:11, Mark Kirkwood wrote: On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Redoing the test on the velociraptor gives me exactly the same numbers as before (effective_io_concurrency = 10 instead of 1). Cheers Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 15:04, Mark Kirkwood wrote: On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s Ah - I have just realized I was not setting effective_io_concurrency - so I'll redo the test. - Apologies. Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 13:53, Mark Kirkwood wrote: On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s Redoing the filesystem on the 520 as btrfs didn't seem to make any difference either: (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 6.4s With patch: ANALYZE pgbench_accounts 6.4s -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] dblink performance regression
On 12/7/13 7:50 PM, Joe Conway wrote: On 12/07/2013 05:41 PM, Fabrízio de Royes Mello wrote: > >On Sat, Dec 7, 2013 at 11:20 PM, Michael Paquier >mailto:michael.paqu...@gmail.com>> >wrote: >>> >>>IMHO is more elegant create a procedure to encapsulate the code >>>to avoid redundancy. >>Yep, perhaps something like PQsetClientEncodingIfDifferent or >>similar would make sense. > >Well I think at this first moment we can just create a procedure >inside the dblink contrib and not touch in libpq. Maybe a libpq function could be done for 9.4, but not for back branches. Stupid question... why don't we just pass encoding in with the other connection parameters? That eliminates any ambiguity. The only issue would be if the user also passed that in via connstr... though now that I think about it, we currently silently ignore that parameter, which IMHO is bad. We should either respect if the user passes that in (ie: not do anything at all), or we should throw an error. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/10/2013 05:20 AM, Jim Nasby wrote: >> > > FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to > fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's > fancy new shared memory stuff). Apologies for posting the same as a new idea before I saw your post. I need to finish the thread before posting. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/06/2013 09:52 AM, Peter Geoghegan wrote: > Has anyone ever thought about opportunistic ANALYZE piggy-backing on > other full-table scans? That doesn't really help Greg, because his > complaint is mostly that a fresh ANALYZE is too expensive, but it > could be an interesting, albeit risky approach. It'd be particularly interesting, IMO, if autovacuum was able to do it using the synchronized scans machinery, so the analyze still ran in a separate proc and didn't impact the user's query. Have an autovac worker or two waiting for new scans to start on tables they need to analyze, and if one doesn't start within a reasonable time frame they start their own scan. We've seen enough issues with hint-bit setting causing unpredictable query times for user queries that I wouldn't want to add another "might write during a read-only query" behaviour. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 13:20, Mark Kirkwood wrote: On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 Doing the same test as above, but on a 80GB Intel 520 (ext4 filesystem mounted with discard): (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts 5s With patch: ANALYZE pgbench_accounts 5s -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing
On 2013-12-09 19:14:58 -0300, Alvaro Herrera wrote: > Here's a revamped version of this patch. One thing I didn't do here is > revert the exporting of CreateMultiXactId, but I don't see any way to > avoid that. I don't so much have a problem with exporting CreateMultiXactId(), just with exporting it under its current name. It's already quirky to have both MultiXactIdCreate and CreateMultiXactId() in multixact.c but exporting it imo goes to far. > Andres mentioned the idea of sharing some code between > heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't > explored that. My idea would just be to have heap_tuple_needs_freeze() call heap_prepare_freeze_tuple() and check whether it returns true. Yes, that's slightly more expensive than the current heap_tuple_needs_freeze(), but it's only called when we couldn't get a cleanup lock on a page, so that seems ok. > ! static TransactionId > ! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask, > ! TransactionId cutoff_xid, MultiXactId > cutoff_multi, > ! uint16 *flags) > ! { > ! if (!MultiXactIdIsValid(multi)) > ! { > ! /* Ensure infomask bits are appropriately set/reset */ > ! *flags |= FRM_INVALIDATE_XMAX; > ! return InvalidTransactionId; > ! } Maybe comment that we're sure to be only called when IS_MULTI is set, which will be unset by FRM_INVALIDATE_XMAX? I wondered twice whether we wouldn't just continually mark the buffer dirty this way. > ! else if (MultiXactIdPrecedes(multi, cutoff_multi)) > ! { > ! /* > ! * This old multi cannot possibly have members still running. > If it > ! * was a locker only, it can be removed without any further > ! * consideration; but if it contained an update, we might need > to > ! * preserve it. > ! */ > ! if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) > ! { > ! *flags |= FRM_INVALIDATE_XMAX; > ! return InvalidTransactionId; Cna you place an Assert(!MultiXactIdIsRunning(multi)) here? > ! if (ISUPDATE_from_mxstatus(members[i].status) && > ! !TransactionIdDidAbort(members[i].xid))# It makes me wary to see a DidAbort() without a previous InProgress() call. Also, after we crashed, doesn't DidAbort() possibly return false for transactions that were in progress before we crashed? At least that's how I always understood it, and how tqual.c is written. > ! /* We only keep lockers if they are still running */ > ! if (TransactionIdIsCurrentTransactionId(members[i].xid) || I don't think there's a need to check for TransactionIdIsCurrentTransactionId() - vacuum can explicitly *not* be run inside a transaction. > *** > *** 5443,5458 heap_freeze_tuple(HeapTupleHeader tuple, TransactionId > cutoff_xid, >* xvac transaction succeeded. >*/ > if (tuple->t_infomask & HEAP_MOVED_OFF) > ! HeapTupleHeaderSetXvac(tuple, > InvalidTransactionId); > else > ! HeapTupleHeaderSetXvac(tuple, > FrozenTransactionId); > > /* >* Might as well fix the hint bits too; usually > XMIN_COMMITTED >* will already be set here, but there's a small chance > not. >*/ > Assert(!(tuple->t_infomask & HEAP_XMIN_INVALID)); > ! tuple->t_infomask |= HEAP_XMIN_COMMITTED; > changed = true; > } > } > --- 5571,5586 >* xvac transaction succeeded. >*/ > if (tuple->t_infomask & HEAP_MOVED_OFF) > ! frz->frzflags |= XLH_FREEZE_XVAC; > else > ! frz->frzflags |= XLH_INVALID_XVAC; > Hm. Isn't this case inverted? I.e. shouldn't you set XLH_FREEZE_XVAC and XLH_INVALID_XVAC exactly the other way round? I really don't understand the moved in/off, since the code has been gone longer than I've followed the code... *** a/src/backend/access/rmgrdesc/mxactdesc.c > --- b/src/backend/access/rmgrdesc/mxactdesc.c > *** > *** 41,47 out_member(StringInfo buf, MultiXactMember *member) > appendStringInfoString(buf, "(upd) "); > break; > default: > ! appendStringInfoString(buf, "(unk) "); > break; > } > } > --- 41,47 > appendStringInfoString(buf, "(upd) "); > break; > default: > ! appendStringInfo(buf, "(unk) ", member->status); > break; >
Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/06/2013 03:02 AM, Josh Berkus wrote: > Heck, I'd be happy just to have a class of messages which specifically > means "OMG, there's something wrong with the server", that is, a flag > for messages which only occur when PostgreSQL encounters a bug, data > corrpution, or platform error. Right now, I have to suss those out by > regex. +10 That's what I really need to see in the logs, too. Using SQLState might be reasonable, but I'd be concerned we'd find cases where the same standard-specififed SQLState should be sent to the *client*, despite different underlying causes for the server of different levels of DBA concern. I'd rather not suppress logging of user-level errors, etc; they're important too. It's just necessary to be able to separate them from internal errors that should never happen when the system is operating correctly in order to do effective log-based alerting. After all, I *don't* want to get an SMS whenever the deadlock detector kicks in and someone's created a database in de_DE so the string pattern match doesn't kick in. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/05/2013 11:25 PM, MauMau wrote: > Hello, > > My customers and colleagues sometimes (or often?) ask about the > following message: > > FATAL: the database system is starting up I would LOVE that message to do away, forever. It's a huge PITA for automated log monitoring, analysis, and alerting. The other one I'd personally like to change, but realise is harder to actually do, is to separate "ERROR"s caused by obvious user input issues from internal ERRORs like not finding the backing file for a relation, block read errors, etc. String pattern matching is a crude and awful non-solution, especially given the way PostgreSQL loves to output messages to the log in whatever language and encoding the current database connection is in. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
Jim Nasby writes: > On 12/9/13 5:56 PM, Tom Lane wrote: >> How so? "FATAL" means "an error that terminates your session", which >> is exactly what these are. > Except in these cases the user never actually got a working session; their > request was denied. > To be clear, from the client standpoint it's certainly fatal, but not from > the server's point of view. This is fully expected behavior as far as the > server is concerned. (Obviously it might be an error that caused the > shutdown/recovery, but that's something different.) Right, but as already pointed out in this thread, these messages are worded from the client's point of view. "The client never got a working connection" seems to me to be an empty distinction. If you got SIGTERM'd before you could issue your first query, should that not be FATAL because you'd not gotten any work done? More generally, we also say FATAL for all sorts of entirely routine connection failures, like wrong password or mistyped user name. People don't seem to have a problem with those. Even if some do complain, the costs of changing that behavior after fifteen-years-and-counting would certainly exceed any benefit. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 13:14, Mark Kirkwood wrote: On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Arrg - sorry forgot the important bits: Ubuntu 13.10 (kernel 3.11.0-14) filesystem is ext4 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/9/13 5:56 PM, Tom Lane wrote: Jim Nasby writes: Arguably 1-3 are inaccurate since they're not really about a backend dying... they occur during the startup phase; you never even get a functioning backend. That's a marked difference from other uses of FATAL. How so? "FATAL" means "an error that terminates your session", which is exactly what these are. Except in these cases the user never actually got a working session; their request was denied. To be clear, from the client standpoint it's certainly fatal, but not from the server's point of view. This is fully expected behavior as far as the server is concerned. (Obviously it might be an error that caused the shutdown/recovery, but that's something different.) -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Mark Kirkwood writes: > I did a test run: > pgbench scale 2000 (pgbench_accounts approx 25GB). > postgres 9.4 > i7 3.5Ghz Cpu > 16GB Ram > 500 GB Velociraptor 10K > (cold os and pg cache both runs) > Without patch: ANALYZE pgbench_accounts90s > With patch: ANALYZE pgbench_accounts 91s > So I'm essentially seeing no difference :-( What OS and filesystem? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 10/12/13 12:14, Heikki Linnakangas wrote: I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. I did a test run: pgbench scale 2000 (pgbench_accounts approx 25GB). postgres 9.4 i7 3.5Ghz Cpu 16GB Ram 500 GB Velociraptor 10K (cold os and pg cache both runs) Without patch: ANALYZE pgbench_accounts90s With patch: ANALYZE pgbench_accounts 91s So I'm essentially seeing no difference :-( Regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
Jim Nasby writes: > Arguably 1-3 are inaccurate since they're not really about a backend dying... > they occur during the startup phase; you never even get a functioning > backend. That's a marked difference from other uses of FATAL. How so? "FATAL" means "an error that terminates your session", which is exactly what these are. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Heikki Linnakangas writes: > Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when > you're prefetching with posix_fadvise anyway. Yeah. If we're not reading consecutive blocks, readahead is unlikely to do anything anyhow. Claudio's comments do suggest that it might be a bad idea to issue a posix_fadvise when the next block to be examined *is* adjacent to the current one, though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 8:45 PM, Heikki Linnakangas wrote: > Claudio Freire wrote: >>On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas >> wrote: >>> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >>be very >>> easy, patch attached. Your mileage may vary, but I'm seeing a nice >>gain from >>> this on my laptop. Taking a 3 page sample of a table with 717717 >>pages >>> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without >>the >>> patch, and less than a second with the patch, with >>> effective_io_concurrency=10. If anyone with a good test data set >>loaded >>> would like to test this and post some numbers, that would be great. >> >>Kernel version? > > 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. > Admittedly not your average database server setup, so it would be nice to get > more reports from others. Yeah, read-ahead isn't relevant for SSD. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Claudio Freire wrote: >On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas > wrote: >> I took a stab at using posix_fadvise() in ANALYZE. It turned out to >be very >> easy, patch attached. Your mileage may vary, but I'm seeing a nice >gain from >> this on my laptop. Taking a 3 page sample of a table with 717717 >pages >> (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without >the >> patch, and less than a second with the patch, with >> effective_io_concurrency=10. If anyone with a good test data set >loaded >> would like to test this and post some numbers, that would be great. > >Kernel version? 3.12, from Debian experimental. With an ssd drive and btrfs filesystem. Admittedly not your average database server setup, so it would be nice to get more reports from others. >I raised this issue on LKML, and, while I got no news on this front, >they might have silently fixed it. I'd have to check the sources >again. Maybe. Or maybe the heuristic read ahead isn't significant/helpful, when you're prefetching with posix_fadvise anyway. I - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 8:14 PM, Heikki Linnakangas wrote: > On 12/09/2013 11:56 PM, Claudio Freire wrote: >> Without patches to the kernel, it is much better. >> >> posix_fadvise interferes with read-ahead, so posix_fadvise on, say, >> bitmap heap scans (or similarly sorted analyze block samples) run at 1 >> IO / block, ie horrible, whereas aio can do read coalescence and >> read-ahead when the kernel thinks it'll be profitable, significantly >> increasing IOPS. I've seen everything from a 2x to 10x difference. > > > How did you test that, given that we don't actually have an asynchronous I/O > implementation? I don't recall any recent patches floating around either to > do that. When Greg Stark investigated this back in 2007-2008 and implemented > posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a > significant speedup on the test data he used. What kind of a data > distribution gives a slowdown like that? That's basically my summarized experience from working on this[0] patch, and the feedback given there about competing AIO work. Sequential I/O was the biggest issue. I had to actively avoid fadvising on sequential I/O, or sequential-ish I/O, which was a huge burden on fadvise logic. > > I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very > easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from > this on my laptop. Taking a 3 page sample of a table with 717717 pages > (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the > patch, and less than a second with the patch, with > effective_io_concurrency=10. If anyone with a good test data set loaded > would like to test this and post some numbers, that would be great. Kernel version? I raised this issue on LKML, and, while I got no news on this front, they might have silently fixed it. I'd have to check the sources again. [0] http://www.postgresql.org/message-id/col116-w162aeaa64173e77d4597eea3...@phx.gbl -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/6/13 7:38 AM, Andres Freund wrote: On 2013-12-06 22:35:21 +0900, MauMau wrote: From: "Tom Lane" No. They are FATAL so far as the individual session is concerned. Possibly some documentation effort is needed here, but I don't think any change in the code behavior would be an improvement. You are suggesting that we should add a note like "Don't worry about the following message. This is a result of normal connectivity checking", don't you? FATAL: the database system is starting up Uh. An explanation why you cannot connect to the database hardly seems like a superflous log message. It is when *you* are not actually trying to connect but rather pg_ctl is (which is one of the use cases here). Arguably 1-3 are inaccurate since they're not really about a backend dying... they occur during the startup phase; you never even get a functioning backend. That's a marked difference from other uses of FATAL. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with displaying "wide" tables in psql
On Mon, Dec 2, 2013 at 10:45 PM, Sergey Muraviov < sergey.k.murav...@gmail.com> wrote: > Hi. > > Psql definitely have a problem with displaying "wide" tables. > Even in expanded mode, they look horrible. > So I tried to solve this problem. > I get compiler warnings: print.c: In function 'print_aligned_vertical': print.c:1238: warning: ISO C90 forbids mixed declarations and code print.c: In function 'print_aligned_vertical': print.c:1238: warning: ISO C90 forbids mixed declarations and code But I really like this and am already benefiting from it. No point in having the string of hyphens between every record wrap to be 30 lines long just because one field somewhere down the list does so. And configuring the pager isn't much of a solution because the pager doesn't know that the hyphens are semantically different than the other stuff getting thrown at it. Cheers, Jeff
Re: [HACKERS] Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
On 12/8/13 3:08 PM, MauMau wrote: #5 is output when the DBA shuts down the replication standby server. #6 is output when the DBA shuts down the server if he is using any custom background worker. These messages are always output. What I'm seeing as a problem is that FATAL messages are output in a normal situation, which worries the DBA, and those messages don't help the DBA with anything. They merely worry the DBA. While "worry" is something to be avoided not logging messages is not the correct solution. On the project side looking for ways and places to better communicate to the user is worthwhile in the absence of adding additional complexity to the logging system. The user/DBA, though, is expected to learn how the proces works, ideally in a testing environment where worry is much less a problem, and understand that some of what they are seeing are client-oriented messages that they should be aware of but that do not indicate server-level issues by themselves. I see, It might makes sense to make the DBA learn how the system works, so that more DBAs can solve problems by themselves. However, I wonder how those messages (just stopping server process by admin request) are useful for the DBA. If they are useful, why don't other background processes (e.g. archiver, writer, walsender, etc.) output such a message when stopping? For information, #5 and #6 are not client-oriented. If we want to keep the messages, I think LOG or DEBUG would be appropriate. How about altering "ereport(FATAL, ...)" to "ereport(LOG, ...); proc_exit(1)"? IMNSHO, this isn't a "worry" thing, it's a complete mis-categorization of what's happening. Do we output "FATAL" when you disconnect from the database? Something that is happening *by design* should not be considered as "fatal". I'm not saying we shouldn't log it at all, but it's silly to call it fatal just because the process is going away *when we did something that needs to make it go away*. What makes it worse is what happens if you do something like pg_terminate_backend() on one of those? That *should* create a fatal error (or at least be treated the same way we treat any other use of pg_terminate_backend()). So in this case I'm betting the real problem is that an orderly shutdown is calling pg_terminate_backend() or the equivalent. While that's convenient from a code standpoint, it's inaccurate from an API standpoint. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/09/2013 11:56 PM, Claudio Freire wrote: On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas wrote: On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... Yeah, other than the fact we have no infrastructure to do asynchronous I/O anywhere in the backend. If we had that, then we could easily use it here. I doubt it would be much better than posix_fadvising the blocks, though. Without patches to the kernel, it is much better. posix_fadvise interferes with read-ahead, so posix_fadvise on, say, bitmap heap scans (or similarly sorted analyze block samples) run at 1 IO / block, ie horrible, whereas aio can do read coalescence and read-ahead when the kernel thinks it'll be profitable, significantly increasing IOPS. I've seen everything from a 2x to 10x difference. How did you test that, given that we don't actually have an asynchronous I/O implementation? I don't recall any recent patches floating around either to do that. When Greg Stark investigated this back in 2007-2008 and implemented posix_fadvise() for bitmap heap scans, posix_fadvise certainly gave a significant speedup on the test data he used. What kind of a data distribution gives a slowdown like that? I took a stab at using posix_fadvise() in ANALYZE. It turned out to be very easy, patch attached. Your mileage may vary, but I'm seeing a nice gain from this on my laptop. Taking a 3 page sample of a table with 717717 pages (ie. slightly larger than RAM), ANALYZE takes about 6 seconds without the patch, and less than a second with the patch, with effective_io_concurrency=10. If anyone with a good test data set loaded would like to test this and post some numbers, that would be great. - Heikki diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 9845b0b..410d487 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -58,10 +58,18 @@ /* Data structure for Algorithm S from Knuth 3.4.2 */ typedef struct { + Relation rel; + ForkNumber forknum; + BlockNumber N;/* number of blocks, known in advance */ int n;/* desired sample size */ BlockNumber t;/* current block number */ int m;/* blocks selected so far */ + + int prefetch_target; + int start; + int end; + BlockNumber *prefetched; } BlockSamplerData; typedef BlockSamplerData *BlockSampler; @@ -87,10 +95,11 @@ static BufferAccessStrategy vac_strategy; static void do_analyze_rel(Relation onerel, VacuumStmt *vacstmt, AcquireSampleRowsFunc acquirefunc, BlockNumber relpages, bool inh, int elevel); -static void BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, +static void BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum, BlockNumber nblocks, int samplesize); static bool BlockSampler_HasMore(BlockSampler bs); static BlockNumber BlockSampler_Next(BlockSampler bs); +static BlockNumber BlockSampler_Next_internal(BlockSampler bs); static void compute_index_stats(Relation onerel, double totalrows, AnlIndexData *indexdata, int nindexes, HeapTuple *rows, int numrows, @@ -954,8 +963,12 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr) * algorithm. */ static void -BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize) +BlockSampler_Init(BlockSampler bs, Relation rel, ForkNumber forknum, + BlockNumber nblocks, int samplesize) { + bs->rel = rel; + bs->forknum = forknum; + bs->N = nblocks; /* measured table size */ /* @@ -965,17 +978,61 @@ BlockSampler_Init(BlockSampler bs, BlockNumber nblocks, int samplesize) bs->n = samplesize; bs->t = 0; /* blocks scanned so far */ bs->m = 0; /* blocks selected so far */ + + bs->prefetch_target = target_prefetch_pages; + if (target_prefetch_pages > 0) + bs->prefetched = palloc((bs->prefetch_target + 1) * sizeof(BlockNumber)); + else + bs->prefetched = NULL; + bs->start = bs->end = 0; } static bool BlockSampler_HasMore(BlockSampler bs) { + if (bs->end != bs->start) + return true; return (bs->t < bs->N) && (bs->m < bs->n); } +static void +BlockSampler_Prefetch(BlockSampler bs) +{ + int next; + + next = (bs->end + 1) % (bs->prefetch_target + 1); + while (next != bs->start && (bs->t < bs->N) && (bs->m < bs->
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 4:18 PM, Jeff Janes wrote: > My reading of the code is that if it is not in the MCV, then it is assumed > to have the average selectivity (about 1/n_distinct, but deflating top and > bottom for the MCV list). There is also a check that it is less than the > least common of the MCV, but I don't know why that situation would ever > prevail--that should always be higher or equal to the average selectivity. I've never seen an n_distinct value of more than 5 digits, regardless of reality. Typically I've seen 20-50k, even if the real number is much higher. But the n_distinct value is only for non-MCVs, so if we estimate the selectivity of column = 'rarevalue' to be (1-nullfrac-mcvfrac)/n_distinct, then making mcvfrac bigger reduces the estimate, and making the MCV list longer naturally makes mcvfrac bigger. I'm not sure how important the less-frequent-than-the-least-common-MCV part is, but I'm very sure that raising the statistics target helps to solve the problem of overestimating the prevalence of uncommon values in a very big table. > I think that parts of the planner are N^2 in the size of histogram (or was > that the size of the MCV list?). So we would probably need a way to use a > larger sample size to get more accurate n_distinct and MCV frequencies, but > not save the entire histogram that goes with that sample size. I think the saving the histogram part is important. As you say, the MCVs are important for a variety of planning purposes, such as hash joins. More than that, in my experience, people with large tables are typically very willing to spend more planning time to get a better plan, because mistakes are expensive and the queries are likely to run for a while anyway. People with small tables care about planning time, because it makes no sense to spend an extra 1ms planning a query unless you improve the plan by enough to save at least 1ms when executing it, and when the tables are small and access is expected to be fast anyway that's often not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgsql: Fix a couple of bugs in MultiXactId freezing
Here's a revamped version of this patch. One thing I didn't do here is revert the exporting of CreateMultiXactId, but I don't see any way to avoid that. Andres mentioned the idea of sharing some code between heap_prepare_freeze_tuple and heap_tuple_needs_freeze, but I haven't explored that. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services *** a/src/backend/access/heap/heapam.c --- b/src/backend/access/heap/heapam.c *** *** 5238,5251 heap_inplace_update(Relation relation, HeapTuple tuple) CacheInvalidateHeapTuple(relation, tuple, NULL); } /* ! * heap_freeze_tuple * * Check to see whether any of the XID fields of a tuple (xmin, xmax, xvac) ! * are older than the specified cutoff XID. If so, replace them with ! * FrozenTransactionId or InvalidTransactionId as appropriate, and return ! * TRUE. Return FALSE if nothing was changed. * * It is assumed that the caller has checked the tuple with * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD --- 5238,5448 CacheInvalidateHeapTuple(relation, tuple, NULL); } + #define FRM_NOOP0x0001 + #define FRM_INVALIDATE_XMAX 0x0002 + #define FRM_RETURN_IS_XID 0x0004 + #define FRM_RETURN_IS_MULTI 0x0008 + #define FRM_MARK_COMMITTED 0x0010 /* ! * FreezeMultiXactId ! * Determine what to do during freezing when a tuple is marked by a ! * MultiXactId. ! * ! * "flags" is an output value; it's used to tell caller what to do on return. ! * ! * Possible flags are: ! * FRM_NOOP ! * don't do anything -- keep existing Xmax ! * FRM_INVALIDATE_XMAX ! * mark Xmax as InvalidTransactionId and set XMAX_INVALID flag. ! * FRM_RETURN_IS_XID ! * The Xid return value is a single update Xid to set as xmax. ! * FRM_MARK_COMMITTED ! * Xmax can be marked as HEAP_XMAX_COMMITTED ! * FRM_RETURN_IS_MULTI ! * The return value is a new MultiXactId to set as new Xmax. ! * (caller must obtain proper infomask bits using GetMultiXactIdHintBits) ! */ ! static TransactionId ! FreezeMultiXactId(MultiXactId multi, uint16 t_infomask, ! TransactionId cutoff_xid, MultiXactId cutoff_multi, ! uint16 *flags) ! { ! TransactionId xid = InvalidTransactionId; ! int i; ! MultiXactMember *members; ! int nmembers; ! bool need_replace; ! int nnewmembers; ! MultiXactMember *newmembers; ! bool has_lockers; ! TransactionId update_xid; ! bool update_committed; ! ! *flags = 0; ! ! if (!MultiXactIdIsValid(multi)) ! { ! /* Ensure infomask bits are appropriately set/reset */ ! *flags |= FRM_INVALIDATE_XMAX; ! return InvalidTransactionId; ! } ! else if (MultiXactIdPrecedes(multi, cutoff_multi)) ! { ! /* ! * This old multi cannot possibly have members still running. If it ! * was a locker only, it can be removed without any further ! * consideration; but if it contained an update, we might need to ! * preserve it. ! */ ! if (HEAP_XMAX_IS_LOCKED_ONLY(t_infomask)) ! { ! *flags |= FRM_INVALIDATE_XMAX; ! return InvalidTransactionId; ! } ! else ! { ! /* replace multi by update xid */ ! xid = MultiXactIdGetUpdateXid(multi, t_infomask); ! ! /* wasn't only a lock, xid needs to be valid */ ! Assert(TransactionIdIsValid(xid)); ! ! /* ! * If the xid is older than the cutoff, it has to have aborted, ! * otherwise the tuple would have gotten pruned away. ! */ ! if (TransactionIdPrecedes(xid, cutoff_xid)) ! { ! Assert(!TransactionIdDidCommit(xid)); ! *flags |= FRM_INVALIDATE_XMAX; ! /* xid = InvalidTransactionId; */ ! } ! else ! { ! *flags |= FRM_RETURN_IS_XID; ! } ! } ! } ! ! /* ! * This multixact might have or might not have members still running, ! * but we know it's valid and is newer than the cutoff point for ! * multis. However, some member(s) of it may be below the cutoff for ! * Xids, so we need to walk the whole members array to figure out what ! * to do, if anything. ! */ ! ! nmembers = GetMultiXactIdMembers(multi, &members, false); ! if (nmembers <= 0) ! { ! /* Nothing worth keeping */ ! *flags |= FRM_INVALIDATE_XMAX; ! return InvalidTransactionId; ! } ! ! /* is there anything older than the cutoff? */ ! need_replace = false; ! for (i = 0; i < nmembers; i++) ! { ! if (TransactionIdPrecedes(members[i].xid, cutoff_xid)) ! { ! need_replace = true; ! break; ! } ! } ! ! /* ! * In the simplest case, there is no member older than the cutoff; we can ! * keep the existing MultiXactId as is. ! */ ! if (!need_replace) ! { ! *flags |= FRM_NOOP; ! pfree(members); ! return InvalidTransactionId; ! } ! ! /* ! * If the multi needs to be updated, figure out which members do we need ! * to keep. ! */ ! nnewmembers = 0; ! newmembers = palloc(sizeof(MultiXactMember) * nmembers); ! has_lockers = false; ! update_xid = InvalidTransactionId; ! update_com
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
On 12/8/13, 12:01 PM, Pavel Stehule wrote: > But still I have no idea, how to push check without possible slowdown > execution with code duplication Create a GUC parameter plpgsql.slow_checks or whatever (perhaps more specific), which people can turn on when they run their test suites. This doesn't really have to be all that much different from what we are currently doing in C with scan-build and address sanitizer, for example. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 6:47 PM, Heikki Linnakangas wrote: > On 12/09/2013 11:35 PM, Jim Nasby wrote: >> >> On 12/8/13 1:49 PM, Heikki Linnakangas wrote: >>> >>> On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): >>> >>> >>> One simple thing we could do, without or in addition to changing the >>> algorithm, is to issue posix_fadvise() calls for the blocks we're >>> going to read. It should at least be possible to match the speed of a >>> plain sequential scan that way. >> >> >> Hrm... maybe it wouldn't be very hard to use async IO here either? I'm >> thinking it wouldn't be very hard to do the stage 2 work in the callback >> routine... > > > Yeah, other than the fact we have no infrastructure to do asynchronous I/O > anywhere in the backend. If we had that, then we could easily use it here. I > doubt it would be much better than posix_fadvising the blocks, though. Without patches to the kernel, it is much better. posix_fadvise interferes with read-ahead, so posix_fadvise on, say, bitmap heap scans (or similarly sorted analyze block samples) run at 1 IO / block, ie horrible, whereas aio can do read coalescence and read-ahead when the kernel thinks it'll be profitable, significantly increasing IOPS. I've seen everything from a 2x to 10x difference. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] stats for network traffic WIP
On Sat, Dec 7, 2013 at 1:17 PM, Fujii Masao wrote: > > Could you share the performance numbers? I'm really concerned about > the performance overhead caused by this patch. > I've tried pgbench in select mode with small data sets to avoid disk io and didn't see any difference. That was on my old core2duo laptop though .. I'll have to retry it on some server class multi core hardware. I could create a new GUC to turn on/off this feature. Currently, it uses "track_counts". > Here are the comments from me: > > All the restrictions of this feature should be documented. For example, > this feature doesn't track the bytes of the data transferred by FDW. > It's worth documenting that kind of information. > OK. It also doesn't account for DNS resolution, Bonjour traffic and any traffic generated from PL functions that create their own sockets. > ISTM that this feature doesn't support SSL case. Why not? It does support SSL, see my_sock_read() and my_sock_write() in backend/libpq/be-secure.c > The amount of data transferred by walreceiver also should be tracked, > I think. I'll have to take another look at it. I might be able to create SSL BIO functions in libpqwalreceiver.c and change some other functions (eg. libpqrcv_send) to return byte counts instead of void to get it working. > I just wonder how conn_received, conn_backend and conn_walsender > are useful. I thought of it mostly for monitoring software usage (eg. cacti, nagios) to track connections/sec which might be used for capacity planning, confirm connection pooler settings, monitoring abuse, etc. Eg. If your conn_walsender is increasing and you have a fixed set of slaves it could show a network issue. The information is available in the logs if "log_connections" GUC is on but it requires parsing and access to log files to extract. With the increasing popularity of hosted postgres services without OS or log access, I think more metrics should be available through system views. -nigel. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/09/2013 11:35 PM, Jim Nasby wrote: On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... Yeah, other than the fact we have no infrastructure to do asynchronous I/O anywhere in the backend. If we had that, then we could easily use it here. I doubt it would be much better than posix_fadvising the blocks, though. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [RFC] Shouldn't we remove annoying FATAL messages from server log?
MauMau wrote: > From: "Greg Stark" >> On the client end the FATAL is pretty logical but in the logs it >> makes it sound like the entire server died. I agree that is easily misunderstood, especially since a FATAL problem is less severe than a PANIC; while in common English usage panic is what might feel when faced with the prospect of speaking in public, but fatal generally means something that kills -- like a disease or a plane crash. There is the notion of a "fatal error" in English, though; which means an error which puts an end to what the person who makes such an error is attempting. >> FATAL is a term of art peculiar to Postgres. No, it is not; at least not in terms of being "characteristic of only one person, group, or thing". The Java logger from Apache called log4j has a FATAL level which is more serious than ERROR. The distinction is intended to indicate whether the application is likely to be able to continue running. Similarly, Sybase ASE has severity levels, where above a certain point they are described as "fatal" -- meaning that the application must acquire a new connection. It's probably used elsewhere as well; those are just a couple I happened to be familiar with. > I find it unnatural for a normal administration operation to emit > a FATAL message. It seems to be a fairly common term of art for a problem which requires a restart or reconnection. FATAL is used when the problem is severe enough that the process or connection must end. It seems to me to be what should consistently be used when a client connection or its process must be terminated for a reason other than a client-side request to terminate. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 1:18 PM, Jeff Janes wrote: > I don't recall ever tracing a bad plan down to a bad n_distinct. It does happen. I've seen it several times. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/8/13 1:49 PM, Heikki Linnakangas wrote: On 12/08/2013 08:14 PM, Greg Stark wrote: The whole accounts table is 1.2GB and contains 10 million rows. As expected with rows_per_block set to 1 it reads 240MB of that containing nearly 2 million rows (and takes nearly 20s -- doing a full table scan for select count(*) only takes about 5s): One simple thing we could do, without or in addition to changing the algorithm, is to issue posix_fadvise() calls for the blocks we're going to read. It should at least be possible to match the speed of a plain sequential scan that way. Hrm... maybe it wouldn't be very hard to use async IO here either? I'm thinking it wouldn't be very hard to do the stage 2 work in the callback routine... -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Bug in VACUUM reporting of "removed %d row versions" in 9.2+
Where are we on this? --- On Fri, May 10, 2013 at 04:37:58PM +0100, Simon Riggs wrote: > Commit d0dcb315db0043f10073a9a244cea138e9e60edd and previous > introduced a bug into the reporting of removed row versions. ('Twas > myself et al, before you ask). > > In those commits, lazy_vacuum_heap() skipped pinned blocks, but then > failed to report that accurately, claiming that the tuples were > actually removed when they were not. That bug has masked the effect of > the page skipping behaviour. > > Bug is in 9.2 and HEAD. > > Attached patch corrects that, with logic to move to the next block > rather than re-try the lock in a tight loop once per tuple, which was > mostly ineffective. > > Attached patch also changes the algorithm slightly to retry a skipped > block by sleeping and then retrying the block, following observation > of the effects of the current skipping algorithm once skipped rows are > correctly reported. > > It also adds a comment which explains the skipping behaviour. > > Viewpoints? > > -- > Simon Riggs http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > diff --git a/src/backend/commands/vacuumlazy.c > b/src/backend/commands/vacuumlazy.c > index 02f3cf3..f0d054a 100644 > --- a/src/backend/commands/vacuumlazy.c > +++ b/src/backend/commands/vacuumlazy.c > @@ -1052,15 +1052,15 @@ lazy_scan_heap(Relation onerel, LVRelStats > *vacrelstats, > static void > lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) > { > - int tupindex; > - int npages; > + int tupindex = 0; > + int npages = 0; > + int ntupskipped = 0; > + int npagesskipped = 0; > PGRUsageru0; > Buffer vmbuffer = InvalidBuffer; > > pg_rusage_init(&ru0); > - npages = 0; > > - tupindex = 0; > while (tupindex < vacrelstats->num_dead_tuples) > { > BlockNumber tblk; > @@ -1075,9 +1075,32 @@ lazy_vacuum_heap(Relation onerel, LVRelStats > *vacrelstats) >vac_strategy); > if (!ConditionalLockBufferForCleanup(buf)) > { > - ReleaseBuffer(buf); > - ++tupindex; > - continue; > + /* > + * If we can't get the lock, sleep, then try again just > once. > + * > + * If we can't get the lock the second time, skip this > block and > + * move onto the next one. This is possible because by > now we > + * know the tuples are dead and all index pointers to > them have been > + * removed, so it is safe to ignore them, even if not > ideal. > + */ > + VacuumCostBalance += VacuumCostLimit; > + vacuum_delay_point(); > + if (!ConditionalLockBufferForCleanup(buf)) > + { > + BlockNumber blkno = tblk; > + > + ReleaseBuffer(buf); > + tupindex++; > + for (; tupindex < vacrelstats->num_dead_tuples; > tupindex++) > + { > + ntupskipped++; > + tblk = > ItemPointerGetBlockNumber(&vacrelstats->dead_tuples[tupindex]); > + if (tblk != blkno) > + break; > + } > + npagesskipped++; > + continue; > + } > } > tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, > vacrelstats, > > &vmbuffer); > @@ -1098,9 +1121,9 @@ lazy_vacuum_heap(Relation onerel, LVRelStats > *vacrelstats) > } > > ereport(elevel, > - (errmsg("\"%s\": removed %d row versions in %d pages", > + (errmsg("\"%s\": removed %d row versions in %d pages > (skipped %d row versions in %d pages)", > RelationGetRelationName(onerel), > - tupindex, npages), > + tupindex - ntupskipped, npages, > ntupskipped, npagesskipped), >errdetail("%s.", > pg_rusage_show(&ru0; > } > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjia
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
On 12/9/13 1:08 PM, Pavel Stehule wrote: So presumably it would be check_never, not check_newer... :) BTW, it's not terribly hard to work around the temp table issue; you just need to create the expected table in the session when you create the function. But even in this case, I think it would still be good to check what we can, like at least basic plpgsql syntax. I sorry. You cannot to create temporary table - this check should not have any side effect - and creating temporary table can run some event trigger. But there should be some hints for check like annotations or some similar. Or you can minimize a area where check will be disabled. Sorry, I meant that the user can work around it by creating the table. I didn't mean to imply that we would magically create a temp table to do the checking. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On 12/6/13 3:21 AM, Andres Freund wrote: On 2013-12-05 17:52:34 -0800, Peter Geoghegan wrote: Has anyone ever thought about opportunistic ANALYZE piggy-backing on other full-table scans? That doesn't really help Greg, because his complaint is mostly that a fresh ANALYZE is too expensive, but it could be an interesting, albeit risky approach. What I've been thinking of is a) making it piggy back on scans vacuum is doing instead of doing separate ones all the time (if possible, analyze needs to be more frequent). Currently with quite some likelihood the cache will be gone again when revisiting. FWIW, if synchronize_seqscans is on I'd think it'd be pretty easy to fire up a 2nd backend to do the ANALYZE portion (or perhaps use Robert's fancy new shared memory stuff). -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Sat, Dec 7, 2013 at 11:46 AM, Robert Haas wrote: > On Tue, Dec 3, 2013 at 6:30 PM, Greg Stark wrote: > > I always gave the party line that ANALYZE only takes a small > > constant-sized sample so even very large tables should be very quick. > > But after hearing the same story again in Heroku I looked into it a > > bit further. I was kind of shocked but the numbers. > > > > ANALYZE takes a sample of 300 * statistics_target rows. That sounds > > pretty reasonable but with default_statistics_target set to 100 that's > > 30,000 rows. If I'm reading the code right It takes this sample by > > sampling 30,000 blocks and then (if the table is large enough) taking > > an average of one row per block. Each block is 8192 bytes so that > > means it's reading 240MB of each table.That's a lot more than I > > realized. > > That is a lot. On the other hand, I question the subject line: > sometimes, our ANALYZE sampling is not good enough. Before we raised > the default statistics target from 10 to 100, complaints about bad > plan choices due to insufficiently-precise statistics were legion -- > and we still have people periodically proposing to sample a fixed > percentage of the table instead of a fixed amount of the table, even > on large tables, which is going the opposite direction. I think this > is because they're getting really bad n_distinct estimates, and no > fixed-size sample can reliably give a good one. > I don't recall ever tracing a bad plan down to a bad n_distinct. I have seen several that were due to bad frequency estimates in MCV list, because hash join planning is extremely sensitive to that. Do we have some kind of catalog of generators of problematic data, so that changes can be tested on known problem sets? Perhaps a wiki page to accumulate them would be useful. For automated testing I guess the generator and query is the easy part, the hard part is the cost settings/caching/RAM needed to trigger the problem, and parsing and interpreting the results. > > More generally, I think the basic problem that people are trying to > solve by raising the statistics target is avoid index scans on > gigantic tables. Obviously, there are a lot of other situations where > inadequate statistics can cause problems, but that's a pretty > easy-to-understand one that we do not always get right. We know that > an equality search looking for some_column = 'some_constant', where > some_constant is an MCV, must be more selective than a search for the > least-frequent MCV. If you store more and more MCVs for a table, > eventually you'll have enough that the least-frequent one is pretty > infrequent, and then things work a lot better. > My reading of the code is that if it is not in the MCV, then it is assumed to have the average selectivity (about 1/n_distinct, but deflating top and bottom for the MCV list). There is also a check that it is less than the least common of the MCV, but I don't know why that situation would ever prevail--that should always be higher or equal to the average selectivity. > > This is more of a problem for big tables than for small tables. MCV > #100 can't have a frequency of greater than 1/100 = 0.01, but that's a > lot more rows on a big table than small one. On a table with 10 > million rows we might estimate something close to 100,000 rows when > the real number is just a handful; when the table has only 10,000 > rows, we just can't be off by as many orders of magnitude. Things > don't always work out that badly, but in the worst case they do. > > Maybe there's some highly-principled statistical approach which could > be taken here, and if so that's fine, but I suspect not. So what I > think we should do is auto-tune the statistics target based on the > table size. If, say, we think that the generally useful range for the > statistics target is something like 10 to 400, then let's come up with > a formula based on table size that outputs 10 for small tables, 400 > for really big tables, and intermediate values for tables in the > middle. > I think that parts of the planner are N^2 in the size of histogram (or was that the size of the MCV list?). So we would probably need a way to use a larger sample size to get more accurate n_distinct and MCV frequencies, but not save the entire histogram that goes with that sample size. Cheers, Jeff
Re: [HACKERS] GIN improvements part 1: additional information
On 12/09/2013 11:34 AM, Alexander Korotkov wrote: On Mon, Dec 9, 2013 at 1:18 PM, Heikki Linnakangas wrote: Even if we use varbyte encoding, I wonder if it would be better to treat block + offset number as a single 48-bit integer, rather than encode them separately. That would allow the delta of two items on the same page to be stored as a single byte, rather than two bytes. Naturally it would be a loss on other values, but would be nice to see some kind of an analysis on that. I suspect it might make the code simpler, too. Yeah, I had that idea, but I thought it's not a better option. Will try to do some analysis. The more I think about that, the more convinced I am that it's a good idea. I don't think it will ever compress worse than the current approach of treating block and offset numbers separately, and, although I haven't actually tested it, I doubt it's any slower. About the same amount of arithmetic is required in both versions. Attached is a version that does that. Plus some other minor cleanup. (we should still investigate using a completely different algorithm, though) - Heikki gin-packed-postinglists-19.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UNNEST with multiple args, and TABLE with multiple funcs
On Thu, Dec 05, 2013 at 10:34:08PM -0500, Stephen Frost wrote: > * Tom Lane (t...@sss.pgh.pa.us) wrote: > > Noah Misch writes: > > > Two naming proposals, "ROWS FROM" and "TABLE FROM", got an ACK from more > > > than > > > one person apiece. I move that we settle on "ROWS FROM". > > > > I'm not sufficiently annoyed by "ROWS FROM" to object. Other opinions? > > Works well enough for me. Great. Here's the patch I'll be using. -- Noah Misch EnterpriseDB http://www.enterprisedb.com diff --git a/doc/src/sgml/queries.sgml b/doc/src/sgml/queries.sgml index b33de68..daba74b 100644 --- a/doc/src/sgml/queries.sgml +++ b/doc/src/sgml/queries.sgml @@ -647,7 +647,7 @@ FROM (VALUES ('anne', 'smith'), ('bob', 'jones'), ('joe', 'blow')) - Table functions may also be combined using the TABLE + Table functions may also be combined using the ROWS FROM syntax, with the results returned in parallel columns; the number of result rows in this case is that of the largest function result, with smaller results padded with NULLs to match. @@ -655,7 +655,7 @@ FROM (VALUES ('anne', 'smith'), ('bob', 'jones'), ('joe', 'blow')) function_call WITH ORDINALITY AS table_alias (column_alias , ... ) -TABLE( function_call , ... ) WITH ORDINALITY AS table_alias (column_alias , ... ) +ROWS FROM( function_call , ... ) WITH ORDINALITY AS table_alias (column_alias , ... ) @@ -674,7 +674,7 @@ TABLE( function_call , ... ) UNNEST () had been called on each parameter - separately and combined using the TABLE construct. + separately and combined using the ROWS FROM construct. @@ -683,7 +683,7 @@ UNNEST( array_expression , ... If no table_alias is specified, the function - name is used as the table name; in the case of a TABLE() + name is used as the table name; in the case of a ROWS FROM() construct, the first function's name is used. @@ -731,20 +731,20 @@ SELECT * FROM vw_getfoo; function_call AS alias (column_definition , ... ) function_call AS alias (column_definition , ... ) -TABLE( ... function_call AS (column_definition , ... ) , ... ) +ROWS FROM( ... function_call AS (column_definition , ... ) , ... ) - When not using the TABLE() syntax, + When not using the ROWS FROM() syntax, the column_definition list replaces the column alias list that could otherwise be attached to the FROM item; the names in the column definitions serve as column aliases. - When using the TABLE() syntax, + When using the ROWS FROM() syntax, a column_definition list can be attached to each member function separately; or if there is only one member function and no WITH ORDINALITY clause, a column_definition list can be written in - place of a column alias list following TABLE(). + place of a column alias list following ROWS FROM(). diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml index 88ebd73..d6a17cc 100644 --- a/doc/src/sgml/ref/select.sgml +++ b/doc/src/sgml/ref/select.sgml @@ -56,7 +56,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionalias [ ( column_alias [, ...] ) ] ] [ LATERAL ] function_name ( [ argument [, ...] ] ) [ AS ] alias ( column_definition [, ...] ) [ LATERAL ] function_name ( [ argument [, ...] ] ) AS ( column_definition [, ...] ) -[ LATERAL ] TABLE( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] ) +[ LATERAL ] ROWS FROM( function_name ( [ argument [, ...] ] ) [ AS ( column_definition [, ...] ) ] [, ...] ) [ WITH ORDINALITY ] [ [ AS ] alias [ ( column_alias [, ...] ) ] ] from_item [ NATURAL ] join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ] @@ -390,7 +390,7 @@ TABLE [ ONLY ] table_name [ * ] Multiple function calls can be combined into a single FROM-clause item by surrounding them -with TABLE( ... ). The output of such an item is the +with ROWS FROM( ... ). The output of such an item is the concatenation of the first row from each function, then the second row from each function, etc. If some of the functions produce fewer rows than others, NULLs are substituted for the missing data, so @@ -410,18 +410,18 @@ TABLE [ ONLY ] table_name [ * ] -When using the TABLE( ... ) syntax, if one of the +When using the ROWS FROM( ... ) syntax, if one of the functions requires a column definition list, it's preferred to put the column definition list after the function call inside -TABLE( ... ). A column definition list can be placed -after the TABLE( ... ) construct only if there's just a -single function and no WITH ORDINALITY clause. +ROWS FROM( ... ). A column definition list can be placed +after the
[HACKERS] In-Memory Columnar Store
Hello! I want to annouce my implementation of In-Memory Columnar Store extension for PostgreSQL: Documentation: http://www.garret.ru/imcs/user_guide.html Sources: http://www.garret.ru/imcs-1.01.tar.gz Any feedbacks, bug reports and suggestions are welcome. Vertical representation of data is stored in PostgreSQL shared memory. This is why it is important to be able to utilize all available physical memory. Now servers with Tb or more RAM are not something exotic, especially in financial world. But there is limitation in Linux with standard 4kb pages for maximal size of mapped memory segment: 256Gb. It is possible to overcome this limitation either by creating multiple segments - but it requires too much changes in PostgreSQL memory manager. Or just set MAP_HUGETLB flag (assuming that huge pages were allocated in the system). I found several messages related with MAP_HUGETLB flag, the most recent one was from 21 of November: http://www.postgresql.org/message-id/20131125032920.ga23...@toroid.org I wonder what is the current status of this patch? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] shared memory message queues
On Sun, Dec 8, 2013 at 5:52 AM, Kohei KaiGai wrote: > 2013/12/6 Kohei KaiGai : >> What will happen if sender tries to send a large chunk that needs to >> be split into multiple sub-chunks and receiver concurrently detaches >> itself from the queue during the writes by sender? >> It seems to me the sender gets SHM_MQ_DETACHED and only >> earlier half of the chunk still remains on the queue even though >> its total length was already in the message queue. >> It may eventually lead infinite loop on the receiver side when another >> receiver appeared again later, then read incomplete chunk. >> Does it a feasible scenario? If so, it might be a solution to prohibit >> enqueuing something without receiver, and reset queue when a new >> receiver is attached. >> > Doesn't it an intended usage to attach a peer process on a message > queue that had once detached, does it? > If so, it may be a solution to put ereport() on shm_mq_set_receiver() > and shm_mq_set_sender() to prohibit to assign a process on the > message queue with mq_detached = true. It will make the situation > simplified. It's not intended that you should be able to attach a new reader or writer in place of an old one that detached. That would in fact be pretty tricky to support, because if the detached process was in the middle of reading or writing a message at the time it died, then there's no way to recover protocol sync. We could design some mechanism for that, but in the case of background workers connected to dynamic shared memory segments it isn't needed, because I assume that when the background worker croaks, you're going to tear down the dynamic shared memory segment and thus the whole queue will disappear; if the user retries the query, we'll create a whole new segment containing a whole new queue (or queues). Now, if we wanted to use these queues in permanent shared memory, we'd probably need to think a little bit harder about this. It is not impossible to make it work even as things stand, because you could reuse the same chunk of shared memory and just overwrite it with a newly-initialized queue. You'd need some mechanism to figure out when to do that, and it might be kind of ugly, but I think i'd be doable. That wasn't the design center for this attempt, though, and if we want to use it that way then we probably should spend some time figuring out how to support both a "clean" detach, where the reader or writer goes away at a message boundary, and possibly also a "dirty" detach, where the reader or writer goes away in the middle of a message. I view those as problems for future patches, though. > Regarding to the test-shm-mq-v1.patch, setup_background_workers() > tries to launch nworkers of background worker processes, however, > may fail during the launching if max_worker_processes is not enough. > Is it a situation to attach the BGWORKER_EPHEMERAL flag when > your patch gets committed, isn't it? I dropped the proposal for BGWORKER_EPHEMERAL; I no longer think we need that. If not all of the workers can be registered, setup_background_workers() will throw an error when RegisterDynamicBackgroundWorker returns false. If the workers are successfully registered but don't get as far as connecting to the shm_mq, wait_for_workers_to_become_ready() will detect that condition and throw an error. If all of the workers start up and attached to the shared memory message queues but then later one of them dies, the fact that it got as far as connecting to the shm_mq means that the message queue's on_dsm_detach callback will run, which will mark the queues to which it is connected as detached. That will cause the workers on either side of it to exit also until eventually the failure propagates back around to the user backend. This is all a bit complex but I don't see a simpler solution. > Also, test_shm_mq_setup() waits for completion of starting up of > background worker processes. I'm uncertain whether it is really > needed, because this shared memory message queue allows to > send byte stream without receiver, and also blocks until byte > stream will come from the peer to be set later. That's actually a very important check. Suppose we've got just 3 worker processes, so that the message queues are connected like this: user backend -> worker 1 -> worker 2 -> worker 3 -> user backend When test_shm_mq_setup connects to the queues linking it to worker 1 and worker 3, it passes a BackgroundWorkerHandle to shm_mq_attach. As a result, if either worker 1 or worker 3 fails during startup, before attaching to the queue, the user backend would notice that and error out right away, even if it didn't do wait_for_workers_to_become_ready(). However, if worker 2 fails during startup, neither the user backend nor either of the other workers would notice that without wait_for_workers_to_become_ready(): the user backend isn't connected to worker 2 by a shm_mq at all, and workers 1 and 3 have no BackgroundWorkerHandle to pass to shm_mq_attach(),
Re: [HACKERS] What are multixactids?
On 12/9/13 1:05 PM, hubert depesz lubaczewski wrote: On Mon, Dec 09, 2013 at 07:59:10PM +0100, Andreas Karlsson wrote: I recommend you read the section in README.tuplock. 1. https://github.com/postgres/postgres/blob/d9250da032e723d80bb0150b9276cc544df6a087/src/backend/access/heap/README.tuplock#L68 +1, even if it's just a link to the README. I also started wondering what these were while investigating the vacuum bugs. I'd submit a patch, but I don't have doc builds working :( -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What are multixactids?
On 12/09/2013 08:05 PM, hubert depesz lubaczewski wrote: Thanks. Read that. Still, it would be good to have some information in normal docs, but I guess this has to do for now. It is mentioned several times in the documentation but I do not think it is explained anywhere. -- Andreas Karlsson -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
2013/12/9 Jim Nasby > On 12/8/13 11:24 PM, Pavel Stehule wrote: > >> > #option check_on_first_start >> > #option check_on_create >> > #option check_newer >> >> what exactly check_newer means, does it mean whenever a function is >> replaced (changed)? >> >> >> no, it means, so request for check will be ignored ever - some functions >> cannot be deeply checked due using dynamic SQL or dynamic created data >> types - temporary tables created in functions. >> > > So presumably it would be check_never, not check_newer... :) BTW, it's not > terribly hard to work around the temp table issue; you just need to create > the expected table in the session when you create the function. But even in > this case, I think it would still be good to check what we can, like at > least basic plpgsql syntax. > I sorry. You cannot to create temporary table - this check should not have any side effect - and creating temporary table can run some event trigger. But there should be some hints for check like annotations or some similar. Or you can minimize a area where check will be disabled. > > Do we really need first_start? ISTM that if you're dependent on run state > then you're basically out of luck. > I afraid so checking on creation time is not enough for plpgsql. and I have a very good experience with check on start from plpgsql_lint usage when I wrote a regression tests. A "first start" doesn't create dependency on state - but just more preciously define a time, when checking will be done. Probably a option check_create_and_start can be useful. Regards Pavel > -- > Jim C. Nasby, Data Architect j...@nasby.net > 512.569.9461 (cell) http://jim.nasby.net >
Re: [HACKERS] What are multixactids?
On Mon, Dec 09, 2013 at 07:59:10PM +0100, Andreas Karlsson wrote: > As you can see from Peter's message it is explained in > README.tuplock[1]. Basically it is used whenever more than one lock > is acquired on the same tuples as a reference to where the locks are > stored. It can store updated/deleted Xid for the tuple so it needs > to be persisted. > I recommend you read the section in README.tuplock. > 1. > https://github.com/postgres/postgres/blob/d9250da032e723d80bb0150b9276cc544df6a087/src/backend/access/heap/README.tuplock#L68 Thanks. Read that. Still, it would be good to have some information in normal docs, but I guess this has to do for now. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/ signature.asc Description: Digital signature
Re: [HACKERS] How to do fast performance timing
On 12/9/13 7:33 AM, Benedikt Grundmann wrote: At Jane Street we have recently spend a lot of time trying to get a fast gettimeofday. I saw lots of references in various postgres hacker threads related to a lack of such a facility so The culmination of those efforts can be read here: https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli and https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.ml it's all OCaml but the code is mostly imperative and very well documented. In particular we made an effort to document our assumption. There are a few which are ocaml specific. But a lot of the lessons we have learned here should be applicable to postgres. Looks interesting. I think this isn't nearly as big an issue in Postgres as it used to be, but I think there's also things we've been avoiding because of the overhead. IE: using IO response time to determine if something came from cache or not. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What are multixactids?
On 12/09/2013 06:04 PM, hubert depesz lubaczewski wrote: Hi, when working on fixing the bug related to vacuum freeze, I found out that there is something called "MultiXactId". Searching docs showed that it is mentioned only once, in release notes to 9.3.2: http://www.postgresql.org/search/?u=%2Fdocs%2F9.3%2F&q=multixactid What's more - I found that Peter Eisentraut already once asked about them, and lack of documentation: http://postgresql.1045698.n5.nabble.com/MultiXactId-concept-underdocumented-td5766754.html So, my question is - what are multixactids, what are they used for, where can I find any documentation/explanation/whatever? It seems to be related in some way to the relfrozenxid/vacuum bug, but I can't comprehend the relation without knowing what multixactid actually is. As you can see from Peter's message it is explained in README.tuplock[1]. Basically it is used whenever more than one lock is acquired on the same tuples as a reference to where the locks are stored. It can store updated/deleted Xid for the tuple so it needs to be persisted. I recommend you read the section in README.tuplock. 1. https://github.com/postgres/postgres/blob/d9250da032e723d80bb0150b9276cc544df6a087/src/backend/access/heap/README.tuplock#L68 -- Andreas Karlsson -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 6:54 PM, Greg Stark wrote: > > This "some math" is straightforward basic statistics. The 95th > percentile confidence interval for a sample consisting of 300 samples > from a population of a 1 million would be 5.66%. A sample consisting > of 1000 samples would have a 95th percentile confidence interval of > +/- 3.1%. Incidentally I got this using an online sample size calculator. Google turns up several but this one seems the easiest to use: http://www.raosoft.com/samplesize.html -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] About shared cache invalidation mechanism
On Sun, Dec 8, 2013 at 1:55 AM, huaicheng Li wrote: > I know that all invalid cache messages are stored in the > shmInvalidationBuffer ring buffer and that they should be consumed by all > other backends to keep their own cache fresh. Since there may be some > "stragglers" which process the SI message quite slow, we use *catchup* > interrupt(signal) to accelerate their cosuming shared invalid messages. Here > comes my questions : > (1). When the current number of messages in the shmInvalidationBuffer > exceeds a threshold, it needs to be cleaned up by using SICleanupQueue. > After that, if the number still exceeds MAXNUMMESSAGES/2, threshold will be > calculated by the following formula: > Threshold = (numMsgs/CLEANUP_QUANTUM + 1) * CLEANUP_QUANTUM > (2). For those slow backends, if their *nextMsgNum* value is less than > *lowbound*, they will be reset, and the *lowbound* is calculated by > lowbound = maxMsgNum - MAXNUMMESSAGES + minFree, > and if their *nextMsgNum* value is less than *minsig*, they will get catchup > signals to speed up, *minsig* is calculated by > minsig = maxMsgNum - MAXNUMMESSAGES/2 > > Here, I want to ask why threshold, lowbound and minsig are calculated like > that ? Do the three formulas have any performance considerations when > designed ? I have searched through the old mail list archives, but found > nothing about these(these changes emerged in pg8.4 firstly), any help would > be appreciated. Since the invalidation queue is a ring buffer, it will eventually wrap around, with new invalidation messages overwriting previously-written ones. Once that happens, any backends that hadn't yet processed some message that's since been overwritten have to be reset. Since there's no longer any way of knowing exactly which parts of their backend-local cache need to be flushed, they'll just have to flush everything. That's the point of the lowbound. However, that's expensive, so we want to prevent it from happening. To do that, when we notice that a backend is way behind, we send it a catchup signal. Hopefully, it will respond to the catchup signal by reading the messages it hasn't yet processed from the queue. But if it doesn't do that, or doesn't do it quickly enough, and more messages keep arriving, then eventually the queue will wrap around and we'll have to just reset it. Does that help? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 6:03 PM, Josh Berkus wrote: > > It's also applicable for the other stats; histogram buckets constructed > from a 5% sample are more likely to be accurate than those constructed > from a 0.1% sample. Same with nullfrac. The degree of improved > accuracy, would, of course, require some math to determine. This "some math" is straightforward basic statistics. The 95th percentile confidence interval for a sample consisting of 300 samples from a population of a 1 million would be 5.66%. A sample consisting of 1000 samples would have a 95th percentile confidence interval of +/- 3.1%. The histogram and nullfact answers the same kind of question as a political poll, "what fraction of the population falls within this subset". This is why pollsters don't need to sample 15 million Americans to have a decent poll result. That's just not how the math works for these kinds of questions. n_distinct is an entirely different kettle of fish. It's a different kind of problem and the error rate there *is* going to be dependent on the percentage of the total population that you sampled. Moreover from the papers I read I'm convinced any sample less than 50-80% is nearly useless so I'm convinced you can't get good results without reading the whole table. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] plpgsql_check_function - rebase for 9.3
On 12/8/13 11:24 PM, Pavel Stehule wrote: > #option check_on_first_start > #option check_on_create > #option check_newer what exactly check_newer means, does it mean whenever a function is replaced (changed)? no, it means, so request for check will be ignored ever - some functions cannot be deeply checked due using dynamic SQL or dynamic created data types - temporary tables created in functions. So presumably it would be check_never, not check_newer... :) BTW, it's not terribly hard to work around the temp table issue; you just need to create the expected table in the session when you create the function. But even in this case, I think it would still be good to check what we can, like at least basic plpgsql syntax. Do we really need first_start? ISTM that if you're dependent on run state then you're basically out of luck. -- Jim C. Nasby, Data Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
On Mon, Dec 9, 2013 at 1:03 PM, Josh Berkus wrote: >> I really don't believe the 5% thing. It's not enough for n_distinct >> and it's *far* too high a value for linear properties like histograms >> or nullfrac etc. > > Actually, it is enough for n_distinct, or more properly, 5% is as good > as you can get for n_distinct unless you're going to jump to scanning > 50% or more. I'd like to see a proof of that result. Not because I'm hostile to changing the algorithm, but because you've made numerous mathematical claims on this thread that fly in the face of what Greg, myself, and others understand to be mathematically true - including this one. If our understanding is wrong, then by all means let's get that fixed. But you're not going to convince anyone here that we should rip out the existing algorithm and its peer-reviewed journal citations by making categorical assertions about the right way to do things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extra functionality to createuser
On Sat, Dec 7, 2013 at 11:39 PM, Amit Kapila wrote: > On Fri, Dec 6, 2013 at 10:31 AM, Peter Eisentraut wrote: >> On Wed, 2013-11-20 at 11:23 -0500, Christopher Browne wrote: >>> I note that similar (with not quite identical behaviour) issues apply >>> to the user name. Perhaps the >>> resolution to this is to leave quoting issues to the administrator. >>> That simplifies the problem away. >> >> How about only one role name per -g option, but allowing the -g option >> to be repeated? > >I think that might simplify the problem and patch, but do you think > it is okay to have inconsistency >for usage of options between Create User statement and this utility? Yes. In general, command-line utilities use a very different syntax for options-passing that SQL commands. Trying to make them consistent feels unnecessary or perhaps even counterproductive. And the proposed syntax is certainly a convention common to many other command-line utilities, so I think it's fine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Josh Berkus writes: > Reading 5% of a 200GB table is going to be considerably faster than > reading the whole thing, if that 5% is being scanned in a way that the > FS understands. Really? See the upthread point that reading one sector from each track has just as much seek overhead as reading the whole thing. I will grant that if you think that reading a *contiguous* 5% of the table is good enough, you can make it faster --- but I don't believe offhand that you can make this better without seriously compromising the randomness of your sample. Too many tables are loaded roughly in time order, or in other ways that make contiguous subsets nonrandom. > You do seem kind of hostile to the idea of full-page-sampling, going > pretty far beyond the "I'd need to see the math". Why? I'm detecting a lot of hostility to assertions unsupported by any math. For good reason. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance optimization of btree binary search
On Fri, Dec 6, 2013 at 4:53 PM, Peter Geoghegan wrote: > I had considered that something like Intel Speedstep technology had a > role here, but I'm pretty sure it steps up very aggressively when > things are CPU bound - I tested that against a Core 2 Duo desktop a > couple of years back, where it was easy to immediately provoke it by > moving around desktop windows or something. I decided to increase the default CPU governor from "ondemand" to "performance" for each of the 8 logical cores on this system. I then re-ran the benchmark. I saw markedly better, much more *consistent* performance for master [1]. I Googled for clues, and found this: https://communities.intel.com/community/datastack/blog/2013/08/05/how-to-maximise-cpu-performance-for-the-oracle-database-on-linux (It happens to mention Oracle, but I think it would equally well apply to any database). I strongly suspect this is down to Kernel version. I should highlight this: """ Another further CPU setting is the Energy/Performance Bias and Red Hat and Oracle users should note that the default setting has changed in the Linux kernel used between the releases of Red Hat/Oracle Linux 5 and Red Hat/Oracle Linux 6. (Some system BIOS options may include a setting to prevent the OS changing this value). In release 5 Linux did not set a value for this setting and therefore the value remained at 0 for a bias towards performance. In Red Hat 6 this behaviour has changed and the default sets a median range to move this bias more towards conserving energy (remember the same Linux kernel is present in both ultrabooks as well as servers and on my ultrabook I use powertop and the other Linux tools and configurations discussed here to maximise battery life) and reports the following in the dmesg output on boot. ... You can also use the tool to set a lower value to change the bias entirely towards performance (the default release 5 behaviour). """ If there is regression in Postgres performance on more recent Linux kernels [2], perhaps this is it. I certainly don't recall hearing advice on this from the usual places. I'm surprised that turbo boost mode wasn't enabled very quickly on a workload like this. It makes a *huge* difference - at 4 clients (one per physical core), setting the CPU governor to "performance" increases TPS by a massive 40% compared to some earlier, comparable runs of master. These days Redhat are even pointing out that CPU governor policy can be set via cron jobs [3]. I cannot account for why the original benchmark performed was consistent with the patch having helped to such a large degree, given the large number of runs involved, and their relatively long duration for a CPU/memory bound workload. As I said, this machine is on dedicated hardware, and virtualization was not used. However, at this point I have no choice but to withdraw it from consideration, and not pursue this any further. Sorry for the noise. [1] http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/turbo/ [2] http://www.postgresql.org/message-id/529f7d58.1060...@agliodbs.com [3] https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_governors.html -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE sampling is too good
Greg, > I really don't believe the 5% thing. It's not enough for n_distinct > and it's *far* too high a value for linear properties like histograms > or nullfrac etc. Actually, it is enough for n_distinct, or more properly, 5% is as good as you can get for n_distinct unless you're going to jump to scanning 50% or more. It's also applicable for the other stats; histogram buckets constructed from a 5% sample are more likely to be accurate than those constructed from a 0.1% sample. Same with nullfrac. The degree of improved accuracy, would, of course, require some math to determine. > From a computer point of view it's too high to be > worth bothering. If we have to read 5% of the table we might as well > do a full scan anyways, it'll be marginally slower but much better > quality results. Reading 5% of a 200GB table is going to be considerably faster than reading the whole thing, if that 5% is being scanned in a way that the FS understands. Also, we can optimize this significantly by using the VM, as Robert (I think) suggested. In the advanced approaches section, there's also the idea of collecting analyze data from table pages while they're in memory anyway for other reasons. You do seem kind of hostile to the idea of full-page-sampling, going pretty far beyond the "I'd need to see the math". Why? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extension Templates S03E11
On Mon, 2013-12-09 at 12:17 -0500, Robert Haas wrote: > On Sat, Dec 7, 2013 at 3:12 AM, Jeff Davis wrote: > > So if we do it this way, then we should pick a new name, like "package". > > That was my first reaction as well, when I looked at this a few years > ago, but I've since backed away from that position. You're certainly > correct that it's awkward to have a single kind of object that behaves > in two radically different ways, but it's also pretty awkward to have > the same "stuff" installed as one of two completely different types of > objects depending on who installed it and how. I think awkwardness is most visible in the resulting documentation and error messages. At the moment, I'm having a difficult time imagining how we explain how this works to users (or, when they make a mistake or don't get the results they expect, explain to them what they did wrong and how to fix it). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] RFC: programmable file format for postgresql.conf
On Sat, Dec 7, 2013 at 3:28 AM, Álvaro Hernández Tortosa wrote: >>> "Right now, writing such a tool in a generic way gets so bogged down >>> just in parsing/manipulating the postgresql.conf file that it's hard to >>> focus on actually doing the tuning part." >> >> That was in 2008. I don't think that stance is accurate anymore. > > Just for me to learn about this: why is it not accurate anymore? This topic has been under active discussion for the last five years. I strongly recommend going back and skimming over the past discussions before trying to pick it up again. In particular go look up the discussion of SET PERSISTENT Since we have include files now you can just generate an auto-tune.conf and not try to parse or write the main config file. The reason previous efforts got bogged down in parsing/manipulating the postgresql.conf file was purely because they were trying to allow you to edit the file by hand and mix that with auto generated config. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Extension Templates S03E11
On Sat, Dec 7, 2013 at 3:12 AM, Jeff Davis wrote: > So if we do it this way, then we should pick a new name, like "package". That was my first reaction as well, when I looked at this a few years ago, but I've since backed away from that position. You're certainly correct that it's awkward to have a single kind of object that behaves in two radically different ways, but it's also pretty awkward to have the same "stuff" installed as one of two completely different types of objects depending on who installed it and how. If we're targeting deployment of user-written application code, then I can see that it might make sense to have a different concept than "extension" for that, because arguably it's a different problem, though it's no longer clear to me that it's all that much different. But if we're talking about deployment of the same PGXN code (or wherever upstream lives) either by a DBA who is also the sysadmin (and can thus run make install or yum install) or one who is not (and thus wishes to proceed entirely via libpq) then making those two different concepts seems like it might be slicing awfully thin. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] What are multixactids?
Hi, when working on fixing the bug related to vacuum freeze, I found out that there is something called "MultiXactId". Searching docs showed that it is mentioned only once, in release notes to 9.3.2: http://www.postgresql.org/search/?u=%2Fdocs%2F9.3%2F&q=multixactid What's more - I found that Peter Eisentraut already once asked about them, and lack of documentation: http://postgresql.1045698.n5.nabble.com/MultiXactId-concept-underdocumented-td5766754.html So, my question is - what are multixactids, what are they used for, where can I find any documentation/explanation/whatever? It seems to be related in some way to the relfrozenxid/vacuum bug, but I can't comprehend the relation without knowing what multixactid actually is. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/ signature.asc Description: Digital signature
Re: [HACKERS] RFC: programmable file format for postgresql.conf
On Fri, Dec 6, 2013 at 10:28 PM, Álvaro Hernández Tortosa wrote: > I think both could be used a lot, editing directly a rich configuration > file or using a GUI tool. I'm trying to suggest supporting both. I don't really understand how changing the file format fixes anything. You could make the file an INI file or an XML file and it would still be hard to edit programmatically, not because the current format is "hard to parse" in any meaningful sense, but because there's no way for a program to know how to make changes while preserving the comments. For example, suppose the user tries to set work_mem to 4MB. If there's an existing line in the config file for work_mem, it's fairly plausible to think that you might just replace everything on that line, up to the beginning of any comment, with a new work_mem setting. But what if, as in the default configuration file, there isn't any such setting? A human will go and find the line that says: #work_mem = 1MB ...and delete the hash mark, and replace 1MB with 4MB. No problem! But for a computer, editing comments is hard, and kind of iffy. After all, there might be multiple lines that look like the above, and how would you know which one to replace? There could even be something like this in the file: #In our installation, because we have very little memory, it's important not to do anything silly like set #work_mem = 64MB A configuration file editor that replaces that line will corrupt the comment, because no program can be smart enough to recognize the context the way a human will. Now, we could design something that gets it right, or close enough to right, 99% of the time. But previous discussions of this issue on this mailing list have concluded that people are not willing to accept that kind of solution, which IMHO is understandable. The only kind of change that I see as possibly helpful is some format that explicitly marks which comments go with which settings. For example, suppose we did this: work_mem min 64kB If you want to set the value, you remove the comment tags around it. And if you want to comment on the value, you can put whatever you like within the comment tags. Now, you've got a machine-editable format, assuming that people keep their comments in the section and not inside actual SGML comments. But that's ugly and overly verbose, so meh. Generally I don't regard trying to tinker with postgresql.conf as a useful way to spend time. Many people have strong and sometimes conflicting feelings about it, making getting any consensus of any change almost impossible. And while I'm sure some die-hard will disagree with me on this, the current format, imperfect as it is, is not really all *that* bad. We all have our bones to pick with it and I certainly wouldn't have picked this exact approach myself, but we could have done far worse. If it were clear what the next logical step to make it better was, or even if it were clear that the current blew chunks, then I'd be all over putting energy into getting this fixed. But it isn't, and it doesn't, and the amount of collective energy that would need to be put into making any change here doesn't seem likely to be worth what we'd get out of it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON decoding plugin
On 09-12-2013 13:12, Merlin Moncure wrote: > This is pretty neat. Couple minor questions: > *) Aren't you *en*coding data into json, not the other way around (decoding?) > Yes. The 'decoding' came from the functionality (logical decoding) and because the POC plugin is named 'test_decoding'. I also think that 'json_decoding' doesn't say much about the module purpose. I confess that I don't like the name but can't come up with a good name. Maybe 'wal2json' or 'logrep2json'? Could you suggest something? > *) Consider generating a long bytea instead of explicitly writing a > 32kb sql into the patch. > I'll consider for next version. > *) You've built your own json serializer here. Maybe some code can be > shared with the json type? > Same here. I already took a look at the json datatype but decided that I wouldn't mess up with the backend code before have a feedback in the general idea. > *) Consider removing 'plugin ' from the name of the plugin. > --plugin=json_decoding etc. > 'plugin' was a tentative to produce an unique name (it sucks but...). -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_archivecleanup bug
On Thu, Dec 5, 2013 at 6:15 PM, Tom Lane wrote: > But the other usages seem to be in assorted utilities, which > will need to do it right for themselves. initdb.c's walkdir() seems to > have it right and might be a reasonable model to follow. Or maybe we > should invent a frontend-friendly version of ReadDir() rather than > duplicating all the error checking code in ten-and-counting places? If there's enough uniformity in all of those places to make that feasible, it certainly seems wise to do it that way. I don't know if that's the case, though - e.g. maybe some callers want to exit and others do not. pg_resetxlog wants to exit; pg_archivecleanup and pg_standby most likely want to print an error and carry on. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_archivecleanup bug
On Fri, Dec 6, 2013 at 11:10 AM, Tom Lane wrote: > Robert Haas writes: >> On Thu, Dec 5, 2013 at 6:15 PM, Tom Lane wrote: >>> In general, I think there is no excuse for code in the backend to use >>> readdir() directly; it should be using ReadDir(), which takes care of this >>> as well as error reporting. > >> My understanding is that the fd.c infrastructure can't be used in the >> postmaster. > > Say what? See ParseConfigDirectory for code that certainly runs in the > postmaster, and uses ReadDir(). Gosh, I could have sworn that I had calls into fd.c that were crashing and burning during development because they happened too early in postmaster startup. But it seems to work fine now, so I've pushed a fix for this and a few related issues. Please let me know if you think there are remaining issues. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] JSON decoding plugin
On Mon, Dec 9, 2013 at 7:03 AM, Euler Taveira wrote: > Hi, > > A few months ago, it was proposed [1] that would be interested to have a > json output plugin for logical decoding. Here it is. > > Each transaction is a JSON object that can contain xid (optional), > timestamp (optional), and change array. Each change's element is a > command that was decoded and it can contains: kind (I/U/D), schema > (optional), table, columnnames, columntypes (optional), columnvalues, > and oldkeys (only for U/D). columnnames, columntypes and columnvalues > are arrays. oldkeys is an object that contains the following arrays: > keynames, keytypes (optional), and keyvalues. > > The JSON objects are serialized if you are decoding a serie of > transactions. Here is an output example: > > { > "xid": 702, > "change": [ > { > "kind": "insert", > "schema": "public", > "table": "foo", > "columnnames": ["a", "b", "c"], > "columntypes": ["int4", "int4", "text"], > "columnvalues": [1, 2, "test"] > } > ,{ > "kind": "update", > "schema": "public", > "table": "foo", > "columnnames": ["a", "b", "c"], > "columntypes": ["int4", "int4", "text"], > "columnvalues": [1, 2, "test2"], > "oldkeys": { > "keynames": ["a", "b"], > "keytypes": ["int4", "int4"], > "keyvalues": [1, 2] > } > } > ] > } > { > "xid": 703, > "change": [ > { > "kind": "update", > "schema": "public", > "table": "foo", > "columnnames": ["a", "b", "c"], > "columntypes": ["int4", "int4", "text"], > "columnvalues": [1, 3, "test2"], > "oldkeys": { > "keynames": ["a", "b"], > "keytypes": ["int4", "int4"], > "keyvalues": [1, 2] > } > } > ] > } > { > "xid": 704, > "change": [ > { > "kind": "delete", > "schema": "public", > "table": "foo", > "oldkeys": { > "keynames": ["a", "b"], > "keytypes": ["int4", "int4"], > "keyvalues": [1, 3] > } > } > ] > } > > > Some data types was adapted to conform with JSON spec. NAN and Infinity > are not valid JSON symbols so their representation is NULL (as some JSON > implementations). Due to JSON datatype simplicity, I represent the vast > majority of Postgres datatypes as string (However, I admit that we could > mimic the json datatype conversion rules). > > The oldkeys treatment follows what was defined by the commit [2]. It uses: > > (i) primary key (default behavior); > (ii) unique index (if REPLICA IDENTITY USING INDEX is defined for table); > (iii) full tuple (if REPLICA IDENTITY FULL is defined for table); > (iv) nothing means an error (if REPLICA IDENTITY NOTHING is defined for > table). > > The TOAST columns have a special treatment for UPDATEs. If a tuple that > contains a TOAST field is updated, the TOAST field is included iif it is > changed too. It means that unchanged TOAST field are omitted from > columns* arrays. This means less overhead while transmitting, > processing and applying changes. > > By design, (i) output plugin doesn't know about aborted transactions and > (ii) subtransactions are reordered into a toplevel transaction and only > the committed pieces are passed to the plugin. > > You can test it firing the regression tests (e.g. 'make test') or using > the following steps? > > postgresql.conf: > wal_level = logical > max_wal_senders = 2 > max_logical_slots = 2 > > start collecting WAL records: > > $ pg_recvlogical --slot=foo -d euler -f /dev/stdout > --plugin=json_decoding_plugin --init > > [execute some transactions] > > start printing decoded transactions: > > $ pg_recvlogical --slot=foo -d euler -f /dev/stdout --start > > stop collecting WAL records: > > $ pg_recvlogical --slot=foo -d euler -f /dev/stdout --stop > > > Comments? This is pretty neat. Couple minor questions: *) Aren't you *en*coding data into json, not the other way around (decoding?) *) Consider generating a long bytea instead of explicitly writing a 32kb sql into the patch. *) You've built your own json serializer here. Maybe some code ca
Re: [HACKERS] Backup throttling
On Fri, Dec 6, 2013 at 6:43 PM, Boszormenyi Zoltan wrote: > Hi, > > 2013-12-05 15:36 keltezéssel, Antonin Houska írta: > >> On 12/02/2013 02:23 PM, Boszormenyi Zoltan wrote: >>> >>> Hi, >>> >>> I am reviewing your patch. >> >> Thanks. New version attached. > > > I have reviewed and tested it and marked it as ready for committer. Here are the review comments: + -r + --max-rate You need to add something like rate. +The purpose is to limit impact of pg_basebackup +on a running master server. s/"master server"/"server" because we can take a backup from also the standby. I think that it's better to document the default value and the accepted range of the rate that we can specify. You need to change the protocol.sgml because you changed BASE_BACKUP replication command. +printf(_(" -r, --max-rate maximum transfer rate to transfer data directory\n")); You need to add something like =RATE just after --max-rate. +result = strtol(src, &after_num, 0); errno should be set to 0 just before calling strtol(). +if (errno_copy == ERANGE || result != (uint64) ((uint32) result)) +{ +fprintf(stderr, _("%s: transfer rate \"%s\" exceeds integer range\n"), progname, src); +exit(1); +} We can move this check after the check of "src == after_num" like parse_int() in guc.c does. If we do this, the local variable 'errno_copy' is no longer necessary. I think that it's better to output the hint message like "Valid units for the transfer rate are \"k\" and \"M\"." when a user specified wrong unit. +/* + * THROTTLING_SAMPLE_MIN / MAX_RATE_LOWER (in seconds) should be the + * longest possible time to sleep. Thus the cast to long is safe. + */ +pg_usleep((long) sleep); It's better to use the latch here so that we can interrupt immediately. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WITHIN GROUP patch
On 11/21/13, 5:04 AM, Atri Sharma wrote: > Please find attached the latest patch for WITHIN GROUP. This patch is > after fixing the merge conflicts. I would like to see more explanations and examples in the documentation. You introduce this feature with "Ordered set functions compute a single result from an ordered set of input values." But string_agg, for example, does that as well, so it's not clear how this is different. Between ordered aggregates, window functions, and this new feature, it can get pretty confusing. Also, the "hypothetical" part should perhaps be explained in more detail. The tutorial part of the documentation contains a nice introduction to window function. I suggest you add something like that as well. In func.sgml, please list the functions in alphabetical order. Also, don't write "should" when you mean "must". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
On 9 Dec 2013 12:16, "Craig Ringer" wrote: > The only way to "deal with" clock drift that isn't fragile in the face > of variable latency, etc, is to basically re-implement (S)NTP in order > to find out what the clock difference with the remote is. There's actually an entirely different way to "deal" with clock drift: test "master time" and "slave time" as two different incomparable spaces. Similar to how you would treat measurements in different units. If you do that then you can measure and manage the delay in the slave between receiving and applying a record and also measure the amount of master server time which can be pending. These measurements don't depend at all on time sync between servers. The specified feature depends explicitly on the conversion between master and slave time spaces so it's inevitable that sync would be an issue. It might be nice to print a warning on connection if the time is far out of sync or periodically check. But I don't think reimplementing NTP is a good idea.
[HACKERS] How to do fast performance timing
At Jane Street we have recently spend a lot of time trying to get a fast gettimeofday. I saw lots of references in various postgres hacker threads related to a lack of such a facility so The culmination of those efforts can be read here: https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.mli and https://github.com/janestreet/core/blob/master/lib/time_stamp_counter.ml it's all OCaml but the code is mostly imperative and very well documented. In particular we made an effort to document our assumption. There are a few which are ocaml specific. But a lot of the lessons we have learned here should be applicable to postgres. Hope this will be useful, Cheers, Bene PS: We are releasing our code under the Apache license so you should feel free to reuse the ideas.
Re: [HACKERS] Recovery to backup point
From: "Heikki Linnakangas" Thanks. Looks sane, although I don't much like the proposed interface to trigger this, setting recovery_target_time='backup_point'. What the code actually does is to stop recovery as soon as you reach consistency, which might not have anything to do with a backup. If you set it on a warm standby server, for example, it will end recovery as soon as it reaches consistency, but there was probably no backup taken at that point. Thank you for reviewing so rapidly. I thought I would check the end of backup in recoveryStopsHere(), by matching XLOG_BACKUP_END and ControlFile->backupStartPoint for backups taken on the primary, and comparing the current redo location with ControlFile->backupEndPoint for backups taken on the standby. However, that would duplicate much code in XLOG_BACKUP_END redo processing and checkRecoveryConsistency(). Besides, the code works only when the user explicitly requests recovery to backup point, not when he starts the warm standby server. (I wonder I'm answering correctly.) Hmm. I guess it's a nice work-around to use this option, but it doesn't really solve the underlying issue. The system might well reach consistency between deleting database files and the transaction commit, in which case you still have the same problem. Yes, you're right. But I believe the trouble can be avoided most of the time. It would be nice to have a more robust fix for that. Perhaps we could use the safe_restartpoint machinery we have to not allow recovery to end until we see the commit record. I was really hoping to get rid of that machinery in 9.4, though, as it won't be needed for GIN and B-tree after the patches I have in the current commitfest are committed. In any case, that's a separate discussion and separate patch. I think so, too. That still seems a bit difficult for what I am now. If someone starts a discussion in a separate thread, I'd like to join it. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [bug fix] pg_ctl always uses the same event source
From: "Amit Kapila" 1. isn't it better to handle as it is done in write_eventlog() which means if string is empty then use PostgreSQL. "evtHandle = RegisterEventSource(NULL, event_source ? event_source : "PostgreSQL");" Thank you for reviewing. Yes, I did so with the first revision of this patch (see the first mail of this thread.) I wanted to avoid duplicating the default value in both the server and pg_ctl code. If user does not set event_source, postgres -C returns the default value "PostgreSQL" in the normal case, so I wanted to rely on it. I thought the second revision would be appreciated by PostgreSQL-based products like EnterpriseDB, because there are fewer source files to modify. But I don't mind which revision will be adopted. 2. What will happen if user doesn't change the name in "event_source" or kept the same name, won't it hit the same problem again? So shouldn't it try to generate different name by appending version string to it? Yes, but I assume that the user has to set his own name to identify his instance uniquely. Even if version string is added, the same issue can happen --- in the likely case where the user explicitly installs, for example, PostgreSQL 9.3 as a standalone database, as well as some packaged application that embeds PostgreSQL 9.3 which the user is unaware of. If the user installs multiple versions of PostgreSQL explicitly with the community installer, the installer can set event_source = 'PostgreSQL 9.3' and 'PostgreSQL 9.4' for each instance. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery to backup point
On 12/09/2013 02:03 PM, MauMau wrote: From: "Michael Paquier" As far as I recall, I don't think so. The problem and the way to solve that are clear. The only trick is to be sure that recovery is done just until a consistent point is reached, and to implement that cleanly. May I implement this feature and submit a patch for the next commitfest if I have time? Please feel free. I might as well participate in the review. I've done with the attached patch. Thanks. Looks sane, although I don't much like the proposed interface to trigger this, setting recovery_target_time='backup_point'. What the code actually does is to stop recovery as soon as you reach consistency, which might not have anything to do with a backup. If you set it on a warm standby server, for example, it will end recovery as soon as it reaches consistency, but there was probably no backup taken at that point. I also confirmed that the problem I raised in the first mail of the below thread was solved with this patch. [bug fix] PITR corrupts the database cluster http://www.postgresql.org/message-id/F93E42280A9A4A5EB74FC7350C801A20%40maumau Hmm. I guess it's a nice work-around to use this option, but it doesn't really solve the underlying issue. The system might well reach consistency between deleting database files and the transaction commit, in which case you still have the same problem. It would be nice to have a more robust fix for that. Perhaps we could use the safe_restartpoint machinery we have to not allow recovery to end until we see the commit record. I was really hoping to get rid of that machinery in 9.4, though, as it won't be needed for GIN and B-tree after the patches I have in the current commitfest are committed. In any case, that's a separate discussion and separate patch. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
On 12/04/2013 02:46 AM, Robert Haas wrote: >> Thanks for your review Christian... > > So, I proposed this patch previously and I still think it's a good > idea, but it got voted down on the grounds that it didn't deal with > clock drift. I view that as insufficient reason to reject the > feature, but others disagreed. Unless some of those people have > changed their minds, I don't think this patch has much future here. Surely that's the operating system / VM host / sysadmin / whatever's problem? The only way to "deal with" clock drift that isn't fragile in the face of variable latency, etc, is to basically re-implement (S)NTP in order to find out what the clock difference with the remote is. If we're going to do that, why not just let the OS deal with it? It might well be worth complaining about obvious aberrations like timestamps in the local future - preferably by complaining and not actually dying. It does need to be able to cope with a *skewing* clock, but I'd be surprised if it had any issues there in the first place. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Recovery to backup point
From: "Michael Paquier" As far as I recall, I don't think so. The problem and the way to solve that are clear. The only trick is to be sure that recovery is done just until a consistent point is reached, and to implement that cleanly. May I implement this feature and submit a patch for the next commitfest if I have time? Please feel free. I might as well participate in the review. I've done with the attached patch. I also confirmed that the problem I raised in the first mail of the below thread was solved with this patch. [bug fix] PITR corrupts the database cluster http://www.postgresql.org/message-id/F93E42280A9A4A5EB74FC7350C801A20@maumau I'm wondering if I can do this with cleaner and less code. It would be grateful if you could give me any advice. Regards MauMau recover_to_backup.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Time-Delayed Standbys
On 2013-12-09 19:51:01 +0900, KONDO Mitsumasa wrote: > Add my comment. We have to consider three situations. > > 1. PITR > 2. replication standby > 3. replication standby with restore_command > > I think this patch cannot delay in 1 situation. Why? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers