RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > On 2019-Sep-03, Tsunakawa, Takayuki wrote: > > I don't think it's rejected. It would be a pity (mottainai) to refuse > > this, because it provides significant speedup despite its simple > > modification. > > I don't necessarily disagree with your argumentation, but Travis is > complaining thusly: I tried to revise David's latest patch (v8) and address Tom's comments in his last mail. But I'm a bit at a loss. First, to accurately count the maximum number of acquired locks in a transaction, we need to track the maximum entries in the hash table, and make it available via a new function like hash_get_max_entries(). However, to cover the shared partitioned hash table (that is not necessary for LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it when entering and removing entries in the hash table. It spoils the effort to decrease contention by hashhdr->freelists[].mutex. Do we want to track the maximum number of acquired locks in the global variable in lock.c, not in the hash table? Second, I couldn't understand the comment about the fill factor well. I can understand that it's not correct to compare the number of hash buckets and the number of locks. But what can we do? I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's approach is based on the database textbook and seems intuitive. So I attached the rebased version. Regards Takayuki Tsunakawa faster-locallock-scan_v3.patch Description: faster-locallock-scan_v3.patch
RE: [bug fix??] Fishy code in tts_cirtual_copyslot()
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > I temporarily changed the Assert to be "==" rather than "<=", and > it still passed check-world, so evidently we are not testing any > cases where the descriptors are of different lengths. This explains > the lack of symptoms. It's still a bug though, so pushed. Thank you for committing. > > BTW, I thought that in PostgreSQL coding convention, local variables > should be defined at the top of blocks, but this function writes "for (int > natts;". > > Yeah, we've agreed to join the 21st century to the extent of allowing > local for-loop variables. That's good news. It'll help a bit to code comfortably. Regards Takayuki Tsunakawa
[bug fix??] Fishy code in tts_cirtual_copyslot()
Hello, In the following code in execTuples.c, shouldn' srcdesc point to the source slot's tuple descriptor? The attached fix passes make check. What kind of failure could this cause? BTW, I thought that in PostgreSQL coding convention, local variables should be defined at the top of blocks, but this function writes "for (int natts;". I didn't modify it because many other source files also write in that way. -- static void tts_virtual_copyslot(TupleTableSlot *dstslot, TupleTableSlot *srcslot) { TupleDesc srcdesc = dstslot->tts_tupleDescriptor; Assert(srcdesc->natts <= dstslot->tts_tupleDescriptor->natts); tts_virtual_clear(dstslot); slot_getallattrs(srcslot); for (int natt = 0; natt < srcdesc->natts; natt++) { dstslot->tts_values[natt] = srcslot->tts_values[natt]; dstslot->tts_isnull[natt] = srcslot->tts_isnull[natt]; } -- Regards Takayuki Tsunakawa virtslot_tiny_fix.patch Description: virtslot_tiny_fix.patch
RE: [bug fix] Produce a crash dump before main() on Windows
From: Michael Paquier [mailto:mich...@paquier.xyz] > Imagine an application which relies on Postgres, still does *not* start > it as a service but uses "pg_ctl start" > automatically. This could be triggered as part of another service startup > which calls say system(), or as another script. Wouldn't that cause > potentially a freeze? Do you mean by "another service startup": another service (Windows service) -> an application (not a Windows service) -> pg_ctl start Then pg_ctl runs under the Windows service environment, and the popup won't appear. > I am also not sure that I quite understand why a > popup would never show up if a different service startup triggers pg_ctl > start by itself that fails. I experimented it with a sample program that I showed in this thread. A program invoked by a Windows services also runs in the service. Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: David Steele [mailto:da...@pgmasters.net] > > Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren, > just in case they are slow to respond to or ignore SIGINT/SIGTERM? That > matches the idea of pg_ctl's immediate shutdown. > > -1, at least not immediately. Archivers can be complex processes and > they should be given the chance to do a graceful shutdown. I agree that the user's archiver program should receive the chance for graceful stop in smart or fast shutdown. But I think in immediate shutdown, all should stop immediately. That's what I expect from the word "immediate." If the grandchildren left running don't disturb the cleanup of PostgreSQL's resources (shared memory, file/directory access, etc.) or restart of PostgreSQL, we may well be able to just advice the grandchildren to stop immediately with SIGINT/SIGTERM. However, for example, in the failover of shared-disk HA clustering, when the clustering software stops PostgreSQL with "pg_ctl stop -m immediate" and then tries to unmount the file systems for $PGDATA and archived WAL, the unmount may take time or fail due to the access from PostgreSQL's grandchildren. Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > SIGTERM, which needs to be adjusted. For another, its > SIGQUIT handler does exit(1) not _exit(2), which seems rather > dubious ... should we make it more like the rest? I think > the reasoning there might've been that if some DBA decides to > SIGQUIT the archiver, we don't need to force a database-wide > reset; but why exactly should we tolerate that? postmaster doesn't distinguish return codes other than 0 for the archiver, and just starts the archiver unless postmaster is shutting down. So we can use _exit(2) like the other children. Can't we use SIGKILL instead of SIGINT/SIGTERM to stop the grandchildren, just in case they are slow to respond to or ignore SIGINT/SIGTERM? That matches the idea of pg_ctl's immediate shutdown. (Windows cannot stop grandchildren because kill() in src/port/kill.c doesn't support the process group... That's a separate topic.) Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Alvaro Herrera from 2ndQuadrant [mailto:alvhe...@alvh.no-ip.org] > Testing protocol version 2 is difficult! Almost every single test fails > because of error messages being reported differently; and streaming > replication (incl. pg_basebackup) doesn't work at all because it's not > possible to establish replication connections. Manual inspection shows > it behaves correctly. Yeah, the code path for protocol v2 is sometimes annoying. I wish v2 support will be dropped soon. I know there was a discussion on it, but I didn't track the conclusion. > Remaining patchset attached (per my count it's v13 of your patchset. I'm afraid those weren't attached. > think we should merge one half of it with each of the other two patches > where the changes are introduced (0003 and 0004). I'm not convinced > that we need 0004-0006 to be separate commits. It was hard to review those separate patches, so I think it's better to merge those. OTOH, I can understand Haribabu-san's idea that the community may not accept the latter patches, e.g. accept only 0001-0005. Regards Takayuki Tsunakawa
RE: [bug fix] Produce a crash dump before main() on Windows
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Alvaro Herrera from 2ndQuadrant writes: > > Well, IMV this is a backpatchable, localized bug fix. > > I dunno. This thread is approaching two years old, and a quick > review shows few signs that we actually have any consensus on > making behavioral changes here. If there is any consensus, > it's that the SetErrorMode calls should depend on checking > pgwin32_is_service(); but we haven't even seen a patch that > does that, let alone tested it. I think we're way too close > to beta4 wrap to be considering pushing such a patch the moment > it appears. We don't have to call pgwin32_is_service() to determine whether we call SetErrorMode() in postmaster.c because: * The dialog box doesn't appear under Windows service, whether PostgreSQL itself starts as a service or another (server) application runs as a service and does "pg_ctl start." * It's OK for the dialog box to appear when the user runs "pg_ctl start" on the command prompt. That usage is interactive, and should not be for production use (postgres processes vanish when the user mistakenly presses Ctrl+C, or closes the command prompt). Even now, the dialog box pops up if postmaster crashes before main(). > BTW, I also violently dislike taking Windows-specific stuff out of > startup_hacks where it belongs and putting it into main() where > it doesn't. I think the entire Windows bit in front of get_progname > should migrate into startup_hacks. Surely the odds of failure > inside get_progname are not worth worrying about --- they seem > significantly less than the odds of failure inside > pgwin32_is_service(), for instance. Agreed. Fixed. I changed the CF status to "need review." Regards Takayuki Tsunakawa crash_dump_before_main_v3.patch Description: crash_dump_before_main_v3.patch
RE: [bug fix] Produce a crash dump before main() on Windows
From: Michael Paquier [mailto:mich...@paquier.xyz] > The last patch submitted is here: > https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D > 1F8ECF73@G01JPEXMBYT05 > And based on the code paths it touches I would recommend to not play with > REL_12_STABLE at this stage. I'm reading the thread to see what I should do... Anyway, I don't think we should rush to include this fix in PG 12, too. But at the same time, I don't think it would be dangerous to put in PG 12. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > Hmm ... is this patch rejected, or is somebody still trying to get it to > committable state? David, you're listed as committer. I don't think it's rejected. It would be a pity (mottainai) to refuse this, because it provides significant speedup despite its simple modification. Again, I think the v2 patch is OK. Tom's comment was as follows: [Tom's comment against v2] FWIW, I tried this patch against current HEAD (959d00e9d). Using the test case described by Amit at I do measure an undeniable speedup, close to 35%. However ... I submit that that's a mighty extreme test case. (I had to increase max_locks_per_transaction to get it to run at all.) We should not be using that sort of edge case to drive performance optimization choices. If I reduce the number of partitions in Amit's example from 8192 to something more real-world, like 128, I do still measure a performance gain, but it's ~ 1.5% which is below what I'd consider a reproducible win. I'm accustomed to seeing changes up to 2% in narrow benchmarks like this one, even when "nothing changes" except unrelated code. Trying a standard pgbench test case (pgbench -M prepared -S with one client and an -s 10 database), it seems that the patch is about 0.5% slower than HEAD. Again, that's below the noise threshold, but it's not promising for the net effects of this patch on workloads that aren't specifically about large and prunable partition sets. I'm also fairly concerned about the effects of the patch on sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 bytes, a 22% increase. That's a lot if you're considering cases with many locks. On the whole I don't think there's an adequate case for committing this patch. I'd also point out that this is hardly the only place where we've seen hash_seq_search on nearly-empty hash tables become a bottleneck. So I'm not thrilled about attacking that with one-table-at-time patches. I'd rather see us do something to let hash_seq_search win across the board. * Extreme test case: Not extreme. Two of our customers, who are considering to use PostgreSQL, are using thousands of partitions now. We hit this issue -- a point query gets nearly 20% slower after automatically creating a generic plan. That's the reason for this proposal. * 0.5% slowdown with pgbench: I think it's below the noise, as Tom said. * sizeof(LOCALLOCK): As Andres replied to Tom in the immediately following mail, LOCALLOCK was bigger in PG 11. * Use case is narrow: No. The bloated LockMethodLocalHash affects the performance of the items below as well as transaction commit/abort: - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice in PREPARE! - LockReleaseSession: advisory lock - LockReleaseCurrentOwner: ?? - LockReassignCurrentOwner: ?? Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Kyotaro Horiguchi [mailto:horikyota@gmail.com] > Since we are allowing OPs to use arbitrary command as > archive_command, providing a replacement with non-standard signal > handling for a specific command doesn't seem a general solution > to me. Couldn't we have pg_system(a tentative name), which > intercepts SIGQUIT then sends SIGINT to children? Might be need > to resend SIGQUIT after some interval, though.. The same idea that you referred to as pg_system occurred to me, too. But I wondered if the archiver process can get the pid of its child (shell? archive_command?), while keeping the capabilities of system() (= the shell). Even if we fork() and then system(), doesn't the OS send SIGQUIT to any descendents of the archiver when postmaster sends SIGQUIT to the child process group? > # Is there any means to view the whole of a thread from archive? > # I'm a kind of reluctant to wander among messages like a rat in > # a maze:p You can see the whole thread by clicking the "Whole Thread" link. Regards Takayuki Tsunakawa
RE: SIGQUIT on archiver child processes maybe not such a hot idea?
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > After investigation, the mechanism that's causing that is that the > src/test/recovery/t/010_logical_decoding_timelines.pl test shuts > down its replica server with a mode-immediate stop, which causes > that postmaster to shut down all its children with SIGQUIT, and > in particular that signal propagates to a "cp" command that the > archiver process is executing. The "cp" is unsurprisingly running > with default SIGQUIT handling, which per the signal man page > includes dumping core. We've experienced this (core dump in the data directory by an archive command) years ago. Related to this, the example of using cp in the PostgreSQL manual is misleading, because cp doesn't reliably persist the WAL archive file. > This makes me wonder whether we shouldn't be using some other signal > to shut down archiver subprocesses. It's not real cool if we're > spewing cores all over the place. Admittedly, production servers > are likely running with "ulimit -c 0" on most modern platforms, > so this might not be a huge problem in the field; but accumulation > of core files could be a problem anywhere that's configured to allow > server core dumps. We enable the core dump in production to help the investigation just in case. > Ideally, perhaps, we'd be using SIGINT not SIGQUIT to shut down > non-Postgres child processes. But redesigning the system's signal > handling to make that possible seems like a bit of a mess. > > Thoughts? We're using a shell script and a command that's called in the shell script. That is: archive_command = 'call some_shell_script.sh ...' [some_shell_script.sh] ulimit -c 0 trap SIGQUIT to just exit on the receipt of the signal call some_command to copy file some_command also catches SIGQUIT just exit. It copies and syncs the file. I proposed something in this line as below, but I couldn't respond to Peter's review comments due to other tasks. Does anyone think it's worth resuming this? https://www.postgresql.org/message-id/7E37040CF3804EA5B018D7A022822984@maumau Regards Takayuki Tsunakawa
RE: Why overhead of SPI is so large?
From: Konstantin Knizhnik [mailto:k.knizh...@postgrespro.ru] > PL/pgSQL: 29044.361 ms > C/SPI: 22785.597 ms > > The fact that difference between PL/pgSQL and function implemented in C > using SPI is not so large was expected by me. This PL/pgSQL overhead is not so significant compared with the three times, but makes me desire some feature like Oracle's ALTER PROCEDURE ... COMPILE; that compiles the PL/SQL logic to native code. I've seen a few dozen percent speed up. Regards Takayuki Tsunakawa
RE: [Patch] PQconnectPoll() is blocked if target_session_attrs is read-write
From: Matsumura, Ryo [mailto:matsumura@jp.fujitsu.com] > Detail: > If target_session_attrs is set to read-write, PQconnectPoll() calls > PQsendQuery("SHOW transaction_read_only") althogh previous return value > was PGRES_POLLING_READING not WRITING. The current code probably assumes that PQsendQuery() to send "SHOW transaction_read_only" shouldn't block, because the message is small enough to fit in the socket send buffer. Likewise, the code in CONNECTION_AWAITING_RESPONSE case sends authentication data using pg_fe_sendauth() without checking for the write-ready status. OTOH, the code in CONNECTION_MADE case waits for write-ready status in advance before sending the startup packet. That's because the startup packet could get large enough to cause pqPacketSend() to block. So, I don't think the fix is necessary. > In result, PQsendQuery() may be blocked in pqsecure_raw_write(). FWIW, if PQsendQuery() blocked during connection establishment, I think it should block in poll() called from from pqWait(), because the libpq's socket is set non-blocking. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > Another counter-argument to this is that there's already an > unexplainable slowdown after you run a query which obtains a large > number of locks in a session or use prepared statements and a > partitioned table with the default plan_cache_mode setting. Are we not > just righting a wrong here? Albeit, possibly 1000 queries later. > > I am, of course, open to other ideas which solve the problem that v5 > has, but failing that, I don't see v6 as all that bad. At least all > the logic is contained in one function. I know Tom wanted to steer > clear of heuristics to reinitialise the table, but most of the stuff > that was in the patch back when he complained was trying to track the > average number of locks over the previous N transactions, and his > concerns were voiced before I showed the 7% performance regression > with unconditionally rebuilding the table. I think I understood what you mean. Sorry, I don't have a better idea. This unexplanability is probably what we should accept to avoid the shocking 7% slowdown. OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdown in the same test case. If it's not satisfactory, then v6 is the best to commit. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > For the use case we've been measuring with partitioned tables and the > generic plan generation causing a sudden spike in the number of > obtained locks, then having plan_cache_mode = force_custom_plan will > cause the lock table not to become bloated. I'm not sure there's > anything interesting to measure there. I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ... WHERE pkey = value): * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query execution time is steady. * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan. The generic plan is not adopted because its cost is high. Later executions of the query will suffer from the bloat until the 1006th execution when LocalLockHash is shrunk. Depending on the number of transactions and what each transaction does, I thought the difference will be noticeable or not. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I personally don't think that's true. The only way you'll notice the > LockReleaseAll() overhead is to execute very fast queries with a > bloated lock table. It's pretty hard to notice that a single 0.1ms > query is slow. You'll need to execute thousands of them before you'll > be able to measure it, and once you've done that, the lock shrink code > will have run and the query will be performing optimally again. Maybe so. Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom? > I voice my concerns with v5 and I wasn't really willing to push it > with a known performance regression of 7% in a fairly valid case. v6 > does not suffer from that. You're right. We may have to consider the unpredictability to users by this hidden behavior as a compromise for higher throughput. > > Where else does the extra overhead come from? > > hash_get_num_entries(LockMethodLocalHash) == 0 && > + hash_get_max_bucket(LockMethodLocalHash) > > + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD) > > that's executed every time, not every 1000 times. I see. Thanks. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I went back to the drawing board on this and I've added some code that counts > the number of times we've seen the table to be oversized and just shrinks > the table back down on the 1000th time. 6.93% / 1000 is not all that much. I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fast at first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there something to tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5. > Of course, not all the extra overhead might be from rebuilding the table, > so here's a test with the updated patch. Where else does the extra overhead come from? Regards Takayuki Tsunakawa
RE: Zedstore - compressed in-core columnar storage
From: Ashwin Agrawal [mailto:aagra...@pivotal.io] > The objective is to gather feedback on design and approach to the same. > The implementation has core basic pieces working but not close to complete. Thank you for proposing a very interesting topic. Are you thinking of including this in PostgreSQL 13 if possible? > * All Indexes supported ... > work. Btree indexes can be created. Btree and bitmap index scans work. Does Zedstore allow to create indexes of existing types on the table (btree, GIN, BRIN, etc.) and perform index scans (point query, range query, etc.)? > * Hybrid row-column store, where some columns are stored together, and > others separately. Provide flexibility of granularity on how to > divide the columns. Columns accessed together can be stored > together. ... > This way of laying out the data also easily allows for hybrid row-column > store, where some columns are stored together, and others have a dedicated > B-tree. Need to have user facing syntax to allow specifying how to group > the columns. ... > Zedstore Table can be > created using command: > > CREATE TABLE (column listing) USING zedstore; Are you aiming to enable Zedstore to be used for HTAP, i.e. the same table can be accessed simultaneously for both OLTP and analytics with the minimal performance impact on OLTP? (I got that impression from the word "hybrid".) If yes, is the assumption that only a limited number of columns are to be stored in columnar format (for efficient scanning), and many other columns are to be stored in row format for efficient tuple access? Are those row-formatted columns stored in the same file as the column-formatted columns, or in a separate file? Regarding the column grouping, can I imagine HBase and Cassandra? How could the current CREATE TABLE syntax support column grouping? (I guess CREATE TABLE needs a syntax for columnar store, and Zedstore need to be incorporated in core, not as an extension...) > A column store uses the same structure but we have *multiple* B-trees, one > for each column, all indexed by TID. The B-trees for all columns are stored > in the same physical file. Did you think that it's not a good idea to have a different file for each group of columns? Is that because we can't expect physical adjacency of data blocks on disk even if we separate a column in a separate file? I thought a separate file for each group of columns would be easier and less error-prone to implement and debug. Adding and dropping the column group would also be very easy and fast. Regards Takayuki Tsunakawa
RE: Commitfest 2019-07, the first of five* for PostgreSQL 13
From: Stephen Frost [mailto:sfr...@snowman.net] > sh, don't look now, but there might be a "Resend email" button in the > archives now that you can click to have an email sent to you... > > Note that you have to be logged in, and the email will go to the email address > that you're logging into the community auth system with. > > (thank you Magnus) Thank you so much, Magnus. This is very convenient. I'm forced to use Outlook at work, which doesn't allow to reply to a downloaded email. Your help eliminates the need to save all emails. Regards Takayuki Tsunakawa
RE: [bug fix] Produce a crash dump before main() on Windows
From: Amit Kapila [mailto:amit.kapil...@gmail.com] > Tsunakawa/Haribabu - By reading this thread briefly, it seems we need > some more inputs from other developers on whether to fix this or not, > so ideally the status of this patch should be 'Needs Review'. Why it > is in 'Waiting on Author' state? If something is required from > Author, then do we expect to see the updated patch in the next few > days? Thank you for paying attention to this. I think the patch is good, but someone else may have a different solution. So I marked it as needs review. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] v5 is attached. Thank you, looks good. I find it ready for committer (I noticed the status is already set so.) Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I've revised the patch to add a new constant named > LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash Thank you, and good performance. The patch passed make check. I'm OK with the current patch, but I have a few comments. Please take them as you see fit (I wouldn't mind if you don't.) (1) +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64 How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage? Code in PostgreSQL seems to use the term threshold. (2) +/* Complain if the above are not set to something sane */ +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE" +#endif I don't think these are necessary, because these are fixed and not configurable. FYI, src/include/utils/memutils.h doesn't have #error to test these macros. #define ALLOCSET_DEFAULT_MINSIZE 0 #define ALLOCSET_DEFAULT_INITSIZE (8 * 1024) #define ALLOCSET_DEFAULT_MAXSIZE (8 * 1024 * 1024) (3) + if (hash_get_num_entries(LockMethodLocalHash) == 0 && + hash_get_max_bucket(LockMethodLocalHash) > + LOCKMETHODLOCALHASH_SHRINK_SIZE) + CreateLocalLockHash(); I get an impression that Create just creates something where there's nothing. How about Init or Recreate? Regards Takayuki Tsunakawa
RE: [PATCH] Speedup truncates of relation forks
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > for (i = 0; i < NBuffers; i++) > { > (snip) > buf_state = LockBufHdr(bufHdr); > > /* check with the lower bound and skip the loop */ > if (bufHdr->tag.blockNum < minBlock) > { > UnlockBufHdr(bufHdr, buf_state); > continue; > } > > for (k = 0; k < nforks; k++) > { > if (RelFileNodeEquals(bufHdr->tag.rnode, rnode.node) && > bufHdr->tag.forkNum == forkNum[k] && > bufHdr->tag.blockNum >= firstDelBlock[k]) > > But since we acquire the buffer header lock after all and the number > of the internal loops is small (at most 3 for now) the benefit will > not be big. Yeah, so I think we can just compare the block number without locking the buffer header here. Regards Takayuki Tsunakawa
RE: [PATCH] Speedup truncates of relation forks
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > We do RelationTruncate() also when we truncate heaps that are created > in the current transactions or has a new relfilenodes in the current > transaction. So I think there is a room for optimization Thomas > suggested, although I'm not sure it's a popular use case. Right, and I don't think of a use case that motivates the opmitizaion, too. > I've not look at this patch deeply but in DropRelFileNodeBuffer I > think we can get the min value of all firstDelBlock and use it as the > lower bound of block number that we're interested in. That way we can > skip checking the array during scanning the buffer pool. That sounds reasonable, although I haven't examined the code, either. > Don't we use each elements of nblocks for each fork? That is, each > fork uses an element at its fork number in the nblocks array and sets > InvalidBlockNumber for invalid slots, instead of passing the valid > number of elements. That way the following code that exist at many places, I think the current patch tries to reduce the loop count in DropRelFileNodeBuffers() by passing the number of target forks. Regards Takayuki Tsunakawa
RE: [PATCH] Speedup truncates of relation forks
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com] > Years ago I've implemented an optimization for many DROP TABLE commands > in a single transaction - instead of scanning buffers for each relation, > the code now accumulates a small number of relations into an array, and > then does a bsearch for each buffer. > > Would something like that be applicable/useful here? That is, if we do > multiple TRUNCATE commands in a single transaction, can we optimize it > like this? Unfortunately not. VACUUM and autovacuum handles each table in a different transaction. BTW, what we really want to do is to keep the failover time within 10 seconds. The customer periodically TRUNCATEs tens of thousands of tables. If failover unluckily happens immediately after those TRUNCATEs, the recovery on the standby could take much longer. But your past improvement seems likely to prevent that problem, if the customer TRUNCATEs tables in the same transaction. On the other hand, it's now highly possible that the customer can only TRUNCATE a single table in a transaction, thus run as many transactions as the TRUNCATEd tables. So, we also want to speed up each TRUNCATE by touching only the buffers for the table, not scanning the whole shared buffers. Andres proposed one method that uses a radix tree, but we don't have an idea how to do it yet. Speeding up each TRUNCATE and its recovery is a different topic. The patch proposed here is one possible improvement to shorten the failover time. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Fujii Masao [mailto:masao.fu...@gmail.com] > Thanks for the info, so I marked the patch as committed. Thanks a lot for your hard work! This felt relatively tough despite the simplicity of the patch. I'm starting to feel the difficulty and fatigue in developing in the community... Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > "vacuum_truncate" gets my vote too. +1 From: 'Andres Freund' [mailto:and...@anarazel.de] > Personally I think the name just needs some committer to make a > call. This largely is going to be used after encountering too many > cancellations in production, and researching the cause. Minor spelling > differences don't seem to particularly matter here. Absolutely. Thank you. From: 'Andres Freund' [mailto:and...@anarazel.de] > I think it needs to be an autovac option. The production issue is that > autovacuums constantly cancel queries on the standbys despite > hot_standby_feedback if you have a workload that does frequent > truncations. If there's no way to configure it in a way that autovacuum > takes it into account, people will just disable autovacuum and rely > entirely on manual scripts. That's what already happens - leading to a > much bigger foot-gun than disabling truncation. FWIW, people already in > production use the workaround to configuring snapshot_too_old as that, > for undocumented reasons, disables trunctations. That's not better > either. Right. We don't want autovacuum to be considered as a criminal. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > And, as far as I can see from a quick review of the thread, > we don't really have consensus on the names and behaviors. Consensus on the name seems to use truncate rather than shrink (a few poople kindly said they like shrink, and I'm OK with either name.) And there's no dispute on the behavior. Do you see any other point? Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > It would be good to get your view on the > shrink_bloated_locallocktable_v3.patch I worked on last night. I was > unable to measure any overhead to solving the problem that way. Thanks, it looks super simple and good. I understood the idea behind your patch is: * Transactions that touch many partitions and/or tables are a special event and not normal, and the hash table bloat is an unlucky accident. So it's reasonable to revert the bloated hash table back to the original size. * Repeated transactions that acquire many locks have to enlarge the hash table every time. However, the overhead of hash table expansion should be hidden behind other various processing (acquiring/releasing locks, reading/writing the relations, accessing the catalogs of those relations) TBH, I think the linked list approach feels more intuitive because the resulting code looks what it wants to do (efficiently iterate over acquired locks) and is based on the classic book. But your approach seems to relieve people. So I'm OK with your patch. I'm registering you as another author and me as a reviewer, and marking this ready for committer. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: 'Andres Freund' [mailto:and...@anarazel.de] > On 2019-04-08 02:28:12 +0000, Tsunakawa, Takayuki wrote: > > I think the linked list of LOCALLOCK approach is natural, simple, and > > good. > > Did you see that people measured slowdowns? Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And that might be within noise as was mentioned. If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seem feasible... Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > On the whole I don't think there's an adequate case for committing > this patch. From: Andres Freund [mailto:and...@anarazel.de] > On 2019-04-05 23:03:11 -0400, Tom Lane wrote: > > If I reduce the number of partitions in Amit's example from 8192 > > to something more real-world, like 128, I do still measure a > > performance gain, but it's ~ 1.5% which is below what I'd consider > > a reproducible win. I'm accustomed to seeing changes up to 2% > > in narrow benchmarks like this one, even when "nothing changes" > > except unrelated code. > > I'm not sure it's actually that narrow these days. With all the > partitioning improvements happening, the numbers of locks commonly held > are going to rise. And while 8192 partitions is maybe on the more > extreme side, it's a workload with only a single table, and plenty > workloads touch more than a single partitioned table. I would feel happy if I could say such a many-partitions use case is narrow or impractical and ignore it, but it's not narrow. Two of our customers are actually requesting such usage: one uses 5,500 partitions and is trying to migrate from a commercial database on Linux, and the other requires 200,000 partitions to migrate from a legacy database on a mainframe. At first, I thought such many partitions indicate a bad application design, but it sounded valid (or at least I can't insist that's bad). PostgreSQL is now expected to handle such huge workloads. From: Andres Freund [mailto:and...@anarazel.de] > I'm not sure I'm quite that concerned. For one, a good bit of that space > was up for grabs until the recent reordering of LOCALLOCK and nobody > complained. But more importantly, I think commonly the amount of locks > around is fairly constrained, isn't it? We can't really have that many > concurrently held locks, due to the shared memory space, and the size of > a LOCALLOCK isn't that big compared to say relcache entries. We also > probably fairly easily could win some space back - e.g. make nLocks 32 > bits. +1 From: Tom Lane [mailto:t...@sss.pgh.pa.us] > I'd also point out that this is hardly the only place where we've > seen hash_seq_search on nearly-empty hash tables become a bottleneck. > So I'm not thrilled about attacking that with one-table-at-time patches. > I'd rather see us do something to let hash_seq_search win across > the board. > > I spent some time wondering whether we could adjust the data structure > so that all the live entries in a hashtable are linked into one chain, > but I don't quite see how to do it without adding another list link to > struct HASHELEMENT, which seems pretty expensive. I think the linked list of LOCALLOCK approach is natural, simple, and good. In the Jim Gray's classic book "Transaction processing: concepts and techniques", we can find the following sentence in "8.4.5 Lock Manager Internal Logic." The sample implementation code in the book uses a similar linked list to remember and release a transaction's acquired locks. "All the locks of a transaction are kept in a list so they can be quickly found and released at commit or rollback." And handling this issue with the LOCALLOCK linked list is more natural than with the hash table resize. We just want to quickly find all grabbed locks, so we use a linked list. A hash table is a mechanism to find a particular item quickly. So it was merely wrong to use the hash table to iterate all grabbed locks. Also, the hash table got big because some operation in the session needed it, and some subsequent operations in the same session may need it again. So we wouldn't be relieved with shrinking the hash table. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
Hi Andres, Fujii-san, any committer, From: Andres Freund [mailto:and...@anarazel.de] > On 2019-04-08 09:52:27 +0900, Fujii Masao wrote: > > I'm thinking to commit this patch at first. We can change the term > > and add the support of "TRUNCATE" option for VACUUM command later. > > I hope you realize feature freeze is in a few hours... Ouch! Could you take care of committing the patch, please? I wouldn't be able to express the sadness and tiredness just in case this is pushed to 13 because of the parameter name... Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Michael Paquier [mailto:mich...@paquier.xyz] > I have just committed the GUC and libpq portion for TCP_USER_TIMEOUT after > a last lookup, and I have cleaned up a couple of places. Thank you for further cleanup and committing. > For the socket_timeout stuff, its way of solving the problem it thinks is > solves does not seem right to me, and this thread has not reached a consensus > anyway, so I have discarded the issue. > > I am marking the CF entry as committed. In the future, it would be better > to not propose multiple concepts on the same thread, and if the > socket_timeout business is resubmitted, I would suggest a completely new > CF entry, and a new thread. Understood. Looking back the review process, it seems that tcp_user_timeout and socket_timeout should have been handled in separate threads. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Michael Paquier [mailto:mich...@paquier.xyz] > The first letter should be upper-case. Thank you for taking care of this patch, and sorry to cause you trouble to fix that... > to me that socket_timeout_v14.patch should be rejected as it could cause > a connection to go down with no actual reason and that the server should > be in charge of handling timeouts. Is my impression right? No, the connection goes down for a good reason that the client could not get the response within a tolerable amount of time. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Amit-san, Imai-snan, From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > I was able to detect it as follows. > plan_cache_mode = auto > >HEAD: 1915 tps > Patched: 2394 tps > > plan_cache_mode = custom (non-problematic: generic plan is never created) > >HEAD: 2402 tps > Patched: 2393 tps Thanks a lot for very quick confirmation. I'm relieved to still see the good results. Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > "VACUUM" needs or "vacuum" is more appropriate here? Looking at the same file and some other files, "vacuum" looks appropriate because it represents the vacuum action, not the specific VACUUM command. > The format of the documentation of new option is a bit weird. Could it > be possible to adjust it around 80 characters per line like other > description? Ah, fixed. It's easy to overlook the style with the screen reader software... I've been wondering if there are good settings for editing .sgml in Emacs that, for example, puts appropriate number of spaces at the beginning of each line when is pressed, automatically break the long line and put spaces, etc. From: Julien Rouhaud [mailto:rjuju...@gmail.com] > also, the documentation should point out that freeing is not > guaranteed. Something like > > + The default is true. If true, VACUUM will try to free empty > pages at the end of the table. That's nice. Done. > > I'm not sure the consensus we got here but we don't make the vacuum > > command option for this? > > I don't think here's a clear consensus, but my personal vote is to add > it, with SHRINK_TABLE = [ force_on | force_off | default ] (unless a > better proposal has been made already) IMO, which I mentioned earlier, I don't think the VACUUM option is necessary because: (1) this is a table property which is determined based on the expected workload, not the one that people want to change per VACUUM operation (2) if someone wants to change the behavior for a particular VACUUM operation, he can do it using ALTER TABLE SET. Anyway, we can add the VACUUM option separately if we want it by all means. I don't it to be the blocker for this feature to be included in PG 12, because the vacuum truncaton has been bothering us like others said in this and other threads... Regards Takayuki Tsunakawa disable-vacuum-truncation_v6.patch Description: disable-vacuum-truncation_v6.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). That's strange... Peter, Imai-san, can you compare your test procedures? Peter, can you check and see the performance improvement with David's method on March 26 instead? Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > I did a bit of performance testing, both a plain pgbench and the > suggested test case with 4096 partitions. I can't detect any > performance improvements. In fact, within the noise, it tends to be > just a bit on the slower side. > > So I'd like to kick it back to the patch submitter now and ask for more > justification and performance analysis. > > Perhaps "speeding up planning with partitions" needs to be accepted first? David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check. Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Fujii Masao [mailto:masao.fu...@gmail.com] > reloption for TOAST is also required? # I've come back to the office earlier than planned... Hm, there's no reason to not provide toast.vacuum_shrink_enabled. Done with the attached patch. Regards Takayuki Tsunakawa disable-vacuum-truncation_v5.patch Description: disable-vacuum-truncation_v5.patch
RE: Libpq support to connect to standby server as priority
Hi Hari-san, I've reviewed all the files. The patch would be OK when the following have been fixed, except for the complexity of fe-connect.c (which probably cannot be improved.) Unfortunately, I'll be absent next week. The earliest date I can do the test will be April 8 or 9. I hope someone could take care of this patch... (1) 0001 With this read-only option type, application can connect to to a read-only server in the list of hosts, in case ... before issuing the SHOW transaction_readonly to find out whether "to" appears twice in a row. transaction_readonly -> transaction_read_only (2) 0001 +succesful connection or failure. succesful -> successful (3) 0008 to conenct to a standby server with a faster check instead of conenct -> connect (4) 0008 Logically, recovery exit should be notified after the following statement: XLogCtl->SharedRecoveryInProgress = false; (5) 0008 + /* Update in_recovery status. */ + if (LocalRecoveryInProgress) + SetConfigOption("in_recovery", + "on", + PGC_INTERNAL, PGC_S_OVERRIDE); + This SetConfigOption() is called for every RecoveryInProgress() call on the standby. It may cause undesirable overhead. How about just calling SetConfigOption() once in InitPostgres() to set the initial value for in_recovery? InitPostgres() and its subsidiary functions call SetConfigOption() likewise. (6) 0008 async.c is for LISTEN/UNLISTEN/NOTIFY. How about adding the new functions in postgres.c like RecoveryConflictInterrupt()? (7) 0008 + if (pid != 0) + { + (void) SendProcSignal(pid, reason, procvxid.backendId); + } The braces are not necessary because the block only contains a single statement. Regards Takayuki Tsunakawa
RE: Timeout parameters
Nagaura-san, The socket_timeout patch needs the following fixes. Now that others have already tested these patches successfully, they appear committable to me. (1) + else + goto iiv_error; ... + +iiv_error: + conn->status = CONNECTION_BAD; + printfPQExpBuffer(&conn->errorMessage, + libpq_gettext("invalid integer value for socket_timeout\n")); + return false; This goto and its corresponding iiv_error label are redundant. You can just set the error message and return at the call site of parse_int_param(). i.e.: if (!parse_int_param(...)) { error processing return false; } if(conn->socket_timeout > 0 && conn->socket_timeout < 2) conn->socket_timeout = 2; The reason why oom_error label is present is that it is used at multiple places to avoid repeating the same error processing code. (2) + conn->sock = -1; Use PGINVALID_SOCKET instead of -1. Regards Takayuki Tsunakawa
RE: Timeout parameters
Nagaura-san, The client-side tcp_user_timeout patch looks good. The server-side tcp_user_timeout patch needs fixing the following: (1) + GUC_UNIT_MS | GUC_NOT_IN_SAMPLE + 12000, 0, INT_MAX, GUC_NOT_IN_SAMPLE should be removed because the parameter appears in postgresql.conf.sample. The default value should be 0, not 12000. (2) Now that you've changed show_tcp_usertimeout() to use pq_gettcpusertimeout(), you need to modify pq_settcpusertimeout(); just immitate pq_setkeepalivesidle(). Otherwise, SHOW would display a wrong value. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] > From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > > +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT, > > + (char *) &timeout, sizeof(timeout)) < 0 && errno != > > ENOPROTOOPT) > > +{ > > +charsebuf[256]; > > + > > +appendPQExpBuffer(&conn->errorMessage, > > +libpq_gettext("setsockopt(TCP_USER_TIMEOUT) > > failed: %s\n"), > > > > I suppose that the reason ENOPROTOOPT is excluded from error > > condition is that the system call may fail with that errno on > > older kernels, but I don't think that that justifies ignoring the > > failure. > > I think that's for the case where the modules is built on an OS that supports > TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used > on an older OS that doesn't support TCP_USER_TIMEOUT. I remember I was > sometimes able to do such a thing on Linux and Solaris. If we don't have > to handle such usage, I agree about removing the special handling of > ENOTPROTO. Oops, I agree that we return an error even in the ENOPROTOOPT case, because setsockopt() is called only when the user specifies tcp_user_timeout. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > +if (setsockopt(conn->sock, IPPROTO_TCP, TCP_USER_TIMEOUT, > + (char *) &timeout, sizeof(timeout)) < 0 && errno != > ENOPROTOOPT) > +{ > +charsebuf[256]; > + > +appendPQExpBuffer(&conn->errorMessage, > +libpq_gettext("setsockopt(TCP_USER_TIMEOUT) > failed: %s\n"), > > I suppose that the reason ENOPROTOOPT is excluded from error > condition is that the system call may fail with that errno on > older kernels, but I don't think that that justifies ignoring the > failure. I think that's for the case where the modules is built on an OS that supports TCP_USER_TIMEOUT (#ifdef TCP_USER_TIMEOUT is true), and the module is used on an older OS that doesn't support TCP_USER_TIMEOUT. I remember I was sometimes able to do such a thing on Linux and Solaris. If we don't have to handle such usage, I agree about removing the special handling of ENOTPROTO. Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Robert Haas [mailto:robertmh...@gmail.com] > You're both right and I'm wrong. > > However, I think it would be better to stick with the term 'truncate' > which is widely-used already, rather than introducing a new term. Yeah, I have the same feeling. OTOH, as I referred in this thread, shrink is used instead of truncate in the PostgreSQL documentation. So, I chose shrink. To repeat myself, I'm comfortable with either word. I'd like the committer to choose what he thinks better. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
I've looked through 0004-0007. I've only found the following: (5) 0005 With this read-only option type, application can connect to connecting to a read-only server in the list of hosts, in case if there is any read-only servers available, the connection attempt fails. "connecting to" can be removed. in case if there is any read-only servers -> If There's no read only server Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > while going through the old patch where the GUC_REPORT is implemented, Tom > has commented the logic of sending the signal to all backends to process > the hot standby exit with SIGHUP, if we add the logic of updating the GUC > variable value in SIGHUP, we may need to change all the SIGHUP handler code > paths. It is also possible that there is no need to update the variable > for other processes except backends. > > If we go with adding the new SIGUSR1 type to check and update the GUC varaible > can work for most of the backends and background workers. > > opinions SIGUSR1 looks reasonable. We can consider it as notifying that the server status has changed. I've fully reviewed 0001-0003 and my comments follow. I'll review 0004-0007. (1) 0001 before issuing the transaction_readonly to find out whether the server is read-write or not is restuctured under a new transaction_readonly -> "SHOW transaction_read_only" restuctured -> restructured (2) 0001 +succesful connection or failure. +successful connection; if it returns on, means server succesful -> successful means -> it means (3) 0003 +But for servers version 12 or greater uses the transaction_read_only +GUC that is reported by the server upon successful connection. GUC doesn't seem to be a term to be used in the user manual. Instead: uses the value of transaction_read_only configuration parameter that is... as in: https://www.postgresql.org/docs/devel/libpq-connect.html client_encoding This sets the client_encoding configuration parameter for this connection. application_name Specifies a value for the application_name configuration parameter. (4) 0003 boolstd_strings;/* standard_conforming_strings */ + booltransaction_read_only; /* session_read_only */ Looking at the comment for std_strings, it's better to change the comment to transaction_read_only to represent the backing configuration parameter name. Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > On Wed, Mar 27, 2019 at 2:30 AM Robert Haas wrote: > > > > On Tue, Mar 26, 2019 at 11:23 AM Masahiko Sawada > wrote: > > > > I don't see a patch with the naming updated, here or there, and I'm > > > > going to be really unhappy if we end up with inconsistent naming > > > > between two patches that do such fundamentally similar things. -1 > > > > from me to committing either one until that inconsistency is resolved. > > > > > > Agreed. I've just submitted the latest version patch that adds > > > INDEX_CLEANUP option and vacuum_index_cleanup reloption. I already > > > mentioned on that thread but I agreed with adding phrase positively > > > than negatively. So if we got consensus on such naming the new options > > > added by this patch could be something like SHRINK option (with > > > true/false) and vacuum_shrink reloption. > > > > No, that's just perpetuating the problem. Then you have an option > > SHRINK here that you set to TRUE to skip something, and an option > > INDEX_CLEANUP over there that you set to FALSE to skip something. > > > > Well, I imagined that both INDEX_CLEANUP option and SHRINK option (or > perhaps TRUNCATE option) should be true by default. If we want to skip > some operation of vacuum we can set each options to false like "VACUUM > (INDEX_CLEANUP false, SHRINK true, VERBOSE true)". I think that > resolves the problem but am I missing something? I almost have the same view as Sawada-san. The reloption vacuum_shrink_enabled is a positive name and follows the naming style of other reloptions. I hope this matches the style you have in mind. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > Here a benchmark doing that using pgbench's script weight feature. Wow, I didn't know that pgbench has evolved to have such a convenient feature. Thanks for telling me how to utilize it in testing. PostgreSQL is cool! Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > My understanding of what David wrote is that the slowness of bloated hash > table is hard to notice, because planning itself is pretty slow. With the > "speeding up planning with partitions" patch, planning becomes quite fast, > so the bloated hash table overhead and so your patch's benefit is easier > to notice. This patch is clearly helpful, but it's just hard to notice > it > when the other big bottleneck is standing in the way. Ah, I see. I failed to recognize the simple fact that without your patch, EXECUTE on a table with many partitions is slow due to the custom planning time proportional to the number of partitions. Thanks for waking up my sleeping head! Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut > wrote: > > Perhaps "speeding up planning with partitions" needs to be accepted first? > > Yeah, I think it likely will require that patch to be able to measure > the gains from this patch. > > If planning a SELECT to a partitioned table with a large number of > partitions using PREPAREd statements, when we attempt the generic plan > on the 6th execution, it does cause the local lock table to expand to > fit all the locks for each partition. This does cause the > LockReleaseAll() to become slow due to the hash_seq_search having to > skip over many empty buckets. Since generating a custom plan for a > partitioned table with many partitions is still slow in master, then I > very much imagine you'll struggle to see the gains brought by this > patch. Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session: 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table. Regards Takayuki Tsunakawa
RE: Re: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: David Steele [mailto:da...@pgmasters.net] > This patch appears to have been stalled for a while. > > Takayuki -- the ball appears to be in your court. Perhaps it would be > helpful to summarize what you think are next steps? disable_index_cleanup is handled by Sawada-san in another thread. I understand I've reflected all review comments in the latest patch, and replied to the opinions/proposals, so the patch status is kept "needs review." (I hope new fire won't happen...) Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Robert Haas [mailto:robertmh...@gmail.com] > I really dislike having both target_sesion_attrs and > target_server_type. It doesn't solve any actual problem. master, > slave, prefer-save, or whatever you like could be put in > target_session_attrs just as easily, and then we wouldn't end up with > two keywords doing closely related things. 'master' is no more or > less a server attribute than 'read-write'. Hmm, that may be OK. At first, I felt it strange to treat the server type (primary or standby) as a session attribute. But we can see the server type as one attribute in a sense that a session is established for. I'm inclined to agree with: target_session_attr = {any | read-write | read-only | prefer-read | primary | standby | prefer-standby} Regards Takayuki Tsunakawa
RE: [survey] New "Stable" QueryId based on normalized query text
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > > > needs.1: stable accross different databases, > > > > Does this mean different database clusters, not different databases in > a single database cluster? > > Does this mean you want different QueryID for the same-looking > query for another database in the same cluster? (I'm afraid this question may be targeted at legland legland, not me...) I think the same query text can have either same or different QueryID in different databases in the database cluster. Even if the QueryID value is the same, we can use DatabaseID to choose desired information. Regards Takayuki Tsunakawa
RE: [survey] New "Stable" QueryId based on normalized query text
From: legrand legrand [mailto:legrand_legr...@hotmail.com] > There are many projects that use alternate QueryId > distinct from the famous pg_stat_statements jumbling algorithm. I'd like to welcome the standard QueryID that DBAs and extension developers can depend on. Are you surveying the needs for you to develop the QueryID that can meet as many needs as possible? > needs.1: stable accross different databases, Does this mean different database clusters, not different databases in a single database cluster? needs.5: minimal overhead to calculate needs.6: doesn't change across database server restarts needs.7: same value on both the primary and standby? > norm.9: comments aware Is this to distinguish queries that have different comments for optimizer hints? If yes, I agree. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] > Fixed. Rebased on HEAD. Regards Takayuki Tsunakawa 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch 0002-speed-up-LOCALLOCK-scan.patch Description: 0002-speed-up-LOCALLOCK-scan.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > Your changes in LOCALLOCK still refer to PGPROC, from your first version > of the patch. > > I think the reordering of struct members could be done as a separate > preliminary patch. > > Some more documentation in the comment before dlist_head LocalLocks to > explain this whole mechanism would be nice. Fixed. > You posted a link to some performance numbers, but I didn't see the test > setup explained there. I'd like to get some more information on this > impact of this. Is there an effect with 100 tables, or do you need 10? Imai-san, can you tell us the test setup? Regards Takayuki Tsunakawa 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch 0002-speed-up-LOCALLOCK-scan.patch Description: 0002-speed-up-LOCALLOCK-scan.patch
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > I don't think so. I think it's just a weirdly-design parameter > without a really compelling use case. Enforcing limits on the value > of the parameter doesn't fix that. Most of the reviewers who have > opined so far have been somewhere between cautious and negative about > the value of that parameter, so I think we should just not add it. At > least for now. I don't think socket_timeout is so bad. I think Nagaura-san and I presented the use case, giving an answer to every question and concern. OTOH, it may be better to commit the tcp_user_timeout patch when Nagaura-san has refined the documentation, and then continue socket_timeout. Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > Target_session_attrs Target_server_type > > read-write prefer-slave, slave > > prefer-read master, slave > read-onlymaster, prefer-slave > > I know that some of the cases above is possible, like master server with > by default accepts > read-only sessions. Instead of we put a check to validate what is right > combination, how > about allowing the combinations and in case if such combination is not > possible, means > there shouldn't be any server the supports the requirement, and connection > fails. > > comments? I think that's OK. To follow the existing naming, it seems better to use "primary" and "standby" instead of master and slave -- primary_conninfo, synchronous_standby_names, hot_standby, max_standby_streaming_delay and such. > And also as we need to support the new option to connect to servers < 12 > also, this option > sends the command "select pg_is_in_recovery()" to the server to find out > whether the server > is recovery mode or not? The query looks good. OTOH, I think we can return an error when target_server_type is specified against older servers because the parameter is new, if the libpq code would get uglier if we support older servers. > And also regarding the implementation point of view, the new > target_server_type option > validation is separately handled, means the check for the required server > is handled in a separate > switch case, when both options are given, first find out the required server > for target_session_attrs > and validate the same again with target_server_type? Logically, it seems the order should be reverse; check the server type first, then the session attributes, considering other session attributes in the future. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > Based on your comment it seems to me that 'socket_timeout' should be > connected with statement_timeout. I mean that end-user should wait > statement_timeout + 'socket_timeout' for returning control. It looks much > more safer for me. I'm afraid we cannot enforce that relationship programmatically, so I think the documentation should warn that socket_timeout be longer than statement_timeout. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > In case of failure PQcancel() terminates in 'socket_timeout'. So, control > to the end-user in such a failure situation will be returned in 2 * > 'socket_timeout' interval. It is much better than hanging forever in some > specific cases. Moreover, such solution will not lead to the overloading > of PostgreSQL server by abnormally ignored 'heavy' queries results by > end-users. Oops, unfortunately, PQcancel() does not follow any timeout parameters... It uses a blocking socket. Also, I still don't think it's a good idea to request cancellation. socket_timeout should be sufficiently longer than the usually expected query execution duration. And long-running queries should be handled by statement_timeout which indicates the maximum tolerable query execution duration. For example, if the usually expected query execution time is 100 ms, statement_timeout can be set to 3 seconds and socket_timeout to 5 seconds. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > Do you mind me asking you whether you have thought that solving your problem > can lead to the problem in the other user applications? > Let's imagine a possible problem: > 1. end-user sets 'socket_timeout' only for current session > 2. 'socket_timeout' is much shorter than 'keep_alive', 'tcp_user_timeout' > and 'statement_timeout' > 3. end-user invokes a 'heavy' query which is not be possible to process > by PostgreSQL server within 'socket_timeout' > 4. if the query fails due to timeout, terminate the session and repeat it > > As the query is not cancelled, PostgreSQL server will process it till > completion or 'statement_timeout' expiration. In such a case PostgreSQL > server can be overloaded by an 'unreliable' user/users and other end-users > will not be able to work with PostgreSQL server as they expected. Yes, so I think it would be necessary to describe how to set socket_timeout with relation to other timeout parameters -- socket_timeout > statement_timeout, emphasizing that socket_timeout is not for canceling long-running queries but for returning control to the client. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > Now you might say - what if the server is stopped not because of > SIGSTOP but because of some other reason, like it's waiting for a > lock? Well, in that case, the database server is still functioning, > and you will not want the connection to be terminated if no notifies > are sent during the lock wait but not terminated if they are sent. At > least, I can't see why anyone would want that. Yes, so I think it would be kind to describe how to set socket_timeout with relation to other timeout parameters -- socket_timeout > statement_timeout > lock_timeout, for example. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > One other thing -- I looked a bit into the pgsql-jdbc implementation > of a similarly-named option, and it does seem to match what you are > proposing here. I wonder what user experiences with that option have > been like. One case I faintly recall is that some SQL execution got stuck in NFS access in the server, and statement_timeout didn't work (probably due to inability to cancel NFS read/write operations). The user chose to use PgJDBC's socketTimeout. I'm not sure whether this use case is fundamental, because I wonder if NFS mount options could have been the solution. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > > For example, OS issues such as abnormally (buggy) slow process scheduling > or paging/swapping that prevent control from being passed to postgres. Or, > abnormally long waits on lwlocks in postgres. statement_timeout doesn't > take effect while waiting on a lwlock. I have experienced both. And, any > other issues in the server that we haven't experienced and not predictable. > > For me all mentioned by Takayuki Tsunakawa problems looks like a lack of > design of end-user application or configuration of DB server. It is not > a problem of PostgreSQL. Certainly, my cases could be avoided by the OS and database configuration. But we cannot say all such problems can be avoided by configuration in advance. The point is to provide a method to get ready for any situation. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Fabien COELHO [mailto:coe...@cri.ensmp.fr] > I think that the typical use-case of \c is to connect to another database > on the same host, at least that what I do pretty often. The natural > expectation is that the same "other" connection parameters are used, > otherwise it does not make much sense, and there is already a whole logic > of reusing the previous settings in psql, at least wrt describing the > target (host, port, user…). I agree about the natural expectation. > > Anyway, I think this would be an improvement for psql's documentation > or > > new feature for psql. What do you think? > > I think that we should fix the behavior rather than document the current > weirdness. I do not think that we should introduce more weirdness. Do you think all connection parameter settings should be reused by default, or when -reuse-previous=on is specified? Do you think this is a bug to be backported to previous versions, or a new feature? I think I'll make a patch separately from this thread, if time permits. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > If so, in turn the socket_timeout doesn't work as expected? I > understand that what is proposed here is to disconnect after that > time of waiting for *the first tuple* of a query, regardless of > it is a long query or network failure. On of the problems raised > here is the socket_timetout patch doesn't work that way? No, the proposal is to set the timeout for every socket read/write like PgJDBC. It is not restricted to an SQL command execution; it applies to any communication with the server that could block the client. > I can't imagine that in the cases where other than applications > cannot be rebuild for some reasons. (E.g. the source code has > been lost or the source code no longer be built on modern > environment..) > > If an application is designed to have such a requirement, mustn't > it have implemented that by itself by any means? For example, an > application on JDBC can be designed to kill a querying thread > that takes longer than threshold. Any paid or free applications whose source code is closed. Also, ordinary users don't want to modify psql and pgAdmin source code by themselves, do they? > Doesn't TCP_KEEPALIVE work that way? > > statement_timeout works on a live connection, TCP_USER_TIMEOUT > works for an already failed network but if the network fails > after a backend already sent the ack for a query request, it > doesn't work. TCP_KEEPALIVE would work for the case where any > packet doesn't reach each other for a certain time. Is there any > other situations to save? For example, OS issues such as abnormally (buggy) slow process scheduling or paging/swapping that prevent control from being passed to postgres. Or, abnormally long waits on lwlocks in postgres. statement_timeout doesn't take effect while waiting on a lwlock. I have experienced both. And, any other issues in the server that we haven't experienced and not predictable. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > But that's not what it will do. As long as the server continues to > dribble out protocol messages from time to time, the timeout will > never fire no matter how much time passes. I saw a system once where > every 8kB read took many seconds to complete; such a system could > dribble out sequential scan results over an arbitrarily long period of > time without ever tripping the timeout. I understood hat the example is about an SELECT that returns multiple rows. If so, statement_timeout should handle it, shouldn't it? > If you really want to return > control to the user in any situation, what you can do is use the libpq > APIs in asynchronous mode which, barring certain limitations of the > current implementation, will actually allow you to maintain control > over the connection at all times. Maybe. But the users aren't often in a situation to modify the application to use libpq asynchronous APIs. > I think the use case for a timeout that has both false positives (i.e. > it will fire even when there's no problem, as when the connection is > legitimately idle) and false negatives (i.e. it will fail to trigger > when there is a problem, as when there are periodic notices or > notifies from the server connection) is extremely limited if not > nonexistent, and I think the potential for users to be confused is > really high. My understanding is that the false positive case doesn't occur, because libpq doesn't wait on the socket while the client is idle and not communicating SQL request/response. As for the false negative case, resetting the timer upon notices or notifies receipt is good, because they show that the database server is functioning. socket_timeout is not a mechanism to precisely limit the duration of query request/response. It is kind of a stop-gap, last resort to assure return control within reasonable amount of time, rather than minutes or hours. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Robert Haas [mailto:robertmh...@gmail.com] > The first thing I notice about the socket_timeout patch is that the > documentation is definitely wrong: Agreed. I suppose the description should be clearer about: * the purpose and what situation this timeout will help: not for canceling a long-running query or recovering from a network failure. * relationship with other timeout parameters: e.g., socket_timeout > connect_timeout, socket_timeout > statement_timeout, socket_timeout > TCP keepalive/user timeout > Actually, I think that socket_timeout_v8.patch is a bad idea and > should be rejected, not because it doesn't send a cancel to the server > but just because it's not the right way of solving either of the > problems that it purports to solve. If you want a timeout for > queries, you need that to be administered by the server, which knows > when it starts and finishes executing a query; you cannot substitute a > client-side judgement based on the last time we received a byte of > data. Maybe a client-side judgement would work if the timer were > armed when we send a Query or Execute message and disarmed when we > receive ReadyForQuery. And, if you want a network problem detector, > then you should use keepalives and perhaps the TCP_USER_TIMEOUT stuff, > so that you don't kill perfectly good connections that just happen to > be idle. I think the purpose of socket_timeout is to avoid infinite or unduely long wait and return response to users, where other existing timeout parameters wouldn't help. For example, OS's process scheduling or paging/swapping problems might cause long waits before postgres gets control (e.g. due to Transparent HugePage (THP)). Within postgres, the unfair lwlock can unexpected long waits (I experienced dozens of seconds per wait on ProcArrayLock, which was unbelievable.) Someone may say that those should be fixed or improved instead of introducing this parameter, but I think it's good to have this as a stop-gap measure. In other words, we can suggest setting this parameter when the user asks "How can I return control to the end user in any situation?" Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Fabien COELHO [mailto:coe...@cri.ensmp.fr] > >> If the user reconnects, eg "\c db", the setting is lost. The > >> re-connection handling should probably take care of this parameter, and > maybe others. > > I think your opinion is reasonable, but it seems not in this thread. > > HI think that your patch is responsible for the added option at least. > > I agree that the underlying issue that other parameters should probably > also be reused, which would be a bug fix, does not belong to this thread. This doesn't seem to be a bug. \connect just establishes a new connection, not reusing the previous settings for most connection parameters. As the examples in the following manual page show, the user needs to specify necessary connection parameters. https://www.postgresql.org/docs/devel/app-psql.html => \c service=foo => \c "host=localhost port=5432 dbname=mydb connect_timeout=10 sslmode=disable" But I'm afraid the description of \connect may not be clear enough about connection parameters, and that may cause users to expect the reuse of all connection parameter settings. Anyway, I think this would be an improvement for psql's documentation or new feature for psql. What do you think? Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi/出利葉 健 > [Size=800, iter=1,000,000] > Master |15.763 > Patched|16.262 (+3%) > > [Size=32768, iter=1,000,000] > Master |61.3076 > Patched|62.9566 (+2%) What's the unit, second or millisecond? Why is the number of digits to the right of the decimal point? Is the measurement correct? I'm wondering because the difference is larger in the latter case. Isn't the accounting processing almost the sane in both cases? * former: 16.262 - 15.763 = 4.99 * latter: 62.956 - 61.307 = 16.49 > At least compared to previous HashAg version, the overhead is smaller. > It has some overhead but is increase by 2 or 3% a little bit? I think the overhead is sufficiently small. It may get even smaller with a trivial tweak. You added the new member usedspace at the end of MemoryContextData. The original size of MemoryContextData is 72 bytes, and Intel Xeon's cache line is 64 bytes. So, the new member will be on a separate cache line. Try putting usedspace before the name member. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > Robert used the phrase "attractive nuisance", which maybe sounds like a > good thing to have to a non native speaker, but it actually isn't -- he > was saying we should avoid a GUC at all, and I can see the reason for > that. I think we should have a VACUUM option and a reloption, but no > GUC. Uh, thanks. I've just recognized I didn't know the meaning of "nuisance." I've looked up the meaning in the dictionary. Nuisance is like a trouble maker... Why do you think that it's better for VACUUM command to have the option? I think it's a table property whose value is determined based on the application workload, not per VACUUM execution. Rather, I think GUC is more useful to determine the behavior of the entire database and/or application. If we want to change a given execution of VACUUM, then we can ALTER TABLE SET, VACUUM, and ALTER TABLE SET back. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > So we could you consider adding an option for the VACUUM command as well > as vacuumdb? The interactions with the current patch is that you need to > define the behavior at the beginning of vacuum for a given heap, instead > of reading the parameter at the time the truncation happens, and give I'm not confident whether this is the same as the above, I imagined this: * Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default. This follows the naming style "verb_object" like log_connections and enable_xxx. We may want to use enable_vacuum_shrink or something like that, but enable_xxx seems to be used solely for planner control. Plus, vacuum-related parameters seem to start with vacuum_. * Give priority to the reloption, because it's targeted at a particular table. If the reloption is not set, the GUC takes effect. * As a consequence, the user can change the behavior of VACUUM command by SETting the GUC in the same session in advance, when the reloption is not set. If the reloption is set, the user can ALTER TABLE SET, VACUUM, and ALTER TABLE again to restore the table's setting. But I don't think this use case (change whether to shrink per VACUUM command execution) is necessary. This is no more than simply possible. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com] > I measured the memory context accounting overhead using Tomas's tool > palloc_bench, > which he made it a while ago in the similar discussion. > https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz > > This tool is a little bit outdated so I fixed it but basically I followed > him. > Things I did: > - make one MemoryContext > - run both palloc() and pfree() for 32kB area 1,000,000 times. > - And measure this time > > The result shows that master is 30 times faster than patched one. > So as Andres mentioned in upper thread it seems it has overhead. > > [master (without v15 patch)] > 61.52 ms > 60.96 ms > 61.40 ms > 61.42 ms > 61.14 ms > > [with v15 patch] > 1838.02 ms > 1754.84 ms > 1755.83 ms > 1789.69 ms > 1789.44 ms > I'm afraid the measurement is not correct. First, the older discussion below shows that the accounting overhead is much, much smaller, even with a more complex accounting. 9.5: Better memory accounting, towards memory-bounded HashAg https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop Second, allocation/free of memory > 8 KB calls malloc()/free(). I guess the accounting overhead will be more likely to be hidden under the overhead of malloc() and free(). What we'd like to know the overhead when malloc() and free() are not called. And are you sure you didn't enable assert checking? Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] > Attached are the updated patches. Thanks, all look fixed. > The target_server_type option yet to be implemented. Please let me review once more and proceed to testing when the above is added, to make sure the final code looks good. I'd like to see how complex the if conditions in multiple places would be after adding target_server_type, and consider whether we can simplify them together with you. Even now, the if conditions seem complicated to me... that's probably due to the existence of read_write_host_index. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Robert Haas [mailto:robertmh...@gmail.com] > I don't think that a VACUUM option would be out of place, but a GUC > sounds like an attractive nuisance to me. It will encourage people to > just flip it blindly instead of considering the particular cases where > they need that behavior, and I think chances are good that most people > who do that will end up being sad. Ouch, I sent my previous mail before reading this. I can understand it may be cumbersome to identify and specify each table, so I tend to agree the parameter in postgresql, which is USERSET to allow ALTER DATABASE/USER SET to tune specific databases and applications. But should the vacuuming of system catalogs also follow this setting? Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > This makes the test page-size sensitive. While we don't ensure that tests > can be run with different page sizes, we should make a maximum effort to > keep the tests compatible if that's easy enough. In this case you could > just use > 0 as base comparison. I can fix that by myself, so no need to > send a new version. Good idea. Done. > Should we also document that the parameter is effective for autovacuum? > The name can lead to confusion regarding that. I'm not sure for the need because autovacuum is just an automatic execution of vacuum, and existing vacuum_xxx parameters also apply to autovacuum. But being specific is good anyway, so I added reference to autovacuum in the description. > Also, shouldn't the relopt check happen in should_attempt_truncation()? > It seems to me that if we use this routine somewhere else then it should > be controlled by the option. That makes sense. Done. > At the same time, we also have REL_TRUNCATE_FRACTION and > REL_TRUNCATE_MINIMUM which could be made equally user-tunnable. > That's more difficult to control, still why don't we also consider this > part? I thought of it, too. But I didn't have a good idea on how to explain those parameters. I'd like to separate it. > Another thing that seems worth thinking about is a system-level GUC, and > an option in the VACUUM command to control if truncation should happen or > not. We have a lot of infrastructure to control such options between vacuum > and autovacuum, so it could be a waste to not consider potential synergies. Being able to specify this parameter in postgresql.conf and SET (especially ALTER DATABASE/USER to target specific databases/applications) might be useful, but I'm not sure... I'm less confident about whether VACUUM command can specify this, because this is a property of table under a specific workload, not a changable property of each VACUUM action. Anyway, I expect it won't be difficult to add those configurability without contradicting the design, so I'm inclined to separate it. From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > Yeah, that would work. Or it's kind of hackie but the rolling back the > insertion instead of INSERT and DELETE might also work. That's good, because it allows us to keep running reloptions test in parallel with other tests. Done. Regards Takayuki Tsunakawa disable-vacuum-truncation_v4.patch Description: disable-vacuum-truncation_v4.patch
RE: [RFC] [PATCH] Flexible "partition pruning" hook
From: Mike Palmiotto [mailto:mike.palmio...@crunchydata.com] > Attached is a patch which attempts to solve a few problems: > > 1) Filtering out partitions flexibly based on the results of an external > function call (supplied by an extension). > 2) Filtering out partitions from pg_inherits based on the same function > call. > 3) Filtering out partitions from a partitioned table BEFORE the partition > is actually opened on-disk. What concrete problems would you expect this patch to solve? What kind of extensions do you imagine? I'd like to hear about the examples. For example, "PostgreSQL 12 will not be able to filter out enough partitions when planning/executing SELECT ... WHERE ... statement. But an extension like this can extract just one partition early." Would this help the following issues with PostgreSQL 12? * UPDATE/DELETE planning takes time in proportion to the number of partitions, even when the actually accessed partition during query execution is only one. * Making a generic plan takes prohibitably long time (IIRC, about 12 seconds when the number of partitoons is 1,000 or 8,000.) Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Masahiko Sawada [mailto:sawada.m...@gmail.com] > This test expects that the inserted tuple is always reclaimed by > subsequent vacuum, but it's not always true if there are concurrent > transactions. So size of the reloptions_test table will not be 0 if > the tuple is not vacuumed. In my environment this test sometimes > failed with 'make check -j 4'. Hmm, "make check -j4" certainly fails on my poor VM, too. Modifying src/test/regress/parallel_schedule to put the reloptions test on a separate line seems to have fixed this issue. Do you think this is the correct remedy? Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] On Mon, Feb 25, 2019 at 03:59:21PM +0900, Masahiko Sawada wrote: > > Also, I think that this test may fail in case where concurrent > > transactions are running. So maybe should not run it in parallel to > > other tests. > > That's why autovacuum is disabled in this specific test, no? A comment > may be a good idea. Exactly. The table is disabled for autovacuum to avoid being influenced by autovacuum. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyot...@lab.ntt.co.jp] > - If you find the process too much "bloat"s and you (intuirively) > suspect the cause is system cache, set it to certain shorter > value, say 1 minutes, and set the catalog_cache_memory_target > to allowable amount of memory for each process. The memory > usage will be stable at (un)certain amount above the target. Could you guide me how to tune these parameters in an example scenario? Let me take the original problematic case referenced at the beginning of this thread. That is: * A PL/pgSQL function that creates a temp table, accesses it, (accesses other non-temp tables), and drop the temp table. * An application repeatedly begins a transaction, calls the stored function, and commits the transaction. With v16 patch applied, and leaving the catalog_cache_xxx parameters set to their defaults, CacheMemoryContext continued to increase as follows: CacheMemoryContext: 1065016 total in 9 blocks; 104168 free (17 chunks); 960848 used CacheMemoryContext: 8519736 total in 12 blocks; 3765504 free (19 chunks); 4754232 used CacheMemoryContext: 25690168 total in 14 blocks; 8372096 free (21 chunks); 17318072 used CacheMemoryContext: 42991672 total in 16 blocks; 11741024 free (21761 chunks); 31250648 used How can I make sure that this context won't exceed, say, 10 MB to avoid OOM? I'm afraid that once the catcache hash table becomes large in a short period, the eviction would happen less frequently, leading to memory bloat. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Robert Haas [mailto:robertmh...@gmail.com] > I don't understand the idea that we would add something to PostgreSQL > without proving that it has value. Sure, other systems have somewhat > similar systems, and they have knobs to tune them. But, first, we > don't know that those other systems made all the right decisions, and > second, even they are, that doesn't mean that we'll derive similar > benefits in a system with a completely different code base and many > other internal differences. I understand that general idea. So, I don't have an idea why the proposed approach, eviction based only on elapsed time only at hash table expansion, is better for PostgreSQL's code base and other internal differences... > You need to demonstrate that each and every GUC you propose to add has > a real, measurable benefit in some plausible scenario. You can't just > argue that other people have something kinda like this so we should > have it too. Or, well, you can argue that, but if you do, then -1 > from me. The benefit of the size limit are: * Controllable and predictable memory usage. The DBA can be sure that OOM won't happen. * Smoothed (non-abnormal) transaction response time. This is due to the elimination of bulk eviction of cache entries. I'm not sure how to tune catalog_cache_prune_min_age and catalog_cache_memory_target. Let me pick up a test scenario in a later mail in response to Horiguchi-san. Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Michael Paquier [mailto:mich...@paquier.xyz] > I don't think that we want to use a too generic name and it seems more natural > to reflect the context where it is used in the parameter name. > If we were to shrink with a similar option for other contexts, we would > most likely use a different option. Depending on the load pattern, users > should also be able to disable or enable a subset of contexts as well. > > So I agree with Julien that [auto]vacuum_shrink_enabled is more adapted > for this stuff. OK, I renamed it to vacuum_shrink_enabled. From: Julien Rouhaud [mailto:rjuju...@gmail.com] > One last thing, I think we should at least add one regression test for > this setting. The one you provided previously seems perfectly suited. Thanks, added. Regards Takayuki Tsunakawa disable-vacuum-truncation_v3.patch Description: disable-vacuum-truncation_v3.patch
RE: Libpq support to connect to standby server as priority
Hi Hari-san, I've reviewed all files. I think I'll proceed to testing when I've reviewed the revised patch and the patch for target_server_type. (1) patch 0001 CONNECTION_CHECK_WRITABLE, /* Check if we could make a writable * connection. */ + CONNECTION_CHECK_TARGET,/* Check if we have a proper target +* connection */ CONNECTION_CONSUME /* Wait for any pending message and consume * them. */ According to the following comment, a new enum value should be added at the end. /* * Although it is okay to add to these lists, values which become unused * should never be removed, nor should constants be redefined - that would * break compatibility with existing code. */ (2) patch 0002 It seems to align better with the existing code to set the default value to pg_conn.requested_session_type explicitly in makeEmptyPGconn(), even if the default value is 0. Although I feel it's redundant, other member variables do so. (3) patch 0003 IntervalStyle was not reported by releases before 8.4; -application_name was not reported by releases before 9.0.) +application_name was not reported by releases before 9.0 +transaction_read_only was not reported by releases before 12.0.) ";" is missing at the end of the third line. (4) patch 0004 - /* Type of connection to make. Possible values: any, read-write. */ + /* Type of connection to make. Possible values: any, read-write, perfer-read. */ char *target_session_attrs; perfer -> prefer (5) patch 0004 @@ -3608,6 +3691,9 @@ makeEmptyPGconn(void) conn = NULL; } + /* Initial value */ + conn->read_write_host_index = -1; The new member should be initialized earlier in this function. Otherwise, as you can see in the above fragment, conn can be NULL in an out-of-memory case. (6) patch 0004 Don't we add read-only as well as prefer-read, which corresponds to Slave or Secondary of PgJDBC's targetServerType? I thought the original proposal was to add both. (7) patch 0004 @@ -2347,6 +2367,7 @@ keep_going: /* We will come back to here until there is conn->try_next_addr = true; goto keep_going; } + appendPQExpBuffer(&conn->errorMessage, libpq_gettext("could not create socket: %s\n"), Is this an unintended newline addition? (8) patch 0004 + const char *type = (conn->requested_session_type == SESSION_TYPE_PREFER_READ) ? + "read-only" : "writable"; I'm afraid these strings are not translatable into other languages. (9) patch 0004 + /* Record read-write host index */ + if (conn->read_write_host_index == -1) + conn->read_write_host_index = conn->whichhost; At this point, the session can be either read-write or read-only, can't it? Add the check "!conn->transaction_read_only" in this if condition? (10) patch 0004 + if ((conn->target_session_attrs != NULL) && + (conn->requested_session_type == SESSION_TYPE_PREFER_READ) && + (conn->read_write_host_index != -2)) The first condition is not necessary because the second one exists. The parenthesis surrounding each of these conditions are redundant. It would be better to remove them for readability. This also applies to the following part: + if (((strncmp(val, "on", 2) == 0) && + (conn->requested_session_type == SESSION_TYPE_READ_WRITE)) || + ((strncmp(val, "off", 3) == 0) && + (conn->requested_session_type == SESSION_TYPE_PREFER_READ) && + (conn->read_write_host_index != -2))) Regards Takayuki Tsunakawa
RE: Libpq support to connect to standby server as priority
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com] Here I attached first set of patches that implemented the prefer-read option > after reporting the transaction_read_only GUC to client. Along the lines > of adding prefer-read option patch, Great, thank you! I'll review and test it. > 3. Existing read-write code is modified to use the new reported GUC instead > of executing the show command. Is the code path left to use SHOW for older servers? Regards Takayuki Tsunakawa
RE: Problem during Windows service start
Hi Higuchi-san, (1) What made you think this problem rarely occurs in PG 10 or later? Looking at the following code, this seems to happen in PG 10+ too. if (do_wait) { write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Waiting for server startup...\n")); if (wait_for_postmaster(postmasterPID, true) != POSTMASTER_READY) { write_eventlog(EVENTLOG_ERROR_TYPE, _("Timed out waiting for server startup\n")); pgwin32_SetServiceStatus(SERVICE_STOPPED); return; } write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server started and accepting connections\n")); } (2) What state should we consider SERVICE_RUNNING as? Isn't it the state where the server has completed startup processing and accepts connections? If no, how is it different from SERVICE_STARTING? (I know that when -w (wait) is not specified, the status becomes SERVICE_RUNNING whether or not the server completes startup processing...) (3) + write_eventlog(EVENTLOG_INFORMATION_TYPE, _("Server startup timed out but might continue in the background\n")); This message is new, isn't it? I think the existing message "Time out..." is enough. (4) + write_eventlog(EVENTLOG_ERROR_TYPE, _("Server startup failed. Examine the log output.\n")); The first sentence (phenomenon) and the second line (detail or action) should be separated with a newline. Below are some examples in pg_ctl.c. Note that write_stderr() writes to the eventlog when running under a Windows service. write_stderr(_("%s: could not start server\n" "Examine the log output.\n"), write_stderr(_("The program \"%s\" was found by \"%s\"\n" "but was not the same version as %s.\n" "Check your installation.\n"), Regards Takayuki Tsunakawa
RE: reloption to prevent VACUUM from truncating empty pages at the end of relation
From: Julien Rouhaud [mailto:rjuju...@gmail.com] > FWIW, I prefer shrink over truncate, though I'd rather go with > vacuum_shink_enabled as suggested previously. Thanks. I'd like to leave a committer to choose the name. FWIW, I chose shrink_enabled rather than vacuum_shrink_enabled because this property may be used in other shrink situations in the future. What I imagined was that with the zheap, DELETE or some maintenance operation, not vacuum, may try to shrink the table. I meant this property to indicate "whether this table shrinks or not" regardless of the specific operation that can shrink the table. > I'm not sure that I get this comment. Since both require a > ShareUpdateExclusiveLock, you can't change the parameter while a > VACUUM is active on that table. Did you wanted to use another lock > mode? No, changing the parameter acquires ShareUpdaeExclusive lock. I just imitated the description for n_distinct in the same comment block. The meaning is that the setting cannot be changed during VACUUM, so in-flight VACUUM is not affected. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com] > socket_timeout (integer) libpq documentation does not write the data type on the parameter name line. > Terminate any connection that has been inactive for more than the specified > number of seconds to prevent client from infinite waiting for individual > socket read operations due to dead connection. This can be used both as > a force global query timeout and network problems detector. A value of zero > (the default) turns this off. > > or > > Controls the number of seconds of connection inactivity to prevent client > from infinite waiting for individual socket read operations due to dead > connection. This can be used both as a force global query timeout and network > problems detector. A value of zero (the default) turns this off. The second one is better, but: * Just "connection inactivity" (i.e. the application hasn't submitted any request to the server for a long time) does not terminate the connection. * "due to dead connction" is not the cause for the timeout. If the timeout occurs, consider the connection dead (see keepalives_count). * Not restricted to "read" operation? Regards Takayuki Tsunakawa
RE: Timeout parameters
From: mikalaike...@ibagroup.eu [mailto:mikalaike...@ibagroup.eu] > I am not very familiar with the PostgreSQL source code. Nevertheless, the > main idea of this parameter is clear for me - closing a connection when > the PostgreSQL server does not response due to any reason. However, I have > not found in the discussion a reference that this parameter can be applied > to the TCP as well as to the UNIX-domain sockets. Moreover, this parameter > works out of communication layer. When we consider TCP communication, the > failover is covered by keep_alive and tpc_user_timeout parameters. Right. This timeout is for individual socket operations, and the last resort to forcibly terminate the connection when other timeouts don't work (TCP keepalive, tcp user timeout, statement_timeout). Also, there's no reason to restrict this timeout to TCP. > According to it, we should not use 'tcp' prefix in this parameter name, > 'socket' sub string is not suitable too. > > This parameter works on the client side. So the word 'client' is a good > candidate for using in this parameter name. > This parameter affects only when we send a 'query' to the pg server. No, this timeout works not only for queries but for any socket operations to communicate with the server. For example, libpq sends query cancellation message to the server, receives server-side parameter value change notifications, etc. So, like PgJDBC and Oracle, I think the name suggesting socket or communication is good. That is, socket_timeout or communication_timeout is OK. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Jamison, Kirk/ジャミソン カーク > Although I did review and followed the suggested way in previous email > way back (which uses root user) and it worked as intended, I'd also like > to hear feedback also from Fabien whether it's alright without the test > script, or if there's another way we can test it (maybe none?). > Then I think it's in good shape. I'm afraid there's not a way to incorporate the test into the regression test. If there is, Fabien should have responded earlier. > However, what I meant by my statement above was to have a separate > decision in committing the two parameters, because based from the thread > tcp_user_timeout has got more progress when it comes to reviews. > So once it's ready, it can be committed separately without waiting for > the socket_timeout param to be ready. (I dont think that requires a > separate thread) I see, thanks. > As for the place, my knowledge is limited so I won't be able to provide > substantial help on it. Fabien pointed out some comments about it that needs > clarification though. > Quoting what Fabien wrote: > "The implementation looks awkward, because part of the logic of > pqWaitTimed > is reimplemented in pqWait. Also, I do not understand the > computation > of finish_time, which seems to assume that pqWait is going to be > called > immediately after sending a query, which may or may not be the case, > and > if it is not the time() call there is not the start of the statement." The patch looks good in that respect. This timeout is for individual socket operations like PgJDBC and Oracle, not the query timeout. That's what the parameter name suggests. > How about the one below? > I got it from combined explanations above. I did not include the > differences though. This can be improved as the discussion > continues along the way. > > tcp_socket_timeout (integer) > > Terminate and restart any session that has been idle for more than > the specified number of milliseconds to prevent client from infinite > waiting for server due to dead connection. This can be used both as > a brute force global query timeout and detecting network problems. > A value of zero (the default) turns this off. Looks better than the original one. However, * The unit is not millisecond, but second. * Doesn't restart the session. * Choose words that better align with the existing libpq parameters. e.g. "connection" should be used instead of "session" as in keepalive_xxx parameters, because there's not a concept of database session at socket level. * Indicate that the timeout is for individual socket operations. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Nagaura, Ryohei [mailto:nagaura.ryo...@jp.fujitsu.com] > > Maybe. Could you suggest good description? > Clients wait until the socket become readable when they try to get results > of their query. > If the socket state get readable, clients read results. > (See src/interfaces/libpq/fe-exec.c, fe-misc.c) > The current pg uses poll() to monitor socket states. > "socket_timeout" is used as timeout argument of poll(). > (See [1] about poll() and its arguments) > > Because "tcp_user_timeout" handles the timeout before and after the above > procedure, > there are different monitoring target between "socket_timeout" and > "tcp_user_timeout". > When the server is saturated, "statement_timeout" doesn't work while > "socket_timeout" does work. > In addition to this, the purpose of "statement_timeout" is to reduce > server's burden, I think. OK, I understand your intent. What I asked Kirk-san is to suggest a description for socket_timeout parameter that the user would see in PostgreSQL documentation. > My proposal is to enable clients to release their resource which is used > communication w/ the saturated server. I thought the primary purpose is to prevent applications from hanging too long, so that the user does not have to wait infinitely. Releasing resources is a secondary purpose. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Nagaura, Ryohei [mailto:nagaura.ryo...@jp.fujitsu.com] > BTW, tcp_user_timeout parameter of servers and clients have same name in > my current implementation. > I think it would be better different name rather than same name. > I'll name them as the following a) or b): > a) server_tcp_user_timeout and client_tcp_user_timeout > b) tcp_user_timeout and user_timeout > b) is the same as the naming convention of keepalive, but it is not > user-friendly. > Do you come up with better name? > Or opinion? a) is not always accurate, because libpq is also used in the server. For example, postgres_fdw and WAL receiver in streaming replication. I'm OK with either the current naming or b). Frankly, I felt a bit strange when I first saw the keepalive parameters, wondering why the same names were not chosen. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Robert Haas [mailto:robertmh...@gmail.com] > That might be enough to justify having the parameter. But I'm not > quite sure how high the value would need to be set to actually get the > benefit in a case like that, or what happens if you set it to a value > that's not quite high enough. I think it might be good to play around > some more with cases like this, just to get a feeling for how much > time you can save in exchange for how much memory. Why don't we consider this just like the database cache and other DBMS's dictionary caches? That is, * If you want to avoid infinite memory bloat, set the upper limit on size. * To find a better limit, check the hit ratio with the statistics view (based on Horiguchi-san's original 0004 patch, although that seems modification anyway) Why do people try to get away from a familiar idea... Am I missing something? Ideriha-san, Could you try simplifying the v15 patch set to see how simple the code would look or not? That is: * 0001: add dlist_push_tail() ... as is * 0002: memory accounting, with correction based on feedback * 0003: merge the original 0003 and 0005, with correction based on feedback Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi/出利葉 健 > I checked it with perf record -avg and perf report. > The following shows top 20 symbols during benchmark including kernel space. > The main difference between master (unpatched) and patched one seems that > patched one consumes cpu catcache-evict-and-refill functions including > SearchCatCacheMiss(), CatalogCacheCreateEntry(), > CatCacheCleanupOldEntries(). > So it seems to me that these functions needs further inspection > to suppress the performace decline as much as possible Thank you. It's good to see the expected functions, rather than strange behavior. The performance drop is natural just like the database cache's hit ratio is low. The remedy for performance by the user is also the same as the database cache -- increase the catalog cache. Regards Takayuki Tsunakawa
RE: Timeout parameters
From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com] > 1) tcp_user_timeout parameter > I think this can be "committed" separately when it's finalized. Do you mean you've reviewed and tested the patch by simulating a communication failure in the way Nagaura-san suggested? > 2) tcp_socket_timeout parameter > - whether (a) it should abort the connection from pqWait() or > other means, or > (b) cancel the statement similar to how psql > does it as suggested by Fabien We have no choice but to terminate the connection, because we can't tell whether we can recover from the problem and continue to use the connection (e.g. long-running query) or not (permanent server or network failure). Regarding the place, pqWait() is the best (and possibly only) place. The purpose of this feature is to avoid waiting for response from the server forever (or too long) in any case, as a last resort. Oracle has similar parameters called SQLNET.RECV_TIMEOUT and SQLNET.SEND_TIMEOUT. From those names, I guess they use SO_RCVTIMEO and SO_SNDTIMEO socket options. However, we can't use them because use non-blocking sockets and poll(), while SO_RCV/SND_TIMEO do ont have an effect for poll(): [excerpt from "man 7 socket"] -- SO_RCVTIMEO and SO_SNDTIMEO Specify the receiving or sending timeouts until reporting an error. The argument is a struct timeval. If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred; if no data has been transferred and the timeout has been reached then -1 is returned with errno set to EAGAIN or EWOULDBLOCK just as if the socket was specified to be non-blocking. If the timeout is set to zero (the default) then the operation will never timeout. Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), etc. -- > - proper parameter name > > Based from your description below, I agree with Fabien that it's somehow > the application/client side query timeout I think the name is good because it indicates the socket-level timeout. That's just like PgJDBC and Oracle, and I didn't feel strange when I read their manuals. > Perhaps you could also clarify a bit more through documentation on how > socket_timeout handles the timeout differently from statement_timeout > and tcp_user_timeout. Maybe. Could you suggest good description? Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > Hm. Putting a list header for a purely-local data structure into shared > memory seems quite ugly. Isn't there a better place to keep that? Agreed. I put it in the global variable. > Do we really want a dlist here at all? I'm concerned that bloating > LOCALLOCK will cost us when there are many locks involved. This patch > increases the size of LOCALLOCK by 25% if I counted right, which does > not seem like a negligible penalty. To delete the LOCALLOCK in RemoveLocalLock(), we need a dlist. slist requires the list iterator to be passed from callers. From: Andres Freund [mailto:and...@anarazel.de] > Sure, we should do that. I don't buy the "illogical" bit, just moving > hashcode up to after tag isn't more or less logical, and saves most of > the padding, and moving the booleans to the end isn't better/worse > either. > > I don't find Thanks, I've done it. From: Simon Riggs [mailto:si...@2ndquadrant.com] > Can we use our knowledge of the structure of locks, i.e. partition locks > are all children of the partitioned table, to do a better job? I couldn't come up with a idea. Regards Takayuki Tsunakawa faster-locallock-scan_v2.patch Description: faster-locallock-scan_v2.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com] > On 2/12/19 7:33 AM, Tsunakawa, Takayuki wrote: > > Imai-san confirmed performance improvement with this patch: > > > > https://commitfest.postgresql.org/22/1993/ > > > > Can you quantify the effects? That is, how much slower/faster does it get? Ugh, sorry, I wrote a wrong URL. The correct page is: https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A512787EC%40g01jpexmbkw24 The quoted figures re: [v20 + faster-locallock-scan.patch] auto: 9,069 TPS custom: 9,015 TPS [v20] auto: 8,037 TPS custom: 8,933 TPS In the original problematic case, plan_cache_mode = auto (default), we can see about 13% improvement. Regards Takayuki Tsunakawa
RE: Protect syscache from bloating with negative cache entries
From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com] > number of tables | 100 |1000|1 > --- > TPS (master) |10966 |10654 |9099 > TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%) > > It seems that before cache exceeding the limit (no pruning at 100 and 1000), > the results are almost same with master but after exceeding the limit (at > 1) > the decline happens. How many concurrent clients? Can you show the perf's call graph sampling profiles of both the unpatched and patched version, to confirm that the bottleneck is around catcache eviction and refill? Regards Takayuki Tsunakawa
ZRE: Protect syscache from bloating with negative cache entries
From: Tomas Vondra [mailto:tomas.von...@2ndquadrant.com] > 0.7% may easily be just a noise, possibly due to differences in layout > of the binary. How many runs? What was the variability of the results > between runs? What hardware was this tested on? 3 runs, with the variability of about +-2%. Luckly, all those three runs (incidentally?) showed slight performance decrease with the patched version. The figures I wrote are the highest ones. The hardware is, a very poor man's VM: CPU: 4 core Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz RAM: 4GB > FWIW I doubt tests with such small small schema are proving anything - > the cache/lists are likely tiny. That's why I tested with much larger > number of relations. Yeah, it requires many relations to test the repeated catcache eviction and new entry creation, which would show the need to increase catalog_cache_max_size for users. I tested with small number of relations to see the impact of additional processing while the catcache is not full -- memory accounting and catcache LRU chain maintenance. Regards Takayuki Tsunakawa