Re: [HACKERS] Avoiding repeated snapshot computation
On Thu, Aug 16, 2012 at 9:02 PM, Bruce Momjian wrote: > Did we ever make a decision on this patch? I committed it as 1fc3d18faa8f4476944bc6854be0f7f6adf4aec8. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Did we ever make a decision on this patch? --- On Sat, Nov 26, 2011 at 09:22:50PM +0530, Pavan Deolasee wrote: > On some recent benchmarks and profile data, I saw GetSnapshotData > figures at the very top or near top. For lesser number of clients, it > can account for 10-20% of time, but more number of clients I have seen > it taking up as much as 40% of sample time. Unfortunately, the machine > of which I was running these tests is currently not available and so I > don't have the exact numbers. But the observation is almost correct. > Our recent work on separating the hot members of PGPROC in a separate > array would definitely reduce data cache misses ans reduce the > GetSnapshotData time, but it probably still accounts for a large > enough critical section for a highly contended lock. > > I think now that we have reduced the run time of the function itself, > we should now try to reduce the number of times the function is > called. Robert proposed a way to reduce the number of calls per > transaction. I think we can go one more step further and reduce the > number for across the transactions. > > One major problem today could be because the way LWLock works. If the > lock is currently held in SHARED mode by some backend and some other > backend now requests it in SHARED mode, it will immediately get it. > Thats probably the right thing to do because you don't want the reader > to really wait when the lock is readily available. But in the case of > GetSnapshotData(), every reader is doing exactly the same thing; they > are computing a snapshot based on the same shared state and would > compute exactly the same snapshot (if we ignore the fact that we don't > include caller's XID in xip array, but thats a minor detail). And > because the way LWLock works, more and more readers would get in to > compute the snapshot, until the exclusive waiters get a window to > sneak in, either because more and more processes slowly start sleeping > for exclusive access. To depict it, the four transactions make > overlapping calls for GetSnapshotData() and hence the total critical > section starts when the first caller enters it and the ends when the > last caller exits. > > Txn1 --[ SHARED]- > Txn2 [ SHARED]--- > Txn3 -[SHARED]- > Txn4 ---[ SHARED > ]- > |<---Total Time > >| > > Couple of ideas come to mind to solve this issue. > > A snapshot once computed will remain valid for every call irrespective > of its origin until at least one transaction ends. So we can store the > last computed snapshot in some shared area and reuse it for all > subsequent GetSnapshotData calls. The shared snapshot will get > invalidated when some transaction ends by calling > ProcArrayEndTransaction(). I tried this approach and saw a 15% > improvement for 32-80 clients on the 32 core HP IA box with pgbench -s > 100 -N tests. Not bad, but I think this can be improved further. > > What we can do is when a transaction comes to compute its snapshot, it > checks if some other transaction is already computing a snapshot for > itself. If so, it just sleeps on the lock. When the other process > finishes computing the snapshot, it saves the snapshot is a shared > area and wakes up all processes waiting for the snapshot. All those > processes then just copy the snapshot from the shared area and they > are done. This will not only reduce the total CPU consumption by > avoiding repetitive work, but would also reduce the total time for > which ProcArrayLock is held in SHARED mode by avoiding pipeline of > GetSnapshotData calls. I am currently trying the shared work queue > mechanism to implement this, but I am sure we can do it this in some > other way too. > > Thanks, > Pavan > > -- > Pavan Deolasee > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Sat, Nov 26, 2011 at 5:49 PM, Andres Freund wrote: > On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote: >> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund wrote: >> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: >> >> I'd just as soon keep the fields in a logical order. >> > >> > Btw, I don't think the new order is necessarily worse than the old one. >> >> You forget to attach the benchmark results. >> >> My impression is that cache lines on modern hardware are 64 or 128 >> *bytes*, in which case this wouldn't matter a bit. > All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry > for that. > And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte > cachelines? OK, so I got around to looking at this again today; sorry for the delay. I agree that 72 -> 56 bytes is very relevant for 64-byte cache lines. I had not realized the structure was as big as that. > And yea. I didn't add benchmark results. I don't think I *have* to do that > when making suggestions to somebody trying to improve something specific. I > also currently don't have hardware where I can sensibly run at a high enough > concurrency to see that GetSnapshotData takes ~40% of runtime. > Additional cacheline references around synchronized access can hurt to my > knowledge... I tested this on Nate Boley's 32-core box today with the 32 clients doing a select-only pgbench test. Results of individual 5 minute runs: results.master.32:tps = 171701.947919 (including connections establishing) results.master.32:tps = 22.430112 (including connections establishing) results.master.32:tps = 218257.478461 (including connections establishing) results.master.32:tps = 226425.964855 (including connections establishing) results.master.32:tps = 218687.662373 (including connections establishing) results.master.32:tps = 219819.451605 (including connections establishing) results.master.32:tps = 216800.131634 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 181997.531357 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 171505.145440 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 226970.348605 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 169725.115763 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 219548.690522 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 175440.705722 (including connections establishing) results.snapshotdata-one-cacheline.32:tps = 217154.173823 (including connections establishing) There's no statistically significant difference between those two data sets; if anything, the results with the patch look like they might be worse, although I believe that's an artifact - some runs seem to mysteriously come out in the 170-180 rangeinstead of the 215-225 range, with nothing in between. But even if you only average the good runs out of each set there's no material difference. Having said that, I am coming around to the view that we should apply this patch anyway, just because it reduces memory consumption. Since the size change crosses a power-of-two boundary, I believe it will actually cut down the size of a palloc'd chunk for a SnapshotData object from 128 bytes to 64 bytes. I doubt we can measure the benefit of that even on a microbenchmark unless someone has a better idea for making PostgreSQL take ridiculous numbers of snapshots than I do. It still seems like a good idea, though: a penny saved is a penny earned. With response to Tom's objections downthread, I don't think that the new ordering is significantly more confusing than the old one. xcnt/subxcnt/xip/subxip doesn't seem measurably less clear than xcnt/xip/subxcnt/subxip, and we're not normally in the habit of letting such concerns get in the way of squeezing out alignment padding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Tue, Nov 29, 2011 at 7:12 AM, Pavan Deolasee wrote: > I think that a good idea. We need a representation that needs minimum > processing to derive the snapshot. I was looking over the generated code for GetSnapshotData to see if there is any low hanging fruit for micro-optimization. The assembly mostly looks pretty tight, but there are 3 function calls to TransactionIdPrecedes and TransactionIdFollowsOrEquals. All the parameters are known to be normal xids, so there are duplicated checks for that and a lot of movs for the calling convention. I wonder if replacing them with special case macros would be a good idea. In that case the whole check will compile down to one cmp instruction. I'm running a set of benchmarks now on my laptop, but I guess the difference will mostly become noticeable on beefier hardware when ProcArray lock is heavily contended. Attached is a patch, if anyone wishes to give it a go. -- Ants Aasma diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index 19ff524..59ff996 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -1324,7 +1324,7 @@ GetSnapshotData(Snapshot snapshot) /* Update globalxmin to be the smallest valid xmin */ xid = pgxact->xmin; /* fetch just once */ if (TransactionIdIsNormal(xid) && -TransactionIdPrecedes(xid, globalxmin)) +TransactionIdPrecedesBothNormal(xid, globalxmin)) globalxmin = xid; /* Fetch xid just once - see GetNewTransactionId */ @@ -1341,11 +1341,11 @@ GetSnapshotData(Snapshot snapshot) */ if (TransactionIdIsNormal(xid)) { -if (TransactionIdFollowsOrEquals(xid, xmax)) +if (TransactionIdFollowsOrEqualsBothNormal(xid, xmax)) continue; if (pgxact != MyPgXact) snapshot->xip[count++] = xid; -if (TransactionIdPrecedes(xid, xmin)) +if (TransactionIdPrecedesBothNormal(xid, xmin)) xmin = xid; } diff --git a/src/include/access/transam.h b/src/include/access/transam.h index c038fd9..31401f9 100644 --- a/src/include/access/transam.h +++ b/src/include/access/transam.h @@ -44,6 +44,9 @@ #define TransactionIdStore(xid, dest) (*(dest) = (xid)) #define StoreInvalidTransactionId(dest) (*(dest) = InvalidTransactionId) +#define TransactionIdPrecedesBothNormal(id1, id2) ((int32) (id1 - id2) < 0) +#define TransactionIdFollowsOrEqualsBothNormal(id1, id2) ((int32) (id1 - id2) >= 0) + /* advance a transaction ID variable, handling wraparound correctly */ #define TransactionIdAdvance(dest) \ do { \ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Hi, On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote: > This may not be necessary, but can you please share the modified config you > used for the last run. I just copied the .conf I had for some independent development. max_connections = 100 listen_addresses = '' port=5432 shared_buffers = 2GB temp_buffers = 64MB work_mem = 96MB maintenance_work_mem = 1GB effective_cache_size = 20GB log_line_prefix = '%d %a %c %v %x %m ' # special values: log_statement = 'ddl' #decreases variance track_activities = off track_counts = off autovacuum = off update_process_title = off logging_collector = off log_destination = stderr log_min_messages = error log_checkpoints = on log_autovacuum_min_duration=0 synchronous_commit = off checkpoint_segments = 30 checkpoint_timeout = 30min checkpoint_completion_target = 0.8 log_error_verbosity = verbose But about the only thing really important is the shared_buffers difference (commenting it out nearly yielded the former speed) Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Tuesday, November 29, 2011 05:51:40 AM Pavan Deolasee wrote: > On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner > > wrote: > > Andres Freund wrote: > >> I would like to see somebody running a benchmark on a machine with > >> higher concurrency... > > > > Yeah, me too. I don't find it at all hard to believe that we will > > see significant performance benefit by changing the PGPROC structure > > so that all parts of it can be accessed through one cache line rather > > than two. The fact that we saw improvement when changing it down to > > two, even though there is extra indirection in some places, seems > > like pretty solid evidence on the face of it. > > I think there is fundamental difference between the PGPROC patch that > we did and the rearranging SnapshotData members that Andres has > proposed. The PGPROC structure is a shared memory area, very heavily > accessed by GetSnapshotData (and some others). There was a linear > increase in the access as number of clients go up. The SnapshotData > structure is mostly from a backend local memory (unless we do > something what I suggested in the OP) and hence fitting that in a > single cache line may or may not have much impact. We don't even > guarantee that it starts at a cacheline which means in most cases it > will still be spread on multiple cache lines. Well, I could measure a ~1.3% benefit on a modestly concurrent machine (see the earlier reformatted mail from Gurjeet) and the benefits were definitely smaller without concurrency. But I aggree - the benefits wont be even remotely as big as the PGPROC splitup patch. Youre also right about the alignment not being predictable enough. Perhaps pg should start using/introducing a force_cacheline_align macro which can easily implemented on most compilers? Especially for SnapshotData and consorts which are aligned statically that should be good enough. Something along thelines of #ifdef _GNUC__ #define force_align_to(alignment) __attribute__((align(alignment))) #elif __MSVC__ #define force_align_to(alignment) __declspec(align(alignment)) #else #define force_align_to(alignment) #endif #define CACHELIGN_ALIGNMENT 64 #define force_cacheline_align force_align_to(CACHELING_ALIGNMENT) Looks complete enough for now. Back to PGPROC: Looking at the current PGPROC layout on x86-64 with linux abi (generated by pahole): struct PGPROC { SHM_QUEUE links;/* 016 */ PGSemaphoreDatasem; /*16 8 */ intwaitStatus; /*24 4 */ Latch procLatch;/*2812 */ LocalTransactionId lxid; /*40 4 */ intpid; /*44 4 */ intpgprocno; /*48 4 */ BackendId backendId;/*52 4 */ OiddatabaseId; /*56 4 */ OidroleId; /*60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ bool recoveryConflictPending; /*64 1 */ bool lwWaiting;/*65 1 */ bool lwExclusive; /*66 1 */ /* XXX 5 bytes hole, try to pack */ struct PGPROC *lwWaitLink; /*72 8 */ LOCK * waitLock; /*80 8 */ PROCLOCK * waitProcLock; /*88 8 */ LOCKMODE waitLockMode; /*96 4 */ LOCKMASK heldLocks;/* 100 4 */ XLogRecPtr waitLSN; /* 104 8 */ intsyncRepState; /* 112 4 */ /* XXX 4 bytes hole, try to pack */ SHM_QUEUE syncRepLinks; /* 12016 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ SHM_QUEUE myProcLocks[16]; /* 136 256 */ /* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */ struct XidCachesubxids; /* 392 256 */ /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */ LWLockId backendLock; /* 648 4 */ /* XXX 4 bytes hole, try to pack */ uint64 fpLockBits; /* 656 8 */ OidfpRelId[16]; /* 66464 */ /* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */ bool fpVXIDLock; /* 728 1 */ /* XXX 3 bytes hole, try to pack */ LocalTransactionId fpLocalTransactionId; /*
Re: [HACKERS] Avoiding repeated snapshot computation
Pavan Deolasee wrote: > On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane wrote: > > Pavan Deolasee writes: > >> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas > >> wrote: > >>> Furthermore, it's > >>> hard to understand how this could be a net improvement in the general > >>> case, because now both B and F are copying everything twice (once to > >>> the shared area and one to backend-local memory) instead of just once > >>> (to backend-local memory) and C and D are sleeping (yikes!). > > > >> Yes, B and F pay a price of double copy. But I think it can be a net > >> saving because C and D (and many more hopefully) don't need to > >> recompute the snapshot again by going over a potentially large > >> ProcArray. > > > > Like Robert, I'm finding it hard to believe that this isn't going to be > > a net pessimization in as many or more cases as it'd be a win. ?Also, > > didn't we just get rid of most of the issue of "going over a large > > ProcArray" with the move of hot members to a separate array? > > > > Yeah, separating the PGPROC members has helped a lot. But that does > not reduce the number of GetSnapshotData calls. It only makes each > call less expensive. As I said, I had seen 15-20% improvement with not > even a slightly tuned patch, so I am optimistic about the potential of > the approach. Agreed. I think there is great potential here. We have been stumped at how to reduce the overhead of this for years, and it seems you are now on a promising path. -- Bruce Momjian http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane wrote: > Pavan Deolasee writes: >> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas wrote: >>> Furthermore, it's >>> hard to understand how this could be a net improvement in the general >>> case, because now both B and F are copying everything twice (once to >>> the shared area and one to backend-local memory) instead of just once >>> (to backend-local memory) and C and D are sleeping (yikes!). > >> Yes, B and F pay a price of double copy. But I think it can be a net >> saving because C and D (and many more hopefully) don't need to >> recompute the snapshot again by going over a potentially large >> ProcArray. > > Like Robert, I'm finding it hard to believe that this isn't going to be > a net pessimization in as many or more cases as it'd be a win. Also, > didn't we just get rid of most of the issue of "going over a large > ProcArray" with the move of hot members to a separate array? > Yeah, separating the PGPROC members has helped a lot. But that does not reduce the number of GetSnapshotData calls. It only makes each call less expensive. As I said, I had seen 15-20% improvement with not even a slightly tuned patch, so I am optimistic about the potential of the approach. > In the end, getting a snapshot is exactly about copying information out > of shared memory. Perhaps it would be more productive to think about > how we could further modify the ProcArray representation to reduce the > impedance mismatch between it and the snapshot representation, rather > than introducing a second shared-memory copy of the same information. > I think that a good idea. We need a representation that needs minimum processing to derive the snapshot. One idea is to maintain a bitmap of currently running transactions in the shared memory. The size of the bitmap has to be significantly larger than the ProcArray itself because there could be gaps. But even then since its just a bit per XID, we can have a bitmap to represent a window of say 16K or 32K consecutive XIDs. We can also have a summary bitmap to quickly find if there is at least one running transaction in any given chunk. If there are long running transactions which don't fit in the bitmap, we can track them separately in an array. Snapshot is then just a copy of this bitmap along with some additional information. Will try to think more about it and see if I can test this idea. But any comments welcome. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner wrote: > Andres Freund wrote: > >> I would like to see somebody running a benchmark on a machine with >> higher concurrency... > > Yeah, me too. I don't find it at all hard to believe that we will > see significant performance benefit by changing the PGPROC structure > so that all parts of it can be accessed through one cache line rather > than two. The fact that we saw improvement when changing it down to > two, even though there is extra indirection in some places, seems > like pretty solid evidence on the face of it. > I think there is fundamental difference between the PGPROC patch that we did and the rearranging SnapshotData members that Andres has proposed. The PGPROC structure is a shared memory area, very heavily accessed by GetSnapshotData (and some others). There was a linear increase in the access as number of clients go up. The SnapshotData structure is mostly from a backend local memory (unless we do something what I suggested in the OP) and hence fitting that in a single cache line may or may not have much impact. We don't even guarantee that it starts at a cacheline which means in most cases it will still be spread on multiple cache lines. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Andres Freund wrote: > I would like to see somebody running a benchmark on a machine with > higher concurrency... Yeah, me too. I don't find it at all hard to believe that we will see significant performance benefit by changing the PGPROC structure so that all parts of it can be accessed through one cache line rather than two. The fact that we saw improvement when changing it down to two, even though there is extra indirection in some places, seems like pretty solid evidence on the face of it. I went in to the office on Sunday to try to get a few hours of benchmarks for this on our new monster machine, but the DBA responsible for putting it into production was on it first, loading it with production data. At this point it will get really hard to schedule any more of the 20-hour runs that were the basis of most of my recent numbers, but I may be able to shut down production use for a two or three hour window on a weekend to run an abbreviated set, focusing on the higher concurrency levels. Ideally that would be on top of the other pending performance patches. Based on my earlier rounds of benchmarking, I expect that this approach will show the greatest benefit at the higher concurrency levels, where cache lines are most stressed. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Hi Gurjeet, On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote: > On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund wrote: > > On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote: > > > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund > > > > wrote: > > > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: > > > >> I'd just as soon keep the fields in a logical order. > > > > > > > > Btw, I don't think the new order is necessarily worse than the old > > > > one. > > > > > > You forget to attach the benchmark results. > > > > > > My impression is that cache lines on modern hardware are 64 or 128 > > > *bytes*, in which case this wouldn't matter a bit. > > > > > > But testing is even better than guessing. > > > > Being prodded like that I ran a very quick benchmark on my workstation. > > Unfortunately that means I cannot work during the time which is why I > > kept it > > rather short... > > > > That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) * > > 2(threads) and 24GB of ram. > > > > Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 > > pgbench > > > > > > pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60 > > ... > > Looks like a win to me. Even on this comparatively small machine. > This may not be necessary, but can you please share the modified config you > used for the last run. Can't right now because I don't have access from here, but I will do so. I don't think its particularly interesting though. > I tabulated your last results to make it more readable, and added columns > to show the improvement. > > origsnap reordersnap diff %age improvement > -- > 102234.66402 103444.8778791210.2138591.1837607827 > 102003.449741103385.081382.4390671.3552865815 > 102119.509053103302.3189231182.80987 1.1582604352 > 101722.410387103372.6594861650.2490991.6223063263 > 101973.651318103330.1576121356.5062941.3302517626 > 102056.440561103313.8338211257.39326 1.2320567454 > > That looks like a win to me too. We're getting a little over 1% improvement > for free! At least if we can convince Tom that such a change would be ok ;) I would like to see somebody running a benchmark on a machine with higher concurrency... > Maybe submitting this patch to the commitfest might help get some serious > consideration from a reviewer. It wasn't actually ment as an actual patch but a comment, but well ;). Will add it. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund wrote: > On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote: > > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund > wrote: > > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: > > >> I'd just as soon keep the fields in a logical order. > > > > > > Btw, I don't think the new order is necessarily worse than the old one. > > > > You forget to attach the benchmark results. > > > > My impression is that cache lines on modern hardware are 64 or 128 > > *bytes*, in which case this wouldn't matter a bit. > > > > But testing is even better than guessing. > Being prodded like that I ran a very quick benchmark on my workstation. > Unfortunately that means I cannot work during the time which is why I kept > it > rather short... > > That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) * > 2(threads) and 24GB of ram. > > Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 > pgbench > > > pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60 > > origsnap: 92825.743958 93145.110901 93389.915474 93175.482351 > reordersnap:93560.183272 93613.333494 93495.263012 93523.368489 > > pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60 > > origsnap: 81846.743329 81545.175672 81702.755226 81576.576435 > 81228.154119 81546.047708 81421.436262 > reordersnap:81823.479196 81787.784508 81820.242145 81790.263415 > 81762.421592 81496.333144 81732.088876 > > At that point I noticed I had accidentally run with a nearly stock > config... > An even shorter run with a more approrioate config yielded: > > > pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20 > > origsnap: 102234.664020 102003.449741 102119.509053 101722.410387 > 101973.651318 102056.440561 > reordersnap:103444.877879 103385.08 103302.318923 103372.659486 > 103330.157612 103313.833821 > > > > Looks like a win to me. Even on this comparatively small machine. > This may not be necessary, but can you please share the modified config you used for the last run. I tabulated your last results to make it more readable, and added columns to show the improvement. origsnap reordersnap diff %age improvement -- 102234.66402 103444.8778791210.2138591.1837607827 102003.449741103385.081382.4390671.3552865815 102119.509053103302.3189231182.80987 1.1582604352 101722.410387103372.6594861650.2490991.6223063263 101973.651318103330.1576121356.5062941.3302517626 102056.440561103313.8338211257.39326 1.2320567454 That looks like a win to me too. We're getting a little over 1% improvement for free! Maybe submitting this patch to the commitfest might help get some serious consideration from a reviewer. Regards, -- Gurjeet Singh EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Avoiding repeated snapshot computation
Andres Freund wrote: > All current x86 cpus use 64bytes. That matches what I found in recent research on the topic. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote: > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund wrote: > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: > >> I'd just as soon keep the fields in a logical order. > > > > Btw, I don't think the new order is necessarily worse than the old one. > > You forget to attach the benchmark results. > > My impression is that cache lines on modern hardware are 64 or 128 > *bytes*, in which case this wouldn't matter a bit. > > But testing is even better than guessing. Being prodded like that I ran a very quick benchmark on my workstation. Unfortunately that means I cannot work during the time which is why I kept it rather short... That machine has 2 E5520@2.27GHz which means 2(sockets) * 4(cores) * 2(threads) and 24GB of ram. Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 pgbench pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60 origsnap: 92825.743958 93145.110901 93389.915474 93175.482351 reordersnap:93560.183272 93613.333494 93495.263012 93523.368489 pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60 origsnap: 81846.743329 81545.175672 81702.755226 81576.576435 81228.154119 81546.047708 81421.436262 reordersnap:81823.479196 81787.784508 81820.242145 81790.263415 81762.421592 81496.333144 81732.088876 At that point I noticed I had accidentally run with a nearly stock config... An even shorter run with a more approrioate config yielded: pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20 origsnap: 102234.664020 102003.449741 102119.509053 101722.410387 101973.651318 102056.440561 reordersnap:103444.877879 103385.08 103302.318923 103372.659486 103330.157612 103313.833821 Looks like a win to me. Even on this comparatively small machine. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote: > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund wrote: > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: > >> I'd just as soon keep the fields in a logical order. > > > > Btw, I don't think the new order is necessarily worse than the old one. > > You forget to attach the benchmark results. > > My impression is that cache lines on modern hardware are 64 or 128 > *bytes*, in which case this wouldn't matter a bit. All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry for that. And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte cachelines? And yea. I didn't add benchmark results. I don't think I *have* to do that when making suggestions to somebody trying to improve something specific. I also currently don't have hardware where I can sensibly run at a high enough concurrency to see that GetSnapshotData takes ~40% of runtime. Additional cacheline references around synchronized access can hurt to my knowledge... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund wrote: > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: >> I'd just as soon keep the fields in a logical order. > Btw, I don't think the new order is necessarily worse than the old one. You forget to attach the benchmark results. My impression is that cache lines on modern hardware are 64 or 128 *bytes*, in which case this wouldn't matter a bit. But testing is even better than guessing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: > I'd just as soon keep the fields in a logical order. Btw, I don't think the new order is necessarily worse than the old one. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote: > Andres Freund writes: > > You could also try if it makes a difference reducing SnapshotData to one > > instead of two cachelines. The data itself fits into one without > > problems. Trivial patch attached. > On what grounds do you argue that this patch gets SnapshotData into one > cacheline today (and on what hardware), or that it will continue to do > so in the future? If we're this close to the edge then any future > addition to the struct will destroy the point. I'd just as soon keep > the fields in a logical order. To my knowledge there is no current and supported (i.e. I don't count Itanium) hardware that doesn't have 64bit cachelines except in the embedded market. I also don't see a movement towards 128bit cpus or anything else requiring bigger pointers. If platforms with 128bit cachelines or such would gain popularity - nothing would be lost by that change. Which change are you "afraid" of? Now the holes I talked about obviously were on a 64bit machine. But there are a few situations where improving layout so it fits with 8byte alignment for pointers makes the situation worse for 32bit. Looking a bit careful at the changes one does should catch those problems. Also its not *that* close to overflowing again. It was 72 bytes before, now its 56 on a 64bit x86 machine with linux abi. I agree that logical order is important. I don't propose changing all structs in pg mechanically.+ Its sad that there is no sensible method to let the compiler safely reorder struct members accross compiler (-versions) and platforms... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Andres Freund writes: > You could also try if it makes a difference reducing SnapshotData to one > instead of two cachelines. The data itself fits into one without problems. > Trivial patch attached. On what grounds do you argue that this patch gets SnapshotData into one cacheline today (and on what hardware), or that it will continue to do so in the future? If we're this close to the edge then any future addition to the struct will destroy the point. I'd just as soon keep the fields in a logical order. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Hi, On Saturday, November 26, 2011 04:52:50 PM Pavan Deolasee wrote: > I think now that we have reduced the run time of the function itself, > we should now try to reduce the number of times the function is > called. Robert proposed a way to reduce the number of calls per > transaction. I think we can go one more step further and reduce the > number for across the transactions. You could also try if it makes a difference reducing SnapshotData to one instead of two cachelines. The data itself fits into one without problems. Trivial patch attached. Generally I think we should check that for most of the more commonly used strutures, we have many with too much padding. Andres diff --git a/src/include/utils/snapshot.h b/src/include/utils/snapshot.h index 859e52a..b9c2209 100644 --- a/src/include/utils/snapshot.h +++ b/src/include/utils/snapshot.h @@ -46,13 +46,14 @@ typedef struct SnapshotData */ TransactionId xmin; /* all XID < xmin are visible to me */ TransactionId xmax; /* all XID >= xmax are invisible to me */ - uint32 xcnt; /* # of xact ids in xip[] */ TransactionId *xip; /* array of xact IDs in progress */ + uint32 xcnt; /* # of xact ids in xip[] */ /* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */ int32 subxcnt; /* # of xact ids in subxip[] */ TransactionId *subxip; /* array of subxact IDs in progress */ bool suboverflowed; /* has the subxip array overflowed? */ bool takenDuringRecovery; /* recovery-shaped snapshot? */ + bool copied; /* false if it's a static snapshot */ /* * note: all ids in subxip[] are >= xmin, but we don't bother filtering @@ -61,7 +62,6 @@ typedef struct SnapshotData CommandId curcid; /* in my xact, CID < curcid are visible */ uint32 active_count; /* refcount on ActiveSnapshot stack */ uint32 regd_count; /* refcount on RegisteredSnapshotList */ - bool copied; /* false if it's a static snapshot */ } SnapshotData; /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
Pavan Deolasee writes: > On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas wrote: >> Furthermore, it's >> hard to understand how this could be a net improvement in the general >> case, because now both B and F are copying everything twice (once to >> the shared area and one to backend-local memory) instead of just once >> (to backend-local memory) and C and D are sleeping (yikes!). > Yes, B and F pay a price of double copy. But I think it can be a net > saving because C and D (and many more hopefully) don't need to > recompute the snapshot again by going over a potentially large > ProcArray. Like Robert, I'm finding it hard to believe that this isn't going to be a net pessimization in as many or more cases as it'd be a win. Also, didn't we just get rid of most of the issue of "going over a large ProcArray" with the move of hot members to a separate array? In the end, getting a snapshot is exactly about copying information out of shared memory. Perhaps it would be more productive to think about how we could further modify the ProcArray representation to reduce the impedance mismatch between it and the snapshot representation, rather than introducing a second shared-memory copy of the same information. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas wrote: > On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee > wrote: >> What we can do is when a transaction comes to compute its snapshot, it >> checks if some other transaction is already computing a snapshot for >> itself. If so, it just sleeps on the lock. When the other process >> finishes computing the snapshot, it saves the snapshot is a shared >> area and wakes up all processes waiting for the snapshot. All those >> processes then just copy the snapshot from the shared area and they >> are done. This will not only reduce the total CPU consumption by >> avoiding repetitive work, but would also reduce the total time for >> which ProcArrayLock is held in SHARED mode by avoiding pipeline of >> GetSnapshotData calls. I am currently trying the shared work queue >> mechanism to implement this, but I am sure we can do it this in some >> other way too. > > I don't quite understand how this is going to work. I will try, keeping it simple. > Transaction A > ends, invaliding the shared snapshot. Now transactions B, C, and D > come along wishing to take snapshots. B begins taking a snapshot, so > C and D wait (blech!) Yeah, C and D waits. But thats better than concurrently doing the same computation. > for that to be complete. Then, they start > copying the snapshot. And they are holding the ProcArrayLock in shared mode. > Meanwhile, transaction E ends, invalidating the > shared snapshot, E can't end until B, C and D are done with copying the snapshot. > and then transaction F comes along, wanting to take a > snapshot. If F constructs a new shared snapshot, then doesn't that > overwrite the same memory area C and D were busy copying? It seems > like F needs some way to know that C and D are done with the old > shared snapshot before it starts writing a new one. Right. And that can easily be achieved by holding shared lock on ProcArray. The snapshot is invalidated by holding exclusive lock and set/copied while holding shared lock. I am assuming setting a boolean (valid/invalid) can safely be done with a shared lock. But I may be wrong. > Furthermore, it's > hard to understand how this could be a net improvement in the general > case, because now both B and F are copying everything twice (once to > the shared area and one to backend-local memory) instead of just once > (to backend-local memory) and C and D are sleeping (yikes!). Yes, B and F pay a price of double copy. But I think it can be a net saving because C and D (and many more hopefully) don't need to recompute the snapshot again by going over a potentially large ProcArray. So as I demonstrated in the illustration, the total time for which ProcArray lock is held will be significantly less with large number of clients. > That > could maybe be a gain at very high concurrency when spinlock > contention is eating us alive, but it doesn't seem like a good idea in > general. Maybe I'm missing something; did you mean to attach a patch? > I have a patch that I am attaching. It contains some unrelated changes (don't know why; may be I took a diff with some other branch), but you will get the idea. This needs improvement though because it just checks if a valid shared snapshot is available and copies that. If not, it goes ahead and computes one for itself. We need something more intelligent to know that a snapshot computation is in progress and we should wait instead of building our own. This patch had shown 15-20% improvement on the HP box that we are using for the benchmarks. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com shared-snapshot-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Avoiding repeated snapshot computation
On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee wrote: > What we can do is when a transaction comes to compute its snapshot, it > checks if some other transaction is already computing a snapshot for > itself. If so, it just sleeps on the lock. When the other process > finishes computing the snapshot, it saves the snapshot is a shared > area and wakes up all processes waiting for the snapshot. All those > processes then just copy the snapshot from the shared area and they > are done. This will not only reduce the total CPU consumption by > avoiding repetitive work, but would also reduce the total time for > which ProcArrayLock is held in SHARED mode by avoiding pipeline of > GetSnapshotData calls. I am currently trying the shared work queue > mechanism to implement this, but I am sure we can do it this in some > other way too. I don't quite understand how this is going to work. Transaction A ends, invaliding the shared snapshot. Now transactions B, C, and D come along wishing to take snapshots. B begins taking a snapshot, so C and D wait (blech!) for that to be complete. Then, they start copying the snapshot. Meanwhile, transaction E ends, invalidating the shared snapshot, and then transaction F comes along, wanting to take a snapshot. If F constructs a new shared snapshot, then doesn't that overwrite the same memory area C and D were busy copying? It seems like F needs some way to know that C and D are done with the old shared snapshot before it starts writing a new one. Furthermore, it's hard to understand how this could be a net improvement in the general case, because now both B and F are copying everything twice (once to the shared area and one to backend-local memory) instead of just once (to backend-local memory) and C and D are sleeping (yikes!). That could maybe be a gain at very high concurrency when spinlock contention is eating us alive, but it doesn't seem like a good idea in general. Maybe I'm missing something; did you mean to attach a patch? I had a different idea for avoiding repeated snapshot computation: save the snapshot in backend-private memory. When a process takes ProcArrayLock in exclusive mode, it increments a 64-byte counter. When a process takes ProcArrayLock in shared mode, it can check whether the counter has changed; if not, it can reuse the snapshot it took before. I think there might be away to tinker with the snapshot management so as to do this without any more copying than we do presently, since CurrentSnapshotData and SecondarySnapshotData are basically treated as scratch-pads anyhow. Another idea would be to try to avoid taking ProcArrayLock at all. For example, suppose we again have a 64-bit counter in shared memory. A transaction wishing to do ProcArrayEndTransaction() takes ProcArrayLock in exclusive mode, increments the counter, memory barrier, clears its XID, memory barrier, increments the counter, releases ProcArrayLock. A transaction wishing to take a snapshot first reads the counter. If the value read from the counter is even, then we continue like this: memory barrier, take snapshot, memory barrier, recheck counter. If we find that the counter is unchanged, then nobody did ProcArrayEndTransaction() while we were taking the snapshot, and we're good to go. If the counter changed then we take ProcArrayLock in shared mode and retake the snapshot. (We also take ProcArrayLock in shared mode if the initial value we read was odd.) This seems like it would all but eliminate contention on the spinlock protecting ProcArrayLock, but I'm not sure how much overhead it would add in other cases. In particular, if we have to retake the snapshot more than very occasionally, it's not going to be good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers