Re: [HACKERS] Load distributed checkpoint V4
Heikki Linnakangas <[EMAIL PROTECTED]> wrote: Thanks for making clearly understandable my patch! > We might want to call GetCheckpointProgress something > else, though. It doesn't return the amount of progress made, but rather > the amount of progress we should've made up to that point or we're in > danger of not completing the checkpoint in time. GetCheckpointProgress might be a bad name; It returns the progress we should have done, not at that time. How about GetCheckpointTargetProgress? > However, if we're > already past checkpoint_write_percent at the beginning of the nap, I > think we should clamp the nap time so that we don't run out of time > until the next checkpoint because of sleeping. Yeah, I'm thinking nap time to be clamped to (100.0 - ckpt_progress_at_nap_ start - checkpoint_sync_percent). I think excess of checkpoint_write_percent is not so important here, so I care about only the end of checkpoint. > In the sync phase, we sleep between each fsync until enough > time/segments have passed, assuming that the time to fsync is > proportional to the file length. I'm not sure that's a very good > assumption. We might have one huge files with only very little changed > data, for example a logging table that is just occasionaly appended to. > If we begin by fsyncing that, it'll take a very short time to finish, > and we'll then sleep for a long time. If we then have another large file > to fsync, but that one has all pages dirty, we risk running out of time > because of the unnecessarily long sleep. The segmentation of relations > limits the risk of that, though, by limiting the max. file size, and I > don't really have any better suggestions. It is difficult to estimate fsync costs. We need additonal statistics to do it. For example, if we record the number of write() for each segment, we might use the value as the number of dirty pages in segments. We don't have per-file write statistics now, but if we will have those information, we can use them to control checkpoints more cleverly. > Should we try doing something similar for the sync phase? If there's > only 2 small files to fsync, there's no point sleeping for 5 minutes > between them just to use up the checkpoint_sync_percent budget. Hmmm... if we add a new parameter like kernel_write_throughput [kB/s] and clamp the maximum sleeping to size-of-segment / kernel_write_throuput (*1), we can avoid unnecessary sleeping in fsync phase. Do we want to have such a new parameter? I think we have many and many guc variables even now. I don't want to add new parameters any more if possible... (*1) dirty-area-in-segment / kernel_write_throuput is better. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dead Space Map version 3 (simplified)
Hiroki Kataoka <[EMAIL PROTECTED]> wrote: > Heikki Linnakangas wrote: > > But does it > > work for tables that have a small hot part that's updated very > > frequently? > > I think there is no problem. Bloating will make pages including the > unnecessary area which will not be accessed. Soon, those pages will be > registered into DSM. The usage_count of pages is always incremented when the page are accessed even if they contain no useful tuples. Daed tuples in the pages are not retrieved, but the pages themselves are considered as accessed. Of cource, extra DSM-synchronizations are not needed if we assume that we need to vacuum tables less frequently than checkpoints. However, we need vacuum within 1min and set checkpoint_timeout as 15-60min in some cases. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Dead Space Map version 3 (simplified)
Heikki Linnakangas <[EMAIL PROTECTED]> wrote: > We discussed it a long time ago already, but I really wished the DSM > wouldn't need a fixed size shared memory area. It's one more thing the > DBA needs to tune manually. It also means we need to have an algorithm > for deciding what to keep in the DSM and what to leave out. I'm planning some kinds of flexible memory management for 8.4, but I can't finish it by the deadline of 8.3. I think we need to brush up memory management in many places; it is required not only by DSM but also by FSM and the modules using SLRU. > And I don't > see a good way to extend the current approach to implement the > index-only-scans that we've been talking about, and the same goes for > recovery. :( Yes, we need a strictly accurate DSM to achieve index-only-scans. I went for it at the beginning, but the index-only-scans is not done at 8.3. That's why I gave up it -- an accurate DSM merely introduces an overhead for now. > The way you update the DSM is quite interesting. When a page is dirtied, > the BM_DSM_DIRTY flag is set in the buffer descriptor. The corresponding > bit in the DSM is set lazily in FlushBuffer whenever BM_DSM_DIRTY is > set. That's a clever way to avoid contention on updates. But does it > work for tables that have a small hot part that's updated very > frequently? Hm, I provided the min_dsm_target parameter, that means the minimum size of tables of which dead space is tracked. But it is useless in such cases. I intended to make it for a small hot *table*, but I forgot around a small hot *part of a large table*. > A straightforward fix would be > to scan the buffer cache for buffers marked with BM_DSM_DIRTY to update > the DSM before starting the vacuum scan. It requires sequentially searches of the buffer pool, but it would pay. > It might not be a problem in practice, but it bothers me that the DSM > isn't 100% accurate. You end up having a page with dead tuples on it > marked as non-dirty in the DSM at least when a page is vacuumed but > there's some RECENTLY_DEAD tuples on it that become dead later on. There > might be other scenarios as well. Quite so. RECENTLY_DEAD tuples should be considered as dead tuples in DSM. We might apply the same to DELETE_IN_PROGRESS tuples if we assume commits occurs more frequectly than rollbacks. Additonally, when HEAP_XMIN_INVALID or HEAP_XMAX_COMMITTED flags are added, we might need to re-add BM_DSM_DIRTY for the page. It occurs when pages with inserted tuples are written before the transaction is done, and then it rollbacks. > If I'm reading the code correctly, DSM makes no attempt to keep the > chunks ordered by block number. If that's the case, vacuum needs to be > modified because it currently relies on the fact that blocks are scanned > and the dead tuple list is therefore populated in order. Vacuum still scans heaps in block order and picks up corresponding DSM chunks. Therefore the order of DSM chunks is not important. This method is not efficient for huge tables with small deadspaces, but I think it doesn't become a serious issue. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [BUGS] BUG #3244: problem with PREPARE
"William Lawrance" <[EMAIL PROTECTED]> writes: > This program that does "PQprepare" and then > "PQexecPrepared" has worked previously, but doesn't > work now. > ... > strcpy(openStmt, "declare C1 cursor for select cola" >" from tprep" >" where cola = $1"); > res = PQprepare(conn, "stmtopen", openStmt, 0, 0); I looked into this a bit and found that the issue comes from my recent changes in support of plan caching. To simplify matters, I instituted a rule that utility statements don't have any interesting transformations done at parse analysis time; see this new comment in analyze.c: * For optimizable statements, we are careful to obtain a suitable lock on * each referenced table, and other modules of the backend preserve or * re-obtain these locks before depending on the results. It is therefore * okay to do significant semantic analysis of these statements. For * utility commands, no locks are obtained here (and if they were, we could * not be sure we'd still have them at execution). Hence the general rule * for utility commands is to just dump them into a Query node untransformed. * parse_analyze does do some purely syntactic transformations on CREATE TABLE * and ALTER TABLE, but that's about it. In cases where this module contains * mechanisms that are useful for utility statements, we provide separate * subroutines that should be called at the beginning of utility execution; * an example is analyzeIndexStmt. This means that "preparing" a DECLARE CURSOR is now effectively a no-op; it doesn't do much more than detect basic syntax errors that the Bison grammar can catch. If you run this program without having created the tprep table, the PQprepare doesn't fail! But the bigger problem, at least for Bill's complaint, is that we also don't notice, let alone assign datatypes to, any parameter symbols appearing in the query. I don't see any particular problem in this for the other command types that had their analyze-time processing removed; there's no value in a parameter in CREATE VIEW, for example. But evidently there's some interest in having parameters in prepared DECLARE CURSOR commands. The easiest answer I can think of at the moment is to run parse analysis for a DECLARE CURSOR and then throw away the result. To avoid this overhead in cases where it's useless, we could probably teach analyze.c to do it only if p_variableparams is true (which essentially would mean that the DECLARE CURSOR came in via PQprepare or equivalent, and not as a simply executable statement). Plan B would be to promote DECLARE CURSOR to an "optimizable statement" that is treated under the same rules as SELECT/UPDATE/etc, in particular that we assume locks obtained at analysis are held through to execution. This might be a cleaner answer overall, but I have no idea right now about the effort required or any possible downsides. Comments, better ideas? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PATCHES] [HACKERS] Full page writes improvement, code update
Hi, I don't insist the name and the default of the GUC parameter. I'm afraid wal_fullpage_optimization = on (default) makes some confusion because the default behavior becomes a bit different on WAL itself. I'd like to have some more opinion on this. Zeugswetter Andreas ADI SD wrote: With DBT-2 benchmark, I've already compared the amount of WAL. The result was as follows: Amount of WAL after 60min. run of DBT-2 benchmark wal_add_optimization_info = off (default) 3.13GB how about wal_fullpage_optimization = on (default) wal_add_optimization_info = on (new case) 3.17GB -> can be optimized to 0.31GB by pg_compresslog. So the difference will be around a couple of percents. I think this is very good figure. For information, DB Size: 12.35GB (120WH) Checkpoint timeout: 60min. Checkpoint occured only once in the run. Unfortunately I think DBT-2 is not a good benchmark to test the disabled wal optimization. The test should contain some larger rows (maybe some updates on large toasted values), and maybe more frequent checkpoints. Actually the poor ratio between full pages and normal WAL content in this benchmark is strange to begin with. Tom fixed a bug recently, and it would be nice to see the new ratio. Have you read Tom's comment on not really having to be able to reconstruct all record types from the full page image ? I think that sounded very promising (e.g. start out with only heap insert/update). Then: - we would not need the wal optimization switch (the full page flag would always be added depending only on backup) - pg_compresslog would only remove such "full page" images where it knows how to reconstruct a "normal" WAL record from - with time and effort pg_compresslog would be able to compress [nearly] all record types's full images (no change in backend) I don't think replacing LSN works fine. For full recovery to the current time, we need both archive log and WAL. Replacing LSN will make archive log LSN inconsistent with WAL's LSN and the recovery will not work. WAL recovery would have had to be modified (decouple LSN from WAL position during recovery). An "archive log" would have been a valid WAL (with appropriate LSN advance records). Reconstruction to regular WAL is proposed as pg_decompresslog. We should be careful enough not to make redo routines confused with the dummy full page writes, as Simon suggested. So far, it works fine. Yes, Tom didn't like "LSN replacing" eighter. I withdraw my concern regarding pg_decompresslog. Your work in this area is extremely valuable and I hope my comments are not discouraging. Thank you Andreas -- - Koichi Suzuki ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Postgres SQL Parser
Uma Krishnan wrote: > Hello I'm trying to implement a new SortMerge algorithm, and see how it > compares with the current algorithm. In doing so, rather than using > select, I'd like to introduce a new verb so that Select is not modified. > > I looked at the source code, but could find the place where Postgres SQL > grammer is defined. > src/backend/parser/gram.y If you intend this work to be incorporated into PostgreSQL you should discuss it first on this list, before you cut a single line of code - the bar for new verbs is likely to be set fairly high. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Postgres SQL Parser
Uma Krishnan wrote: > Hello I'm trying to implement a new SortMerge algorithm, and see how > it compares with the current algorithm. In doing so, rather than using > select, I'd like to introduce a new verb so that Select is not modified. > > I looked at the source code, but could find the place where Postgres > SQL grammer is defined. src/parser/gram.y I recommend you read the developer's FAQ. -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Postgres SQL Parser
Hello I'm trying to implement a new SortMerge algorithm, and see how it compares with the current algorithm. In doing so, rather than using select, I'd like to introduce a new verb so that Select is not modified. I looked at the source code, but could find the place where Postgres SQL grammer is defined. Please help. Thanks
Re: [HACKERS] [Fwd: PGBuildfarm member narwhal Branch HEAD Status changed from OK to InstallCheck failure]
Dave Page <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> I was afraid of that. Well, at least get a dump of page 104 in that >> index so we can see what's on-disk. > Sure - I'll have to try with 8.1/8.2 unless you have a pg_filedump > that'll work with -HEAD? No, I don't, but a plain hex/ascii dump is probably the best thing anyway, since we know the page header is wrong. So use any old version of pg_filedump with -d switch. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [Fwd: PGBuildfarm member narwhal Branch HEAD Status changed from OK to InstallCheck failure]
Tom Lane wrote: Dave Page <[EMAIL PROTECTED]> writes: Tom Lane wrote: If you want to poke at it, I'd suggest changing the ERROR to PANIC (it's in bufmgr.c) to cause a core dump, run installchecks till you get a panic, and then look around in the dump to see what you can find. It'd be particularly interesting to see what the buffer actually contains. Also you could look at the corresponding page of the disk file (which in theory should be the same as the buffer contents, since this error check is only made just after a read() ...) Hmm, I'll give it a go when I'm back in the office, but bear in mind this is a Mingw build on which debugging is nigh-on impossible. I was afraid of that. Well, at least get a dump of page 104 in that index so we can see what's on-disk. Sure - I'll have to try with 8.1/8.2 unless you have a pg_filedump that'll work with -HEAD? /D ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
[HACKERS] Please add EXISTS optimization to TODO list
One of the few situations where I experience poor performance under PostgreSQL, compared to other, commercial databases, is when an EXISTS predicate is used. Actually, these often do perform quite well, but there are some situations where there are optimizations available which other products detect and use, which PostgreSQL misses. Looking back at the mailing list archives, it appears that optimizations similar to what other products use for both IN and EXISTS were added to the IN predicate circa 2003: http://archives.postgresql.org/pgsql-hackers/2003-08/msg00438.php -Make IN/NOT IN have similar performance to EXISTS/NOT EXISTS (Tom) There are many situations where the EXISTS predicate and an IN predicate with a table subquery produce identical results. After reviewing the SQL standard and thinking about it a bit, I think that the equivalence holds unless: (1) the equivalent IN predicate is capable of producing a result of UNKNOWN, and (2) the predicate is used in a context where the difference between UNKNOWN and FALSE is significant. One additional point: Martijn van Oosterhout pointed out in an earlier email that if the subquery contains any non-immutable functions there could be a difference, although he described that issue as "minor". The most common case for UNKNOWN logical values is when comparisons involve a NULL on either or both sides of an operator. In some cases, such as the one I posted about a month ago, it is quickly clear to a human reader that none of the above conditions exist -- the columns are all NOT NULL (so the result of the comparisons can never be UNKNOWN), they are used in a context where UNKNOWN and FALSE produce the same results, and no non-immutable functions are invoked. http://archives.postgresql.org/pgsql-hackers/2007-03/msg01408.php The plan for the EXISTS predicate (running in 8.2.3) has a cost of 72736.37, while the logically equivalent query using IN has a cost of 36.38 (and these do approximate reality), so if the faster option was visible to the planner, it would be chosen. Some would argue that I should just change the query to use IN. There are three problems with that. (1) It requires the use of a multi-value row value constructor, which is a construct not supported by all databases we currently use. (We have a heterogeneous environment, where the same queries must run on multiple platforms.) We have a somewhat ugly but valid query to use as a workaround which runs on all of our databases; it has a PostgreSQL cost of 130.98, which is tolerable, if not optimal. (2) I have seen a number of cases where the logically equivalent EXISTS predicate performs better than the IN predicate. The failure to recognize equivalence and to cost both approaches risks suboptimal performance. Avoiding that requires careful testing of both forms to coerce the planner into choosing the best plan. (3) I have been trying to move our application programmers away from a focus on how they want to navigate through the data, toward declaring what they want as the result. (Some programmers routinely use cursors to navigate each table, row by row, based on what they think is the best plan, stuffing the data into a temporary table as they go, then selecting from the temporary table, when a single SELECT statement with a few subqueries will produce the desired data.) The current situation with these predicates diverts the focus from "what to show" back to "how to get it". I hate to see any queries run slower on PostgreSQL than on other databases, so I'm suggesting we address this. We are talking about an optimization that I've seen in some other products for at least 15 years. -Kevin ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] contrib/uuid-ossp: immutable vs. volatile
Marko Kreen wrote: > The UUID generation functions are tagged IMMUTABLE, > shouldn't they be VOLATILE? Some of them should be. Let me recheck that. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] contrib/uuid-ossp: immutable vs. volatile
The UUID generation functions are tagged IMMUTABLE, shouldn't they be VOLATILE? -- marko ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] Re: [BUGS] BUG #3242: FATAL: could not unlock semaphore: error code 298
Marcin Waldowski wrote: > Magnus Hagander wrote: >> Tom Lane wrote: >> >>> Magnus Hagander <[EMAIL PROTECTED]> writes: >>> No, it's definitly the right primitive. But we're creating it with a max count of 1. >>> That's definitely wrong. There are at least three reasons for a PG >>> process's semaphore to be signaled (heavyweight lock release, LWLock >>> release, pin count waiter), and at least two of them can occur >>> concurrently (eg, if deadlock checker fires, it will need to take >>> LWLocks, but there's nothing saying that the original lock won't be >>> released while it waits for an LWLock). >>> >>> The effective max count on Unixen is typically in the thousands, >>> and I'd suggest the same on Windows unless there's some efficiency >>> reason to keep it small (in which case, maybe ten would do). >>> >> >> AFAIK there's no problem with huge numbers (it takes an int32, and the >> documentation says nothing about a limit - I'm sure it's just a 32-bit >> counter in the kernel). I'll give that a shot. >> > > Magnus, Tom, thank you for finding what causes the problem :) I hope > that was also a reason why other transactions were hung (because that is > a prior, I think). > >> Marcin - can you test a source patch? Or should I try to build you a >> binary for testing? It'd be good if you can confirm that it works before >> we commit anything, I think. >> > > Of course I will check fix :) I will be able to do tests on monday. I > think source path should be enought, despite I've newer build PostgreSQL > on Windows (I definitely should try). If i have problems then I will ask > you for binary. Great, please try the attached trivial patch. //Magnus Index: src/backend/port/win32_sema.c === RCS file: /projects/cvsroot/pgsql/src/backend/port/win32_sema.c,v retrieving revision 1.4 diff -c -r1.4 win32_sema.c *** src/backend/port/win32_sema.c 5 Jan 2007 22:19:35 - 1.4 --- src/backend/port/win32_sema.c 22 Apr 2007 18:19:13 - *** *** 82,88 sec_attrs.bInheritHandle = TRUE; /* We don't need a named semaphore */ ! cur_handle = CreateSemaphore(&sec_attrs, 1, 1, NULL); if (cur_handle) { /* Successfully done */ --- 82,88 sec_attrs.bInheritHandle = TRUE; /* We don't need a named semaphore */ ! cur_handle = CreateSemaphore(&sec_attrs, 1, 32767, NULL); if (cur_handle) { /* Successfully done */ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [Fwd: PGBuildfarm member narwhal Branch HEAD Status changed from OK to InstallCheck failure]
Dave Page <[EMAIL PROTECTED]> writes: > Tom Lane wrote: >> If you want to poke at it, I'd suggest changing the ERROR to PANIC >> (it's in bufmgr.c) to cause a core dump, run installchecks till you >> get a panic, and then look around in the dump to see what you can find. >> It'd be particularly interesting to see what the buffer actually >> contains. Also you could look at the corresponding page of the disk >> file (which in theory should be the same as the buffer contents, >> since this error check is only made just after a read() ...) > Hmm, I'll give it a go when I'm back in the office, but bear in mind > this is a Mingw build on which debugging is nigh-on impossible. I was afraid of that. Well, at least get a dump of page 104 in that index so we can see what's on-disk. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [Fwd: PGBuildfarm member narwhal Branch HEAD Status changed from OK to InstallCheck failure]
Tom Lane wrote: > Dave Page <[EMAIL PROTECTED]> writes: >> I've been seeing this failure intermittently on Narwhal HEAD, and once >> on 8.1. Other branches have been OK, as have other animals running on >> the same physical box. Narwhal-HEAD is run more often than any other >> builds however. > >> Anyone have any idea what might be wrong? It seems unlikely to be a >> hardware issue given that it's the exact same test failures each time. > > Yeah, I'd been wondering about that too, but have no clue what's up. > It seems particularly odd that all the failures are in installcheck > not check. > > If you want to poke at it, I'd suggest changing the ERROR to PANIC > (it's in bufmgr.c) to cause a core dump, run installchecks till you > get a panic, and then look around in the dump to see what you can find. > It'd be particularly interesting to see what the buffer actually > contains. Also you could look at the corresponding page of the disk > file (which in theory should be the same as the buffer contents, > since this error check is only made just after a read() ...) Hmm, I'll give it a go when I'm back in the office, but bear in mind this is a Mingw build on which debugging is nigh-on impossible. Regards, Dave. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
pgsql-hackers@postgresql.org
rancpine cui wrote: > Hello > I am confused about the ideas of "session"&"backend"&"process" when I > read the lmgr part of source code. What's the difference among them? They are all the same. -- Bruce Momjian <[EMAIL PROTECTED]> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
pgsql-hackers@postgresql.org
Hello I am confused about the ideas of "session"&"backend"&"process" when I read the lmgr part of source code. What's the difference among them? Thanks, Ranc.