Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as
Josh Berkus wrote: > Jim, > >> I agree about splitting the utilities, except that I think the database >> should be able to generate UUIDs somehow. > > There is a GUID add-in, and someone is working on a 2nd one. UUIDs are not part of the SQL standard, and we've only seen sporadic demand for them (and different types each time) so I can't imagine one making it further than contrib real soon. > > Also, one could argue that UUIDs are a foot gun, so they're not exactly the type of thing we want to advocate in advance of demand. > Martijn van Oosterhout wrote: > It seems to me that maybe the backend should include a 16-byte fixed > length object (after all, we've got 1, 2, 4 and 8 bytes already) and > then people can use that to build whatever they like, using domains, > for example... > So how about the split? I.e. just add a 16 byte data type and forget all about UUID's for now. Regards, Thomas Hallgren ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as
Jim, I agree about splitting the utilities, except that I think the database should be able to generate UUIDs somehow. There is a GUID add-in, and someone is working on a 2nd one. UUIDs are not part of the SQL standard, and we've only seen sporadic demand for them (and different types each time) so I can't imagine one making it further than contrib real soon. Also, one could argue that UUIDs are a foot gun, so they're not exactly the type of thing we want to advocate in advance of demand. --Josh Berkus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > I'll get back to you with kernel build information tomorrow. We'll also > try to talk to some kernel hackers about this. Some googling turned up recent discussions about race conditions in Linux NFS code: http://threebit.net/mail-archive/linux-kernel/msg05313.html http://lkml.org/lkml/2006/3/1/381 I don't know the kernel nearly well enough to guess if these are related ... regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: Longer startup delay (was Re: [HACKERS] Single Index Tuple Chain (SITC) method)
Alvaro Herrera <[EMAIL PROTECTED]> writes: > Speaking of which, I think I've noticed a longer delay in server start > after initdb. I haven't measured nor profiled it, but I think it may be > because of the heap_inplace_update xlogging that we weren't doing > previously. Can't say that I've noticed anything like that. I have gotten a nagging sense lately that there are strange delays happening at random times during the regression tests. Like you I'm not quite sure about this, but I run "make installcheck" often enough to be fairly attuned to it, and it just seems like it's sometimes hanging up for a couple of seconds for no apparent reason. OTOH we've stuck some new, slow queries into the tests recently and maybe I just haven't acclimated to that. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 21:59 -0400, Tom Lane wrote: > [ back to the start of the thread... ] > > BTW, a couple of thoughts here: > > * If my theory about the low-level cause is correct, then reindexing > sl_log_1 would make the "duplicate key" errors go away, but nonetheless > you'd have lost data --- the overwritten rows would be gone. I suppose > that this would result in the slave missing some rows that are present > on the master. Have you tried comparing slave and master databases to > see if you can find any discrepancies? Haven't done that yet - in test we tend to restart the old subscriber as the new provider and rebuild the cluster. I'll check the logs from our production failure to figure out what to compare and see what I can discover. > * One way that the problem could happen would be if a race condition in > the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than > the true end-of-file (as determined by another process' write() > extending the EOF just slightly earlier --- ie, lseek fails to note the > effects of the just-completed write, and returns the prior EOF value). > PG does have internal locking that should guarantee that the lseek is > not done until after the write completes ... but could there be a bug in > the kernel allowing stale data to be returned? The SMP hardware is > relevant (maybe one processor sees different data than the other) and > frankly I don't trust NFS very far at all for questions such as this. > It'd be interesting to see if you can reproduce the problem in a > database on local storage. Unfortunately we haven't got any local storage that can stand the sort of loads we are putting through. With slower storage the CPUs mostly sit idle and we are very unlikely to trigger a timing-based bug if that's what it is. I'll get back to you with kernel build information tomorrow. We'll also try to talk to some kernel hackers about this. Many thanks for your efforts so far. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > By dike out, you mean remove? Please confirm and I'll try it. Right, just remove (or comment out) the lines I quoted. > We ran this system happily for nearly a year on the > previous kernel without experiencing this problem (tcp lockups are a > different matter). I'm starting to think that a kernel bug is a realistic explanation, see other message. Whose kernel build is this exactly? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 21:47 -0400, Tom Lane wrote: > One easy thing that would be worth trying is to build with > --enable-cassert and see if any Asserts get provoked during the > failure case. I don't have a lot of hope for that, but it's > something that would require only machine time not people time. I'll try this tomorrow. > A couple other things to try, given that you can provoke the failure > fairly easily: > > 1. In studying the code, it bothers me a bit that P_NEW is the same as > InvalidBlockNumber. The intended uses of P_NEW appear to be adequately > interlocked, but it's fairly easy to see how something like this could > happen if there are any places where InvalidBlockNumber is > unintentionally passed to ReadBuffer --- that would look like a P_NEW > call and it *wouldn't* be interlocked. So it would be worth changing > P_NEW to "(-2)" (this should just take a change in bufmgr.h and > recompile) and adding an "Assert(blockNum != InvalidBlockNumber)" > at the head of ReadBufferInternal(). Then rebuild with asserts enabled > and see if the failure case provokes that assert. I'll try this too. > 2. I'm also eyeing this bit of code in hio.c: > > /* > * If the FSM knows nothing of the rel, try the last page before > * we give up and extend. This avoids one-tuple-per-page syndrome > * during bootstrapping or in a recently-started system. > */ > if (targetBlock == InvalidBlockNumber) > { > BlockNumber nblocks = RelationGetNumberOfBlocks(relation); > > if (nblocks > 0) > targetBlock = nblocks - 1; > } > > If someone else has just extended the relation, it's possible that this > will allow a process to get to the page before the intended extender has > finished initializing it. AFAICT that's not harmful because the page > will look like it has no free space ... but it seems a bit fragile. > If you dike out the above-mentioned code, can you still provoke the > failure? By dike out, you mean remove? Please confirm and I'll try it. > A different line of attack is to see if you can make a self-contained > test case so other people can try to reproduce it. More eyeballs on the > problem are always better. Can't really see this being possible. This is clearly a very unusual problem and without similar hardware I doubt that anyone else will trigger it. We ran this system happily for nearly a year on the previous kernel without experiencing this problem (tcp lockups are a different matter). Also the load is provided by a bunch of servers and robots simulating rising and falling load. > Lastly, it might be interesting to look at the WAL logs for the period > leading up to a failure. This would give us an idea of what was > happening concurrently with the processes that seem directly involved. Next time we reproduce it, I'll take a copy of the WAL files too. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
[ back to the start of the thread... ] Marc Munro <[EMAIL PROTECTED]> writes: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. > Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 > Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux > ... > NetApp FAS270 OnTap 7.0.3 > Mounted with the NFS options > rw,nfsvers=3D3,hard,rsize=3D32768,wsize=3D32768,timeo=3D600,tcp,noac > Jumbo frames 8192 MTU. > All postgres data and logs are stored on the netapp. BTW, a couple of thoughts here: * If my theory about the low-level cause is correct, then reindexing sl_log_1 would make the "duplicate key" errors go away, but nonetheless you'd have lost data --- the overwritten rows would be gone. I suppose that this would result in the slave missing some rows that are present on the master. Have you tried comparing slave and master databases to see if you can find any discrepancies? * One way that the problem could happen would be if a race condition in the kernel allowed an lseek(fd, 0, SEEK_END) to return a value less than the true end-of-file (as determined by another process' write() extending the EOF just slightly earlier --- ie, lseek fails to note the effects of the just-completed write, and returns the prior EOF value). PG does have internal locking that should guarantee that the lseek is not done until after the write completes ... but could there be a bug in the kernel allowing stale data to be returned? The SMP hardware is relevant (maybe one processor sees different data than the other) and frankly I don't trust NFS very far at all for questions such as this. It'd be interesting to see if you can reproduce the problem in a database on local storage. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > If there's anything we can do to help debug this we will. We can apply > patches, different build options, etc. One easy thing that would be worth trying is to build with --enable-cassert and see if any Asserts get provoked during the failure case. I don't have a lot of hope for that, but it's something that would require only machine time not people time. A couple other things to try, given that you can provoke the failure fairly easily: 1. In studying the code, it bothers me a bit that P_NEW is the same as InvalidBlockNumber. The intended uses of P_NEW appear to be adequately interlocked, but it's fairly easy to see how something like this could happen if there are any places where InvalidBlockNumber is unintentionally passed to ReadBuffer --- that would look like a P_NEW call and it *wouldn't* be interlocked. So it would be worth changing P_NEW to "(-2)" (this should just take a change in bufmgr.h and recompile) and adding an "Assert(blockNum != InvalidBlockNumber)" at the head of ReadBufferInternal(). Then rebuild with asserts enabled and see if the failure case provokes that assert. 2. I'm also eyeing this bit of code in hio.c: /* * If the FSM knows nothing of the rel, try the last page before * we give up and extend. This avoids one-tuple-per-page syndrome * during bootstrapping or in a recently-started system. */ if (targetBlock == InvalidBlockNumber) { BlockNumber nblocks = RelationGetNumberOfBlocks(relation); if (nblocks > 0) targetBlock = nblocks - 1; } If someone else has just extended the relation, it's possible that this will allow a process to get to the page before the intended extender has finished initializing it. AFAICT that's not harmful because the page will look like it has no free space ... but it seems a bit fragile. If you dike out the above-mentioned code, can you still provoke the failure? A different line of attack is to see if you can make a self-contained test case so other people can try to reproduce it. More eyeballs on the problem are always better. Lastly, it might be interesting to look at the WAL logs for the period leading up to a failure. This would give us an idea of what was happening concurrently with the processes that seem directly involved. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] index corruption
On Thu, 2006-06-29 at 19:59 -0400, Tom Lane wrote: > Ummm ... you did restart the server? "select version();" would be > the definitive test. Can't say I blame you for the skepticism but I have confirmed it again. test=# select version(); version - PostgreSQL 8.0.8 on x86_64-redhat-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-3) (1 row) __ Marc signature.asc Description: This is a digitally signed message part
[HACKERS] [Re: Index corruption]
Ooops: forgot to cc this to the list. On Thu, 2006-06-29 at 19:27 -0400, Tom Lane wrote: > Are you *certain* this slave isn't running 8.0.2 or older? If you can > verify that, then I guess we need to look for another mechanism that > could cause the same kind of thing. Certain. We built new rpms for 8.0.8 and installed them on both database servers (and I just checked them again to make sure). If there's anything we can do to help debug this we will. We can apply patches, different build options, etc. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
I wrote: > What I speculate right at the moment is that we are not looking at index > corruption at all, but at heap corruption: somehow, the first insertion > into ctid (27806,2) got lost and the same ctid got re-used for the next > inserted row. We fixed one bug like this before ... Further study makes this look more probable. It seems that most of the activity around the trouble spot consists of transactions inserting exactly 13 rows into the table; for instance here are traces of two successive transactions: Block ItemXMINCMIN 27800 7 600921860 3 27800 8 600921860 7 27800 9 600921860 11 27800 10 600921860 15 27800 11 600921860 19 27800 12 600921860 23 27800 13 600921860 27 27800 14 600921860 31 27800 15 600921860 35 27800 16 600921860 39 27800 17 600921860 43 27800 18 600921860 47 27800 19 600921860 51 27800 20 600921870 3 27800 21 600921870 7 27800 22 600921870 11 27800 23 600921870 15 27800 24 600921870 19 27800 25 600921870 23 27800 26 600921870 27 27800 27 600921870 31 27800 28 600921870 35 27800 29 600921870 39 27800 30 600921870 43 27800 31 600921870 47 27800 32 600921870 51 The pattern of CMIN values is the same for all these transactions. But look at the rows inserted by 600921856 and 600921858, the two transactions that seem to be involved with the problem on page 27086: 27787 50 600921856 3 27795 41 600921858 3 27795 42 600921858 7 27795 43 600921858 11 27795 44 600921858 15 27795 45 600921858 19 27795 46 600921858 23 27795 47 600921858 27 27795 48 600921858 31 27795 49 600921858 35 27795 50 600921858 39 27795 51 600921858 43 27806 1 600921858 47 27806 2 600921856 15 27806 3 600921856 19 27806 4 600921856 23 27806 5 600921856 27 27806 6 600921856 31 27806 7 600921856 35 27806 8 600921856 39 27806 9 600921856 43 27806 10 600921856 47 27806 11 600921856 51 27806 12 600921858 51 (27787,50) and (27795,51) are both the last rows on their pages. What's evidently happened is that the two transactions filled those pages and then both seized on 27806 as the next page to insert into. I think that 600921856 tried to insert its CMIN 7 and 11 rows as items 1 and 2 on that page, and then something wiped the page, then 600921858 inserted its CMIN 47 row as item 1, and then 600921856 got control back and finished inserting its rows. Further study of the indexes shows that there are two entries in each index pointing to each of (27806,1) and (27806,2) --- but since the xxid values are different for the two (27806,1) entries, those didn't show up as duplicates in my first look. Given the apparent connection to vacuuming, this is looking a WHOLE lot like this bug fixed in 8.0.3: 2005-05-07 17:32 tgl * src/backend/: access/heap/hio.c, commands/vacuumlazy.c (REL7_3_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c, commands/vacuumlazy.c (REL7_4_STABLE), access/heap/hio.c, commands/vacuumlazy.c (REL7_2_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c, commands/vacuumlazy.c (REL8_0_STABLE), access/heap/hio.c, access/nbtree/nbtpage.c, access/nbtree/nbtree.c, commands/vacuumlazy.c: Repair very-low-probability race condition between relation extension and VACUUM: in the interval between adding a new page to the relation and formatting it, it was possible for VACUUM to come along and decide it should format the page too. Though not harmful in itself, this would cause data loss if a third transaction were able to insert tuples into the vacuumed page before the original extender got control back. Are you *certain* this slave isn't running 8.0.2 or older? If you can verify that, then I guess we need to look for another mechanism that could cause the same kind of thing. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Index corruption
On Fri, 2006-06-30 at 00:37 +0300, Hannu Krosing wrote: > Marc: do you have triggers on some replicated tables ? > We have a non-slony trigger on only 2 tables, neither of them involved in this transaction. We certainly have no circular trigger structures. > I remember having some corruption in a database with weird circular > trigger structures, some of them being slony log triggers. > > The thing that seemed to mess up something inside there, was when change > on parent rownt fired a trigger that changes child table rows and there > rows fired another trigger that changed the same parent row again. > __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Index corruption
Ühel kenal päeval, N, 2006-06-29 kell 17:23, kirjutas Tom Lane: > Marc Munro <[EMAIL PROTECTED]> writes: > > Tom, > > we have a newer and much smaller (35M) file showing the same thing: > > Thanks. Looking into this, what I find is that *both* indexes have > duplicated entries for the same heap tuple: > ... > However, the two entries in idx1 contain different data!! > > What I speculate right at the moment is that we are not looking at index > corruption at all, but at heap corruption: somehow, the first insertion > into ctid (27806,2) got lost and the same ctid got re-used for the next > inserted row. We fixed one bug like this before ... Marc: do you have triggers on some replicated tables ? I remember having some corruption in a database with weird circular trigger structures, some of them being slony log triggers. The thing that seemed to mess up something inside there, was when change on parent rownt fired a trigger that changes child table rows and there rows fired another trigger that changed the same parent row again. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Index corruption
Ühel kenal päeval, N, 2006-06-29 kell 16:42, kirjutas Chris Browne: > [EMAIL PROTECTED] (Marc Munro) writes: > > As you see, slony is attempting to enter one tuple > > ('374520943','22007','0') two times. > > > > Each previous time we have had this problem, rebuilding the indexes on > > slony log table (sl_log_1) has fixed the problem. I have not reindexed > > the table this time as I do not want to destroy any usable evidence. > > We have seen this phenomenon on 7.4.8 several times; pulled dumps of > sl_log_1 and index files that Jan Wieck looked at, which alas hasn't > led to a fix. > > He did, mind you, find some concurrency pattern that led, if memory > serves, to 7.4.12's release. We had experienced cases where there was > some worse corruption that required that we rebuild replicas from > scratch :-(. How well did you check the C-language triggers and special slony functions for possibly corrupting some backend/shared-mem structures ? -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > Tom, > we have a newer and much smaller (35M) file showing the same thing: Thanks. Looking into this, what I find is that *both* indexes have duplicated entries for the same heap tuple: idx1: Item 190 -- Length: 24 Offset: 3616 (0x0e20) Flags: USED Block Id: 27806 linp Index: 2 Size: 24 Has Nulls: 0 Has Varwidths: 0 0e20: 9e6c 02001800 0100 0057d123 ...l.W.# 0e30: 1a781200 .x.. Item 191 -- Length: 24 Offset: 3592 (0x0e08) Flags: USED Block Id: 27806 linp Index: 2 Size: 24 Has Nulls: 0 Has Varwidths: 0 0e08: 9e6c 02001800 0100 0057d123 ...l.W.# 0e18: 2e781200 .x.. idx2: Item 127 -- Length: 16 Offset: 6144 (0x1800) Flags: USED Block Id: 27806 linp Index: 2 Size: 16 Has Nulls: 0 Has Varwidths: 0 1800: 9e6c 02001000 0057d123 ...l.W.# Item 128 -- Length: 16 Offset: 6128 (0x17f0) Flags: USED Block Id: 27806 linp Index: 2 Size: 16 Has Nulls: 0 Has Varwidths: 0 17f0: 9e6c 02001000 0057d123 ...l.W.# However, the two entries in idx1 contain different data!! What I speculate right at the moment is that we are not looking at index corruption at all, but at heap corruption: somehow, the first insertion into ctid (27806,2) got lost and the same ctid got re-used for the next inserted row. We fixed one bug like this before ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index corruption
[EMAIL PROTECTED] (Marc Munro) writes: > As you see, slony is attempting to enter one tuple > ('374520943','22007','0') two times. > > Each previous time we have had this problem, rebuilding the indexes on > slony log table (sl_log_1) has fixed the problem. I have not reindexed > the table this time as I do not want to destroy any usable evidence. We have seen this phenomenon on 7.4.8 several times; pulled dumps of sl_log_1 and index files that Jan Wieck looked at, which alas hasn't led to a fix. He did, mind you, find some concurrency pattern that led, if memory serves, to 7.4.12's release. We had experienced cases where there was some worse corruption that required that we rebuild replicas from scratch :-(. -- output = reverse("gro.mca" "@" "enworbbc") http://cbbrowne.com/info/advocacy.html "There is no psychiatrist in the world like a puppy licking your face." -- Ben Williams ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Some questions to developers
Adding -advocacy On Thu, Jun 29, 2006 at 04:48:01PM +0400, Nikolay Samokhvalov wrote: > Recently an interview with Marten Mickos (mysql's ceo) appeared in > russian IT news (eg http://citcity.ru/12776/). I'd like ask some > [similar] questions to PostgreSQL core developers and write an article > in Russian. Actually, I'm sure that there is a great lack of PG news > in our mass media. Moreover, it seems that Anniversary is a good > opportunity to rise a wave of interest. Would you be willing to also publish in English? I should be able to get support from Pervasive for this, too. If developers would rather do a phone interview rather than an email interview, for example. We can also help with editing, as well as producing collatoral for trade shows and what-not. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
We have reproduced the problem again. This time it looks like vacuum is not part of the picture. From the provider's log: 2006-06-29 14:02:41 CST DEBUG2 cleanupThread: 101.057 seconds for vacuuming And from the subscriber's: 2006-06-29 13:00:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374740387','22008','4000'); If my maths is correct and the logs are honest, the vacuum would have started at 14:01:00 CST (13:01:PDT), about 20 seconds after we first encounter the problem. The clocks on the two machines, though in different timezones, are currently synced. Tom, I will create another tarball of the sl_log_1 table and indexes. Should be quite a bit smaller than the previous one. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] Single Index Tuple Chain (SITC) method
Martijn van Oosterhout wrote: > > You can't truncate a tuple to just the header, or at least it's not > > going to be very useful to do it, unless you can also move other tuples > > to coalesce the free space on the page. Which means you need a > > VACUUM-strength page lock. If you're trying to do this in foreground > > queries, you get into the same performance and deadlock issues I already > > mentioned. And I think the net-increase-in-WAL-traffic point would > > apply too, since VACUUM will still need to clean the page when it > > removes the header. > > Well, I was only thinking of having the bgwriter do it in the > background, just bfore writing the block to disk. I'm hoping that it > only tries to write out pages not recently used, so hopefully there > would be very little contention there. > > And perhaps you can avoid the xlogging for the same reason as I > suggested above. Non-visible members of the SITC chains could also removed by the background writer. Adding to the chain does not require a vacuum-level lock; only reusing the space requires it. The thing that has always held us back from more aggressive tuple reuse is the inability of vacuum to do small cleanups --- it has to scan indexes so it wants to do many pages at once. If you have SITC, there are new possibilities for tuple reuse. -- Bruce Momjian [EMAIL PROTECTED] EnterpriseDBhttp://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Single Index Tuple Chain (SITC) method
Martijn van Oosterhout wrote: > One thing I am confused about, currently the ctid chain follows tuple > history so that transactions can find the latest version of any tuple, > even if the key fields have changed. This proposal breaks that, I'm not > sure how important that is though. No, SITC doesn't break the UPDATE chain, it merely doesn't set the SITC_NOT_TAIL bit on the tuple, so an index scan knows that is the last tuple for that index entry. -- Bruce Momjian [EMAIL PROTECTED] EnterpriseDBhttp://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Single Index Tuple Chain (SITC) method
On Thu, Jun 29, 2006 at 12:35:12PM -0400, Tom Lane wrote: > Another issue is that this would replace a simple hint-bit setting with > an index change that requires a WAL entry. There'll be more WAL traffic > altogether from backends retail-deleting index tuples than there would > be from VACUUM cleaning the whole page at once --- and it won't cut the > I/O demand from VACUUM any, either, since VACUUM still has to scan the > index. AFAICS this wouldn't make VACUUM either cheaper or less > necessary, so I'm not sure I see the point. Ok, I'm going to suggest something that's either radical, or very dumb: does this truncation really need to be xlogged? After all, just like hint bits, nothing is being done that won't be recovered from later if it doesn't happen. The only possible thing I can imagine is that during xlog replay you end up trying to add more tuples than appear to fit. But if you have a vacuum page procedure you could call it again to try to compress it down. Then again, perhaps visibility checks are not safe within xlog replay state. I'm hoping that overall disk traffic reduces because total disk space used by tables/indexes reduces. > You can't truncate a tuple to just the header, or at least it's not > going to be very useful to do it, unless you can also move other tuples > to coalesce the free space on the page. Which means you need a > VACUUM-strength page lock. If you're trying to do this in foreground > queries, you get into the same performance and deadlock issues I already > mentioned. And I think the net-increase-in-WAL-traffic point would > apply too, since VACUUM will still need to clean the page when it > removes the header. Well, I was only thinking of having the bgwriter do it in the background, just bfore writing the block to disk. I'm hoping that it only tries to write out pages not recently used, so hopefully there would be very little contention there. And perhaps you can avoid the xlogging for the same reason as I suggested above. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Longer startup delay (was Re: [HACKERS] Single Index Tuple Chain (SITC) method)
Tom Lane wrote: > Another issue is that this would replace a simple hint-bit setting with > an index change that requires a WAL entry. There'll be more WAL traffic > altogether from backends retail-deleting index tuples than there would > be from VACUUM cleaning the whole page at once Speaking of which, I think I've noticed a longer delay in server start after initdb. I haven't measured nor profiled it, but I think it may be because of the heap_inplace_update xlogging that we weren't doing previously. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Single Index Tuple Chain (SITC) method
Ühel kenal päeval, N, 2006-06-29 kell 12:35, kirjutas Tom Lane: > Martijn van Oosterhout writes: > >> Tom - what do you think of the other related idea, that of reusing dead > >> index entries ? > > Possibly workable for btree now that we do page-at-a-time index scans; > however I'm pretty hesitant to build any large infrastructure atop that > change until we've got more performance results. We might yet end up > reverting it. > > Another issue is that this would replace a simple hint-bit setting with > an index change that requires a WAL entry. There'll be more WAL traffic > altogether from backends retail-deleting index tuples than there would > be from VACUUM cleaning the whole page at once --- and it won't cut the > I/O demand from VACUUM any, either, since VACUUM still has to scan the > index. AFAICS this wouldn't make VACUUM either cheaper or less > necessary, so I'm not sure I see the point. How can it generate more traffic ? When you replace a dead index entry with a live one, you just reuse space - you would have to WAL log the index in both cases (adding a new entry or replacing dead entry) Espacially in the case, where you replace an index entryu with the same value. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [GENERAL] UUID's as primary keys
On Thu, Jun 29, 2006 at 06:40:13PM +0200, Thomas Hallgren wrote: > Martijn van Oosterhout wrote: > >On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote: > > > >>I have to concur with this. Assume you use a bytea for a UUID that in > >>turn is used as a primary key. The extra overhead will be reflected in > >>all indexes, all foreign keys, etc. In a normalized database some tables > >>may consist of UUID columns only. > >> > > > >So you create a UUID type. It's cheap enough to create new types after > >all, that's one of postgresql's strengths. > It would be a whole lot easier if I could use a domain. It seems to me that maybe the backend should include a 16-byte fixed length object (after all, we've got 1, 2, 4 and 8 bytes already) and then people can use that to build whatever they like, using domains, for example... Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > As you see, slony is attempting to enter one tuple > ('374520943','22007','0') two times. > Each previous time we have had this problem, rebuilding the indexes on > slony log table (sl_log_1) has fixed the problem. I have not reindexed > the table this time as I do not want to destroy any usable evidence. Good. How big is the log table --- would it be possible for you to send me the physical table and index files? (Not a pg_dump, the actual disk files) regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [GENERAL] UUID's as primary keys
Martijn van Oosterhout wrote: On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote: I have to concur with this. Assume you use a bytea for a UUID that in turn is used as a primary key. The extra overhead will be reflected in all indexes, all foreign keys, etc. In a normalized database some tables may consist of UUID columns only. So you create a UUID type. It's cheap enough to create new types after all, that's one of postgresql's strengths. It would be a whole lot easier if I could use a domain. What I'm saying is that it's easier to create new fixed length types for the cases that need it, than it is to redo the entire type handling of the backend. Of course. But it's a matter of who does what. Your reasoning push the burden to the users. Regards, Thomas Hallgren ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Single Index Tuple Chain (SITC) method
Martijn van Oosterhout writes: >> Tom - what do you think of the other related idea, that of reusing dead >> index entries ? Possibly workable for btree now that we do page-at-a-time index scans; however I'm pretty hesitant to build any large infrastructure atop that change until we've got more performance results. We might yet end up reverting it. Another issue is that this would replace a simple hint-bit setting with an index change that requires a WAL entry. There'll be more WAL traffic altogether from backends retail-deleting index tuples than there would be from VACUUM cleaning the whole page at once --- and it won't cut the I/O demand from VACUUM any, either, since VACUUM still has to scan the index. AFAICS this wouldn't make VACUUM either cheaper or less necessary, so I'm not sure I see the point. > I'd like to know about this too, including ideas about truncating > tuples to just the header. You can't truncate a tuple to just the header, or at least it's not going to be very useful to do it, unless you can also move other tuples to coalesce the free space on the page. Which means you need a VACUUM-strength page lock. If you're trying to do this in foreground queries, you get into the same performance and deadlock issues I already mentioned. And I think the net-increase-in-WAL-traffic point would apply too, since VACUUM will still need to clean the page when it removes the header. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as
On Thu, Jun 29, 2006 at 09:12:32AM +0200, Thomas Hallgren wrote: > The split make sense since clients often have powerful UUID utilities handy > and hence have limited or no use for such utilities in the database (true > for all .NET and Java clients). Some PL's will also enable such packages > out of the box. I agree about splitting the utilities, except that I think the database should be able to generate UUIDs somehow. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Index corruption
On Thu, 2006-06-29 at 12:11 -0400, Tom Lane wrote: > OK, so it's not an already-known problem. > > > We were able to corrupt the index within 90 minutes of starting up our > > cluster. A slony-induced vacuum was under way on the provider at the > > time the subscriber failed. > > You still haven't given any details. What do you mean by "corrupt the > index" --- what actual symptoms are you seeing? We have a two node slony cluster with load tests running againt the provider node. After resyncing the subscriber, and running load tests for about an hour, the slony subscriber begins to fail. From the log file: 2006-06-28 17:58:43 PDT ERROR remoteWorkerThread_1: "insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22001','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22002','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22003','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22004','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22005','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22006','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520943','22007','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22009','0'); insert into "public"."table_trans_attribute" (table_transaction_id,attribute_type,value) values ('374520942','22010','0'); " ERROR: duplicate key violates unique constraint "table_trans_attr_pk" As you see, slony is attempting to enter one tuple ('374520943','22007','0') two times. Each previous time we have had this problem, rebuilding the indexes on slony log table (sl_log_1) has fixed the problem. I have not reindexed the table this time as I do not want to destroy any usable evidence. __ Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] [GENERAL] UUID's as primary keys
Greg Stark <[EMAIL PROTECTED]> writes: > In the current setup the only reason for Postgres to have this data type at > all is purely for legacy compatibility. Yes. So? regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Index corruption
Marc Munro <[EMAIL PROTECTED]> writes: > On Tom Lane's advice, we upgraded to Postgres 8.0.8. OK, so it's not an already-known problem. > We were able to corrupt the index within 90 minutes of starting up our > cluster. A slony-induced vacuum was under way on the provider at the > time the subscriber failed. You still haven't given any details. What do you mean by "corrupt the index" --- what actual symptoms are you seeing? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Index corruption
On Tom Lane's advice, we upgraded to Postgres 8.0.8. We also upgraded slony to 1.1.5, due to some rpm issues. Apart from that everything is as described below. We were able to corrupt the index within 90 minutes of starting up our cluster. A slony-induced vacuum was under way on the provider at the time the subscriber failed. What can we do to help identify the cause of this? We have a test system that seems able to reproduce this fairly easily. __ Marc On Wed, 2006-06-28 at 09:28 -0700, Marc Munro wrote: > We have now experienced index corruption on two separate but identical > slony clusters. In each case the slony subscriber failed after > attempting to insert a duplicate record. In each case reindexing the > sl_log_1 table on the provider fixed the problem. > > The latest occurrence was on our production cluster yesterday. This has > only happened since we performed kernel upgrades and we are uncertain > whether this represents a kernel bug, or a postgres bug exposed by > different timings in the new kernel. > > Our systems are: > > Sun v40z 4 x Dual Core AMD Opteron(tm) Processor 875 > Kernel 2.6.16.14 #8 SMP x86_64 x86_64 x86_64 GNU/Linux > kernel boot option: elevator=deadline > 16 Gigs of RAM > postgresql-8.0.3-1PGDG > Bonded e1000/tg3 NICs with 8192 MTU. > Slony 1.1.0 > > NetApp FAS270 OnTap 7.0.3 > Mounted with the NFS options > rw,nfsvers=3,hard,rsize=32768,wsize=32768,timeo=600,tcp,noac > Jumbo frames 8192 MTU. > > All postgres data and logs are stored on the netapp. > > In the latest episode, the index corruption was coincident with a > slony-induced vacuum. I don't know if this was the case with our test > system failures. > > __ > Marc signature.asc Description: This is a digitally signed message part
Re: [HACKERS] [GENERAL] UUID's as primary keys
Martijn van Oosterhout writes: > On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote: > > I have to concur with this. Assume you use a bytea for a UUID that in > > turn is used as a primary key. The extra overhead will be reflected in > > all indexes, all foreign keys, etc. In a normalized database some tables > > may consist of UUID columns only. > > So you create a UUID type. It's cheap enough to create new types after > all, that's one of postgresql's strengths. What I'm saying is that it's > easier to create new fixed length types for the cases that need it, > than it is to redo the entire type handling of the backend. I guess my motivation here is that I feel currently char(n) is basically broken in Postgres. Sure it satisfies the letter of the specification, but it's failing to actually achieve anything for the users. There's no point at all in using char(n) in Postgres since it takes exactly the same amount of space as varchar() if you're always stuffing it full and more space if you're not. In the current setup the only reason for Postgres to have this data type at all is purely for legacy compatibility. It doesn't actually "work" in that it doesn't provide the space savings it's intended to and that would give users an actual reason to use it in new databases. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] vacuum, performance, and MVCC
Hannu Krosing wrote: > > > But we still have to think about similar cases (index entries pointing > > > inside CITC chains), unless we plan to disallow adding indexes to > > > tables. > > > > CREATE INDEX has to undo any chains where the new indexed columns change > > in the chain, and add index entries to remove the chain. > > Yes, that would be the most straightforward solution. > > It could be better in some cases, if we could avoid adding entries to > other indexes. Maybe we can just reset some flags, so that some SITC ops > like finding tuples by the CITC index pointer still work while adding > new entries wont. > > But it will be tricky to make this work for bitmap index scans. > > So yes, index build is a slop operation anyway, so making it even a > little slower is probably not a big problem. And most CITC chains will > have only one visible row at a time, this will probably not be a big > issue. Consider that index scans coming from different indexes have to span the same SITC. I see no clean way to avoid making all the indexes consistent, and as you said, CREATE INDEX isn't a frequent operation. -- Bruce Momjian [EMAIL PROTECTED] EnterpriseDBhttp://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] UUID's as primary keys
On Thu, Jun 29, 2006 at 03:54:36PM +0200, Thomas Hallgren wrote: > I have to concur with this. Assume you use a bytea for a UUID that in > turn is used as a primary key. The extra overhead will be reflected in > all indexes, all foreign keys, etc. In a normalized database some tables > may consist of UUID columns only. So you create a UUID type. It's cheap enough to create new types after all, that's one of postgresql's strengths. What I'm saying is that it's easier to create new fixed length types for the cases that need it, than it is to redo the entire type handling of the backend. And for people that want char(1), they should be using "char", which really is one byte (ex padding ofcourse). Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Compilatiuon of source code for windows
Hi. Probably, Supposing you are using source of 8.1.4, it is necessary to bring Stable of CVS or to apply PATCH with reference to the following. http://archives.postgresql.org/pgsql-patches/2006-05/msg00232.php Regards, Hiroshi Saito - Original Message - From: Sandeep Jakkaraju(Navolve) To: pgsql-hackers@postgresql.org Sent: Thursday, June 29, 2006 4:29 PM Subject: [HACKERS] Compilatiuon of source code for windows Hi All I am unable to compile the source code for windows. the command in the INSTALL file nmake /f win32.mak is not working !!! Can any one help ??? Thanks in advance sandeep ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] session id and global storage
Andrew Dunstan <[EMAIL PROTECTED]> writes: > Rodrigo De Leon wrote: >> You could do this: >> >> SELECT procpid||' '||backend_start >> FROM pg_stat_activity >> WHERE datname = current_database() >> AND usename = session_user >> AND client_addr = inet_client_addr() >> AND client_port = inet_client_port(); > That's pretty roundabout. Indeed. Use pg_backend_pid() instead: SELECT whatever FROM pg_stat_activity WHERE procpid = pg_backend_pid(); A difficulty with this in existing releases is that pg_stat_activity lags behind reality, so that you won't see your session listed in it until you've been connected at least half a second or so. 8.2 won't have that problem. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [GENERAL] UUID's as primary keys
Greg Stark wrote: Martijn van Oosterhout writes: To be honest, it seems like a lot of work to save the four bytes of overhead for the varlena structure on disk if you're going to need it in memory anyway. And anything like RAW(16) which people want for UUIDs, if it's going to have a lot of functions associated with it, may as well just be a new type. For large databases storage density leads directly to speed. Saving four bytes of overhead on a 16-byte data structure would mean a 20% speed increase. Even if that's only helpful on a tenth of the columns you're still talking about a 2% speed increase for all queries on the table. A lot of databases use CHAR(1) for flags. The overhead is even worse there. I have to concur with this. Assume you use a bytea for a UUID that in turn is used as a primary key. The extra overhead will be reflected in all indexes, all foreign keys, etc. In a normalized database some tables may consist of UUID columns only. Regards, Thomas Hallgren ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] session id and global storage
That's pretty roundabout. We already expose (hex coded) pid.starttime as a session identifier in log_line_prefix (it's the %c escape) so I don't see any reason not to provide either the same thing directly in a function, or at least to expose the backend pid. That would be nice. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [GENERAL] UUID's as primary keys
Martijn van Oosterhout writes: > A tuple is just an array of datums, with some header information. The > problems come when you don't have a tuple anymore, but only the datum, > like in arguments for functions. > > I think it's more a case that most places that deal with datums simply > don't know about typmods. For example, the return type of a function > can only be char, not char(16). If you consider the case of a function > returning a RAW, the caller will have no way of knowing the typmod, > they do know the type though. > > To be honest, it seems like a lot of work to save the four bytes of > overhead for the varlena structure on disk if you're going to need it > in memory anyway. And anything like RAW(16) which people want for > UUIDs, if it's going to have a lot of functions associated with it, may > as well just be a new type. For large databases storage density leads directly to speed. Saving four bytes of overhead on a 16-byte data structure would mean a 20% speed increase. Even if that's only helpful on a tenth of the columns you're still talking about a 2% speed increase for all queries on the table. A lot of databases use CHAR(1) for flags. The overhead is even worse there. > Consider where we currently we have a "Filter Cond" on a "Seq Scan". > Currently the filter can access the datums directly on the disk page, with > what you're proposing, it can't. Well it only can't if the data type has conversion functions. I'm not sure how complex it would be having pointers that *getattr sometimes return pointers to the disk page and sometimes return pointers to a palloced copy though. -- greg ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] session id and global storage
Rodrigo De Leon wrote: Hi, I cant find any function, which tells me something like session id. Is there something like that? I need it in my AM, because I need to know, if something which I wrote in file was written in this current session or previously. How about select procpid||' '||backend_start from pg_stat_activity; Yours, Laurenz Albe Something like this would be maybe possible, but this select can return more rows, when the user is connected with more instances... You could do this: SELECT procpid||' '||backend_start FROM pg_stat_activity WHERE datname = current_database() AND usename = session_user AND client_addr = inet_client_addr() AND client_port = inet_client_port(); That's pretty roundabout. We already expose (hex coded) pid.starttime as a session identifier in log_line_prefix (it's the %c escape) so I don't see any reason not to provide either the same thing directly in a function, or at least to expose the backend pid. If you need it in backend C code, the data can be fetched from MyProcPid and MyProcPort->session_start.tv_sec cheers andrew ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] session id and global storage
Hi, I cant find any function, which tells me something like session id. Is there something like that? I need it in my AM, because I need to know, if something which I wrote in file was written in this current session or previously. How about select procpid||' '||backend_start from pg_stat_activity; Yours, Laurenz Albe Something like this would be maybe possible, but this select can return more rows, when the user is connected with more instances... David Hoksza You could do this: SELECT procpid||' '||backend_start FROM pg_stat_activity WHERE datname = current_database() AND usename = session_user AND client_addr = inet_client_addr() AND client_port = inet_client_port(); Regards, Rodrigo ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] Some questions to developers
Hello, Recently an interview with Marten Mickos (mysql's ceo) appeared in russian IT news (eg http://citcity.ru/12776/). I'd like ask some [similar] questions to PostgreSQL core developers and write an article in Russian. Actually, I'm sure that there is a great lack of PG news in our mass media. Moreover, it seems that Anniversary is a good opportunity to rise a wave of interest. -- Best regards, Nikolay ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Single Index Tuple Chain (SITC) method
> > And anyway, ctid is a usable unique row identifier only within > > read-only transactions, or not ? actually for as long as no vacuum comes along. This would change with SITC. (Maybe it would help to only reuse old versions of the same row, then anybody holding a ctid would at least be still looking at a version of the same row, and should thus be able to follow the update chain) > Err, no. The ctid is the only identifer of a tuple in any > case. When you do a delete, the tuple to be deleted is > indicated by the ctid field which has been passed up from the > base table through the rest of the query. When you reach the > top the ctid better refer to the same tuple or you'll delete > the wrong one. UPDATE is the same. For all these purposes you will be holding the ctid of a visible (to someone) tuple. Those don't qualify for a new SITC tuple anyway. > For all intents and purposes, the CTID of tuple can't change > unless you're 100% certain no-one is using it in any way. For all I know, noone is using dead tuples except for visibility lookup. We would need to make sure that other backends see the new tuple eighter as dead or txopen as long as the contents are not valid. I think we could do that without a vacuum lock on platforms that support 4 byte atomic operations. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Single Index Tuple Chain (SITC) method
On Thu, Jun 29, 2006 at 01:39:51AM +0300, Hannu Krosing wrote: > And anyway, ctid is a usable unique row identifier only within read-only > transactions, or not ? Err, no. The ctid is the only identifer of a tuple in any case. When you do a delete, the tuple to be deleted is indicated by the ctid field which has been passed up from the base table through the rest of the query. When you reach the top the ctid better refer to the same tuple or you'll delete the wrong one. UPDATE is the same. In READ COMMITTED mode, the tuple is rechecked for visibility and if it's invisible, the ctid chain is followed to find the visible one (which may not necessarily be the last one). For all intents and purposes, the CTID of tuple can't change unless you're 100% certain no-one is using it in any way. That's what the vacuum lock is for. > Is it correct to assume, that only one row version can be in process of > being modified at any one time? No, different transactions may be updating differing versions, depending on what was visible at the time. In serialisable transactions you'll get a serialisation failure though, and for read committed, the query will be rerun for the latest version of the tuple. One thing I am confused about, currently the ctid chain follows tuple history so that transactions can find the latest version of any tuple, even if the key fields have changed. This proposal breaks that, I'm not sure how important that is though. > Tom - what do you think of the other related idea, that of reusing dead > index entries ? I'd like to know about this too, including ideas about truncating tuples to just the header. Have a nice day. -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] vacuum, performance, and MVCC
Ühel kenal päeval, T, 2006-06-27 kell 12:16, kirjutas Bruce Momjian: > Hannu Krosing wrote: > > ?hel kenal p?eval, T, 2006-06-27 kell 10:38, kirjutas Hannu Krosing: > > > ?hel kenal p?eval, E, 2006-06-26 kell 23:08, kirjutas Bruce Momjian: > > > > Jim C. Nasby wrote: > > > > > On Mon, Jun 26, 2006 at 02:32:59PM -0400, Bruce Momjian wrote: > > > > > > > > > > > > It is certainly possible to do what you are suggesting, that is > > > > > > have two > > > > > > index entries point to same chain head, and have the index access > > > > > > routines figure out if the index qualifications still hold, but that > > > > > > seems like a lot of overhead. > > > > > > I think Jim meant not 2 pointing to the same head, but 2 pointing into > > > the same chain. Say we have table with (id serial, ts timestamp) where > > > ts changes at each update and id does not. > > > > > > So after 3 updates on one page we have one CITC/ITPC head with pointers > > > from both indexes and two follow-up tuples with pointers from only ts > > > index. > > > > > > The problem with this setup is, that we can't reuse any of those > > > follow-up tuples without index cleanup. > > > > But we still have to think about similar cases (index entries pointing > > inside CITC chains), unless we plan to disallow adding indexes to > > tables. > > CREATE INDEX has to undo any chains where the new indexed columns change > in the chain, and add index entries to remove the chain. Yes, that would be the most straightforward solution. It could be better in some cases, if we could avoid adding entries to other indexes. Maybe we can just reset some flags, so that some SITC ops like finding tuples by the CITC index pointer still work while adding new entries wont. But it will be tricky to make this work for bitmap index scans. So yes, index build is a slop operation anyway, so making it even a little slower is probably not a big problem. And most CITC chains will have only one visible row at a time, this will probably not be a big issue. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Single Index Tuple Chain (SITC) method
Ühel kenal päeval, K, 2006-06-28 kell 18:19, kirjutas Tom Lane: > Bruce Momjian <[EMAIL PROTECTED]> writes: > > Here is an overview of the SITC method: > > http://momjian.us/cgi-bin/pgsitc > > A pretty fundamental problem is that the method assumes it's OK to > change the CTID of a live tuple (by swapping its item pointer with some > expired version). It is not --- this will break: > * active UPDATEs and DELETEs that may have fetched the CTID > but not yet completed processing to decide whether to change > the tuple; > * pending AFTER ROW triggers, such as foreign key checks; > * ODBC as well as other applications that assume CTID is a > usable unique row identifier within transactions. We should *always* return the ctid of CITC head, as this is the one that does not change. And anyway, ctid is a usable unique row identifier only within read-only transactions, or not ? > VACUUM FULL can get away with moving tuples to new CTIDs because it takes > AccessExclusiveLock, so there can be no open transactions with knowledge > of current CTIDs in the table. This is not OK for something that's > supposed to happen in plain UPDATEs, though. Would it still be a problem, if we *always* refer to the whole CITC chain by its externally visible ctid, an look up the real tuple inside tuple fetch op at every access. (1) If we had some special bits for tuples at CITC chain head and inside CITC but not at head, then even seqscan can ignore non-head CITC chain members at its find next tuple op and do the real tuple lookup in some inner function when it hits CITC head. Is it correct to assume, that only one row version can be in process of being modified at any one time? > Another problem is you can't recycle tuples, nor item ids, without > taking a VACUUM-style lock on the page (LockBufferForCleanup). If > anyone else is holding a pin on the page they risk getting totally > confused --- for instance, a seqscan will either miss a tuple or scan it > twice depending on which direction you're juggling item ids around it. I think (1) above solves this, at cost of looking twice at CITC internal tuple headers. > The concurrency loss involved in LockBufferForCleanup is OK for > background-maintenance operations like VACUUM, but I seriously doubt > anyone will find it acceptable for UPDATE. It could easily create > application-level deadlocks, too. (VACUUM is safe against that because > it only holds one lock.) Tom - what do you think of the other related idea, that of reusing dead index entries ? -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] [GENERAL] UUID's as primary keys
On Thu, Jun 29, 2006 at 02:40:15AM -0400, Greg Stark wrote: > > Greg Stark <[EMAIL PROTECTED]> writes: > > No, it doesn't, and we'd pay a nonzero price for allowing that. > > Currently the executor doesn't have to care (much) about whether a > > tuple is on-disk or in-memory --- the individual datums look the same > > either way. Allowing them to be different would force a lot of > > format conversion steps that currently need not happen. > > Is there ever a case where an entire tuple is passed around without knowing > the typmod of an attribute in the tuple? A tuple is just an array of datums, with some header information. The problems come when you don't have a tuple anymore, but only the datum, like in arguments for functions. I think it's more a case that most places that deal with datums simply don't know about typmods. For example, the return type of a function can only be char, not char(16). If you consider the case of a function returning a RAW, the caller will have no way of knowing the typmod, they do know the type though. To be honest, it seems like a lot of work to save the four bytes of overhead for the varlena structure on disk if you're going to need it in memory anyway. And anything like RAW(16) which people want for UUIDs, if it's going to have a lot of functions associated with it, may as well just be a new type. I think time would be much better spent finding a way of allowing user-defined types to be created without using C functions. > The conversion would only really have to happen when the attribute is fetched > or stored, not when the tuple is being passed around wholesale. But I have a > feeling that would be more intrusive than just making the entire system typmod > aware. I'm not sure if tuples are ever passed wholesale very far. The first node to actually do anything with it (any join, expression or condition test) is going to need to deconstruct it. Consider where we currently we have a "Filter Cond" on a "Seq Scan". Currently the filter can access the datums directly on the disk page, with what you're proposing, it can't. Have a nice day, -- Martijn van Oosterhout http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to > litigate. signature.asc Description: Digital signature
Re: [HACKERS] Single Index Tuple Chain (SITC) method
> > Here is an overview of the SITC method: > > http://momjian.us/cgi-bin/pgsitc > > A pretty fundamental problem is that the method assumes it's > OK to change the CTID of a live tuple (by swapping its item > pointer with some expired version). It is not --- this will break: I am having difficulty visualizing that. The plan is not to change CTID's (only the CTID's offset into the page is to be changed). The CTID of the new version is one that is up to now invisible to all backends, so noone can actually have remembered that CTID. Also you would first insert the slot content and then change the CTID offset (this offset change might need to be made atomic). Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Compilatiuon of source code for windows
Hi AllI am unable to compile the source code for windows.the command in the INSTALL file nmake /f win32.makis not working !!!Can any one help ???Thanks in advancesandeep
Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as
If there is interest - I'm sure Nathan and I would be willing to put it on pgfoundry, and at some point give it up for inclusion into PostgreSQL. One requirement would be that it runs on Windows. Is that something you have tested? In case it influences anyone, MySQL 5 already has built-in UUID support: http://dev.mysql.com/doc/refman/5.0/en/miscellaneous-functions.html#id2899901 Chris ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as
[EMAIL PROTECTED] wrote: On Thu, Jun 29, 2006 at 02:02:32AM -0400, [EMAIL PROTECTED] wrote: It was written by Nathan Wagner <[EMAIL PROTECTED]> and myself, and is based off the OSSP ( http://www.ossp.org/ ) UUID implementation. I'm not an expert on the license, but it seems acceptable to me: ... If there is interest - I'm sure Nathan and I would be willing to put it on pgfoundry, and at some point give it up for inclusion into PostgreSQL. This might require a little bit of research. It appears that the development version of OSSP UUID may provide its own PostgreSQL 'bindings'. I may try and contact the author of the OSSP UUID and see whether any changes we have that he does not, can be rolled into his version... Cheers, mark I'm thinking ahead on possible objections to inclusion in core. One objection might be that a fully blown UUID implementation is a lot of code. Code that needs to be maintained and it increases the size of the binary etc. A solution to that might be to break the whole thing up in two: 1 The actual type A plain scalar type that stores 16 bytes. It's complete with standard operators for comparison (natural order) and the text representation would be a 32 character hexadecimal string. This type should make no interpretation whatsoever on what it stores and its only association with UUID's is the storage size. 2 UUID utilities Various ways of representing, generating, and extract partial information from UUID's. Should have support for variants #0, #1, and #2 (the OSSP based code sounds like a good candidate). The split make sense since clients often have powerful UUID utilities handy and hence have limited or no use for such utilities in the database (true for all .NET and Java clients). Some PL's will also enable such packages out of the box. The actual type would be extremely generic, lightweight, and easy to implement. No portability issues whatsoever. The only difficulty that I see is naming it :-). Regards, Thomas Hallgren ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Fixed length datatypes. WAS [GENERAL] UUID's as
[EMAIL PROTECTED] wrote: On Wed, Jun 28, 2006 at 01:12:17PM -0500, Jim C. Nasby wrote: On Wed, Jun 28, 2006 at 01:49:55PM -0400, Andrew Dunstan wrote: Personally I don't buy the misuse objection - we already have plenty of things that can be misused. As long as there is a reasonable valid use and we can make it portable enough, I think there is a good case for including it. Well, since Mark has one, how about we consider adding it in? If nothing else, can you please put your stuff on pgFoundry so others can find it, Mark? It was written by Nathan Wagner <[EMAIL PROTECTED]> and myself, and is based off the OSSP ( http://www.ossp.org/ ) UUID implementation. I'm not an expert on the license, but it seems acceptable to me: "Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies." I haven't tested to see how portable the OSSP UUID implementation is. This is their words: "OSSP uuid was already written with maximum portability in mind, so there should be no great effort required to get it running on any Unix platform with a reasonable POSIX API. Additionally, the portability was tested by successfully building and running it on the following particular Unix platforms (syntax is "- ()"): alpha-tru644.0 (cc) alpha-tru645.1 (gcc, cc) hppa-hpux11.11 (cc) ia64-hpux11.23 (cc) ix86-debian2.2 (gcc, icc) ix86-debian3.0 (gcc) ix86-debian3.1 (gcc) ix86-freebsd4.9 (gcc) ix86-freebsd5.2 (gcc, icc) ix86-netbsd1.6 (gcc) ix86-qnx6.2 (gcc) ix86-solaris10 (gcc) ix86-unixware7.1.3 (cc) mips64-irix6.5 (gcc) sparc64-solaris8 (gcc, forte) sparc64-solaris9 (gcc)" I've put it through a fair amount of testing, including using it within compound indexes, expecting the index to be used for at least '=', constructing many UUIDs quickly, in a sequence, and converting it to and from string form. We chose to implement our own encode / decode routines for performance reasons. With the exception of testing it on a wider range of platforms, I would call the module stable. If there is interest - I'm sure Nathan and I would be willing to put it on pgfoundry, and at some point give it up for inclusion into PostgreSQL. One requirement would be that it runs on Windows. Is that something you have tested? Regards, Thomas Hallgren ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org