Re: [HACKERS] [PATCHES] plpgsql, return can contains any expression
OK, where are we on this patch? without changes. This task have to do anybody who better know PostgreSQL cache system than me. Regards Pavel --- Pavel Stehule wrote: Pavel Stehule [EMAIL PROTECTED] writes: This patch doesn't seem to cope with cases where the supplied tuple has the wrong number of columns, and it doesn't look like it's being careful about dropped columns either. Also, that's a mighty bizarre-looking choice of cache memory context in coerce_to_tuple ... but then again, why are you bothering with a cache at all for temporary arrays? I am sorry, Tom. But I don't understand. I can check number of columns, ofcourse and I'll do it. What cache for temporary arrays do you mean? I thought that making coerce_to_tuple depend on estate-err_func was pretty bizarre, and that there was no need for any cache at all for arrays that need only live as long as the function runs. All you are saving here is a palloc/pfree cycle, which is not worth the obscurantism and risk of bugs (are you sure natts can never change?). No, cache there is ugly. But I don't have idea about more efective implementation of it :-(. First version of this patch was more clean. and little bit slow. This cache save 10%. BTW, if you want this patch to make it into 8.2, it needs to be fixed and resubmitted *very* soon. I understand, but I am not able work on it in next four days. And I need help with it from Neil. It will be for 8.3. Thank you Pavel _ Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci. http://messenger.msn.cz/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate _ Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/ ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [PATCHES] [pgsql-patches] Phantom Command IDs,updated patch
Tom Lane wrote: I'm testing the patch currently. I was a bit surprised to find the without_oid test failing, but it makes sense because I'm using a MAXALIGN=8 machine. I suppose Heikki tested on MAXALIGN=4. That's right. Thanks for the review! -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] [pgsql-patches] Phantom Command IDs, updated patch
As for what I think we *should* do near-term, I'm pretty strongly tempted to suggest that we just throw an error if a subtransaction tries to upgrade an upper transaction's shared lock to exclusive. So when a RI check locks a parent, you would not be able to update the parent in a later subtrans. I can imagine, that the error would be a problem in a select for update loop, because there you usually want to update the row. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.
Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian: Log Message: --- Add lock matrix to documentation. This needs some revisions. The table needs to be mentioned somewhere in the text, so the reader knows when or why to refer to it. Also, the cryptic abbreviations need to be expanded or explained. And then the concept of lock compatibility, as the table puts it, is not used anywhere else in the documentation. The table should be put in terms of conflicts instead. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
From: Magnus Hagander [EMAIL PROTECTED] Right. Which is why you're likely to see better performance if you keep shared buffers smaller. There is something in dealing with it that's slow on win32, per reports from the field. It needs to be investigated further... We've had reports that it's slow with large shared_buffers, yes. That's a shocking news. I'm sad. I wonder whether the field you are talking about set Windows to use more memory for programs than for filesystem cache, which is selectable from [System] applet of Control Panel (Oh, I wonder how my machine is set in this respect... have to check.) If filesystem cache is preferred, the following senario may be possible: 1. PostgreSQL tries to read data from disk into database cache. 2. The kernel tries to allocate filesystem buffers by paging out PostgreSQL's memory (possibly shared buffers). 3. PostgreSQL finds data requested by its clients in database cache, and tries to get it in memory. 4. But the shared buffers are paged out, and page-ins happen. Are you sure you're not running this on for example IDE disks with write-cache that lies? Windows will write through that write-cache even if the disk lies, whereas most linux versions won't. At least that used to be the case not too long ago, but there has also been talking about fixign that in linux, so maybe that's done... I'm using a PC server whose disks are all SCSI. It has no IDE disk. Also note that when you run pg_bench on the local machine, you take a much higher hit from the fact that context switching between processes is a lot more expensive on Windows than it is on Linux. But it shouldn't be big enough to explain the huge difference you had in your test. Yes, I suspect it, too. So, Oracle uses one multi-threaded server process on Windows, while it employs multi-process architecture. SQL Server is of course multi-threaded. SRA's original PostgreSQL for Windows (based on 7.x) was also multi-threaded. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposal: Commit timestamp
On 2/8/2007 11:41 PM, Richard Troy wrote: On Thu, 8 Feb 2007, Joshua D. Drake wrote: Well how deep are we talking here? My understanding of what Jan wants to do is simple. Be able to declare which triggers are fired depending on the state of the cluster. In Jan's terms, the Origin or Subscriber. In Replicator terms the Master or Slave. This is useful because I may have a trigger on the Master and the same trigger on the Slave. You do not want the trigger to fire on the Slave because we are doing data replication. In short, the we replicate the result, not the action. However, you may want triggers that are on the Slave to fire separately. A reporting server that generates materialized views is a good example. Don't tie up the Master with what a Slave can do. It'd be great if Jan considers the blending of replication; Please elaborate. I would really like to get all you can contribute. I have no clue what got you into what you are doing here. Jan, some sleep now and then might be helpful to your public disposition. Richard, don't embarrass Bruce. He doesn't need your help. I have been with this project and know Bruce Momjian for more than 10 years. Every now and then, Bruce and I get into some sort of eventually publicly visible dispute that doesn't really mean much. I'll probably spend next Christmas with him and his family again, play a few rounds of backgammon with Wilma (who I really owe a revenge), hopefully don't interfere too much with Christine's work (especially when it involves handling food over a white carpet) and none of us will even remember this crap. Our friendship has been through some real tests. Any real problem we would have, we'd never discuss here. We would just meet and talk. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Archive log compression keeping physical log availablein the crash recovery
Our evaluation result is as follows: Database size: 2GB WAL size (after 10hours pgbench run): 48.3GB gzipped size: 8.8GB removal of the physical log: 2.36GB fullpage_writes=off log size: 2.42GB I'm planning to do archive log size evalutation with other benchmarks such as DBT-2 as well. Looks promising :-) Did you use the standard 5 minute checkpoint_timeout? Very nice would be a run with checkpoint_timeout increased to 30 min, because that is what you would tune if you are concerned about fullpage overhead. Andreas ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposal: Commit timestamp
On 2/7/2007 7:13 AM, José Orlando Pereira wrote: On Saturday 03 February 2007, Bruce Momjian wrote: Jan Wieck wrote: I don't have any such paper and the proof of concept will be the implementation of the system. I do however see enough resistance against this proposal to withdraw the commit timestamp at this time. The new replication system will therefore require the installation of a patched, non-standard PostgreSQL version, compiled from sources cluster wide in order to be used. I am aware that this will dramatically reduce it's popularity but it is impossible to develop this essential feature as an external module. I thank everyone for their attention. Going and working on it on your own doesn't seem like the proper solution. I don't see people objecting to adding it, but they want it work, which I am sure you want too. You have to show how it will work and convince others of that, and then you have a higher chance it will work, and be in the PostgreSQL codebase. Hi, Would it be possible to solve the problem using the GORDA on-commit hook? Jan would be able reliably obtain a commit timestamp with the desired semantics and store it in a regular table within transaction boundaries. I am not sure, I would have to look at what exactly that hook provides. The key to a Lamport timestamp is that it is advancing it commit order (plus some other things ... of course). If the hook can guarantee that the calls are made always in commit order, serialized without any race condition possible, it would probably be suitable. Jan PostgreSQL would not have to commit to a specific timestamp semantics and the patch is quite small. Regards, -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Variable length varlena headers redux
Bruce Momjian wrote: Uh, I thought the approach was to create type-specific in/out functions, and add casting so every time there were referenced, they would expand to a varlena structure in memory. Oh, one more thing. You are going to need to teach the code that walks through a tuple attributes about the short header types. I think you should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro code there too. (As an example, see the macro att_addlength().) I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. --- Gregory Stark wrote: I've been looking at this again and had a few conversations about it. This may be easier than I had originally thought but there's one major issue that's bugging me. Do you see any way to avoid having every user function everywhere use a new macro api instead of VARDATA/VARATT_DATA and VARSIZE/VARATT_SIZEP? The two approaches I see are either a) To have two sets of macros, one of which, VARATT_DATA and VARATT_SIZEP are for constructing new tuples and behaves exactly as it does now. So you always construct a four-byte header datum. Then in heap_form*tuple we check if you can use a shorter header and convert. VARDATA/VARSIZE would be for looking at existing datums and would interpret the header bits. This seems very fragile since one stray call site using VARATT_DATA to find the data in an existing datum would cause random bugs that only occur rarely in certain circumstances. It would even work as long as the size is filled in with VARATT_SIZEP first which it usually is, but fail if someone changes the order of the statements. or b) throw away VARATT_DATA and VARATT_SIZEP and make all user function everywhere change over to a new macro api. That seems like a pretty big burden. It's safer but means every contrib module would have to be updated and so on. I'm hoping I'm missing something and there's a way to do this without breaking the api for every user function. -- Start of included mail From: Tom Lane [EMAIL PROTECTED] To: Gregory Stark [EMAIL PROTECTED] cc: Gregory Stark [EMAIL PROTECTED], Bruce Momjian [EMAIL PROTECTED], Peter Eisentraut [EMAIL PROTECTED], pgsql-hackers@postgresql.org, Martijn van Oosterhout kleptog@svana.org Subject: Re: [HACKERS] Fixed length data types issue Date: Mon, 11 Sep 2006 13:15:43 -0400 Lines: 64 Xref: stark.xeocode.com work.enterprisedb:683 Gregory Stark [EMAIL PROTECTED] writes: In any case it seems a bit backwards to me. Wouldn't it be better to preserve bits in the case of short length words where they're precious rather than long ones? If we make 0xxx the 1-byte case it means ... Well, I don't find that real persuasive: you're saying that it's important to have a 1-byte not 2-byte header for datums between 64 and 127 bytes long. Which is by definition less than a 2% savings for those values. I think its's more important to pick bitpatterns that reduce the number of cases heap_deform_tuple has to think about while decoding the length of a field --- every if in that inner loop is expensive. I realized this morning that if we are going to preserve the rule that 4-byte-header and compressed-header cases can be distinguished from the data alone, there is no reason to be very worried about whether the 2-byte cases can represent the maximal length of an in-line datum. If you want to do 16K inline (and your page is big enough for that) you can just fall back to the 4-byte-header case. So there's no real disadvantage if the 2-byte headers can only go up to 4K or so. This gives us some more flexibility in the bitpattern choices. Another thought that occurred to me is that if we preserve the convention that a length word's value includes itself, then for a 1-byte header the bit pattern 1000 is meaningless --- the count has to be at least 1. So one trick we could play is to take over this value as the signal for toast pointer follows, with the assumption that the tuple-decoder code knows a-priori how big a toast pointer is. I am not real enamored of this, because it certainly adds one case to the inner heap_deform_tuple loop and it'll give us problems if we ever want more than one kind of toast pointer. But it's a possibility. Anyway, a couple of encodings that I'm thinking about now involve limiting uncompressed data to 1G (same as now), so that we can play with the first 2 bits instead of just 1: 00xx4-byte length word, aligned, uncompressed data (up to 1G) 01xx4-byte length word, aligned, compressed data (up to 1G) 100x1-byte length word,
[HACKERS] Database backup mechanism
Hi Folks, We have a requirement to deal with large databases of the size Terabytes when we go into production. What is the best database back-up mechanism and possible issues? pg_dump can back-up database but the dump file is limited by OS file-size limit. What about the option of compressing the dump file? How much time does it generally take for large databases? I heard, that it would be way too long (even one or days). I haven't tried it out, though. What about taking zipped back-up of the database directory? We tried this out but the checkpoint data in pg_xlogs directory is also being backed-up. Since these logs keeps on increasing from day1 of database creation, the back_up size if increasing drastically. Can we back-up certain subdirectories without loss of information or consistency..? Any quick comments/suggestions in this regard would be very helpful. Thanks in advance, Ravi Kumar Mandala
Re: [HACKERS] Proposal: Commit timestamp
Richard Troy wrote: It'd be great if Jan considers the blending of replication; any given DB instance shouldn't be only a master/originator or only a slave/subscriber. A solution that lets you blend replication strategies in a single db is, from my point of view, very important. Perhaps if more people read Jan's posts he wouldn't be so frustrated. http://archives.postgresql.org/pgsql-hackers/2007-01/msg01302.php He clearly describes that the master/slave setting is per session, not per database. Kris Jurka ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] doxygen.postgresql.org
Bruce Momjian wrote: Luke Lonergan wrote: Yay! This rocks IMO, but I'm a borderline PHB so what do I know ;-) You think, Oh, this will make my developers more productive. :-) Well, I just had a university contact thank us for setting it up. So, thanks! --Josh Berkus ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal: Commit timestamp
On Feb 8, 2007, at 8:41 PM, Richard Troy wrote: It'd be great if Jan considers the blending of replication; any given DB instance shouldn't be only a master/originator or only a slave/ subscriber. A solution that lets you blend replication strategies in a single db is, from my point of view, very important. It might be constructive to define what a minimal complete set of replication primitives actually is in addition to which ones should be implemented. In addition to master/slave models, you have Paxos algorithms and dynamic reconfiguration models in literature that can utilize many of the same primitives but which are very different in implementation. I see the value of Jan's proposal, but perhaps it would be better to step back and make some assertions about the nature of the core capabilities that will be supported in some broader picture. Having a theoretically (mostly) complete set of usable primitives would be an incredibly powerful feature set. Cheers, J. Andrew Rogers [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
On Fri, Feb 09, 2007 at 01:06:13PM +0900, Takayuki Tsunakawa wrote: On Thu, Feb 08, 2007 at 09:50:26PM +0900, Takayuki Tsunakawa wrote: When I try to start PostgreSQL 8.2.1 on Windows 2003 Server with shared_buffers=1024MB, I get the following error messages in the Event Log (with log_min_messages=debug5) and can't start PostgreSQL: Is this for testing, or for production? From what I've heard, you would normally never want that much shared memory - I've seen more reports on taht you shuld keep it as low as possible, really. For performance reasons. For testing. I wanted to place all data in shared buffers to eliminate reads from disk while I run pgbench repeatedly (actually most reads should come from kernel cache, though.) Right. Which is why you're likely to see better performance if you keep shared buffers smaller. There is something in dealing with it that's slow on win32, per reports from the field. It needs to be investigated further... Does PostgreSQL for Windows have any problem when using a large database cache unlike UNIX versions? We've had reports that it's slow with large shared_buffers, yes. I'm excited about your current great work to enable building all of PostgreSQL with MSVC. I thought you are aiming at making PostgreSQL 64-bit on Windows in the near future (though you may not have signified in ML.) I'm afraid MinGW will not extend to 64-bit (for x64 and Itanium) at least reliably and immediately, due to the difference of data model -- 'long' is still 32-bit in 64-bit applications on Windows. I thought Magnus-san got worried about it and started the activity of completely switching to MSVC. Well, that's one of the reasons, yes. But being able to build 64-bit won't automatically mean that larger shared buffers is the way to go. BTW, the current PostgreSQL for Windows is very slow, isn't it? I compared the performance of PostgreSQL 8.2.x for Linux (RHEL4 for x86, kernel 2.6.x) and Windows Server 2003. I ran 'pgbench -c32 -t500' on the same machine with the same disk layout for data files and WAL, i.e. they are stored on separate disks. The settings in postgresql.conf is the same, except for wal_sync_method -- it is set to open_sync on Linux and open_datasync on Windows, because they are the best for each platform. Linux version shows 1100 tps, but Windows version shows only 450 tps. It's certainly slower than on Linux, but that's a larger difference than i've usually seen. Are you sure you're not running this on for example IDE disks with write-cache that lies? Windows will write through that write-cache even if the disk lies, whereas most linux versions won't. At least that used to be the case not too long ago, but there has also been talking about fixign that in linux, so maybe that's done... Also note that when you run pg_bench on the local machine, you take a much higher hit from the fact that context switching between processes is a lot more expensive on Windows than it is on Linux. But it shouldn't be big enough to explain the huge difference you had in your test. //Magnus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Archive log compression keeping physical log available in the crash recovery
Further information about the following evaluation: Pgbench throughput was as follows: Full WAL archiving (full_page_writes=on), 48.3GB archive: 123TPS Gzip WAL compress, 8.8GB archive: 145TPS Physical log removal, 2.36GB archive: 148TPS full_page_writes=off, 2.42GB archive: 161TPS Koichi Suzuki wrote: Sorry for the late responce; Gzip can reduce the archive log size about one fourth. My point is that it can still be large enough.Removing physical log record (by replacing them with logical log record) from archive log will achieve will shrink the size of the archive log to one twentieth, in the case of pgbehcn test about ten hours (3,600,000 transactions) with database size about 2GB. In the case of gzip, maybe becuase of higher CPU load, total throughput for gzip is less than just copying WAL to archive. In our case, throughput seems to be slightly higher than just copying (preserving physical log) or gzip. I'll gather the meaturement result and try to post. The size of archive log seems not affected by the size of the database, but just by the number of transactions. In the case of full_page_writes=on and full_page_compress=on, compressed archive log size seems to be dependent only on the number of transactions and transaction characteristics. Our evaluation result is as follows: Database size: 2GB WAL size (after 10hours pgbench run): 48.3GB gzipped size: 8.8GB removal of the physical log: 2.36GB fullpage_writes=off log size: 2.42GB The reason why archive log size of our case is slightly smaller than full_page_writes=off is because we remove not only the physical logs but also each page header and the dummy part at the tail of each log segment. Further, we can apply gzip to this archive (2.36GB). Final size is 0.75GB, less than one sixtieth of the original WAL. Overall duration to gzip from WAL (48.3GB to 8.8GB) was about 4000sec, and our compression to 2.36GB needed about 1010sec, slightly less than just cat command (1386sec). When gzip is combined with our compression (48.3GB to 0.75GB), total duration was about 1330sec. This shows that phyiscal log removal is good selection for the following case: 1) Need same crash recovery possibility as full_page_writes=on, and 2) Need to shrink the size of archive log for loger period to store. Of course, if we care crash recovery in PITR slave, we still need physical log records in archive log. In this case, because archive log is not intended to be kept long, its size will not be an issue. I'm planning to do archive log size evalutation with other benchmarks such as DBT-2 as well. Materials for this has already been thrown to HACKERS and PATCHES. I hope you try this. Jim Nasby wrote: I thought the drive behind full_page_writes = off was to reduce the amount of data being written to pg_xlog, not to shrink the size of a PITR log archive. ISTM that if you want to shrink a PITR log archive you'd be able to get good results by (b|g)zip'ing the WAL files in the archive. I quick test on my laptop shows over a 4x reduction in size. Presumably that'd be even larger if you increased the size of WAL segments. On Jan 29, 2007, at 2:15 AM, Koichi Suzuki wrote: This is a proposal for archive log compression keeping physical log in WAL. In PotgreSQL 8.2, full-page_writes option came back to cut out physical log both from WAL and archive log. To deal with the partial write during the online backup, physical log is written only during the online backup. Although this dramatically reduces the log size, it can risk the crash recovery. If any page is inconsisitent because of the fault, crash recovery doesn't work because full page images are necessary to recover the page in such case. For critical use, especially in commercial use, we don't like to risk the crash recovery chance, while reducing the archive log size will be crucial too for larger databases.WAL size itself may be less critical, because they're reused cyclickly. Here, I have a simple idea to reduce archive log size while keeping physical log in xlog: 1. Create new GUC: full_page_compress, 2. Turn on both the full_page_writes and full_page_compress: physical log will be written to WAL at the first write to a page after the checkpoint, just as conventional full_page_writes ON. 3. Unless physical log is written during the online backup, this can be removed from the archive log. One bit in XLR_BKP_BLOCK_MASK (XLR_BKP_REMOVABLE) is available to indicate this (out of four, only three of them are in use) and this mark can be set in XLogInsert(). With the both full_page_writes and full_page_compress on, both logical log and physical log will also be written to WAL with XLR_BKP_REMOVABLE flag on. Having both physical and logical log in a same WAL is not harmful in the crash recovery. In the crash recovery, physical log is used if it's available. Logical log is used in the archive recovery, as the corresponding physical log
Re: [HACKERS] HOT for PostgreSQL 8.3
On 2/9/07, Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2007-02-07 at 14:17 -0500, Tom Lane wrote: ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. This has got the same atomicity problem as for CREATE INDEX, because it's the same thing: you're de-HOT-ifying the child. So if you can solve the former, I think you can make this work too. This is looking like the best option out of the many, since it doesn't have any serious restrictions or penalties. Let's see what Pavan thinks, since he's been working on this aspect. ISTM that there two related issues that we need to solve to make progress. - We must make de-HOTifying or CHILLing crash safe - Concurrent index scans should work correctly with CHILLing operations I think the first issue can be addressed on the lines of what Heikki suggested. We can CHILL one tuple at a time. I am thinking of a two step process. In the first step, the root-tuple and the heap-only tuple (which needs CHILLing) are marked with a special flag, CHILL_IN_PROGRESS. This operation is WAL logged. We then insert appropriate index entries for the tuple under consideration. In the second step, the HEAP_UPDATED_ROOT and HEAP_ONLY_TUPLE flags on the heap tuples are adjusted and CHILL_IN_PROGRESS flags are cleared. During normal operations, if CHILL_IN_PROGRESS flag is found set, we might need to do some more work to figure out whether the index insert operations were successful or not. If we find that there are missing index entries for the tuple under consideration for CHILLing, then those could be added now and flags are set/reset appropriately. The second problem of concurrent index scans seems a bit more complex. We need a mechanism so that no tuples are missed or tuples are not returned twice. Since CHILLing of a tuple adds a new access path to the tuple from the index, a concurrent index scan may return a tuple twice. How about grabbing a AccessExclusiveLock during CHILLing operation ? This would prevent any concurrent index scans. Since CHILLing of a large table can take a long time, the operation can be spread across time with periodic acquire/release of the lock. This would prevent starvation of other backends. Since CHILLing is required only for CREATE INDEX and stub-cleanup, I am assuming that its ok for it to be lazy in nature. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] HOT for PostgreSQL 8.3
On Thu, 2007-02-08 at 14:47 +, Heikki Linnakangas wrote: However, the easiest solution would be to make CREATE INDEX wait until the old tuple is dead. That should be ok at least for concurrent CREATE INDEX, because it already has that kind of a wait between 1st and 2nd phase. I'm not sure this part of the idea is possible; the rest sounded good. Looking at DefineIndex() the wait happens only for transactions that already have a lock on the table being indexed, which may not be very many. AFAICS the ref page for CREATE INDEX CONCURRENTLY isn't fully accurate (any more?) when it says [CREATE INDEX] must wait for all existing transactions to terminate. Waiting until an arbitrary Xid dies could be deadlock-prone, if the lock isn't taken carefully. Imagine a pg_dump coming towards you and then waiting on the locked table. You'd need to wait at the beginning of the command, before locks were taken. However, that would means CREATE INDEX would only be possible outside of transaction blocks, which I don't think is acceptable. I wanted this for VACUUM FULL also, but same problem exists. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Variable length varlena headers redux
Bruce Momjian [EMAIL PROTECTED] writes: Bruce Momjian wrote: Uh, I thought the approach was to create type-specific in/out functions, and add casting so every time there were referenced, they would expand to a varlena structure in memory. Are you talking about actual casts? Because that would lead to all kinds of strange places with indexes and function lookups and so on. Or are you just talking about code in the macro api to datum? Oh, one more thing. You are going to need to teach the code that walks through a tuple attributes about the short header types. I think you should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro code there too. (As an example, see the macro att_addlength().) I thought of doing this. It would let us, for example, treat text/varchar, bpchar, and numeric but leave other data types unchanged. That does help somewhat but unfortunately text is the problem case. There's tons of code that generates text without using textin. All of pgcrypto for example. I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. I was leaning toward generating the short varlena headers primarily in heap_form*tuple and just having the datatype specific code generate 4-byte headers much as you describe. However that doesn't get us away from having VARDATA/VARSIZE aware of the new headers. Since heap_deform*tuple and the other entry points which extract individual attributes return pointers to the datum in the tuple. They can't expand the header to a 4-byte header on the fly. I thought of doing it in DETOAST_DATUM on the theory that everyone's going to be calling it on their arguments. However there are other cases than just arguments. Other functions might call, say, text_concat() and then call VARDATA() on the result. Even if we only ever generate short headers on heap_form*tuple and always expand them on DETOAST we could have code that passes around tuples that it knows are entirely in memory and therefore not toasted. I'm thinking of plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour but it still seems sort of fragile to me. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal: TABLE functions
Hello, it can by more simple than I though. I need only one flag, and if its true then I don't create language variables for OUT params. But I need one next column in pg_proc. Currently a lot of columns in pg_proc is bool. What about one binary columns for other options? I hope so next versions can support autonomous transaction, which need flag too. Regards Pavel Stehule Pavel Stehule wrote: Hello, Currently PostgreSQL support set returning functions. ANSI SQL 2003 goes with new type of functions - table functions. With this syntax CREATE FUNCTION foo() RETURNS TABLE (c1 t1, ... ) Yeah this should be pretty easy because a table is just a composite type. You can already do this: CREATE TABLE foo (id bigint, first_name text); CREATE FUNCTION foo() RETURNS SET OF foo... PostgreSQL equal statements are: CREATE TYPE tmptype AS (c1 t1, ...) CREATE FUNCTION ... RETURNS SETOF tmptype AS ... All necessary infrastructure is done. Implementation needs propably only small changes in parser. This feature doesn't need any changes in SQL functions. I expect so they will be more readable and consistent. CREATE OR REPLACE FUNCTION foo(f integer) RETURNS TABLE(a int, b int) AS $$ SELECT a, b FROM FROM footab WHERE a f; $$ LANGUAGE sql; plpgpsql RETURN have to be enhanced for table expressions. CREATE OR REPLACE FUNCTION foo(f integer) RETURNS TABLE(a int, b int) AS -- they are not variables! $$ BEGIN RETURN TABLE(SELECT a, b -- it's secure, a,b are not variables FROM footab WHERE a f); END; $$ LANGUAGE plpgsql; RETURN NEXT can be used without changes. This feature doesn't allow combination of RETURN TABLE and RETURN NEXT statement. Table functions can have only IN arguments. Advances: * conformance with ansi sql 2003 * less propability of colision varnames and colnames Regards Pavel Stehule _ Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci. http://messenger.msn.cz/ ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ _ Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci. http://messenger.msn.cz/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.
On Fri, 9 Feb 2007, Peter Eisentraut wrote: Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian: Log Message: --- Add lock matrix to documentation. This needs some revisions. The table needs to be mentioned somewhere in the text, so the reader knows when or why to refer to it. Also, the cryptic abbreviations need to be expanded or explained. And then the concept of lock compatibility, as the table puts it, is not used anywhere else in the documentation. The table should be put in terms of conflicts instead. Another version with expanded abbreviations is http://mira.sai.msu.su/~megera/pgsql/lockmatrix/c2.html, if remove UPDATE EXCLUSIVE. While compatibility matrix is a commonly accepted termin, I agree, that using conficts would be better in context of our docs. Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Database backup mechanism
[EMAIL PROTECTED] wrote: We have a requirement to deal with large databases of the size Terabytes when we go into production. What is the best database back-up mechanism and possible issues? It depends. Make sure you read Chapter 23. Backup and Restore of the user manual: http://www.postgresql.org/docs/8.2/interactive/backup.html It discusses pg_dump and restore, as well as file system level backup. You'll probably want to set up continuous archiving, which allows you to take a file-system level backup without shutting down the database. What about taking zipped back-up of the database directory? We tried this out but the checkpoint data in pg_xlogs directory is also being backed-up. Since these logs keeps on increasing from day1 of database creation, the back_up size if increasing drastically. The amount of WAL files in pg_xlog directory is controlled by the checkpoint_segments configuration parameter. Can we back-up certain subdirectories without loss of information or consistency..? No. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Database backup mechanism
I would also be interested in any creative ways to reduce the size and time to backup databases/clusters. We were just having a conversation about this yesterday. We were mulling over things like using rsync to only backup files in the database directory tree that actually changed. Or maybe doing a selective backup of files based on modified times, etc, but were unsure if this would be a safe, reliable way to backup a reduced set of data. Doug Knight WSI Inc. Andover, MA On Fri, 2007-02-09 at 12:45 +0530, [EMAIL PROTECTED] wrote: Hi Folks, We have a requirement to deal with large databases of the size Terabytes when we go into production. What is the best database back-up mechanism and possible issues? pg_dump can back-up database but the dump file is limited by OS file-size limit. What about the option of compressing the dump file? How much time does it generally take for large databases? I heard, that it would be way too long (even one or days). I haven't tried it out, though. What about taking zipped back-up of the database directory? We tried this out but the checkpoint data in pg_xlogs directory is also being backed-up. Since these logs keeps on increasing from day1 of database creation, the back_up size if increasing drastically. Can we back-up certain subdirectories without loss of information or consistency..? Any quick comments/suggestions in this regard would be very helpful. Thanks in advance, Ravi Kumar Mandala
Re: [HACKERS] HOT for PostgreSQL 8.3
Tom Lane wrote: ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there as well, but if the old and new tid happen to be on a different bitmap page, it requires some care to avoid deadlocks. Also, we'd need more work mem for vacuum. This has got the same atomicity problem as for CREATE INDEX, because it's the same thing: you're de-HOT-ifying the child. Not exactly. De-HOT-ifying, or chilling, a child means inserting new index entries. But if we're just replacing the tids from the existing index entries, it's ok if we crash after replacing some but not all of them. The next vacuum would replace the rest of the pointers, and remove the old root tuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Archive log compression keeping physical log available in the crash recovery
On Fri, Feb 09, 2007 at 01:00:10PM +0900, Koichi Suzuki wrote: Further, we can apply gzip to this archive (2.36GB). Final size is 0.75GB, less than one sixtieth of the original WAL. Note that if you were compressing on the fly, you'll have to tell gzip to regularly flush its buffers to make sure all the data actually hits disk. That cuts into your compression ratio... Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Hierarchical Queries--Status
On Thu, 2007-02-08 at 20:49 -0500, Bruce Momjian wrote: Who is working on this item? Jonah was trying to complete this for 8.3, but I believe that he has handed it onto Gregory Stark - I think http://archives.postgresql.org/pgsql-hackers/2007-01/msg01586.php is the latest update. Kind regards, Mark. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Database backup mechanism
[EMAIL PROTECTED] wrote: Hi Folks, We have a requirement to deal with large databases of the size Terabytes when we go into production. What is the best database back-up mechanism and possible issues? pg_dump can back-up database but the dump file is limited by OS file-size limit. What about the option of compressing the dump file? How much time does it generally take for large databases? I heard, that it would be way too long (even one or days). I haven't tried it out, though. What about taking zipped back-up of the database directory? We tried this out but the checkpoint data in pg_xlogs directory is also being backed-up. Since these logs keeps on increasing from day1 of database creation, the back_up size if increasing drastically. Can we back-up certain subdirectories without loss of information or consistency..? Any quick comments/suggestions in this regard would be very helpful. Please ask in the correct forum, either pgsql-general or pgsql-admin. This list is strictly for discussion of development of postgres, not usage questions. (If all you need is a pg_dump backup, maybe you could just pipe its output to something like 'split -a 5 -b 1000m - mybackup') cheers andrew ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Variable length varlena headers redux
Greg Stark wrote: Bruce Momjian [EMAIL PROTECTED] writes: Bruce Momjian wrote: Uh, I thought the approach was to create type-specific in/out functions, and add casting so every time there were referenced, they would expand to a varlena structure in memory. Are you talking about actual casts? Because that would lead to all kinds of strange places with indexes and function lookups and so on. Or are you just talking about code in the macro api to datum? I am thinking of auto-casts, sort of like how varchar, char, and text are all internally treated as interchangable. Oh, one more thing. You are going to need to teach the code that walks through a tuple attributes about the short header types. I think you should set pg_type.typlen = -3 (vs -1 for varlena) and put your macro code there too. (As an example, see the macro att_addlength().) I thought of doing this. It would let us, for example, treat text/varchar, bpchar, and numeric but leave other data types unchanged. That does help somewhat but unfortunately text is the problem case. There's tons of code that generates text without using textin. All of pgcrypto for example. Well, TEXT can't use short headers. I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. I was leaning toward generating the short varlena headers primarily in heap_form*tuple and just having the datatype specific code generate 4-byte headers much as you describe. Yep. However that doesn't get us away from having VARDATA/VARSIZE aware of the new headers. Since heap_deform*tuple and the other entry points which extract individual attributes return pointers to the datum in the tuple. They can't expand the header to a 4-byte header on the fly. Yep, you are going to have to special-case those call points to test typlen and use your short macros there if == -3. I thought of doing it in DETOAST_DATUM on the theory that everyone's going to be calling it on their arguments. However there are other cases than just arguments. Other functions might call, say, text_concat() and then call VARDATA() on the result. Right, I think all the in-memory stuff has to be varlena. Even if we only ever generate short headers on heap_form*tuple and always expand them on DETOAST we could have code that passes around tuples that it knows are entirely in memory and therefore not toasted. I'm thinking of plpgsql here primarily. Perhaps it would be enough to outlaw this behaviour but it still seems sort of fragile to me. Yea, we might need some cleanup, but the cleanup is just to do things properly. I am unclear on the case you are describing. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.
On 2/9/07, Oleg Bartunov oleg@sai.msu.su wrote: On Fri, 9 Feb 2007, Peter Eisentraut wrote: Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian: Log Message: --- Add lock matrix to documentation. This needs some revisions. The table needs to be mentioned somewhere in the text, so the reader knows when or why to refer to it. Also, the cryptic abbreviations need to be expanded or explained. And then the concept of lock compatibility, as the table puts it, is not used anywhere else in the documentation. The table should be put in terms of conflicts instead. Another version with expanded abbreviations is http://mira.sai.msu.su/~megera/pgsql/lockmatrix/c2.html, if remove UPDATE EXCLUSIVE. While compatibility matrix is a commonly accepted termin, I agree, that using conficts would be better in context of our docs. How about changing 'current lock mode' to 'opposing lock mode'? 'current' kind of suggests that you are escalating your own lock. merlin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] [pgsql-patches] Phantom Command IDs, updated patch
Zeugswetter Andreas ADI SD [EMAIL PROTECTED] writes: So when a RI check locks a parent, you would not be able to update the parent in a later subtrans. I can imagine, that the error would be a problem in a select for update loop, because there you usually want to update the row. No, it would not, because select for update would acquire exclusive lock in the first place. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [COMMITTERS] pgsql: Add lock matrix to documentation.
Oleg Bartunov wrote: On Fri, 9 Feb 2007, Peter Eisentraut wrote: Am Donnerstag, 8. Februar 2007 16:32 schrieb Bruce Momjian: Log Message: --- Add lock matrix to documentation. This needs some revisions. The table needs to be mentioned somewhere in the text, so the reader knows when or why to refer to it. Also, the cryptic abbreviations need to be expanded or explained. And then the concept of lock compatibility, as the table puts it, is not used anywhere else in the documentation. The table should be put in terms of conflicts instead. Another version with expanded abbreviations is http://mira.sai.msu.su/~megera/pgsql/lockmatrix/c2.html, if remove UPDATE EXCLUSIVE. While compatibility matrix is a commonly accepted termin, I agree, that using conficts would be better in context of our docs. If UE is moved 2 or 3 places down/right, the matrix will be closer to triangular. It would look better with non-conflicts being blank (or nbsp; in HTML) rather than 'O', I think - the conflicts will stand out more. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
On Fri, Feb 09, 2007 at 07:31:33PM +0900, Takayuki Tsunakawa wrote: 1. PostgreSQL tries to read data from disk into database cache. 2. The kernel tries to allocate filesystem buffers by paging out PostgreSQL's memory (possibly shared buffers). 3. PostgreSQL finds data requested by its clients in database cache, and tries to get it in memory. 4. But the shared buffers are paged out, and page-ins happen. One of the biggest issues with large shared memory segments is that we can't be sure they'll actually stay in memory. So your shared memory block should not be too much larger than your working set. So yes, if you make a really large segment, the problem you describe may happen. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
Takayuki Tsunakawa [EMAIL PROTECTED] writes: I wonder whether the field you are talking about set Windows to use more memory for programs than for filesystem cache, which is selectable from [System] applet of Control Panel (Oh, I wonder how my machine is set in this respect... have to check.) If filesystem cache is preferred, the following senario may be possible: 1. PostgreSQL tries to read data from disk into database cache. 2. The kernel tries to allocate filesystem buffers by paging out PostgreSQL's memory (possibly shared buffers). 3. PostgreSQL finds data requested by its clients in database cache, and tries to get it in memory. 4. But the shared buffers are paged out, and page-ins happen. It's certainly true that if shared_buffers is large enough to make the kernel try to swap out parts of the shared buffer array, then you've got a counterproductive situation resulting in net *more* I/O than if you'd used a smaller setting. On some Unixen shared memory is implicitly locked in RAM, and on others it's possible to request locking it (though I'm not sure we try to at the moment). Perhaps it's always swappable on Windows? Or maybe Windows is just more eager to swap it out? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] [PATCHES] How can I use 2GB of shared buffers on Windows?
On Fri, Feb 09, 2007 at 09:49:25AM -0500, Tom Lane wrote: Takayuki Tsunakawa [EMAIL PROTECTED] writes: I wonder whether the field you are talking about set Windows to use more memory for programs than for filesystem cache, which is selectable from [System] applet of Control Panel (Oh, I wonder how my machine is set in this respect... have to check.) If filesystem cache is preferred, the following senario may be possible: 1. PostgreSQL tries to read data from disk into database cache. 2. The kernel tries to allocate filesystem buffers by paging out PostgreSQL's memory (possibly shared buffers). 3. PostgreSQL finds data requested by its clients in database cache, and tries to get it in memory. 4. But the shared buffers are paged out, and page-ins happen. It's certainly true that if shared_buffers is large enough to make the kernel try to swap out parts of the shared buffer array, then you've got a counterproductive situation resulting in net *more* I/O than if you'd used a smaller setting. On some Unixen shared memory is implicitly locked in RAM, and on others it's possible to request locking it (though I'm not sure we try to at the moment). Perhaps it's always swappable on Windows? Or maybe Windows is just more eager to swap it out? The way it is it is definitly always swappable. I've been thinking of digging into that, but haven't had the time. There are API calls to mark memory as non-swappable, but I'm not sure it works on shared memory the way we do it. Apart from saying that, I will refrain from speculatnig more in *why* it's slower with more shared memory before someone (yeah, I realise that could be me) does some actual investigation into what happens. //Magnus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] plpgsql, return can contains any expression
Pavel Stehule wrote: OK, where are we on this patch? without changes. This task have to do anybody who better know PostgreSQL cache system than me. How about you submit a version without any caching, but which works correctly; and we worry about optimizations later? --- Pavel Stehule wrote: Pavel Stehule [EMAIL PROTECTED] writes: This patch doesn't seem to cope with cases where the supplied tuple has the wrong number of columns, and it doesn't look like it's being careful about dropped columns either. Also, that's a mighty bizarre-looking choice of cache memory context in coerce_to_tuple ... but then again, why are you bothering with a cache at all for temporary arrays? I am sorry, Tom. But I don't understand. I can check number of columns, ofcourse and I'll do it. What cache for temporary arrays do you mean? I thought that making coerce_to_tuple depend on estate-err_func was pretty bizarre, and that there was no need for any cache at all for arrays that need only live as long as the function runs. All you are saving here is a palloc/pfree cycle, which is not worth the obscurantism and risk of bugs (are you sure natts can never change?). No, cache there is ugly. But I don't have idea about more efective implementation of it :-(. First version of this patch was more clean. and little bit slow. This cache save 10%. BTW, if you want this patch to make it into 8.2, it needs to be fixed and resubmitted *very* soon. I understand, but I am not able work on it in next four days. And I need help with it from Neil. It will be for 8.3. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Proposal: TABLE functions
Pavel Stehule [EMAIL PROTECTED] writes: it can by more simple than I though. I need only one flag, and if its true then I don't create language variables for OUT params. But I need one next column in pg_proc. I thought you said this was just syntactic sugar for capabilities we already had? Currently a lot of columns in pg_proc is bool. What about one binary columns for other options? I hope so next versions can support autonomous transaction, which need flag too. I think stored procedures of that sort aren't functions at all, and probably don't belong in pg_proc. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Variable length varlena headers redux
Greg Stark [EMAIL PROTECTED] writes: Bruce Momjian [EMAIL PROTECTED] writes: I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. I was leaning toward generating the short varlena headers primarily in heap_form*tuple and just having the datatype specific code generate 4-byte headers much as you describe. I thought we had a solution for all this, namely to make the short-form headers be essentially a TOAST-compressed representation. The format with 4-byte headers is still legal but just not compressed. Anyone who fails to detoast an input argument is already broken, so there's no code compatibility hit taken. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] referential Integrity and SHARE locks
On 2/8/2007 2:46 PM, Marc Munro wrote: On Thu, 2007-08-02 at 14:33 -0500, Tom Lane wrote: Marc Munro [EMAIL PROTECTED] writes: Yes in this case, T1 must abort because the record it was going to update has disappeared from underneath it. I don't see how this is significantly different from the same race for the record if the table had no RI constraints. The only difference that I can see, is that T1 now has some locks that it must relinquish as the transaction aborts. No, the difference is there would have been no error at all before; if the record were deleted before T1 got to it then it wouldn't have attempted to update it. I really don't think you can make it work to perform updates or deletes on a record you have not yet locked. The record would be locked before the update or delete is attempted, however it would not be locked until the referential integrity constraints have succeeded in acquiring their locks. It is becoming clear to me that I am missing something but I still don't know what it is. If anyone can see it and explain it I'd really appreciate it. I think you are missing the fact that the exclusive row lock on UPDATE is taken before any triggers are fired at all, even BEFORE ROW triggers. This is necessary in order to prevent the row being updated or removed concurrently while the triggers are executing. Since BEFORE ROW triggers can modify the content of the row (including the foreign key), the RI check and lock of the referenced row cannot happen before other BR triggers are completed. In order to make your idea fly, the RI check trigger on INSERT or UPDATE would have to be fired before taking the row lock considering the NEW values for referencing columns as they are thus far. Since the row isn't locked at this time, it can change or disappear while the RI trigger is executing, so the check and lock has to be redone later with the actual row that got locked and after all BR triggers are done with it. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Database backup mechanism
Doug Knight wrote: I would also be interested in any creative ways to reduce the size and time to backup databases/clusters. We were just having a conversation about this yesterday. We were mulling over things like using rsync to only backup files in the database directory tree that actually changed. Or maybe doing a selective backup of files based on modified times, etc, but were unsure if this would be a safe, reliable way to backup a reduced set of data. Most of your virtual-layer filesystems should offer a snapshot facility that lets the database think its writing to the files while you see a static version. You could rsync that against an older file-level copy as the base copy in a PITR backup. Note - even with a snapshot facility you need to use PITR or stop the database to get a guaranteed working copy. -- Richard Huxton Archonet Ltd ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] HOT for PostgreSQL 8.3
Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there That isn't simple (may be, even possible) from GIN. -- Teodor Sigaev E-mail: [EMAIL PROTECTED] WWW: http://www.sigaev.ru/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] HOT for PostgreSQL 8.3
Teodor Sigaev [EMAIL PROTECTED] writes: Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there That isn't simple (may be, even possible) from GIN. I suspect that those pushing this idea only care about btrees anyway, so one possible answer is that HOT is only possible when the table has only btree indexes --- or at least, only indexes of AMs that support the replace-these-TIDs operation. (Yet another pg_am flag...) regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] possible TODOs
From recent IRC discussion: . provide encode() for text values in addition to current bytea . proved cast from text to bytea Thoughts? cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] HOT for PostgreSQL 8.3
On Fri, 2007-02-09 at 10:17 -0500, Tom Lane wrote: Teodor Sigaev [EMAIL PROTECTED] writes: Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there That isn't simple (may be, even possible) from GIN. I suspect that those pushing this idea only care about btrees anyway, so one possible answer is that HOT is only possible when the table has only btree indexes --- or at least, only indexes of AMs that support the replace-these-TIDs operation. (Yet another pg_am flag...) Well, thats me. Yes, I think b-trees-only is acceptable. Realistically, very frequent updating and full text indexing are easily separable use cases, at least into separate tables. HOT should be of use in Data Warehousing applications also, when summary tables are maintained alongside detailed data, but that also sounds like HOT and bitmap indexes would be separable at the table level without difficulty. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] possible TODOs
Am Freitag, 9. Februar 2007 16:45 schrieb Andrew Dunstan: . provide encode() for text values in addition to current bytea . proved cast from text to bytea I'm not sure how exactly you want to define these, but some possible defintions will be dependent on the server encoding, which sounds troublesome. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] possible TODOs
Andrew Dunstan [EMAIL PROTECTED] writes: From recent IRC discussion: . provide encode() for text values in addition to current bytea . proved cast from text to bytea What would these be meant to do? In particular I'm wondering if the proposed cast is meant to be a binary-compatible transformation (thereby exposing encoding issues), or if it's supposed to do the same backslash conversions as I/O would, or what. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] possible TODOs
Peter Eisentraut wrote: Am Freitag, 9. Februar 2007 16:45 schrieb Andrew Dunstan: . provide encode() for text values in addition to current bytea . proved cast from text to bytea I'm not sure how exactly you want to define these, but some possible defintions will be dependent on the server encoding, which sounds troublesome. Well, the cast wasn't my suggestion, so I'll let that one slide - someone else can pick it up if they like :P . The encode() suggestion was to have it have the same effect as: encode(byteain(varcharout(textvalue)), 'encodingtype') Would it be such a worry to have that dependent on server encoding? cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Re-ordering of OR conditions
Jim Nasby [EMAIL PROTECTED] writes: IF I run the following with the a 2900 condition first, the more expensive EXISTS only gets executed when needed, but if I change the order of the OR's, the EXISTS is always executed. It would be good if the optimizer could re-order the OR conditions based on estimated cost (granted, this wouldn't work very well if you've got functions in the OR, but it'd still be useful): I looked at this for a bit. It's in principle do-able but I'm not sure it's a good idea. The problem is that while AND'ed condition lists are usually fairly short and hence cheap to sort, OR'ed condition lists are not infrequently very long --- nobody blinks an eye at hundreds of items in an IN-list for instance. I'm afraid we'd waste a lot more cycles sorting than we could hope to regain. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Variable length varlena headers redux
Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: Bruce Momjian [EMAIL PROTECTED] writes: I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. I was leaning toward generating the short varlena headers primarily in heap_form*tuple and just having the datatype specific code generate 4-byte headers much as you describe. I thought we had a solution for all this, namely to make the short-form headers be essentially a TOAST-compressed representation. The format with 4-byte headers is still legal but just not compressed. Anyone who fails to detoast an input argument is already broken, so there's no code compatibility hit taken. It's not just input arguments though. A function could call DirectFunctionCall* and rightfully expect the return value not to need detoasting. I suppose this leads me to *only* generate short headers at heap_form*tuple time. Then DirectFunctionCall isn't relevant and most of the user code is perfectly safe. There could still be cases where a heaptuple is passed around in pl_exec.c or somewhere but if it's subsequently deformed whoever looks at it hopefully wouldn't be too surprised for it to be mandatory that they go through pg_detoast_datum. It'll happen as long as they use the DatumGetFoo macros anyways. It does mean that anyone going through a heap_form*tuple/heap_deform*tuple cycle may generate more copies and memory allocations than they expected. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposal: TABLE functions
Pavel Stehule [EMAIL PROTECTED] writes: it can by more simple than I though. I need only one flag, and if its true then I don't create language variables for OUT params. But I need one next column in pg_proc. I thought you said this was just syntactic sugar for capabilities we already had? My mistake. I am sorry. I have to store somewhere flag. One bit, which signalise don't use OUT arguments as function's parameters. Other is only game in parser. Currently a lot of columns in pg_proc is bool. What about one binary columns for other options? I hope so next versions can support autonomous transaction, which need flag too. I think stored procedures of that sort aren't functions at all, and probably don't belong in pg_proc. Why not? Some people use ugly implementation of it in plperlu and DBI. pg_proc and related infrastructure works well. It miss only little bit bigger adaptability. I thing so can be interesting one general option byte, and one byte reservated for language handlers. Regards Pavel Stehule _ Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposal: TABLE functions
Pavel Stehule [EMAIL PROTECTED] writes: I thought you said this was just syntactic sugar for capabilities we already had? My mistake. I am sorry. I have to store somewhere flag. One bit, which signalise don't use OUT arguments as function's parameters. Huh? What exactly is the meaning of the arguments then? It sounds to me like this might be better thought of as a new proargmode value, but I'm quite unsure what you're talking about ... regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] HOT for PostgreSQL 8.3
On Fri, 2007-02-09 at 13:39 +, Heikki Linnakangas wrote: Tom Lane wrote: ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there as well, but if the old and new tid happen to be on a different bitmap page, it requires some care to avoid deadlocks. Grouped Item Indexes cope with this easily also, yes? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Hierarchical Queries--Status
Mark Cave-Ayland [EMAIL PROTECTED] writes: On Thu, 2007-02-08 at 20:49 -0500, Bruce Momjian wrote: Who is working on this item? Jonah was trying to complete this for 8.3, but I believe that he has handed it onto Gregory Stark - I think http://archives.postgresql.org/pgsql-hackers/2007-01/msg01586.php is the latest update. There's also http://archives.postgresql.org/pgsql-patches/2007-02/msg00086.php -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal: TABLE functions
I thought you said this was just syntactic sugar for capabilities we already had? My mistake. I am sorry. I have to store somewhere flag. One bit, which signalise don't use OUT arguments as function's parameters. Huh? What exactly is the meaning of the arguments then? It sounds to me like this might be better thought of as a new proargmode value, but I'm quite unsure what you're talking about ... My basic idea was: CREATE FUNCTION aaa(IN a1, OUT a, OUT b) RETURNS SETOF RECORD AS $$ .. is similar CREATE FUNCTION aaa(IN a1) RETURNS SETOF RECORD AS $$ from executor perspective there isn't any difference. But PL languages have to create only IN variables. It's protection before identifier's name colision. With special flag I don't need any changes in executor. And small change in PL compile rutines. Special proargmode can be solution too. I don't need new column in pg_proc, but have to modify executor and need more changes in output rutines in PL. I'll go on the way to spec. proargmode. It's good idea. Thank You Pavel Stehule _ Emotikony a pozadi programu MSN Messenger ozivi vasi konverzaci. http://messenger.msn.cz/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Re-ordering of OR conditions
On Feb 9, 2007, at 10:46 AM, Tom Lane wrote: Jim Nasby [EMAIL PROTECTED] writes: IF I run the following with the a 2900 condition first, the more expensive EXISTS only gets executed when needed, but if I change the order of the OR's, the EXISTS is always executed. It would be good if the optimizer could re-order the OR conditions based on estimated cost (granted, this wouldn't work very well if you've got functions in the OR, but it'd still be useful): I looked at this for a bit. It's in principle do-able but I'm not sure it's a good idea. The problem is that while AND'ed condition lists are usually fairly short and hence cheap to sort, OR'ed condition lists are not infrequently very long --- nobody blinks an eye at hundreds of items in an IN-list for instance. I'm afraid we'd waste a lot more cycles sorting than we could hope to regain. Do people actually do that with OR lists though? My understanding is that now IN lists are converted to arrays, so I'd think that wouldn't be an issue there. Is it easy for the planner to discern between simple OR expressions and stuff like EXISTS? If so it might be worth automatically pushing EXISTS to the end... -- Jim Nasby [EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOT for PostgreSQL 8.3
Simon Riggs wrote: On Fri, 2007-02-09 at 13:39 +, Heikki Linnakangas wrote: Tom Lane wrote: ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there as well, but if the old and new tid happen to be on a different bitmap page, it requires some care to avoid deadlocks. Grouped Item Indexes cope with this easily also, yes? Yes, as long as the old and the new tid point to the same page. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] [PATCHES] plpgsql, return can contains any expression
Pavel Stehule wrote: OK, where are we on this patch? without changes. This task have to do anybody who better know PostgreSQL cache system than me. How about you submit a version without any caching, but which works correctly; and we worry about optimizations later? I can update and send simple version. Regards Pavel Stehule _ Najdete si svou lasku a nove pratele na Match.com. http://www.msn.cz/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Variable length varlena headers redux
Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: Bruce Momjian [EMAIL PROTECTED] writes: I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. I was leaning toward generating the short varlena headers primarily in heap_form*tuple and just having the datatype specific code generate 4-byte headers much as you describe. I thought we had a solution for all this, namely to make the short-form headers be essentially a TOAST-compressed representation. The format with 4-byte headers is still legal but just not compressed. Anyone who fails to detoast an input argument is already broken, so there's no code compatibility hit taken. Uh. So I don't see how to make this work on a little-endian machine. If the leading its are 0 we don't know if they're toast flags or bits on the least significant byte of a longer length. If we store all lengths in network byte order that problem goes away but then user code that does VARATT_SIZEP(datum) = len is incorrect. If we declare in-memory format to be host byte order and on-disk format to be network byte order then every single varlena datum needs to be copied when heap_deform*tuple runs. If we only do this for a new kind of varlena then only text/varchar/ char/numeric datums would need to be copied but that's still a lot. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] HOT for PostgreSQL 8.3
Ühel kenal päeval, R, 2007-02-09 kell 13:39, kirjutas Heikki Linnakangas: Tom Lane wrote: ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. Implementing the replace these TIDs operation atomically would be simple, except for the new bitmap index am. It should be possible there as well, but if the old and new tid happen to be on a different bitmap page, it requires some care to avoid deadlocks. Also, we'd need more work mem for vacuum. Why do we need to muck around with indexes at all ? What are the problems with just shuffling the last (and only visible) tuple to replace the HOT-hain root and be done with it ? Can there be some problems with seqscans moving one tuple at a time or doing revisits of the same (by TID) tuple ? If there are can't these be fixed by creative use of ctid chains form the original live one to the new live one ? This has got the same atomicity problem as for CREATE INDEX, because it's the same thing: you're de-HOT-ifying the child. Not exactly. De-HOT-ifying, or chilling, a child means inserting new index entries. If we can just move the tuple inside the page we can avoid even that. But if we're just replacing the tids from the existing index entries, it's ok if we crash after replacing some but not all of them. The next vacuum would replace the rest of the pointers, and remove the old root tuple. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] HOT for PostgreSQL 8.3
Hannu Krosing [EMAIL PROTECTED] writes: What are the problems with just shuffling the last (and only visible) tuple to replace the HOT-hain root and be done with it ? ctid stops being a reliable identifier. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Commit timestamp
On Fri, 9 Feb 2007, Jan Wieck wrote: [ I wrote ] It'd be great if Jan considers the blending of replication; Please elaborate. I would really like to get all you can contribute. Thanks Jan, prefaced that I really haven't read everything you've written on this (or what other people are doing, either), and that I've got a terrible flu right now (fever, etc), I'll give it a go - hopefully it's actually helpful. To wit: In general terms, blending of replication [techniques] means to me that one can have a single database instance serve as a master and as a slave (to use only one set of terminology), and as a multi-master, too, all simultaneously, letting the DBA / Architect choose which portions serve which roles (purposes). All replication features would respect the boundaries of such choices automatically, as it's all blended. In more specific terms, and I'm just brainstorming in public here, perhaps we can use the power of Schemas within a database to manage such divisions; commands which pertain to replication can/would include a schema specifier and elements within the schema can be replicated one way or another, at the whim of the DBA / Architect. For backwards compatability, if a schema isn't specified, it indicates that command pertains to the entire database. At the very least, a schema division strategy for replication leaverages an existing DB-component binding/dividing mechanism that most everyone is familliar with. While there are/may be database-wide, nay, installation- wide constructs as in your Commit Timestamp proposal, I don't see that there's any conflict - at least, from what I understand of existing systems and proposals to date. HTH, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 [EMAIL PROTECTED], http://ScienceTools.com/ ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] HOT for PostgreSQL 8.3
On Fri, 2007-02-09 at 13:47 -0500, Tom Lane wrote: Hannu Krosing [EMAIL PROTECTED] writes: What are the problems with just shuffling the last (and only visible) tuple to replace the HOT-hain root and be done with it ? ctid stops being a reliable identifier. Yes, that sums it up. The issue can be overcome in the internals, but the only means of doing so that seemed workable required changing the output of the CTID special column. Some applications use the tid datatype directly to relocate rows within a transaction using the tidscan. ODBC driver uses that concept to implement Updateable Cursors from the client. We can change that, but we'd break all the other ones we don't know about. This issue was one of the major contributing factors to the size of the previous HOT prototype, making it more invasive than is really desirable. Pointer-swinging avoids those issues and seems workable, even if it is a pain to have to visit the index during VACUUM. So changing CTID isn't a bridge we really need to cross, for which I'm glad. Just as a matter of record, the tuple identifier would need to include (block, itemid, xmin, cmin) to make this idea work, with the itemid being that of the root tuple and the xmin and cmin being the tuple in the chain that is being referenced. This would've then allowed callers to relocate a specific tuple, even when the update chain had changed between block accesses. Anyway, glad we're not going there anymore. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal: Commit timestamp
On Feb 7, 8:12 pm, [EMAIL PROTECTED] (Bruce Momjian) wrote: Jan Wieck wrote: On 2/7/2007 10:35 PM, Bruce Momjian wrote: I find the term logical proof of it's correctness too restrictive. It sounds like some formal academic process that really doesn't work well for us. Thank you. My intuition is that it might be possible to prove that _nothing_ can provide guaranteed ordering when there is disconnected operation. However, I think that the clock based ordering Jan has described could provide _probable_ ordering under disconnected operation. I can see three variables in the equation that would determine the probability of correctness for the ordering. 1) clock drift rate between disconnected clusters 2) disconnection time 3) transaction rate on the tables, or even rows involved There are probably more. I think that if Jan implements what he's described then a very interesting follow-up would be to do the statistical analysis necessary to quantify the risk of incorrect ordering while disconnected. (I've got x ms/ms relative clock drift, and y tps. How long can I run disconnected before falling under 99.999% probability of correctly ordered transactions?) No, I _now_ understand the use case, but when the patch was posted, the use case was missing. I would like to see a repost with the patch, and a description of its use so we can all move forward on that. An additional use case for an on-commit timestamp is in the analysis of billing transactions in highly concurrent systems. For example, imagine your billing period is monthly and you have transactions which start before and end after the end-of-month. Having the on-commit timestamp for these transactions may help when attempting to reconcile between transactions and account activities. Andrew ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Commit timestamp
Richard Troy wrote: In more specific terms, and I'm just brainstorming in public here, perhaps we can use the power of Schemas within a database to manage such divisions; commands which pertain to replication can/would include a schema specifier and elements within the schema can be replicated one way or another, at the whim of the DBA / Architect. For backwards compatability, if a schema isn't specified, it indicates that command pertains to the entire database. I understand that you're just thinking aloud, but overloading namespaces in this way strikes me as awful. Applications and extensions, which are the things that have need of namespaces, should not have to care about replication. If we have to design them for replication we'll be on a fast track to nowhere IMNSHO. cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Proposal: Commit timestamp
On 2/9/2007 2:27 PM, Richard Troy wrote: In general terms, blending of replication [techniques] means to me that one can have a single database instance serve as a master and as a slave (to use only one set of terminology), and as a multi-master, too, all simultaneously, letting the DBA / Architect choose which portions serve which roles (purposes). All replication features would respect the boundaries of such choices automatically, as it's all blended. That is specifically what the changes to pg_trigger and pg_rewrite take into account. However much you blend different techniques, a single transaction on one server will always fall into one of three categories. 1) It could be the original operation done by the client application. B) It could be the actions performed by the replication engine to replay a remote transaction. And iii) it can be an administrative operation that requires not to be propagated at all. No matter how many different models you have in parallel, one single transaction will be either a master, a slave or an isolated local thing. The proposed changes allow to tell the session which of these three roles it is playing and the triggers and rules can be configured to fire during master/local role, slave role, always or never. That functionality will work for master-slave as well as multi-master. Although my current plan isn't creating such a blended system, the proposed trigger and rule changes are designed to support exactly that in a 100% backward compatible way. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Commit timestamp
On Fri, 9 Feb 2007, Andrew Dunstan wrote: Richard Troy wrote: In more specific terms, and I'm just brainstorming in public here, perhaps we can use the power of Schemas within a database to manage such divisions; commands which pertain to replication can/would include a schema specifier and elements within the schema can be replicated one way or another, at the whim of the DBA / Architect. For backwards compatability, if a schema isn't specified, it indicates that command pertains to the entire database. I understand that you're just thinking aloud, but overloading namespaces in this way strikes me as awful. Applications and extensions, which are the things that have need of namespaces, should not have to care about replication. If we have to design them for replication we'll be on a fast track to nowhere IMNSHO. Well, Andrew, replication _is_ an application. Or, you could think of replication as an extension to an application. I was under the impression that_users_ decide to put tables in schema spaces based upon _user_ need, and that Postgres developer's use of them for other purposes was incroaching on user choices, not the other way around. Either way, claiming need like this strikes me as stuck-in-a-rut or dogmatic thinking. Besides, don't we have schema nesting to help resolve any such care? And, what do you mean by design them for replication? While I'm in no way stuck on blending replication strategies via schemas, it does strike me as an appropriate concept and I'd preferr to have it evaluated based on technical merrit - possibly citing workarounds or solutions to technical issues, which is what I gather has been the tradition of this group: Use case first, technical merrit second... Other alternatives, ISTM, will have virtually the same look/feel as a schema from an external perspective, and the more I think of it the more I think using schemas is a sound, clean approach. That it offends someones sense of asthetics STM a poor rationale for not choosing it. Another question might be: What's lacking in the implementation of schemas that makes this a poor choice, and what could be done about it without much effort? Regards, Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 [EMAIL PROTECTED], http://ScienceTools.com/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Commit timestamp
On Fri, 9 Feb 2007, Jan Wieck wrote: No matter how many different models you have in parallel, one single transaction will be either a master, a slave or an isolated local thing. The proposed changes allow to tell the session which of these three roles it is playing and the triggers and rules can be configured to fire during master/local role, slave role, always or never. That functionality will work for master-slave as well as multi-master. Although my current plan isn't creating such a blended system, the proposed trigger and rule changes are designed to support exactly that in a 100% backward compatible way. Jan Fantastic! ...At some point you'll be thinking of the management end - turning it on or off, etc. That might be where these other points come more into play. Richard -- Richard Troy, Chief Scientist Science Tools Corporation 510-924-1363 or 202-747-1263 [EMAIL PROTECTED], http://ScienceTools.com/ ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Re-ordering of OR conditions
On Fri, 2007-02-09 at 11:46 -0500, Tom Lane wrote: Jim Nasby [EMAIL PROTECTED] writes: IF I run the following with the a 2900 condition first, the more expensive EXISTS only gets executed when needed, but if I change the order of the OR's, the EXISTS is always executed. It would be good if the optimizer could re-order the OR conditions based on estimated cost (granted, this wouldn't work very well if you've got functions in the OR, but it'd still be useful): I looked at this for a bit. It's in principle do-able but I'm not sure it's a good idea. The problem is that while AND'ed condition lists are usually fairly short and hence cheap to sort, OR'ed condition lists are not infrequently very long --- nobody blinks an eye at hundreds of items in an IN-list for instance. I'm afraid we'd waste a lot more cycles sorting than we could hope to regain. Seems like the planner could decide ahead of time whether sorting the conditions at execution time was likely to be effective or not. Perhaps limiting it to at most 5 conditions, where at least one of those was a function or a join condition? That would be a fairly cheap test at planning time, but potentially a good win at execution time. The OR'ed condition is common condition when the schema uses complex sub-classing. Now we have function costs it seems more likely this idea would get used in practice. Anyway, not necessarily for you to do, but sounds like a useful idea all the same. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Proposal: Commit timestamp
On 2/9/2007 2:19 PM, Andrew Hammond wrote: On Feb 7, 8:12 pm, [EMAIL PROTECTED] (Bruce Momjian) wrote: Jan Wieck wrote: On 2/7/2007 10:35 PM, Bruce Momjian wrote: I find the term logical proof of it's correctness too restrictive. It sounds like some formal academic process that really doesn't work well for us. Thank you. My intuition is that it might be possible to prove that _nothing_ can provide guaranteed ordering when there is disconnected operation. As a matter of physics, for two events happening outside of the event horizon of each other, the question which happened first is pointless. However, I think that the clock based ordering Jan has described could provide _probable_ ordering under disconnected operation. I can see three variables in the equation that would determine the probability of correctness for the ordering. That precisely is the intended functionality. And I can exactly describe when two conflicting actions will result in the wrong row to persist. This will happen when the second update to the logically same row will be performed on the server with the Lamport timestamp lagging behind by more than the time between the two conflicting commits. Example: User fills out a form, submits, hits back button, corrects input and submits again within 3 seconds. Load balancing sends both requests to different servers and the first server is 3.0001 seconds ahead ... the users typo will be the winner. My Lamport timestamp conflict resolution will not be able to solve this problem. However, when this happens, one thing is guaranteed. The update from the second server, arriving on the first for replication will be ignored because a locally generated row is newer. This fact can be used as an indicator that there is a possible conflict that was resolved using the wrong data (business process wise). All nodes in the cluster will end up using the same wrong row, so at least they are consistently wrong. Nevertheless, being able to identify possible problem cases this way will allow to initiate further action including but not limited to human intervention. If this is not an acceptable risk for the application, other resolution methods will be needed. But I think in many cases, this form of default resolution will be good enough. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposal: Commit timestamp
Richard Troy wrote: On Fri, 9 Feb 2007, Andrew Dunstan wrote: Richard Troy wrote: In more specific terms, and I'm just brainstorming in public here, perhaps we can use the power of Schemas within a database to manage such divisions; commands which pertain to replication can/would include a schema specifier and elements within the schema can be replicated one way or another, at the whim of the DBA / Architect. For backwards compatability, if a schema isn't specified, it indicates that command pertains to the entire database. I understand that you're just thinking aloud, but overloading namespaces in this way strikes me as awful. Applications and extensions, which are the things that have need of namespaces, should not have to care about replication. If we have to design them for replication we'll be on a fast track to nowhere IMNSHO. Well, Andrew, replication _is_ an application. Or, you could think of replication as an extension to an application. No, I don't think of it as either. It's a utility, more an extension of the DBMS than of the application. You don't replicate for the sake of replicating. I was under the impression that_users_ decide to put tables in schema spaces based upon _user_ need, and that Postgres developer's use of them for other purposes was incroaching on user choices, not the other way around. That's exactly what you would be doing with this proposal, encroaching on what I regard as user space. Either way, claiming need like this strikes me as stuck-in-a-rut or dogmatic thinking. Besides, don't we have schema nesting to help resolve any such care? No. We do now have schema nesting, for this or any other purpose. Where did you get that idea? If we did I would not be so resistant to using them for this purpose, but as it is, if you hijack schemas for replication segregation you will detract from their more obvious use in name segregation. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Commit timestamp
I wrote: We do now have schema nesting, for this or any other purpose. s/now/not/ (of course) cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Variable length varlena headers redux
Greg Stark wrote: Tom Lane [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: Bruce Momjian [EMAIL PROTECTED] writes: I know it is kind of odd to have a data type that is only used on disk, and not in memory, but I see this as a baby varlena type, used only to store and get varlena values using less disk space. I was leaning toward generating the short varlena headers primarily in heap_form*tuple and just having the datatype specific code generate 4-byte headers much as you describe. I thought we had a solution for all this, namely to make the short-form headers be essentially a TOAST-compressed representation. The format with 4-byte headers is still legal but just not compressed. Anyone who fails to detoast an input argument is already broken, so there's no code compatibility hit taken. Uh. So I don't see how to make this work on a little-endian machine. If the leading its are 0 we don't know if they're toast flags or bits on the least significant byte of a longer length. If we store all lengths in network byte order that problem goes away but then user code that does VARATT_SIZEP(datum) = len is incorrect. If we declare in-memory format to be host byte order and on-disk format to be network byte order then every single varlena datum needs to be copied when heap_deform*tuple runs. If we only do this for a new kind of varlena then only text/varchar/ char/numeric datums would need to be copied but that's still a lot. I wonder if we need to reorder the TOAST structure to have the bits we need at the start of the structure so we can be sure they are first. For example, what if we split varattrib.va_header, which is int32 now, into for 'char' fields, and just reassemble it in the toast code. That would be pretty localized. I had forgotten about hooking into the TOAST system, but since we are going to be expanding the headers of these types when they get into memory, it does make sense. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Commit timestamp
On 2/9/2007 3:25 PM, Andrew Dunstan wrote: Richard Troy wrote: On Fri, 9 Feb 2007, Andrew Dunstan wrote: Richard Troy wrote: In more specific terms, and I'm just brainstorming in public here, perhaps we can use the power of Schemas within a database to manage such divisions; commands which pertain to replication can/would include a schema specifier and elements within the schema can be replicated one way or another, at the whim of the DBA / Architect. For backwards compatability, if a schema isn't specified, it indicates that command pertains to the entire database. I understand that you're just thinking aloud, but overloading namespaces in this way strikes me as awful. Applications and extensions, which are the things that have need of namespaces, should not have to care about replication. If we have to design them for replication we'll be on a fast track to nowhere IMNSHO. Well, Andrew, replication _is_ an application. Or, you could think of replication as an extension to an application. No, I don't think of it as either. It's a utility, more an extension of the DBMS than of the application. You don't replicate for the sake of replicating. I was under the impression that_users_ decide to put tables in schema spaces based upon _user_ need, and that Postgres developer's use of them for other purposes was incroaching on user choices, not the other way around. That's exactly what you would be doing with this proposal, encroaching on what I regard as user space. I'd never use a schema for that. Look at Slony-I. It uses a user selectable schema for the stuff it needs to store in the database, to keep Slony objects separate from user objects. But it organizes the user tables in what is called sets. A set can be any combination of tables and sequences from any number of namespaces. If I would do it again, I would use names instead of numbers to identify sets, and I would allow for summary sets containing groups of simple sets. On the other issue, replication is part of the overall infrastructure and thereby part of the solution to a business problem. Like the hardware, database and application it solves a piece of the puzzle, none of the other is particularly good at. It will perform better or eventually not at all, depending on how much consideration the distributed nature of the business model was given when the application was designed. Applications that happened by accident rather than being designed usually don't allow any kind of distribution. And yes, there are people who replicate for the sake of it. It is the kind of people who ask for sync multi-master no matter what their actual problem might be, because it sounds most prestigious. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] HOT for PostgreSQL 8.3
On Fri, 2007-02-09 at 13:16 +0530, Pavan Deolasee wrote: The second problem of concurrent index scans seems a bit more complex. We need a mechanism so that no tuples are missed or tuples are not returned twice. Since CHILLing of a tuple adds a new access path to the tuple from the index, a concurrent index scan may return a tuple twice. How about grabbing a AccessExclusiveLock during CHILLing operation ? This would prevent any concurrent index scans. Since CHILLing of a large table can take a long time, the operation can be spread across time with periodic acquire/release of the lock. This would prevent starvation of other backends. Since CHILLing is required only for CREATE INDEX and stub-cleanup, I am assuming that its ok for it to be lazy in nature. We've just spoken about this, so just wanted to add those thoughts here. A pointer-swing operation will begin when we see a tuple that is both status of HEAPTUPLE_DEAD and is marked HEAP_UPDATE_ROOT. Perhaps that requires a new status from HeapTupleSatisfiesVacuum()? We chill all tuples in the chain, up to the new root. We mark those tuples, so that HeapTupleSatisfiesVacuum() will be describe them as HEAPTUPLE_RECENTLY_DEAD. So the current Vacuum won't immediately remove them, but they'll go away in the future as part of an on-demand block vacuum or Vacuum. That's similar to the way we handle HALF_DEAD index pages. The index scans are page-at-a-time, so when we pointer-swing from the root tuple to one of the HOT tuples we'll be OK. We're switching a specific index tuple, so there's no multi-page locking on the index to consider. Right now, I wouldn't want to assume that the way tuples are marked prior to pointer-swinging is exactly the same as the chilling required by CREATE INDEX: CHILL_IN_PROGRESS. It may well be, but I'm wary that we assume they are exactly the same and introduce a subtle bug. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] HOT for PostgreSQL 8.3
Tom Lane wrote: Removing the root tuple will require a VACUUM *FULL*. That seems unacceptable ... it won't take too long for your table to fill up with stubs, and we don't want to return to the bad old days when periodic VACUUM FULL was unavoidable. ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. This has got the same atomicity problem as for CREATE INDEX, because it's the same thing: you're de-HOT-ifying the child. So if you can solve the former, I think you can make this work too. I need clarification here. Is removing dead heap tuple always going to require an index scan, or was this just for chilling a row (adding an index)? -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Variable length varlena headers redux
Bruce Momjian [EMAIL PROTECTED] writes: Uh. So I don't see how to make this work on a little-endian machine. If the leading its are 0 we don't know if they're toast flags or bits on the least significant byte of a longer length. ... I had forgotten about hooking into the TOAST system, but since we are going to be expanding the headers of these types when they get into memory, it does make sense. Ok, I guess this can work if we guarantee that in-memory datums always have 4-byte headers. That means that heap_deform*tuple always copies the datum if it's this type of datum. That means we never have pointers to shortvarlena datums inside tuples. I'm not sure if there are parts of the system that assume that the datums they get out of heap_deform*tuple are pointers into the tuple or not. I haven't come across any in my travels thus far. That seems like an awful lot of copying and pallocs that aren't there currently though. And it'll make us reluctant to change over frequently used data types like text -- which are precisely the ones that would gain us the most. It seems to me that it might be better to change to storing varlena lengths in network byte order instead. That way we can dedicate the leading bits to toast flags and read more bytes as necessary. I think the way to do this would be to throw out the VARATT_SIZEP macro and replace it with VARATT_SET_SIZE(datum,size). VARSIZE would just call ntohl (or ntohs if the leading bits on the first byte indicated...) That does mean touching every piece of data type code. And invalidating every piece of user code. :( At least it's fairly mechanical. And it has the advantage of not being at all fragile -- unfixed code won't even compile. While we're at it I would suggest taking out the VARHDRSZ offset. Just store the size of the data payload. The constant VARHDRSZ offset no longer makes sense since it won't actually be the size of the varlena header size anyways. And predicting the actual size of the varlena header will be annoying and bug-prone since it depends on the resulting value you calculate. (Incidentally, this would actually make EnterpriseDB somewhat sad since we want pg_migrator to work for 8.3. But it wouldn't be out of the realm of possibility to go through the database and switch varlena headers to network byte order. There's no need to compress them, just leave the 4-byte format in place with the bytes swapped around.) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOT for PostgreSQL 8.3
On Fri, 2007-02-09 at 18:10 -0500, Bruce Momjian wrote: Tom Lane wrote: Removing the root tuple will require a VACUUM *FULL*. That seems unacceptable ... it won't take too long for your table to fill up with stubs, and we don't want to return to the bad old days when periodic VACUUM FULL was unavoidable. ISTM we could fix that by extending the index VACUUM interface to include two concepts: aside from remove these TIDs when you find them, there could be replace these TIDs with those TIDs when you find them. This would allow pointer-swinging to one of the child tuples, after which the old root could be removed. This has got the same atomicity problem as for CREATE INDEX, because it's the same thing: you're de-HOT-ifying the child. So if you can solve the former, I think you can make this work too. I need clarification here. Is removing dead heap tuple always going to require an index scan, or was this just for chilling a row (adding an index)? We can remove a tupled marked HEAP_ONLY_TUPLE when it is status HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a TupleStub, but not removed. Multiple tuples in the chain can be removed, though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the first non-removed tuple in the chain. All of that can be done when we hold a CleanupLock on the block, without reference to the indexes; this can be performed on-demand, when we attempt an UPDATE. This is similar to what already happens prior to a btree split operation. (This could also be performed by bgwriter, but that isn't proposed at this time because the buffer transit time through the cache is often not long enough to allow tuples to die and get benefit from space reuse). TupleStubs can be marked for removal by a pointer-swing operation during normal VACUUM, i.e. it will require touching the indexes. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Variable length varlena headers redux
Gregory Stark [EMAIL PROTECTED] writes: That seems like an awful lot of copying and pallocs that aren't there currently though. And it'll make us reluctant to change over frequently used data types like text -- which are precisely the ones that would gain us the most. It seems to me that it might be better to change to storing varlena lengths in network byte order instead. That way we can dedicate the leading bits to toast flags and read more bytes as necessary. This'll add its own overhead ... but probably less than pallocs and data-copying would. And I agree we can find (pretty much) all the places that need changing by the expedient of deliberately renaming the macros and struct fields. One problem I foresee is that I think you are about to propose that VARDATA depend on the length already having been inserted, which it does not now; and simple renamings won't detect ordering errors for that. Also I believe there are places that over-allocate memory, fill in the data, and only then set the length; something you will not easily be able to change. It might work if we assume that *creation* of a varlena value always produces the 4-byte-header form and only reading of a value that might be on disk needs to cope with the short-header forms. However this seems to require two different forms of VARDATA depending on whether one is preparing or reading a value. Ugh. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HOT for PostgreSQL 8.3
Simon Riggs wrote: I need clarification here. Is removing dead heap tuple always going to require an index scan, or was this just for chilling a row (adding an index)? We can remove a tupled marked HEAP_ONLY_TUPLE when it is status HEAPTUPLE_DEAD. The HEAP_UPDATE_ROOT tuple can be reduced to a TupleStub, but not removed. Multiple tuples in the chain can be removed, though the HEAP_UPDATE_ROOT's t_ctid must be modified to point to the first non-removed tuple in the chain. All of that can be done when we hold a CleanupLock on the block, without reference to the indexes; this can be performed on-demand, when we attempt an UPDATE. This is similar to what already happens prior to a btree split operation. (This could also be performed by bgwriter, but that isn't proposed at this time because the buffer transit time through the cache is often not long enough to allow tuples to die and get benefit from space reuse). TupleStubs can be marked for removal by a pointer-swing operation during normal VACUUM, i.e. it will require touching the indexes. OK, that sounds like a good plan. Thanks. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Variable length varlena headers redux
Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: That seems like an awful lot of copying and pallocs that aren't there currently though. And it'll make us reluctant to change over frequently used data types like text -- which are precisely the ones that would gain us the most. It seems to me that it might be better to change to storing varlena lengths in network byte order instead. That way we can dedicate the leading bits to toast flags and read more bytes as necessary. This'll add its own overhead ... but probably less than pallocs and data-copying would. And I agree we can find (pretty much) all the places that need changing by the expedient of deliberately renaming the macros and struct fields. I think we should go with the pallocs and see how it performs. That is certainly going to be easier to do, and we can test it pretty easily. One palloc optimization idea would be to split out the representation so the length is stored seprately from the data in memory, and we could use an int32 for the length, and point to the shared buffer for the data. However I don't think our macros can handle that so it might be a non-starter. However, I think we should find out of the palloc is a problem before avoiding it. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
[HACKERS] Foreign keys for non-default datatypes, redux
Almost a year ago, we talked about the problem that referential integrity should be selecting comparison operators on the basis of b-tree index opclasses, instead of assuming that the appropriate operator is always named =: http://archives.postgresql.org/pgsql-hackers/2006-02/msg00960.php http://archives.postgresql.org/pgsql-hackers/2006-03/msg00161.php I'm about to go off and implement that at last. To refresh folks' memory, what I think we agreed to was that at the time of definition of a foreign-key constraint, we should identify the specific equality operator to be used for (each column of) the constraint. The method for doing this is to be: * First, identify the unique index that is relied on to enforce uniqueness of the PK entries (we do this already of course). * Look to see if there is an equality operator in this index's opfamily accepting exactly the PK and FK data types (ie, PK = FK). If so, use that. * Else, check to see if there is an implicit promotion from the FK datatype to the PK datatype. If so, use the equality operator PK = PK, which must exist since the opfamily supports an index on the PK datatype. * Else fail (this means that the present warning about inefficient foreign keys will become a hard error). The good thing about this proposal is that we know that we have identified an operator whose notion of equality is compatible with the notion of equality being enforced by the unique index, and thus a lot of potential gotchas with nondefault opclasses go away. My intention is that we'd record pg_depend entries making the RI constraint dependent on not only the index, but the specific operators to use. This would not have been too critical a year ago given that opclasses were effectively immutable; but in the current opfamily design it's entirely likely that we'd select cross-type equality operators that are considered loose and potentially droppable from the opfamily. So we need dependencies to prevent the operators from disappearing out from under us. (Come to think of it, we might want to record dependencies on the casts too, if we're using implicit casts?) What I'm thinking about right now is that the ri_triggers.c routines need to be able to find out which operators they're supposed to use, so that they can construct the RI queries correctly. We could possibly have them dredge the information out of pg_depend, but this seems inefficient, and I'm not entirely sure how one would match up operators with columns given only the pg_depend entries. What I'd like to propose instead is: * Add an oid[] column to pg_constraint that stores the equality operator OIDs for a foreign-key constraint, in the same column order as conkey[] and confkey[]. * Add an OID column to pg_trigger giving the OID of the constraint owning the trigger (or 0 if none). Add this information to struct Trigger as well, so that it gets passed to trigger functions. Given the pg_constraint OID, the RI triggers could fetch the constraint row and look at conkey[], confkey[], and the new operator oid[] array to determine what they need to know. This would actually mean that they don't need pg_trigger.tgargs at all. I am pretty strongly tempted to stop storing anything in tgargs for RI triggers --- it's ugly, and updating the info during RENAME commands is a pain in the rear. On the other hand removing it might break client-side code that expects to look at tgargs to learn about FK constraints. I'd personally think that pg_constraint is a lot easier to work with, but there might be some code out there left over from way back before pg_constraint existed --- anyone know of any such issue? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Variable length varlena headers redux
Bruce Momjian wrote: Tom Lane wrote: Gregory Stark [EMAIL PROTECTED] writes: That seems like an awful lot of copying and pallocs that aren't there currently though. And it'll make us reluctant to change over frequently used data types like text -- which are precisely the ones that would gain us the most. It seems to me that it might be better to change to storing varlena lengths in network byte order instead. That way we can dedicate the leading bits to toast flags and read more bytes as necessary. This'll add its own overhead ... but probably less than pallocs and data-copying would. And I agree we can find (pretty much) all the places that need changing by the expedient of deliberately renaming the macros and struct fields. I think we should go with the pallocs and see how it performs. That is certainly going to be easier to do, and we can test it pretty easily. One palloc optimization idea would be to split out the representation so the length is stored seprately from the data in memory, and we could use an int32 for the length, and point to the shared buffer for the data. However I don't think our macros can handle that so it might be a non-starter. However, I think we should find out of the palloc is a problem before avoiding it. Another idea about reducing palloc is that we know every short column is at most 128 + 4 = 132 bytes, so we could allocate a 132-byte buffer for every short column in the scan, and just re-use the buffer for every row. -- Bruce Momjian [EMAIL PROTECTED] http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] RFC: Temporal Extensions for PostgreSQL
Temporal Extensions for PostgreSQL by: Warren Turkal I would like to see a comprehensive solution to time varying tables (or temporal) in PostgreSQL. I specifically want to see suuport for valid-time and transacation-time and bitemporal (valid-time and transaction-time) tables. I will be defering the descriptions of much of the functionality to Dr. Richard T. Snodgrass's _Developing Time-Oriented Database Applications in SQL_ at [1]. The mangled pages 30-31 are at [2]. a) Functionality Dr. Richard T. Snodgrass has worked on defining semantics of temporal very completely in several writings. He was also involved in an unsuccessful effort to standardize temporal extensions to SQL. I believe his book does a good job in presenting the semantics of temporal databases and describing extensions to SQL that make the data much more natural with which to work. b) How current solutions fall flat Current solutions fall flat due to the extreme complexity of implementing valid-time and transaction time semantics on tables by adding columns to track all of the data. Please see chapter 11 of [1] for a more complete description of this complexity. Chapter 12 of [1] goes on to lay out new syntax for SQL that will make dealing with data of this nature much more natural. c) Examples --create normal table CREATE TABLE products ( id SERIAL PRIMARY KEY , description TEXT ); -- Add valid-time support to the table with granularity of timestamp. ALTER TABLE products ADD VALIDTIME PERIOD(TIMESTAMP WITH TIMEZONE); -- Insert row valid from 2006-01-01 to just before 2007-01-01 VALIDTIME PERIOD '[2006-01-01 - 2007-01-01)' INSERT INTO products ( description ) VALUES ( 'red ball' ); -- Insert row valid from 2007-01-01 to just before 2008-01-01 -- Should be smart enough to realize the id=777 does not conflict in this time -- of validity. VALIDTIME PERIOD '[2007-01-01 - 2008-01-01)' INSERT INTO products ( id , description ) VALUES ( 777 , 'blue ball' ); -- Select history of products with id=777 VALIDTIME SELECT * FROM product WHERE id=777; id | description | valid_period -- 777| red ball| [2006-01-01 - 2007-01-01) 777| blue ball | [2007-01-01 - 2008-01-01) -- Select current products with id=777 -- The date when query was run was 2007-02-10. SELECT * FROM products WHERE id=777; id | description -- 777| blue ball There are many more details in chapter 12 of [1]. d) New stuff (dependencies, indices, syntax, libraries) One of the base level additions is the PERIOD datatype. I think that implementing temporal support is reliant on developing such a type. The description of this datatype is laid out in chapter 4 of [1]. The SQL syntax is present in chapter 12 of [1]. I see this as the first piece that needs to be implemented in order to take steps toward a DBMS to supports full temporal capabilities. I think that PERIOD can largely reuse the datatime functionality for parsing of literals and for comparisons. The RTREE seems to nicely incorporate needed indexing of the PERIOD type. The syntax of the parser will have to be extended to handle the PERIOD literals and constructor. I believe any additional libraries will be required. There are also extensions to the syntax of table creation, table altering, querying, inserting, and updating on temporal tables. These are all discussed in some detail in chapter 12 of [1]. I don't think that any of these changes will require new libraries. The semantics of temporal tables and querying them could have a dramatic affect on how things like primary keys and unique constraints work. I would like to get some comments about this from the community. e) See Also Addtional resources can be found at Dr. Richard T. Snodgrass's website at [3], including SQL valid-time table support spec at [4] and SQL transaction-time table support spec at [5]. Thoughts? Questions? Comments? [1]http://www.cs.arizona.edu/~rts/tdbbook.pdf [2]http://www.cs.arizona.edu/~rts/pp30-31.pdf [3]http://www.cs.arizone.edu/~rts/ [4]ftp://ftp.cs.arizona.edu/tsql/tsql2/sql3/mad146.pdf [5]ftp://ftp.cs.arizona.edu/tsql/tsql2/sql3/mad147.pdf Thanks, wt -- Warren Turkal (w00t) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq