Re: [HACKERS] WORM and Read Only Tables (v0.1)
Ühel kenal päeval, E, 2007-12-17 kell 09:20, kirjutas Simon Riggs: On Sat, 2007-12-15 at 13:32 +0100, Albert Cervera i Areny wrote: Read-Only Tables Postgres supports the concept of freezing tuples, so they can live forever within the database without needing further writes. Currently there is no command that will guarantee that a table has been completely frozen. This makes it difficult to reliably write data files to WORM media for longer term archiving. (WORM means Write-Once, Read-Many). It's also a pain having to VACUUM a large table again just because a small number of rows need to be frozen. I'm not an expert at all, but I'd like to understand this, do you plan that READ-ONLY tables wouldn't even store transaction information? That should save quite a lot of space. Maybe when the table would be moved to the compressed tablespace, MVCC information could be dropped too? Of course that would avoid future insert update possibilities though. It could, but its a lot of work for little gain. The tuple headers look like they will compress fairly well, so why bother to remove them at all? One place for removing them would be if we do column-stores where there would be one header per column instead of one per tuple. -- Hannu ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Sat, 2007-12-15 at 13:32 +0100, Albert Cervera i Areny wrote: Read-Only Tables Postgres supports the concept of freezing tuples, so they can live forever within the database without needing further writes. Currently there is no command that will guarantee that a table has been completely frozen. This makes it difficult to reliably write data files to WORM media for longer term archiving. (WORM means Write-Once, Read-Many). It's also a pain having to VACUUM a large table again just because a small number of rows need to be frozen. I'm not an expert at all, but I'd like to understand this, do you plan that READ-ONLY tables wouldn't even store transaction information? That should save quite a lot of space. Maybe when the table would be moved to the compressed tablespace, MVCC information could be dropped too? Of course that would avoid future insert update possibilities though. It could, but its a lot of work for little gain. The tuple headers look like they will compress fairly well, so why bother to remove them at all? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Read-Only Tables Postgres supports the concept of freezing tuples, so they can live forever within the database without needing further writes. Currently there is no command that will guarantee that a table has been completely frozen. This makes it difficult to reliably write data files to WORM media for longer term archiving. (WORM means Write-Once, Read-Many). It's also a pain having to VACUUM a large table again just because a small number of rows need to be frozen. I'm not an expert at all, but I'd like to understand this, do you plan that READ-ONLY tables wouldn't even store transaction information? That should save quite a lot of space. Maybe when the table would be moved to the compressed tablespace, MVCC information could be dropped too? Of course that would avoid future insert update possibilities though. -- Albert Cervera i Areny http://www.NaN-tic.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Simon, Use Case: VLDB with tons of (now) read only data, some not. Data needs to be accessible, but data itself is rarely touched, allowing storage costs to be minimised via a storage hierarchy of progressively cheaper storage. There's actually 2 cases to optimize for: 1) write-once-read-many (WORM) 2) write-once-read-seldom (WORS) The 2nd case is becoming extremely popular due to the presence of government-mandated records databases. For example, I'm currently working on one call completion records database which will hold 75TB of data, of which we expect less than 1% to *ever* be queried. One of the other things I'd like to note is that for WORM, conventional storage is never going to approach column-store DBs for general performance. So, should we be working on incremental improvements like the ones you propose, or should we be working on integrating a c-store into PostgreSQL on a per-table basis? -- Josh the Fuzzy Berkus San Francisco ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 20:30 -0800, Josh Berkus wrote: Simon, Use Case: VLDB with tons of (now) read only data, some not. Data needs to be accessible, but data itself is rarely touched, allowing storage costs to be minimised via a storage hierarchy of progressively cheaper storage. There's actually 2 cases to optimize for: 1) write-once-read-many (WORM) 2) write-once-read-seldom (WORS) The 2nd case is becoming extremely popular due to the presence of government-mandated records databases. For example, I'm currently working on one call completion records database which will hold 75TB of data, of which we expect less than 1% to *ever* be queried. Well, that's exactly the use case I'm writing for. I called that an archival data store in my post on VLDB Features. WORM is a type of storage that might be used, so it would be somewhat confusing if we use it as the name of a specific use case. Getting partitioning/read-only right will allow 70+TB of that to be on tape or similar, which with compression can be reduced to maybe 20TB? I don't want to promise any particular compression ratio, but it will make a substantial difference, as I'm sure you realise. One of the other things I'd like to note is that for WORM, conventional storage is never going to approach column-store DBs for general performance. So, should we be working on incremental improvements like the ones you propose, or should we be working on integrating a c-store into PostgreSQL on a per-table basis? What I'm saying is that there are some features that all VLDBs need. If we had a column store DB we would still need partitioning as well or the data structures would become unmanageable. Plus partitioning can allow the planner to avoid de-archiving/spinning up data and help reduce storage costs. Radical can be good, but it can take more time also. I dare say it would be harder for the community to accept also. So I look for worthwhile change in acceptable size chunks. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] WORM and Read Only Tables (v0.1)
There are a number of nasty limitations for partitions currently (not the least of which is that real uniqueness guarantees are impractical), Just to add an other opinion to this statement, because it imho sounds overly pessimistic: Uniqueness is currently perfectly practical, when the unique index contains the column[s] that is/are used in a non overlapping partitioning scheme. If you cannot create separate unique indexes on each partition that guarantee global uniqueness because of the chosen partitioning scheme, you can often reconsider your scheme (e.g. use natural keys instead of serials). Other db software allows creating global indexes, or indexes with separate partitioning schemes, but this is then often a pain. When you drop/attach/detach a partition such an index needs to be recreated or reorganized. This then makes a large slow transaction out of attach/detach partition. If you don't need to attach/detach, there is still one other argument against the huge global index which is fault isolation. There is imho large room to make it better than others :-) And I think we should not regard them as positive examples, because that narrows the view. Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Getting partitioning/read-only right will allow 70+TB of that to be on tape or similar, which with compression can be reduced to maybe 20TB? I don't want to promise any particular compression ratio, but it will make a substantial difference, as I'm sure you realise. Wouldn't one very substantial requirement of such storage be to have it independent of db version, or even db product? Keeping old hardware and software around can be quite expensive. So, wouldn't a virtual table interface be a better match for such a problem ? Such a virtual table should be allowed to be part of a partitioning scheme, have native or virtual indexes, ... Andreas ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Wed, Dec 12, 2007 at 12:14:43PM +0100, Zeugswetter Andreas ADI SD wrote: Uniqueness is currently perfectly practical, when the unique index contains the column[s] that is/are used in a non overlapping partitioning scheme. Well, yes, assuming you have no bugs. Part of the reason I want the database to handle this for me is because, where I've come from, the only thing I can be sure of is that there will be bugs. There'll even be bugs before there is running code. One bug I can easily imagine is that the non-overlapping partitioning scheme has a bug in it, such that it turns out there _is_ an overlap some time. All of that said, I agree with you, particularly about the alternative ways things can suck instead :-/ A ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Wed, Dec 12, 2007 at 12:58:11PM +0100, Zeugswetter Andreas ADI SD wrote: Wouldn't one very substantial requirement of such storage be to have it independent of db version, or even db product? Keeping old hardware and software around can be quite expensive. This was one of the explicit requirements I had when I wrote my pie in the sky outline. Hrm. I wonder if I can get permission to post it. Let me find out. The requirement was, anyway, that we be able to read old versions of archived rows. IIRC there was an implementation choice, whether we would _never_ allow such rows to be SET READ WRITE or whether they'd be immediately upgraded to the present format on SET READ WRITE. A ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Wed, 2007-12-12 at 11:22 -0500, Andrew Sullivan wrote: On Wed, Dec 12, 2007 at 12:14:43PM +0100, Zeugswetter Andreas ADI SD wrote: Uniqueness is currently perfectly practical, when the unique index contains the column[s] that is/are used in a non overlapping partitioning scheme. Well, yes, assuming you have no bugs. Part of the reason I want the database to handle this for me is because, where I've come from, the only thing I can be sure of is that there will be bugs. There'll even be bugs before there is running code. One bug I can easily imagine is that the non-overlapping partitioning scheme has a bug in it, such that it turns out there _is_ an overlap some time. Enforcing uniqueness with a global index has a number of disadvantages. The worst of these is that the index continues to get bigger and bigger as the total data volume increases. You have to index all partitions, plus each index entry needs to include a partition id as well as the index key. So not only is it big, its huge. Huge indexes are slow, so an index with terabytes of data in it is going to be almost unusable. The best thing to do would be to sit down and work out exactly how big and deep such an index would be in the case you're thinking of so we can tell whether it is very bad or merely bad. I seem to be the only one saying global indexes are bad, so if people that want them can do the math and honestly say they want them, then I will listen. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Wed, 2007-12-12 at 10:48 -0800, Josh Berkus wrote: Andrew, The requirement was, anyway, that we be able to read old versions of archived rows. IIRC there was an implementation choice, whether we would _never_ allow such rows to be SET READ WRITE or whether they'd be immediately upgraded to the present format on SET READ WRITE. Well, in theory we need this capability for upgrade-in-place too. While that project has kind of stalled for the moment, we'll pick it back up again soon. Who was working on it? -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Andrew, The requirement was, anyway, that we be able to read old versions of archived rows. IIRC there was an implementation choice, whether we would _never_ allow such rows to be SET READ WRITE or whether they'd be immediately upgraded to the present format on SET READ WRITE. Well, in theory we need this capability for upgrade-in-place too. While that project has kind of stalled for the moment, we'll pick it back up again soon. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] WORM and Read Only Tables (v0.1)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, 12 Dec 2007 19:07:57 + Simon Riggs [EMAIL PROTECTED] wrote: I seem to be the only one saying global indexes are bad, so if people that want them can do the math and honestly say they want them, then I will listen. global indexes are bad for certain situations for others they are required. Constraint Exclusion/Partitioning is not only for ginormous tables. It can also be used for maintenance efficiency, micro optimizations and just general data architecture. Joshua D. Drake - -- The PostgreSQL Company: Since 1997, http://www.commandprompt.com/ Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate SELECT 'Training', 'Consulting' FROM vendor WHERE name = 'CMD' -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFHYDPhATb/zqfZUUQRApkWAJ0ZRixV0QD5DCAZxexq/oOojkIftwCfZqDv LA1HPCP/h2di7Xlj2uju0zo= =/lMO -END PGP SIGNATURE- ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Simon, Who was working on it? Zdenec and Dhanaraj from Sun, and someone from EDB (I'm not sure who, maybe Korry?). Unfortunately, both companies have shifted staff around and we need to re-start work. Of course, if hackers other than those from EDB Sun want to attack the problem, the more the merrier. -- Josh Berkus PostgreSQL @ Sun San Francisco ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Wed, Dec 12, 2007 at 07:07:57PM +, Simon Riggs wrote: Enforcing uniqueness with a global index has a number of disadvantages. This is why I was trying to talk about constraints rather than global indexes. Just because we happen to implement them that way today does not mean that such constraints need be implemented that way in every case. I think especially for the sort of detached rows scenario I was dreaming about, a global index is never going to be good. A ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
[HACKERS] WORM and Read Only Tables (v0.1)
Many applications have the need to archive data after it has been through the initial flurry of reads and updates that follows its original insertion. Currently there is no specific feature support to meet this requirement, so I propose to add this for 8.4. Use Case: VLDB with tons of (now) read only data, some not. Data needs to be accessible, but data itself is rarely touched, allowing storage costs to be minimised via a storage hierarchy of progressively cheaper storage. Features - Read Only Tables - Compressed Tablespaces - Attaching table - Per-Tablespace Access Costing - Performance Tuning Read-Only Tables Postgres supports the concept of freezing tuples, so they can live forever within the database without needing further writes. Currently there is no command that will guarantee that a table has been completely frozen. This makes it difficult to reliably write data files to WORM media for longer term archiving. (WORM means Write-Once, Read-Many). It's also a pain having to VACUUM a large table again just because a small number of rows need to be frozen. So we need a DDL command that will ensure all tuples are frozen and then mark the table as read-only. Ideally, we would like to do this in a way that doesn't hold long full table locks, since we want the data to remain accessible at all times. So... VACUUM FREEZE table SET READ ONLY; would be my first thought, but I'm guessing everybody will press me towards supporting the more obvious ALTER TABLE table SET READ ONLY; This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). On completion of the freeze pass we will then update the pg_class entry to show that it is now read-only, so we will emulate the way VACUUM does this. This form of the ALTER TABLE command will need to be mangled so it can only run outside of a transaction block and also so it takes only a ShareLock rather than an AccessExclusiveLock. Reversing the process is simpler, since we only have to turn off the flag in pg_class: ALTER TABLE table SET READ WRITE; possibly able to do this without grabbing an AccessExclusiveLock, though that isn't an important part of this implementation. Read-only tables never need VACUUMing, so we would be able to make autovacuum and explicit vacuum ignore them. Read-only tables may not be written to, yet would still allow implicit or explicit INSERT, UPDATE and DELETE privileges to be held on the table. Attempts to write to the table will result in a specific read only table cannot be modified ERROR. This allows a table to be placed into read-only mode for long periods of time and flipped back to read-write if some updates are required. That is useful for various performance reasons, see later. We can't use the privilege mechanism to prevent writes since superusers bypass them. (Thoughts?) Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only tables will be ignored, since they are effectively already there. So we don't need to change the internals of the locking, nor edit the RI code to remove the call to SHARE lock referenced tables. Do this during post-parse analysis. Tables can be copied to WORM media by using ALTER TABLE table SET TABLESPACE tblspc; This would also use a ShareLock rather than an AccessExclusiveLock, piggy-backing off the work mentioned above. Running SET TABLESPACE and SET READ ONLY at the same time might sound like a good plan, but ISTM will require two fairly different code paths, so if we do it at all it will be a later addition. Compressed Tablespaces -- Frequently with large data archives there is a requirement to reduce the footprint of the data to allow longer term storage costs to be reduced. For Insert-only data we might imagine we can reduce the size of tables by removing unused tuple header information. Although that is possible, repeated headers compress fairly well, so it seems easier to tackle the problem directly by having compressed tables. Using a streaming library like zlib, it will be easy to read/write data files into a still-usable form but with much reduced size. Access to a compressed table only makes sense as a SeqScan. That would be handled by introducing tablespace-specific access costs, discussed below. Indexes on compressed tables would still be allowed, but would hardly ever be used. Access would probably be via tablespace-specific storage managers. So implement mdcompress.c alongside md.c in src/backend/storage/smgr. If that implementation route was chosen, it would then allow the compression option to be made at tablespace level, so commands would be: CREATE
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Simon Riggs [EMAIL PROTECTED] writes: So... VACUUM FREEZE table SET READ ONLY; would be my first thought, but I'm guessing everybody will press me towards supporting the more obvious ALTER TABLE table SET READ ONLY; This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). On completion of the freeze pass we will then update the pg_class entry to show that it is now read-only, so we will emulate the way VACUUM does this. To be clear it if it meets a block for which a tuple is not freezable -- that is, it has an xmin or xmax more recent than the global xmin then it needs to block waiting for the backend which that recent xmin. Then presumably it needs to update its concept of recent global xmin going forward. You might be best off grabbing a list of txid-xmin when you start and sorting them by xmin so you can loop through them sleeping until you reach the first txid with an xmin large enough to continue. Reversing the process is simpler, since we only have to turn off the flag in pg_class: I'm not sure how this interacts with: Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only tables will be ignored, since they are effectively already there. So we don't need to change the internals of the locking, nor edit the RI code to remove the call to SHARE lock referenced tables. Do this during post-parse analysis. Since queries which think they hold FOR SHARE tuple locks will be magically losing their share locks if you turn off the read-only flag. Do you need to obtain an exclusive lock on the table to turn it read-write? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 11:12 +, Simon Riggs wrote: Features - Read Only Tables - Compressed Tablespaces I wonder if instead of read-only tables wouldn't it be better to have some kind of automatic partitioning which permits to have different chunks of the table data in different tablespaces, and a freeze command which effectively moves the data from the (normally small) active chunk to the archive chunk when it's transaction id is older than a predefined threshold ? Then put the active chunk on a high performance file system and the archive tablespace on a compressed/slow/cheap file system and you're done. Allow even the archive chunk to be updateable, and put new tuple data in the active chunk. It would work just fine for cases where the old data is rarely updated/deleted... Another advantage I guess would be that active data would more likely stay in cache, as updated records would stay together and not spread over the inactive. Cheers, Csaba. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 13:44 +0100, Csaba Nagy wrote: Another advantage I guess would be that active data would more likely stay in cache, as updated records would stay together and not spread over the inactive. And I forgot to mention that vacuum could mostly skip the archive part, and only vacuum the active part, which would drastically reduce the cost of vacuuming big active tables. Cheers, Csaba. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Ühel kenal päeval, T, 2007-12-11 kell 13:44, kirjutas Csaba Nagy: On Tue, 2007-12-11 at 11:12 +, Simon Riggs wrote: Features - Read Only Tables - Compressed Tablespaces I wonder if instead of read-only tables wouldn't it be better to have some kind of automatic partitioning which permits to have different chunks of the table data in different tablespaces, and a freeze command which effectively moves the data from the (normally small) active chunk to the archive chunk when it's transaction id is older than a predefined threshold ? This would be doable using Simons proposed commands. Then put the active chunk on a high performance file system and the archive tablespace on a compressed/slow/cheap file system and you're done. Allow even the archive chunk to be updateable, and put new tuple data in the active chunk. It would work just fine for cases where the old data is rarely updated/deleted... You can't update a table on a read-only (write-once) partition, at least not with current header structure. Another advantage I guess would be that active data would more likely stay in cache, as updated records would stay together and not spread over the inactive. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 14:58 +0200, Hannu Krosing wrote: Ühel kenal päeval, T, 2007-12-11 kell 13:44, kirjutas Csaba Nagy: Then put the active chunk on a high performance file system and the archive tablespace on a compressed/slow/cheap file system and you're done. Allow even the archive chunk to be updateable, and put new tuple data in the active chunk. It would work just fine for cases where the old data is rarely updated/deleted... You can't update a table on a read-only (write-once) partition, at least not with current header structure. OK, but that's what I'm challenging, why do you need a write once partition ? You mean by that tapes ? OK, it means I was thinking in completely different usage scenarios then... Cheers, Csaba. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On 12/11/07, Simon Riggs [EMAIL PROTECTED] wrote: Compressed Tablespaces Using a streaming library like zlib, it will be easy to read/write data files into a still-usable form but with much reduced size. Access to a compressed table only makes sense as a SeqScan. That would be handled by introducing tablespace-specific access costs, discussed below. Indexes on compressed tables would still be allowed, but would hardly ever be used. I've actually been wanting this lately, for a couple reasons. One is reduced disk footprint, but the other is reduced I/O, similar to how TOAST helps with large fields now. (In my particular scenario, TOAST can't help due to small field sizes.) It would be useful to have available even on read/write data. To that end, it would probably make more sense to use a block compression algorithm rather than a streaming one. Block-based algorithms can generally get better compression than streaming ones as well, at least when fed large enough blocks. I'm not familiar with the implementation issues, other than the obvious variable block sizes make the I/O subsystem look very different, so I don't know if there's a major tradeoff between the two strategies (even just for read-only). I'm open to arguments that we don't need this at all because filesystem utilities exist that do everything we need. You're experience will be good to hear about in regard to this feature. Some filesystems do support transparent compression, but they're not always available. It would be nice to have compression on unsophisticated systems with cheap hardware. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On 11/12/2007, Csaba Nagy [EMAIL PROTECTED] wrote: On Tue, 2007-12-11 at 14:58 +0200, Hannu Krosing wrote: Ühel kenal päeval, T, 2007-12-11 kell 13:44, kirjutas Csaba Nagy: Then put the active chunk on a high performance file system and the archive tablespace on a compressed/slow/cheap file system and you're done. Allow even the archive chunk to be updateable, and put new tuple data in the active chunk. It would work just fine for cases where the old data is rarely updated/deleted... You can't update a table on a read-only (write-once) partition, at least not with current header structure. OK, but that's what I'm challenging, why do you need a write once partition ? You mean by that tapes ? OK, it means I was thinking in completely different usage scenarios then... Cheers, Csaba. I think DVD or CD would make sence, Tapes have an added limitation of being sequential access only. Peter Childs
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On 11/12/2007, Simon Riggs [EMAIL PROTECTED] wrote: Attach -- Writing tables on one system and then moving that data to other systems is fairly common. If we supported read-only tables then you might consider how you would publish new versions to people. For now, we just want to consider how we will upgrade from one release to another without needing to unload and reload potentially many Terabytes of data. We can't delete the old data until the new data is successfully loaded, so we will have a huge temporary storage cost. This could go very slowly if we use cheaper storage, plus reloading the data means we have to re-freeze it again also. So we need a way of attaching the old tables to the new database. We might call this binary upgrade, or we might be slightly less ambitious and talk about just moving the old read-only data. That's all I want to do at this stage. I'm mentioning this here now to see what comes out in debate, and what others are planning to work on in this area. This sounds like allowing new versions to read old versions file structure, Probably I guess on a tablespace by table space basis Another advantage might be to be able to load an old pitr backup in a new version across major versions. But I'm not sure that would work. Maybe we need a command to upgrade a tablespace to a new versions file format? Need to be careful we don't stunt future progress by fixing file format much. Peter Childs
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 11:49 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: So... VACUUM FREEZE table SET READ ONLY; would be my first thought, but I'm guessing everybody will press me towards supporting the more obvious ALTER TABLE table SET READ ONLY; This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). On completion of the freeze pass we will then update the pg_class entry to show that it is now read-only, so we will emulate the way VACUUM does this. To be clear it if it meets a block for which a tuple is not freezable -- that is, it has an xmin or xmax more recent than the global xmin then it needs to block waiting for the backend which that recent xmin. Then presumably it needs to update its concept of recent global xmin going forward. You might be best off grabbing a list of txid-xmin when you start and sorting them by xmin so you can loop through them sleeping until you reach the first txid with an xmin large enough to continue. D'oh. Completely agreed. Mia culpa. I had that bit in my original design, but I was looking elsewhere on this clearly. I'd been trying to think about how to do this since about 2 years ago and it was only the CREATE INDEX CONCURRENTLY stuff that showed me how. Thanks for nudging me. Reversing the process is simpler, since we only have to turn off the flag in pg_class: I'm not sure how this interacts with: Requests for tuple-level SHARE locks (e.g. FOR SHARE) against read-only tables will be ignored, since they are effectively already there. So we don't need to change the internals of the locking, nor edit the RI code to remove the call to SHARE lock referenced tables. Do this during post-parse analysis. Since queries which think they hold FOR SHARE tuple locks will be magically losing their share locks if you turn off the read-only flag. Do you need to obtain an exclusive lock on the table to turn it read-write? Agreed. I wasn't suggesting implementing without, just noting that it might have been possible, but it seems not as you say. I don't think its important to be able to do that with less than AccessExclusiveLock. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] WORM and Read Only Tables (v0.1)
Simon Riggs [EMAIL PROTECTED] writes: This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). This breaks MVCC. The fact that a transaction has completed is not license to discard tuple xmin immediately. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 10:19 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: This command will place a ShareLock (only) on the table, preventing anybody from writing to the table while we freeze it. The ShareLock is incompatible with any transaction that has written to the table, so when we acquire the lock all writers to the table will have completed. We then run the equivalent of a VACUUM FREEZE which will then be able to freeze *all* rows in one pass (rather than all except the most recent). This breaks MVCC. The fact that a transaction has completed is not license to discard tuple xmin immediately. Yeh, agreed. I knew I'd solved that bit, so I was focused elsewhere. Sloppy, so apologies. I was originally planning to put a wait in at the beginning, as is used by CREATE INDEX CONCURRENTLY, though I prefer Greg's variant because it's more forgiving. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 13:44 +0100, Csaba Nagy wrote: On Tue, 2007-12-11 at 11:12 +, Simon Riggs wrote: Features - Read Only Tables - Compressed Tablespaces I wonder if instead of read-only tables wouldn't it be better to have some kind of automatic partitioning That's definitely on my list of requirements for partitioning. which permits to have different chunks of the table data in different tablespaces, and a freeze command which effectively moves the data from the (normally small) active chunk to the archive chunk when it's transaction id is older than a predefined threshold ? As Hannu says, this is exactly what the other features will allow, so Yes! -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, Dec 11, 2007 at 11:12:46AM +, Simon Riggs wrote: Read-Only Tables In the past when this topic came up, there was some discussion of doing this at a level somewhere below the table horizon. There are a number of nasty limitations for partitions currently (not the least of which is that real uniqueness guarantees are impractical), so allowing users to specify some segment of the table to be read only without imposing it on the whole table would be awful nice. I seem to recall Jan had an idea on how to do it, but I could be wrong. Also, doing this at the tuple, rather than table-wide, level might lead to additional capabilities in this area: Attach -- Writing tables on one system and then moving that data to other systems is fairly common. If we supported read-only tables then you might consider how you would publish new versions to people. Some time ago I was speculating on pie-in-the-sky features I might like in Postgres, and it was something like this attach. But the idea was somehow related to the read-only tuples. In my specific case, I have piles and piles of mostly useless data. Sometimes, however, some of that data is possibly useful in retrospect. So the suggestion was to have tables that could be mostly offline -- archived somewhere -- but for which we had enough metadata online to say, You have some data that might match in catalog C. Go mount it, and I'll check. I think this is subtly different from the attach case you're outlining? A ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] WORM and Read Only Tables (v0.1)
On Tue, 2007-12-11 at 14:25 -0500, Andrew Sullivan wrote: On Tue, Dec 11, 2007 at 11:12:46AM +, Simon Riggs wrote: Read-Only Tables In the past when this topic came up, there was some discussion of doing this at a level somewhere below the table horizon. There are a number of nasty limitations for partitions currently (not the least of which is that real uniqueness guarantees are impractical), so allowing users to specify some segment of the table to be read only without imposing it on the whole table would be awful nice. I seem to recall Jan had an idea on how to do it, but I could be wrong. Also, doing this at the tuple, rather than table-wide, level might lead to additional capabilities in this area: Seems fair comment. I'll move forward my thoughts on partitioning, so we can decide whether to do things this way, or below the table horizon as you say. I've got some detailed notes on this already which showed it was roughly balanced between the two ways. I'll write it up so we can see; the answer might change during the writing. Attach -- Writing tables on one system and then moving that data to other systems is fairly common. If we supported read-only tables then you might consider how you would publish new versions to people. Some time ago I was speculating on pie-in-the-sky features I might like in Postgres, and it was something like this attach. But the idea was somehow related to the read-only tuples. In my specific case, I have piles and piles of mostly useless data. Sometimes, however, some of that data is possibly useful in retrospect. So the suggestion was to have tables that could be mostly offline -- archived somewhere -- but for which we had enough metadata online to say, You have some data that might match in catalog C. Go mount it, and I'll check. I think this is subtly different from the attach case you're outlining? Yes it is, but I had hoped that what you're asking for is catered for here. If you have a hierarchical storage manager, you can just call access the file, whereupon the actual file will be de-archived to allow access. Or maybe you have it on MAID storage, so we spin up the disks to allow access to the files. So I guess I was expecting the de-archive to be automated at the file system level. http://en.wikipedia.org/wiki/Hierarchical_Storage_Management Regrettably, I'm not aware of an open source HSM, though XFS has some hooks in it that mention this. That's an old IBM term, so some people might refer to this concept as tiered storage. Since I was planning to modify smgr to allow different kinds of tablespaces it should be possible to make the file issue some kind of a call out to mount the appropriate files. What would that call out look like? Would that be a DMAPI/XDSM impelementation, or something simpler as was used for PITR, or a roll-your-own plug-in hook? I can see I'll have to re-wire smgr_nblocks() for non-md smgrs to access pg_class.relpages rather than issue a seek, which will need to issue an open. That way we can do planning without actually accessing the table. (Maybe that's presuming we dealing with tables, oh well). -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly