Re: [HACKERS] Table and Index compression
For future reference, and since this keeps appearing every few months: The license of LZO is not acceptable for inclusion or use with PostgreSQL. You need to find a different library if you want to pursue this further. Yes, I know about the license... I used LZO for tests, but since my little experiment with compression didn't give any really motivating results, I won't pursue this further. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Tuesday 11 August 2009 13:05:39 Pierre Frédéric Caillaud wrote: > Well, here is the patch. I've included a README, which I paste here. > If someone wants to play with it (after the CommitFest...) feel free to > do so. > While it was an interesting thing to try, I don't think it has enough > potential to justify more effort... > > > * How to test > > - apply the patch > - copy minilzo.c and minilzo.h to > src/backend/storage/smgr For future reference, and since this keeps appearing every few months: The license of LZO is not acceptable for inclusion or use with PostgreSQL. You need to find a different library if you want to pursue this further. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Well, here is the patch. I've included a README, which I paste here. If someone wants to play with it (after the CommitFest...) feel free to do so. While it was an interesting thing to try, I don't think it has enough potential to justify more effort... * How to test - apply the patch - copy minilzo.c and minilzo.h to src/backend/storage/smgr - configure & make - enjoy * How it works - pg block size set to 32K - an extra field is added in the header telling the compressed length THIS IS BAD, this information should be stored in a separate fork of the relation, because - it would then be backwards compatible - the number of bytes to read from a compressed page would be known in advance - the table file is sparse - the page header is not compressed - pages are written at their normal positions, but only the compressed bytes are written - if compression gains nothing, un-compressed page is stored - the filesystem doesn't store the un-written blocks * Benefits - Sparse file holes are not cached, so OS disk cache efficiency is at least x2 - Random access is faster, having a better probability to hit cache (sometimes a bit faster, sometimes it's spectatular) - Yes, it does save space (> 50%) * Problems - Biggest problem : any write to a table that writes data that compresses less than whatever was there before can fail on a disk full error. - ext3 sparse file handling isn't as fast as I wish it would be : on seq scans, even if it reads 2x less data, and decompresses very fast, it's still slower... - many seq scans (especially with aggregates) are CPU bound anyway - therefore, some kind of background-reader-decompressor would be needed - pre-allocation has to be done to avoid extreme fragmentation of the file, which kind of defeats the purpose - it still causes fragmentation * Conclusion (for now) It was a nice thing to try, but I believe it would be better if this was implemented directly in the filesystem, on the condition that it should be implemented well (ie not like NTFS compression). pg_8.4.0_compression_patch_v001.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Pierre, > I added a field in PageHeader which contains : > - 0 to indicate a non-compressed page > - length of compressed data if compressed > > If compression gains nothing (ie gains less than 4K), the page is > stored raw. > > It seems that only pages having a PageHeader are handled by md.c, so > it should work (am I mistaken ?) Well, there's the issue of upgradability; this would require us to have an incompatible upgrade of on-disk formats. So we don't want to go further down this route unless we're sure it's worthwhile. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 07, 2009 at 04:17:18PM +0200, Pierre Frrrdddric Caillaud wrote: > I'm answering my own question : at the beginning of the run, postgres > creates a 800MB temporary file, then it fills the table, then deletes the > temp file. > Is this because I use generate_series to fill the test table ? Doh, yes. A function's result is written to temp location first and then read back again once the function returns success. You'll have more luck if you do: SELECT now() + '1 sec'::INTERVAL, (1+random()*8), random()*1000,n+random()*1, n+random()*1000, n FROM ( SELECT generate_series( 1, 6000 )) x(n); -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, 07 Aug 2009 15:42:35 +0200, Kevin Grittner wrote: Pierre Frédéric Caillaud wrote: tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives, but it does it too if I put the tablespace and data on the same volume. it starts out relatively fast : si sobibo in csus sy id wa 00 0 43680 2796 19162 42 18 37 3 00 0 45515 3165 20652 44 17 35 4 00 0 43130 3046 21991 43 17 38 2 then here it starts to slow down : check "bo" output 00 181 24439 577 3541 31 6 40 23 00 176 17258 292 1324 31 4 43 22 00 0 18626 162 693 35 3 49 12 00 1 21554 235 1362 31 5 50 14 00 0 19177 324 2053 35 4 50 12 00 0 19208 206 1155 36 4 48 12 00 1 20740 215 1117 33 4 50 13 00 0 20154 258 1100 32 4 50 14 00 0 20355 316 2056 34 5 49 12 ... and it stays like this until the end of the INSERT... I don't know if this is it, but we tend to see outrageously high performance at the start of a benchmark because of the battery-backed cache in the RAID controller. Every write comes back immediately after copying the data to RAM. After a while the cache gets filled and you settle down to a steady state. If it's not BBU with write-back enabled, perhaps you have drives that lie about write completion? -Kevin I'm answering my own question : at the beginning of the run, postgres creates a 800MB temporary file, then it fills the table, then deletes the temp file. Is this because I use generate_series to fill the test table ? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Pierre Frédéric Caillaud wrote: > tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives, > but it does it too if I put the tablespace and data on the same > volume. > it starts out relatively fast : > > si sobibo in csus sy id wa > 00 0 43680 2796 19162 42 18 37 3 > 00 0 45515 3165 20652 44 17 35 4 > 00 0 43130 3046 21991 43 17 38 2 > > then here it starts to slow down : check "bo" output > > 00 181 24439 577 3541 31 6 40 23 > 00 176 17258 292 1324 31 4 43 22 > 00 0 18626 162 693 35 3 49 12 > 00 1 21554 235 1362 31 5 50 14 > 00 0 19177 324 2053 35 4 50 12 > 00 0 19208 206 1155 36 4 48 12 > 00 1 20740 215 1117 33 4 50 13 > 00 0 20154 258 1100 32 4 50 14 > 00 0 20355 316 2056 34 5 49 12 > > ... and it stays like this until the end of the INSERT... I don't know if this is it, but we tend to see outrageously high performance at the start of a benchmark because of the battery-backed cache in the RAID controller. Every write comes back immediately after copying the data to RAM. After a while the cache gets filled and you settle down to a steady state. If it's not BBU with write-back enabled, perhaps you have drives that lie about write completion? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 07, 2009 at 03:29:44PM +0200, Pierre Frrrdddric Caillaud wrote: > vmstat output : Sorry, I don't know enough of PGs internals to suggest anything here, but iostat may give you more details as to what's going on. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Not strictly related to compression, but I've noticed something really strange... pg 8.4 (vanilla) is doing it, and my compressed version is doing it too. tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives, but it does it too if I put the tablespace and data on the same volume. The traditional test table : BEGIN; CREATE TABLE dwb ( bid SERIAL, aid INTEGER NOT NULL, ts TIMESTAMP NOT NULL, i1 INTEGER NOT NULL, i2 INTEGER NOT NULL, i3 INTEGER NOT NULL, i4 INTEGER NOT NULL ) WITHOUT OIDS; The traditional test data : INSERT INTO dwb (ts,aid,i1,i2,i3,i4) SELECT now() + '1 sec'::INTERVAL, (1+random()*8), random()*1000, n+random()*1, n+random()*1000, n FROM generate_series( 1, 6000 ) AS n; vmstat output : it starts out relatively fast : si sobibo in csus sy id wa 00 0 43680 2796 19162 42 18 37 3 00 0 45515 3165 20652 44 17 35 4 00 0 43130 3046 21991 43 17 38 2 then here it starts to slow down : check "bo" output 00 181 24439 577 3541 31 6 40 23 00 176 17258 292 1324 31 4 43 22 00 0 18626 162 693 35 3 49 12 00 1 21554 235 1362 31 5 50 14 00 0 19177 324 2053 35 4 50 12 00 0 19208 206 1155 36 4 48 12 00 1 20740 215 1117 33 4 50 13 00 0 20154 258 1100 32 4 50 14 00 0 20355 316 2056 34 5 49 12 ... and it stays like this until the end of the INSERT... It's not writing any xlog since the table was created after the BEGIN... I'm trying to benchmark insert speed, but no luck. This volume does about 100 MB/s sustained write speed, so ?.. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
For reference what I'm picturing is this: When a table is compressed it's marked read-only which bars any new tuples from being inserted or existing tuples being deleted. Then it's frozen and any pages which contain tuples wich can't be frozen are waited on until they can be. When it's finished every tuple has to be guaranteed to be fully frozen. Then the relation is rewritten in compressed form. Each block is compressed one by one and written one after the other to disk. At the same time a new fork is written which contains a pointer to each block. It could just be a directly addressed array of offsets and lengths. All block lookups have to first load the page of the indirection map, then read the appropriate section of the original file and decompress it into shared buffers. I had pondered the idea of a fork storing the compressed status of each page, because it has advantages : - no need to change the page layout to insert a "is compressed" flag - possible to compress any data, not just standard pages - if you know the compressed size of a page in advance, it is much easier to prefetch it entirely and not just the first chunk, or read too much... From a programming point of view this is nice and simple. From a user's point of view it's a bit of a pain since it means you have to rewrite your whole table when you want to compress it. And it means you have to rewrite it all again if you decide you want to set it back to read-write. My experience with people who have very large tables is that they design their whole process around the goal of avoiding having to move the data once it's written. Note that if a table is huge, it is always cut in (currently) 1GB slices, so you could operate on one slice at a time, then release a lock, let the backlog of queries flow, and resume. Realtime compression would be much less of a hassle to use, though... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 7, 2009 at 8:18 AM, Greg Stark wrote: > For reference what I'm picturing is this: > > When a table is compressed it's marked read-only which bars any new > tuples from being inserted or existing tuples being deleted. Then it's > frozen and any pages which contain tuples wich can't be frozen are > waited on until they can be. When it's finished every tuple has to be > guaranteed to be fully frozen. > > Then the relation is rewritten in compressed form. Each block is > compressed one by one and written one after the other to disk. > > At the same time a new fork is written which contains a pointer to > each block. It could just be a directly addressed array of offsets and > lengths. All block lookups have to first load the page of the > indirection map, then read the appropriate section of the original > file and decompress it into shared buffers. > > From a programming point of view this is nice and simple. From a > user's point of view it's a bit of a pain since it means you have to > rewrite your whole table when you want to compress it. And it means > you have to rewrite it all again if you decide you want to set it back > to read-write. My experience with people who have very large tables is > that they design their whole process around the goal of avoiding > having to move the data once it's written. If you add an indirection table, it's not strictly necessary for the table to be read-only, though if you want to make it read-write you'd need to think about how to defragment. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 07, 2009 at 12:59:57PM +0100, Greg Stark wrote: > On Fri, Aug 7, 2009 at 12:48 PM, Sam Mason wrote: > >> Well most users want compression for the space savings. So running out > >> of space sooner than without compression when most of the space is > >> actually unused would disappoint them. > > > > Note, that as far as I can tell for a filesystems you only need to keep > > enough reserved for the amount of uncompressed dirty buffers you have in > > memory. As space runs out in the filesystem all that happens is that > > the amount of (uncompressed?) dirty buffers you can safely have around > > decreases. > > And when it drops to zero? That was why I said you need to have one page left "to handle the base case". I was treating the inductive case as the interesting common case and considered the base case of lesser interest. > > In PG's case, it would seem possible to do the compression and then > > check to see if the resulting size is greater than 4kB. If it is you > > write into the 4kB page size and write uncompressed data. Upon reading > > you do the inverse, if it's 4kB then no need to decompress. I believe > > TOAST does this already. > > It does, as does gzip and afaik every compression system. It's still a case that needs to be handled explicitly by the code. Just for reference, gzip does not appear to do this when I test it: echo -n 'a' | gzip > tmp.gz gzip -l --verbose tmp.gz says the compression ratio is "-200%" (an empty string results in an infinite increase in size yet gets displayed as "0%" for some strange reason). It's only when you hit six 'a's that you start to get positive ratios. Note that that this is taking headers into account; the compressed size is 23 bytes for both 'aaa' and 'aa' but the uncompressed size obviously changes. gzip does indeed have a "copy" method, but it doesn't seem to be being used. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
For reference what I'm picturing is this: When a table is compressed it's marked read-only which bars any new tuples from being inserted or existing tuples being deleted. Then it's frozen and any pages which contain tuples wich can't be frozen are waited on until they can be. When it's finished every tuple has to be guaranteed to be fully frozen. Then the relation is rewritten in compressed form. Each block is compressed one by one and written one after the other to disk. At the same time a new fork is written which contains a pointer to each block. It could just be a directly addressed array of offsets and lengths. All block lookups have to first load the page of the indirection map, then read the appropriate section of the original file and decompress it into shared buffers. >From a programming point of view this is nice and simple. From a user's point of view it's a bit of a pain since it means you have to rewrite your whole table when you want to compress it. And it means you have to rewrite it all again if you decide you want to set it back to read-write. My experience with people who have very large tables is that they design their whole process around the goal of avoiding having to move the data once it's written. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 7, 2009 at 12:48 PM, Sam Mason wrote: >> Well most users want compression for the space savings. So running out >> of space sooner than without compression when most of the space is >> actually unused would disappoint them. > > Note, that as far as I can tell for a filesystems you only need to keep > enough reserved for the amount of uncompressed dirty buffers you have in > memory. As space runs out in the filesystem all that happens is that > the amount of (uncompressed?) dirty buffers you can safely have around > decreases. And when it drops to zero? > In PG's case, it would seem possible to do the compression and then > check to see if the resulting size is greater than 4kB. If it is you > write into the 4kB page size and write uncompressed data. Upon reading > you do the inverse, if it's 4kB then no need to decompress. I believe > TOAST does this already. It does, as does gzip and afaik every compression system. -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 07, 2009 at 11:49:46AM +0100, Greg Stark wrote: > On Fri, Aug 7, 2009 at 11:29 AM, Sam Mason wrote: > > When you choose a compression algorithm you know how much space a worst > > case compression will take (i.e. lzo takes up to 8% more for a 4kB block > > size). This space should be reserved in case of situations like the > > above and the filesystem shouldn't over-commit on this. > > > > Never had to think about this before though so I'm probably missing > > something obvious. > > Well most users want compression for the space savings. So running out > of space sooner than without compression when most of the space is > actually unused would disappoint them. Note, that as far as I can tell for a filesystems you only need to keep enough reserved for the amount of uncompressed dirty buffers you have in memory. As space runs out in the filesystem all that happens is that the amount of (uncompressed?) dirty buffers you can safely have around decreases. In practical terms, this says that performance drops off when there is less free space than the size of the filesystem's cache and I think you have to reserve exactly one block to handle the base case. But there are so many problems associated with completely filling a filesystem that I'm not sure if this would really matter. > Also, I'm puzzled why it would the space increase would proportional > to the amount of data and be more than 300 bytes. There's no reason it > wouldn't be a small fixed amount. The ideal is you set aside one bit > -- if the bit is set the rest is compressed and has to save at least > one bit. If the bit is not set then the rest is uncompressed. Maximum > bloat is 1-bit. In real systems it's more likely to be a byte or a > word. It'll depend on the compression algorithm; lz algorithms are dictionary based so you'd have a single entry for the incompressible data and then a pointer to the entry. In PG's case, it would seem possible to do the compression and then check to see if the resulting size is greater than 4kB. If it is you write into the 4kB page size and write uncompressed data. Upon reading you do the inverse, if it's 4kB then no need to decompress. I believe TOAST does this already. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Also, I'm puzzled why it would the space increase would proportional to the amount of data and be more than 300 bytes. There's no reason it wouldn't be a small fixed amount. The ideal is you set aside one bit -- if the bit is set the rest is compressed and has to save at least one bit. If the bit is not set then the rest is uncompressed. Maximum bloat is 1-bit. In real systems it's more likely to be a byte or a word. I'm working on cleaning the patch... I added a field in PageHeader which contains : - 0 to indicate a non-compressed page - length of compressed data if compressed If compression gains nothing (ie gains less than 4K), the page is stored raw. It seems that only pages having a PageHeader are handled by md.c, so it should work (am I mistaken ?) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 7, 2009 at 11:29 AM, Sam Mason wrote: > When you choose a compression algorithm you know how much space a worst > case compression will take (i.e. lzo takes up to 8% more for a 4kB block > size). This space should be reserved in case of situations like the > above and the filesystem shouldn't over-commit on this. > > Never had to think about this before though so I'm probably missing > something obvious. Well most users want compression for the space savings. So running out of space sooner than without compression when most of the space is actually unused would disappoint them. Also, I'm puzzled why it would the space increase would proportional to the amount of data and be more than 300 bytes. There's no reason it wouldn't be a small fixed amount. The ideal is you set aside one bit -- if the bit is set the rest is compressed and has to save at least one bit. If the bit is not set then the rest is uncompressed. Maximum bloat is 1-bit. In real systems it's more likely to be a byte or a word. -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 07, 2009 at 10:33:33AM +0100, Greg Stark wrote: > 2009/8/7 Pierre Frédéric Caillaud : > > Also, about compressed NTFS : it can give you disk-full errors on read(). > > I suspect it's unavoidable for similar reasons to the problems > Postgres faces. When you issue a read() you have to find space in the > filesystem cache to hold the data. Some other data has to be evicted. > If that data doesn't compress as well as it did previously it could > take more space and cause the disk to become full. > > This also implies that fsync() could generate that error... If that's indeed how it works it seems like one broken file system and needs to get its block accounting in order. When you choose a compression algorithm you know how much space a worst case compression will take (i.e. lzo takes up to 8% more for a 4kB block size). This space should be reserved in case of situations like the above and the filesystem shouldn't over-commit on this. Never had to think about this before though so I'm probably missing something obvious. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
2009/8/7 Pierre Frédéric Caillaud : > > Also, about compressed NTFS : it can give you disk-full errors on read(). I suspect it's unavoidable for similar reasons to the problems Postgres faces. When you issue a read() you have to find space in the filesystem cache to hold the data. Some other data has to be evicted. If that data doesn't compress as well as it did previously it could take more space and cause the disk to become full. This also implies that fsync() could generate that error... > Back to the point of how to handle disk full errors : > - we could write a file the size of shared_buffers at startup > - if a write() reports disk full, delete the file above > - we now have enough space to flush all of shared_buffers > - flush and exit gracefully Unfortunately that doesn't really help. That only addresses the issue for a single backend (or as many as are actually running when the error starts). The next connection could read in new data and expand that and now you have no slop space. Put another way, we don't want to exit at all, gacefully or not. We want to throw an error, abort the transaction (or subtransaction) and keep going. -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Fri, Aug 07, 2009 at 10:36:39AM +0200, Pierre Frrrdddric Caillaud wrote: > Also, about compressed NTFS : it can give you disk-full errors on read(). > While this may appear stupid, it is in fact very good. Is this not just because they've broken the semantics of read? > As a side note, I have also tested lzjb (the ZFS compressor) and lzo is > much faster, and compresses much better (sometimes 2x better). Disks are fast and cheap; a basic IDE disk runs at over 100MB/s now, and it's doing this in the background while your CPU is doing other stuff. If you're also decompressing stuff you're serializing even more and you're doing so with a much power hungrier device (the CPU). How fast is decompression (as that seems to be your selling point)? Lzo claims to run at about about a third of main memory bandwidth which is nice, however research projects found this to be far too slow and were only getting positive results when decompression stayed in secondary cache. Basically decompression has to run at several GB/s for it to have much measurable benefit. > I made a quick check before implementing it, using python scripts to play > with sparse files on ext3 : > > - writing a sparse file is a bit (not much) slower than a regular file, > - reading from a non-fragmented sparse file is as fast as reading a > regular file > - holes do not eat OS disk cache (which is the most interesting point) > - reading from cache is as fast as reading a regular file (and faster if > you don't read the holes because you know they are holes, which is the > case here) Numbers? > And, also, growing a sparse file by plugging the holes in it WILL allocate > blocks all over the place and render IO extremely inefficient. > You can defrag it, of course (using fs-specific tools or just cpio), but > that's not "high-availability"... That would not seem to difficult to solve. > I forgot to talk about SSDs in the previous message. SSDs are quite > expensive, but seek really fast. SSDs are about decreasing latency; if you're putting compression in there you're pushing latency up as well. If you don't care about latency you get traditional rotating media. -- Sam http://samason.me.uk/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark wrote: I like the idea too, but I think there are some major problems to solve. In particular I think we need a better solution to blocks growing than sparse files. How much benefit does this approach have over using TOAST compression more aggressively? ...Robert The two are different : - TOAST compresses a large column value. To store a 100KB text file, TOAST is great. - page compression compresses whole pages. Suppose you have a table with a TIMESTAMP, and a few INT columns. The rows are small enough to make per-row compression useless, and TOAST cannot compress non-varlenas anyway. However, if (for instance) the timestamp is the row insertion date, and you INSERT several rows per second, most timestamps on a page will have lots of bytes in common. Also, row headers (which are larger than the rows) will have much redundant data. Page compression can exploit this, without the need for the rest of the code to know about it. Page compression can also handle indexes, etc. Also, External TOAST is nice if you seldom need the field : for instance, you search on in-page columns, get the row you need, and fetch it. Suppose you have a forum : in this case, when you display a topic page, you need all the posts text. It would be a very bad idea to store them in a separate TOAST table, because it would create more random IO. Storing the posts in the page means less IO, and if you regularly CLUSTER your table, all the posts you need to display a topic page are on the same (or adjacent) postgres page. In this case, individual post text can be TOASTed, too, but compression tends to work better with longer blocks, so compressing the whole page will be more efficient. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
First, a few things that I forgot to mention in the previous message : I like the idea too, but I think there are some major problems to solve. In particular I think we need a better solution to blocks growing than sparse files. Sparse files allow something great : to test this concept in real life, with a length of code that is shorter than this email. The main problem with using sparse files is that currently postgres is careful to allocate blocks early so it can fail if there's not enough space. With your sparse file solution Postgres might only find out there's no space after it has already committed a transaction. bgwriter has no good course of action to take if it finds out there's nowhere to put the data it has in shared buffers. Yes, that is a big problem. Note that the same problem can happen if you use a filesystem that supports compression : - you preallocate a file full of empty blocks - filesystem compresses it well, says it fits - you fill it with data that compresses much worse than zeros - at some point one of the write() calls fails with a disk full error This would be an interesting argument in the "can we use compressed ZFS on postgres" debate... Also, about compressed NTFS : it can give you disk-full errors on read(). While this may appear stupid, it is in fact very good. When you read() something compressed, NTFS will reserve the space on disk to re-write this data, assuming the worst compression, so your app fails early (on read, not on write). If there is not enough space on the volume, read() will fail with a disk full error. Many applications do not expect this. On the other hand, filling a disk completely is a never a good idea, since it generally triggers fragmentation so extreme that the only way to recover is to go buy another harddisk and copy your data. As a side note, I have also tested lzjb (the ZFS compressor) and lzo is much faster, and compresses much better (sometimes 2x better). Back to the point of how to handle disk full errors : - we could write a file the size of shared_buffers at startup - if a write() reports disk full, delete the file above - we now have enough space to flush all of shared_buffers - flush and exit gracefully This will waste disk space. Of course... but the purpose of compression is not to save disk space, it is to have a faster database. Also, if you use compression, - you'll do it because your database is much bigger than shared_buffers, so the wasted space is not huge. - you'll use a smaller shared_buffers than usual, because shared_buffers contains uncompressed pages, and you'll want to use lots of OS disk cache to cache compressed data. Doesn't seem too crazy to me. But I think even if you solve that it's not really a good long-term solution. We don't know how the OS handles block allocation for this type of file. I'm actually moderately surprised it isn't skipping enough blocks assuming you'll allocate them eventually. Even if it does handle it the way you expect what happens when you do grow a block, it'll have to allocate it way out of the way and we have no way to repair that discontinuity later. I made a quick check before implementing it, using python scripts to play with sparse files on ext3 : - writing a sparse file is a bit (not much) slower than a regular file, - reading from a non-fragmented sparse file is as fast as reading a regular file - holes do not eat OS disk cache (which is the most interesting point) - reading from cache is as fast as reading a regular file (and faster if you don't read the holes because you know they are holes, which is the case here) And, also, growing a sparse file by plugging the holes in it WILL allocate blocks all over the place and render IO extremely inefficient. You can defrag it, of course (using fs-specific tools or just cpio), but that's not "high-availability"... Also, the way you've prellocated blocks effectively nails the maximum compression at 2x. That seems to be leaving a lot of money on the table. Yes, it does, but there are a few subtleties. First, about preallocation : I had to implement this, because of the way the files are extended. Since first an empty block is written, it will compress well, to 1 OS page. When, later, this block is written again with data, it will compress to a larger amount of pages, which will trigger allocation all over the place. In testing, the resulting sparse file is slow in the of the "forget it" kind... The file could also be non-sparse : in this case, no disk space is saved. Not saving disk space is OK with me : disks are really cheap these days. I know this sounds crazy ;) but as I said above, compression is meant to save expensive resources. When you use H-264 to compress 1TB of raw video to 1GB of compressed video, you save an expensive resource (0.999 TB of harddisk). But if you use lzo to compress 40GB of database down to 20GB, you save mayb
Re: [HACKERS] Table and Index compression
I'm curious what advantages there are in building compression into the database itself, rather than using filesystem-based compression. I see ZFS articles[1] discuss how enabling compression improves performance with ZFS; for Linux, Btrfs has compression features as well[2]; and on Windows NTFS seems to too. [1]http://blogs.sun.com/observatory/entry/zfs_compression_a_win_win [2]http://lwn.net/Articles/305697/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On 8/6/09 1:03 PM, Greg Stark wrote: > One possibility is to handle only read-only tables. That would make > things a *lot* simpler. But it sure would be inconvenient if it's only > useful on large static tables but requires you to rewrite the whole > table -- just what you don't want to do with large static tables -- to > get the benefit. Well less flexible, I could see combining this with partitioning to still be useful. If we could rewrite specific partitions as compressed, then there's a lot of cumulative data applications which it would benefit. Not as exciting as being able to compress the whole thing, of course. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Robert Haas wrote: > On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark wrote: >> I like the idea too, but I think there are some major problems to >> solve. In particular I think we need a better solution to blocks >> growing than sparse files. > > How much benefit does this approach have over using TOAST > compression more aggressively? I was wondering the same thing. It seems like compressing a page at a time should allow more space savings than a column at a time, and possibly do it faster. One question I have banging around in my head is what to do with out-of-line storage. Sometimes you have a large column which you know contains data which is already compressed and/or encrypted, so attempting compression would give little or no benefit; so I'm inclined to think that if we do page compression, it shouldn't deal with toast tables. We could leave them to the current techniques. That leaves some subtle problems with how to deal with a datum which currently compresses from, say, several kB down to one or two hundred bytes. Current TOAST logic would typically compress and inline it. What would we do if we're trying to push heap compression to the page level? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On Thu, Aug 6, 2009 at 4:03 PM, Greg Stark wrote: > I like the idea too, but I think there are some major problems to > solve. In particular I think we need a better solution to blocks > growing than sparse files. How much benefit does this approach have over using TOAST compression more aggressively? ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
I like the idea too, but I think there are some major problems to solve. In particular I think we need a better solution to blocks growing than sparse files. The main problem with using sparse files is that currently postgres is careful to allocate blocks early so it can fail if there's not enough space. With your sparse file solution Postgres might only find out there's no space after it has already committed a transaction. bgwriter has no good course of action to take if it finds out there's nowhere to put the data it has in shared buffers. But I think even if you solve that it's not really a good long-term solution. We don't know how the OS handles block allocation for this type of file. I'm actually moderately surprised it isn't skipping enough blocks assuming you'll allocate them eventually. Even if it does handle it the way you expect what happens when you do grow a block, it'll have to allocate it way out of the way and we have no way to repair that discontinuity later. Also, the way you've prellocated blocks effectively nails the maximum compression at 2x. That seems to be leaving a lot of money on the table. To handle read-write tables I think we would need to directly implement the kind of indirection layer that you're getting out of the filesystem's block layer currently. That would let you allocate enough blocks to hold the data uncompressed and then free up those blocks once you're sure the data is compressible. One possibility is to handle only read-only tables. That would make things a *lot* simpler. But it sure would be inconvenient if it's only useful on large static tables but requires you to rewrite the whole table -- just what you don't want to do with large static tables -- to get the benefit. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
Pierre, On Thu, Aug 6, 2009 at 11:39 AM, PFC wrote: > The best for this is lzo : very fast decompression, a good compression ratio > on a sample of postgres table and indexes, and a license that could work. The license of lzo doesn't allow us to include it in PostgreSQL without relicensing PostgreSQL as GPL. I'm not sure of what you imply by "a license that could work". Note that it doesn't change the interest of your approach. It's just that I'm not sure we can find a performance-acceptable BSD licensed compression library (it was discussed a lot of times here). -- Guillaume -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Table and Index compression
On 8/6/09 2:39 AM, PFC wrote: > > > With the talk about adding compression to pg_dump lately, I've been > wondering if tables and indexes could be compressed too. > So I've implemented a quick on-the-fly compression patch for postgres I find this very interesting, and would like to test it further on some client workloads, before you/we put more work into completing it. I think if we can implement compressed database as an option (perhaps at initdb time, perhaps at tablespace creation time) then it will be very attractive. Where is the patch? BTW, who are you actually? -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers