Re: [HACKERS] Table and Index compression

2009-08-12 Thread Peter Eisentraut
On Tuesday 11 August 2009 13:05:39 Pierre Frédéric Caillaud wrote:
   Well, here is the patch. I've included a README, which I paste here.
   If someone wants to play with it (after the CommitFest...) feel free to
 do so.
   While it was an interesting thing to try, I don't think it has enough
 potential to justify more effort...


 * How to test

 - apply the patch
 - copy minilzo.c and minilzo.h to
 src/backend/storage/smgr

For future reference, and since this keeps appearing every few months: The 
license of LZO is not acceptable for inclusion or use with PostgreSQL.  You 
need to find a different library if you want to pursue this further.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-12 Thread Pierre Frédéric Caillau d


For future reference, and since this keeps appearing every few months:  
The
license of LZO is not acceptable for inclusion or use with PostgreSQL.   
You

need to find a different library if you want to pursue this further.


	Yes, I know about the license... I used LZO for tests, but since my  
little experiment with compression didn't give any really motivating  
results, I won't pursue this further.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-11 Thread Pierre Frédéric Caillau d


Well, here is the patch. I've included a README, which I paste here.
	If someone wants to play with it (after the CommitFest...) feel free to  
do so.
	While it was an interesting thing to try, I don't think it has enough  
potential to justify more effort...



* How to test

- apply the patch
- copy minilzo.c and minilzo.h to
src/backend/storage/smgr

- configure  make
- enjoy

* How it works

- pg block size set to 32K
- an extra field is added in the header telling the compressed length

THIS IS BAD, this information should be stored in a separate fork of the  
relation, because

- it would then be backwards compatible
	- the number of bytes to read from a compressed page would be known in  
advance


- the table file is sparse
- the page header is not compressed
- pages are written at their normal positions, but only the compressed  
bytes are written

- if compression gains nothing, un-compressed page is stored
- the filesystem doesn't store the un-written blocks

* Benefits

- Sparse file holes are not cached, so OS disk cache efficiency is at  
least x2
- Random access is faster, having a better probability to hit cache  
(sometimes a bit faster, sometimes it's spectatular)

- Yes, it does save space ( 50%)

* Problems

- Biggest problem : any write to a table that writes data that compresses  
less than whatever was there before can fail on a disk full error.


- ext3 sparse file handling isn't as fast as I wish it would be : on seq  
scans, even if it reads 2x less data, and decompresses very fast, it's  
still slower...


- many seq scans (especially with aggregates) are CPU bound anyway

- therefore, some kind of background-reader-decompressor would be needed

- pre-allocation has to be done to avoid extreme fragmentation of the  
file, which kind of defeats the purpose


- it still causes fragmentation

* Conclusion (for now)

It was a nice thing to try, but I believe it would be better if this was  
implemented directly in the filesystem, on the condition that it should be  
implemented well (ie not like NTFS compression).






pg_8.4.0_compression_patch_v001.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Pierre Frédéric Caillau d


First, a few things that I forgot to mention in the previous message :






I like the idea too, but I think there are some major problems to
solve. In particular I think we need a better solution to blocks
growing than sparse files.


Sparse files allow something great : to test this concept in real life,  
with a length of code that is shorter than this email.



The main problem with using sparse files is that currently postgres is
careful to allocate blocks early so it can fail  if there's not enough
space. With your sparse file solution Postgres might only find out
there's no space after it has already committed a transaction.
bgwriter has no good course of action to take if it finds out there's
nowhere to put the data it has in shared buffers.


Yes, that is a big problem.

Note that the same problem can happen if you use a filesystem that  
supports compression :


- you preallocate a file full of empty blocks
- filesystem compresses it well, says it fits
- you fill it with data that compresses much worse than zeros
- at some point one of the write() calls fails with a disk full error

This would be an interesting argument in the can we use compressed ZFS on  
postgres debate...


Also, about compressed NTFS : it can give you disk-full errors on read().
While this may appear stupid, it is in fact very good.
When you read() something compressed, NTFS will reserve the space on disk  
to re-write this data, assuming the worst compression, so your app fails  
early (on read, not on write). If there is not enough space on the volume,  
read() will fail with a disk full error. Many applications do not expect  
this.


On the other hand, filling a disk completely is a never a good idea, since  
it generally triggers fragmentation so extreme that the only way to  
recover is to go buy another harddisk and copy your data.


As a side note, I have also tested lzjb (the ZFS compressor) and lzo is  
much faster, and compresses much better (sometimes 2x better).


Back to the point of how to handle disk full errors :
- we could write a file the size of shared_buffers at startup
- if a write() reports disk full, delete the file above
- we now have enough space to flush all of shared_buffers
- flush and exit gracefully

This will waste disk space. Of course... but the purpose of compression is  
not to save disk space, it is to have a faster database.

Also, if you use compression,
- you'll do it because your database is much bigger than shared_buffers,  
so the wasted space is not huge.
- you'll use a smaller shared_buffers than usual, because shared_buffers  
contains uncompressed pages, and you'll want to use lots of OS disk cache  
to cache compressed data.


Doesn't seem too crazy to me.


But I think even if you solve that it's not really a good long-term
solution. We don't know how the OS handles block allocation for this
type of file. I'm actually moderately surprised it isn't skipping
enough blocks assuming you'll allocate them eventually. Even if it
does handle it the way you expect what happens when you do grow a
block, it'll have to allocate it way out of the way and we have no way
to repair that discontinuity later.


I made a quick check before implementing it, using python scripts to play  
with sparse files on ext3 :


- writing a sparse file is a bit (not much) slower than a regular file,
- reading from a non-fragmented sparse file is as fast as reading a  
regular file

- holes do not eat OS disk cache (which is the most interesting point)
- reading from cache is as fast as reading a regular file (and faster if  
you don't read the holes because you know they are holes, which is the  
case here)


And, also, growing a sparse file by plugging the holes in it WILL allocate  
blocks all over the place and render IO extremely inefficient.
You can defrag it, of course (using fs-specific tools or just cpio), but  
that's not high-availability...



Also, the way you've prellocated blocks effectively nails the maximum
compression at 2x. That seems to be leaving a lot of money on the
table.


Yes, it does, but there are a few subtleties.

First, about preallocation : I had to implement this, because of the way  
the files are extended.
Since first an empty block is written, it will compress well, to 1 OS  
page. When, later, this block is written again with data, it will compress  
to a larger amount of pages, which will trigger allocation all over the  
place. In testing, the resulting sparse file is slow in the of the forget  
it kind...


The file could also be non-sparse : in this case, no disk space is saved.

Not saving disk space is OK with me : disks are really cheap these days.

I know this sounds crazy ;) but as I said above, compression is meant to  
save expensive resources. When you use H-264 to compress 1TB of raw video  
to 1GB of compressed video, you save an expensive resource (0.999 TB of  
harddisk). But if you use lzo to compress 40GB of database down to 20GB,  
you save maybe $3 

Re: [HACKERS] Table and Index compression

2009-08-07 Thread Pierre Frédéric Caillau d

On Thu, Aug 6, 2009 at 4:03 PM, Greg Starkgsst...@mit.edu wrote:

I like the idea too, but I think there are some major problems to
solve. In particular I think we need a better solution to blocks
growing than sparse files.


How much benefit does this approach have over using TOAST compression
more aggressively?

...Robert




The two are different :

- TOAST compresses a large column value.

To store a 100KB text file, TOAST is great.

- page compression compresses whole pages.

Suppose you have a table with a TIMESTAMP, and a few INT columns. The rows  
are small enough to make per-row compression useless, and TOAST cannot  
compress non-varlenas anyway. However, if (for instance) the timestamp is  
the row insertion date, and you INSERT several rows per second, most  
timestamps on a page will have lots of bytes in common. Also, row headers  
(which are larger than the rows) will have much redundant data. Page  
compression can exploit this, without the need for the rest of the code to  
know about it.


Page compression can also handle indexes, etc.

Also, External TOAST is nice if you seldom need the field : for instance,  
you search on in-page columns, get the row you need, and fetch it.
Suppose you have a forum : in this case, when you display a topic page,  
you need all the posts text. It would be a very bad idea to store them in  
a separate TOAST table, because it would create more random IO. Storing  
the posts in the page means less IO, and if you regularly CLUSTER your  
table, all the posts you need to display a topic page are on the same (or  
adjacent) postgres page. In this case, individual post text can be  
TOASTed, too, but compression tends to work better with longer blocks, so  
compressing the whole page will be more efficient.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:36:39AM +0200, Pierre Frrrdddric Caillaud wrote:
 Also, about compressed NTFS : it can give you disk-full errors on read().
 While this may appear stupid, it is in fact very good.

Is this not just because they've broken the semantics of read?

 As a side note, I have also tested lzjb (the ZFS compressor) and lzo is  
 much faster, and compresses much better (sometimes 2x better).

Disks are fast and cheap; a basic IDE disk runs at over 100MB/s now, and
it's doing this in the background while your CPU is doing other stuff.
If you're also decompressing stuff you're serializing even more and
you're doing so with a much power hungrier device (the CPU).

How fast is decompression (as that seems to be your selling point)?  Lzo
claims to run at about about a third of main memory bandwidth which is
nice, however research projects found this to be far too slow and were
only getting positive results when decompression stayed in secondary
cache.  Basically decompression has to run at several GB/s for it to
have much measurable benefit.

 I made a quick check before implementing it, using python scripts to play  
 with sparse files on ext3 :
 
 - writing a sparse file is a bit (not much) slower than a regular file,
 - reading from a non-fragmented sparse file is as fast as reading a  
 regular file
 - holes do not eat OS disk cache (which is the most interesting point)
 - reading from cache is as fast as reading a regular file (and faster if  
 you don't read the holes because you know they are holes, which is the  
 case here)

Numbers?

 And, also, growing a sparse file by plugging the holes in it WILL allocate  
 blocks all over the place and render IO extremely inefficient.
 You can defrag it, of course (using fs-specific tools or just cpio), but  
 that's not high-availability...

That would not seem to difficult to solve.

 I forgot to talk about SSDs in the previous message. SSDs are quite  
 expensive, but seek really fast.

SSDs are about decreasing latency; if you're putting compression in
there you're pushing latency up as well.  If you don't care about
latency you get traditional rotating media.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Greg Stark
2009/8/7 Pierre Frédéric Caillaud li...@peufeu.com:

 Also, about compressed NTFS : it can give you disk-full errors on read().

I suspect it's unavoidable for similar reasons to the problems
Postgres faces. When you issue a read() you have to find space in the
filesystem cache to hold the data. Some other data has to be evicted.
If that data doesn't compress as well as it did previously it could
take more space and cause the disk to become full.

This also implies that fsync() could generate that error...

 Back to the point of how to handle disk full errors :
 - we could write a file the size of shared_buffers at startup
 - if a write() reports disk full, delete the file above
 - we now have enough space to flush all of shared_buffers
 - flush and exit gracefully

Unfortunately that doesn't really help. That only addresses the issue
for a single backend (or as many as are actually running when the
error starts). The next connection could read in new data and expand
that and now you have no slop space.

Put another way, we don't want to exit at all, gacefully or not. We
want to throw an error, abort the transaction (or subtransaction) and
keep going.


-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 10:33:33AM +0100, Greg Stark wrote:
 2009/8/7 Pierre Frédéric Caillaud li...@peufeu.com:
  Also, about compressed NTFS : it can give you disk-full errors on read().
 
 I suspect it's unavoidable for similar reasons to the problems
 Postgres faces. When you issue a read() you have to find space in the
 filesystem cache to hold the data. Some other data has to be evicted.
 If that data doesn't compress as well as it did previously it could
 take more space and cause the disk to become full.
 
 This also implies that fsync() could generate that error...

If that's indeed how it works it seems like one broken file system and
needs to get its block accounting in order.

When you choose a compression algorithm you know how much space a worst
case compression will take (i.e. lzo takes up to 8% more for a 4kB block
size).  This space should be reserved in case of situations like the
above and the filesystem shouldn't over-commit on this.

Never had to think about this before though so I'm probably missing
something obvious.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Greg Stark
On Fri, Aug 7, 2009 at 11:29 AM, Sam Masons...@samason.me.uk wrote:
 When you choose a compression algorithm you know how much space a worst
 case compression will take (i.e. lzo takes up to 8% more for a 4kB block
 size).  This space should be reserved in case of situations like the
 above and the filesystem shouldn't over-commit on this.

 Never had to think about this before though so I'm probably missing
 something obvious.

Well most users want compression for the space savings. So running out
of space sooner than without compression when most of the space is
actually unused would disappoint them.

Also, I'm puzzled why it would the space increase would proportional
to the amount of data and be more than 300 bytes. There's no reason it
wouldn't be a small fixed amount. The ideal is you set aside one bit
-- if the bit is set the rest is compressed and has to save at least
one bit. If the bit is not set then the rest is uncompressed. Maximum
bloat is 1-bit. In real systems it's more likely to be a byte or a
word.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Pierre Frédéric Caillau d



Also, I'm puzzled why it would the space increase would proportional
to the amount of data and be more than 300 bytes. There's no reason it
wouldn't be a small fixed amount. The ideal is you set aside one bit
-- if the bit is set the rest is compressed and has to save at least
one bit. If the bit is not set then the rest is uncompressed. Maximum
bloat is 1-bit. In real systems it's more likely to be a byte or a
word.


I'm working on cleaning the patch...

I added a field in PageHeader which contains :
- 0 to indicate a non-compressed page
- length of compressed data if compressed

	If compression gains nothing (ie gains less than 4K), the page is stored  
raw.


	It seems that only pages having a PageHeader are handled by md.c, so it  
should work (am I mistaken ?)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 11:49:46AM +0100, Greg Stark wrote:
 On Fri, Aug 7, 2009 at 11:29 AM, Sam Masons...@samason.me.uk wrote:
  When you choose a compression algorithm you know how much space a worst
  case compression will take (i.e. lzo takes up to 8% more for a 4kB block
  size).  This space should be reserved in case of situations like the
  above and the filesystem shouldn't over-commit on this.
 
  Never had to think about this before though so I'm probably missing
  something obvious.
 
 Well most users want compression for the space savings. So running out
 of space sooner than without compression when most of the space is
 actually unused would disappoint them.

Note, that as far as I can tell for a filesystems you only need to keep
enough reserved for the amount of uncompressed dirty buffers you have in
memory.  As space runs out in the filesystem all that happens is that
the amount of (uncompressed?) dirty buffers you can safely have around
decreases.  In practical terms, this says that performance drops off
when there is less free space than the size of the filesystem's cache
and I think you have to reserve exactly one block to handle the base
case.  But there are so many problems associated with completely filling
a filesystem that I'm not sure if this would really matter.

 Also, I'm puzzled why it would the space increase would proportional
 to the amount of data and be more than 300 bytes. There's no reason it
 wouldn't be a small fixed amount. The ideal is you set aside one bit
 -- if the bit is set the rest is compressed and has to save at least
 one bit. If the bit is not set then the rest is uncompressed. Maximum
 bloat is 1-bit. In real systems it's more likely to be a byte or a
 word.

It'll depend on the compression algorithm; lz algorithms are dictionary
based so you'd have a single entry for the incompressible data and then
a pointer to the entry.

In PG's case, it would seem possible to do the compression and then
check to see if the resulting size is greater than 4kB.  If it is you
write into the 4kB page size and write uncompressed data.  Upon reading
you do the inverse, if it's 4kB then no need to decompress.  I believe
TOAST does this already.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Greg Stark
On Fri, Aug 7, 2009 at 12:48 PM, Sam Masons...@samason.me.uk wrote:
 Well most users want compression for the space savings. So running out
 of space sooner than without compression when most of the space is
 actually unused would disappoint them.

 Note, that as far as I can tell for a filesystems you only need to keep
 enough reserved for the amount of uncompressed dirty buffers you have in
 memory.  As space runs out in the filesystem all that happens is that
 the amount of (uncompressed?) dirty buffers you can safely have around
 decreases.

And when it drops to zero?

 In PG's case, it would seem possible to do the compression and then
 check to see if the resulting size is greater than 4kB.  If it is you
 write into the 4kB page size and write uncompressed data.  Upon reading
 you do the inverse, if it's 4kB then no need to decompress.  I believe
 TOAST does this already.

It does, as does gzip and afaik every compression system.

-- 
greg
http://mit.edu/~gsstark/resume.pdf

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Greg Stark
For reference what I'm picturing is this:

When a table is compressed it's marked read-only which bars any new
tuples from being inserted or existing tuples being deleted. Then it's
frozen and any pages which contain tuples wich can't be frozen are
waited on until they can be. When it's finished every tuple has to be
guaranteed to be fully frozen.

Then the relation is rewritten in compressed form. Each block is
compressed one by one and written one after the other to disk.

At the same time a new fork is written which contains a pointer to
each block. It could just be a directly addressed array of offsets and
lengths. All block lookups have to first load the page of the
indirection map, then read the appropriate section of the original
file and decompress it into shared buffers.

From a programming point of view this is nice and simple. From a
user's point of view it's a bit of a pain since it means you have to
rewrite your whole table when you want to compress it. And it means
you have to rewrite it all again if you decide you want to set it back
to read-write. My experience with people who have very large tables is
that they design their whole process around the goal of avoiding
having to move the data once it's written.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 12:59:57PM +0100, Greg Stark wrote:
 On Fri, Aug 7, 2009 at 12:48 PM, Sam Masons...@samason.me.uk wrote:
  Well most users want compression for the space savings. So running out
  of space sooner than without compression when most of the space is
  actually unused would disappoint them.
 
  Note, that as far as I can tell for a filesystems you only need to keep
  enough reserved for the amount of uncompressed dirty buffers you have in
  memory.  As space runs out in the filesystem all that happens is that
  the amount of (uncompressed?) dirty buffers you can safely have around
  decreases.
 
 And when it drops to zero?

That was why I said you need to have one page left to handle the base
case.  I was treating the inductive case as the interesting common case
and considered the base case of lesser interest.

  In PG's case, it would seem possible to do the compression and then
  check to see if the resulting size is greater than 4kB.  If it is you
  write into the 4kB page size and write uncompressed data.  Upon reading
  you do the inverse, if it's 4kB then no need to decompress.  I believe
  TOAST does this already.
 
 It does, as does gzip and afaik every compression system.

It's still a case that needs to be handled explicitly by the code.  Just
for reference, gzip does not appear to do this when I test it:

  echo -n 'a' | gzip  tmp.gz
  gzip -l --verbose tmp.gz

says the compression ratio is -200% (an empty string results in
an infinite increase in size yet gets displayed as 0% for some
strange reason).  It's only when you hit six 'a's that you start to get
positive ratios.  Note that that this is taking headers into account;
the compressed size is 23 bytes for both 'aaa' and 'aa' but the
uncompressed size obviously changes.

gzip does indeed have a copy method, but it doesn't seem to be being
used.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Robert Haas
On Fri, Aug 7, 2009 at 8:18 AM, Greg Starkgsst...@mit.edu wrote:
 For reference what I'm picturing is this:

 When a table is compressed it's marked read-only which bars any new
 tuples from being inserted or existing tuples being deleted. Then it's
 frozen and any pages which contain tuples wich can't be frozen are
 waited on until they can be. When it's finished every tuple has to be
 guaranteed to be fully frozen.

 Then the relation is rewritten in compressed form. Each block is
 compressed one by one and written one after the other to disk.

 At the same time a new fork is written which contains a pointer to
 each block. It could just be a directly addressed array of offsets and
 lengths. All block lookups have to first load the page of the
 indirection map, then read the appropriate section of the original
 file and decompress it into shared buffers.

 From a programming point of view this is nice and simple. From a
 user's point of view it's a bit of a pain since it means you have to
 rewrite your whole table when you want to compress it. And it means
 you have to rewrite it all again if you decide you want to set it back
 to read-write. My experience with people who have very large tables is
 that they design their whole process around the goal of avoiding
 having to move the data once it's written.

If you add an indirection table, it's not strictly necessary for the
table to be read-only, though if you want to make it read-write you'd
need to think about how to defragment.

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Pierre Frédéric Caillau d

For reference what I'm picturing is this:

When a table is compressed it's marked read-only which bars any new
tuples from being inserted or existing tuples being deleted. Then it's
frozen and any pages which contain tuples wich can't be frozen are
waited on until they can be. When it's finished every tuple has to be
guaranteed to be fully frozen.

Then the relation is rewritten in compressed form. Each block is
compressed one by one and written one after the other to disk.

At the same time a new fork is written which contains a pointer to
each block. It could just be a directly addressed array of offsets and
lengths. All block lookups have to first load the page of the
indirection map, then read the appropriate section of the original
file and decompress it into shared buffers.


	I had pondered the idea of a fork storing the compressed status of each  
page, because it has advantages :

- no need to change the page layout to insert a is compressed flag
- possible to compress any data, not just standard pages
	- if you know the compressed size of a page in advance, it is much easier  
to prefetch it entirely and not just the first chunk, or read too much...



From a programming point of view this is nice and simple. From a
user's point of view it's a bit of a pain since it means you have to
rewrite your whole table when you want to compress it. And it means
you have to rewrite it all again if you decide you want to set it back
to read-write. My experience with people who have very large tables is
that they design their whole process around the goal of avoiding
having to move the data once it's written.


	Note that if a table is huge, it is always cut in (currently) 1GB slices,  
so you could operate on one slice at a time, then release a lock, let the  
backlog of queries flow, and resume.


Realtime compression would be much less of a hassle to use, though...



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Pierre Frédéric Caillau d


Not strictly related to compression, but I've noticed something really
strange...

pg 8.4 (vanilla) is doing it, and my compressed version is doing it too.
tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives,
but it does it too if I put the tablespace and data on the same volume.

The traditional test table :

BEGIN; CREATE TABLE dwb (
bid SERIAL,
aid INTEGER NOT NULL,
ts TIMESTAMP NOT NULL,
i1 INTEGER NOT NULL,
i2 INTEGER NOT NULL,
i3 INTEGER NOT NULL,
i4 INTEGER NOT NULL
) WITHOUT OIDS;

The traditional test data :

INSERT INTO dwb (ts,aid,i1,i2,i3,i4)
SELECT now() + '1 sec'::INTERVAL, (1+random()*8), random()*1000,
n+random()*1, n+random()*1000, n
  FROM generate_series( 1, 6000 ) AS n;

vmstat output :

it starts out relatively fast :

 si   sobibo   in   csus sy id wa
00 0 43680 2796 19162 42 18 37  3
00 0 45515 3165 20652 44 17 35  4
00 0 43130 3046 21991 43 17 38  2

then here it starts to slow down : check bo output

00   181 24439  577 3541 31  6 40 23
00   176 17258  292 1324 31  4 43 22
00 0 18626  162  693 35  3 49 12
00 1 21554  235 1362 31  5 50 14
00 0 19177  324 2053 35  4 50 12
00 0 19208  206 1155 36  4 48 12
00 1 20740  215 1117 33  4 50 13
00 0 20154  258 1100 32  4 50 14
00 0 20355  316 2056 34  5 49 12

... and it stays like this until the end of the INSERT...

It's not writing any xlog since the table was created after the BEGIN...

I'm trying to benchmark insert speed, but no luck. This volume does about
100 MB/s sustained write speed, so ?..

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 03:29:44PM +0200, Pierre Frrrdddric Caillaud wrote:
 vmstat output :

Sorry, I don't know enough of PGs internals to suggest anything here,
but iostat may give you more details as to what's going on.

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Kevin Grittner
Pierre Frédéric Caillaudli...@peufeu.com wrote:
 
 tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives,
 but it does it too if I put the tablespace and data on the same
 volume.
 
 it starts out relatively fast :
 
   si   sobibo   in   csus sy id wa
  00 0 43680 2796 19162 42 18 37  3
  00 0 45515 3165 20652 44 17 35  4
  00 0 43130 3046 21991 43 17 38  2
 
 then here it starts to slow down : check bo output
 
  00   181 24439  577 3541 31  6 40 23
  00   176 17258  292 1324 31  4 43 22
  00 0 18626  162  693 35  3 49 12
  00 1 21554  235 1362 31  5 50 14
  00 0 19177  324 2053 35  4 50 12
  00 0 19208  206 1155 36  4 48 12
  00 1 20740  215 1117 33  4 50 13
  00 0 20154  258 1100 32  4 50 14
  00 0 20355  316 2056 34  5 49 12
 
 ... and it stays like this until the end of the INSERT...
 
I don't know if this is it, but we tend to see outrageously high
performance at the start of a benchmark because of the battery-backed
cache in the RAID controller.  Every write comes back immediately
after copying the data to RAM.  After a while the cache gets filled
and you settle down to a steady state.  If it's not BBU with
write-back enabled, perhaps you have drives that lie about write
completion?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Pierre Frédéric Caillau d
On Fri, 07 Aug 2009 15:42:35 +0200, Kevin Grittner  
kevin.gritt...@wicourts.gov wrote:



Pierre Frédéric Caillaudli...@peufeu.com wrote:


tablespace is a RAID5 of 3 drives, xlog in on a RAID1 of 2 drives,
but it does it too if I put the tablespace and data on the same
volume.



it starts out relatively fast :

  si   sobibo   in   csus sy id wa
 00 0 43680 2796 19162 42 18 37  3
 00 0 45515 3165 20652 44 17 35  4
 00 0 43130 3046 21991 43 17 38  2

then here it starts to slow down : check bo output

 00   181 24439  577 3541 31  6 40 23
 00   176 17258  292 1324 31  4 43 22
 00 0 18626  162  693 35  3 49 12
 00 1 21554  235 1362 31  5 50 14
 00 0 19177  324 2053 35  4 50 12
 00 0 19208  206 1155 36  4 48 12
 00 1 20740  215 1117 33  4 50 13
 00 0 20154  258 1100 32  4 50 14
 00 0 20355  316 2056 34  5 49 12

... and it stays like this until the end of the INSERT...

I don't know if this is it, but we tend to see outrageously high
performance at the start of a benchmark because of the battery-backed
cache in the RAID controller.  Every write comes back immediately
after copying the data to RAM.  After a while the cache gets filled
and you settle down to a steady state.  If it's not BBU with
write-back enabled, perhaps you have drives that lie about write
completion?
-Kevin



I'm answering my own question : at the beginning of the run, postgres  
creates a 800MB temporary file, then it fills the table, then deletes the  
temp file.

Is this because I use generate_series to fill the test table ?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Sam Mason
On Fri, Aug 07, 2009 at 04:17:18PM +0200, Pierre Frrrdddric Caillaud wrote:
 I'm answering my own question : at the beginning of the run, postgres  
 creates a 800MB temporary file, then it fills the table, then deletes the  
 temp file.
 Is this because I use generate_series to fill the test table ?

Doh, yes.  A function's result is written to temp location first and
then read back again once the function returns success.  You'll have
more luck if you do:

  SELECT now() + '1 sec'::INTERVAL, (1+random()*8),
random()*1000,n+random()*1, n+random()*1000, n
  FROM (
SELECT generate_series( 1, 6000 )) x(n);

-- 
  Sam  http://samason.me.uk/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-07 Thread Josh Berkus
Pierre,

 I added a field in PageHeader which contains :
 - 0 to indicate a non-compressed page
 - length of compressed data if compressed
 
 If compression gains nothing (ie gains less than 4K), the page is
 stored raw.
 
 It seems that only pages having a PageHeader are handled by md.c, so
 it should work (am I mistaken ?)

Well, there's the issue of upgradability; this would require us to have
an incompatible upgrade of on-disk formats.  So we don't want to go
further down this route unless we're sure it's worthwhile.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Josh Berkus
On 8/6/09 2:39 AM, PFC wrote:
 
 
 With the talk about adding compression to pg_dump lately, I've been
 wondering if tables and indexes could be compressed too.
 So I've implemented a quick on-the-fly compression patch for postgres

I find this very interesting, and would like to test it further on some
client workloads, before you/we put more work into completing it.

I think if we can implement compressed database as an option (perhaps at
initdb time, perhaps at tablespace creation time) then it will be very
attractive.

Where is the patch?

BTW, who are you actually?

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Guillaume Smet
Pierre,

On Thu, Aug 6, 2009 at 11:39 AM, PFCli...@peufeu.com wrote:
 The best for this is lzo : very fast decompression, a good compression ratio
 on a sample of postgres table and indexes, and a license that could work.

The license of lzo doesn't allow us to include it in PostgreSQL
without relicensing PostgreSQL as GPL.

I'm not sure of what you imply by a license that could work.

Note that it doesn't change the interest of your approach. It's just
that I'm not sure we can find a performance-acceptable BSD licensed
compression library (it was discussed a lot of times here).

-- 
Guillaume

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Greg Stark
I like the idea too, but I think there are some major problems to
solve. In particular I think we need a better solution to blocks
growing than sparse files.

The main problem with using sparse files is that currently postgres is
careful to allocate blocks early so it can fail  if there's not enough
space. With your sparse file solution Postgres might only find out
there's no space after it has already committed a transaction.
bgwriter has no good course of action to take if it finds out there's
nowhere to put the data it has in shared buffers.

But I think even if you solve that it's not really a good long-term
solution. We don't know how the OS handles block allocation for this
type of file. I'm actually moderately surprised it isn't skipping
enough blocks assuming you'll allocate them eventually. Even if it
does handle it the way you expect what happens when you do grow a
block, it'll have to allocate it way out of the way and we have no way
to repair that discontinuity later.

Also, the way you've prellocated blocks effectively nails the maximum
compression at 2x. That seems to be leaving a lot of money on the
table.

To handle read-write tables I think we would need to directly
implement the kind of indirection layer that you're getting out of the
filesystem's block layer currently. That would let you allocate enough
blocks to hold the data uncompressed and then free up those blocks
once you're sure the data is compressible.

One possibility is to handle only read-only tables. That would make
things a *lot* simpler. But it sure would be inconvenient if it's only
useful on large static tables but requires you to rewrite the whole
table -- just what you don't want to do with large static tables -- to
get the benefit.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Robert Haas
On Thu, Aug 6, 2009 at 4:03 PM, Greg Starkgsst...@mit.edu wrote:
 I like the idea too, but I think there are some major problems to
 solve. In particular I think we need a better solution to blocks
 growing than sparse files.

How much benefit does this approach have over using TOAST compression
more aggressively?

...Robert

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote: 
 On Thu, Aug 6, 2009 at 4:03 PM, Greg Starkgsst...@mit.edu wrote:
 I like the idea too, but I think there are some major problems to
 solve. In particular I think we need a better solution to blocks
 growing than sparse files.
 
 How much benefit does this approach have over using TOAST
 compression more aggressively?
 
I was wondering the same thing.  It seems like compressing a page at a
time should allow more space savings than a column at a time, and
possibly do it faster.
 
One question I have banging around in my head is what to do with
out-of-line storage.  Sometimes you have a large column which you know
contains data which is already compressed and/or encrypted, so
attempting compression would give little or no benefit; so I'm
inclined to think that if we do page compression, it shouldn't deal
with toast tables.  We could leave them to the current techniques. 
That leaves some subtle problems with how to deal with a datum which
currently compresses from, say, several kB down to one or two hundred
bytes.  Current TOAST logic would typically compress and inline it.
What would we do if we're trying to push heap compression to the page
level?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Josh Berkus
On 8/6/09 1:03 PM, Greg Stark wrote:
 One possibility is to handle only read-only tables. That would make
 things a *lot* simpler. But it sure would be inconvenient if it's only
 useful on large static tables but requires you to rewrite the whole
 table -- just what you don't want to do with large static tables -- to
 get the benefit.

Well less flexible, I could see combining this with partitioning to
still be useful.  If we could rewrite specific partitions as compressed,
then there's a lot of cumulative data applications which it would benefit.

Not as exciting as being able to compress the whole thing, of course.

-- 
Josh Berkus
PostgreSQL Experts Inc.
www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Table and Index compression

2009-08-06 Thread Ron Mayer
I'm curious what advantages there are in building compression into
the database itself, rather than using filesystem-based compression.

I see ZFS articles[1] discuss how enabling compression
improves performance with ZFS; for Linux, Btrfs has compression
features as well[2]; and on Windows NTFS seems to too.

[1]http://blogs.sun.com/observatory/entry/zfs_compression_a_win_win
[2]http://lwn.net/Articles/305697/



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers