Be forewarned - this is probably a very long post, and I'm just a mere mortal (ie. admin) who doesn't write copious amounts of C code. Take the following posts and suggestions with a grain of salt.

So I've been seeing/hearing all of the hoopla over vertical databases (column stores), and how they'll not only slice bread but also make toast, etc. I've done some quick searches for past articles on "C-Store", "Vertica", "Column Store", and "Vertical Database", and have seen little discussion on this. And then a funny thought occurs to me - when I look at the directory structure and file layout of a PostgreSQL database, I see that each OID corresponds to a table, which corresponds to (generally) a single file. Then I have a second funny thought - what if there was a low-friction, low-cost-of-implementation way to bring similar advantages to PostgreSQL without major alterations, recoding, etc? Finally it occurs to me that PostgreSQL already does something similar but it could do it so much better, with only one language change and minor changes to the storage layout. So here's my plum-crazy proposal (and I've made some before - see http://archives.postgresql.org/pgsql-odbc/2006-10/msg00040.php - and they not only made it into production, but they are in active use by me on a weekly basis - Thanks Hiroshi!!!), bear with me...

Make one small, very tiny syntactic change to "CREATE TABLE" that includes a new keyword, "COLUMN-STORE" or something similar. I don't care where it appears as long as it's after the "CREATE TABLE". You would not have to break any existing SQL conventions, PostgreSQL would continue to be SQL compliant, and given the odd wording, I highly doubt that the folks who work on SQL keywords will end up using it at any point in time. If adding COLUMN-STORE is objectionable because it will "cloud the compliance of the language" then simply move the functionality into the table space functionality. In hindsight, it might even belong there instead. So, instead of specifying it by table, create a table space that has an attribute "Column Storage" set as active. When inactive, it uses the traditional "one-file-per-table" layout.

Make each table column capable of receiving an OID. This will be critical for the following steps...

If a table is created with "COLUMN-STORE" as an option, then it will continue to behave in the same way it always has, but the storage will be different. Each column in the table will be represented by a single file, with the file name being (naturally) the OID. INSERT/UPDATE/DELETE would function as it always has, etc. Nothing would change. Except how the data is stored. The existing TOAST mechanisms continue to work - because the engine would treat each file as a single-column table! One additional "column" would be added to the store, an invisible one that not only tracks the OID for the "rows" in this type of setup, but also the state of the row. Let's call this the "Control Column". Given that the metadata size for the row would be fixed/constant, we won't have to worry about what is in the other "columns" and "rows", they can be any size. BTW, the "Control Column" would be just another column from the storage engine's point of view. It just happens to be one that no-one can see, other than the database (and maybe the administrator).

When you go to VACUUM a table, you would treat each column as a single-row table, so if a row is a candidate for a VACUUM reclamation, then it will adjust each "column" an equal amount. Under no circumstances would you have columns "out of sync", so if a record goes, it means each adjacent column goes with it. This sounds disk-intensive at first, until you realize that the admin will have made a contentious decision to use this format, and understands the advantages/drawbacks to this method. So it takes a little longer to VACUUM, I don't care, because as an admin I will have specified this layout for a reason - to do OLAP, not OLTP. Which means, I rarely VACUUM it. Add to this the high efficiency you would gain by packing more records into buffers per read, and most of the losses you take in re-reading data would really not amount to as big a loss as you might think.

DELETE would simply mark a row off as deleted in the "Control Column". If the storage engine needed to reclaim a row, it would not have to look any further than the "control column" to find an empty spot where it could overwrite data.

INSERT/UPDATE continue to work as they always have. The storage engine would perceive each "column" as a single-column table, meaning that the existing TOAST mechanisms continue to work! Nothing needs to change there. The real change would be that the table's columns would be "split up" into individual updates, and the "Control Column" would be used to keep all of the records in sync.

Why bother with this? Because, when you are said and done, you will find yourself with a rough equivelent of a column-store database, with all of the OLAP goodness that people are looking for. You have little if any impact on the admin/users perception, other than a flag was checked somewhere and forgotten about in the database. From the storage engine's perspective, you have many many many small 1-column tables to take care of, and they all update at the same "place" at the same "time" to keep the records in sync when you recompose a row. TOAST and large object storage works the same as before, nothing changes, and that's as it should be.

All with what would be (hopefully) a minor change to the storage backend. We're not talking about brain surgery on existing, tested code, but rather, a new feature that uses existing features in-place. As TOAST improves so does this feature. As caching improves, so does the feature again. And so on.

I've been in a bit of a hurry to blurt all of this out, and I'm sure that I've forgotten something along the way, so if you find something missing, please be patient- I had to write all of this in about 20 minutes or less and I didn't have alot of time.

---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate

Reply via email to