We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do.
I've been trying to figure out a plan to enable native column stores (CS or "colstore") for Postgres. Motivations: * avoid the 32 TB limit for tables * avoid the 1600 column limit for tables * increased performance There already are some third-party CS implementations for Postgres; some of these work on top of the FDW interface, others are simply proprietary forks. Since I don't have access to any of their code, it's not much I can learn from them. If people with insider knowledge on them can chime in, perhaps we can work together -- collaboration is very welcome. We're not interested in perpetuating the idea that a CS needs to go through the FDW mechanism. Even if there's a lot of simplicity of implementation, it's almost certain to introduce too many limitations. Simply switching all our code to use columnar storage rather than row-based storage is unlikely to go well. We're aiming at letting some columns of tables be part of a CS, while other parts would continue to be in the heap. At the same time, we're aiming at opening the way for different CS implementations instead of trying to provide a single one-size-fits-all one. There are several parts to this: 1. the CSM API 2. Cataloguing column stores 3. Query processing: rewriter, optimizer, executor The Column Store Manager API ---------------------------- Since we want to have pluggable implementations, we need to have a registry of store implementations. I propose we add a catalog pg_cstore_impl with OID, name, and a bunch of function references to "open" a store, "getvalue" from it, "getrows" (to which we pass a qual and get a bunch of tuple IDs back), "putvalue". This is in line with our procedural language support. One critical detail is what will be used to identify a heap row when talking to a CS implementation. There are two main possibilities: 1. use CTIDs 2. use some logical tuple identifier Using CTIDs is simpler. One disadvantage is that every UPDATE of a row needs to let the CS know about the new location of the tuple, so that the value is known associated with the new tuple location as well as the old. This needs to happen even if the value of the column itself is not changed. Using logical tuple identifiers solves this problem: an update does not change the LTID, so the tuple colstore needn't be involved unless the attribute(s) in the colstore is being changed. The downside is that the logical tuple identifier must come from somewhere. We could either use some user attribute, if there's something appropriate. But it's probably not good to simply use any primary key that the user has specified. (Also, having an UPDATE change the primary key would be troublesome). We could also offer the choice of having an autogenerated value that's not user-visible; we currently don't have non-user-visible columns, so this would be additional implementation effort. Furthermore, we should think about interactions between this and the IDENTITY stuff we currently have for replication -- my impression is that IDENTITY does not necessarily represent an useful identifier for column store purposes. All in all, it seems prudent to limit the initial implementation to use CTIDs only, and leave LTIDs for a later stage. Cataloguing Column Stores ------------------------- Each table with columns in a separate store will have relhasstore=t. This hints construction of its relcache entry to obtain rows from pg_cstore for that table. The new catalog pg_cstore looks like this: cstname | cststoreid | cstrelid | cstnatts | cstatts cstname is the store name; unique within each relation. cststoreid is the OID of the pg_cstore_impl row. cstorerelid is the OID of the table that this cstore is for. cstnatts is the number of columns in the store cstatts is an array of attnums contained in this store. This design permits having multiple stores for a table, and one or more columns in a store. We will focus on the case that a table has a single column store, and a column store has a single column, because we believe this simplifies several things. Query Processing ---------------- Rewriter Parsing occurs as currently. During query rewrite, specifically at the bottom of the per-relation loop in fireRIRrules(), we will modify the query tree: each relation RTE containing a colstore will be replaced with a JoinExpr containing the relation as left child and the colstore as right child (1). The colstore RTE will be of a new RTEKind. For each such change, all Var nodes that point to attnums stored in the colstore will modified so that they reference the RTE of the colstore instead (2). (1) This seems very similar to what convert_ANY_sublink_to_join() does. (2) This is very similar to ChangeVarNodes does, except that we modify only some of the var nodes pointing to the relation, not all. The addition of the new RTE will use a new function addRangeTableEntryForColstore(). This seems a bit odd, because that's a parse-time function, but I see that we're already using -ForSubquery in convert_ANY_sublink_to_join() so it's probably okay. Another thing worth mentioning is that we need to have the colstore and the relation have a junk attribute which use to join them. This will be the CTID for now, as indicated above. Planner I still haven't formed any idea of what needs to change in planner code to handle column stores especially. So far I have the idea that we will need to distinguish them from other RT kinds by using a new value RELOPT_COLSTORE for RelOptKind. I think it's not okay to use RELOPT_BASEREL, because a colstore is not a full-blown relation; and "otherrel" also seems not appropriate because we do need them to appear in the join tree, as described above, which current otherrels (Append/MergeAppend children) do not. Some things such as inner join removal should work on colstore RTEs just like on plain relations. Unless the optimization problems are shown to be easily solvable, we will probably disallow having column stores on inherited relations (esp. partitioned relations). We can improve the optimizar later to allow for this, after the basics are in place. Executor There are a few ideas on a couple of new executor nodes that will need to operate on colstore RTEs. For example, we can delay accessing columns in a store until we need to project them; if rows from the main relation are filtered somewhere upstream, any columns in the store needn't be accessed for the filtered rows. We have been using the name LateColumnMaterialization for this. A completely different consideration is offloading some query quals to a lone column store scan, which can be combined using BitmapOr or BitmapAnd with other quals on the relation; the idea is something very similar to IndexHeapScan in that it takes a qual and produces a list of TIDs. We're calling this BitmapColumnScan. The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2015) under grant agreement n° 318633. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers