On 1/16/17 10:09 PM, Haribabu Kommi wrote:
Yes, that' correct. Currently with this approach, it is not possible to
ditch the
heap completely. This approach is useful for the cases, where the user wants
to store only some columns as part of clustered index.

Ahh, that's unfortunate. Billion row+ tables are becoming rather common, and that 24GB of overhead starts becoming very painful. It's actually a lot worse considering there will be at least one index on the table, so 100GB+ of overhead isn't that uncommon.

    Another complication is that one of the big advantages of a CSTORE
    is allowing analysis to be done efficiently on a column-by-column
    (as opposed to row-by-row) basis. Does your patch by chance provide
    that?

Not the base patch that I shared. But the further patches provides the
data access
column-by-column basis using the custom plan methods.

Great, that's something else that a column store really needs to be successful. Something else I suspect is necessary is a faster/better way to eliminate chunks of rows from scans.

Just as an example, with my simple array-based approach, you can store a range type along with each array that contains the min and max values for the array. That means any query that wants values between 50 and 100 can include a clause that filters on range types that overlap with [50,100]. That can be indexed very efficiently and is fast to run checks against.

    Generally speaking, I do think the idea of adding support for this
    as an "index" is a really good starting point, since that part of

... as discussed elsewhere in the thread, adding a bunch of hooks is probably not a good way to do this. :/

    That would be a great way to gain knowledge on what users would want
    to see in a column store, something else I suspect we need. It would
    also be far less code than what you or Alvaro are proposing. When it
    comes to large changes that don't have crystal-clear requirements, I
    think that's really important.

The  main use case of this patch is to support mixed load environments,
where both OLTP and OLAP queries are possible. The advantage of
proposed patch design is, providing good performance to OLAP queries
without affecting OLTP.

Yeah, that's a big part of what I was envisioning with my array-based approach. In simple terms, there would be a regular row-based table, and an array-based table, with a view that allows seamless querying into both (re-presenting the array-storage on a per-row basis). There would be a periodic process that moves entire sets of rows from the row storage into the array storage.

If you updated or deleted a row that was part of an array, the contents of the entire array could be moved back into row-based storage. After a period of time, rows would get moved back into array storage. Or the array could be modified in place, but you need to be very careful about bloating the array storage if you do that.

The big missing piece here is getting the planner to intelligently handle a mixed row/column store. As I mentioned, you can easily add range type fields to greatly increase performance, but they won't do any good unless the appropriate filters get added. It's not THAT hard to do that by hand, but it'd be great if there was a more automated method. Such a method might also be very useful for transforming expressions like date_part('quarter', ...) into something that could use existing indexes.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to