Re: [HACKERS] On columnar storage (2)
Hey Alvaro, I was referring to https://wiki.postgresql.org/wiki/ColumnOrientedSTorage . and yes, I'll be at the next fosdem / pgconf.eu for sure. :-) Bert On Thu, Mar 3, 2016 at 3:40 PM, Alvaro Herrerawrote: > Bert wrote: > > > Alvaro, > > You wrote that a wiki page would be opened regarding this. But I still > > cannot find such a page (expect for an old page which hasn't changed in > the > > last year). Is there already something we can look at? > > Yeah, I haven't done that yet. I will post here as soon as I get that > done. Happy to share another beer to discuss, next time I'm over there. > I'm also going to have code to share for you to test by then! > > What's the other page you mention? > > -- > Álvaro Herrerahttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Bert Desmet 0477/305361
Re: [HACKERS] On columnar storage (2)
Haribabu Kommi wrote: > The performance report is taken on the patch that is WIP columnar storage > on PostgreSQL database. Only the storage part of the code is finished. > To test the performance, we used custom plan to generate the plans > where it can use the columnar storage. This way we ran the performance > test. Quickly eyeballing your results I think they are similar to ours: there are some performance gains but nothing spectacular. That's why I want to take another, more invasive approach that buys us more. The wiki page I'm to write will describe our rough plan for that. Your input on that will be appreciated. > I want to integrate this patch with syntax proposed by Alvaro for columnar > storage and share it with community, before that i want to share the current > storage design with the community for review by preparing some readme > file. I will try to send this soon. Please do, thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Bert wrote: > Alvaro, > You wrote that a wiki page would be opened regarding this. But I still > cannot find such a page (expect for an old page which hasn't changed in the > last year). Is there already something we can look at? Yeah, I haven't done that yet. I will post here as soon as I get that done. Happy to share another beer to discuss, next time I'm over there. I'm also going to have code to share for you to test by then! What's the other page you mention? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
On Thu, Mar 3, 2016 at 7:46 PM, Bertwrote: > > Thank you for the performance test. But please not that the patch is 'thrown > away', and will be totally rewritten. I have no idea of the status of the > second / third attempt however. > However, what is interesting is that for some queries this patch is already > on par with VCI. Which db is that exactly? The performance report is taken on the patch that is WIP columnar storage on PostgreSQL database. Only the storage part of the code is finished. To test the performance, we used custom plan to generate the plans where it can use the columnar storage. This way we ran the performance test. I want to integrate this patch with syntax proposed by Alvaro for columnar storage and share it with community, before that i want to share the current storage design with the community for review by preparing some readme file. I will try to send this soon. Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Hello Haribabu, Thank you for the performance test. But please not that the patch is 'thrown away', and will be totally rewritten. I have no idea of the status of the second / third attempt however. However, what is interesting is that for some queries this patch is already on par with VCI. Which db is that exactly? Alvaro, You wrote that a wiki page would be opened regarding this. But I still cannot find such a page (expect for an old page which hasn't changed in the last year). Is there already something we can look at? Bert On Thu, Mar 3, 2016 at 6:07 AM, Haribabu Kommiwrote: > On Mon, Feb 1, 2016 at 12:11 AM, Alvaro Herrera > wrote: > > So we discussed some of this stuff during the developer meeting in > > Brussels and the main conclusion is that we're going to split this up in > > multiple independently useful pieces, and write up the general roadmap > > in the wiki so that we can discuss in detail on-list. > > > > I'm marking this as Returned with Feedback now. > > > > Thanks everybody, > > Here I attached the DBT-3 performance report that is measured on the > prototype patch > that is written for columnar storage as I mentioned in my earlier mail > with WOS and ROS > design. > > Currently to measure the benefits of this design, we did the following > changes, > 1. Created the columnar storage index similar like other index methods > 2. Used custom plan to generate the plan that can use the columnar storage > 3. Optimized parallelism to use the columnar storage > > The code is not fully ready yet, I posted the performance results to > get a view from > community, whether this approach is really beneficial? > > I will provide the full details of the design and WIP patches later. > > Regards, > Hari Babu > Fujitsu Australia > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Bert Desmet 0477/305361
Re: [HACKERS] On columnar storage (2)
On Mon, Feb 1, 2016 at 12:11 AM, Alvaro Herrerawrote: > So we discussed some of this stuff during the developer meeting in > Brussels and the main conclusion is that we're going to split this up in > multiple independently useful pieces, and write up the general roadmap > in the wiki so that we can discuss in detail on-list. > > I'm marking this as Returned with Feedback now. > > Thanks everybody, Here I attached the DBT-3 performance report that is measured on the prototype patch that is written for columnar storage as I mentioned in my earlier mail with WOS and ROS design. Currently to measure the benefits of this design, we did the following changes, 1. Created the columnar storage index similar like other index methods 2. Used custom plan to generate the plan that can use the columnar storage 3. Optimized parallelism to use the columnar storage The code is not fully ready yet, I posted the performance results to get a view from community, whether this approach is really beneficial? I will provide the full details of the design and WIP patches later. Regards, Hari Babu Fujitsu Australia DBT3_performance_vci_community.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
So we discussed some of this stuff during the developer meeting in Brussels and the main conclusion is that we're going to split this up in multiple independently useful pieces, and write up the general roadmap in the wiki so that we can discuss in detail on-list. I'm marking this as Returned with Feedback now. Thanks everybody, -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
On Mon, Dec 28, 2015 at 2:15 PM, Alvaro Herrerawrote: >> 1. CS API. >> I agree with you that FDW API seems to be not enough to efficiently support >> work with CS. >> At least we need batch insert. >> But may be it is better to extend FDW API rather than creating special API >> for CS? > > The patch we have proposed thus far does not mess with executor > structure too much, so probably it would be possible to add some things > here and there to the FDW API and it might work. But in the long term I > think the columnar storage project is more ambitious; for instance, I'm > sure we will want to be able to vectorise certain operations, and the > FDW API will become a bottleneck, so to speak. I'm thinking in > vectorisation in two different ways: one is that some operations such as > computing aggregates over large data sets can work a lot faster if you > feed the value of one column for multiple tuples at a time in columnar > format; that way you can execute the operation directly in the CPU > (this requires specific support from the aggregate functions.) > For this to work, the executor needs to be rejigged so that multiple > values (tuples) can be passed at once. > > The other aspect of vectorisation is that one input tuple might have > been split in several data origins, so that one half of the tuple is in > columnar format and another format is in row format; that lets you do > very fast updates on the row-formatted part, while allowing fast reads > for the columnar format, for instance. (It's well known that columnar > oriented storage does not go well with updates; some implementation even > disallow updates and deletes altogether.) Currently within the executor > a tuple is a TupleTableSlot which contains one Datum array, which has > all the values coming out of the HeapTuple; but for split storage > tuples, we will need to have a TupleTableSlot that has multiple "Datum > arrays" (in a way --- because, actually, once we get to vectorise as in > the preceding paragraph, we no longer have a Datum array, but some more > complex representation). > > I think that trying to make the FDW API address all these concerns, > while at the same time *also* serving the needs of external data > sources, insanity will ensue. I think the opposite. Suppose we add vectorization support (or whatever other feature, could be asynchronous execution or faster-than-light travel or whatever) to the executor. Well, are we going to say that FDWs can't get access to that feature? I think that would be an extremely surprising decision. Presumably, if we add cool capabilities to the executor, we want FDWs to be able to get access to those new capabilities just as built-in tables can. So, we'll probably think about what new FDW methods - optional methods, probably - would be needed to expose the new capabilities and add them. Now, there may still be some reason why it doesn't make sense to have the columnar store stuff go through the FDW API. It's sorta doing something different. If you tilt your head right, a table with a columnar store smells a lot like two tables that will frequently need to be joined; and if we were to implement it that way, then one of those tables would just be a table, and the other one would be a "foreign table" that actually has backing storage. If we don't do it that way, then I'm curious what my mental model for this feature should be. We don't have any concept currently of an "incomplete tuple" that includes only a subset of the columns. Some of the columns can be TOAST pointers that have to be expanded before use, but they can't be left out altogether... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
On 12/28/15 1:15 PM, Alvaro Herrera wrote: Currently within the executor a tuple is a TupleTableSlot which contains one Datum array, which has all the values coming out of the HeapTuple; but for split storage tuples, we will need to have a TupleTableSlot that has multiple "Datum arrays" (in a way --- because, actually, once we get to vectorise as in the preceding paragraph, we no longer have a Datum array, but some more complex representation). I think that trying to make the FDW API address all these concerns, while at the same time*also* serving the needs of external data sources, insanity will ensue. Are you familiar with DataFrames in Pandas[1]? They're a collection of Series[2], which are essentially vectors. (Technically, they're more complex than that because you can assign arbitrary indexes). So instead of the normal collection of rows, a DataFrame is a collection of columns. Series are also sparse (like our tuples), but the sparse value can be anything, not just NULL (or NaN in panda-speak). There's also DataFrames in R; not sure how equivalent they are. I mention this because there's a lot being done with dataframes and they might be a good basis for a columnstore API, killing 2 birds with one stone. BTW, the underlying python type for Series is ndarrays[3], which are specifically designed to interface to things like C arrays. So a column store could potentially be accessed directly. Aside from potential API inspiration, it might be useful to prototype a columnstore using Series (or maybe ndarrays). [1] http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html [2] http://pandas.pydata.org/pandas-docs/stable/api.html#series [3] http://docs.scipy.org/doc/numpy-1.10.0/reference/internals.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Konstantin Knizhnik wrote: > 3. Transpose of data and role of CS. > Let's look once again on Quote example above. Data is received in time > ascending order. But most queries require grouping it by symbol. So at some > stage we have to "transpose" data. To efficiently append data to timeseries > we need to buffer it somewhere and then use append range of values. In > Fujitsu approach two different representations of data are used: reader and > writer optimized. In IMCS approach, CS is just temporary projection of > normal PostgreSQL tables. So we do not need to worry about durability - it > is enforced by PostgreSQL. > > So the question is whether CS should be only storage for the data or just > copy (may be transient) of normal table? Our original plan was that a CS was the primary storage of data, not a duplicate. However, after some discussion it became apparent that are several use cases that are better served by allowing redundant storage, i.e. having CSs that are just a reader-optimized copy of data that exists elsewhere. While I'm not a fan of that approach, I think it would be good to leave the door open for a future implementation of that. However, I think it'll bring interesting challenges to the optimizer side, so I'm not promising to work on it. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Konstantin Knizhnik wrote: Hi, > May be you know, that I have implemented IMCS (in-memory-columnar-store) as > PostgreSQL extension. > It was not so successful, mostly because people prefer to use standard SQL > rather than using some special functions for accessing columnar storage > (CS). Now I am thinking about second reincarnation of IMCS, based on FDW and > CSP (custom nodes). This is why I am very interested in your patch. Great to hear. > I have investigated previous version of the patch and have some > questions. I will be pleased if you can clarify them to me: > > 1. CS API. > I agree with you that FDW API seems to be not enough to efficiently support > work with CS. > At least we need batch insert. > But may be it is better to extend FDW API rather than creating special API > for CS? The patch we have proposed thus far does not mess with executor structure too much, so probably it would be possible to add some things here and there to the FDW API and it might work. But in the long term I think the columnar storage project is more ambitious; for instance, I'm sure we will want to be able to vectorise certain operations, and the FDW API will become a bottleneck, so to speak. I'm thinking in vectorisation in two different ways: one is that some operations such as computing aggregates over large data sets can work a lot faster if you feed the value of one column for multiple tuples at a time in columnar format; that way you can execute the operation directly in the CPU (this requires specific support from the aggregate functions.) For this to work, the executor needs to be rejigged so that multiple values (tuples) can be passed at once. The other aspect of vectorisation is that one input tuple might have been split in several data origins, so that one half of the tuple is in columnar format and another format is in row format; that lets you do very fast updates on the row-formatted part, while allowing fast reads for the columnar format, for instance. (It's well known that columnar oriented storage does not go well with updates; some implementation even disallow updates and deletes altogether.) Currently within the executor a tuple is a TupleTableSlot which contains one Datum array, which has all the values coming out of the HeapTuple; but for split storage tuples, we will need to have a TupleTableSlot that has multiple "Datum arrays" (in a way --- because, actually, once we get to vectorise as in the preceding paragraph, we no longer have a Datum array, but some more complex representation). I think that trying to make the FDW API address all these concerns, while at the same time *also* serving the needs of external data sources, insanity will ensue. > 2. Horizontal<->Vertical data mapping. As far as I understand this patch, > the model of CS assumes that some table columns are stored in horizontal > format (in heap), some - in vertical format (in CS). And there is > one-to-one mapping between horizontal and vertical parts of row using CTID. Yes, that part needs to go away. We will deal with this eventually; the patch I posted was just some very basic infrastructure. In the future we would like to be able to have real support for not having to translate between column-oriented and row-oriented formats; at least for some operations. (I expect that we will leave most code as currently and require translation, while other parts that have been optimized are able to skip the translation step. As things mature we make more things understand the new format without translation.) This is also dependent on being able to vectorise the executor. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Hi Alvaro, May be you know, that I have implemented IMCS (in-memory-columnar-store) as PostgreSQL extension. It was not so successful, mostly because people prefer to use standard SQL rather than using some special functions for accessing columnar storage (CS). Now I am thinking about second reincarnation of IMCS, based on FDW and CSP (custom nodes). This is why I am very interested in your patch. I have investigated previous version of the patch and have some questions. I will be pleased if you can clarify them to me: 1. CS API. I agree with you that FDW API seems to be not enough to efficiently support work with CS. At least we need batch insert. But may be it is better to extend FDW API rather than creating special API for CS? 2. Horizontal<->Vertical data mapping. As far as I understand this patch, the model of CS assumes that some table columns are stored in horizontal format (in heap), some - in vertical format (in CS). And there is one-to-one mapping between horizontal and vertical parts of row using CTID. But been involved in several projects requiring OLAP, I found out that in most cases it is more convenient to have one-to-many mapping. Assume some trading system dealing with stock quotes. Data looks something like this: Symbol DayOpen Close High Low Volume AAA 12/22/2015 10.0 12.0 13.0 8.0 100 AAB 12/22/2015 9.0 8.0 10.0 9.0 200 ... AAA 12/23/2015 12.0 11.5 12.5 11.050 AAB 12/23/2015 8.0 8.88.58.0 300 So it can be represented using the following table: create table Quote (Symbol char(10), Day date, Open real, High real, Low real, Close real, Volume integer); Most likely we need to calculate some statistic for particular symbol or set of symbols. For example, portfolio is set of symbols and we need to somehow analyze instruments in this portfolio. There are about several thousands symbols, tens instruments in portfolio and tens of thousands quotes per symbol (in other cases size of timeseries are much larger - millions elements). How can we efficiently execute query like: select Symbol,sum(Close*Volume)/sum(Volume) as VWAP from Quote group by Symbol where day between '01/01/2001' and '01/01/2010' and Symbol in ('AAA', 'AAB','ABB',...); If we have index by Symbol, then it will contain a lot of duplicates. And it is not clear how to efficiently combine index scan by symbol name and time slice. One of the possible solution is to embed timeseries into tuples. In this case we will have something like this: create table Quote (Symbol char(10), Day timeseries(date), Open timeseries(real), High timeseries(real), Low timeseries(real), Close timeseries(real), Volume timeseries(integer)); We are using here unexisted type timeseries. It is something similar with array, but its content in stored in columnar storage rather than in record's TOAST. In this case we can efficiently locate records by symbol (there are only few thousands entries in the table) and then perform CS operations with located timeseries. So here we also split tuple into horizontal and vertical part. In horizontal part we store just identifier of timeseries. Query plan should combine standard nodes with custom CS nodes. Mixing horizontal and vertical operations significantly complicates optimizer and restricts flexibility: having proposed representation it is difficult to efficiently calculate some characteristic for all symbols in specified time range. This is why I am not sure that it is the only possible and most efficient approach. But in any case there should be some efficient plan for queries like above. 3. Transpose of data and role of CS. Let's look once again on Quote example above. Data is received in time ascending order. But most queries require grouping it by symbol. So at some stage we have to "transpose" data. To efficiently append data to timeseries we need to buffer it somewhere and then use append range of values. In Fujitsu approach two different representations of data are used: reader and writer optimized. In IMCS approach, CS is just temporary projection of normal PostgreSQL tables. So we do not need to worry about durability - it is enforced by PostgreSQL. So the question is whether CS should be only storage for the data or just copy (may be transient) of normal table? Best regards, Konstantin On 22.12.2015 17:43, Alvaro Herrera wrote: Michael Paquier wrote: On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janeswrote: Could we get this rebased past the merge of the parallel execution commits? +1. Alvaro, Tomas, Simon, what are the next plans with those patches? Yeah, I've been working intermittently on getting the whole tree rebased and squashed, because after the last submission we made a lot of progress. I'll repost later. I think it should be marked "returned with feedback" for now.
Re: [HACKERS] On columnar storage (2)
On Tue, Dec 22, 2015 at 11:43 PM, Alvaro Herrerawrote: > Michael Paquier wrote: >> On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janes wrote: >> > Could we get this rebased past the merge of the parallel execution commits? >> >> +1. Alvaro, Tomas, Simon, what are the next plans with those patches? > > Yeah, I've been working intermittently on getting the whole tree rebased > and squashed, because after the last submission we made a lot of > progress. I'll repost later. I think it should be marked "returned > with feedback" for now. Ok. Noted. Thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janeswrote: > Could we get this rebased past the merge of the parallel execution commits? +1. Alvaro, Tomas, Simon, what are the next plans with those patches? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Michael Paquier wrote: > On Wed, Dec 9, 2015 at 3:10 PM, Jeff Janeswrote: > > Could we get this rebased past the merge of the parallel execution commits? > > +1. Alvaro, Tomas, Simon, what are the next plans with those patches? Yeah, I've been working intermittently on getting the whole tree rebased and squashed, because after the last submission we made a lot of progress. I'll repost later. I think it should be marked "returned with feedback" for now. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
Could we get this rebased past the merge of the parallel execution commits? Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage (2)
On Tue, Sep 1, 2015 at 8:53 AM, Alvaro Herrerawrote: > As discussed in > https://www.postgresql.org/message-id/20150611230316.gm133...@postgresql.org > we've been working on implementing columnar storage for Postgres. > Here's some initial code to show our general idea, and to gather > comments related to what we're building. This is not a complete patch, > and we don't claim that it works! This is in very early stages, and we > have a lot of work to do to get this in working shape. > > This was proposed during the Developer's Unconference in Ottawa earlier > this year. While some questions were raised about some elements of our > design, we don't think they were outright objections, so we have pressed > forward on the expectation that any limitations can be fixed before this > is final if they are critical, or in subsequent commits if not. > > The commit messages for each patch should explain what we've done in > enough technical detail, and hopefully provide a high-level overview of > what we're developing. > > The first few pieces are "ready for comment" -- feel free to speak up > about the catalog additions, the new COLUMN STORE bits we added to the > grammar, the way we handle column stores in the relcache, or the > mechanics to create column store catalog entries. > > The later half of the patch series is much less well cooked yet; for > example, the colstore_dummy module is just a simple experiment to let us > verify that the API is working. The planner and executor code are > mostly stubs, and we are not yet sure of what are the executor nodes > that we would like to have: while we have discussed this topic > internally a lot, we haven't yet formed final opinions, and of course > the stub implementations are not doing the proper things, and in many > cases they are even not doing anything at all. Fujitsu is also interested in implementing a columnar storage extension. First we thought of implementing this extension using index access methods The following is the basic design idea of the columnar extension, currently this may need to be redesigned according to columnar access methods, create an vertical columnar index on a table with specified columns that are needed to be stored in columnar storage format. To provide performance benefit for both read and write operations, the data is stored in two formats, 1) write optimized storage (WOS) 2) read optimized storage (ROS). This is useful for the users where there is a great chance of data modification that is newly added. Because of two storage's, we need maintain two entries in pg_class table. one is WOS and others are all columns in columnar storage. Insert: write optimized storage is the data of all columns that are part of VCI are stored in a row wise format. All the newly added data is stored in WOS relation with xmin/xmax information also. If user wants to update/delete the newly added data, it doesn't affect the performance much compared to deleting the data from columnar storage. The tuples which don't have multiple copies or frozen data will be moved from WOS to ROS periodically by the background worker process or autovauum process. Every column data is stored separately in it's relation file. There is no transaction information is present in ROS. The data in ROS can be referred with tuple ID. In this approach, the column data is present in both heap and columnar storage, whereas with columnar access methods the column data doesn't present in the heap. Select: Because of two storage formats, during the select operation, the data in WOS is converted into Local ROS for the statement to be executed. The conversion cost depends upon the number of tuples present in the WOS file. This may add some performance overhead for select statements. Delete: During the delete operation, whenever the data is deleted in heap at the same time the data in WOS file is marked as deleted similar like heap. But in case if the data is already migrated from WOS to ROS, then we will maintain some delete vector to store the details of tuple id, transaction information and etc. During the data read from ROS file, it is verified against delete vector and confirms whether the record is visible or not? All the delete vectors data is applied to ROS periodically. The concept of columnar extension is from Fujitsu Labs, Japan. Any comments for further evaluation of this approach according to columnar access methods? Regards, Hari Babu Fujitsu Australia -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Jim Nasby wrote: Related to idea of an 'auto join', I do wish we had the ability to access columns in a referenced FK table from a referring key; something like SELECT customer_id.first_name FROM invoice (which would be translated to SELECT first_name FROM invoice JOIN customer USING( customer_id )). We already have this feature -- you just need to set the add_missing_from GUC. (... actually we removed that feature because it was considered dangerous or something.) -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
On 6/17/15 2:04 PM, Alvaro Herrera wrote: Jim Nasby wrote: Related to idea of an 'auto join', I do wish we had the ability to access columns in a referenced FK table from a referring key; something like SELECT customer_id.first_name FROM invoice (which would be translated to SELECT first_name FROM invoice JOIN customer USING( customer_id )). We already have this feature -- you just need to set the add_missing_from GUC. (... actually we removed that feature because it was considered dangerous or something.) That was removed in 9.0. Even so, I don't believe it added the JOIN, so you'd suddenly get a Cartesian product. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
On 6/14/15 10:22 AM, Alvaro Herrera wrote: To me, it feels like there are two different features here that would be better separated. First, there's the idea of having a table that gets auto-joined to other tables whenever you access it, so that the user sees one really wide table but really the data is segregated by column groups under the hood. That's a neat idea. Thanks. (It also seems pretty tricky to implement.) I look at it as a form of vertical partitioning; the big difference being whether you normalize the columns out or not (or to use data warehouse parlance, slow vs fast changing dimensions). Perhaps it would be useful to vet this out as a userspace extension first since that would presumably be much easier. I believe we actually have all the backend infrastructure that would be needed for this now that views are smart enough to exclude tables that aren't referenced at all. I suspect that even a 'dumb userspace' approach would still expose a lot of the planner problems we'll run into (join explosion and filtering through the join come to mind). Related to idea of an 'auto join', I do wish we had the ability to access columns in a referenced FK table from a referring key; something like SELECT customer_id.first_name FROM invoice (which would be translated to SELECT first_name FROM invoice JOIN customer USING( customer_id )). -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf In sketch: There is the concept of a Write-Optimized-Store (WOS) and Read-optimized-store (ROS), and a TupleMover that moves records from WOS to ROS (some what like vacuum), and from ROS to WOS for updates. It seems to me that heap is naturally a WOS, and only vacuuming for a column-backed heap table would move records from the heap into the column store. Of course, there would need to be a deeper vacuum when the column store itself needs to be vacuumed. When a record in column store needs to be updated, a top-level transaction moves the record into the heap by marking the row as deleted in the column store and inserting the record into the heap store. The updates could then proceed according to the current heap transactional logic. I am not sure if this makes sense, but it seems plausible and 1/ retains the heap transactional logic code which is very hard to get right 2/ makes column store essentially a storage optimization that users do not need to be too concerned with; heap is kept small and old data are moved into column store automatically 3/ no need to keep 20+bytes of visibility info on the rows in column store 4/ instead of column store, this could be a heap (without visibility) store if you prefer row I haven't thought about the indexing aspect of this. From a DW angle, I am more interested in a heap store that is backed by multiple column stores via partition keys. Regards, -cktan On Mon, Jun 15, 2015 at 12:02 AM, Amit Kapila amit.kapil...@gmail.com wrote: On Fri, Jun 12, 2015 at 10:58 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: Amit Kapila wrote: On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: One critical detail is what will be used to identify a heap row when talking to a CS implementation. There are two main possibilities: 1. use CTIDs 2. use some logical tuple identifier Using CTIDs is simpler. One disadvantage is that every UPDATE of a row needs to let the CS know about the new location of the tuple, so that the value is known associated with the new tuple location as well as the old. This needs to happen even if the value of the column itself is not changed. Isn't this somewhat similar to index segment? Not sure what you mean with index segment. The part similar to index segment is reference to heap for visibility information and tuple id (TID). Have I misunderstood something? Will the column store obey snapshot model similar to current heap tuples, if so will it derive the transaction information from heap tuple? Yes, visibility will be tied to the heap tuple -- a value is accessed only when its corresponding heap row has already been determined to be visible. Won't it possible that all columns of a table belong to column-store? I think for such a case heap will just be used to store transaction information (visibility info) for a column store tuple and depending on how the column-store is organized, the reference to this information needs to be stored in column-store (the same row reference might need to be stored for each column value). Also any write operation could lead to much more I/O because of updation at 2 different locations (one in column-store and other in heap). One interesting point that raises from this is about vacuum: when are we able to remove a value from the store? Yes, that could also be quite tricky to handle, may be one naive way could be to make list of all TID's from heap that needs to be expired and then search for references of all those TID's in column-store. I understand your point for re-using the existing transaction infrastructure for column-store by keeping that information in heap as it is done now, but I think that won't be free either. Another point to consider here is does the column-store needs transactional consistency, do other commercial/opensource column-store implementation's are transactional consistent and if yes, then can't we think of doing it in a way where data could be present both in heap as well as in column-store (I understand that it could lead to duplicate data, OTOH, such an implementation anyway eliminates the need for indexes, so may be worth considering). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On columnar storage
On 2015-06-11 20:03:16 -0300, Alvaro Herrera wrote: Rewriter Parsing occurs as currently. During query rewrite, specifically at the bottom of the per-relation loop in fireRIRrules(), we will modify the query tree: each relation RTE containing a colstore will be replaced with a JoinExpr containing the relation as left child and the colstore as right child (1). The colstore RTE will be of a new RTEKind. For each such change, all Var nodes that point to attnums stored in the colstore will modified so that they reference the RTE of the colstore instead (2). FWIW, I think this is a pretty bad place to tackle this. For one I think we shouldn't add more stuff using the rewriter unless it's clearly the best interface. For another, doing things in the rewriter will make optimizing things much harder - the planner will have to reconstruct knowledge which of the joins are column store joins and such. Why do you want to do things there? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Andres Freund and...@anarazel.de writes: On 2015-06-11 20:03:16 -0300, Alvaro Herrera wrote: Parsing occurs as currently. During query rewrite, specifically at the bottom of the per-relation loop in fireRIRrules(), we will modify the query tree: each relation RTE containing a colstore will be replaced with a JoinExpr containing the relation as left child and the colstore as right child (1). The colstore RTE will be of a new RTEKind. For each such change, all Var nodes that point to attnums stored in the colstore will modified so that they reference the RTE of the colstore instead (2). FWIW, I think this is a pretty bad place to tackle this. For one I think we shouldn't add more stuff using the rewriter unless it's clearly the best interface. For another, doing things in the rewriter will make optimizing things much harder - the planner will have to reconstruct knowledge which of the joins are column store joins and such. As a comparison point, one of my Salesforce colleagues just put in a somewhat similar though single-purpose thing, to expand what originally is a simple table reference into a join (against a system catalog that's nowhere mentioned in the original query). In our case, we wanted to force a scan on a large table to have a constraint on the leading primary key column; if the query has such a constraint already, then fine, else create one by joining to a catalog that lists the allowed values of that column. We started out by trying to do it in the rewriter, and that didn't work well at all. We ended up actually doing it at createplan.c time, which is conceptually ugly, but there was no good place to do it earlier without duplicating a lot of indexqual analysis. But the thing that made that painful was that the transformation was optional, and indeed might happen or not happen for a given query depending on the selected plan shape. AFAICT the transformation Alvaro is proposing is unconditional, so it might be all right to do it in the rewriter. As you say, if the planner needs to reconstruct what happened, that would be a strike against this way, but it's not clear from here whether any additional info is needed beyond the already-suggested extra RTEKind. Another model that could be followed is expansion of inheritance-tree references, which happens early in the planner. In that case the planner does keep additional information about what it did (the appendrel data structures), so that could be a good model if this code needs to do likewise. The existing join-alias-var flattening logic in the planner might be of interest as well for the variable-substitution business, which I suspect is the main reason Alvaro is proposing doing it in the rewriter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Won't this cause issues to MergeAppend optimizations? Like what? Well, as I understand, MergeAppend needs to know the sort order of the child node, right? But that's available only on the relation RTE, not on the colstore-join RTE. Uh, what? Sort order is a property of a path, not an RTE. And we have always understood which join types preserve sort order. And if there are such issues, why do you think you wouldn't be expected to solve them? Precisely. If I simply reject having column stores in partitioned tables, then I don't *need* to solve them. You misunderstood the thrust of my comment, which basically is that I doubt anyone will think that rejecting that combination is an acceptable implementation restriction. It might be all right if it doesn't work very well in v0, but not if the implementation is designed so that it can never be fixed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Hi, On 06/13/15 00:07, Michael Nolan wrote: On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera alvhe...@2ndquadrant.com mailto:alvhe...@2ndquadrant.com wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. I've been trying to figure out a plan to enable native column stores (CS or colstore) for Postgres. Motivations: * avoid the 32 TB limit for tables * avoid the 1600 column limit for tables * increased performance Are you looking to avoid all hardware-based limits, or would using a 64 bit row pointer be possible? That would give you 2^64 or 1.8 E19 unique rows over whatever granularity/uniqueness you use (per table, per database, etc.) -- Mike Nolan. I don't think the number of tuples is the main problem here, it's the number of pages a single relation can have. Looking at the numbers of rows as a direct function of TID size is misleading, because the TID is split into two fixed parts - page number (32b) and tuple number (16b). For the record, 2^48 is 281,474,976,710,656 which ought to be enough for anybody, but we waste large part of that because we assume there might be up to 2^16 tuples per page, although the actual limit is way lower (~290 for 8kB pages, and ~1200 for 32kB pages. So we can only have ~4 billion pages, which is where the 32TB limit comes from (with 32kB pages it's 128TB). Longer TIDs are one a straightforward way to work around this limit, assuming you add the bits to the 'page number' field. Adding 16 bits (thus using 64-bit pointers) would increase the limit 2^16-times to about 2048 petabytes (with 8kB pages). But that of course comes with a cost, because you have to keep those larger TIDs in indexes etc. Another option might be to split the 48 bits differently, by moving 5 bits to the page number part of TID (so that we expect ~2048 tuples per page at most). That'd increase the limit to 1PB (4PB with 32kB pages). The column store approach is somehow orthogonal to this, because it splits the table vertically into multiple pieces, each stored in a separate relfilenode and thus using a separate sequence of page numbers. And of course, the usual 'horizontal' partitioning has a very similar effect (separate filenodes). regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Robert Haas wrote: On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I've been trying to figure out a plan to enable native column stores (CS or colstore) for Postgres. Motivations: * avoid the 32 TB limit for tables * avoid the 1600 column limit for tables * increased performance To me, it feels like there are two different features here that would be better separated. First, there's the idea of having a table that gets auto-joined to other tables whenever you access it, so that the user sees one really wide table but really the data is segregated by column groups under the hood. That's a neat idea. Thanks. (It also seems pretty tricky to implement.) Second, there's the idea of a way of storing tuples that is different from PostgreSQL's usual mechanism - i.e. a storage manager API. I understand your concerns about going through the FDW API so maybe that's not the right way to do it, but it seems to me that in the end you are going to end up with something awfully similar to that + a local relfilenode + WAL logging support. I'm not clear on why you want to make the column store API totally different from the FDW API; there may be a reason, but I don't immediately see it. I just don't see that the FDW API is such a good fit for what I'm trying to do. Anything using the FDW API needs to implement its own visibility checking, for instance. I want to avoid that, because it's its own complex problem. Also, it doesn't look like the FDW API supports things that I want to do (neither my proposed LateColumnMaterialization nor my proposed BitmapColumnScan). I would have to extend the FDW API, then contort my stuff so that it fits in the existing API; then I will need to make sure that existing FDWs are not broken by the changes I would propose elsewhere. Round peg, square hole is all I see here. All in all, this seems too much additional work, just to make to things that are really addressing different problems go through the same code. You're correct about local WAL logging. We will need a solution to that problem. I was hoping to defer that until we had something like Alexander Korotkov's proposed pluggable WAL stuff. Each of these two features is independently useful. If you just had the first feature, you could use the existing table format as your columnar store. I'm sure it's possible to do a lot better in some cases, but there could easily be cases where that's a really big win, because the existing table format has far more sophisticated indexing capabilities than any columnar store is likely to have in an early version. Yeah, sure, it's pretty likely that the first experimental colstore implementation will just be based on existing infrastructure. The second capability, of course, opens up all sorts of interesting possibilities, like compressed read-only storage or index-organized tables. And it would also let you have an all-columnar table, similar to what Citus's cstore_fdw does, which doesn't seem like it would be supported in your design, and could be a pretty big win for certain kinds of tables. Well, I would like to know about those use cases that Citus stuff is good at, so that we can make sure they work reasonably under my proposed design. Maybe I have to require vacuuming so that the all-visible bits are set, so that a column scan can skip visibility checking for most of the underlying heap tuples to produce a large aggregation report. That seems pretty reasonable to me. BTW, I'm not sure if it's a good idea to call all of this stuff cstore. The name is intuitive, but cstore_fdw has enough traction already that it may create some confusion. I'm not thinking of calling anything user-visible with the name cstore. It's just a development term. Surely we're not reserving names from what is used in third party code. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
On 06/14/15 17:22, Alvaro Herrera wrote: Robert Haas wrote: ,,, Second, there's the idea of a way of storing tuples that is different from PostgreSQL's usual mechanism - i.e. a storage manager API. I understand your concerns about going through the FDW API so maybe that's not the right way to do it, but it seems to me that in the end you are going to end up with something awfully similar to that + a local relfilenode + WAL logging support. I'm not clear on why you want to make the column store API totally different from the FDW API; there may be a reason, but I don't immediately see it. I just don't see that the FDW API is such a good fit for what I'm trying to do. Anything using the FDW API needs to implement its own visibility checking, for instance. I want to avoid that, because it's its own complex problem. Also, it doesn't look like the FDW API supports things that I want to do (neither my proposed LateColumnMaterialization nor my proposed BitmapColumnScan). I would have to extend the FDW API, then contort my stuff so that it fits in the existing API; then I will need to make sure that existing FDWs are not broken by the changes I would propose elsewhere. Round peg, square hole is all I see here. All in all, this seems too much additional work, just to make to things that are really addressing different problems go through the same code. You're correct about local WAL logging. We will need a solution to that problem. I was hoping to defer that until we had something like Alexander Korotkov's proposed pluggable WAL stuff. Probably worth mentioning we don't expect the column store API to be used only by extensions, but even by code from within the core which can use the current WAL infrastructure etc. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Andres Freund wrote: On 2015-06-11 20:03:16 -0300, Alvaro Herrera wrote: Rewriter Parsing occurs as currently. During query rewrite, specifically at the bottom of the per-relation loop in fireRIRrules(), we will modify the query tree: each relation RTE containing a colstore will be replaced with a JoinExpr containing the relation as left child and the colstore as right child (1). The colstore RTE will be of a new RTEKind. For each such change, all Var nodes that point to attnums stored in the colstore will modified so that they reference the RTE of the colstore instead (2). FWIW, I think this is a pretty bad place to tackle this. For one I think we shouldn't add more stuff using the rewriter unless it's clearly the best interface. For another, doing things in the rewriter will make optimizing things much harder - the planner will have to reconstruct knowledge which of the joins are column store joins and such. What do you mean reconstruct? That bit will be explicit -- I'm thinking the planner will have specialized bits to deal with such nodes, i.e. it will know there's a one-to-one relationship between colstore tuples and heap tuples, so it will know how to move nodes around. Or at least, that's how I'm hoping it will be ... Why do you want to do things there? Because I see no better place. Isn't the rewriter the place where we modify the query tree, mostly? Another choice I considered was subquery_planner: in the spot where expand_inherited_tables() is called, add a new call to expand the RTEs that correspond to tables containing stores. But I had second thoughts because that function says it's safe because it only adds baserels, not joins, and I'm adding joins. I guess I could use a spot earlier than wherever it is that joins are checked, but this planner code is recursive anyway and for this I must do a single pass. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Tom Lane wrote: I wrote: Another model that could be followed is expansion of inheritance-tree references, which happens early in the planner. Actually ... if you intend to allow column storage to work with inherited tables (and if you don't, you'd better have a darn good reason why not), I think you probably want to do this join insertion *after* inheritance expansion, so you can join child column stores only to the appropriate child heap table, and not to the entire inheritance tree. Won't this cause issues to MergeAppend optimizations? I guess we could have the join be of a new type that's transparent to such optimizations but, you see, that seems complicated enough that I want to avoid that ... -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Actually ... if you intend to allow column storage to work with inherited tables (and if you don't, you'd better have a darn good reason why not), I think you probably want to do this join insertion *after* inheritance expansion, so you can join child column stores only to the appropriate child heap table, and not to the entire inheritance tree. Won't this cause issues to MergeAppend optimizations? Like what? Well, as I understand, MergeAppend needs to know the sort order of the child node, right? But that's available only on the relation RTE, not on the colstore-join RTE. Though now that I think about it, maybe I can push that info from the relation RTE to the colstore-join RTE, since I know the ordering will be the same. And if there are such issues, why do you think you wouldn't be expected to solve them? Precisely. If I simply reject having column stores in partitioned tables, then I don't *need* to solve them. Later in the project, when some planner hacker decides to join, I can ask them for advice on how to tackle them. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
I wrote: Another model that could be followed is expansion of inheritance-tree references, which happens early in the planner. Actually ... if you intend to allow column storage to work with inherited tables (and if you don't, you'd better have a darn good reason why not), I think you probably want to do this join insertion *after* inheritance expansion, so you can join child column stores only to the appropriate child heap table, and not to the entire inheritance tree. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Alvaro Herrera alvhe...@2ndquadrant.com writes: Another choice I considered was subquery_planner: in the spot where expand_inherited_tables() is called, add a new call to expand the RTEs that correspond to tables containing stores. But I had second thoughts because that function says it's safe because it only adds baserels, not joins, and I'm adding joins. I think that comment is only justifying why we don't need to redetermine hasJoinRTEs/hasLateralRTEs/hasOuterJoins afterwards. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Amit Kapila wrote: Will the column store obey snapshot model similar to current heap tuples, if so will it derive the transaction information from heap tuple? Yes, visibility will be tied to the heap tuple -- a value is accessed only when its corresponding heap row has already been determined to be visible. One interesting point that raises from this is about vacuum: when are we able to remove a value from the store? I have some not-completely-formed ideas about this. Hm. This seems not terribly ambitious --- mightn't a column store extension wish to store tables *entirely* in the column store, rather than tying them to a perhaps-vestigial heap table? Well, yes, it might, but that opens a huge can of worms. What heapam offers is not just tuple storage, but a lot of functionality on top of that -- in particular, tuple locking and visibility. I am certainly not considering re-implementing any of that. We might eventually go there, but we will *additionally* need different implementations of those things, and I'm pretty sure that will be painful, so I'm trying to stay away from that. Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: I can't help thinking that this could tie in with the storage level API that I was waving my arms about last year. Or maybe not --- the goals are substantially different --- but I think we ought to reflect on that rather than just doing a narrow hack for column stores used in the particular way you're describing here. I can't seem to remember this proposal you mention. Care to be more specific? Perhaps a link to archives is enough. I never got to the point of having a concrete proposal, but there was a discussion about it at last year's PGCon unconference; were you there? No, regrettably I wasn't there. Anyway the idea was to try to cut a clearer line between heap storage and the upper levels of the system, particularly the catalog/DDL code that we have so much of. Based on Salesforce's experience so far, it's near impossible to get rid of HeapTuple as the lingua franca for representing rows in the upper system levels, so we've not really tried; but it would be nice if the DDL code weren't so much in bed with heap-specific knowledge, like the wired-into-many-places assumption that row insert and update actions require index updates but deletions don't. Agreed on both counts. As far as catalog code goes, removing direct mapping from HeapTuple to C structs would require a huge rewrite of tons of code. Unless we're considering rewriting small pieces of specific catalog handling at a time, it seems unlikely that we will have columns from system catalogs in column stores. (It seems reasonable that as soon as we have column stores, we can have particular catalog code to work with columnar storage, but I don't think there's much need for that currently.) I agree with your second point also --- it might be good to have a layer in between, and it seems not completely unreasonable. It would require touching lots of places but not hugely transforming things. (I think it's not in the scope of the things I'm currently after, though.) We're also not very happy with the general assumption that a TID is an adequate row identifier (as our storage engine does not have TIDs), so I'm a bit disappointed to see you doubling down on that restriction rather than trying to lift it. Well, in the general design, there is room for different tuple identifiers. I'm just not implementing it for the first version. Now much of this pain only comes into play if one is trying to change the underlying storage format for system catalogs, which I gather is not considered in your proposal. But if you want one format for catalogs and another for user tables then you have issues like how do you guarantee atomic commit and crash safety across multiple storage engines. That way lies a mess, especially if you're trying to keep the engines at arms' length which is what a pluggable architecture implies. MySQL is a cautionary example we should be keeping in mind while thinking about this. Right. I don't want a separate storage engine that needs to reimplement transactions (as is the case in MySQL), or visibility rules. I don't want to have a different format for tables or catalogs; both would still be based on the current heapam API. I simply want to extend the API so that I can have some columns in a separate place. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
On 2015-06-14 17:36, Tomas Vondra wrote: On 06/14/15 17:22, Alvaro Herrera wrote: Robert Haas wrote: ,,, Second, there's the idea of a way of storing tuples that is different from PostgreSQL's usual mechanism - i.e. a storage manager API. I understand your concerns about going through the FDW API so maybe that's not the right way to do it, but it seems to me that in the end you are going to end up with something awfully similar to that + a local relfilenode + WAL logging support. I'm not clear on why you want to make the column store API totally different from the FDW API; there may be a reason, but I don't immediately see it. I just don't see that the FDW API is such a good fit for what I'm trying to do. Anything using the FDW API needs to implement its own visibility checking, for instance. I want to avoid that, because it's its own complex problem. Also, it doesn't look like the FDW API supports things that I want to do (neither my proposed LateColumnMaterialization nor my proposed BitmapColumnScan). I would have to extend the FDW API, then contort my stuff so that it fits in the existing API; then I will need to make sure that existing FDWs are not broken by the changes I would propose elsewhere. Round peg, square hole is all I see here. All in all, this seems too much additional work, just to make to things that are really addressing different problems go through the same code. You're correct about local WAL logging. We will need a solution to that problem. I was hoping to defer that until we had something like Alexander Korotkov's proposed pluggable WAL stuff. Probably worth mentioning we don't expect the column store API to be used only by extensions, but even by code from within the core which can use the current WAL infrastructure etc. I think it would be even ok if the initial implementation had restrictions in a way that extensions can't use it. I think we want implementation in the core first, support for extensions can come later. But it's still worth it to have an API since single implementation probably won't fit every purpose (see indexes). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Actually ... if you intend to allow column storage to work with inherited tables (and if you don't, you'd better have a darn good reason why not), I think you probably want to do this join insertion *after* inheritance expansion, so you can join child column stores only to the appropriate child heap table, and not to the entire inheritance tree. Won't this cause issues to MergeAppend optimizations? Like what? And if there are such issues, why do you think you wouldn't be expected to solve them? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: Alvaro Herrera alvhe...@2ndquadrant.com writes: Won't this cause issues to MergeAppend optimizations? Like what? Well, as I understand, MergeAppend needs to know the sort order of the child node, right? But that's available only on the relation RTE, not on the colstore-join RTE. Uh, what? Sort order is a property of a path, not an RTE. Evidently need to do more digging .. but that makes plenty of sense. And we have always understood which join types preserve sort order. That's obvious now that you say it. You misunderstood the thrust of my comment, which basically is that I doubt anyone will think that rejecting that combination is an acceptable implementation restriction. It might be all right if it doesn't work very well in v0, but not if the implementation is designed so that it can never be fixed. Gotcha. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
On Sun, Jun 14, 2015 at 10:30 AM, Tomas Vondra tomas.von...@2ndquadrant.com wrote: Are you looking to avoid all hardware-based limits, or would using a 64 bit row pointer be possible? That would give you 2^64 or 1.8 E19 unique rows over whatever granularity/uniqueness you use (per table, per database, etc.) -- Mike Nolan. I don't think the number of tuples is the main problem here, it's the number of pages a single relation can have. Looking at the numbers of rows as a direct function of TID size is misleading, because the TID is split into two fixed parts - page number (32b) and tuple number (16b). For the record, 2^48 is 281,474,976,710,656 which ought to be enough for anybody, but we waste large part of that because we assume there might be up to 2^16 tuples per page, although the actual limit is way lower (~290 for 8kB pages, and ~1200 for 32kB pages. So we can only have ~4 billion pages, which is where the 32TB limit comes from (with 32kB pages it's 128TB). Longer TIDs are one a straightforward way to work around this limit, assuming you add the bits to the 'page number' field. Adding 16 bits (thus using 64-bit pointers) would increase the limit 2^16-times to about 2048 petabytes (with 8kB pages). But that of course comes with a cost, because you have to keep those larger TIDs in indexes etc. Another option might be to split the 48 bits differently, by moving 5 bits to the page number part of TID (so that we expect ~2048 tuples per page at most). That'd increase the limit to 1PB (4PB with 32kB pages). The column store approach is somehow orthogonal to this, because it splits the table vertically into multiple pieces, each stored in a separate relfilenode and thus using a separate sequence of page numbers. And of course, the usual 'horizontal' partitioning has a very similar effect (separate filenodes). regards -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services Thanks for the reply. It's been a while since my last data structures course (1971), but I do remember a few things. I have never personally needed more than 1500 columns in a table, but can see how some might. Likewise, the 32TB limit hasn't affected me yet, either. I doubt either ever will. Solving either or both of those seems like it may at some point require a larger bit space for (at least some) TIDs, which is why I was wondering if a goal here is to eliminate all (practical) limits, It probably doesn't make sense to force all users to use that large bit space (with the associated space and performance penalties) If there's a way to do this, then you are all truly wizards. (This all reminds me of how the IP4 bit space was parcelled up into Class A, B, C and D addresses, at a time when people thought 32 bits would last us forever. Maybe 128 bits actually will.) -- Mike Nolan
Re: [HACKERS] On columnar storage
Tom Lane wrote: I can't help thinking that this could tie in with the storage level API that I was waving my arms about last year. Or maybe not --- the goals are substantially different --- but I think we ought to reflect on that rather than just doing a narrow hack for column stores used in the particular way you're describing here. I can't seem to remember this proposal you mention. Care to be more specific? Perhaps a link to archives is enough. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Alvaro Herrera alvhe...@2ndquadrant.com writes: Tom Lane wrote: I can't help thinking that this could tie in with the storage level API that I was waving my arms about last year. Or maybe not --- the goals are substantially different --- but I think we ought to reflect on that rather than just doing a narrow hack for column stores used in the particular way you're describing here. I can't seem to remember this proposal you mention. Care to be more specific? Perhaps a link to archives is enough. I never got to the point of having a concrete proposal, but there was a discussion about it at last year's PGCon unconference; were you there? Anyway the idea was to try to cut a clearer line between heap storage and the upper levels of the system, particularly the catalog/DDL code that we have so much of. Based on Salesforce's experience so far, it's near impossible to get rid of HeapTuple as the lingua franca for representing rows in the upper system levels, so we've not really tried; but it would be nice if the DDL code weren't so much in bed with heap-specific knowledge, like the wired-into-many-places assumption that row insert and update actions require index updates but deletions don't. We're also not very happy with the general assumption that a TID is an adequate row identifier (as our storage engine does not have TIDs), so I'm a bit disappointed to see you doubling down on that restriction rather than trying to lift it. Now much of this pain only comes into play if one is trying to change the underlying storage format for system catalogs, which I gather is not considered in your proposal. But if you want one format for catalogs and another for user tables then you have issues like how do you guarantee atomic commit and crash safety across multiple storage engines. That way lies a mess, especially if you're trying to keep the engines at arms' length which is what a pluggable architecture implies. MySQL is a cautionary example we should be keeping in mind while thinking about this. I don't really have any answers in this area, just questions. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Merlin Moncure wrote: On Thu, Jun 11, 2015 at 6:03 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. Quick thought. We already support out of line columnar storage in a fashion with 'toast'. Obviously that's a long way from where you want to go, but have you ruled out extending that system? TOAST uses pointers in the heap row. A toasted column is still present in the heap -- there's no way to get across the 1600-column limitation if we rely on anything stored in the heap. Hence the design of the feature at hand is that the columnar storage somehow points to the heap, rather than the other way around. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
On Thu, Jun 11, 2015 at 7:03 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. I've been trying to figure out a plan to enable native column stores (CS or colstore) for Postgres. Motivations: * avoid the 32 TB limit for tables * avoid the 1600 column limit for tables * increased performance Are you looking to avoid all hardware-based limits, or would using a 64 bit row pointer be possible? That would give you 2^64 or 1.8 E19 unique rows over whatever granularity/uniqueness you use (per table, per database, etc.) -- Mike Nolan.
Re: [HACKERS] On columnar storage
On Thu, Jun 11, 2015 at 6:03 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. Quick thought. We already support out of line columnar storage in a fashion with 'toast'. Obviously that's a long way from where you want to go, but have you ruled out extending that system? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Amit Kapila wrote: On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: There are several parts to this: 1. the CSM API 2. Cataloguing column stores 3. Query processing: rewriter, optimizer, executor I think another important point is about the format of column stores, in Page format used by index/heap and how are they organised? Not really. That stuff is part of the column store implementation itself; we're not tackling that part just yet. Eventually there might be an implementation using ORC or other formats. That doesn't matter at this point -- we only need something that implements the specified API. One critical detail is what will be used to identify a heap row when talking to a CS implementation. There are two main possibilities: 1. use CTIDs 2. use some logical tuple identifier Using CTIDs is simpler. One disadvantage is that every UPDATE of a row needs to let the CS know about the new location of the tuple, so that the value is known associated with the new tuple location as well as the old. This needs to happen even if the value of the column itself is not changed. Isn't this somewhat similar to index segment? Not sure what you mean with index segment. A column store is not an index -- it is the primary storage for the column in question. The heap does not have a copy of the data. Will the column store obey snapshot model similar to current heap tuples, if so will it derive the transaction information from heap tuple? Yes, visibility will be tied to the heap tuple -- a value is accessed only when its corresponding heap row has already been determined to be visible. One interesting point that raises from this is about vacuum: when are we able to remove a value from the store? I have some not-completely-formed ideas about this. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On columnar storage
Alvaro Herrera alvhe...@2ndquadrant.com writes: Amit Kapila wrote: Will the column store obey snapshot model similar to current heap tuples, if so will it derive the transaction information from heap tuple? Yes, visibility will be tied to the heap tuple -- a value is accessed only when its corresponding heap row has already been determined to be visible. One interesting point that raises from this is about vacuum: when are we able to remove a value from the store? I have some not-completely-formed ideas about this. Hm. This seems not terribly ambitious --- mightn't a column store extension wish to store tables *entirely* in the column store, rather than tying them to a perhaps-vestigial heap table? Perhaps that's a necessary restriction to get to something implementable, but considering that the proposal mentions pluggable column stores I should think you'd not want to tie it down that much. I can't help thinking that this could tie in with the storage level API that I was waving my arms about last year. Or maybe not --- the goals are substantially different --- but I think we ought to reflect on that rather than just doing a narrow hack for column stores used in the particular way you're describing here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] On columnar storage
We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. I've been trying to figure out a plan to enable native column stores (CS or colstore) for Postgres. Motivations: * avoid the 32 TB limit for tables * avoid the 1600 column limit for tables * increased performance There already are some third-party CS implementations for Postgres; some of these work on top of the FDW interface, others are simply proprietary forks. Since I don't have access to any of their code, it's not much I can learn from them. If people with insider knowledge on them can chime in, perhaps we can work together -- collaboration is very welcome. We're not interested in perpetuating the idea that a CS needs to go through the FDW mechanism. Even if there's a lot of simplicity of implementation, it's almost certain to introduce too many limitations. Simply switching all our code to use columnar storage rather than row-based storage is unlikely to go well. We're aiming at letting some columns of tables be part of a CS, while other parts would continue to be in the heap. At the same time, we're aiming at opening the way for different CS implementations instead of trying to provide a single one-size-fits-all one. There are several parts to this: 1. the CSM API 2. Cataloguing column stores 3. Query processing: rewriter, optimizer, executor The Column Store Manager API Since we want to have pluggable implementations, we need to have a registry of store implementations. I propose we add a catalog pg_cstore_impl with OID, name, and a bunch of function references to open a store, getvalue from it, getrows (to which we pass a qual and get a bunch of tuple IDs back), putvalue. This is in line with our procedural language support. One critical detail is what will be used to identify a heap row when talking to a CS implementation. There are two main possibilities: 1. use CTIDs 2. use some logical tuple identifier Using CTIDs is simpler. One disadvantage is that every UPDATE of a row needs to let the CS know about the new location of the tuple, so that the value is known associated with the new tuple location as well as the old. This needs to happen even if the value of the column itself is not changed. Using logical tuple identifiers solves this problem: an update does not change the LTID, so the tuple colstore needn't be involved unless the attribute(s) in the colstore is being changed. The downside is that the logical tuple identifier must come from somewhere. We could either use some user attribute, if there's something appropriate. But it's probably not good to simply use any primary key that the user has specified. (Also, having an UPDATE change the primary key would be troublesome). We could also offer the choice of having an autogenerated value that's not user-visible; we currently don't have non-user-visible columns, so this would be additional implementation effort. Furthermore, we should think about interactions between this and the IDENTITY stuff we currently have for replication -- my impression is that IDENTITY does not necessarily represent an useful identifier for column store purposes. All in all, it seems prudent to limit the initial implementation to use CTIDs only, and leave LTIDs for a later stage. Cataloguing Column Stores - Each table with columns in a separate store will have relhasstore=t. This hints construction of its relcache entry to obtain rows from pg_cstore for that table. The new catalog pg_cstore looks like this: cstname | cststoreid | cstrelid | cstnatts | cstatts cstname is the store name; unique within each relation. cststoreid is the OID of the pg_cstore_impl row. cstorerelid is the OID of the table that this cstore is for. cstnatts is the number of columns in the store cstatts is an array of attnums contained in this store. This design permits having multiple stores for a table, and one or more columns in a store. We will focus on the case that a table has a single column store, and a column store has a single column, because we believe this simplifies several things. Query Processing Rewriter Parsing occurs as currently. During query rewrite, specifically at the bottom of the per-relation loop in fireRIRrules(), we will modify the query tree: each relation RTE containing a colstore will be replaced with a JoinExpr containing the relation as left child and the colstore as right child (1). The colstore RTE will be of a new RTEKind. For each such change, all Var nodes that point to attnums stored in the colstore will modified so that they reference the RTE of the colstore instead (2). (1) This seems very similar to what convert_ANY_sublink_to_join() does. (2) This is very similar to ChangeVarNodes does, except that we modify only some of the var nodes pointing to the relation,
Re: [HACKERS] On columnar storage
On Fri, Jun 12, 2015 at 4:33 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. I've been trying to figure out a plan to enable native column stores (CS or colstore) for Postgres. Motivations: * avoid the 32 TB limit for tables * avoid the 1600 column limit for tables * increased performance There already are some third-party CS implementations for Postgres; some of these work on top of the FDW interface, others are simply proprietary forks. Since I don't have access to any of their code, it's not much I can learn from them. If people with insider knowledge on them can chime in, perhaps we can work together -- collaboration is very welcome. We're not interested in perpetuating the idea that a CS needs to go through the FDW mechanism. Even if there's a lot of simplicity of implementation, it's almost certain to introduce too many limitations. Simply switching all our code to use columnar storage rather than row-based storage is unlikely to go well. We're aiming at letting some columns of tables be part of a CS, while other parts would continue to be in the heap. At the same time, we're aiming at opening the way for different CS implementations instead of trying to provide a single one-size-fits-all one. There are several parts to this: 1. the CSM API 2. Cataloguing column stores 3. Query processing: rewriter, optimizer, executor I think another important point is about the format of column stores, in Page format used by index/heap and how are they organised? The Column Store Manager API Since we want to have pluggable implementations, we need to have a registry of store implementations. I propose we add a catalog pg_cstore_impl with OID, name, and a bunch of function references to open a store, getvalue from it, getrows (to which we pass a qual and get a bunch of tuple IDs back), putvalue. This is in line with our procedural language support. One critical detail is what will be used to identify a heap row when talking to a CS implementation. There are two main possibilities: 1. use CTIDs 2. use some logical tuple identifier Using CTIDs is simpler. One disadvantage is that every UPDATE of a row needs to let the CS know about the new location of the tuple, so that the value is known associated with the new tuple location as well as the old. This needs to happen even if the value of the column itself is not changed. Isn't this somewhat similar to index segment? Will the column store obey snapshot model similar to current heap tuples, if so will it derive the transaction information from heap tuple? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] On columnar storage
Josh, * Josh Berkus (j...@agliodbs.com) wrote: On 06/11/2015 04:03 PM, Alvaro Herrera wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. Added to: https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Unconference#Topics I believe it was already there? Look for 'Native Columnar Storage'. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] On columnar storage
On 06/11/2015 04:03 PM, Alvaro Herrera wrote: We hope to have a chance to discuss this during the upcoming developer unconference in Ottawa. Here are some preliminary ideas to shed some light on what we're trying to do. Added to: https://wiki.postgresql.org/wiki/PgCon_2015_Developer_Unconference#Topics -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers