Re: [HACKERS] Pluggable storage
On Wed, Oct 25, 2017 at 1:59 PM, Amit Kapilawrote: >> Another thing to consider is that, if we could replace satisfiesfunc, >> it would probably break some existing code. There are multiple places >> in the code that compare snapshot->satisfies to >> HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC. >> >> I think the storage API should just leave snapshots alone. If a >> storage engine wants to call HeapTupleSatisfiesVisibility() with that >> snapshot, it can do so. Otherwise it can switch on >> snapshot->satisfies and handle each case however it likes. >> > > How will it switch satisfies at runtime? There are places where we > might know which visibility function (*MVCC , *Dirty, etc) needs to be > called, but I think there are other places (like heap_fetch) where it > is not clear and we decide based on what is stored in > snapshot->satisfies. An alternative storage engine needs to provide its own implementation of heap_fetch, and that replacement implementation can implement MVCC and other snapshot behavior in any way it likes. My point here is that I think it's better if the table access method stuff doesn't end up modifying snapshots. I think it's fine for a table access method to get passed a standard snapshot. Some code may be needed to cater to the access method's specific needs, but that code can live inside the table access method, without contaminating the snapshot stuff. We have to try to draw some boundary around table access methods -- we don't want to end up teaching everything in the system about them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Oct 25, 2017 at 11:37 AM, Robert Haaswrote: > On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapila wrote: >> I think what we need here is a way to register satisfies function >> (SnapshotSatisfiesFunc) in SnapshotData for different storage engines. > > I don't see how that helps very much. SnapshotSatisfiesFunc takes a > HeapTuple as an argument, and it cares in detail about that tuple's > xmin, xmax, and infomask, and it sets hint bits. All of that is bad, > because an alternative storage engine is likely to use a different > format than HeapTuple and to not have hint bits (or at least not in > the same form we have them now). Also, it doesn't necessarily have a > Boolean answer to the question "can this snapshot see this tuple?". > It may be more like "given this TID, what tuple if any can I see > there?" or "given this tuple, what version of it would I see with this > snapshot?". > > Another thing to consider is that, if we could replace satisfiesfunc, > it would probably break some existing code. There are multiple places > in the code that compare snapshot->satisfies to > HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC. > > I think the storage API should just leave snapshots alone. If a > storage engine wants to call HeapTupleSatisfiesVisibility() with that > snapshot, it can do so. Otherwise it can switch on > snapshot->satisfies and handle each case however it likes. > How will it switch satisfies at runtime? There are places where we might know which visibility function (*MVCC , *Dirty, etc) needs to be called, but I think there are other places (like heap_fetch) where it is not clear and we decide based on what is stored in snapshot->satisfies. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapilawrote: > I think what we need here is a way to register satisfies function > (SnapshotSatisfiesFunc) in SnapshotData for different storage engines. I don't see how that helps very much. SnapshotSatisfiesFunc takes a HeapTuple as an argument, and it cares in detail about that tuple's xmin, xmax, and infomask, and it sets hint bits. All of that is bad, because an alternative storage engine is likely to use a different format than HeapTuple and to not have hint bits (or at least not in the same form we have them now). Also, it doesn't necessarily have a Boolean answer to the question "can this snapshot see this tuple?". It may be more like "given this TID, what tuple if any can I see there?" or "given this tuple, what version of it would I see with this snapshot?". Another thing to consider is that, if we could replace satisfiesfunc, it would probably break some existing code. There are multiple places in the code that compare snapshot->satisfies to HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC. I think the storage API should just leave snapshots alone. If a storage engine wants to call HeapTupleSatisfiesVisibility() with that snapshot, it can do so. Otherwise it can switch on snapshot->satisfies and handle each case however it likes. I don't see how generalizing a Snapshot for other storage engines really buys us anything except complexity and the danger of reducing performance for the existing heap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sat, Oct 14, 2017 at 1:09 AM, Alexander Korotkovwrote: > On Fri, Oct 13, 2017 at 9:41 PM, Robert Haas wrote: >> >> On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan wrote: >> >> Fully agreed. >> > >> > If we implement that interface, where does that leave EvalPlanQual()? > > > From the first glance, it seems that pluggable storage should override > EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic. > I think there is more to it. Currently, EState->es_epqTuple is a HeapTuple which is filled as part of EvalPlanQual mechanism and then later used during the scan. We need to make it pluggable in some way so that other heaps can work. We also need some work for EvalPlanQualFetchRowMarks as that also seems to be tightly coupled with HeapTuple. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommiwrote: > > On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas wrote: >> >> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi >> wrote: >> > Currently I added a snapshot_satisfies API to find out whether the tuple >> > satisfies the visibility or not with different types of visibility >> > routines. >> > I feel these >> > are some how enough to develop a different storage methods like UNDO. >> > The storage methods can decide internally how to provide the visibility. >> > >> > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = >> > HeapTupleSatisfiesMVCC; >> > + amroutine->snapshot_satisfies[SELF_VISIBILITY] = >> > HeapTupleSatisfiesSelf; >> > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny; >> > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = >> > HeapTupleSatisfiesToast; >> > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = >> > HeapTupleSatisfiesDirty; >> > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] = >> > HeapTupleSatisfiesHistoricMVCC; >> > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] = >> > HeapTupleSatisfiesNonVacuumable; >> > + >> > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate; >> > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum; >> > >> > Currently no changes are carried out in snapshot logic as that is kept >> > seperate >> > from storage API. >> >> That seems like a strange choice of API. I think it should more >> integrated with the scan logic. For example, if I'm doing an index >> scan, and I get a TID, then I should be able to just say "here's a >> TID, give me any tuples associated with that TID that are visible to >> the scan snapshot". Then for the current heap it will do >> heap_hot_search_buffer, and for zheap it will walk the undo chain and >> return the relevant tuple from the chain. > > > OK, Understood. > I will check along these lines and come up with storage API's. > I think what we need here is a way to register satisfies function (SnapshotSatisfiesFunc) in SnapshotData for different storage engines. That is the core API to decide visibility with respect to different storage engines. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 9:41 PM, Robert Haaswrote: > On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan wrote: > >> Fully agreed. > > > > If we implement that interface, where does that leave EvalPlanQual()? > >From the first glance, it seems that pluggable storage should override EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic. > > Do those semantics have to be preserved? > > For a general-purpose heap storage format, I would say yes. +1 I mean, we don't really have control over how people use the API. If > somebody decides to implement a storage API that breaks EvalPlanQual > semantics horribly, I can't stop them, and I don't want to stop them. > Open source FTW. > Yeah. We don't have any kind of "safe extensions". Any extension can break things really horribly. For me that means user should absolutely trust extension developer. But I don't really want that code in our tree, either. We keep things in our tree as correct as we can. And for sure, we should follow this politics for pluggable storages too. > I think a > storage engine is and should be about the format in which data gets > stored on disk, and that it should only affect the performance of > queries not the answers that they give. Pretty same idea as index access methods. They also affects the performance, but not query answers. When it's not true, this situation is considered as bug, and it needs to be fixed. > I am sure there will be cases > where, for reasons of implementation complexity, that turns out not to > be true, but I think in general we should try to avoid it as much as > we can. I think in some cases we can tolerate missing features (and document it), but don't tolerate wrong features. For instance, we may have some pluggable storage which doesn't support transactions at all (and that should be documented for sure), but we shouldn't have pluggable storage which transaction support is wrong. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 9:37 PM, Robert Haaswrote: > On Fri, Oct 13, 2017 at 5:25 AM, Kuntal Ghosh > wrote: > > For some other > > storage engine, if we maintain the older version in different storage, > > undo for example, and don't require a new index entry, should we still > > call it HOT-chain? > > I would say, emphatically, no. HOT is a creature of the existing > heap. If it's creeping into storage APIs they are not really > abstracted from what we have currently. +1, different storage may need to insert entries to only *some* of indexes. Wherein these new index entries may have either same or new TIDs. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 1:59 PM, Peter Geogheganwrote: >> Fully agreed. > > If we implement that interface, where does that leave EvalPlanQual()? > Do those semantics have to be preserved? For a general-purpose heap storage format, I would say yes. I mean, we don't really have control over how people use the API. If somebody decides to implement a storage API that breaks EvalPlanQual semantics horribly, I can't stop them, and I don't want to stop them. Open source FTW. But I don't really want that code in our tree, either. I think a storage engine is and should be about the format in which data gets stored on disk, and that it should only affect the performance of queries not the answers that they give. I am sure there will be cases where, for reasons of implementation complexity, that turns out not to be true, but I think in general we should try to avoid it as much as we can. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 5:25 AM, Kuntal Ghoshwrote: > For some other > storage engine, if we maintain the older version in different storage, > undo for example, and don't require a new index entry, should we still > call it HOT-chain? I would say, emphatically, no. HOT is a creature of the existing heap. If it's creeping into storage APIs they are not really abstracted from what we have currently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Oct 12, 2017 at 2:23 PM, Robert Haaswrote: > On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov >> However I imply that alternative storage would share our "MVCC model". So, >> it >> should share our transactional model including transactions, >> subtransactions, snapshots etc. >> Therefore, if alternative storage is transactional, then in particular it >> should be able to fetch tuple with >> given TID according to given snapshot. However, how it's implemented >> internally is >> a black box for us. Thus, we don't insist that tuple should have different >> TID after update; >> we don't insist there is any analogue of HOT; we don't insist alternative >> storage needs vacuum >> (or if even it needs vacuum, it might be performed in completely different >> way) and so on. > > Fully agreed. If we implement that interface, where does that leave EvalPlanQual()? Do those semantics have to be preserved? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommiwrote: > > > On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas wrote: >> >> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi >> wrote: >> >> That seems like a strange choice of API. I think it should more >> integrated with the scan logic. For example, if I'm doing an index >> scan, and I get a TID, then I should be able to just say "here's a >> TID, give me any tuples associated with that TID that are visible to >> the scan snapshot". Then for the current heap it will do >> heap_hot_search_buffer, and for zheap it will walk the undo chain and >> return the relevant tuple from the chain. > > > OK, Understood. > I will check along these lines and come up with storage API's. > I've some doubts regarding the following function hook: +typedef bool (*hot_search_buffer_hook) (ItemPointer tid, Relation relation, +Buffer buffer, Snapshot snapshot, HeapTuple heapTuple, +bool *all_dead, bool first_call); As per my understanding, with HOT feature a new tuple placed on the same page and with all indexed columns the same as its parent row version does not get new index entries (README.HOT). For some other storage engine, if we maintain the older version in different storage, undo for example, and don't require a new index entry, should we still call it HOT-chain? If not, IMHO, we may have something like *search_buffer_hook(tid,,storageTuple,...). Depending on the underlying storage, one can traverse hot-chain or undo-chain or some other multi-version strategy for non-key updates. After a successful index search, most of the index AMs set (HeapTupleData)xs_ctup->t_self of IndexScanDescData to the tupleid in the storage. IMHO, some changes are needed here to make it generic. @@ -328,47 +376,27 @@ ExecStoreTuple(HeapTuple tuple, Assert(tuple != NULL); Assert(slot != NULL); Assert(slot->tts_tupleDescriptor != NULL); + Assert(slot->tts_storageslotam != NULL); /* passing shouldFree=true for a tuple on a disk page is not sane */ Assert(BufferIsValid(buffer) ? (!shouldFree) : true); For some storage engine, isn't it possible that the buffer is valid and the tuple to be stored is formed in memory (for example, tuple formed from UNDO, in-memory decrypted tuple etc.) -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 11:55 AM, Robert Haaswrote: > On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi > wrote: > > Currently I added a snapshot_satisfies API to find out whether the tuple > > satisfies the visibility or not with different types of visibility > routines. > > I feel these > > are some how enough to develop a different storage methods like UNDO. > > The storage methods can decide internally how to provide the visibility. > > > > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = > HeapTupleSatisfiesMVCC; > > + amroutine->snapshot_satisfies[SELF_VISIBILITY] = > HeapTupleSatisfiesSelf; > > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny; > > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = > HeapTupleSatisfiesToast; > > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = > HeapTupleSatisfiesDirty; > > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] = > > HeapTupleSatisfiesHistoricMVCC; > > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] = > > HeapTupleSatisfiesNonVacuumable; > > + > > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate; > > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum; > > > > Currently no changes are carried out in snapshot logic as that is kept > > seperate > > from storage API. > > That seems like a strange choice of API. I think it should more > integrated with the scan logic. For example, if I'm doing an index > scan, and I get a TID, then I should be able to just say "here's a > TID, give me any tuples associated with that TID that are visible to > the scan snapshot". Then for the current heap it will do > heap_hot_search_buffer, and for zheap it will walk the undo chain and > return the relevant tuple from the chain. OK, Understood. I will check along these lines and come up with storage API's. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommiwrote: > Currently I added a snapshot_satisfies API to find out whether the tuple > satisfies the visibility or not with different types of visibility routines. > I feel these > are some how enough to develop a different storage methods like UNDO. > The storage methods can decide internally how to provide the visibility. > > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC; > + amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf; > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny; > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast; > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty; > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] = > HeapTupleSatisfiesHistoricMVCC; > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] = > HeapTupleSatisfiesNonVacuumable; > + > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate; > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum; > > Currently no changes are carried out in snapshot logic as that is kept > seperate > from storage API. That seems like a strange choice of API. I think it should more integrated with the scan logic. For example, if I'm doing an index scan, and I get a TID, then I should be able to just say "here's a TID, give me any tuples associated with that TID that are visible to the scan snapshot". Then for the current heap it will do heap_hot_search_buffer, and for zheap it will walk the undo chain and return the relevant tuple from the chain. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Fri, Oct 13, 2017 at 8:23 AM, Robert Haaswrote: > On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov > wrote: > > It's probably that we imply different meaning to "MVCC implementation". > > While writing "MVCC implementation" I meant that, for instance, > alternative > > storage > > may implement UNDO chains to store versions of same row. > Correspondingly, > > it may not have any analogue of our HOT. > > Yes, the zheap project on which EnterpriseDB is working has precisely > this characteristic. > > > However I imply that alternative storage would share our "MVCC model". > So, > > it > > should share our transactional model including transactions, > > subtransactions, snapshots etc. > > Therefore, if alternative storage is transactional, then in particular it > > should be able to fetch tuple with > > given TID according to given snapshot. However, how it's implemented > > internally is > > a black box for us. Thus, we don't insist that tuple should have > different > > TID after update; > > we don't insist there is any analogue of HOT; we don't insist alternative > > storage needs vacuum > > (or if even it needs vacuum, it might be performed in completely > different > > way) and so on. > > Fully agreed. Currently I added a snapshot_satisfies API to find out whether the tuple satisfies the visibility or not with different types of visibility routines. I feel these are some how enough to develop a different storage methods like UNDO. The storage methods can decide internally how to provide the visibility. + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC; + amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf; + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny; + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast; + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty; + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] = HeapTupleSatisfiesHistoricMVCC; + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] = HeapTupleSatisfiesNonVacuumable; + + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate; + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum; Currently no changes are carried out in snapshot logic as that is kept seperate from storage API. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkovwrote: > It's probably that we imply different meaning to "MVCC implementation". > While writing "MVCC implementation" I meant that, for instance, alternative > storage > may implement UNDO chains to store versions of same row. Correspondingly, > it may not have any analogue of our HOT. Yes, the zheap project on which EnterpriseDB is working has precisely this characteristic. > However I imply that alternative storage would share our "MVCC model". So, > it > should share our transactional model including transactions, > subtransactions, snapshots etc. > Therefore, if alternative storage is transactional, then in particular it > should be able to fetch tuple with > given TID according to given snapshot. However, how it's implemented > internally is > a black box for us. Thus, we don't insist that tuple should have different > TID after update; > we don't insist there is any analogue of HOT; we don't insist alternative > storage needs vacuum > (or if even it needs vacuum, it might be performed in completely different > way) and so on. Fully agreed. > During conversations with you at PGCon and other conferences I had > impression > that you share this view on pluggable storages and MVCC. Probably, we just > express > this view in different words. Or alternatively I might understand you > terribly wrong. No, it sounds like we are on the same page. I'm only hoping that we don't end with a bunch of storage engines that each use a different XID space or something icky like that. I don't think the API should try to cater to that sort of development. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Oct 11, 2017 at 11:08 PM, Robert Haaswrote: > On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov > wrote: > > For me, it's crucial point that pluggable storages should be able to have > > different MVCC implementation, and correspondingly have full control over > > its interactions with indexes. > > Thus, it would be good if we would get consensus on that point. I'd like > > other discussion participants to comment whether they agree/disagree and > > why. > > Any comments? > > I think it's good for new storage managers to have full control over > interactions with indexes. I'm not sure about the MVCC part. I think > it would be legitimate to want a storage manager to ignore MVCC > altogether - e.g. to build a non-transactional table. I don't know > that it would be a very good idea to have two different full-fledged > MVCC implementations, though. Like Tom says, that would be > replicating a lot of the awfulness of the MySQL model. It's probably that we imply different meaning to "MVCC implementation". While writing "MVCC implementation" I meant that, for instance, alternative storage may implement UNDO chains to store versions of same row. Correspondingly, it may not have any analogue of our HOT. However I imply that alternative storage would share our "MVCC model". So, it should share our transactional model including transactions, subtransactions, snapshots etc. Therefore, if alternative storage is transactional, then in particular it should be able to fetch tuple with given TID according to given snapshot. However, how it's implemented internally is a black box for us. Thus, we don't insist that tuple should have different TID after update; we don't insist there is any analogue of HOT; we don't insist alternative storage needs vacuum (or if even it needs vacuum, it might be performed in completely different way) and so on. During conversations with you at PGCon and other conferences I had impression that you share this view on pluggable storages and MVCC. Probably, we just express this view in different words. Or alternatively I might understand you terribly wrong. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Wed, Oct 11, 2017 at 1:08 PM, Robert Haaswrote: > On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov > wrote: >> For me, it's crucial point that pluggable storages should be able to have >> different MVCC implementation, and correspondingly have full control over >> its interactions with indexes. >> Thus, it would be good if we would get consensus on that point. I'd like >> other discussion participants to comment whether they agree/disagree and >> why. >> Any comments? > > I think it's good for new storage managers to have full control over > interactions with indexes. I'm not sure about the MVCC part. I think > it would be legitimate to want a storage manager to ignore MVCC > altogether - e.g. to build a non-transactional table. I agree with Alexander -- if you're going to have a new MVCC implementation, you have to do significant work within index access methods. Adding "retail index tuple deletion" is probably just the beginning. ISTM that you need something like InnoDB's purge thread when index values change, since two versions of the same index tuple (each with distinct attribute values) have to physically co-exist for a time. > I don't know > that it would be a very good idea to have two different full-fledged > MVCC implementations, though. Like Tom says, that would be > replicating a lot of the awfulness of the MySQL model. It's not just the MySQL model, FWIW. SQL-on-Hadoop systems like Impala, certain NoSQL systems, and AFAIK any database system that claims to have pluggable storage all do it this way. That is, core transaction management functions (e.g. MVCC snapshot acquisition) is outsourced to the storage engine. It *is* very cumbersome, but that's what they do. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkovwrote: > For me, it's crucial point that pluggable storages should be able to have > different MVCC implementation, and correspondingly have full control over > its interactions with indexes. > Thus, it would be good if we would get consensus on that point. I'd like > other discussion participants to comment whether they agree/disagree and > why. > Any comments? I think it's good for new storage managers to have full control over interactions with indexes. I'm not sure about the MVCC part. I think it would be legitimate to want a storage manager to ignore MVCC altogether - e.g. to build a non-transactional table. I don't know that it would be a very good idea to have two different full-fledged MVCC implementations, though. Like Tom says, that would be replicating a lot of the awfulness of the MySQL model. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Mon, Oct 9, 2017 at 5:32 PM, Tom Lanewrote: > Alexander Korotkov writes: > > For me, it's crucial point that pluggable storages should be able to have > > different MVCC implementation, and correspondingly have full control over > > its interactions with indexes. > > Thus, it would be good if we would get consensus on that point. I'd like > > other discussion participants to comment whether they agree/disagree and > > why. > > Any comments? > > TBH, I think that's a good way of ensuring that nothing will ever get > committed. You're trying to draw the storage layer boundary at a point > that will take in most of the system. If we did build it like that, > what we'd end up with would be very reminiscent of mysql's storage > engines, complete with inconsistent behaviors and varying feature sets > across engines. I don't much want to go there. > However, if we insist that pluggable storage should have the same MVCC implementation, interacts with indexes the same way and also use TIDs as tuple identifiers, then what useful implementations might we have? Per-page heap compression and encryption? Or different heap page layout? Or tuple format? OK, but that doesn't justify such wide API as it's implemented in the current version of patch in this thread. If we really want to restrict applicability of pluggable storages that way, then we probably should give up with "pluggable storages" and make it "pluggable heap page format" at I proposed upthread. Implementation of alternative storage would be hard and challenging task. Yes, it would include reimplementation of significant part of the system. But that seems inevitable if we're going to implement alternative really storages (not just hacks over existing storage). And I don't think that our pluggable storages would be reminiscent of mysql's storage engines while we're keeping two properties: 1) All the storages use the same WAL stream, 2) All the storages use same transactions and snapshots. If we keep these two properties, we wouldn't need neither 2PC to run transactions across different storages, neither separate log for replication. These two are major drawbacks of MySQL model. Varying feature sets across engines seems inevitable and natural. We've to invent alternative storages to have features whose are hard to have in our current storage. So, no wonder that feature sets would be varying... -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
Alexander Korotkovwrites: > For me, it's crucial point that pluggable storages should be able to have > different MVCC implementation, and correspondingly have full control over > its interactions with indexes. > Thus, it would be good if we would get consensus on that point. I'd like > other discussion participants to comment whether they agree/disagree and > why. > Any comments? TBH, I think that's a good way of ensuring that nothing will ever get committed. You're trying to draw the storage layer boundary at a point that will take in most of the system. If we did build it like that, what we'd end up with would be very reminiscent of mysql's storage engines, complete with inconsistent behaviors and varying feature sets across engines. I don't much want to go there. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Sep 27, 2017 at 7:51 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > I took a look on this patch. I've following notes for now. > > tuple_insert_hook tuple_insert; /* heap_insert */ >> tuple_update_hook tuple_update; /* heap_update */ >> tuple_delete_hook tuple_delete; /* heap_delete */ > > > I don't think this set of functions provides good enough level of > abstraction for storage AM. This functions encapsulate only low-level work > of insert/update/delete tuples into heap itself. However, it still assumed > that indexes are managed outside of storage AM. I don't think this is > right, assuming that one of most demanded storage API usage would be > different MVCC implementation (for instance, UNDO log based as discussed > upthread). Different MVCC implementation is likely going to manage indexes > in a different way. For example, storage AM utilizing UNDO would implement > in-place update even when indexed columns are modified. Therefore this > piece of code in ExecUpdate() wouldn't be relevant anymore. > > /* >> * insert index entries for tuple >> * >> * Note: heap_update returns the tid (location) of the new tuple in >> * the t_self field. >> * >> * If it's a HOT update, we mustn't insert new index entries. >> */ >> if ((resultRelInfo->ri_NumIndices > 0) && >> !storage_tuple_is_heaponly(resultRelationDesc, >> tuple)) >> recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid), >> estate, false, NULL, NIL); > > > I'm firmly convinced that this logic should be encapsulated into storage > AM altogether with inserting new index tuples on storage insert. Also, HOT > should be completely encapsulated into heapam. It's quite evident for me > that storage utilizing UNDO wouldn't have anything like our current HOT. > Therefore, I think there shouldn't be hot_search_buffer() API function. > tuple_fetch() may cover hot_search_buffer(). That might require some > signature change of tuple_fetch() (probably, extra arguments). > For me, it's crucial point that pluggable storages should be able to have different MVCC implementation, and correspondingly have full control over its interactions with indexes. Thus, it would be good if we would get consensus on that point. I'd like other discussion participants to comment whether they agree/disagree and why. Any comments? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
Hi! I took a look on this patch. I've following notes for now. tuple_insert_hook tuple_insert; /* heap_insert */ > tuple_update_hook tuple_update; /* heap_update */ > tuple_delete_hook tuple_delete; /* heap_delete */ I don't think this set of functions provides good enough level of abstraction for storage AM. This functions encapsulate only low-level work of insert/update/delete tuples into heap itself. However, it still assumed that indexes are managed outside of storage AM. I don't think this is right, assuming that one of most demanded storage API usage would be different MVCC implementation (for instance, UNDO log based as discussed upthread). Different MVCC implementation is likely going to manage indexes in a different way. For example, storage AM utilizing UNDO would implement in-place update even when indexed columns are modified. Therefore this piece of code in ExecUpdate() wouldn't be relevant anymore. /* > * insert index entries for tuple > * > * Note: heap_update returns the tid (location) of the new tuple in > * the t_self field. > * > * If it's a HOT update, we mustn't insert new index entries. > */ > if ((resultRelInfo->ri_NumIndices > 0) && > !storage_tuple_is_heaponly(resultRelationDesc, tuple)) > recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid), > estate, false, NULL, NIL); I'm firmly convinced that this logic should be encapsulated into storage AM altogether with inserting new index tuples on storage insert. Also, HOT should be completely encapsulated into heapam. It's quite evident for me that storage utilizing UNDO wouldn't have anything like our current HOT. Therefore, I think there shouldn't be hot_search_buffer() API function. tuple_fetch() may cover hot_search_buffer(). That might require some signature change of tuple_fetch() (probably, extra arguments). LockTupleMode and HeapUpdateFailureData shouldn't be private of heapam. Any fullweight OLTP storage AM should support our tuple lock modes and should be able to report update failures. HeapUpdateFailureData should be renamed to something like StorageUpdateFailureData. Contents of HeapUpdateFailureData seems to me general enough to be supported by any storage with ItemPointer tuple locator. storage_setscanlimits() is used only during index build. I think that since storage AM may have different MVCC implementation then storage AM should decide how to communicate with indexes including index build. Therefore, instead of exposing storage_setscanlimits(), the whole IndexBuildHeapScan() should be encapsulated into storage AM. Also, BulkInsertState should be private of structure of heapam. Another storages may have another state for bulk insert. On API level we might have some abstract pointer instead of BulkInsertState while having GetBulkInsertState and others as API methods. storage_freeze_tuple() is called only once from rewrite_heap_tuple(). That makes me think that tuple_freeze API method is wrong for abstraction. We probably should make rewrite_heap_tuple() or even the whole rebuild_relation() an API method... Heap reloptions are untouched for now. Storage AM should be able to provide its own specific options just like index AMs do. That's all I have for now. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Wed, Aug 23, 2017 at 8:26 AM, Haribabu Kommiwrote: > - Minor: don't think the _function suffix for Storis necessary, just >> makes things long, and every member has it. Besides that, it's also >> easy to misunderstand - for a second I understood >> scan_getnext_function to be about getting the next function... >> > > OK. How about adding _hook? > I've answered to Andrew why I think _function suffix is OK for now. And I don't particularly like _hook suffix for this purpose, because those functions are parts of API implementation, not hooks. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
Hi, Thank you for review on this subject. I think it's extremely important for PostgreSQL to eventually get pluggable storage API. In general I agree with all your points. But I'd like to make couple of comments. On Tue, Aug 15, 2017 at 9:53 AM, Andres Freundwrote: > - I don't think we should introduce this without a user besides > heapam. The likelihood that API will be usable by anything else > without a testcase seems fairly remote. I think some src/test/modules > type implementation of a per-session, in-memory storage - relatively > easy to implement - or such is necessary. > +1 for having a user before committing API. However, I'd like to note that sample storage implementation should do something really different from out current heap. In particular, if per-session, in-memory storage would be just another way to keep heap in local buffers, it wouldn't be OK for me; because such kind of storage could be achieved way more easier without so complex API. But if per-session, in-memory storage would, for instance, utilize different MVCC implementation, that would be very good sample of storage API usage. > - Minor: don't think the _function suffix for Storis necessary, just > makes things long, and every member has it. Besides that, it's also > easy to misunderstand - for a second I understood > scan_getnext_function to be about getting the next function... > _function suffix looks long for me too. But we should look on this question from uniformity point of view. FdwRoutine, TsmRoutine, IndexAmRoutine use _function suffix. This is why I think we should use _function suffix for StorageAmRoutine unless we're going to change that for other *Routines too. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Thu, Sep 14, 2017 at 8:17 AM, Haribabu Kommiwrote: > Instead of modifying the Bitmap Heap and Sample scan's to avoid referring > the internal members of the HeapScanDesc, I divided the HeapScanDesc > into two parts. > > 1. StorageScanDesc > 2. HeapPageScanDesc > > The StorageScanDesc contains the minimal information that is required > outside > the Storage routine and this must be provided by all storage routines. This > structure contains minimal information such as relation, snapshot, buffer > and > etc. > > The HeapPageScanDesc contains other extra information that is required for > Bitmap Heap and Sample scans to work. This structure contains the > information > of blocks, visible offsets and etc. Currently this structure is used only > in > Bitmap Heap and Sample scan and it's supported contrib modules, except > the pgstattuple module. The pgstattuple needs some additional changes. > > By adding additional storage API to return HeapPageScanDesc as it required > by the Bitmap Heap and Sample scan's and this API is called only in these > two scan's. And also these scan methods are choosen by the planner only > when the storage routine supports to returning of HeapPageScanDesc API. > Currently Implemented the planner support only for Bitmap, yet to do it > for Sample scan. > > With the above approach, I removed all the references of HeapScanDesc > outside the heap. The changes of this approach is available in the > 0008-Remove-HeapScanDesc-usage-outside-heap.patch > > Suggestions/comments with the above approach. > For me, that's an interesting idea. Naturally, the way BitmapHeapScan and SampleScan work even on very high-level is applicable only for some storage AMs (i.e. heap-like storage AMs). For example, index-organized table wouldn't ever support BitmapHeapScan, because it refers tuples by PK values not TIDs. However, in this case, storage AM might have some alternative to our BitmapHeapScan. So, index-organized table might have some compressed representation of ordered PK values set and use it for bulk fetch of PK index. Therefore, I think it would be nice to make BitmapHeapScan an heap-storage-AM-specific scan method while other storage AMs could provide other storage-AM-specific scan methods. Probably it would be too much for this patchset and should be done during one of next work cycles on storage AM (I'm sure that such huge project as pluggable storage AMs would have multiple iterations). Similarly, SampleScans contain storage-AM-specific logic. For instance, our SYSTEM sampling method fetches random blocks from heap providing high performance way to sample heap. Coming back to the example of index-organized table, it could provide it's own storage-AM-specific table sampling methods including sophisticated PK tree traversal with fetching random small ranges of PK. Given that tablesample methods are already pluggable, making them storage-AM-specific would lead to user-visible changes. I.e. tablesample method should be created for particular storage AM or set of storage AMs. However, I didn't yet figure out what should API exactly look like... -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Sat, Sep 9, 2017 at 1:23 PM, Haribabu Kommiwrote: > > I rebased the patch to the latest master and also fixed the duplicate OID > and some slot fixes. Updated patches are attached. > While analyzing the removal of HeapScanDesc usage other than heap modules, The mostly used member is "*rs_cbuf*" for the purpose of locking the buffer, visibility checks and marking it as dirty. The buffer is tightly integrated with the visibility. Buffer may not be required for some storage routines where the data is always in the memory and etc. I found the references of its structure members in the following places. I feel other than the following 4 parameters, rest of them needs to be moved into their corresponding storage routines. * Relation rs_rd; /* heap relation descriptor */* * Snapshot rs_snapshot; /* snapshot to see */* * int rs_nkeys; /* number of scan keys */* * ScanKey rs_key; /* array of scan key descriptors */* But currently I am treating the "*rs_cbuf" *also a needed member and also expecting all storage routines will be provide it. Or we may need a another approach to mark the buffer as dirty. Suggestions? Following are the rest of the parameters that are used outside the heap. * BlockNumber rs_nblocks; /* total number of blocks in rel */* pgstattuple.c, tsm_system_rows.c, tsm_system_time.c, system.c nodeBitmapheapscan.c nodesamplescan.c, *Mostly for the purpose of checking the number of blocks in a rel.* * BufferAccessStrategy rs_strategy; /* access strategy for reads */* pgstattuple.c * bool rs_pageatatime; /* verify visibility page-at-a-time? */* * BlockNumber rs_startblock; /* block # to start at */* * bool rs_syncscan; /* report location to syncscan logic? */* * bool rs_inited; /* false = scan not init'd yet */* nodesamplescan.c * HeapTupleData rs_ctup; /* current tuple in scan, if any */* *genam.c, nodeBitmapHeapscan.c, nodesamplescan.c* *Used for retrieving the last scanned tuple.* * BlockNumber rs_cblock; /* current block # in scan, if any */* *index.c, nodesamplescan.c* * Buffer rs_cbuf; /* current buffer in scan, if any */* *pgrowlocks.c, pgstattuple.c, genam.c, index.c, cluster.c,* *tablecmds.c, nodeBitmapHeapscan.c, nodesamplescan.c* *Mostly used for Locking the Buffer.* * ParallelHeapScanDesc rs_parallel; /* parallel scan information */* *nodeseqscan.c* * int rs_cindex; /* current tuple's index in vistuples */* *nodeBitmapHeapScan.c* * int rs_ntuples; /* number of visible tuples on page */* * OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */* *tsm_system_rows.c, nodeBitmapHeapscan.c, nodesamplescan.c* *Used for retrieve the offsets mainly Bitmap and sample scans.* I think rest of the above parameters usage other than heap can be changed once the Bitmap and Sample scans are modified to use the storage routines while returning the tuple instead of their own implementations. I feel these scans are the major users of the rest of the parameters. This approach may need to some more API's to get rid of Bitmap and sample scan's own implementation. suggestions? Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Fri, Sep 1, 2017 at 1:51 PM, Haribabu Kommiwrote: > Here I attached new set of patches that are rebased to the latest master. Hi Haribabu, Could you please post a new rebased version? 0007-Scan-functions-are-added-to-storage-AM.patch conflicts with recent changes. Thanks! -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Aug 23, 2017 at 11:59 PM, Amit Kapilawrote: > On Wed, Aug 23, 2017 at 11:05 AM, Haribabu Kommi > wrote: > > > > > > On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila > > wrote: > >> > >> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi > >> wrote: > >> > > >> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila > > >> > wrote: > >> >> > >> >> > >> >> Also, it is quite possible that some of the storage Am's don't even > >> >> want to return bool as a parameter from HeapTupleSatisfies* API's. I > >> >> guess what we need here is to provide a way so that different storage > >> >> am's can register their function pointer for an equivalent to > >> >> satisfies function. So, we need to change > >> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different > >> >> handlers can register their function instead of using that directly. > >> >> I think that should address the problem you are planning to solve by > >> >> omitting buffer parameter. > >> > > >> > > >> > Thanks for your suggestion. Yes, it is better to go in the direction > of > >> > SnapshotSatisfiesFunc. > >> > > >> > I verified the above idea of implementing the Tuple visibility > functions > >> > and assign them into the snapshotData structure based on the snapshot. > >> > > >> > The Tuple visibility functions that are specific to the relation are > >> > available > >> > with the RelationData structure and this structure may not be > available, > >> > > >> > >> Which functions are you referring here? I don't see anything in > >> tqual.h that uses RelationData. > > > > > > > > With storage API's, the tuple visibility functions are available with > > RelationData > > and those are needs used to update the SnapshotData structure > > SnapshotSatisfiesFunc member. > > > > But the RelationData is not available everywhere, where the snapshot is > > created, > > but it is available every place where the tuple visibility is checked. > So I > > just changed > > the way of checking the tuple visibility with the information of > snapshot by > > calling > > the corresponding tuple visibility function from RelationData. > > > > If SnapshotData provides MVCC, then the MVCC specific tuple visibility > > function from > > RelationData is called. The SnapshotSatisfiesFunc member is changed to a > > enum > > that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc. > > Whenever > > the visibility check is needed, the corresponding function is called. > > > > It will be easy to understand and see if there is some better > alternative once you have something in the form of a patch. Sorry for the delay. I will submit the new patch series with all comments given in the upthread to the upcoming commitfest. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Wed, Aug 23, 2017 at 11:05 AM, Haribabu Kommiwrote: > > > On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila > wrote: >> >> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi >> wrote: >> > >> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila >> > wrote: >> >> >> >> >> >> Also, it is quite possible that some of the storage Am's don't even >> >> want to return bool as a parameter from HeapTupleSatisfies* API's. I >> >> guess what we need here is to provide a way so that different storage >> >> am's can register their function pointer for an equivalent to >> >> satisfies function. So, we need to change >> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different >> >> handlers can register their function instead of using that directly. >> >> I think that should address the problem you are planning to solve by >> >> omitting buffer parameter. >> > >> > >> > Thanks for your suggestion. Yes, it is better to go in the direction of >> > SnapshotSatisfiesFunc. >> > >> > I verified the above idea of implementing the Tuple visibility functions >> > and assign them into the snapshotData structure based on the snapshot. >> > >> > The Tuple visibility functions that are specific to the relation are >> > available >> > with the RelationData structure and this structure may not be available, >> > >> >> Which functions are you referring here? I don't see anything in >> tqual.h that uses RelationData. > > > > With storage API's, the tuple visibility functions are available with > RelationData > and those are needs used to update the SnapshotData structure > SnapshotSatisfiesFunc member. > > But the RelationData is not available everywhere, where the snapshot is > created, > but it is available every place where the tuple visibility is checked. So I > just changed > the way of checking the tuple visibility with the information of snapshot by > calling > the corresponding tuple visibility function from RelationData. > > If SnapshotData provides MVCC, then the MVCC specific tuple visibility > function from > RelationData is called. The SnapshotSatisfiesFunc member is changed to a > enum > that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc. > Whenever > the visibility check is needed, the corresponding function is called. > It will be easy to understand and see if there is some better alternative once you have something in the form of a patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapilawrote: > On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi > wrote: > > > > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila > > wrote: > >> > >> > >> Also, it is quite possible that some of the storage Am's don't even > >> want to return bool as a parameter from HeapTupleSatisfies* API's. I > >> guess what we need here is to provide a way so that different storage > >> am's can register their function pointer for an equivalent to > >> satisfies function. So, we need to change > >> SnapshotData.SnapshotSatisfiesFunc in some way so that different > >> handlers can register their function instead of using that directly. > >> I think that should address the problem you are planning to solve by > >> omitting buffer parameter. > > > > > > Thanks for your suggestion. Yes, it is better to go in the direction of > > SnapshotSatisfiesFunc. > > > > I verified the above idea of implementing the Tuple visibility functions > > and assign them into the snapshotData structure based on the snapshot. > > > > The Tuple visibility functions that are specific to the relation are > > available > > with the RelationData structure and this structure may not be available, > > > > Which functions are you referring here? I don't see anything in > tqual.h that uses RelationData. With storage API's, the tuple visibility functions are available with RelationData and those are needs used to update the SnapshotData structure SnapshotSatisfiesFunc member. But the RelationData is not available everywhere, where the snapshot is created, but it is available every place where the tuple visibility is checked. So I just changed the way of checking the tuple visibility with the information of snapshot by calling the corresponding tuple visibility function from RelationData. If SnapshotData provides MVCC, then the MVCC specific tuple visibility function from RelationData is called. The SnapshotSatisfiesFunc member is changed to a enum that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc. Whenever the visibility check is needed, the corresponding function is called. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Tue, Aug 15, 2017 at 4:53 PM, Andres Freundwrote: > Hi, > > > On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote: > > Here I attached WIP patches to support pluggable storage. The patch > series > > are may not work individually. Still so many things are under > development. > > These patches are just to share the approach of the current development. > > Making a pass through the patchset to get a feel where this at, and > where this is headed. I previously skimmed the thread to get a rough > sense on what's discused, but not in a very detailed manner. > Thanks for the review. General: > > - I think one important discussion we need to have is what kind of > performance impact we're going to accept introducing this. It seems > very likely that this'll cause some slowdown. We can kind of > alleviate that by doing some optimizations at the same time, but > nevertheless, this abstraction is going to cost. > OK. May be to take some decision, we may need some performance figures, I will measure the performance once the API's stabilized. - I don't think we should introduce this without a user besides > heapam. The likelihood that API will be usable by anything else > without a testcase seems fairly remote. I think some src/test/modules > type implementation of a per-session, in-memory storage - relatively > easy to implement - or such is necessary. > Sure, I will add a test module once the API's are stabilized. > - I think, and detailed some of that, we're going to need some cleanups > that go in before this, to decrease the size / increase the quality of > the new APIs. It's going to get more painful to change APIs > subsequently. > > - We'll have to document clearly that these APIs are going to change for > a while, even after the release introducing them. > Yes, that's correct, because this is the first time we are developing the storage API's to support pluggable storage, so it may needs some refinements based on the usage to support different storage methods. > StorageAm - Scan stuff: > > - I think API isn't quite right. There's a lot of granular callback > functionality like scan_begin_function / scan_begin_catalog / > scan_begin_bm - these largely are convenience wrappers around the same > function, and I don't think that would, or rather should, change in > any abstracted storage layer. So I think there needs to be some > unification first (pretty close w/ beginscan_internal already, but > perhaps we should get rid of a few of these wrappers). > OK. I will change the API to add a function to beginscan_internal and replace the rest of the functions usage with beginscan_internal. And also there are many bool flags that are passed to the beginscan_internal, I will try to optimize them also. > - Some of the exposed functionality, e.g. scan_getpage, > scan_update_snapshot, scan_rescan_set_params looks like it should just > be excised, i.e. there's no reason for it to exist. > Currently these API's are used only in Bitmap and Sample scan's. These scan methods are fully depends on the heap format. I will check how to remove these API's. - Minor: don't think the _function suffix for Storis necessary, just > makes things long, and every member has it. Besides that, it's also > easy to misunderstand - for a second I understood > scan_getnext_function to be about getting the next function... > OK. How about adding _hook? > - Scans are still represented as HeapScanDesc - I don't think that's > going to fly. Either this needs to be an opaque type (i.e. a struct > that's not defined, just forward declared), or it needs to be a base > struct that individual AMs embed in their own structs. Individual AMs > definitely are going to need different pieces of data. > Currently the internal members of the HeapScanDesc are directly used in many places especially in Bitmap and Sample scan's. I am yet to write the code in a best way to handle these scan methods and then removing its usage will be easy. > Storage AM - tuple stuff: > > - tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are > each individual functions, that seems pretty painful to maintain, and > v/ likely to just grow and grow. Not sure what the solution is, but > this seems like a hard sell. > OK. How about adding a one API and takes some flags to represent what type of data that is needed from the tuple and returned the corresponding data as void *. The caller must typecast the data to their corresponding type before use it. - The three *speculative* functions don't quite seem right to me, nor do > I understand: > +* > +* Setting a tuple's speculative token is a slot-only operation, > so no need > +* for a storage AM method, but after inserting a tuple containing > a > +* speculative token, the insertion must be completed by these > routines: > +*/ > I don't see anything related to
Re: [HACKERS] Pluggable storage
On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommiwrote: > > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila > wrote: >> >> >> Also, it is quite possible that some of the storage Am's don't even >> want to return bool as a parameter from HeapTupleSatisfies* API's. I >> guess what we need here is to provide a way so that different storage >> am's can register their function pointer for an equivalent to >> satisfies function. So, we need to change >> SnapshotData.SnapshotSatisfiesFunc in some way so that different >> handlers can register their function instead of using that directly. >> I think that should address the problem you are planning to solve by >> omitting buffer parameter. > > > Thanks for your suggestion. Yes, it is better to go in the direction of > SnapshotSatisfiesFunc. > > I verified the above idea of implementing the Tuple visibility functions > and assign them into the snapshotData structure based on the snapshot. > > The Tuple visibility functions that are specific to the relation are > available > with the RelationData structure and this structure may not be available, > Which functions are you referring here? I don't see anything in tqual.h that uses RelationData. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapilawrote: > On Sat, Aug 12, 2017 at 10:31 AM, Haribabu Kommi > wrote: > >> > >> Why do we need to store handler function in TupleDesc? As of now, the > >> above patch series has it available in RelationData and > >> TupleTableSlot, I am not sure if instead of that keeping it in > >> TupleDesc is a good idea. Which all kind of places require TupleDesc > >> to contain handler? If those are few places, can we think of passing > >> it as a parameter? > > > > > > Till now I am to able to proceed without adding any storage handler > > functions to > > TupleDesc structure. Sure, I will try the way of passing as a parameter > when > > there is a need of it. > > > > Okay, I think it is better if you discuss such locations before > directly modifying those. > Sure. I will check with community before making any such changes. > > During the progress of the patch, I am facing problems in designing the > > storage API > > regarding the Buffer. For example To replace all the > HeapTupleSatisfiesMVCC > > and > > related functions with function pointers, In HeapTuple format, the tuple > may > > belongs > > to one buffer, so the buffer is passed to the HeapTupleSatifisifes*** > > functions along > > with buffer, But in case of other storage formats, the single buffer may > not > > contains > > the actual data. > > > > Also, it is quite possible that some of the storage Am's don't even > want to return bool as a parameter from HeapTupleSatisfies* API's. I > guess what we need here is to provide a way so that different storage > am's can register their function pointer for an equivalent to > satisfies function. So, we need to change > SnapshotData.SnapshotSatisfiesFunc in some way so that different > handlers can register their function instead of using that directly. > I think that should address the problem you are planning to solve by > omitting buffer parameter. > Thanks for your suggestion. Yes, it is better to go in the direction of SnapshotSatisfiesFunc. I verified the above idea of implementing the Tuple visibility functions and assign them into the snapshotData structure based on the snapshot. The Tuple visibility functions that are specific to the relation are available with the RelationData structure and this structure may not be available, so I changed the SnapShotData structure to hold an enum to represent what type of snapshot it is, instead of storing the pointer to the tuple visibility function. Whenever there is a need to check for the tuple visibilty the storageam handler pointer corresponding to the snapshot type is called and result is obtained as earlier. > This buffer is used to set the Hint bits and mark the > > buffer as dirty. > > In case if the buffer is not available, the performance may affect for > the > > following > > queries if the hint bits are not set. > > > > I don't think it is advisable to change that for the current heap. > I didn't change the prototype of existing functions. Currently tuple visibility functions assumes that Buffer is always proper, but that may not be correct based on the storage. > > > And also the Buffer is used to get from heap_fetch, heap_lock_tuple and > > related > > functions to check the Tuple visibility, but currently returning a buffer > > from the above > > heap_** function is not possible for other formats. > > > > Why not? I mean if we consider that all the formats we are worried at > this stage have TID (block number, tuple location), then we can get > the buffer. We might want to consider passing TID as a parameter to > these API's if required to make that possible. You also agreed above > [1] that we can first design the API considering storage formats > having TID. > The current approach is to support the storages that support TID bits. But what I mean here, in some storage methods (for example column storage), the tuple is not present in one buffer, the tuple data may be calculated from many buffers and return the slot/storageTuple (until unless we change everywhere to slot). If any of the following code after the storage methods is expecting a Buffer that should be valid may need some changes to check it first whether it is a valid or not and perform the operations based on that. > > And also for the > > HeapTuple data, > > the tuple data is copied into palloced buffer instead of pointing > directly > > to the page. > > So, returning a Buffer is a valid or not here? > > > > Yeah, but I think for the sake of compatibility and not changing too > much in the current API's signature, we should try to avoid it. > Currently I am trying to avoid changing the current API's signatures. Most of the signature changes are something like HeapTuple -> StorageTuple and etc. > > Currently I am proceeding to remove the Buffer as parameter in the API > and > > proceed > > further, In case if it affects the performance, we need to find out a > >
Re: [HACKERS] Pluggable storage
Hi, On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote: > Here I attached WIP patches to support pluggable storage. The patch series > are may not work individually. Still so many things are under development. > These patches are just to share the approach of the current development. Making a pass through the patchset to get a feel where this at, and where this is headed. I previously skimmed the thread to get a rough sense on what's discused, but not in a very detailed manner. General: - I think one important discussion we need to have is what kind of performance impact we're going to accept introducing this. It seems very likely that this'll cause some slowdown. We can kind of alleviate that by doing some optimizations at the same time, but nevertheless, this abstraction is going to cost. - I don't think we should introduce this without a user besides heapam. The likelihood that API will be usable by anything else without a testcase seems fairly remote. I think some src/test/modules type implementation of a per-session, in-memory storage - relatively easy to implement - or such is necessary. - I think, and detailed some of that, we're going to need some cleanups that go in before this, to decrease the size / increase the quality of the new APIs. It's going to get more painful to change APIs subsequently. - We'll have to document clearly that these APIs are going to change for a while, even after the release introducing them. StorageAm - Scan stuff: - I think API isn't quite right. There's a lot of granular callback functionality like scan_begin_function / scan_begin_catalog / scan_begin_bm - these largely are convenience wrappers around the same function, and I don't think that would, or rather should, change in any abstracted storage layer. So I think there needs to be some unification first (pretty close w/ beginscan_internal already, but perhaps we should get rid of a few of these wrappers). - Some of the exposed functionality, e.g. scan_getpage, scan_update_snapshot, scan_rescan_set_params looks like it should just be excised, i.e. there's no reason for it to exist. - Minor: don't think the _function suffix for Storis necessary, just makes things long, and every member has it. Besides that, it's also easy to misunderstand - for a second I understood scan_getnext_function to be about getting the next function... - Scans are still represented as HeapScanDesc - I don't think that's going to fly. Either this needs to be an opaque type (i.e. a struct that's not defined, just forward declared), or it needs to be a base struct that individual AMs embed in their own structs. Individual AMs definitely are going to need different pieces of data. Storage AM - tuple stuff: - tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are each individual functions, that seems pretty painful to maintain, and v/ likely to just grow and grow. Not sure what the solution is, but this seems like a hard sell. - The three *speculative* functions don't quite seem right to me, nor do I understand: +* +* Setting a tuple's speculative token is a slot-only operation, so no need +* for a storage AM method, but after inserting a tuple containing a +* speculative token, the insertion must be completed by these routines: +*/ I don't see anything related to slots, here? Storage AM - slot stuff: - I think there's a number of wrapper functions (slot_getattr, slot_getallattrs, getsomeattrs, attisnull) around the same functionality - that bloats the API and causes slowdowns. Instead we need something like slot_virtualize_tuple(int16 upto), and the rest should just be wrappers. - I think it's wrong to have the slot functionality defined on the StorageAm level. That'll cause even more indirect function calls (=> slowness), and besides that the TupleTableSlot struct will need members for each and every Am. I think the solution is to instead have an Am level "create slot" function, and the returned slot is allocated by the Am, with a base member of TupleTableSlot with basically just tts_nvalid, tts_values, tts_isnull as members. Those are the only members that can be accessed without functions. Access to the individual functions (say store_tuple) would then be direct members of the TupleTableSlot interface. While that costs a bit of memory, it removes one indirection from an already performance critical path. - MinimalTuples should be one type of slot for the above, except it's not created by an StorageAm but by a function returning a TupleTableSlot. This should remove the need for the slot_copy_min_tuple, slot_is_physical_tuple functions. - Right now TupleTableSlots are an executor datastructure, but these patches (rightly!) make it much more widely used. So I think it needs to be moved outside of executor/, and probably renamed to something like
Re: [HACKERS] Pluggable storage
On Sun, Aug 13, 2017 at 5:21 PM, Amit Kapilawrote: > On Sat, Aug 12, 2017 at 10:34 AM, Haribabu Kommi > wrote: > > > > On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila > wrote: > >> > >> +typedef struct StorageAmRoutine > >> +{ > >> > >> In this structure, you have already covered most of the API's that a > >> new storage module needs to provide, but I think there could be more. > >> One such API could be heap_hot_search. This seems specific to current > >> heap where we have the provision of HOT. I think we can provide a new > >> API tuple_search or something like that. > > > > > > Thanks for the review. > > > > Yes, the storageAmRoutine needs more function pointers. Currently I am > > adding all the functions that are present in the heapam.h and some slot > > related function from tuptable.h. > > > > Hmm, this API is exposed via heapam.h. Am I missing something? > Sorry I was not clearly explaining in my earlier mail. Yes your are right that the heap_hot_search function exists in heapam.h, but I am yet to add all the exposed functions that are present in the heapam.h file to the storageAmRoutine structure. > Once I stabilize the code and API's that > > are > > currently added, then I will further enhance it with remaining functions > > that > > are necessary to support pluggable storage API. > > > > Sure, but I think if we found any thing during development/review, > then we should either add it immediately or at the very least add fix > me in a patch to avoid forgetting the finding. OK. I will add all the functions that are identified till now. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Sat, Aug 12, 2017 at 10:34 AM, Haribabu Kommiwrote: > > On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila wrote: >> >> On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi >> wrote: >> > >> > >> > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera >> > >> > wrote: >> >> >> >> I have sent the partial patch I have to Hari Babu Kommi. We expect >> >> that >> >> he will be able to further this goal some more. >> > >> > >> > Thanks Alvaro for sharing your development patch. >> > >> > Most of the patch design is same as described by Alvaro in the first >> > mail >> > [1]. >> > I will detail the modifications, pending items and open items (needs >> > discussion) >> > to implement proper pluggable storage. >> > >> > Here I attached WIP patches to support pluggable storage. The patch >> > series >> > are may not work individually. Still so many things are under >> > development. >> > These patches are just to share the approach of the current development. >> > >> >> +typedef struct StorageAmRoutine >> +{ >> >> In this structure, you have already covered most of the API's that a >> new storage module needs to provide, but I think there could be more. >> One such API could be heap_hot_search. This seems specific to current >> heap where we have the provision of HOT. I think we can provide a new >> API tuple_search or something like that. > > > Thanks for the review. > > Yes, the storageAmRoutine needs more function pointers. Currently I am > adding all the functions that are present in the heapam.h and some slot > related function from tuptable.h. > Hmm, this API is exposed via heapam.h. Am I missing something? > Once I stabilize the code and API's that > are > currently added, then I will further enhance it with remaining functions > that > are necessary to support pluggable storage API. > Sure, but I think if we found any thing during development/review, then we should either add it immediately or at the very least add fix me in a patch to avoid forgetting the finding. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sat, Aug 12, 2017 at 10:31 AM, Haribabu Kommiwrote: >> >> Why do we need to store handler function in TupleDesc? As of now, the >> above patch series has it available in RelationData and >> TupleTableSlot, I am not sure if instead of that keeping it in >> TupleDesc is a good idea. Which all kind of places require TupleDesc >> to contain handler? If those are few places, can we think of passing >> it as a parameter? > > > Till now I am to able to proceed without adding any storage handler > functions to > TupleDesc structure. Sure, I will try the way of passing as a parameter when > there is a need of it. > Okay, I think it is better if you discuss such locations before directly modifying those. > During the progress of the patch, I am facing problems in designing the > storage API > regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC > and > related functions with function pointers, In HeapTuple format, the tuple may > belongs > to one buffer, so the buffer is passed to the HeapTupleSatifisifes*** > functions along > with buffer, But in case of other storage formats, the single buffer may not > contains > the actual data. > Also, it is quite possible that some of the storage Am's don't even want to return bool as a parameter from HeapTupleSatisfies* API's. I guess what we need here is to provide a way so that different storage am's can register their function pointer for an equivalent to satisfies function. So, we need to change SnapshotData.SnapshotSatisfiesFunc in some way so that different handlers can register their function instead of using that directly. I think that should address the problem you are planning to solve by omitting buffer parameter. > This buffer is used to set the Hint bits and mark the > buffer as dirty. > In case if the buffer is not available, the performance may affect for the > following > queries if the hint bits are not set. > I don't think it is advisable to change that for the current heap. > And also the Buffer is used to get from heap_fetch, heap_lock_tuple and > related > functions to check the Tuple visibility, but currently returning a buffer > from the above > heap_** function is not possible for other formats. > Why not? I mean if we consider that all the formats we are worried at this stage have TID (block number, tuple location), then we can get the buffer. We might want to consider passing TID as a parameter to these API's if required to make that possible. You also agreed above [1] that we can first design the API considering storage formats having TID. > And also for the > HeapTuple data, > the tuple data is copied into palloced buffer instead of pointing directly > to the page. > So, returning a Buffer is a valid or not here? > Yeah, but I think for the sake of compatibility and not changing too much in the current API's signature, we should try to avoid it. > Currently I am proceeding to remove the Buffer as parameter in the API and > proceed > further, In case if it affects the performance, we need to find out a > different appraoch > in handling the hint bits. > Leaving aside the performance concern, I am not convinced that it is a good idea to remove Buffer as a parameter from the API's you have mentioned above. Would you mind thinking once again keeping the suggestions provided above in this email to see if we can avoid removing Buffer as a parameter? [1] - https://www.postgresql.org/message-id/CAJrrPGd8%2Bi8sqZCdhfvBhs2d1akEb_kEuBvgRHSPJ9z2Z7VBJw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapilawrote: > On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi > wrote: > > > > > > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera < > alvhe...@2ndquadrant.com> > > wrote: > >> > >> I have sent the partial patch I have to Hari Babu Kommi. We expect that > >> he will be able to further this goal some more. > > > > > > Thanks Alvaro for sharing your development patch. > > > > Most of the patch design is same as described by Alvaro in the first mail > > [1]. > > I will detail the modifications, pending items and open items (needs > > discussion) > > to implement proper pluggable storage. > > > > Here I attached WIP patches to support pluggable storage. The patch > series > > are may not work individually. Still so many things are under > development. > > These patches are just to share the approach of the current development. > > > > +typedef struct StorageAmRoutine > +{ > > In this structure, you have already covered most of the API's that a > new storage module needs to provide, but I think there could be more. > One such API could be heap_hot_search. This seems specific to current > heap where we have the provision of HOT. I think we can provide a new > API tuple_search or something like that. Thanks for the review. Yes, the storageAmRoutine needs more function pointers. Currently I am adding all the functions that are present in the heapam.h and some slot related function from tuptable.h. Once I stabilize the code and API's that are currently added, then I will further enhance it with remaining functions that are necessary to support pluggable storage API. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Mon, Aug 7, 2017 at 11:12 PM, Amit Kapilawrote: > On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommi > wrote: > > > > > > On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila > > wrote: > >> > > > >> > >> > 1. Design an API that returns values/nulls array and convert that > into a > >> > HeapTuple whenever it is required in the upper layers. All the > existing > >> > heap form/deform tuples are used for every tuple with some > adjustments. > >> > > >> > >> So, this would have the additional cost of form/deform. Also, how > >> would it have lesser changes as compare to what you have described > >> earlier? > > > > > > Yes, It have the additional cost of form/deform. It is the same approach > > that > > is described earlier. But I have an intention of modifying everywhere the > > HeapTuple is accessed. But with the other prototype changes of removing > > HeapTuple usage from triggers, I realized that it needs some clear design > > to proceed further, instead of combining those changes with pluggable > > Storage API. > > > > - heap_getnext function is kept as it as and it is used only for system > > table > > access. > > - heap_getnext_slot function is introduced to return the slot whenever > the > > data is found, otherwise NULL, This function is used in all the places > > from > > Executor and etc. > > > > - The TupleTableSlot structure is modified to contain a void* tuple > instead > > of > > HeapTuple. And also it contains the storagehanlder functions. > > - heap_insert and etc function can take Slot as an argument and perform > the > > insert operation. > > > > The cases where the TupleTableSlot is not possible to sent, form a > HeapTuple > > from the data and sent it and also note down that it is a HeapTuple data, > > not > > the tuple from the storage. > > > .. > > > > > >> > >> > 3. Design an API that returns StorageTuple(void *) but the necessary > >> > format information of that tuple can be get from the tupledesc. > wherever > >> > the tuple is present, there exists a tupledesc in most of the cases. > How > >> > about adding some kind of information in tupledesc to find out the > tuple > >> > format and call the necessary functions > >> > > > > > > > heap_getnext function returns StorageTuple instead of HeapTuple. The > tuple > > type information is available in the TupleDesc structure. > > > > All heap_insert and etc function accepts TupleTableSlot as input and > perform > > the insert operation. This approach is almost same as first approach > except > > the > > storage handler functions are stored in TupleDesc. > > > > Why do we need to store handler function in TupleDesc? As of now, the > above patch series has it available in RelationData and > TupleTableSlot, I am not sure if instead of that keeping it in > TupleDesc is a good idea. Which all kind of places require TupleDesc > to contain handler? If those are few places, can we think of passing > it as a parameter? Till now I am to able to proceed without adding any storage handler functions to TupleDesc structure. Sure, I will try the way of passing as a parameter when there is a need of it. During the progress of the patch, I am facing problems in designing the storage API regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC and related functions with function pointers, In HeapTuple format, the tuple may belongs to one buffer, so the buffer is passed to the HeapTupleSatifisifes*** functions along with buffer, But in case of other storage formats, the single buffer may not contains the actual data. This buffer is used to set the Hint bits and mark the buffer as dirty. In case if the buffer is not available, the performance may affect for the following queries if the hint bits are not set. And also the Buffer is used to get from heap_fetch, heap_lock_tuple and related functions to check the Tuple visibility, but currently returning a buffer from the above heap_** function is not possible for other formats. And also for the HeapTuple data, the tuple data is copied into palloced buffer instead of pointing directly to the page. So, returning a Buffer is a valid or not here? Currently I am proceeding to remove the Buffer as parameter in the API and proceed further, In case if it affects the performance, we need to find out a different appraoch in handling the hint bits. comments? Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommiwrote: > > > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera > wrote: >> >> I have sent the partial patch I have to Hari Babu Kommi. We expect that >> he will be able to further this goal some more. > > > Thanks Alvaro for sharing your development patch. > > Most of the patch design is same as described by Alvaro in the first mail > [1]. > I will detail the modifications, pending items and open items (needs > discussion) > to implement proper pluggable storage. > > Here I attached WIP patches to support pluggable storage. The patch series > are may not work individually. Still so many things are under development. > These patches are just to share the approach of the current development. > +typedef struct StorageAmRoutine +{ In this structure, you have already covered most of the API's that a new storage module needs to provide, but I think there could be more. One such API could be heap_hot_search. This seems specific to current heap where we have the provision of HOT. I think we can provide a new API tuple_search or something like that. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommiwrote: > > > On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila > wrote: >> > >> >> > 1. Design an API that returns values/nulls array and convert that into a >> > HeapTuple whenever it is required in the upper layers. All the existing >> > heap form/deform tuples are used for every tuple with some adjustments. >> > >> >> So, this would have the additional cost of form/deform. Also, how >> would it have lesser changes as compare to what you have described >> earlier? > > > Yes, It have the additional cost of form/deform. It is the same approach > that > is described earlier. But I have an intention of modifying everywhere the > HeapTuple is accessed. But with the other prototype changes of removing > HeapTuple usage from triggers, I realized that it needs some clear design > to proceed further, instead of combining those changes with pluggable > Storage API. > > - heap_getnext function is kept as it as and it is used only for system > table > access. > - heap_getnext_slot function is introduced to return the slot whenever the > data is found, otherwise NULL, This function is used in all the places > from > Executor and etc. > > - The TupleTableSlot structure is modified to contain a void* tuple instead > of > HeapTuple. And also it contains the storagehanlder functions. > - heap_insert and etc function can take Slot as an argument and perform the > insert operation. > > The cases where the TupleTableSlot is not possible to sent, form a HeapTuple > from the data and sent it and also note down that it is a HeapTuple data, > not > the tuple from the storage. > .. > > >> >> > 3. Design an API that returns StorageTuple(void *) but the necessary >> > format information of that tuple can be get from the tupledesc. wherever >> > the tuple is present, there exists a tupledesc in most of the cases. How >> > about adding some kind of information in tupledesc to find out the tuple >> > format and call the necessary functions >> > > > > heap_getnext function returns StorageTuple instead of HeapTuple. The tuple > type information is available in the TupleDesc structure. > > All heap_insert and etc function accepts TupleTableSlot as input and perform > the insert operation. This approach is almost same as first approach except > the > storage handler functions are stored in TupleDesc. > Why do we need to store handler function in TupleDesc? As of now, the above patch series has it available in RelationData and TupleTableSlot, I am not sure if instead of that keeping it in TupleDesc is a good idea. Which all kind of places require TupleDesc to contain handler? If those are few places, can we think of passing it as a parameter? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapilawrote: > On Wed, Jul 19, 2017 at 11:33 AM, Haribabu Kommi > wrote: > > > > I am finding out that eliminating the HeapTuple usage in the upper layers > > needs some major changes, How about not changing anything in the upper > > layers of storage currently and just support the pluggable tuple with > one of > > the following approach for first version? > > > > It is not very clear to me how any of the below alternatives are > better or worse as compare to the approach you are already working. > Do you mean to say that we will get away without changing all the > places which take HeapTuple as input or return it as output with some > of the below approaches? > Yes, With the following approaches, my intention is to reduce the changing of HeapTuple everywhere to support the Pluggable Storage API with minimal changes and later improvise further. But until unless we change the HeapTuple everywhere properly, may be we can't see the benefits of pluggable storages. > > 1. Design an API that returns values/nulls array and convert that into a > > HeapTuple whenever it is required in the upper layers. All the existing > > heap form/deform tuples are used for every tuple with some adjustments. > > > > So, this would have the additional cost of form/deform. Also, how > would it have lesser changes as compare to what you have described > earlier? > Yes, It have the additional cost of form/deform. It is the same approach that is described earlier. But I have an intention of modifying everywhere the HeapTuple is accessed. But with the other prototype changes of removing HeapTuple usage from triggers, I realized that it needs some clear design to proceed further, instead of combining those changes with pluggable Storage API. - heap_getnext function is kept as it as and it is used only for system table access. - heap_getnext_slot function is introduced to return the slot whenever the data is found, otherwise NULL, This function is used in all the places from Executor and etc. - The TupleTableSlot structure is modified to contain a void* tuple instead of HeapTuple. And also it contains the storagehanlder functions. - heap_insert and etc function can take Slot as an argument and perform the insert operation. The cases where the TupleTableSlot is not possible to sent, form a HeapTuple from the data and sent it and also note down that it is a HeapTuple data, not the tuple from the storage. > 2. Design an API that returns StorageTuple(void *) with first member > > represents the TYPE of the storage, so that corresponding registered > > function calls can be called to deform/form the tuple whenever there is > > a need of tuple. > > > > Do you intend to say that we store such information in disk tuple or > only in the in-memory version of same? Only in-memory version. > Also, what makes you think > that we would need hooks only for form and deform? Right now, in many > cases tuple will directly point to disk page and we deal with it by > retaining the pin on the corresponding buffer, what if some kinds of > tuple don't follow that rule? For ex. to support in-place updates, we > might always need a separate copy of tuple rather than the one > pointing to disk page. > In any of the approaches, except for system tables, we are going to remove the direct disk access of the tuple. Either with replace tuple with slot or something, no direct disk access will be removed. Otherwise it will be difficult to support, and also it doesn't provide much performance benefit also. All the storagehandler functions needs to be maintained seperately and accessed by them using the hanlder ID. heap_getnext function returns StorageTuple and not HeapTuple. The StorageTuple can be mapped to HeapTuple or another Tuple based on the first member type. heap_insert etc function takes input as StorageTuple and internally decides based on the type of the tuple to perform the insert operation. Instead of storing the handler functions inside a relation/slot, the function pointers can be accessed directly based on the storage type data. This works every case where the tuple is accessed, but the problem is, it may need changes wherever the tuple is accessed. > > 3. Design an API that returns StorageTuple(void *) but the necessary > > format information of that tuple can be get from the tupledesc. wherever > > the tuple is present, there exists a tupledesc in most of the cases. How > > about adding some kind of information in tupledesc to find out the tuple > > format and call the necessary functions > > > heap_getnext function returns StorageTuple instead of HeapTuple. The tuple type information is available in the TupleDesc structure. All heap_insert and etc function accepts TupleTableSlot as input and perform the insert operation. This approach is almost same as first approach except the storage handler functions are stored in
Re: [HACKERS] Pluggable storage
On Wed, Jul 19, 2017 at 11:33 AM, Haribabu Kommiwrote: > > > On Sat, Jul 15, 2017 at 12:30 PM, Robert Haas wrote: >> > > I am finding out that eliminating the HeapTuple usage in the upper layers > needs some major changes, How about not changing anything in the upper > layers of storage currently and just support the pluggable tuple with one of > the following approach for first version? > It is not very clear to me how any of the below alternatives are better or worse as compare to the approach you are already working. Do you mean to say that we will get away without changing all the places which take HeapTuple as input or return it as output with some of the below approaches? > 1. Design an API that returns values/nulls array and convert that into a > HeapTuple whenever it is required in the upper layers. All the existing > heap form/deform tuples are used for every tuple with some adjustments. > So, this would have the additional cost of form/deform. Also, how would it have lesser changes as compare to what you have described earlier? > 2. Design an API that returns StorageTuple(void *) with first member > represents the TYPE of the storage, so that corresponding registered > function calls can be called to deform/form the tuple whenever there is > a need of tuple. > Do you intend to say that we store such information in disk tuple or only in the in-memory version of same? Also, what makes you think that we would need hooks only for form and deform? Right now, in many cases tuple will directly point to disk page and we deal with it by retaining the pin on the corresponding buffer, what if some kinds of tuple don't follow that rule? For ex. to support in-place updates, we might always need a separate copy of tuple rather than the one pointing to disk page. > 3. Design an API that returns StorageTuple(void *) but the necessary > format information of that tuple can be get from the tupledesc. wherever > the tuple is present, there exists a tupledesc in most of the cases. How > about adding some kind of information in tupledesc to find out the tuple > format and call the necessary functions > > 4. Design an API to return always the StorageTuple and it converts to > HeapTuple with a function hook if it gets registered (for heap storages > this is not required to register the hook, because it is already a HeapTuple > format). This function hook should be placed in the heap form/deform > functions. > I think some more information is required to comment on any of the approaches or suggest a new one. You might want to try by quoting some specific examples from code so that it is easy to understand what your proposal will change in that case. One idea could be that we start with some interfaces like structures TupleTableSlot, EState, HeapScanDescData, IndexScanDescData, etc. and interfaces like heap_insert, heap_update, heap_lock_tuple, SnapshotSatisfiesFunc, EvalPlanQualFetch, etc. Now, it is quite possible that we don't want to change some of these interfaces, but it can help to see how such a usage can be replaced with new kind of Tuple structure. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Jul 19, 2017 at 10:56 AM, Robert Haaswrote: >> I strongly agree. I simply don't understand how you can adopt UNDO for >> MVCC, and yet expect to get a benefit commensurate with the effort >> without also implementing "retail index tuple deletion" first. > > I agree that we need retail index tuple deletion. I liked Claudio's > idea at > http://postgr.es/m/cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com > -- that seems indispensible to making retail index tuple deletion > reasonably efficient. Is anybody going to work on getting that > committed? I will do review work on it. IMV the main problems are: * The way a "header" is added at the PageAddItemExtended() level, rather than making heap TID something much closer to a conventional attribute that perhaps only nbtree and indextuple.c have special knowledge of, strikes me as the wrong way to go. * It's simply not acceptable to add overhead to *all* internal items. That kills fan-in. We're going to need suffix truncation for the common case where the user-visible attributes for a split point/new high key at the leaf level sufficiently distinguish what belongs on either side. IOW, you should only see internal items with a heap TID in the uncommon case where you have so many duplicates at the leaf level that you have no choice put to use a split point that's right in the middle of many duplicates. Fortunately, if we confine ourselves to making heap TID part of the keyspace, the code can be far simpler than what would be needed to get my preferred, all-encompassing design for suffix truncation [1] to work. I think we could just stash the number of attributes participating in a comparison within internal pages' unused item pointer offset. I've talked about this before, in the context of Anastasia's INCLUDED columns patch. If we can have a variable number of attributes for heap tuples, we can do so for index tuples, too. * We might also have problems with changing the performance characteristics for the worse in some cases by some measures. This will probably technically increase the amount of bloat for some indexes with sparse deletion patterns. I think that that will be well worth it, but I don't expect a slam dunk. A nice benefit of this work is that it lets us kill the hack that adds randomness to the search for free space among duplicates, and may let us follow the Lehman & Yao algorithm more closely. [1] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sat, Jul 15, 2017 at 8:58 PM, Peter Geogheganwrote: > To repeat myself, for emphasis: *Not all bloat is equal*. +1. > I strongly agree. I simply don't understand how you can adopt UNDO for > MVCC, and yet expect to get a benefit commensurate with the effort > without also implementing "retail index tuple deletion" first. I agree that we need retail index tuple deletion. I liked Claudio's idea at http://postgr.es/m/cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com -- that seems indispensible to making retail index tuple deletion reasonably efficient. Is anybody going to work on getting that committed? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sat, Jul 15, 2017 at 6:36 PM, Alexander Korotkovwrote: > I think in general there are two ways dealing with out index AM API > limitation. One of them is to extend index AM API. That's pretty much what I have in mind. I think it's likely that if we end up with, say, 3 kinds of heap and 12 kinds of index, there will be some compatibility matrix. Some index AMs will be compatible with some heap AMs, and others won't be. For example, if somebody makes an IOT-type heap using the API proposed here or some other one, BRIN probably won't work at all. btree, on the other hand, could probably be made to work, perhaps with some greater or lesser degree of modification. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sat, Jul 15, 2017 at 12:30 PM, Robert Haaswrote: > On Fri, Jul 14, 2017 at 8:35 AM, Haribabu Kommi > wrote: > > To replace tuple with slot, I took trigger and SPI calls as the first > step > > in modifying > > from tuple to slot, Here I attached a WIP patch. The notable changes are, > > > > 1. Replace most of the HeapTuple with Slot in SPI interface functions. > > 2. In SPITupleTable, Instead of HeapTuple, it is changed to > TupleTableSlot. > > But this change may not be a proper approach, because a duplicate copy of > > TupleTableSlot is generated and stored. > > 3. Changed all trigger interfaces to accept TupleTableSlot Instead of > > HeapTuple. > > 4. ItemPointerData is added as a member to the TupleTableSlot structure. > > 5. Modified the ExecInsert and others to work directly on TupleTableSlot > > instead > > of tuple(not completely). > > What problem are you trying to solve with these changes? I'm not > saying that it's a bad idea, but I think you should spell out the > motivation a little more explicitly. > Sorry for not providing complete details. I am trying these experiments to find out the best way to return the tuple from Storage methods by designing a proper API. The changes that I am doing are to reduce the dependency on the HeapTuple format with value/nulls array. So if there is no dependency on the HeapTuple format in the upper layers of the PostgreSQL storage, then directly we can define the StorageAPI to return the value/nulls array instead of one big chunck of tuple data like HeapTuple or StorageTuple(void *). I am finding out that eliminating the HeapTuple usage in the upper layers needs some major changes, How about not changing anything in the upper layers of storage currently and just support the pluggable tuple with one of the following approach for first version? 1. Design an API that returns values/nulls array and convert that into a HeapTuple whenever it is required in the upper layers. All the existing heap form/deform tuples are used for every tuple with some adjustments. 2. Design an API that returns StorageTuple(void *) with first member represents the TYPE of the storage, so that corresponding registered function calls can be called to deform/form the tuple whenever there is a need of tuple. 3. Design an API that returns StorageTuple(void *) but the necessary format information of that tuple can be get from the tupledesc. wherever the tuple is present, there exists a tupledesc in most of the cases. How about adding some kind of information in tupledesc to find out the tuple format and call the necessary functions 4. Design an API to return always the StorageTuple and it converts to HeapTuple with a function hook if it gets registered (for heap storages this is not required to register the hook, because it is already a HeapTuple format). This function hook should be placed in the heap form/deform functions. Any other better ideas for a first version. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Mon, Jul 17, 2017 at 1:24 PM, Alexander Korotkovwrote: > It's probably depends on particular storage (once we have pluggable > storages). Some storages would have additional level of indirection while > others wouldn't. Agreed. Like kill_prior_tuple, it's an optional capability, and where implemented is implemented in a fairly consistent way. > But even if unique index contain no true duplicates, it's > still possible that true delete happen. Then we still have to delete tuple > even from unique index. I think I agree. I've been looking over the ARIES paper [1] again today. They say this: "For index updates, in the interest of increasing concurrency, we do not want to prevent the space released by one transaction from being consumed by another before the commit of the first transaction." You can literally reclaim space from an index tuple deletion *immediately* with their design, which matters because you want to reclaim space as early as possible, before a page spit is needed. Obviously they understand how important this is. This might not work so well with an MVCC system, where there are no 2PL predicate locks. You need to keep a "ghost record", even for non-unique indexes, and the deletion can only happen when the xact commits. Remember, the block number cannot be used to see if there was changes against the page, unlike the heap, because you have to worry about page splits and page merges/deletion. UNDO is entirely logical for indexes for this reason. (This is why UNDO does not actually undo page splits, relation extension, etc. Only REDO/WAL always works at the level of individual pages in all cases. UNDO for MVCC is not as different to our design as I once thought.). The reason I want to at least start with unique indexes is because you need a TID to make non-unique/secondary indexes have unique keys (unique keys are always needed if retail index tuple insertion is always supported). For unique indexes, you really can do an update in the index (see my design below for one example of how that can work), but I think you need something more like a deletion followed by an insertion for non-unique indexes, because there the physical/heap TID changed, and that's part of the key, and that might belong on a different page. You therefore haven't really fixed the problem with secondary indexes sometimes needing new index tuples even though user visible attributes weren't updated. You haven't fixed the problem with secondary index, unless, of course, all secondary indexes have logical pointers to begin with, such as the PK value. Then you only need to "insert and delete, not update" when the PK value is updated or when a secondary index needs a new index tuple with distinct user visible attribute values to the previous version's -- you fix the secondary index problem. And, while your "version chain overflow indirection" structure is basically something that lives outside the heap, it is still only needed for one index, and not all of them. This new indirection structure is a really nice target for pruning, because you can prune physical TIDs that no possible snapshot could use, unlike with the heap, where EvalPlanQual() could make any heap tuple visible to snapshots at or after the minimal snapshot horizon implied by RecentGlobalXmin. And, because index scans on any index can prune for everyone. You could also do "true index deletes", as you suggest, but you'd need to have ghost records there too, and you'd need an asynchronous cleanup process to do the cleanup when the deleting xact committed. I'm not sure if it's worth doing that eagerly. It may or may not be better to hope for kill_prior_tuple to do the job for us. Not sure where this leaves index-only scans on secondary indexes..."true index deletes" might be justified by making index only scans work more often in general, especially for secondary indexes with logical pointers. I'm starting to think that you were right all along about indirect indexes needing to store PK values. Perhaps we should just bite the bullet...it's not like places like the bufpage.c index routines actually know or care about whether or not the index tuple has a TID, what a TID is, etc. They care about stuff like the header values of index tuples, and the page ItemId array, but TID is, as you put it, merely payload. > It's possible to add indirection layer "on demand". Thus, initially index > tuples point directly to the heap tuple. If tuple gets updates and doesn't > fit to the page anymore, then it's moved to another place with redirect in > the old place. I think that if carefully designed, it's possible to > guarantee there is at most one redirect. This is actually what I was thinking. Here is a sketch: When you start out, index tuples in nbtree are the same as today -- one physical pointer (TID). But, on the first update to a PK index, they grow a new pointer, but this is not a physical/heap TID. It's a pointer to some kind of
Re: [HACKERS] Pluggable storage
On Mon, Jul 17, 2017 at 7:51 PM, Peter Geogheganwrote: > On Mon, Jul 17, 2017 at 3:22 AM, Alexander Korotkov > wrote: > > I think that "retail index tuple deletion" is the feature which could > give > > us some advantages even independently from pluggable storages. For > example, > > imagine very large table with only small amount of dead tuples. In this > > case, it would be cheaper to delete index links to those dead tuples one > by > > one using "retail index tuple deletion", rather than do full scan of > every > > index to perform "bulk delete" of index tuples. One may argue that you > > shouldn't do vacuum of large table when only small amount of tuples are > > dead. But in terms of index bloat mitigation, very aggressive vacuum > > strategy could be justified. > > Yes, definitely. Especially with the visibility map. Even still, I > tend to think that for unique indexes, true duplicates should be > disallowed, and dealt with with an additional layer of indirection. So > this would be for secondary indexes. > It's probably depends on particular storage (once we have pluggable storages). Some storages would have additional level of indirection while others wouldn't. But even if unique index contain no true duplicates, it's still possible that true delete happen. Then we still have to delete tuple even from unique index. >> I agree with Robert that being able to store an arbitrary payload as a > >> TID is probably not going to ever work very well. > > > > > > Support of arbitrary payload as a TID doesn't sound easy. However, that > > doesn't mean it's unachievable. For me, it's more like long way which > could > > be traveled step by step. > > To be fair, it probably is achievable. Where there is a will, there is > a way. I just think that it will be easier to find a different way of > realizing similar benefits. I'm mostly talking about benefits around > making it cheap to have many secondary indexes by having logical > indirection instead of physical pointers (doesn't *have* to be > user-visible primary key values). It's possible to add indirection layer "on demand". Thus, initially index tuples point directly to the heap tuple. If tuple gets updates and doesn't fit to the page anymore, then it's moved to another place with redirect in the old place. I think that if carefully designed, it's possible to guarantee there is at most one redirect. But I sill think that evading arbitrary payload for indexes is delaying of inevitable, if only we want pluggable storages and want them to reuse existing index AMs. So, for example, arbitrary payload together with ability to update this payload allows us to make indexes separately versioned (have separate garbage collection process more or less unrelated to heap). Despite overhead caused by MVCC attributes, I think such indexes could give significant advantages in various workloads. > HOT simply isn't effective enough at > preventing UPDATE index tuple insertions for indexes on unchanged > attributes, often just because pruning can fail to happen in time, > which WARM will not fix. > Right. I think HOT and WARM depend on factors which are hard to control: distribution of UPDATEs between heap pages, oldest snapshot and so on. It's quite hard for DBA to understand why table starts getting bloat while it didn't before. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Mon, Jul 17, 2017 at 3:22 AM, Alexander Korotkovwrote: > I think that "retail index tuple deletion" is the feature which could give > us some advantages even independently from pluggable storages. For example, > imagine very large table with only small amount of dead tuples. In this > case, it would be cheaper to delete index links to those dead tuples one by > one using "retail index tuple deletion", rather than do full scan of every > index to perform "bulk delete" of index tuples. One may argue that you > shouldn't do vacuum of large table when only small amount of tuples are > dead. But in terms of index bloat mitigation, very aggressive vacuum > strategy could be justified. Yes, definitely. Especially with the visibility map. Even still, I tend to think that for unique indexes, true duplicates should be disallowed, and dealt with with an additional layer of indirection. So this would be for secondary indexes. >> I agree with Robert that being able to store an arbitrary payload as a >> TID is probably not going to ever work very well. > > > Support of arbitrary payload as a TID doesn't sound easy. However, that > doesn't mean it's unachievable. For me, it's more like long way which could > be traveled step by step. To be fair, it probably is achievable. Where there is a will, there is a way. I just think that it will be easier to find a different way of realizing similar benefits. I'm mostly talking about benefits around making it cheap to have many secondary indexes by having logical indirection instead of physical pointers (doesn't *have* to be user-visible primary key values). HOT simply isn't effective enough at preventing UPDATE index tuple insertions for indexes on unchanged attributes, often just because pruning can fail to happen in time, which WARM will not fix. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sun, Jul 16, 2017 at 3:58 AM, Peter Geogheganwrote: > I strongly agree. I simply don't understand how you can adopt UNDO for > MVCC, and yet expect to get a benefit commensurate with the effort > without also implementing "retail index tuple deletion" first. > Pursuing UNDO this way has the same problem that WARM likely has -- it > doesn't really help with the worst case, where users get big, > unpleasant surprises. Postgres is probably the only major database > system that doesn't support retail index tuple deletion. It's a basic > thing, that has nothing to do with MVCC. Really, what do we have to > lose? > I think that "retail index tuple deletion" is the feature which could give us some advantages even independently from pluggable storages. For example, imagine very large table with only small amount of dead tuples. In this case, it would be cheaper to delete index links to those dead tuples one by one using "retail index tuple deletion", rather than do full scan of every index to perform "bulk delete" of index tuples. One may argue that you shouldn't do vacuum of large table when only small amount of tuples are dead. But in terms of index bloat mitigation, very aggressive vacuum strategy could be justified. I agree with Robert that being able to store an arbitrary payload as a > TID is probably not going to ever work very well. Support of arbitrary payload as a TID doesn't sound easy. However, that doesn't mean it's unachievable. For me, it's more like long way which could be traveled step by step. Some of our existing index access methods (B-tree, hash, GiST, SP-GiST) may support arbitrary payload relatively easy, because they are not relying on its internal structure. For others (GIN, BRIN) arbitrary payload is much harder to support, but I wouldn't say it's impossible. However, if we make arbitrary payload support an option of index AM and implement this support for first group of index AMs, it would be already great step forward. So, for sample, it would be possible to use indirect indexes when primary key is not 6-bytes, if index AM supports arbitrary payload. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Sat, Jul 15, 2017 at 3:36 PM, Alexander Korotkovwrote: > I think that pruning and vacuum are artifacts of not only current heap > formats, but they are also artifacts of current index AM API. And this is > more significant circumstance given that we're going to preserve > compatibility of new storage engines with current index AMs. Our current > index AM API assumes that we can delete from index only in bulk manner. Our > payload to index key is TID, not arbitrary piece of data. And that payload > can't be updated. I agree that this is a big set of problems. This is where I think we can get the most benefit. One nice thing about having a fully logical TID is that you don't have to bloat indexes just because a HOT update was not possible. You bloat something else instead, certainly, but that can be optimized for garbage collection. Not all bloat is equal. MVCC based on UNDO can be very fast because UNDO is very well optimized for garbage collection, and so can be bloated with no long term consequences, and minor short term consequences. Still, it isn't fair to blame the Postgres VACUUM design for the fact that Postgres bloats indexes so easily. Nothing stops us from adding a new layer of indirection, so that bloating an index degrades into something that is not even as bad as bloating the heap [1]. We may just have a data structure problem, which is not nearly the same thing as a fundamental design problem in the storage layer. This approach could pay off soon if we start with unique indexes, where there is "queue" of row versions that can be pruned with simple logic, temporal locality helps a lot, only zero or one versions can be visible to your MVCC snapshot, etc. This might require only minimal revisions the index AM API, to help nbtree. We could later improve this so that you bloat UNDO instead of bloating a heap-like structure, both for indexes and for the actual heap. That seems less urgent. To repeat myself, for emphasis: *Not all bloat is equal*. Index bloat makes the way a B-Tree's keyspace is split up far too granular, making pages sparely packed, a problem that is more or less *irreversible* by VACUUM or any garbage collection process [2]. That's how B-Trees work -- they're optimized for concurrency, not for garbage collection. >> InnoDB isn't much like the PostgreSQL heap, and >> neither is SQL Server, IIUC. If we're creating a heap format that can >> only be different in trivial ways from what we have now, and anything >> that really changes the paradigm is off-limits, well, that's not >> really interesting enough to justify the work of creating a heap >> storage API. > > > My concern is that we probably can't do anything that really changes > paradigm while preserving compatibility with index AM API. If you don't > agree with that, it would be good to provide some examples. It seems > unlikely for me that we're going to have something like InnoDB or SQL Server > table with our current index AM API. InnoDB utilizes index-organized tables > where primary and secondary indexes are versioned independently. SQL Server > utilizes flat data structure similar to our heap, but MVCC implementation > also seems very different. I strongly agree. I simply don't understand how you can adopt UNDO for MVCC, and yet expect to get a benefit commensurate with the effort without also implementing "retail index tuple deletion" first. Pursuing UNDO this way has the same problem that WARM likely has -- it doesn't really help with the worst case, where users get big, unpleasant surprises. Postgres is probably the only major database system that doesn't support retail index tuple deletion. It's a basic thing, that has nothing to do with MVCC. Really, what do we have to lose? The biggest weakness of the current design is IMV how it fails to prevent index bloat in the first place, but avoiding bloating index leaf pages in the first place doesn't seem incompatible with how VACUUM works. Or at least, let's not assume that it is. We should avoid throwing the baby out with the bathwater. > I think in general there are two ways dealing with out index AM API > limitation. One of them is to extend index AM API. At least, we would need > a method for deletion of individual index tuple (for sure, we already have > kill_prior_tuple but it's just a hint for now). kill_prior_tuple can work well, but, like HOT, it works inconsistently, in a way that is hard to predict. > Also, it would be nice to > have arbitrary payload to index tuples instead of TID, and method to update > that payload. But that would be quite big amount of work. Alternatively, > we could allow pluggable storages to have their own index AMs, and that will > move this amount of work to the pluggable storage side. I agree with Robert that being able to store an arbitrary payload as a TID is probably not going to ever work very well. However, I don't think that that's a reason to give up on the underlying idea:
Re: [HACKERS] Pluggable storage
On Sat, Jul 15, 2017 at 5:14 AM, Robert Haaswrote: > On Thu, Jun 22, 2017 at 9:30 AM, Alexander Korotkov > wrote: > > If #1 is only about changing tuple and page formats, then could be much > > simpler than the patch upthread? We can implement "page format access > > methods" with routines for insertion, update, pruning and deletion of > tuples > > *in particular page*. There is no need to redefine high-level logic for > > scanning heap, doing updates and so on... > > That assumes that every tuple format does those things in the same > way, which I suspect is not likely to be the case. I think that > pruning and vacuum are artifacts of the current heap format, and they > may be nonexistent or take some altogether different form in some > other storage engine. I think that pruning and vacuum are artifacts of not only current heap formats, but they are also artifacts of current index AM API. And this is more significant circumstance given that we're going to preserve compatibility of new storage engines with current index AMs. Our current index AM API assumes that we can delete from index only in bulk manner. Our payload to index key is TID, not arbitrary piece of data. And that payload can't be updated. InnoDB isn't much like the PostgreSQL heap, and > neither is SQL Server, IIUC. If we're creating a heap format that can > only be different in trivial ways from what we have now, and anything > that really changes the paradigm is off-limits, well, that's not > really interesting enough to justify the work of creating a heap > storage API. My concern is that we probably can't do anything that really changes paradigm while preserving compatibility with index AM API. If you don't agree with that, it would be good to provide some examples. It seems unlikely for me that we're going to have something like InnoDB or SQL Server table with our current index AM API. InnoDB utilizes index-organized tables where primary and secondary indexes are versioned independently. SQL Server utilizes flat data structure similar to our heap, but MVCC implementation also seems very different. I think in general there are two ways dealing with out index AM API limitation. One of them is to extend index AM API. At least, we would need a method for deletion of individual index tuple (for sure, we already have kill_prior_tuple but it's just a hint for now). Also, it would be nice to have arbitrary payload to index tuples instead of TID, and method to update that payload. But that would be quite big amount of work. Alternatively, we could allow pluggable storages to have their own index AMs, and that will move this amount of work to the pluggable storage side. The thing which we would evade is storage API, which would be invasive like something changing paradigm, but actually allowing just trivial changes in heap format. Mechanical replacement of heap methods with storage API methods could lead us there. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Fri, Jul 14, 2017 at 8:35 AM, Haribabu Kommiwrote: > To replace tuple with slot, I took trigger and SPI calls as the first step > in modifying > from tuple to slot, Here I attached a WIP patch. The notable changes are, > > 1. Replace most of the HeapTuple with Slot in SPI interface functions. > 2. In SPITupleTable, Instead of HeapTuple, it is changed to TupleTableSlot. > But this change may not be a proper approach, because a duplicate copy of > TupleTableSlot is generated and stored. > 3. Changed all trigger interfaces to accept TupleTableSlot Instead of > HeapTuple. > 4. ItemPointerData is added as a member to the TupleTableSlot structure. > 5. Modified the ExecInsert and others to work directly on TupleTableSlot > instead > of tuple(not completely). What problem are you trying to solve with these changes? I'm not saying that it's a bad idea, but I think you should spell out the motivation a little more explicitly. I think performance is likely to be a key concern here. Maybe these interfaces are Just Better with TupleTableSlots and the patch is a win regardless of pluggable storage -- but if the reverse turns out to be true, and this slows things down in cases that anyone cares about, I think that's going to be a problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 4:27 PM, Tomas Vondrawrote: > Can you elaborate a bit more about this TID bit pattern issues? I do > remember that not all TIDs are valid due to safeguards on individual fields, > like for example > > Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits)) > > But perhaps there are some other issues? I think the other issues that I know about have largely already been mentioned by others, but since this question was addressed to me: 1. TID bitmaps assume that the page and offset parts of the TID contain just those things. As Tom pointed out, tidbitmap.c isn't cool with TIDs that have ip_posid out of range. More broadly, we assume lossification is a sensible way of keeping a TID bitmap down to a reasonable size without losing too much efficiency, and that sorting tuple references by the block ID is likely to produce a sequential I/O pattern. Those things won't necessarily be true if TIDs are treated as opaque tuple identifiers. 2. Apparently, GIN uses the structure of TIDs to compress posting lists. (I'm not personally familiar with this code.) In general, it's a fairly dangerous thing to suppose that you can repurpose a value as widely used as a TID and not break anything. I'm not saying it can't be done, but we use TIDs in an awful lot of places and rooting out all of the places somebody may have made an assumption about the structure of them may not be trivial. I tend to think it's an ugly kludge to shove some other kind of value into a TID, anyway. If we need to store something that's not a TID, I think we should have a purpose-built mechanism for that, not just hammer on the existing system until it sorta works. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 9:30 AM, Alexander Korotkovwrote: > If #1 is only about changing tuple and page formats, then could be much > simpler than the patch upthread? We can implement "page format access > methods" with routines for insertion, update, pruning and deletion of tuples > *in particular page*. There is no need to redefine high-level logic for > scanning heap, doing updates and so on... That assumes that every tuple format does those things in the same way, which I suspect is not likely to be the case. I think that pruning and vacuum are artifacts of the current heap format, and they may be nonexistent or take some altogether different form in some other storage engine. InnoDB isn't much like the PostgreSQL heap, and neither is SQL Server, IIUC. If we're creating a heap format that can only be different in trivial ways from what we have now, and anything that really changes the paradigm is off-limits, well, that's not really interesting enough to justify the work of creating a heap storage API. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Jun 28, 2017 at 7:43 AM, Haribabu Kommiwrote: > > On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov > wrote: >> >> On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila >> wrote: >>> >>> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov >>> wrote: >>> >>> Probably, some other algorithm for vacuum. I am not sure current >>> vacuum with its parameters can be used so easily. One thing that >>> might need some thoughts is that is it sufficient to say that keep >>> autovacuum as off and call some different API for places where the >>> vacuum can be invoked manually like Vacuum command to the developer >>> implementing some different strategy for vacuum or we need something >>> more as well. >> >> >> What kind of other vacuum algorithm do you expect? It would be rather >> easier to understand if we would have examples. >> >> For me, changing of vacuum algorithm is not needed for just heap page >> format change. Existing vacuum algorithm could just call page format API >> functions for manipulating individual pages. >> >> Changing of vacuum algorithm might be needed for more invasive changes >> than just heap page format. However, we should first understand what these >> changes could be and how are they consistent with rest of API design. > > > Yes, I agree that we need some changes in the vacuum to handle the pluggable > storage. > Currently I didn't fully checked the changes that are needed in vacuum, but > I feel the low level changes of the function are enough, and also there > should be > some option from storage handler to decide whether it needs a vacuum or not? > Based on this flag, the vacuum may be skipped on those tables. So these > handlers > no need to register those API's. > Something in that direction sounds reasonable to me. I am also not very clear what kind of pluggability will be required for vacuum. I think for now we can park this problem and try to tackle tuple format and page format related stuff. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Tue, Jun 27, 2017 at 7:23 PM, Alexander Korotkovwrote: > On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila > wrote: >> >> Similarly, >> if the page format is changed you need to consider all page scan API's >> like heapgettup_pagemode/heapgetpage. > > > If page format is changed, then heapgettup_pagemode/heapgetpage should use > appropriate API functions for manipulating page items. I'm very afraid of > overriding whole heapgettup_pagemode/heapgetpage and monstrous functions > like heap_update without understanding of clear use-case. It's definitely > not needed for changing heap page format. > Yeah, we might not change them completely, there will always be some common parts, but as of now, this API considers that we can return a pointer to tuple on the page with just having a pin on the buffer. This is based on the assumption that nobody can update the current tuple, only a new tuple will be created on the update. For some use cases like in-place updates, we might want to do that differently. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Tue, Jun 27, 2017 at 11:53 PM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila> wrote: > >> On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov >> wrote: >> > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas >> wrote: >> >> >> >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi >> >> wrote: >> >> > Open Items: >> >> > >> >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with >> >> > HeapTuple and HeapScanDesc, So these scans are directly operating >> >> > on those structures and providing the result. >> >> > >> >> > These scan types may not be applicable to different storage formats. >> >> > So how to handle them? >> >> >> >> I think that BitmapHeapScan, at least, is applicable to any table AM >> >> that has TIDs. It seems to me that in general we can imagine three >> >> kinds of table AMs: >> >> >> >> 1. Table AMs where a tuple can be efficiently located by a real TID. >> >> By a real TID, I mean that the block number part is really a block >> >> number and the item ID is really a location within the block. These >> >> are necessarily quite similar to our current heap, but they can change >> >> the tuple format and page format to some degree, and it seems like in >> >> many cases it should be possible to plug them into our existing index >> >> AMs without too much heartache. Both index scans and bitmap index >> >> scans ought to work. >> > >> > >> > If #1 is only about changing tuple and page formats, then could be much >> > simpler than the patch upthread? We can implement "page format access >> > methods" with routines for insertion, update, pruning and deletion of >> tuples >> > *in particular page*. There is no need to redefine high-level logic for >> > scanning heap, doing updates and so on... >> >> If you are changing tuple format then you do need to worry about >> places wherever we are using HeapTuple like TupleTableSlots, >> Visibility routines, etc. (just think if somebody changes tuple >> header, then all such places are susceptible to change). > > > Agree. I think that we can consider pluggable tuple format as an > independent feature which is desirable to have before pluggable storages. > For example, I believe some FDWs could already have benefit from pluggable > tuple format. > Accepting multiple tuple format is possible with complete replacement of HeapTuple with TupleTableSlot or something like value/null array instead of a single memory chunk tuple data. Currently I am working on it to replace the HeapTuple with TupleTableSlot in the upper layers once the tuples is returned from the scan. In most of the places where the HeapTuple is present, either replace it with TupleTableSlot or change it to StorageTuple (void *). I am yet to evaluate whether it is possible to support as an independent feature without the need of some heap format function to understand the tuple format in every place. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila> wrote: > >> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov >> wrote: >> > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi < >> kommi.harib...@gmail.com> >> > wrote: >> >> >> >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera < >> alvhe...@2ndquadrant.com> >> >> wrote: >> >>> >> >>> I have sent the partial patch I have to Hari Babu Kommi. We expect >> that >> >>> he will be able to further this goal some more. >> >> >> >> >> >> Thanks Alvaro for sharing your development patch. >> >> >> >> Most of the patch design is same as described by Alvaro in the first >> mail >> >> [1]. >> >> I will detail the modifications, pending items and open items (needs >> >> discussion) >> >> to implement proper pluggable storage. >> >> >> >> Here I attached WIP patches to support pluggable storage. The patch >> series >> >> are may not work individually. Still so many things are under >> development. >> >> These patches are just to share the approach of the current >> development. >> >> >> >> Some notable changes that I did to make the patch work: >> >> >> >> 1. Added storageam handler to the slot, this is because not all places >> >> the relation is not available in handy. >> >> 2. Retained the minimal Tuple in the slot, as this is used in HASH >> join. >> >> As per the first version, I feel it is fine to allow creating HeapTuple >> >> format data. >> >> >> >> Thanks everyone for sharing their ideas in the developer's >> unconference at >> >> PGCon Ottawa. >> >> >> >> Pending items: >> >> >> >> 1. Replacement of Tuple with slot in Trigger functionality >> >> 2. Replacement of Tuple with Slot from storage handler functions. >> >> 3. Remove/minimize the use of HeapTuple as a Datum. >> >> 4. Replace all references of HeapScanDesc with StorageScanDesc >> >> 5. Planner changes to consider the relation storage during the >> planning. >> >> 6. Any planner changes based on the discussion of open items? >> >> 7. some Executor changes to consider the storage advantages? >> >> >> >> Open Items: >> >> >> >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with >> >> HeapTuple and HeapScanDesc, So these scans are directly operating >> >> on those structures and providing the result. >> > >> > >> > What about vacuum? I see vacuum is untouched in the patchset and it is >> not >> > mentioned in this discussion. >> > Do you plan to override low-level function like heap_page_prune(), >> > lazy_vacuum_page() etc., but preserve high-level logic of vacuum? >> > Or do you plan to let pluggable storage implement its own high-level >> vacuum >> > algorithm? >> > >> >> Probably, some other algorithm for vacuum. I am not sure current >> vacuum with its parameters can be used so easily. One thing that >> might need some thoughts is that is it sufficient to say that keep >> autovacuum as off and call some different API for places where the >> vacuum can be invoked manually like Vacuum command to the developer >> implementing some different strategy for vacuum or we need something >> more as well. > > > What kind of other vacuum algorithm do you expect? It would be rather > easier to understand if we would have examples. > > For me, changing of vacuum algorithm is not needed for just heap page > format change. Existing vacuum algorithm could just call page format API > functions for manipulating individual pages. > > Changing of vacuum algorithm might be needed for more invasive changes > than just heap page format. However, we should first understand what these > changes could be and how are they consistent with rest of API design. > Yes, I agree that we need some changes in the vacuum to handle the pluggable storage. Currently I didn't fully checked the changes that are needed in vacuum, but I feel the low level changes of the function are enough, and also there should be some option from storage handler to decide whether it needs a vacuum or not? Based on this flag, the vacuum may be skipped on those tables. So these handlers no need to register those API's. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 5:47 AM, Robert Haaswrote: > On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi > wrote: > > Open Items: > > > > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with > > HeapTuple and HeapScanDesc, So these scans are directly operating > > on those structures and providing the result. > > > > These scan types may not be applicable to different storage formats. > > So how to handle them? > > I think that BitmapHeapScan, at least, is applicable to any table AM > that has TIDs. It seems to me that in general we can imagine three > kinds of table AMs: > > 1. Table AMs where a tuple can be efficiently located by a real TID. > By a real TID, I mean that the block number part is really a block > number and the item ID is really a location within the block. These > are necessarily quite similar to our current heap, but they can change > the tuple format and page format to some degree, and it seems like in > many cases it should be possible to plug them into our existing index > AMs without too much heartache. Both index scans and bitmap index > scans ought to work. > > 2. Table AMs where a tuple has some other kind of locator. For > example, imagine an index-organized table where the locator is the > primary key, which is a bit like what Alvaro had in mind for indirect > indexes. If the locator is 6 bytes or less, it could potentially be > jammed into a TID, but I don't think that's a great idea. For things > like int8 or numeric, it won't work at all. Even for other things, > it's going to cause problems because the bit patterns won't be what > the code is expecting; e.g. bitmap scans care about the structure of > the TID, not just how many bits it is. (Due credit: Somebody, maybe > Alvaro, pointed out this problem before, at PGCon.) For these kinds > of tables, larger modifications to the index AMs are likely to be > necessary, at least if we want a really general solution, or maybe we > should have separate index AMs - e.g. btree for traditional TID-based > heaps, and generic_btree or indirect_btree or key_btree or whatever > for heaps with some other kind of locator. It's not too hard to see > how to make index scans work with this sort of structure but it's very > unclear to me whether, or how, bitmap scans can be made to work. > > 3. Table AMs where a tuple doesn't really have a locator at all. In > these cases, we can't support any sort of index AM at all. When the > table is queried, there's really nothing the core system can do except > ask the table AM for a full scan, supply the quals, and hope the table > AM has some sort of smarts that enable it to optimize somehow. For > example, you can imagine converting cstore_fdw into a table AM of this > sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole > chunks to be proven uninteresting and skipped. (You could use chunk > number + offset to turn this into a table AM of the previous type if > you wanted to support secondary indexes; not sure if that'd be useful, > but it'd certainly be harder.) > > I'm more interested in #1 than in #3, and more interested in #3 than > #2, but other people may have different priorities. Hi Robert, Thanks for the details and your opinion. I also agree that option#1 is better to do first. Regards, Hari Babu Fujitsu Australia
Re: [HACKERS] Pluggable storage
On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapilawrote: > On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov > wrote: > > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi < > kommi.harib...@gmail.com> > > wrote: > >> > >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera < > alvhe...@2ndquadrant.com> > >> wrote: > >>> > >>> I have sent the partial patch I have to Hari Babu Kommi. We expect > that > >>> he will be able to further this goal some more. > >> > >> > >> Thanks Alvaro for sharing your development patch. > >> > >> Most of the patch design is same as described by Alvaro in the first > mail > >> [1]. > >> I will detail the modifications, pending items and open items (needs > >> discussion) > >> to implement proper pluggable storage. > >> > >> Here I attached WIP patches to support pluggable storage. The patch > series > >> are may not work individually. Still so many things are under > development. > >> These patches are just to share the approach of the current development. > >> > >> Some notable changes that I did to make the patch work: > >> > >> 1. Added storageam handler to the slot, this is because not all places > >> the relation is not available in handy. > >> 2. Retained the minimal Tuple in the slot, as this is used in HASH join. > >> As per the first version, I feel it is fine to allow creating HeapTuple > >> format data. > >> > >> Thanks everyone for sharing their ideas in the developer's unconference > at > >> PGCon Ottawa. > >> > >> Pending items: > >> > >> 1. Replacement of Tuple with slot in Trigger functionality > >> 2. Replacement of Tuple with Slot from storage handler functions. > >> 3. Remove/minimize the use of HeapTuple as a Datum. > >> 4. Replace all references of HeapScanDesc with StorageScanDesc > >> 5. Planner changes to consider the relation storage during the planning. > >> 6. Any planner changes based on the discussion of open items? > >> 7. some Executor changes to consider the storage advantages? > >> > >> Open Items: > >> > >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with > >> HeapTuple and HeapScanDesc, So these scans are directly operating > >> on those structures and providing the result. > > > > > > What about vacuum? I see vacuum is untouched in the patchset and it is > not > > mentioned in this discussion. > > Do you plan to override low-level function like heap_page_prune(), > > lazy_vacuum_page() etc., but preserve high-level logic of vacuum? > > Or do you plan to let pluggable storage implement its own high-level > vacuum > > algorithm? > > > > Probably, some other algorithm for vacuum. I am not sure current > vacuum with its parameters can be used so easily. One thing that > might need some thoughts is that is it sufficient to say that keep > autovacuum as off and call some different API for places where the > vacuum can be invoked manually like Vacuum command to the developer > implementing some different strategy for vacuum or we need something > more as well. What kind of other vacuum algorithm do you expect? It would be rather easier to understand if we would have examples. For me, changing of vacuum algorithm is not needed for just heap page format change. Existing vacuum algorithm could just call page format API functions for manipulating individual pages. Changing of vacuum algorithm might be needed for more invasive changes than just heap page format. However, we should first understand what these changes could be and how are they consistent with rest of API design. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapilawrote: > On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov > wrote: > > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas > wrote: > >> > >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi > >> wrote: > >> > Open Items: > >> > > >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with > >> > HeapTuple and HeapScanDesc, So these scans are directly operating > >> > on those structures and providing the result. > >> > > >> > These scan types may not be applicable to different storage formats. > >> > So how to handle them? > >> > >> I think that BitmapHeapScan, at least, is applicable to any table AM > >> that has TIDs. It seems to me that in general we can imagine three > >> kinds of table AMs: > >> > >> 1. Table AMs where a tuple can be efficiently located by a real TID. > >> By a real TID, I mean that the block number part is really a block > >> number and the item ID is really a location within the block. These > >> are necessarily quite similar to our current heap, but they can change > >> the tuple format and page format to some degree, and it seems like in > >> many cases it should be possible to plug them into our existing index > >> AMs without too much heartache. Both index scans and bitmap index > >> scans ought to work. > > > > > > If #1 is only about changing tuple and page formats, then could be much > > simpler than the patch upthread? We can implement "page format access > > methods" with routines for insertion, update, pruning and deletion of > tuples > > *in particular page*. There is no need to redefine high-level logic for > > scanning heap, doing updates and so on... > > If you are changing tuple format then you do need to worry about > places wherever we are using HeapTuple like TupleTableSlots, > Visibility routines, etc. (just think if somebody changes tuple > header, then all such places are susceptible to change). Agree. I think that we can consider pluggable tuple format as an independent feature which is desirable to have before pluggable storages. For example, I believe some FDWs could already have benefit from pluggable tuple format. > Similarly, > if the page format is changed you need to consider all page scan API's > like heapgettup_pagemode/heapgetpage. If page format is changed, then heapgettup_pagemode/heapgetpage should use appropriate API functions for manipulating page items. I'm very afraid of overriding whole heapgettup_pagemode/heapgetpage and monstrous functions like heap_update without understanding of clear use-case. It's definitely not needed for changing heap page format. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkovwrote: > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi > wrote: >> >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera >> wrote: >>> >>> I have sent the partial patch I have to Hari Babu Kommi. We expect that >>> he will be able to further this goal some more. >> >> >> Thanks Alvaro for sharing your development patch. >> >> Most of the patch design is same as described by Alvaro in the first mail >> [1]. >> I will detail the modifications, pending items and open items (needs >> discussion) >> to implement proper pluggable storage. >> >> Here I attached WIP patches to support pluggable storage. The patch series >> are may not work individually. Still so many things are under development. >> These patches are just to share the approach of the current development. >> >> Some notable changes that I did to make the patch work: >> >> 1. Added storageam handler to the slot, this is because not all places >> the relation is not available in handy. >> 2. Retained the minimal Tuple in the slot, as this is used in HASH join. >> As per the first version, I feel it is fine to allow creating HeapTuple >> format data. >> >> Thanks everyone for sharing their ideas in the developer's unconference at >> PGCon Ottawa. >> >> Pending items: >> >> 1. Replacement of Tuple with slot in Trigger functionality >> 2. Replacement of Tuple with Slot from storage handler functions. >> 3. Remove/minimize the use of HeapTuple as a Datum. >> 4. Replace all references of HeapScanDesc with StorageScanDesc >> 5. Planner changes to consider the relation storage during the planning. >> 6. Any planner changes based on the discussion of open items? >> 7. some Executor changes to consider the storage advantages? >> >> Open Items: >> >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with >> HeapTuple and HeapScanDesc, So these scans are directly operating >> on those structures and providing the result. > > > What about vacuum? I see vacuum is untouched in the patchset and it is not > mentioned in this discussion. > Do you plan to override low-level function like heap_page_prune(), > lazy_vacuum_page() etc., but preserve high-level logic of vacuum? > Or do you plan to let pluggable storage implement its own high-level vacuum > algorithm? > Probably, some other algorithm for vacuum. I am not sure current vacuum with its parameters can be used so easily. One thing that might need some thoughts is that is it sufficient to say that keep autovacuum as off and call some different API for places where the vacuum can be invoked manually like Vacuum command to the developer implementing some different strategy for vacuum or we need something more as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkovwrote: > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas wrote: >> >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi >> wrote: >> > Open Items: >> > >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with >> > HeapTuple and HeapScanDesc, So these scans are directly operating >> > on those structures and providing the result. >> > >> > These scan types may not be applicable to different storage formats. >> > So how to handle them? >> >> I think that BitmapHeapScan, at least, is applicable to any table AM >> that has TIDs. It seems to me that in general we can imagine three >> kinds of table AMs: >> >> 1. Table AMs where a tuple can be efficiently located by a real TID. >> By a real TID, I mean that the block number part is really a block >> number and the item ID is really a location within the block. These >> are necessarily quite similar to our current heap, but they can change >> the tuple format and page format to some degree, and it seems like in >> many cases it should be possible to plug them into our existing index >> AMs without too much heartache. Both index scans and bitmap index >> scans ought to work. > > > If #1 is only about changing tuple and page formats, then could be much > simpler than the patch upthread? We can implement "page format access > methods" with routines for insertion, update, pruning and deletion of tuples > *in particular page*. There is no need to redefine high-level logic for > scanning heap, doing updates and so on... > If you are changing tuple format then you do need to worry about places wherever we are using HeapTuple like TupleTableSlots, Visibility routines, etc. (just think if somebody changes tuple header, then all such places are susceptible to change). Similarly, if the page format is changed you need to consider all page scan API's like heapgettup_pagemode/heapgetpage. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Mon, Jun 26, 2017 at 10:55 PM, Tomas Vondrawrote: > On 06/26/2017 05:18 PM, Alexander Korotkov wrote: > >> I see that design question for PostgreSQL pluggable storages is very hard. >> > > IMHO it's mostly expected to be hard. > > Firstly, PostgreSQL is a mature product with many advanced features, and > reworking a low-level feature without breaking something on top of it is > hard by definition. > > Secondly, project policies and code quality requirements set the bar very > high too, I think. Sure. > BTW, I think it worth analyzing existing use-cases of pluggable > >> storages. I think that the most famous DBMS with pluggable storage API >> is MySQL. This why I decided to start with it. I've added >> MySQL/MariaDB section on wiki page. >> https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB >> It appears that significant part of MySQL storage engines are misuses. >> MySQL lacks of various features like FDWs or writable views and so on. >> This is why developers created a lot of pluggable storages for that >> purposes. We definitely don't want something like this in PostgreSQL now. >> I created special resume column where I expressed whether it >> would be nice to have something like this table engine in PostgreSQL. >> > > I don't want to discourage you, but I'm not sure how valuable this is. > > I agree it's valuable to have a an over-view of use cases for pluggable > storage, but I don't think we'll get that from looking at MySQL. As you > noticed, most of the storage engines are misuses, so it's difficult to > learn anything valuable from them. You can argue that using FDWs to > implement alternative storage engines is a misuse too, but at least that > gives us a valid use case (columnar storage implemented using FDW). > > If anything, the MySQL storage engines should serve as a cautionary tale > how not to do things - there's also a plenty of references in the MySQL > "Restrictions and Limitations" section of the manual: > > https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf Just to clarify the thing. I don't propose any adoption of MySQL pluggable storage API to PostgreSQL or something like this. I just wrote this table for completeness of vision. It may appear that somebody will make some valuable notes out of it, it may appear that not. "Yes" in third column means only that there is positive user visible effects which are *nice to have* in PostgreSQL. Also, I remember there was a table with comparison of different designs of pluggable storages and their use-cases at PGCon 2017 unconference. Could somebody reproduce it? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
Hi, On 06/26/2017 05:18 PM, Alexander Korotkov wrote: Hackers, I see that design question for PostgreSQL pluggable storages is very hard. IMHO it's mostly expected to be hard. Firstly, PostgreSQL is a mature product with many advanced features, and reworking a low-level feature without breaking something on top of it is hard by definition. Secondly, project policies and code quality requirements set the bar very high too, I think. > BTW, I think it worth analyzing existing use-cases of pluggable storages. I think that the most famous DBMS with pluggable storage API is MySQL. This why I decided to start with it. I've added MySQL/MariaDB section on wiki page. https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB It appears that significant part of MySQL storage engines are misuses. MySQL lacks of various features like FDWs or writable views and so on. This is why developers created a lot of pluggable storages for that purposes. We definitely don't want something like this in PostgreSQL now. I created special resume column where I expressed whether it would be nice to have something like this table engine in PostgreSQL. I don't want to discourage you, but I'm not sure how valuable this is. I agree it's valuable to have a an over-view of use cases for pluggable storage, but I don't think we'll get that from looking at MySQL. As you noticed, most of the storage engines are misuses, so it's difficult to learn anything valuable from them. You can argue that using FDWs to implement alternative storage engines is a misuse too, but at least that gives us a valid use case (columnar storage implemented using FDW). If anything, the MySQL storage engines should serve as a cautionary tale how not to do things - there's also a plenty of references in the MySQL "Restrictions and Limitations" section of the manual: https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hackers, I see that design question for PostgreSQL pluggable storages is very hard. BTW, I think it worth analyzing existing use-cases of pluggable storages. I think that the most famous DBMS with pluggable storage API is MySQL. This why I decided to start with it. I've added MySQL/MariaDB section on wiki page. https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB It appears that significant part of MySQL storage engines are misuses. MySQL lacks of various features like FDWs or writable views and so on. This is why developers created a lot of pluggable storages for that purposes. We definitely don't want something like this in PostgreSQL now. I created special resume column where I expressed whether it would be nice to have something like this table engine in PostgreSQL. Any comments and additions are welcome. I'm planning to write similar table for MongoDB. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On 23/06/2017 16:07, Tomas Vondra wrote: > > I think you're probably right - GIN does compress the posting lists by > exploiting the TID redundancy (that it's page/offset structure), and I > don't think there are other index AMs doing that. > > But I'm not sure we can simply rely on that - it's possible people will > try to improve other index types (e.g. by adding similar compression to > other index types). That reminds me https://www.postgresql.org/message-id/55e4051b.7020...@postgrespro.ru where Anastasia proposed something similar. -- Julien Rouhaud http://dalibo.com - http://dalibo.org -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Tomas Vondrawrites: > It would be really great if you could explain why BitmapScans are > dubious, instead of just labeling them as dubious. (I guess you mean > Bitmap Heap Scans, right?) The two things I'm aware of are (a) the assumption you noted, that fetching tuples in TID sort order is a reasonably efficient thing, and (b) the "offset" part of a TID can't exceed MaxHeapTuplesPerPage --- see data structure in tidbitmap.c. The latter issue means that you don't really have a full six bytes to play with in a TID, only about five. I don't think (b) would be terribly hard to fix if we had a motivation to, but I wonder whether there aren't other places that also know this about TIDs. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/23/17 10:38 AM, Teodor Sigaev wrote: 1. Table AM with a 6-byte TID. 2. Table AM with a custom locator format, which could be TID-like. 3. Table AM with no locators. Currently TID has its own type in system catalog. Seems, it's possible that storage claims type of TID which it uses. Then, AM could claim it too, so the core based on that information could solve the question about AM-storage compatibility. Storage could also claim that it hasn't TID type at all so it couldn't be used with any access method, use case: compressed column oriented storage. Isn't the fact that TID is an existing type defined in system catalogs is a fairly insignificant detail? I mean, we could just as easily define a new 64-bit locator data type, and use that instead, for example. The main issue here is that we assume things about the TID contents, i.e. that it contains page/offset etc. And Bitmap nodes rely on that, to some extent - e.g. when prefetching data. > As I remeber, only GIN depends on TID format, other indexes use it as opaque type. Except, at least, btree and GiST - they believe that internal pointers are the same as outer (to heap) I think you're probably right - GIN does compress the posting lists by exploiting the TID redundancy (that it's page/offset structure), and I don't think there are other index AMs doing that. But I'm not sure we can simply rely on that - it's possible people will try to improve other index types (e.g. by adding similar compression to other index types). Moreover we now support extensions defining custom index AMs, and we have no clue what people may do in those. So this would clearly require some sort of flag for each index AM. > Another dubious part - BitmapScan. It would be really great if you could explain why BitmapScans are dubious, instead of just labeling them as dubious. (I guess you mean Bitmap Heap Scans, right?) I see no conceptual issue with bitmap scans on arbitrary locator types, as long as there's sufficient locality information encoded in the value. What I mean by that is that for any two locator values A and B: (1) if (A < B) then (A is stored before B) (2) if (A is close to B) then (A is stored close to B) Without these features it's probably futile to try to do bitmap scans, because the bitmap would not result in mostly sequential access pattern and things like prefetch would not be very efficient, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
1. Table AM with a 6-byte TID. 2. Table AM with a custom locator format, which could be TID-like. 3. Table AM with no locators. Currently TID has its own type in system catalog. Seems, it's possible that storage claims type of TID which it uses. Then, AM could claim it too, so the core based on that information could solve the question about AM-storage compatibility. Storage could also claim that it hasn't TID type at all so it couldn't be used with any access method, use case: compressed column oriented storage. As I remeber, only GIN depends on TID format, other indexes use it as opaque type. Except, at least, btree and GiST - they believ that internal pointers are the same as outer (to heap) Another dubious part - BitmapScan. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 11:27 PM, Robert Haaswrote: > On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov > wrote: >> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier >> wrote: >>> Putting that in a couple of words. >>> 1. Table AM with a 6-byte TID. >>> 2. Table AM with a custom locator format, which could be TID-like. >>> 3. Table AM with no locators. >>> >>> Getting into having #1 first to work out would already be really >>> useful for users. >> >> What exactly would be useful for *users*? Any kind of API itself is >> completely useless for users, because they are users, not developers. >> Storage API could be useful for developers to implement storage AMs whose in >> turn could be useful for users. > > What's your point? I assume that is what Michael meant. Sorry for the confusion. I implied here that users are the ones developing modules. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/22/17 4:36 PM, Alexander Korotkov wrote: On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas> wrote: On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov > wrote: > On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier > > wrote: >> Putting that in a couple of words. >> 1. Table AM with a 6-byte TID. >> 2. Table AM with a custom locator format, which could be TID-like. >> 3. Table AM with no locators. >> >> Getting into having #1 first to work out would already be really >> useful for users. > > What exactly would be useful for *users*? Any kind of API itself is > completely useless for users, because they are users, not developers. > Storage API could be useful for developers to implement storage AMs whose in > turn could be useful for users. What's your point? I assume that is what Michael meant. TBH, I don't understand what particular enchantments do we expect from having #1. I'd say that's one of the things we're trying to figure out in this thread. Does it make sense to go with #1 in v1 of the patch, or do we have to implement #2 or #3 to get some real benefit for the users? > This is why it's hard for me to say if #1 is good idea. It's also hard for me to say if patch upthread is right way of implementing #1. But, I have gut feeling that if even #1 is good idea itself, it's definitely not what users expect from "pluggable storages". The question is "why" do you think that. What features do you expect from pluggable storage API that would be impossible to implement with option #1? I'm not trying to be annoying or anything like that - I don't know the answer and discussing those things is exactly why this thread exists. I do agree #1 has limitations, and that it'd be great to get API that supports all kinds of pluggable storage implementations. But I guess that'll take some time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Hi, On 6/21/17 9:47 PM, Robert Haas wrote: ... > like int8 or numeric, it won't work at all. Even for other things, it's going to cause problems because the bit patterns won't be what the code is expecting; e.g. bitmap scans care about the structure of the TID, not just how many bits it is. (Due credit: Somebody, maybe Alvaro, pointed out this problem before, at PGCon.) Can you elaborate a bit more about this TID bit pattern issues? I do remember that not all TIDs are valid due to safeguards on individual fields, like for example Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits)) But perhaps there are some other issues? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 5:27 PM, Robert Haaswrote: > On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov > wrote: > > On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier < > michael.paqu...@gmail.com> > > wrote: > >> Putting that in a couple of words. > >> 1. Table AM with a 6-byte TID. > >> 2. Table AM with a custom locator format, which could be TID-like. > >> 3. Table AM with no locators. > >> > >> Getting into having #1 first to work out would already be really > >> useful for users. > > > > What exactly would be useful for *users*? Any kind of API itself is > > completely useless for users, because they are users, not developers. > > Storage API could be useful for developers to implement storage AMs > whose in > > turn could be useful for users. > > What's your point? I assume that is what Michael meant. > TBH, I don't understand what particular enchantments do we expect from having #1. This is why it's hard for me to say if #1 is good idea. It's also hard for me to say if patch upthread is right way of implementing #1. But, I have gut feeling that if even #1 is good idea itself, it's definitely not what users expect from "pluggable storages". >From user side, it would be normal to expect that "pluggable storage" could be index-organized table where index could be say LSM... -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Wed, Jun 21, 2017 at 10:47 PM, Robert Haaswrote: > On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi > wrote: > > Open Items: > > > > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with > > HeapTuple and HeapScanDesc, So these scans are directly operating > > on those structures and providing the result. > > > > These scan types may not be applicable to different storage formats. > > So how to handle them? > > I think that BitmapHeapScan, at least, is applicable to any table AM > that has TIDs. It seems to me that in general we can imagine three > kinds of table AMs: > > 1. Table AMs where a tuple can be efficiently located by a real TID. > By a real TID, I mean that the block number part is really a block > number and the item ID is really a location within the block. These > are necessarily quite similar to our current heap, but they can change > the tuple format and page format to some degree, and it seems like in > many cases it should be possible to plug them into our existing index > AMs without too much heartache. Both index scans and bitmap index > scans ought to work. > If #1 is only about changing tuple and page formats, then could be much simpler than the patch upthread? We can implement "page format access methods" with routines for insertion, update, pruning and deletion of tuples *in particular page*. There is no need to redefine high-level logic for scanning heap, doing updates and so on... -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkovwrote: > On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier > wrote: >> Putting that in a couple of words. >> 1. Table AM with a 6-byte TID. >> 2. Table AM with a custom locator format, which could be TID-like. >> 3. Table AM with no locators. >> >> Getting into having #1 first to work out would already be really >> useful for users. > > What exactly would be useful for *users*? Any kind of API itself is > completely useless for users, because they are users, not developers. > Storage API could be useful for developers to implement storage AMs whose in > turn could be useful for users. What's your point? I assume that is what Michael meant. > Then while saying that #1 is useful for > users, it would be nice to keep in mind particular storage AMs which can be > implemented using #1. I don't think anybody's arguing with that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquierwrote: > Putting that in a couple of words. > 1. Table AM with a 6-byte TID. > 2. Table AM with a custom locator format, which could be TID-like. > 3. Table AM with no locators. > > Getting into having #1 first to work out would already be really > useful for users. What exactly would be useful for *users*? Any kind of API itself is completely useless for users, because they are users, not developers. Storage API could be useful for developers to implement storage AMs whose in turn could be useful for users. Then while saying that #1 is useful for users, it would be nice to keep in mind particular storage AMs which can be implemented using #1. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommiwrote: > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera > wrote: > >> I have sent the partial patch I have to Hari Babu Kommi. We expect that >> he will be able to further this goal some more. > > > Thanks Alvaro for sharing your development patch. > > Most of the patch design is same as described by Alvaro in the first mail > [1]. > I will detail the modifications, pending items and open items (needs > discussion) > to implement proper pluggable storage. > > Here I attached WIP patches to support pluggable storage. The patch series > are may not work individually. Still so many things are under development. > These patches are just to share the approach of the current development. > > Some notable changes that I did to make the patch work: > > 1. Added storageam handler to the slot, this is because not all places > the relation is not available in handy. > 2. Retained the minimal Tuple in the slot, as this is used in HASH join. > As per the first version, I feel it is fine to allow creating HeapTuple > format data. > > Thanks everyone for sharing their ideas in the developer's unconference at > PGCon Ottawa. > > Pending items: > > 1. Replacement of Tuple with slot in Trigger functionality > 2. Replacement of Tuple with Slot from storage handler functions. > 3. Remove/minimize the use of HeapTuple as a Datum. > 4. Replace all references of HeapScanDesc with StorageScanDesc > 5. Planner changes to consider the relation storage during the planning. > 6. Any planner changes based on the discussion of open items? > 7. some Executor changes to consider the storage advantages? > > Open Items: > > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with > HeapTuple and HeapScanDesc, So these scans are directly operating > on those structures and providing the result. > What about vacuum? I see vacuum is untouched in the patchset and it is not mentioned in this discussion. Do you plan to override low-level function like heap_page_prune(), lazy_vacuum_page() etc., but preserve high-level logic of vacuum? Or do you plan to let pluggable storage implement its own high-level vacuum algorithm? -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 11:12 AM, Amit Langotewrote: > On 2017/06/22 10:01, Michael Paquier wrote: >> #3 implies that the index AM logic is implemented in the table >> AM. Not saying that it is not useful, but it does not feel natural to >> have the planner request for a sequential scan, just to have the table >> AM secretly do some kind of index/skipping scan. > > I had read a relevant comment on a pluggable storage thread awhile back > [1]. In short, the comment was that the planner should be able to get > some intelligence, via some API, from the heap storage implementation > about the latter's access cost characteristics. The storage should > provide accurate-enough cost information to the planner when such a > request is made by, say, cost_seqscan(), so that the planner can make > appropriate choice. If two tables containing the same number of rows (and > the same size in bytes, perhaps) use different storage implementations, > then, planner's cost parameters remaining same, cost_seqscan() will end up > calculating different costs for the two tables. Perhaps, SeqScan would be > chosen for one table but not the another based on that. Yeah, I agree that the costing part needs some clear attention and thoughts, and the gains are absolutely huge with the correct interface. That could be done in a later step though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On 2017/06/22 10:01, Michael Paquier wrote: > #3 implies that the index AM logic is implemented in the table > AM. Not saying that it is not useful, but it does not feel natural to > have the planner request for a sequential scan, just to have the table > AM secretly do some kind of index/skipping scan. I had read a relevant comment on a pluggable storage thread awhile back [1]. In short, the comment was that the planner should be able to get some intelligence, via some API, from the heap storage implementation about the latter's access cost characteristics. The storage should provide accurate-enough cost information to the planner when such a request is made by, say, cost_seqscan(), so that the planner can make appropriate choice. If two tables containing the same number of rows (and the same size in bytes, perhaps) use different storage implementations, then, planner's cost parameters remaining same, cost_seqscan() will end up calculating different costs for the two tables. Perhaps, SeqScan would be chosen for one table but not the another based on that. Thanks, Amit [1] https://www.postgresql.org/message-id/CA%2BTgmoY3LXVUPQVdZW70XKp5PsXffO82pXXt%3DbeegcV%2B%3DRsQgg%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Jun 22, 2017 at 4:47 AM, Robert Haaswrote: > I think that BitmapHeapScan, at least, is applicable to any table AM > that has TIDs. It seems to me that in general we can imagine three > kinds of table AMs: > > 1. Table AMs where a tuple can be efficiently located by a real TID. > By a real TID, I mean that the block number part is really a block > number and the item ID is really a location within the block. These > are necessarily quite similar to our current heap, but they can change > the tuple format and page format to some degree, and it seems like in > many cases it should be possible to plug them into our existing index > AMs without too much heartache. Both index scans and bitmap index > scans ought to work. > > 2. Table AMs where a tuple has some other kind of locator. For > example, imagine an index-organized table where the locator is the > primary key, which is a bit like what Alvaro had in mind for indirect > indexes. If the locator is 6 bytes or less, it could potentially be > jammed into a TID, but I don't think that's a great idea. For things > like int8 or numeric, it won't work at all. Even for other things, > it's going to cause problems because the bit patterns won't be what > the code is expecting; e.g. bitmap scans care about the structure of > the TID, not just how many bits it is. (Due credit: Somebody, maybe > Alvaro, pointed out this problem before, at PGCon.) For these kinds > of tables, larger modifications to the index AMs are likely to be > necessary, at least if we want a really general solution, or maybe we > should have separate index AMs - e.g. btree for traditional TID-based > heaps, and generic_btree or indirect_btree or key_btree or whatever > for heaps with some other kind of locator. It's not too hard to see > how to make index scans work with this sort of structure but it's very > unclear to me whether, or how, bitmap scans can be made to work. > > 3. Table AMs where a tuple doesn't really have a locator at all. In > these cases, we can't support any sort of index AM at all. When the > table is queried, there's really nothing the core system can do except > ask the table AM for a full scan, supply the quals, and hope the table > AM has some sort of smarts that enable it to optimize somehow. For > example, you can imagine converting cstore_fdw into a table AM of this > sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole > chunks to be proven uninteresting and skipped. (You could use chunk > number + offset to turn this into a table AM of the previous type if > you wanted to support secondary indexes; not sure if that'd be useful, > but it'd certainly be harder.) > > I'm more interested in #1 than in #3, and more interested in #3 than > #2, but other people may have different priorities. Putting that in a couple of words. 1. Table AM with a 6-byte TID. 2. Table AM with a custom locator format, which could be TID-like. 3. Table AM with no locators. Getting into having #1 first to work out would already be really useful for users. My take on the matter is that being able to plug in in-core index AMs directly into a table AM #1 is more useful in the long term, as it is possible for multiple table AMs to use the same kind of index AM which is designed nicely enough. So the index AM logic basically does not need to be duplicated across multiple table AMs. #3 implies that the index AM logic is implemented in the table AM. Not saying that it is not useful, but it does not feel natural to have the planner request for a sequential scan, just to have the table AM secretly do some kind of index/skipping scan. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommiwrote: > Open Items: > > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with > HeapTuple and HeapScanDesc, So these scans are directly operating > on those structures and providing the result. > > These scan types may not be applicable to different storage formats. > So how to handle them? I think that BitmapHeapScan, at least, is applicable to any table AM that has TIDs. It seems to me that in general we can imagine three kinds of table AMs: 1. Table AMs where a tuple can be efficiently located by a real TID. By a real TID, I mean that the block number part is really a block number and the item ID is really a location within the block. These are necessarily quite similar to our current heap, but they can change the tuple format and page format to some degree, and it seems like in many cases it should be possible to plug them into our existing index AMs without too much heartache. Both index scans and bitmap index scans ought to work. 2. Table AMs where a tuple has some other kind of locator. For example, imagine an index-organized table where the locator is the primary key, which is a bit like what Alvaro had in mind for indirect indexes. If the locator is 6 bytes or less, it could potentially be jammed into a TID, but I don't think that's a great idea. For things like int8 or numeric, it won't work at all. Even for other things, it's going to cause problems because the bit patterns won't be what the code is expecting; e.g. bitmap scans care about the structure of the TID, not just how many bits it is. (Due credit: Somebody, maybe Alvaro, pointed out this problem before, at PGCon.) For these kinds of tables, larger modifications to the index AMs are likely to be necessary, at least if we want a really general solution, or maybe we should have separate index AMs - e.g. btree for traditional TID-based heaps, and generic_btree or indirect_btree or key_btree or whatever for heaps with some other kind of locator. It's not too hard to see how to make index scans work with this sort of structure but it's very unclear to me whether, or how, bitmap scans can be made to work. 3. Table AMs where a tuple doesn't really have a locator at all. In these cases, we can't support any sort of index AM at all. When the table is queried, there's really nothing the core system can do except ask the table AM for a full scan, supply the quals, and hope the table AM has some sort of smarts that enable it to optimize somehow. For example, you can imagine converting cstore_fdw into a table AM of this sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole chunks to be proven uninteresting and skipped. (You could use chunk number + offset to turn this into a table AM of the previous type if you wanted to support secondary indexes; not sure if that'd be useful, but it'd certainly be harder.) I'm more interested in #1 than in #3, and more interested in #3 than #2, but other people may have different priorities. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
I have sent the partial patch I have to Hari Babu Kommi. We expect that he will be able to further this goal some more. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Alvaro Herrera wrote: > Many have expressed their interest in this topic, but I haven't seen any > design of how it should work. Here's my attempt; I've been playing with > this for some time now and I think what I propose here is a good initial > plan. I regret to announce that I'll have to stay away from this topic for a little while, as I have another project taking priority. I expect to return to this shortly thereafter, hopefully in time to get it done for pg10. If anyone is interested in helping with the (currently not compilable) patch I have, please mail me privately and we can discuss. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On 2016-08-18 08:58:11 +0100, Simon Riggs wrote: > On 16 August 2016 at 19:46, Andres Freundwrote: > > On 2016-08-15 12:02:18 -0400, Robert Haas wrote: > >> Thanks for taking a stab at this. I'd like to throw out a few concerns. > >> > >> One, I'm worried that adding an additional layer of pointer-jumping is > >> going to slow things down and make Andres' work to speed up the > >> executor more difficult. I don't know that there is a problem there, > >> and if there is a problem I don't know what to do about it, but I > >> think it's something we need to consider. > > > > I'm quite concerned about that as well. > > This objection would apply to all other proposals as well, FDW etc.. Not really. The place you draw your boundary significantly influences where and how much of a price you pay. Having another indirection inside HeapTuple - which is accessed in many many places, is something different from having a seqscan equivalent, which returns you a batch of already deformed tuples in array form. In the latter case there's one additional indirection per batch of tuples, in the former there's many for each tuple. > Do you see some way to add flexibility yet without adding a branch > point in the code? I'm not even saying that the approach of doing the indirection inside the HeapTuple replacement is a no-go, just that it concerns me. I do think that working on only lugging arround values/isnull arrays is something that I could see working better, if some problems are addressed beforehand. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On August 18, 2016 7:44:50 AM PDT, Ants Aasmawrote: >On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund >wrote: >> On 2016-08-15 12:02:18 -0400, Robert Haas wrote: >>> I am somewhat inclined to >>> believe that we need to restructure the executor in a bigger way so >>> that it passes around datums instead of tuples; I'm inclined to >>> believe that the current tuple-centric model is probably not optimal >>> even for the existing storage format. >> >> I actually prototyped that, and it's not an easy win so far. Column >> extraction cost, even after significant optimization, is still often >a >> significant portion of the runtime. And e.g. projection only >extracting >> all columns, after evaluating a restrictive qual referring to an >"early" >> column, can be a significant win. We'd definitely have to give up on >> extracting columns 0..n when accessing later columns... Hm. > >What about going even further than [1] in converting the executor to >being opcode based and merging projection and qual evaluation to a >single pass? Optimizer would then have some leeway about how to order >column extraction and qual evaluation. Might even be worth it to >special case some functions as separate opcodes (e.g. int4eq, >timestamp_lt). > >Regards, >Ants Aasma > >[1] >https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3...@alap3.anarazel.de Good question. I think I have a reasonable answer, but lets discuss that in the other thread. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Tue, Aug 16, 2016 at 9:46 PM, Andres Freundwrote: > On 2016-08-15 12:02:18 -0400, Robert Haas wrote: >> I am somewhat inclined to >> believe that we need to restructure the executor in a bigger way so >> that it passes around datums instead of tuples; I'm inclined to >> believe that the current tuple-centric model is probably not optimal >> even for the existing storage format. > > I actually prototyped that, and it's not an easy win so far. Column > extraction cost, even after significant optimization, is still often a > significant portion of the runtime. And e.g. projection only extracting > all columns, after evaluating a restrictive qual referring to an "early" > column, can be a significant win. We'd definitely have to give up on > extracting columns 0..n when accessing later columns... Hm. What about going even further than [1] in converting the executor to being opcode based and merging projection and qual evaluation to a single pass? Optimizer would then have some leeway about how to order column extraction and qual evaluation. Might even be worth it to special case some functions as separate opcodes (e.g. int4eq, timestamp_lt). Regards, Ants Aasma [1] https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3...@alap3.anarazel.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Wed, Aug 17, 2016 at 10:33 PM, Alvaro Herrerawrote: > Anastasia Lubennikova wrote: >> >> Except these, there are some pretty strange and unrelated functions in >> src/backend/catalog. >> I'm willing to fix them, but I'd like to synchronize our efforts. > > I very much would like to stay away from touching src/backend/catalog, > which are the functions that deal with system catalogs. We can simply > say that system catalogs are hardcoded to use heapam.c storage for now. > Does this mean that if any storage needs to access system catalog information, they need to be aware of HeapTuple and other required stuff like syscache? Again, if they need to update some stats or something like that, they need to be aware of heap tuple format. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Thu, Aug 18, 2016 at 10:58 AM, Simon Riggswrote: > On 16 August 2016 at 19:46, Andres Freund wrote: > > On 2016-08-15 12:02:18 -0400, Robert Haas wrote: > >> Thanks for taking a stab at this. I'd like to throw out a few concerns. > >> > >> One, I'm worried that adding an additional layer of pointer-jumping is > >> going to slow things down and make Andres' work to speed up the > >> executor more difficult. I don't know that there is a problem there, > >> and if there is a problem I don't know what to do about it, but I > >> think it's something we need to consider. > > > > I'm quite concerned about that as well. > > This objection would apply to all other proposals as well, FDW etc.. > > Do you see some way to add flexibility yet without adding a branch > point in the code? > It's impossible without branch point in code. The question is where this branch should be located. In particular, be can put this branch point into planner by defining distinct executor nodes for each pluggable storage. In this case, each storage would have own optimized executor nodes. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Pluggable storage
On 16 August 2016 at 19:46, Andres Freundwrote: > On 2016-08-15 12:02:18 -0400, Robert Haas wrote: >> Thanks for taking a stab at this. I'd like to throw out a few concerns. >> >> One, I'm worried that adding an additional layer of pointer-jumping is >> going to slow things down and make Andres' work to speed up the >> executor more difficult. I don't know that there is a problem there, >> and if there is a problem I don't know what to do about it, but I >> think it's something we need to consider. > > I'm quite concerned about that as well. This objection would apply to all other proposals as well, FDW etc.. Do you see some way to add flexibility yet without adding a branch point in the code? -- Simon Riggshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
Anastasia Lubennikova wrote: > 13.08.2016 02:15, Alvaro Herrera: > >To support this, we introduce StorageTuple and StorageScanDesc. > >StorageTuples represent a physical tuple coming from some storage AM. > >It is necessary to have a pointer to a StorageAmRoutine in order to > >manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple. > > StorageTuples concept looks really cool. I've got some questions on > details of implementation. > > Do StorageTuples have fields common to all implementations? > Or StorageTuple is totally abstract structure that has nothing to do > with data, except pointing to it? > > I mean, now we already have HeapTupleData structure, which is a pretty > good candidate to replace with StorageTuple. I was planning to replace all uses of HeapTuple in the executor with StorageTuple, actually. But the main reason I would like to avoid HeapTupleData itself is that it contains an assumption that there is a single palloc chunk that contains the tuple (t_len and t_data). This might not be true in representations that split the tuple, for example in columnar storage where you have one column in page A and another column in page B, for the same tuple. I suppose there might be some point to keeping t_tableOid and t_self, though. > And maybe add a "t_handler" field that points out to handler functions. > I don't sure if it will be a name of StorageAm, or its OID, or maybe the > main function itself. Although, If I'm not mistaken, we always have > RelationData when we want to operate the tuple, so having t_handler > in the StorageTuple is excessive. Yeah, I think the RelationData (or more precisely the StorageAmRoutine) is going to be available always, so I don't think we need a pointer in the tuple itself. > This approach allows to minimize code changes and ensure that we > won't miss any function that handles tuples. > > Do you see any weak points of the suggestion? > What design do you use in your prototype? It's currently a "void *" pointer in my prototype. > >RelationData gains ->rd_stamroutine which is a pointer to the > >StorageAmRoutine for the relation in question. Similarly, > >TupleTableSlot is augmented with a link to the StorageAmRoutine to > >handle the StorageTuple it contains (probably in most cases it's set at > >the same time as the tupdesc). This implies that routines such as > >ExecAssignScanType need to pass down the StorageAmRoutine from the > >relation to the slot. > > If we already have this pointer in t_handler as described below, > we don't need to pass it between functions and slots. I think it's better to have it in slots, so you can install multiple tuples in the slot without having to change the routine pointers each time. > >The executor is modified so that instead of calling heap_insert etc > >directly, it uses rel->rd_stamroutine to call these methods. The > >executor is still in charge of dealing with indexes, constraints, and > >any other thing that's not the tuple storage itself (this is one major > >point in which this differs from FDWs). This all looks simple enough, > >with one exception and a few notes: > > That is exactly what I tried to describe in my proposal. > Chapter "Relation management". I'm sure, you've already noticed > that it will require huge source code cleaning. I've carefully read > the sources and found "violators" of abstraction in src/backend/commands. > The list is attached to the wiki page > https://wiki.postgresql.org/wiki/HeapamRefactoring. > > Except these, there are some pretty strange and unrelated functions in > src/backend/catalog. > I'm willing to fix them, but I'd like to synchronize our efforts. I very much would like to stay away from touching src/backend/catalog, which are the functions that deal with system catalogs. We can simply say that system catalogs are hardcoded to use heapam.c storage for now. If we later see a need to enable some particular catalog using a different storage implementation, we can change the code for that specific catalog in src/backend/catalog and everywhere else, to use the abstract API instead of hardcoding heap_insert etc. But that can be left for a second pass. (This is my point "iv" further below, to which you said "+1"). > Nothing to do, just substitute t_data with proper HeapTupleHeader > representation. I think it's a job for StorageAm. Let's say each StorageAm > must have stam_to_heaptuple() function and opposite function > stam_from_heaptuple(). Hmm, yeah, that also works. We'd have to check again whether it's more convenient to start as a slot rather than a StorageTuple. AFAICS the trigger.c code is all starting from a slot, so it makes sense to have the conversion use the slot code -- that way, there's no need for each storageAM to re-implement conversion to HeapTuple. > >note f) More widespread, MinimalTuples currently use a tweaked HeapTuple > >format. In the long run, it may be possible to replace them with a > >separate storage
Re: [HACKERS] Pluggable storage
13.08.2016 02:15, Alvaro Herrera: Many have expressed their interest in this topic, but I haven't seen any design of how it should work. Here's my attempt; I've been playing with this for some time now and I think what I propose here is a good initial plan. This will allow us to write permanent table storage that works differently than heapam.c. At this stage, I haven't throught through whether this is going to allow extensions to define new storage modules; I am focusing on AMs that can coexist with heapam in core. The design starts with a new row type in pg_am, of type "s" (for "storage"). The handler function returns a struct of node StorageAmRoutine. This contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the like), and operations on tuples that are part of slots (tuple_deform, materialize). To support this, we introduce StorageTuple and StorageScanDesc. StorageTuples represent a physical tuple coming from some storage AM. It is necessary to have a pointer to a StorageAmRoutine in order to manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple. StorageTuples concept looks really cool. I've got some questions on details of implementation. Do StorageTuples have fields common to all implementations? Or StorageTuple is totally abstract structure that has nothing to do with data, except pointing to it? I mean, now we already have HeapTupleData structure, which is a pretty good candidate to replace with StorageTuple. It's already widely used in executor and moreover, it's the only structure (except MinimalTuples and all those crazy optimizations) that works with tuples, both extracted from the page or created on-the-fly in executor node. typedef struct HeapTupleData { uint32t_len; /* length of *t_data */ ItemPointerData t_self;/* SelfItemPointer */ Oidt_tableOid; /* table the tuple came from */ HeapTupleHeader t_data;/* -> tuple header and data */ } HeapTupleData; We can simply change t_data type from HeapTupleHeader to Pointer. And maybe add a "t_handler" field that points out to handler functions. I don't sure if it will be a name of StorageAm, or its OID, or maybe the main function itself. Although, If I'm not mistaken, we always have RelationData when we want to operate the tuple, so having t_handler in the StorageTuple is excessive. typedef struct StorageTupleData { uint32t_len; /* length of *t_data */ ItemPointerData t_self;/* SelfItemPointer */ Oid t_tableOid;/* table the tuple came from */ Pointer t_data;/* -> tuple header and data * This field should never be accessed directly, * only via StorageAm handler functions, * because we don't know underlying data structure. */ ??? t_handler; /* StorageAm that knows what to do with the tuple */ } StorageTupleData ; This approach allows to minimize code changes and ensure that we won't miss any function that handles tuples. Do you see any weak points of the suggestion? What design do you use in your prototype? RelationData gains ->rd_stamroutine which is a pointer to the StorageAmRoutine for the relation in question. Similarly, TupleTableSlot is augmented with a link to the StorageAmRoutine to handle the StorageTuple it contains (probably in most cases it's set at the same time as the tupdesc). This implies that routines such as ExecAssignScanType need to pass down the StorageAmRoutine from the relation to the slot. If we already have this pointer in t_handler as described below, we don't need to pass it between functions and slots. The executor is modified so that instead of calling heap_insert etc directly, it uses rel->rd_stamroutine to call these methods. The executor is still in charge of dealing with indexes, constraints, and any other thing that's not the tuple storage itself (this is one major point in which this differs from FDWs). This all looks simple enough, with one exception and a few notes: That is exactly what I tried to describe in my proposal. Chapter "Relation management". I'm sure, you've already noticed that it will require huge source code cleaning. I've carefully read the sources and found "violators" of abstraction in src/backend/commands. The list is attached to the wiki page https://wiki.postgresql.org/wiki/HeapamRefactoring. Except these, there are some pretty strange and unrelated functions in src/backend/catalog. I'm willing to fix them, but I'd like to synchronize our efforts. exception a) ExecMaterializeSlot needs special consideration. This is used in two different ways: a1) is the stated "make tuple independent from any underlying storage" point, which is handled
Re: [HACKERS] Pluggable storage
On 2016-08-15 12:02:18 -0400, Robert Haas wrote: > Thanks for taking a stab at this. I'd like to throw out a few concerns. > > One, I'm worried that adding an additional layer of pointer-jumping is > going to slow things down and make Andres' work to speed up the > executor more difficult. I don't know that there is a problem there, > and if there is a problem I don't know what to do about it, but I > think it's something we need to consider. I'm quite concerned about that as well. > I am somewhat inclined to > believe that we need to restructure the executor in a bigger way so > that it passes around datums instead of tuples; I'm inclined to > believe that the current tuple-centric model is probably not optimal > even for the existing storage format. I actually prototyped that, and it's not an easy win so far. Column extraction cost, even after significant optimization, is still often a significant portion of the runtime. And e.g. projection only extracting all columns, after evaluating a restrictive qual referring to an "early" column, can be a significant win. We'd definitely have to give up on extracting columns 0..n when accessing later columns... Hm. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
13.08.2016 02:15, Alvaro Herrera: Many have expressed their interest in this topic, but I haven't seen any design of how it should work. Here's my attempt; I've been playing with this for some time now and I think what I propose here is a good initial plan. This will allow us to write permanent table storage that works differently than heapam.c. At this stage, I haven't throught through whether this is going to allow extensions to define new storage modules; I am focusing on AMs that can coexist with heapam in core. Hello, thank you for starting this topic. I am working on a very similar proposal in another thread. https://commitfest.postgresql.org/10/700/ At first glance our drafts are very similar. I hope it means that we are moving in the right direction. It's great that your proposal is focused on interactions with executor while mine are mostly about internals of new StorageAm interface and interactions with a buffer and storage management. Please read and comment https://wiki.postgresql.org/wiki/HeapamRefactoring I'll leave more concrete comments tomorrow. Thank you for the explicit description of issues. The design starts with a new row type in pg_am, of type "s" (for "storage"). The handler function returns a struct of node StorageAmRoutine. This contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the like), and operations on tuples that are part of slots (tuple_deform, materialize). To support this, we introduce StorageTuple and StorageScanDesc. StorageTuples represent a physical tuple coming from some storage AM. It is necessary to have a pointer to a StorageAmRoutine in order to manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple. RelationData gains ->rd_stamroutine which is a pointer to the StorageAmRoutine for the relation in question. Similarly, TupleTableSlot is augmented with a link to the StorageAmRoutine to handle the StorageTuple it contains (probably in most cases it's set at the same time as the tupdesc). This implies that routines such as ExecAssignScanType need to pass down the StorageAmRoutine from the relation to the slot. The executor is modified so that instead of calling heap_insert etc directly, it uses rel->rd_stamroutine to call these methods. The executor is still in charge of dealing with indexes, constraints, and any other thing that's not the tuple storage itself (this is one major point in which this differs from FDWs). This all looks simple enough, with one exception and a few notes: exception a) ExecMaterializeSlot needs special consideration. This is used in two different ways: a1) is the stated "make tuple independent from any underlying storage" point, which is handled by ExecMaterializeSlot itself and calling a method from the storage AM to do any byte copying as needed. ExecMaterializeSlot no longer returns a HeapTuple, because there might not be any. The second usage pattern a2) is to create a HeapTuple that's passed to other modules which only deal with HT and not slots (triggers are the main case I noticed, but I think there are others such as the executor itself wanting tuples as Datum for some reason). For the moment I'm handling this by having a new ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of the original tuple format. note b) EvalPlanQual currently maintains an array of HeapTuple in EState->es_epqTuple. I think it works to replace that with an array of StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine methods in order to interact with it. Other than those changes, it seems okay. note c) nodeSubplan has curTuple as a HeapTuple. It seems simple to replace this with an independent slot-based tuple. note d) grp_firstTuple in nodeAgg / nodeSetOp. These are less simple than the above, but replacing the HeapTuple with a slot-based tuple seems doable too. note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual. TupleTableSlot also seems a good replacement. This has fallout in other users of EvalPlanQual, too. note f) More widespread, MinimalTuples currently use a tweaked HeapTuple format. In the long run, it may be possible to replace them with a separate storage module that's specifically designed to handle tuples meant for tuplestores etc. That may simplify TupleTableSlot and execTuples. For the moment we keep the tts_mintuple as it is. Whenever a tuple is not already in heap format, we heapify it in order to put in the store. The current heapam.c routines need some changes. Currently, practice is that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on their input tuples to set the resulting ItemPointer in tuple->t_self. This is messy if we want StorageTuples to be abstract. I'm changing this so that the resulting ItemPointer is returned in a separate output argument; the tuple itself is left alone. This is somewhat messy in the case of
Re: [HACKERS] Pluggable storage
On Fri, Aug 12, 2016 at 7:15 PM, Alvaro Herrerawrote: > Many have expressed their interest in this topic, but I haven't seen any > design of how it should work. Here's my attempt; I've been playing with > this for some time now and I think what I propose here is a good initial > plan. This will allow us to write permanent table storage that works > differently than heapam.c. At this stage, I haven't throught through > whether this is going to allow extensions to define new storage modules; > I am focusing on AMs that can coexist with heapam in core. Thanks for taking a stab at this. I'd like to throw out a few concerns. One, I'm worried that adding an additional layer of pointer-jumping is going to slow things down and make Andres' work to speed up the executor more difficult. I don't know that there is a problem there, and if there is a problem I don't know what to do about it, but I think it's something we need to consider. I am somewhat inclined to believe that we need to restructure the executor in a bigger way so that it passes around datums instead of tuples; I'm inclined to believe that the current tuple-centric model is probably not optimal even for the existing storage format. It seems even less likely to be right for a data format in which fetching columns is more expensive than currently, such as a columnar store. Two, I think that we really need to think very hard about how the query planner will interact with new heap storage formats. For example, suppose cstore_fdw were rewritten as a new heap storage format. Because ORC contains internal indexing structures with characteristics somewhat similar to BRIN, many scans can be executed much more efficiently than for our current heap storage format. If it can be seen that an entire chunk will fail to match the quals, we can skip the whole chunk. Some operations may permit additional optimizations: for example, given SELECT count(*) FROM thing WHERE quals, we may be able to push the COUNT(*) down into the heap access layer. If it can be verified that EVERY tuple in a chunk will match the quals, we can just increment the count by that number without visiting each tuple individually. This could be really fast. These kinds of query planner issues are generally why I have favored trying to do something like this through the FDW interface, which already has the right APIs for this kind of thing, even if we're not using them all yet. I don't say that's the only way to crack this problem, but I think we're going to find that a heap storage API that doesn't include adequate query planner integration is not a very exciting thing. Three, with respect to this limitation: > iii) All tuples need to be identifiable by ItemPointers. Storages that > have different requirements will need careful additional thought across > the board. I think it's a good idea for a first patch in this area to ignore (or mostly ignore) this problem - e.g. maybe allow such storage formats but refuse to create indexes on them. But eventually I think we're going to want/need to do something about it. There are an awful lot of interesting ideas that we can't pursue without addressing this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Pluggable storage
On Sat, Aug 13, 2016 at 2:15 AM, Alvaro Herrerawrote: > Many have expressed their interest in this topic, but I haven't seen any > design of how it should work. That's it. My design presented at PGCon was very sketchy, and I didn't deliver any prototype yet. > Here's my attempt; I've been playing with > this for some time now and I think what I propose here is a good initial > plan. Great! It's nice to see you're working in this direction! > This will allow us to write permanent table storage that works > differently than heapam.c. At this stage, I haven't throught through > whether this is going to allow extensions to define new storage modules; > I am focusing on AMs that can coexist with heapam in core. > So, as I get you're proposing extendability to replace heap with something another but compatible with heap (with executor nodes, index access methods and so on). That's good, but what alternative storage access methods could be? AFAICS, we can't fit here, for instance, another MVCC implementation (undo log) or in-memory storage or columnar storage. However, it seems that we would be able to make compressed heap or alternative HOT/WARM tuples mechanism. Correct me if I'm wrong. ISTM you're proposing something quite orthogonal to my view of pluggable storage engines. My idea, in general, was to extend FDW mechanism to let it be something manageable inside database (support for VACUUM, defining indexes and so on). But imperative was that it should have its own executor nodes, and it doesn't have to compatible with existing index access methods. Therefore, I think my design presented at PGCon and your current proposal are about orthogonal features which could coexist. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Pluggable storage
Many have expressed their interest in this topic, but I haven't seen any design of how it should work. Here's my attempt; I've been playing with this for some time now and I think what I propose here is a good initial plan. This will allow us to write permanent table storage that works differently than heapam.c. At this stage, I haven't throught through whether this is going to allow extensions to define new storage modules; I am focusing on AMs that can coexist with heapam in core. The design starts with a new row type in pg_am, of type "s" (for "storage"). The handler function returns a struct of node StorageAmRoutine. This contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the like), and operations on tuples that are part of slots (tuple_deform, materialize). To support this, we introduce StorageTuple and StorageScanDesc. StorageTuples represent a physical tuple coming from some storage AM. It is necessary to have a pointer to a StorageAmRoutine in order to manipulate the tuple. For heapam.c, a StorageTuple is just a HeapTuple. RelationData gains ->rd_stamroutine which is a pointer to the StorageAmRoutine for the relation in question. Similarly, TupleTableSlot is augmented with a link to the StorageAmRoutine to handle the StorageTuple it contains (probably in most cases it's set at the same time as the tupdesc). This implies that routines such as ExecAssignScanType need to pass down the StorageAmRoutine from the relation to the slot. The executor is modified so that instead of calling heap_insert etc directly, it uses rel->rd_stamroutine to call these methods. The executor is still in charge of dealing with indexes, constraints, and any other thing that's not the tuple storage itself (this is one major point in which this differs from FDWs). This all looks simple enough, with one exception and a few notes: exception a) ExecMaterializeSlot needs special consideration. This is used in two different ways: a1) is the stated "make tuple independent from any underlying storage" point, which is handled by ExecMaterializeSlot itself and calling a method from the storage AM to do any byte copying as needed. ExecMaterializeSlot no longer returns a HeapTuple, because there might not be any. The second usage pattern a2) is to create a HeapTuple that's passed to other modules which only deal with HT and not slots (triggers are the main case I noticed, but I think there are others such as the executor itself wanting tuples as Datum for some reason). For the moment I'm handling this by having a new ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of the original tuple format. note b) EvalPlanQual currently maintains an array of HeapTuple in EState->es_epqTuple. I think it works to replace that with an array of StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine methods in order to interact with it. Other than those changes, it seems okay. note c) nodeSubplan has curTuple as a HeapTuple. It seems simple to replace this with an independent slot-based tuple. note d) grp_firstTuple in nodeAgg / nodeSetOp. These are less simple than the above, but replacing the HeapTuple with a slot-based tuple seems doable too. note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual. TupleTableSlot also seems a good replacement. This has fallout in other users of EvalPlanQual, too. note f) More widespread, MinimalTuples currently use a tweaked HeapTuple format. In the long run, it may be possible to replace them with a separate storage module that's specifically designed to handle tuples meant for tuplestores etc. That may simplify TupleTableSlot and execTuples. For the moment we keep the tts_mintuple as it is. Whenever a tuple is not already in heap format, we heapify it in order to put in the store. The current heapam.c routines need some changes. Currently, practice is that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on their input tuples to set the resulting ItemPointer in tuple->t_self. This is messy if we want StorageTuples to be abstract. I'm changing this so that the resulting ItemPointer is returned in a separate output argument; the tuple itself is left alone. This is somewhat messy in the case of heap_multi_insert because it returns several items; I think it's acceptable to return an array of ItemPointers in the same order as the input tuples. This works fine for the only caller, which is COPY in batch mode. For the other routines, they don't really care where the TID is returned AFAICS. Additional noteworthy items: i) Speculative insertion: the speculative insertion token is no longer installed directly in the heap tuple by the executor (of course). Instead, the token becomes part of the slot. When the tuple_insert method is called, the insertion routine is in charge of setting the token from the slot into the storage tuple. Executor is