subject:"\[HACKERS\] Pluggable storage"

Re: [HACKERS] Pluggable storage

2017-10-26 Thread Robert Haas

On Wed, Oct 25, 2017 at 1:59 PM, Amit Kapila  wrote:
>> Another thing to consider is that, if we could replace satisfiesfunc,
>> it would probably break some existing code.  There are multiple places
>> in the code that compare snapshot->satisfies to
>> HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.
>>
>> I think the storage API should just leave snapshots alone.  If a
>> storage engine wants to call HeapTupleSatisfiesVisibility() with that
>> snapshot, it can do so.  Otherwise it can switch on
>> snapshot->satisfies and handle each case however it likes.
>>
>
> How will it switch satisfies at runtime?  There are places where we
> might know which visibility function (*MVCC , *Dirty, etc) needs to be
> called, but I think there are other places (like heap_fetch) where it
> is not clear and we decide based on what is stored in
> snapshot->satisfies.

An alternative storage engine needs to provide its own implementation
of heap_fetch, and that replacement implementation can implement MVCC
and other snapshot behavior in any way it likes.

My point here is that I think it's better if the table access method
stuff doesn't end up modifying snapshots.  I think it's fine for a
table access method to get passed a standard snapshot.  Some code may
be needed to cater to the access method's specific needs, but that
code can live inside the table access method, without contaminating
the snapshot stuff.  We have to try to draw some boundary around table
access methods -- we don't want to end up teaching everything in the
system about them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-25 Thread Amit Kapila

On Wed, Oct 25, 2017 at 11:37 AM, Robert Haas  wrote:
> On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapila  wrote:
>> I think what we need here is a way to register satisfies function
>> (SnapshotSatisfiesFunc) in SnapshotData for different storage engines.
>
> I don't see how that helps very much.  SnapshotSatisfiesFunc takes a
> HeapTuple as an argument, and it cares in detail about that tuple's
> xmin, xmax, and infomask, and it sets hint bits.  All of that is bad,
> because an alternative storage engine is likely to use a different
> format than HeapTuple and to not have hint bits (or at least not in
> the same form we have them now).  Also, it doesn't necessarily have a
> Boolean answer to the question "can this snapshot see this tuple?".
> It may be more like "given this TID, what tuple if any can I see
> there?" or "given this tuple, what version of it would I see with this
> snapshot?".
>
> Another thing to consider is that, if we could replace satisfiesfunc,
> it would probably break some existing code.  There are multiple places
> in the code that compare snapshot->satisfies to
> HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.
>
> I think the storage API should just leave snapshots alone.  If a
> storage engine wants to call HeapTupleSatisfiesVisibility() with that
> snapshot, it can do so.  Otherwise it can switch on
> snapshot->satisfies and handle each case however it likes.
>

How will it switch satisfies at runtime?  There are places where we
might know which visibility function (*MVCC , *Dirty, etc) needs to be
called, but I think there are other places (like heap_fetch) where it
is not clear and we decide based on what is stored in
snapshot->satisfies.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-25 Thread Robert Haas

On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapila  wrote:
> I think what we need here is a way to register satisfies function
> (SnapshotSatisfiesFunc) in SnapshotData for different storage engines.

I don't see how that helps very much.  SnapshotSatisfiesFunc takes a
HeapTuple as an argument, and it cares in detail about that tuple's
xmin, xmax, and infomask, and it sets hint bits.  All of that is bad,
because an alternative storage engine is likely to use a different
format than HeapTuple and to not have hint bits (or at least not in
the same form we have them now).  Also, it doesn't necessarily have a
Boolean answer to the question "can this snapshot see this tuple?".
It may be more like "given this TID, what tuple if any can I see
there?" or "given this tuple, what version of it would I see with this
snapshot?".

Another thing to consider is that, if we could replace satisfiesfunc,
it would probably break some existing code.  There are multiple places
in the code that compare snapshot->satisfies to
HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.

I think the storage API should just leave snapshots alone.  If a
storage engine wants to call HeapTupleSatisfiesVisibility() with that
snapshot, it can do so.  Otherwise it can switch on
snapshot->satisfies and handle each case however it likes.  I don't
see how generalizing a Snapshot for other storage engines really buys
us anything except complexity and the danger of reducing performance
for the existing heap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-19 Thread Amit Kapila

On Sat, Oct 14, 2017 at 1:09 AM, Alexander Korotkov
 wrote:
> On Fri, Oct 13, 2017 at 9:41 PM, Robert Haas  wrote:
>>
>> On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan  wrote:
>> >> Fully agreed.
>> >
>> > If we implement that interface, where does that leave EvalPlanQual()?
>
>
> From the first glance, it seems that pluggable storage should override
> EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic.
>

I think there is more to it.  Currently, EState->es_epqTuple is a
HeapTuple which is filled as part of EvalPlanQual mechanism and then
later used during the scan.  We need to make it pluggable in some way
so that other heaps can work.  We also need some work for
EvalPlanQualFetchRowMarks as that also seems to be tightly coupled
with HeapTuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-19 Thread Amit Kapila

On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommi
 wrote:
>
> On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas  wrote:
>>
>> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
>>  wrote:
>> > Currently I added a snapshot_satisfies API to find out whether the tuple
>> > satisfies the visibility or not with different types of visibility
>> > routines.
>> > I feel these
>> > are some how enough to develop a different storage methods like UNDO.
>> > The storage methods can decide internally how to provide the visibility.
>> >
>> > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] =
>> > HeapTupleSatisfiesMVCC;
>> > + amroutine->snapshot_satisfies[SELF_VISIBILITY] =
>> > HeapTupleSatisfiesSelf;
>> > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
>> > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] =
>> > HeapTupleSatisfiesToast;
>> > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] =
>> > HeapTupleSatisfiesDirty;
>> > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
>> > HeapTupleSatisfiesHistoricMVCC;
>> > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
>> > HeapTupleSatisfiesNonVacuumable;
>> > +
>> > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
>> > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
>> >
>> > Currently no changes are carried out in snapshot logic as that is kept
>> > seperate
>> > from storage API.
>>
>> That seems like a strange choice of API.  I think it should more
>> integrated with the scan logic.  For example, if I'm doing an index
>> scan, and I get a TID, then I should be able to just say "here's a
>> TID, give me any tuples associated with that TID that are visible to
>> the scan snapshot".  Then for the current heap it will do
>> heap_hot_search_buffer, and for zheap it will walk the undo chain and
>> return the relevant tuple from the chain.
>
>
> OK, Understood.
> I will check along these lines and come up with storage API's.
>

I think what we need here is a way to register satisfies function
(SnapshotSatisfiesFunc) in SnapshotData for different storage engines.
That is the core API to decide visibility with respect to different
storage engines.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Alexander Korotkov

On Fri, Oct 13, 2017 at 9:41 PM, Robert Haas  wrote:

> On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan  wrote:
> >> Fully agreed.
> >
> > If we implement that interface, where does that leave EvalPlanQual()?
>

>From the first glance, it seems that pluggable storage should
override EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic.

> > Do those semantics have to be preserved?
>
> For a general-purpose heap storage format, I would say yes.

+1

I mean, we don't really have control over how people use the API.  If
> somebody decides to implement a storage API that breaks EvalPlanQual
> semantics horribly, I can't stop them, and I don't want to stop them.
> Open source FTW.
>

Yeah.  We don't have any kind of "safe extensions".  Any extension can
break things really horribly.
For me that means user should absolutely trust extension developer.

But I don't really want that code in our tree, either.

We keep things in our tree as correct as we can.  And for sure, we should
follow this politics for pluggable storages too.

> I think a
> storage engine is and should be about the format in which data gets
> stored on disk, and that it should only affect the performance of
> queries not the answers that they give.

Pretty same idea as index access methods.  They also affects the
performance, but not query answers.  When it's not true, this situation
is considered as bug, and it needs to be fixed.

> I am sure there will be cases
> where, for reasons of implementation complexity, that turns out not to
> be true, but I think in general we should try to avoid it as much as
> we can.

I think in some cases we can tolerate missing features (and document it),
but don't tolerate wrong features.  For instance, we may have some pluggable
storage which doesn't support transactions at all (and that should be
documented for sure), but we shouldn't have pluggable storage which
transaction support is wrong.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Alexander Korotkov

On Fri, Oct 13, 2017 at 9:37 PM, Robert Haas  wrote:

> On Fri, Oct 13, 2017 at 5:25 AM, Kuntal Ghosh
>  wrote:
> > For some other
> > storage engine, if we maintain the older version in different storage,
> > undo for example, and don't require a new index entry, should we still
> > call it HOT-chain?
>
> I would say, emphatically, no.  HOT is a creature of the existing
> heap.  If it's creeping into storage APIs they are not really
> abstracted from what we have currently.


+1,
different storage may need to insert entries to only *some* of indexes.
Wherein these new index entries may have either same or new TIDs.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Robert Haas

On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan  wrote:
>> Fully agreed.
>
> If we implement that interface, where does that leave EvalPlanQual()?
> Do those semantics have to be preserved?

For a general-purpose heap storage format, I would say yes.

I mean, we don't really have control over how people use the API.  If
somebody decides to implement a storage API that breaks EvalPlanQual
semantics horribly, I can't stop them, and I don't want to stop them.
Open source FTW.

But I don't really want that code in our tree, either.  I think a
storage engine is and should be about the format in which data gets
stored on disk, and that it should only affect the performance of
queries not the answers that they give.  I am sure there will be cases
where, for reasons of implementation complexity, that turns out not to
be true, but I think in general we should try to avoid it as much as
we can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Robert Haas

On Fri, Oct 13, 2017 at 5:25 AM, Kuntal Ghosh
 wrote:
> For some other
> storage engine, if we maintain the older version in different storage,
> undo for example, and don't require a new index entry, should we still
> call it HOT-chain?

I would say, emphatically, no.  HOT is a creature of the existing
heap.  If it's creeping into storage APIs they are not really
abstracted from what we have currently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Peter Geoghegan

On Thu, Oct 12, 2017 at 2:23 PM, Robert Haas  wrote:
> On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
>> However I imply that alternative storage would share our "MVCC model".  So,
>> it
>> should share our transactional model including transactions,
>> subtransactions, snapshots etc.
>> Therefore, if alternative storage is transactional, then in particular it
>> should be able to fetch tuple with
>> given TID according to given snapshot.  However, how it's implemented
>> internally is
>> a black box for us.  Thus, we don't insist that tuple should have different
>> TID after update;
>> we don't insist there is any analogue of HOT; we don't insist alternative
>> storage needs vacuum
>> (or if even it needs vacuum, it might be performed in completely different
>> way) and so on.
>
> Fully agreed.

If we implement that interface, where does that leave EvalPlanQual()?
Do those semantics have to be preserved?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Kuntal Ghosh

On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommi
 wrote:
>
>
> On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas  wrote:
>>
>> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
>>  wrote:
>>
>> That seems like a strange choice of API.  I think it should more
>> integrated with the scan logic.  For example, if I'm doing an index
>> scan, and I get a TID, then I should be able to just say "here's a
>> TID, give me any tuples associated with that TID that are visible to
>> the scan snapshot".  Then for the current heap it will do
>> heap_hot_search_buffer, and for zheap it will walk the undo chain and
>> return the relevant tuple from the chain.
>
>
> OK, Understood.
> I will check along these lines and come up with storage API's.
>
I've some doubts regarding the following function hook:

+typedef bool (*hot_search_buffer_hook) (ItemPointer tid, Relation relation,
+Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
+bool *all_dead, bool first_call);

As per my understanding, with HOT feature  a new tuple placed on the
same page and with all indexed columns the same as its parent row
version does not get new index entries (README.HOT). For some other
storage engine, if we maintain the older version in different storage,
undo for example, and don't require a new index entry, should we still
call it HOT-chain? If not, IMHO, we may have something like
*search_buffer_hook(tid,,storageTuple,...). Depending on the
underlying storage, one can traverse hot-chain or undo-chain or some
other multi-version strategy for non-key updates.

After a successful index search, most of the index AMs set
(HeapTupleData)xs_ctup->t_self of IndexScanDescData to the tupleid in
the storage. IMHO, some changes are needed here to make it generic.

@@ -328,47 +376,27 @@ ExecStoreTuple(HeapTuple tuple,
  Assert(tuple != NULL);
  Assert(slot != NULL);
  Assert(slot->tts_tupleDescriptor != NULL);
+ Assert(slot->tts_storageslotam != NULL);
  /* passing shouldFree=true for a tuple on a disk page is not sane */
  Assert(BufferIsValid(buffer) ? (!shouldFree) : true);
For some storage engine, isn't it possible that the buffer is valid
and the tuple to be stored is formed in memory (for example, tuple
formed from UNDO, in-memory decrypted tuple etc.)

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Haribabu Kommi

On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas  wrote:

> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
>  wrote:
> > Currently I added a snapshot_satisfies API to find out whether the tuple
> > satisfies the visibility or not with different types of visibility
> routines.
> > I feel these
> > are some how enough to develop a different storage methods like UNDO.
> > The storage methods can decide internally how to provide the visibility.
> >
> > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] =
> HeapTupleSatisfiesMVCC;
> > + amroutine->snapshot_satisfies[SELF_VISIBILITY] =
> HeapTupleSatisfiesSelf;
> > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
> > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] =
> HeapTupleSatisfiesToast;
> > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] =
> HeapTupleSatisfiesDirty;
> > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
> > HeapTupleSatisfiesHistoricMVCC;
> > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
> > HeapTupleSatisfiesNonVacuumable;
> > +
> > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
> > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
> >
> > Currently no changes are carried out in snapshot logic as that is kept
> > seperate
> > from storage API.
>
> That seems like a strange choice of API.  I think it should more
> integrated with the scan logic.  For example, if I'm doing an index
> scan, and I get a TID, then I should be able to just say "here's a
> TID, give me any tuples associated with that TID that are visible to
> the scan snapshot".  Then for the current heap it will do
> heap_hot_search_buffer, and for zheap it will walk the undo chain and
> return the relevant tuple from the chain.


OK, Understood.
I will check along these lines and come up with storage API's.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-10-12 Thread Robert Haas

On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
 wrote:
> Currently I added a snapshot_satisfies API to find out whether the tuple
> satisfies the visibility or not with different types of visibility routines.
> I feel these
> are some how enough to develop a different storage methods like UNDO.
> The storage methods can decide internally how to provide the visibility.
>
> + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC;
> + amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf;
> + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
> + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast;
> + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty;
> + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
> HeapTupleSatisfiesHistoricMVCC;
> + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
> HeapTupleSatisfiesNonVacuumable;
> +
> + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
> + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
>
> Currently no changes are carried out in snapshot logic as that is kept
> seperate
> from storage API.

That seems like a strange choice of API.  I think it should more
integrated with the scan logic.  For example, if I'm doing an index
scan, and I get a TID, then I should be able to just say "here's a
TID, give me any tuples associated with that TID that are visible to
the scan snapshot".  Then for the current heap it will do
heap_hot_search_buffer, and for zheap it will walk the undo chain and
return the relevant tuple from the chain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-12 Thread Haribabu Kommi

On Fri, Oct 13, 2017 at 8:23 AM, Robert Haas  wrote:

> On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
>  wrote:
> > It's probably that we imply different meaning to "MVCC implementation".
> > While writing "MVCC implementation" I meant that, for instance,
> alternative
> > storage
> > may implement UNDO chains to store versions of same row.
> Correspondingly,
> > it may not have any analogue of our HOT.
>
> Yes, the zheap project on which EnterpriseDB is working has precisely
> this characteristic.
>
> > However I imply that alternative storage would share our "MVCC model".
> So,
> > it
> > should share our transactional model including transactions,
> > subtransactions, snapshots etc.
> > Therefore, if alternative storage is transactional, then in particular it
> > should be able to fetch tuple with
> > given TID according to given snapshot.  However, how it's implemented
> > internally is
> > a black box for us.  Thus, we don't insist that tuple should have
> different
> > TID after update;
> > we don't insist there is any analogue of HOT; we don't insist alternative
> > storage needs vacuum
> > (or if even it needs vacuum, it might be performed in completely
> different
> > way) and so on.
>
> Fully agreed.


Currently I added a snapshot_satisfies API to find out whether the tuple
satisfies the visibility or not with different types of visibility
routines. I feel these
are some how enough to develop a different storage methods like UNDO.
The storage methods can decide internally how to provide the visibility.

+ amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC;
+ amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf;
+ amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
+ amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast;
+ amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty;
+ amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
HeapTupleSatisfiesHistoricMVCC;
+ amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
HeapTupleSatisfiesNonVacuumable;
+
+ amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
+ amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;

Currently no changes are carried out in snapshot logic as that is kept
seperate
from storage API.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-10-12 Thread Robert Haas

On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
 wrote:
> It's probably that we imply different meaning to "MVCC implementation".
> While writing "MVCC implementation" I meant that, for instance, alternative
> storage
> may implement UNDO chains to store versions of same row.  Correspondingly,
> it may not have any analogue of our HOT.

Yes, the zheap project on which EnterpriseDB is working has precisely
this characteristic.

> However I imply that alternative storage would share our "MVCC model".  So,
> it
> should share our transactional model including transactions,
> subtransactions, snapshots etc.
> Therefore, if alternative storage is transactional, then in particular it
> should be able to fetch tuple with
> given TID according to given snapshot.  However, how it's implemented
> internally is
> a black box for us.  Thus, we don't insist that tuple should have different
> TID after update;
> we don't insist there is any analogue of HOT; we don't insist alternative
> storage needs vacuum
> (or if even it needs vacuum, it might be performed in completely different
> way) and so on.

Fully agreed.

> During conversations with you at PGCon and other conferences I had
> impression
> that you share this view on pluggable storages and MVCC.  Probably, we just
> express
> this view in different words.  Or alternatively I might understand you
> terribly wrong.

No, it sounds like we are on the same page.  I'm only hoping that we
don't end with a bunch of storage engines that each use a different
XID space or something icky like that.  I don't think the API should
try to cater to that sort of development.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-12 Thread Alexander Korotkov

On Wed, Oct 11, 2017 at 11:08 PM, Robert Haas  wrote:

> On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
>  wrote:
> > For me, it's crucial point that pluggable storages should be able to have
> > different MVCC implementation, and correspondingly have full control over
> > its interactions with indexes.
> > Thus, it would be good if we would get consensus on that point.  I'd like
> > other discussion participants to comment whether they agree/disagree and
> > why.
> > Any comments?
>
> I think it's good for new storage managers to have full control over
> interactions with indexes.  I'm not sure about the MVCC part.  I think
> it would be legitimate to want a storage manager to ignore MVCC
> altogether - e.g. to build a non-transactional table.  I don't know
> that it would be a very good idea to have two different full-fledged
> MVCC implementations, though.  Like Tom says, that would be
> replicating a lot of the awfulness of the MySQL model.

It's probably that we imply different meaning to "MVCC implementation".
While writing "MVCC implementation" I meant that, for instance, alternative
storage
may implement UNDO chains to store versions of same row.  Correspondingly,
it may not have any analogue of our HOT.

However I imply that alternative storage would share our "MVCC model".  So,
it
should share our transactional model including transactions,
subtransactions, snapshots etc.
Therefore, if alternative storage is transactional, then in particular it
should be able to fetch tuple with
given TID according to given snapshot.  However, how it's implemented
internally is
a black box for us.  Thus, we don't insist that tuple should have different
TID after update;
we don't insist there is any analogue of HOT; we don't insist alternative
storage needs vacuum
(or if even it needs vacuum, it might be performed in completely different
way) and so on.

During conversations with you at PGCon and other conferences I had
impression
that you share this view on pluggable storages and MVCC.  Probably, we just
express
this view in different words.  Or alternatively I might understand you
terribly wrong.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-10-11 Thread Peter Geoghegan

On Wed, Oct 11, 2017 at 1:08 PM, Robert Haas  wrote:
> On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
>  wrote:
>> For me, it's crucial point that pluggable storages should be able to have
>> different MVCC implementation, and correspondingly have full control over
>> its interactions with indexes.
>> Thus, it would be good if we would get consensus on that point.  I'd like
>> other discussion participants to comment whether they agree/disagree and
>> why.
>> Any comments?
>
> I think it's good for new storage managers to have full control over
> interactions with indexes.  I'm not sure about the MVCC part.  I think
> it would be legitimate to want a storage manager to ignore MVCC
> altogether - e.g. to build a non-transactional table.

I agree with Alexander -- if you're going to have a new MVCC
implementation, you have to do significant work within index access
methods. Adding "retail index tuple deletion" is probably just the
beginning. ISTM that you need something like InnoDB's purge thread
when index values change, since two versions of the same index tuple
(each with distinct attribute values) have to physically co-exist for
a time.

> I don't know
> that it would be a very good idea to have two different full-fledged
> MVCC implementations, though.  Like Tom says, that would be
> replicating a lot of the awfulness of the MySQL model.

It's not just the MySQL model, FWIW. SQL-on-Hadoop systems like
Impala, certain NoSQL systems, and AFAIK any database system that
claims to have pluggable storage all do it this way. That is, core
transaction management functions (e.g. MVCC snapshot acquisition) is
outsourced to the storage engine. It *is* very cumbersome, but that's
what they do.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-11 Thread Robert Haas

On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
 wrote:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point.  I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

I think it's good for new storage managers to have full control over
interactions with indexes.  I'm not sure about the MVCC part.  I think
it would be legitimate to want a storage manager to ignore MVCC
altogether - e.g. to build a non-transactional table.  I don't know
that it would be a very good idea to have two different full-fledged
MVCC implementations, though.  Like Tom says, that would be
replicating a lot of the awfulness of the MySQL model.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-09 Thread Alexander Korotkov

On Mon, Oct 9, 2017 at 5:32 PM, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > For me, it's crucial point that pluggable storages should be able to have
> > different MVCC implementation, and correspondingly have full control over
> > its interactions with indexes.
> > Thus, it would be good if we would get consensus on that point.  I'd like
> > other discussion participants to comment whether they agree/disagree and
> > why.
> > Any comments?
>
> TBH, I think that's a good way of ensuring that nothing will ever get
> committed.  You're trying to draw the storage layer boundary at a point
> that will take in most of the system.  If we did build it like that,
> what we'd end up with would be very reminiscent of mysql's storage
> engines, complete with inconsistent behaviors and varying feature sets
> across engines.  I don't much want to go there.
>

However, if we insist that pluggable storage should have the same MVCC
implementation, interacts with indexes the same way and also use TIDs as
tuple identifiers, then what useful implementations might we have?
Per-page heap compression and encryption?  Or different heap page layout?
Or tuple format?  OK, but that doesn't justify such wide API as it's
implemented in the current version of patch in this thread.  If we really
want to restrict applicability of pluggable storages that way, then we
probably should give up with "pluggable storages" and make it "pluggable
heap page format" at I proposed upthread.

Implementation of alternative storage would be hard and challenging task.
Yes, it would include reimplementation of significant part of the system.
But that seems inevitable if we're going to implement alternative really
storages (not just hacks over existing storage).  And I don't think that
our pluggable storages would be reminiscent of mysql's storage engines
while we're keeping two properties:
1) All the storages use the same WAL stream,
2) All the storages use same transactions and snapshots.
If we keep these two properties, we wouldn't need neither 2PC to run
transactions across different storages, neither separate log for
replication.  These two are major drawbacks of MySQL model.
Varying feature sets across engines seems inevitable and natural.  We've to
invent alternative storages to have features whose are hard to have in our
current storage.  So, no wonder that feature sets would be varying...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-10-09 Thread Tom Lane

Alexander Korotkov  writes:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point.  I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

TBH, I think that's a good way of ensuring that nothing will ever get
committed.  You're trying to draw the storage layer boundary at a point
that will take in most of the system.  If we did build it like that,
what we'd end up with would be very reminiscent of mysql's storage
engines, complete with inconsistent behaviors and varying feature sets
across engines.  I don't much want to go there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-09 Thread Alexander Korotkov

On Wed, Sep 27, 2017 at 7:51 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> I took a look on this patch.  I've following notes for now.
>
> tuple_insert_hook tuple_insert; /* heap_insert */
>> tuple_update_hook tuple_update; /* heap_update */
>> tuple_delete_hook tuple_delete; /* heap_delete */
>
>
> I don't think this set of functions provides good enough level of
> abstraction for storage AM.  This functions encapsulate only low-level work
> of insert/update/delete tuples into heap itself.  However, it still assumed
> that indexes are managed outside of storage AM.  I don't think this is
> right, assuming that one of most demanded storage API usage would be
> different MVCC implementation (for instance, UNDO log based as discussed
> upthread).  Different MVCC implementation is likely going to manage indexes
> in a different way.  For example, storage AM utilizing UNDO would implement
> in-place update even when indexed columns are modified.  Therefore this
> piece of code in ExecUpdate() wouldn't be relevant anymore.
>
> /*
>> * insert index entries for tuple
>> *
>> * Note: heap_update returns the tid (location) of the new tuple in
>> * the t_self field.
>> *
>> * If it's a HOT update, we mustn't insert new index entries.
>> */
>> if ((resultRelInfo->ri_NumIndices > 0) && 
>> !storage_tuple_is_heaponly(resultRelationDesc,
>> tuple))
>> recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid),
>>   estate, false, NULL, NIL);
>
>
> I'm firmly convinced that this logic should be encapsulated into storage
> AM altogether with inserting new index tuples on storage insert.  Also, HOT
> should be completely encapsulated into heapam.  It's quite evident for me
> that storage utilizing UNDO wouldn't have anything like our current HOT.
> Therefore, I think there shouldn't be hot_search_buffer() API function.
>  tuple_fetch() may cover hot_search_buffer(). That might require some
> signature change of tuple_fetch() (probably, extra arguments).
>

For me, it's crucial point that pluggable storages should be able to have
different MVCC implementation, and correspondingly have full control over
its interactions with indexes.
Thus, it would be good if we would get consensus on that point.  I'd like
other discussion participants to comment whether they agree/disagree and
why.
Any comments?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-09-27 Thread Alexander Korotkov

Hi!

I took a look on this patch.  I've following notes for now.

tuple_insert_hook tuple_insert; /* heap_insert */
> tuple_update_hook tuple_update; /* heap_update */
> tuple_delete_hook tuple_delete; /* heap_delete */


I don't think this set of functions provides good enough level of
abstraction for storage AM.  This functions encapsulate only low-level work
of insert/update/delete tuples into heap itself.  However, it still assumed
that indexes are managed outside of storage AM.  I don't think this is
right, assuming that one of most demanded storage API usage would be
different MVCC implementation (for instance, UNDO log based as discussed
upthread).  Different MVCC implementation is likely going to manage indexes
in a different way.  For example, storage AM utilizing UNDO would implement
in-place update even when indexed columns are modified.  Therefore this
piece of code in ExecUpdate() wouldn't be relevant anymore.

/*
> * insert index entries for tuple
> *
> * Note: heap_update returns the tid (location) of the new tuple in
> * the t_self field.
> *
> * If it's a HOT update, we mustn't insert new index entries.
> */
> if ((resultRelInfo->ri_NumIndices > 0) &&
> !storage_tuple_is_heaponly(resultRelationDesc, tuple))
> recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid),
>   estate, false, NULL, NIL);


I'm firmly convinced that this logic should be encapsulated into storage AM
altogether with inserting new index tuples on storage insert.  Also, HOT
should be completely encapsulated into heapam.  It's quite evident for me
that storage utilizing UNDO wouldn't have anything like our current HOT.
Therefore, I think there shouldn't be hot_search_buffer() API function.
 tuple_fetch() may cover hot_search_buffer(). That might require some
signature change of tuple_fetch() (probably, extra arguments).

LockTupleMode and HeapUpdateFailureData shouldn't be private of heapam.
Any fullweight OLTP storage AM should support our tuple lock modes and
should be able to report update failures.  HeapUpdateFailureData should be
renamed to something like StorageUpdateFailureData.  Contents of
HeapUpdateFailureData seems to me general enough to be supported by any
storage with ItemPointer tuple locator.

storage_setscanlimits() is used only during index build.  I think that
since storage AM may have different MVCC implementation then storage AM
should decide how to communicate with indexes including index build.
Therefore, instead of exposing storage_setscanlimits(), the
whole IndexBuildHeapScan() should be encapsulated into storage AM.

Also, BulkInsertState should be private of structure of heapam.  Another
storages may have another state for bulk insert.  On API level we might
have some abstract pointer instead of BulkInsertState while having
GetBulkInsertState and others as API methods.

storage_freeze_tuple() is called only once from rewrite_heap_tuple().  That
makes me think that tuple_freeze API method is wrong for abstraction.  We
probably should make rewrite_heap_tuple() or even the
whole rebuild_relation() an API method...

Heap reloptions are untouched for now.  Storage AM should be able to
provide its own specific options just like index AMs do.

That's all I have for now.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-09-27 Thread Alexander Korotkov

On Wed, Aug 23, 2017 at 8:26 AM, Haribabu Kommi 
wrote:

> - Minor: don't think the _function suffix for Storis necessary, just
>>   makes things long, and every member has it. Besides that, it's also
>>   easy to misunderstand - for a second I understood
>>   scan_getnext_function to be about getting the next function...
>>
>
> OK. How about adding _hook?
>

I've answered to Andrew why I think _function suffix is OK for now.
And I don't particularly like _hook suffix for this purpose, because those
functions are parts of API implementation, not hooks.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-09-27 Thread Alexander Korotkov

Hi,

Thank you for review on this subject.  I think it's extremely important for
PostgreSQL to eventually get pluggable storage API.
In general I agree with all your points.  But I'd like to make couple of
comments.

On Tue, Aug 15, 2017 at 9:53 AM, Andres Freund  wrote:

> - I don't think we should introduce this without a user besides
>   heapam. The likelihood that API will be usable by anything else
>   without a testcase seems fairly remote.  I think some src/test/modules
>   type implementation of a per-session, in-memory storage - relatively
>   easy to implement - or such is necessary.
>

+1 for having a user before committing API.  However, I'd like to note that
sample storage implementation should do something really different from out
current heap.  In particular, if per-session, in-memory storage would be
just another way to keep heap in local buffers, it wouldn't be OK for me;
because such kind of storage could be achieved way more easier without so
complex API.  But if per-session, in-memory storage would, for instance,
utilize different MVCC implementation, that would be very good sample of
storage API usage.

> - Minor: don't think the _function suffix for Storis necessary, just
>   makes things long, and every member has it. Besides that, it's also
>   easy to misunderstand - for a second I understood
>   scan_getnext_function to be about getting the next function...
>

_function suffix looks long for me too.  But we should look on this
question from uniformity point of view.
FdwRoutine, TsmRoutine, IndexAmRoutine use _function suffix.  This is why I
think we should use _function suffix for StorageAmRoutine unless we're
going to change that for other *Routines too.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-09-14 Thread Alexander Korotkov

On Thu, Sep 14, 2017 at 8:17 AM, Haribabu Kommi 
wrote:

> Instead of modifying the Bitmap Heap and Sample scan's to avoid referring
> the internal members of the HeapScanDesc, I divided the HeapScanDesc
> into two parts.
>
> 1. StorageScanDesc
> 2. HeapPageScanDesc
>
> The StorageScanDesc contains the minimal information that is required
> outside
> the Storage routine and this must be provided by all storage routines. This
> structure contains minimal information such as relation, snapshot, buffer
> and
> etc.
>
> The HeapPageScanDesc contains other extra information that is required for
> Bitmap Heap and Sample scans to work. This structure contains the
> information
> of blocks, visible offsets and etc. Currently this structure is used only
> in
> Bitmap Heap and Sample scan and it's supported contrib modules, except
> the pgstattuple module. The pgstattuple needs some additional changes.
>
> By adding additional storage API to return HeapPageScanDesc as it required
> by the Bitmap Heap and Sample scan's and this API is called only in these
> two scan's. And also these scan methods are choosen by the planner only
> when the storage routine supports to returning of HeapPageScanDesc API.
> Currently Implemented the planner support only for Bitmap, yet to do it
> for Sample scan.
>
> With the above approach, I removed all the references of HeapScanDesc
> outside the heap. The changes of this approach is available in the
> 0008-Remove-HeapScanDesc-usage-outside-heap.patch
>
> Suggestions/comments with the above approach.
>

For me, that's an interesting idea.  Naturally, the way BitmapHeapScan and
SampleScan work even on very high-level is applicable only for some storage
AMs (i.e. heap-like storage AMs).  For example, index-organized table
wouldn't ever support BitmapHeapScan, because it refers tuples by PK values
not TIDs.  However, in this case, storage AM might have some alternative to
our BitmapHeapScan.  So, index-organized table might have some compressed
representation of ordered PK values set and use it for bulk fetch of PK
index.

Therefore, I think it would be nice to make BitmapHeapScan an
heap-storage-AM-specific scan method while other storage AMs could provide
other storage-AM-specific scan methods.  Probably it would be too much for
this patchset and should be done during one of next work cycles on storage
AM (I'm sure that such huge project as pluggable storage AMs would have
multiple iterations).

Similarly, SampleScans contain storage-AM-specific logic.  For instance,
our SYSTEM sampling method fetches random blocks from heap providing high
performance way to sample heap.  Coming back to the example of
index-organized table, it could provide it's own storage-AM-specific table
sampling methods including sophisticated PK tree traversal with fetching
random small ranges of PK.  Given that tablesample methods are already
pluggable, making them storage-AM-specific would lead to user-visible
changes.  I.e. tablesample method should be created for particular storage
AM or set of storage AMs.  However, I didn't yet figure out what should API
exactly look like...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-09-11 Thread Haribabu Kommi

On Sat, Sep 9, 2017 at 1:23 PM, Haribabu Kommi 
wrote:

>
> I rebased the patch to the latest master and also fixed the duplicate OID
> and some slot fixes. Updated patches are attached.
>

While analyzing the removal of HeapScanDesc usage other than heap
modules, The mostly used member is "*rs_cbuf*" for the purpose of locking
the buffer, visibility checks and marking it as dirty. The buffer is
tightly integrated
with the visibility. Buffer may not be required for some storage routines
where
the data is always in the memory and etc.

I found the references of its structure members in the following
places. I feel other than the following 4 parameters, rest of them needs to
be
moved into their corresponding storage routines.

* Relation rs_rd; /* heap relation descriptor */*
* Snapshot rs_snapshot; /* snapshot to see */*
* int rs_nkeys; /* number of scan keys */*
* ScanKey rs_key; /* array of scan key descriptors */*

But currently I am treating the "*rs_cbuf" *also a needed member and also
expecting all storage routines will be provide it. Or we may need a another
approach to mark the buffer as dirty.

Suggestions?

Following are the rest of the parameters that are used
outside the heap.

* BlockNumber rs_nblocks; /* total number of blocks in rel */*

pgstattuple.c, tsm_system_rows.c, tsm_system_time.c, system.c
nodeBitmapheapscan.c nodesamplescan.c,
*Mostly for the purpose of checking the number of blocks in a rel.*

* BufferAccessStrategy rs_strategy; /* access strategy for reads */*

pgstattuple.c

* bool rs_pageatatime; /* verify visibility page-at-a-time? */*
* BlockNumber rs_startblock; /* block # to start at */*
* bool rs_syncscan; /* report location to syncscan logic? */*
* bool rs_inited; /* false = scan not init'd yet */*

nodesamplescan.c

* HeapTupleData rs_ctup; /* current tuple in scan, if any */*

*genam.c, nodeBitmapHeapscan.c, nodesamplescan.c*
*Used for retrieving the last scanned tuple.*

* BlockNumber rs_cblock; /* current block # in scan, if any */*

*index.c, nodesamplescan.c*

* Buffer rs_cbuf; /* current buffer in scan, if any */*

*pgrowlocks.c, pgstattuple.c, genam.c, index.c, cluster.c,*
*tablecmds.c, nodeBitmapHeapscan.c, nodesamplescan.c*
*Mostly used for Locking the Buffer.*

* ParallelHeapScanDesc rs_parallel; /* parallel scan information */*

*nodeseqscan.c*

* int rs_cindex; /* current tuple's index in vistuples */*

*nodeBitmapHeapScan.c*

* int rs_ntuples; /* number of visible tuples on page */*
* OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */*

*tsm_system_rows.c, nodeBitmapHeapscan.c, nodesamplescan.c*
*Used for retrieve the offsets mainly Bitmap and sample scans.*

I think rest of the above parameters usage other than heap can be changed
once the Bitmap and Sample scans are modified to use the storage routines
while returning the tuple instead of their own implementations. I feel these
scans are the major users of the rest of the parameters. This approach may
need to some more API's to get rid of Bitmap and sample scan's own
implementation.

suggestions?

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-09-06 Thread Thomas Munro

On Fri, Sep 1, 2017 at 1:51 PM, Haribabu Kommi  wrote:
> Here I attached new set of patches that are rebased to the latest master.

Hi Haribabu,

Could you please post a new rebased version?
0007-Scan-functions-are-added-to-storage-AM.patch conflicts with
recent changes.

Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-25 Thread Haribabu Kommi

On Wed, Aug 23, 2017 at 11:59 PM, Amit Kapila 
wrote:

> On Wed, Aug 23, 2017 at 11:05 AM, Haribabu Kommi
>  wrote:
> >
> >
> > On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila 
> > wrote:
> >>
> >> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
> >>  wrote:
> >> >
> >> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila  >
> >> > wrote:
> >> >>
> >> >>
> >> >> Also, it is quite possible that some of the storage Am's don't even
> >> >> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
> >> >> guess what we need here is to provide a way so that different storage
> >> >> am's can register their function pointer for an equivalent to
> >> >> satisfies function.  So, we need to change
> >> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different
> >> >> handlers can register their function instead of using that directly.
> >> >> I think that should address the problem you are planning to solve by
> >> >> omitting buffer parameter.
> >> >
> >> >
> >> > Thanks for your suggestion. Yes, it is better to go in the direction
> of
> >> > SnapshotSatisfiesFunc.
> >> >
> >> > I verified the above idea of implementing the Tuple visibility
> functions
> >> > and assign them into the snapshotData structure based on the snapshot.
> >> >
> >> > The Tuple visibility functions that are specific to the relation are
> >> > available
> >> > with the RelationData structure and this structure may not be
> available,
> >> >
> >>
> >> Which functions are you referring here?  I don't see anything in
> >> tqual.h that uses RelationData.
> >
> >
> >
> > With storage API's, the tuple visibility functions are available with
> > RelationData
> > and those are needs used to update the SnapshotData structure
> > SnapshotSatisfiesFunc member.
> >
> > But the RelationData is not available everywhere, where the snapshot is
> > created,
> > but it is available every place where the tuple visibility is checked.
> So I
> > just changed
> > the way of checking the tuple visibility with the information of
> snapshot by
> > calling
> > the corresponding tuple visibility function from RelationData.
> >
> > If SnapshotData provides MVCC, then the MVCC specific tuple visibility
> > function from
> > RelationData is called. The SnapshotSatisfiesFunc member is changed to a
> > enum
> > that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc.
> > Whenever
> > the visibility check is needed, the corresponding function is called.
> >
>
> It will be easy to understand and see if there is some better
> alternative once you have something in the form of a patch.


Sorry for the delay.

I will submit the new patch series with all comments given
in the upthread to the upcoming commitfest.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-08-23 Thread Amit Kapila

On Wed, Aug 23, 2017 at 11:05 AM, Haribabu Kommi
 wrote:
>
>
> On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila 
> wrote:
>>
>> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
>>  wrote:
>> >
>> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila 
>> > wrote:
>> >>
>> >>
>> >> Also, it is quite possible that some of the storage Am's don't even
>> >> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
>> >> guess what we need here is to provide a way so that different storage
>> >> am's can register their function pointer for an equivalent to
>> >> satisfies function.  So, we need to change
>> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different
>> >> handlers can register their function instead of using that directly.
>> >> I think that should address the problem you are planning to solve by
>> >> omitting buffer parameter.
>> >
>> >
>> > Thanks for your suggestion. Yes, it is better to go in the direction of
>> > SnapshotSatisfiesFunc.
>> >
>> > I verified the above idea of implementing the Tuple visibility functions
>> > and assign them into the snapshotData structure based on the snapshot.
>> >
>> > The Tuple visibility functions that are specific to the relation are
>> > available
>> > with the RelationData structure and this structure may not be available,
>> >
>>
>> Which functions are you referring here?  I don't see anything in
>> tqual.h that uses RelationData.
>
>
>
> With storage API's, the tuple visibility functions are available with
> RelationData
> and those are needs used to update the SnapshotData structure
> SnapshotSatisfiesFunc member.
>
> But the RelationData is not available everywhere, where the snapshot is
> created,
> but it is available every place where the tuple visibility is checked. So I
> just changed
> the way of checking the tuple visibility with the information of snapshot by
> calling
> the corresponding tuple visibility function from RelationData.
>
> If SnapshotData provides MVCC, then the MVCC specific tuple visibility
> function from
> RelationData is called. The SnapshotSatisfiesFunc member is changed to a
> enum
> that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc.
> Whenever
> the visibility check is needed, the corresponding function is called.
>

It will be easy to understand and see if there is some better
alternative once you have something in the form of a patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-22 Thread Haribabu Kommi

On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila 
wrote:

> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
>  wrote:
> >
> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila 
> > wrote:
> >>
> >>
> >> Also, it is quite possible that some of the storage Am's don't even
> >> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
> >> guess what we need here is to provide a way so that different storage
> >> am's can register their function pointer for an equivalent to
> >> satisfies function.  So, we need to change
> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different
> >> handlers can register their function instead of using that directly.
> >> I think that should address the problem you are planning to solve by
> >> omitting buffer parameter.
> >
> >
> > Thanks for your suggestion. Yes, it is better to go in the direction of
> > SnapshotSatisfiesFunc.
> >
> > I verified the above idea of implementing the Tuple visibility functions
> > and assign them into the snapshotData structure based on the snapshot.
> >
> > The Tuple visibility functions that are specific to the relation are
> > available
> > with the RelationData structure and this structure may not be available,
> >
>
> Which functions are you referring here?  I don't see anything in
> tqual.h that uses RelationData.

With storage API's, the tuple visibility functions are available with
RelationData
and those are needs used to update the SnapshotData structure
SnapshotSatisfiesFunc member.

But the RelationData is not available everywhere, where the snapshot is
created,
but it is available every place where the tuple visibility is checked. So I
just changed
the way of checking the tuple visibility with the information of snapshot
by calling
the corresponding tuple visibility function from RelationData.

If SnapshotData provides MVCC, then the MVCC specific tuple visibility
function from
RelationData is called. The SnapshotSatisfiesFunc member is changed to a
enum
that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc.
Whenever
the visibility check is needed, the corresponding function is called.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-08-22 Thread Haribabu Kommi

On Tue, Aug 15, 2017 at 4:53 PM, Andres Freund  wrote:

> Hi,
>
>
> On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote:
> > Here I attached WIP patches to support pluggable storage. The patch
> series
> > are may not work individually. Still so many things are under
> development.
> > These patches are just to share the approach of the current development.
>
> Making a pass through the patchset to get a feel where this at, and
> where this is headed.  I previously skimmed the thread to get a rough
> sense on what's discused, but not in a very detailed manner.
>

Thanks for the review.


General:
>
> - I think one important discussion we need to have is what kind of
>   performance impact we're going to accept introducing this. It seems
>   very likely that this'll cause some slowdown.  We can kind of
>   alleviate that by doing some optimizations at the same time, but
>   nevertheless, this abstraction is going to cost.
>

OK. May be to take some decision, we may need some performance
figures, I will measure the performance once the API's stabilized.

- I don't think we should introduce this without a user besides
>   heapam. The likelihood that API will be usable by anything else
>   without a testcase seems fairly remote.  I think some src/test/modules
>   type implementation of a per-session, in-memory storage - relatively
>   easy to implement - or such is necessary.
>

Sure, I will add a test module once the API's are stabilized.



> - I think, and detailed some of that, we're going to need some cleanups
>   that go in before this, to decrease the size / increase the quality of
>   the new APIs.  It's going to get more painful to change APIs
>   subsequently.
>
> - We'll have to document clearly that these APIs are going to change for
>   a while, even after the release introducing them.
>

Yes, that's correct, because this is the first time we are developing the
storage API's to support pluggable storage, so it may needs some refinements
based on the usage to support different storage methods.



> StorageAm - Scan stuff:
>
> - I think API isn't quite right. There's a lot of granular callback
>   functionality like scan_begin_function / scan_begin_catalog /
>   scan_begin_bm - these largely are convenience wrappers around the same
>   function, and I don't think that would, or rather should, change in
>   any abstracted storage layer.  So I think there needs to be some
>   unification first (pretty close w/ beginscan_internal already, but
>   perhaps we should get rid of a few of these wrappers).
>

OK. I will change the API to add a function to beginscan_internal and
replace
the rest of the functions usage with beginscan_internal. And also there are
many bool flags that are passed to the beginscan_internal, I will try to
optimize
them also.



> - Some of the exposed functionality, e.g. scan_getpage,
>   scan_update_snapshot, scan_rescan_set_params looks like it should just
>   be excised, i.e. there's no reason for it to exist.
>

Currently these API's are used only in Bitmap and Sample scan's.
These scan methods are fully depends on the heap format. I will
check how to remove these API's.

- Minor: don't think the _function suffix for Storis necessary, just
>   makes things long, and every member has it. Besides that, it's also
>   easy to misunderstand - for a second I understood
>   scan_getnext_function to be about getting the next function...
>

OK. How about adding _hook?



> - Scans are still represented as HeapScanDesc - I don't think that's
>   going to fly. Either this needs to be an opaque type (i.e. a struct
>   that's not defined, just forward declared), or it needs to be a base
>   struct that individual AMs embed in their own structs.  Individual AMs
>   definitely are going to need different pieces of data.
>

Currently the internal members of the HeapScanDesc are directly used
in many places especially in Bitmap and Sample scan's. I am yet to write
the code in a best way to handle these scan methods and then removing
its usage will be easy.


> Storage AM - tuple stuff:
>
> - tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are
>   each individual functions, that seems pretty painful to maintain, and
>   v/ likely to just grow and grow. Not sure what the solution is, but
>   this seems like a hard sell.
>

OK. How about adding a one API and takes some flags to represent
what type of data that is needed from the tuple and returned the
corresponding
data as void *. The caller must typecast the data to their corresponding
type before use it.

- The three *speculative* functions don't quite seem right to me, nor do
>   I understand:
> +*
> +* Setting a tuple's speculative token is a slot-only operation,
> so no need
> +* for a storage AM method, but after inserting a tuple containing
> a
> +* speculative token, the insertion must be completed by these
> routines:
> +*/
>   I don't see anything related to

Re: [HACKERS] Pluggable storage

2017-08-21 Thread Amit Kapila

On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
 wrote:
>
> On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila 
> wrote:
>>
>>
>> Also, it is quite possible that some of the storage Am's don't even
>> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
>> guess what we need here is to provide a way so that different storage
>> am's can register their function pointer for an equivalent to
>> satisfies function.  So, we need to change
>> SnapshotData.SnapshotSatisfiesFunc in some way so that different
>> handlers can register their function instead of using that directly.
>> I think that should address the problem you are planning to solve by
>> omitting buffer parameter.
>
>
> Thanks for your suggestion. Yes, it is better to go in the direction of
> SnapshotSatisfiesFunc.
>
> I verified the above idea of implementing the Tuple visibility functions
> and assign them into the snapshotData structure based on the snapshot.
>
> The Tuple visibility functions that are specific to the relation are
> available
> with the RelationData structure and this structure may not be available,
>

Which functions are you referring here?  I don't see anything in
tqual.h that uses RelationData.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-21 Thread Haribabu Kommi

On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila 
wrote:

> On Sat, Aug 12, 2017 at 10:31 AM, Haribabu Kommi
>  wrote:
> >>
> >> Why do we need to store handler function in TupleDesc?  As of now, the
> >> above patch series has it available in RelationData and
> >> TupleTableSlot, I am not sure if instead of that keeping it in
> >> TupleDesc is a good idea.  Which all kind of places require TupleDesc
> >> to contain handler?  If those are few places, can we think of passing
> >> it as a parameter?
> >
> >
> > Till now I am to able to proceed without adding any storage handler
> > functions to
> > TupleDesc structure. Sure, I will try the way of passing as a parameter
> when
> > there is a need of it.
> >
>
> Okay, I think it is better if you discuss such locations before
> directly modifying those.
>

Sure. I will check with community before making any such changes.

> > During the progress of the patch, I am facing problems in designing the
> > storage API
> > regarding the Buffer. For example To replace all the
> HeapTupleSatisfiesMVCC
> > and
> > related functions with function pointers, In HeapTuple format, the tuple
> may
> > belongs
> > to one buffer, so the buffer is passed to the HeapTupleSatifisifes***
> > functions along
> > with buffer, But in case of other storage formats, the single buffer may
> not
> > contains
> > the actual data.
> >
>
> Also, it is quite possible that some of the storage Am's don't even
> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
> guess what we need here is to provide a way so that different storage
> am's can register their function pointer for an equivalent to
> satisfies function.  So, we need to change
> SnapshotData.SnapshotSatisfiesFunc in some way so that different
> handlers can register their function instead of using that directly.
> I think that should address the problem you are planning to solve by
> omitting buffer parameter.
>

Thanks for your suggestion. Yes, it is better to go in the direction of
SnapshotSatisfiesFunc.

I verified the above idea of implementing the Tuple visibility functions
and assign them into the snapshotData structure based on the snapshot.

The Tuple visibility functions that are specific to the relation are
available
with the RelationData structure and this structure may not be available,
so I changed the SnapShotData structure to hold an enum to represent
what type of snapshot it is, instead of storing the pointer to the tuple
visibility function. Whenever there is a need to check for the tuple
visibilty
the storageam handler pointer corresponding to the snapshot type is
called and result is obtained as earlier.

> This buffer is used to set the Hint bits and mark the
> > buffer as dirty.
> > In case if the buffer is not available, the performance may affect for
> the
> > following
> > queries if the hint bits are not set.
> >
>
> I don't think it is advisable to change that for the current heap.
>

I didn't change the prototype of existing functions. Currently tuple
visibility
functions assumes that Buffer is always proper, but that may not be correct
based on the storage.

>
> > And also the Buffer is used to get from heap_fetch, heap_lock_tuple and
> > related
> > functions to check the Tuple visibility, but currently returning a buffer
> > from the above
> > heap_** function is not possible for other formats.
> >
>
> Why not?  I mean if we consider that all the formats we are worried at
> this stage have TID (block number, tuple location), then we can get
> the buffer.  We might want to consider passing TID as a parameter to
> these API's if required to make that possible.  You also agreed above
> [1] that we can first design the API considering storage formats
> having TID.
>

The current approach is to support the storages that support TID bits.
But what I mean here, in some storage methods (for example column
storage), the tuple is not present in one buffer, the tuple data may be
calculated from many buffers and return the slot/storageTuple (until
unless we change everywhere to slot).

If any of the following code after the storage methods is expecting
a Buffer that should be valid may need some changes to check it first
whether it is a valid or not and perform the operations based on that.

> > And also for the
> > HeapTuple data,
> > the tuple data is copied into palloced buffer instead of pointing
> directly
> > to the page.
> > So, returning a Buffer is a valid or not here?
> >
>
> Yeah, but I think for the sake of compatibility and not changing too
> much in the current API's signature, we should try to avoid it.
>

Currently I am trying to avoid changing the current API's signatures.
Most of the signature changes are something like HeapTuple -> StorageTuple
and etc.

> > Currently I am proceeding to remove the Buffer as parameter in the API
> and
> > proceed
> > further, In case if it affects the performance, we need to find out a
> >

Re: [HACKERS] Pluggable storage

2017-08-15 Thread Andres Freund

Hi,


On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote:
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.

Making a pass through the patchset to get a feel where this at, and
where this is headed.  I previously skimmed the thread to get a rough
sense on what's discused, but not in a very detailed manner.


General:

- I think one important discussion we need to have is what kind of
  performance impact we're going to accept introducing this. It seems
  very likely that this'll cause some slowdown.  We can kind of
  alleviate that by doing some optimizations at the same time, but
  nevertheless, this abstraction is going to cost.

- I don't think we should introduce this without a user besides
  heapam. The likelihood that API will be usable by anything else
  without a testcase seems fairly remote.  I think some src/test/modules
  type implementation of a per-session, in-memory storage - relatively
  easy to implement - or such is necessary.

- I think, and detailed some of that, we're going to need some cleanups
  that go in before this, to decrease the size / increase the quality of
  the new APIs.  It's going to get more painful to change APIs
  subsequently.

- We'll have to document clearly that these APIs are going to change for
  a while, even after the release introducing them.


StorageAm - Scan stuff:

- I think API isn't quite right. There's a lot of granular callback
  functionality like scan_begin_function / scan_begin_catalog /
  scan_begin_bm - these largely are convenience wrappers around the same
  function, and I don't think that would, or rather should, change in
  any abstracted storage layer.  So I think there needs to be some
  unification first (pretty close w/ beginscan_internal already, but
  perhaps we should get rid of a few of these wrappers).

- Some of the exposed functionality, e.g. scan_getpage,
  scan_update_snapshot, scan_rescan_set_params looks like it should just
  be excised, i.e. there's no reason for it to exist.

- Minor: don't think the _function suffix for Storis necessary, just
  makes things long, and every member has it. Besides that, it's also
  easy to misunderstand - for a second I understood
  scan_getnext_function to be about getting the next function...

- Scans are still represented as HeapScanDesc - I don't think that's
  going to fly. Either this needs to be an opaque type (i.e. a struct
  that's not defined, just forward declared), or it needs to be a base
  struct that individual AMs embed in their own structs.  Individual AMs
  definitely are going to need different pieces of data.


Storage AM - tuple stuff:

- tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are
  each individual functions, that seems pretty painful to maintain, and
  v/ likely to just grow and grow. Not sure what the solution is, but
  this seems like a hard sell.

- The three *speculative* functions don't quite seem right to me, nor do
  I understand:
+*
+* Setting a tuple's speculative token is a slot-only operation, so no 
need
+* for a storage AM method, but after inserting a tuple containing a
+* speculative token, the insertion must be completed by these routines:
+*/
  I don't see anything related to slots, here?


Storage AM - slot stuff:


- I think there's a number of wrapper functions (slot_getattr,
  slot_getallattrs, getsomeattrs, attisnull) around the same
  functionality - that bloats the API and causes slowdowns. Instead we
  need something like slot_virtualize_tuple(int16 upto), and the rest
  should just be wrappers.

- I think it's wrong to have the slot functionality defined on the
  StorageAm level.  That'll cause even more indirect function calls (=>
  slowness), and besides that the TupleTableSlot struct will need
  members for each and every Am.

  I think the solution is to instead have an Am level "create slot"
  function, and the returned slot is allocated by the Am, with a base
  member of TupleTableSlot with basically just tts_nvalid, tts_values,
  tts_isnull as members.  Those are the only members that can be
  accessed without functions.

  Access to the individual functions (say store_tuple) would then be
  direct members of the TupleTableSlot interface. While that costs a bit
  of memory, it removes one indirection from an already performance
  critical path.

- MinimalTuples should be one type of slot for the above, except it's
  not created by an StorageAm but by a function returning a
  TupleTableSlot.

  This should remove the need for the slot_copy_min_tuple,
  slot_is_physical_tuple functions.

- Right now TupleTableSlots are an executor datastructure, but
  these patches (rightly!) make it much more widely used. So I think it
  needs to be moved outside of executor/, and probably renamed to
  something like

Re: [HACKERS] Pluggable storage

2017-08-13 Thread Haribabu Kommi

On Sun, Aug 13, 2017 at 5:21 PM, Amit Kapila 
wrote:

> On Sat, Aug 12, 2017 at 10:34 AM, Haribabu Kommi
>  wrote:
> >
> > On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila 
> wrote:
> >>
> >> +typedef struct StorageAmRoutine
> >> +{
> >>
> >> In this structure, you have already covered most of the API's that a
> >> new storage module needs to provide, but I think there could be more.
> >> One such API could be heap_hot_search.  This seems specific to current
> >> heap where we have the provision of HOT.  I think we can provide a new
> >> API tuple_search or something like that.
> >
> >
> > Thanks for the review.
> >
> > Yes, the storageAmRoutine needs more function pointers. Currently I am
> > adding all the functions that are present in the heapam.h and some slot
> > related function from tuptable.h.
> >
>
> Hmm, this API is exposed via heapam.h.  Am I missing something?
>

Sorry I was not clearly explaining in my earlier mail. Yes your are right
that the heap_hot_search function exists in heapam.h, but I am yet to
add all the exposed functions that are present in the heapam.h file to
the storageAmRoutine structure.

> Once I stabilize the code and API's that
> > are
> > currently added, then I will further enhance it with remaining functions
> > that
> > are necessary to support pluggable storage API.
> >
>
> Sure, but I think if we found any thing during development/review,
> then we should either add it immediately or at the very least add fix
> me in a patch to avoid forgetting the finding.


OK. I will add all the functions that are identified till now.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-08-13 Thread Amit Kapila

On Sat, Aug 12, 2017 at 10:34 AM, Haribabu Kommi
 wrote:
>
> On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila  wrote:
>>
>> On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
>>  wrote:
>> >
>> >
>> > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera
>> > 
>> > wrote:
>> >>
>> >> I have sent the partial patch I have to Hari Babu Kommi.  We expect
>> >> that
>> >> he will be able to further this goal some more.
>> >
>> >
>> > Thanks Alvaro for sharing your development patch.
>> >
>> > Most of the patch design is same as described by Alvaro in the first
>> > mail
>> > [1].
>> > I will detail the modifications, pending items and open items (needs
>> > discussion)
>> > to implement proper pluggable storage.
>> >
>> > Here I attached WIP patches to support pluggable storage. The patch
>> > series
>> > are may not work individually. Still so many things are under
>> > development.
>> > These patches are just to share the approach of the current development.
>> >
>>
>> +typedef struct StorageAmRoutine
>> +{
>>
>> In this structure, you have already covered most of the API's that a
>> new storage module needs to provide, but I think there could be more.
>> One such API could be heap_hot_search.  This seems specific to current
>> heap where we have the provision of HOT.  I think we can provide a new
>> API tuple_search or something like that.
>
>
> Thanks for the review.
>
> Yes, the storageAmRoutine needs more function pointers. Currently I am
> adding all the functions that are present in the heapam.h and some slot
> related function from tuptable.h.
>

Hmm, this API is exposed via heapam.h.  Am I missing something?

> Once I stabilize the code and API's that
> are
> currently added, then I will further enhance it with remaining functions
> that
> are necessary to support pluggable storage API.
>

Sure, but I think if we found any thing during development/review,
then we should either add it immediately or at the very least add fix
me in a patch to avoid forgetting the finding.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-13 Thread Amit Kapila

On Sat, Aug 12, 2017 at 10:31 AM, Haribabu Kommi
 wrote:
>>
>> Why do we need to store handler function in TupleDesc?  As of now, the
>> above patch series has it available in RelationData and
>> TupleTableSlot, I am not sure if instead of that keeping it in
>> TupleDesc is a good idea.  Which all kind of places require TupleDesc
>> to contain handler?  If those are few places, can we think of passing
>> it as a parameter?
>
>
> Till now I am to able to proceed without adding any storage handler
> functions to
> TupleDesc structure. Sure, I will try the way of passing as a parameter when
> there is a need of it.
>

Okay, I think it is better if you discuss such locations before
directly modifying those.

> During the progress of the patch, I am facing problems in designing the
> storage API
> regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC
> and
> related functions with function pointers, In HeapTuple format, the tuple may
> belongs
> to one buffer, so the buffer is passed to the HeapTupleSatifisifes***
> functions along
> with buffer, But in case of other storage formats, the single buffer may not
> contains
> the actual data.
>

Also, it is quite possible that some of the storage Am's don't even
want to return bool as a parameter from HeapTupleSatisfies* API's.  I
guess what we need here is to provide a way so that different storage
am's can register their function pointer for an equivalent to
satisfies function.  So, we need to change
SnapshotData.SnapshotSatisfiesFunc in some way so that different
handlers can register their function instead of using that directly.
I think that should address the problem you are planning to solve by
omitting buffer parameter.

> This buffer is used to set the Hint bits and mark the
> buffer as dirty.
> In case if the buffer is not available, the performance may affect for the
> following
> queries if the hint bits are not set.
>

I don't think it is advisable to change that for the current heap.

> And also the Buffer is used to get from heap_fetch, heap_lock_tuple and
> related
> functions to check the Tuple visibility, but currently returning a buffer
> from the above
> heap_** function is not possible for other formats.
>

Why not?  I mean if we consider that all the formats we are worried at
this stage have TID (block number, tuple location), then we can get
the buffer.  We might want to consider passing TID as a parameter to
these API's if required to make that possible.  You also agreed above
[1] that we can first design the API considering storage formats
having TID.

> And also for the
> HeapTuple data,
> the tuple data is copied into palloced buffer instead of pointing directly
> to the page.
> So, returning a Buffer is a valid or not here?
>

Yeah, but I think for the sake of compatibility and not changing too
much in the current API's signature, we should try to avoid it.

> Currently I am proceeding to remove the Buffer as parameter in the API and
> proceed
> further, In case if it affects the performance, we need to find out a
> different appraoch
> in handling the hint bits.
>

Leaving aside the performance concern, I am not convinced that it is a
good idea to remove Buffer as a parameter from the API's you have
mentioned above.  Would you mind thinking once again keeping the
suggestions provided above in this email to see if we can avoid
removing Buffer as a parameter?

[1] - 
https://www.postgresql.org/message-id/CAJrrPGd8%2Bi8sqZCdhfvBhs2d1akEb_kEuBvgRHSPJ9z2Z7VBJw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-11 Thread Haribabu Kommi

On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila  wrote:

> On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
>  wrote:
> >
> >
> > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> > wrote:
> >>
> >> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
> >> he will be able to further this goal some more.
> >
> >
> > Thanks Alvaro for sharing your development patch.
> >
> > Most of the patch design is same as described by Alvaro in the first mail
> > [1].
> > I will detail the modifications, pending items and open items (needs
> > discussion)
> > to implement proper pluggable storage.
> >
> > Here I attached WIP patches to support pluggable storage. The patch
> series
> > are may not work individually. Still so many things are under
> development.
> > These patches are just to share the approach of the current development.
> >
>
> +typedef struct StorageAmRoutine
> +{
>
> In this structure, you have already covered most of the API's that a
> new storage module needs to provide, but I think there could be more.
> One such API could be heap_hot_search.  This seems specific to current
> heap where we have the provision of HOT.  I think we can provide a new
> API tuple_search or something like that.


Thanks for the review.

Yes, the storageAmRoutine needs more function pointers. Currently I am
adding all the functions that are present in the heapam.h and some slot
related function from tuptable.h. Once I stabilize the code and API's that
are
currently added, then I will further enhance it with remaining functions
that
are necessary to support pluggable storage API.


Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-08-11 Thread Haribabu Kommi

On Mon, Aug 7, 2017 at 11:12 PM, Amit Kapila 
wrote:

> On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommi 
> wrote:
> >
> >
> > On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila 
> > wrote:
> >>
> >
> >>
> >> > 1. Design an API that returns values/nulls array and convert that
> into a
> >> > HeapTuple whenever it is required in the upper layers. All the
> existing
> >> > heap form/deform tuples are used for every tuple with some
> adjustments.
> >> >
> >>
> >> So, this would have the additional cost of form/deform.  Also, how
> >> would it have lesser changes as compare to what you have described
> >> earlier?
> >
> >
> > Yes, It have the additional cost of form/deform. It is the same approach
> > that
> > is described earlier. But I have an intention of modifying everywhere the
> > HeapTuple is accessed. But with the other prototype changes of removing
> > HeapTuple usage from triggers, I realized that it needs some clear design
> > to proceed further, instead of combining those changes with pluggable
> > Storage API.
> >
> > - heap_getnext function is kept as it as and it is used only for system
> > table
> >   access.
> > - heap_getnext_slot function is introduced to return the slot whenever
> the
> >   data is found, otherwise NULL, This function is used in all the places
> > from
> >   Executor and etc.
> >
> > - The TupleTableSlot structure is modified to contain a void* tuple
> instead
> > of
> > HeapTuple. And also it contains the storagehanlder functions.
> > - heap_insert and etc function can take Slot as an argument and perform
> the
> > insert operation.
> >
> > The cases where the TupleTableSlot is not possible to sent, form a
> HeapTuple
> > from the data and sent it and also note down that it is a HeapTuple data,
> > not
> > the tuple from the storage.
> >
> ..
> >
> >
> >>
> >> > 3. Design an API that returns StorageTuple(void *) but the necessary
> >> > format information of that tuple can be get from the tupledesc.
> wherever
> >> > the tuple is present, there exists a tupledesc in most of the cases.
> How
> >> > about adding some kind of information in tupledesc to find out the
> tuple
> >> > format and call the necessary functions
> >> >
> >
> >
> > heap_getnext function returns StorageTuple instead of HeapTuple. The
> tuple
> > type information is available in the TupleDesc structure.
> >
> > All heap_insert and etc function accepts TupleTableSlot as input and
> perform
> > the insert operation. This approach is almost same as first approach
> except
> > the
> > storage handler functions are stored in TupleDesc.
> >
>
> Why do we need to store handler function in TupleDesc?  As of now, the
> above patch series has it available in RelationData and
> TupleTableSlot, I am not sure if instead of that keeping it in
> TupleDesc is a good idea.  Which all kind of places require TupleDesc
> to contain handler?  If those are few places, can we think of passing
> it as a parameter?

Till now I am to able to proceed without adding any storage handler
functions to
TupleDesc structure. Sure, I will try the way of passing as a parameter
when
there is a need of it.

During the progress of the patch, I am facing problems in designing the
storage API
regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC
and
related functions with function pointers, In HeapTuple format, the tuple
may belongs
to one buffer, so the buffer is passed to the HeapTupleSatifisifes***
functions along
with buffer, But in case of other storage formats, the single buffer may
not contains
the actual data. This buffer is used to set the Hint bits and mark the
buffer as dirty.
In case if the buffer is not available, the performance may affect for the
following
queries if the hint bits are not set.

And also the Buffer is used to get from heap_fetch, heap_lock_tuple and
related
functions to check the Tuple visibility, but currently returning a buffer
from the above
heap_** function is not possible for other formats. And also for the
HeapTuple data,
the tuple data is copied into palloced buffer instead of pointing directly
to the page.
So, returning a Buffer is a valid or not here?

Currently I am proceeding to remove the Buffer as parameter in the API and
proceed
further, In case if it affects the performance, we need to find out a
different appraoch
in handling the hint bits.

comments?

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-08-07 Thread Amit Kapila

On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
 wrote:
>
>
> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera 
> wrote:
>>
>> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
>> he will be able to further this goal some more.
>
>
> Thanks Alvaro for sharing your development patch.
>
> Most of the patch design is same as described by Alvaro in the first mail
> [1].
> I will detail the modifications, pending items and open items (needs
> discussion)
> to implement proper pluggable storage.
>
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.
>

+typedef struct StorageAmRoutine
+{

In this structure, you have already covered most of the API's that a
new storage module needs to provide, but I think there could be more.
One such API could be heap_hot_search.  This seems specific to current
heap where we have the provision of HOT.  I think we can provide a new
API tuple_search or something like that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-07 Thread Amit Kapila

On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommi  wrote:
>
>
> On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila 
> wrote:
>>
>
>>
>> > 1. Design an API that returns values/nulls array and convert that into a
>> > HeapTuple whenever it is required in the upper layers. All the existing
>> > heap form/deform tuples are used for every tuple with some adjustments.
>> >
>>
>> So, this would have the additional cost of form/deform.  Also, how
>> would it have lesser changes as compare to what you have described
>> earlier?
>
>
> Yes, It have the additional cost of form/deform. It is the same approach
> that
> is described earlier. But I have an intention of modifying everywhere the
> HeapTuple is accessed. But with the other prototype changes of removing
> HeapTuple usage from triggers, I realized that it needs some clear design
> to proceed further, instead of combining those changes with pluggable
> Storage API.
>
> - heap_getnext function is kept as it as and it is used only for system
> table
>   access.
> - heap_getnext_slot function is introduced to return the slot whenever the
>   data is found, otherwise NULL, This function is used in all the places
> from
>   Executor and etc.
>
> - The TupleTableSlot structure is modified to contain a void* tuple instead
> of
> HeapTuple. And also it contains the storagehanlder functions.
> - heap_insert and etc function can take Slot as an argument and perform the
> insert operation.
>
> The cases where the TupleTableSlot is not possible to sent, form a HeapTuple
> from the data and sent it and also note down that it is a HeapTuple data,
> not
> the tuple from the storage.
>
..
>
>
>>
>> > 3. Design an API that returns StorageTuple(void *) but the necessary
>> > format information of that tuple can be get from the tupledesc. wherever
>> > the tuple is present, there exists a tupledesc in most of the cases. How
>> > about adding some kind of information in tupledesc to find out the tuple
>> > format and call the necessary functions
>> >
>
>
> heap_getnext function returns StorageTuple instead of HeapTuple. The tuple
> type information is available in the TupleDesc structure.
>
> All heap_insert and etc function accepts TupleTableSlot as input and perform
> the insert operation. This approach is almost same as first approach except
> the
> storage handler functions are stored in TupleDesc.
>

Why do we need to store handler function in TupleDesc?  As of now, the
above patch series has it available in RelationData and
TupleTableSlot, I am not sure if instead of that keeping it in
TupleDesc is a good idea.  Which all kind of places require TupleDesc
to contain handler?  If those are few places, can we think of passing
it as a parameter?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-08-01 Thread Haribabu Kommi

On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila 
wrote:

> On Wed, Jul 19, 2017 at 11:33 AM, Haribabu Kommi
>  wrote:
> >
> > I am finding out that eliminating the HeapTuple usage in the upper layers
> > needs some major changes, How about not changing anything in the upper
> > layers of storage currently and just support the pluggable tuple with
> one of
> > the following approach for first version?
> >
>
> It is not very clear to me how any of the below alternatives are
> better or worse as compare to the approach you are already working.
> Do you mean to say that we will get away without changing all the
> places which take HeapTuple as input or return it as output with some
> of the below approaches?
>

Yes, With the following approaches, my intention is to reduce the changing
of HeapTuple everywhere to support the Pluggable Storage API with minimal
changes and later improvise further. But until unless we change the
HeapTuple
everywhere properly, may be we can't see the benefits of pluggable storages.

> > 1. Design an API that returns values/nulls array and convert that into a
> > HeapTuple whenever it is required in the upper layers. All the existing
> > heap form/deform tuples are used for every tuple with some adjustments.
> >
>
> So, this would have the additional cost of form/deform.  Also, how
> would it have lesser changes as compare to what you have described
> earlier?
>

Yes, It have the additional cost of form/deform. It is the same approach
that
is described earlier. But I have an intention of modifying everywhere the
HeapTuple is accessed. But with the other prototype changes of removing
HeapTuple usage from triggers, I realized that it needs some clear design
to proceed further, instead of combining those changes with pluggable
Storage API.

- heap_getnext function is kept as it as and it is used only for system
table
  access.
- heap_getnext_slot function is introduced to return the slot whenever the
  data is found, otherwise NULL, This function is used in all the places
from
  Executor and etc.

- The TupleTableSlot structure is modified to contain a void* tuple instead
of
HeapTuple. And also it contains the storagehanlder functions.
- heap_insert and etc function can take Slot as an argument and perform the
insert operation.

The cases where the TupleTableSlot is not possible to sent, form a HeapTuple
from the data and sent it and also note down that it is a HeapTuple data,
not
the tuple from the storage.

> 2. Design an API that returns StorageTuple(void *) with first member
> > represents the TYPE of the storage, so that corresponding registered
> > function calls can be called to deform/form the tuple whenever there is
> > a need of tuple.
> >
>
> Do you intend to say that we store such information in disk tuple or
> only in the in-memory version of same?

Only in-memory version.

>   Also, what makes you think
> that we would need hooks only for form and deform?  Right now, in many
> cases tuple will directly point to disk page and we deal with it by
> retaining the pin on the corresponding buffer, what if some kinds of
> tuple don't follow that rule?  For ex. to support in-place updates, we
> might always need a separate copy of tuple rather than the one
> pointing to disk page.
>

In any of the approaches, except for system tables, we are going to remove
the direct disk access of the tuple. Either with replace tuple with slot or
something,
no direct disk access will be removed. Otherwise it will be difficult to
support,
and also it doesn't provide much performance benefit also.

All the storagehandler functions needs to be maintained seperately
and accessed by them using the hanlder ID.

heap_getnext function returns StorageTuple and not HeapTuple.
The StorageTuple can be mapped to HeapTuple or another Tuple
based on the first member type.

heap_insert etc function takes input as StorageTuple and internally
decides based on the type of the tuple to perform the insert operation.

Instead of storing the handler functions inside a relation/slot, the
function pointers can be accessed directly based on the storage
type data.

This works every case where the tuple is accessed, but the problem
is, it may need changes wherever the tuple is accessed.

> > 3. Design an API that returns StorageTuple(void *) but the necessary
> > format information of that tuple can be get from the tupledesc. wherever
> > the tuple is present, there exists a tupledesc in most of the cases. How
> > about adding some kind of information in tupledesc to find out the tuple
> > format and call the necessary functions
> >
>

heap_getnext function returns StorageTuple instead of HeapTuple. The tuple
type information is available in the TupleDesc structure.

All heap_insert and etc function accepts TupleTableSlot as input and perform
the insert operation. This approach is almost same as first approach except
the
storage handler functions are stored in

Re: [HACKERS] Pluggable storage

2017-07-23 Thread Amit Kapila

On Wed, Jul 19, 2017 at 11:33 AM, Haribabu Kommi
 wrote:
>
>
> On Sat, Jul 15, 2017 at 12:30 PM, Robert Haas  wrote:
>>
>
> I am finding out that eliminating the HeapTuple usage in the upper layers
> needs some major changes, How about not changing anything in the upper
> layers of storage currently and just support the pluggable tuple with one of
> the following approach for first version?
>

It is not very clear to me how any of the below alternatives are
better or worse as compare to the approach you are already working.
Do you mean to say that we will get away without changing all the
places which take HeapTuple as input or return it as output with some
of the below approaches?

> 1. Design an API that returns values/nulls array and convert that into a
> HeapTuple whenever it is required in the upper layers. All the existing
> heap form/deform tuples are used for every tuple with some adjustments.
>

So, this would have the additional cost of form/deform.  Also, how
would it have lesser changes as compare to what you have described
earlier?

> 2. Design an API that returns StorageTuple(void *) with first member
> represents the TYPE of the storage, so that corresponding registered
> function calls can be called to deform/form the tuple whenever there is
> a need of tuple.
>

Do you intend to say that we store such information in disk tuple or
only in the in-memory version of same?  Also, what makes you think
that we would need hooks only for form and deform?  Right now, in many
cases tuple will directly point to disk page and we deal with it by
retaining the pin on the corresponding buffer, what if some kinds of
tuple don't follow that rule?  For ex. to support in-place updates, we
might always need a separate copy of tuple rather than the one
pointing to disk page.

> 3. Design an API that returns StorageTuple(void *) but the necessary
> format information of that tuple can be get from the tupledesc. wherever
> the tuple is present, there exists a tupledesc in most of the cases. How
> about adding some kind of information in tupledesc to find out the tuple
> format and call the necessary functions
>
> 4. Design an API to return always the StorageTuple and it converts to
> HeapTuple with a function hook if it gets registered (for heap storages
> this is not required to register the hook, because it is already a HeapTuple
> format). This function hook should be placed in the heap form/deform
> functions.
>

I think some more information is required to comment on any of the
approaches or suggest a new one.  You might want to try by quoting
some specific examples from code so that it is easy to understand what
your proposal will change in that case.  One idea could be that we
start with some interfaces like structures TupleTableSlot, EState,
HeapScanDescData,  IndexScanDescData, etc. and interfaces like
heap_insert, heap_update, heap_lock_tuple,
SnapshotSatisfiesFunc, EvalPlanQualFetch, etc.  Now, it is quite
possible that we don't want to change some of these interfaces, but it
can help to see how such a usage can be replaced with new kind of
Tuple structure.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-19 Thread Peter Geoghegan

On Wed, Jul 19, 2017 at 10:56 AM, Robert Haas  wrote:
>> I strongly agree. I simply don't understand how you can adopt UNDO for
>> MVCC, and yet expect to get a benefit commensurate with the effort
>> without also implementing "retail index tuple deletion" first.
>
> I agree that we need retail index tuple deletion.  I liked Claudio's
> idea at 
> http://postgr.es/m/cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com
> -- that seems indispensible to making retail index tuple deletion
> reasonably efficient.  Is anybody going to work on getting that
> committed?

I will do review work on it.

IMV the main problems are:

* The way a "header" is added at the PageAddItemExtended() level,
rather than making heap TID something much closer to a conventional
attribute that perhaps only nbtree and indextuple.c have special
knowledge of, strikes me as the wrong way to go.

* It's simply not acceptable to add overhead to *all* internal items.
That kills fan-in. We're going to need suffix truncation for the
common case where the user-visible attributes for a split point/new
high key at the leaf level sufficiently distinguish what belongs on
either side. IOW, you should only see internal items with a heap TID
in the uncommon case where you have so many duplicates at the leaf
level that you have no choice put to use a split point that's right in
the middle of many duplicates.

Fortunately, if we confine ourselves to making heap TID part of the
keyspace, the code can be far simpler than what would be needed to get
my preferred, all-encompassing design for suffix truncation [1] to
work. I think we could just stash the number of attributes
participating in a comparison within internal pages' unused item
pointer offset. I've talked about this before, in the context of
Anastasia's INCLUDED columns patch. If we can have a variable number
of attributes for heap tuples, we can do so for index tuples, too.

* We might also have problems with changing the performance
characteristics for the worse in some cases by some measures. This
will probably technically increase the amount of bloat for some
indexes with sparse deletion patterns. I think that that will be well
worth it, but I don't expect a slam dunk.

A nice benefit of this work is that it lets us kill the hack that adds
randomness to the search for free space among duplicates, and may let
us follow the Lehman & Yao algorithm more closely.

[1] 
https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-19 Thread Robert Haas

On Sat, Jul 15, 2017 at 8:58 PM, Peter Geoghegan  wrote:
> To repeat myself, for emphasis: *Not all bloat is equal*.

+1.

> I strongly agree. I simply don't understand how you can adopt UNDO for
> MVCC, and yet expect to get a benefit commensurate with the effort
> without also implementing "retail index tuple deletion" first.

I agree that we need retail index tuple deletion.  I liked Claudio's
idea at 
http://postgr.es/m/cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com
-- that seems indispensible to making retail index tuple deletion
reasonably efficient.  Is anybody going to work on getting that
committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-19 Thread Robert Haas

On Sat, Jul 15, 2017 at 6:36 PM, Alexander Korotkov
 wrote:
> I think in general there are two ways dealing with out index AM API
> limitation.  One of them is to extend index AM API.

That's pretty much what I have in mind.  I think it's likely that if
we end up with, say, 3 kinds of heap and 12 kinds of index, there will
be some compatibility matrix.  Some index AMs will be compatible with
some heap AMs, and others won't be.  For example, if somebody makes an
IOT-type heap using the API proposed here or some other one, BRIN
probably won't work at all.  btree, on the other hand, could probably
be made to work, perhaps with some greater or lesser degree of
modification.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-19 Thread Haribabu Kommi

On Sat, Jul 15, 2017 at 12:30 PM, Robert Haas  wrote:

> On Fri, Jul 14, 2017 at 8:35 AM, Haribabu Kommi
>  wrote:
> > To replace tuple with slot, I took trigger and SPI calls as the first
> step
> > in modifying
> > from tuple to slot, Here I attached a WIP patch. The notable changes are,
> >
> > 1. Replace most of the HeapTuple with Slot in SPI interface functions.
> > 2. In SPITupleTable, Instead of HeapTuple, it is changed to
> TupleTableSlot.
> > But this change may not be a proper approach, because a duplicate copy of
> > TupleTableSlot is generated and stored.
> > 3. Changed all trigger interfaces to accept TupleTableSlot Instead of
> > HeapTuple.
> > 4. ItemPointerData is added as a member to the TupleTableSlot structure.
> > 5. Modified the ExecInsert and others to work directly on TupleTableSlot
> > instead
> > of tuple(not completely).
>
> What problem are you trying to solve with these changes?  I'm not
> saying that it's a bad idea, but I think you should spell out the
> motivation a little more explicitly.
>

Sorry for not providing complete details. I am trying these experiments
to find out the best way to return the tuple from Storage methods by
designing a proper API.

The changes that I am doing are to reduce the dependency on the
HeapTuple format with value/nulls array. So if there is no dependency
on the HeapTuple format in the upper layers of the PostgreSQL storage,
then directly we can define the StorageAPI to return the value/nulls array
instead of one big chunck of tuple data like HeapTuple or StorageTuple(void
*).

I am finding out that eliminating the HeapTuple usage in the upper layers
needs some major changes, How about not changing anything in the upper
layers of storage currently and just support the pluggable tuple with one of
the following approach for first version?

1. Design an API that returns values/nulls array and convert that into a
HeapTuple whenever it is required in the upper layers. All the existing
heap form/deform tuples are used for every tuple with some adjustments.

2. Design an API that returns StorageTuple(void *) with first member
represents the TYPE of the storage, so that corresponding registered
function calls can be called to deform/form the tuple whenever there is
a need of tuple.

3. Design an API that returns StorageTuple(void *) but the necessary
format information of that tuple can be get from the tupledesc. wherever
the tuple is present, there exists a tupledesc in most of the cases. How
about adding some kind of information in tupledesc to find out the tuple
format and call the necessary functions

4. Design an API to return always the StorageTuple and it converts to
HeapTuple with a function hook if it gets registered (for heap storages
this is not required to register the hook, because it is already a HeapTuple
format). This function hook should be placed in the heap form/deform
functions.

Any other better ideas for a first version.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-07-17 Thread Peter Geoghegan

On Mon, Jul 17, 2017 at 1:24 PM, Alexander Korotkov
 wrote:
> It's probably depends on particular storage (once we have pluggable
> storages).  Some storages would have additional level of indirection while
> others wouldn't.

Agreed. Like kill_prior_tuple, it's an optional capability, and where
implemented is implemented in a fairly consistent way.

> But even if unique index contain no true duplicates, it's
> still possible that true delete happen.  Then we still have to delete tuple
> even from unique index.

I think I agree. I've been looking over the ARIES paper [1] again
today. They say this:

"For index updates, in the interest of increasing concurrency, we do
not want to prevent the space released by one transaction from being
consumed by another before the commit of the first transaction."

You can literally reclaim space from an index tuple deletion
*immediately* with their design, which matters because you want to
reclaim space as early as possible, before a page spit is needed.
Obviously they understand how important this is.

This might not work so well with an MVCC system, where there are no
2PL predicate locks. You need to keep a "ghost record", even for
non-unique indexes, and the deletion can only happen when the xact
commits. Remember, the block number cannot be used to see if there was
changes against the page, unlike the heap, because you have to worry
about page splits and page merges/deletion. UNDO is entirely logical
for indexes for this reason. (This is why UNDO does not actually undo
page splits, relation extension, etc. Only REDO/WAL always works at
the level of individual pages in all cases. UNDO for MVCC is not as
different to our design as I once thought.).

The reason I want to at least start with unique indexes is because you
need a TID to make non-unique/secondary indexes have unique keys
(unique keys are always needed if retail index tuple insertion is
always supported). For unique indexes, you really can do an update in
the index (see my design below for one example of how that can work),
but I think you need something more like a deletion followed by an
insertion for non-unique indexes, because there the physical/heap TID
changed, and that's part of the key, and that might belong on a
different page. You therefore haven't really fixed the problem with
secondary indexes sometimes needing new index tuples even though user
visible attributes weren't updated.

You haven't fixed the problem with secondary index, unless, of course,
all secondary indexes have logical pointers to begin with, such as the
PK value. Then you only need to "insert and delete, not update" when
the PK value is updated or when a secondary index needs a new index
tuple with distinct user visible attribute values to the previous
version's -- you fix the secondary index problem. And, while your
"version chain overflow indirection" structure is basically something
that lives outside the heap, it is still only needed for one index,
and not all of them.

This new indirection structure is a really nice target for pruning,
because you can prune physical TIDs that no possible snapshot could
use, unlike with the heap, where EvalPlanQual() could make any heap
tuple visible to snapshots at or after the minimal snapshot horizon
implied by RecentGlobalXmin. And, because index scans on any index can
prune for everyone.

You could also do "true index deletes", as you suggest, but you'd need
to have ghost records there too, and you'd need an asynchronous
cleanup process to do the cleanup when the deleting xact committed.
I'm not sure if it's worth doing that eagerly. It may or may not be
better to hope for kill_prior_tuple to do the job for us. Not sure
where this leaves index-only scans on secondary indexes..."true index
deletes" might be justified by making index only scans work more often
in general, especially for secondary indexes with logical pointers.

I'm starting to think that you were right all along about indirect
indexes needing to store PK values. Perhaps we should just bite the
bullet...it's not like places like the bufpage.c index routines
actually know or care about whether or not the index tuple has a TID,
what a TID is, etc. They care about stuff like the header values of
index tuples, and the page ItemId array, but TID is, as you put it,
merely payload.

> It's possible to add indirection layer "on demand".  Thus, initially index
> tuples point directly to the heap tuple.  If tuple gets updates and doesn't
> fit to the page anymore, then it's moved to another place with redirect in
> the old place.  I think that if carefully designed, it's possible to
> guarantee there is at most one redirect.

This is actually what I was thinking. Here is a sketch:

When you start out, index tuples in nbtree are the same as today --
one physical pointer (TID). But, on the first update to a PK index,
they grow a new pointer, but this is not a physical/heap TID. It's a
pointer to some kind of

Re: [HACKERS] Pluggable storage

2017-07-17 Thread Alexander Korotkov

On Mon, Jul 17, 2017 at 7:51 PM, Peter Geoghegan  wrote:

> On Mon, Jul 17, 2017 at 3:22 AM, Alexander Korotkov
>  wrote:
> > I think that "retail index tuple deletion" is the feature which could
> give
> > us some advantages even independently from pluggable storages.  For
> example,
> > imagine very large table with only small amount of dead tuples.  In this
> > case, it would be cheaper to delete index links to those dead tuples one
> by
> > one using "retail index tuple deletion", rather than do full scan of
> every
> > index to perform "bulk delete" of index tuples.  One may argue that you
> > shouldn't do vacuum of large table when only small amount of tuples are
> > dead.  But in terms of index bloat mitigation, very aggressive vacuum
> > strategy could be justified.
>
> Yes, definitely. Especially with the visibility map. Even still, I
> tend to think that for unique indexes, true duplicates should be
> disallowed, and dealt with with an additional layer of indirection. So
> this would be for secondary indexes.
>

It's probably depends on particular storage (once we have pluggable
storages).  Some storages would have additional level of indirection while
others wouldn't.  But even if unique index contain no true duplicates, it's
still possible that true delete happen.  Then we still have to delete tuple
even from unique index.

>> I agree with Robert that being able to store an arbitrary payload as a
> >> TID is probably not going to ever work very well.
> >
> >
> > Support of arbitrary payload as a TID doesn't sound easy.  However, that
> > doesn't mean it's unachievable. For me, it's more like long way which
> could
> > be traveled step by step.
>
> To be fair, it probably is achievable. Where there is a will, there is
> a way. I just think that it will be easier to find a different way of
> realizing similar benefits. I'm mostly talking about benefits around
> making it cheap to have many secondary indexes by having logical
> indirection instead of physical pointers (doesn't *have* to be
> user-visible primary key values).

It's possible to add indirection layer "on demand".  Thus, initially index
tuples point directly to the heap tuple.  If tuple gets updates and doesn't
fit to the page anymore, then it's moved to another place with redirect in
the old place.  I think that if carefully designed, it's possible to
guarantee there is at most one redirect.

But I sill think that evading arbitrary payload for indexes is delaying of
inevitable, if only we want pluggable storages and want them to reuse
existing index AMs.  So, for example, arbitrary payload together with
ability to update this payload allows us to make indexes separately
versioned (have separate garbage collection process more or less unrelated
to heap).  Despite overhead caused by MVCC attributes, I think such indexes
could give significant advantages in various workloads.

> HOT simply isn't effective enough at
> preventing UPDATE index tuple insertions for indexes on unchanged
> attributes, often just because pruning can fail to happen in time,
> which WARM will not fix.
>

Right.  I think HOT and WARM depend on factors which are hard to control:
distribution of UPDATEs between heap pages, oldest snapshot and so on.
It's quite hard for DBA to understand why table starts getting bloat while
it didn't before.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-07-17 Thread Peter Geoghegan

On Mon, Jul 17, 2017 at 3:22 AM, Alexander Korotkov
 wrote:
> I think that "retail index tuple deletion" is the feature which could give
> us some advantages even independently from pluggable storages.  For example,
> imagine very large table with only small amount of dead tuples.  In this
> case, it would be cheaper to delete index links to those dead tuples one by
> one using "retail index tuple deletion", rather than do full scan of every
> index to perform "bulk delete" of index tuples.  One may argue that you
> shouldn't do vacuum of large table when only small amount of tuples are
> dead.  But in terms of index bloat mitigation, very aggressive vacuum
> strategy could be justified.

Yes, definitely. Especially with the visibility map. Even still, I
tend to think that for unique indexes, true duplicates should be
disallowed, and dealt with with an additional layer of indirection. So
this would be for secondary indexes.

>> I agree with Robert that being able to store an arbitrary payload as a
>> TID is probably not going to ever work very well.
>
>
> Support of arbitrary payload as a TID doesn't sound easy.  However, that
> doesn't mean it's unachievable. For me, it's more like long way which could
> be traveled step by step.

To be fair, it probably is achievable. Where there is a will, there is
a way. I just think that it will be easier to find a different way of
realizing similar benefits. I'm mostly talking about benefits around
making it cheap to have many secondary indexes by having logical
indirection instead of physical pointers (doesn't *have* to be
user-visible primary key values). HOT simply isn't effective enough at
preventing UPDATE index tuple insertions for indexes on unchanged
attributes, often just because pruning can fail to happen in time,
which WARM will not fix.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-17 Thread Alexander Korotkov

On Sun, Jul 16, 2017 at 3:58 AM, Peter Geoghegan  wrote:

> I strongly agree. I simply don't understand how you can adopt UNDO for
> MVCC, and yet expect to get a benefit commensurate with the effort
> without also implementing "retail index tuple deletion" first.
> Pursuing UNDO this way has the same problem that WARM likely has -- it
> doesn't really help with the worst case, where users get big,
> unpleasant surprises. Postgres is probably the only major database
> system that doesn't support retail index tuple deletion. It's a basic
> thing, that has nothing to do with MVCC. Really, what do we have to
> lose?
>

I think that "retail index tuple deletion" is the feature which could give
us some advantages even independently from pluggable storages.  For
example, imagine very large table with only small amount of dead tuples.
In this case, it would be cheaper to delete index links to those dead
tuples one by one using "retail index tuple deletion", rather than do full
scan of every index to perform "bulk delete" of index tuples.  One may
argue that you shouldn't do vacuum of large table when only small amount of
tuples are dead.  But in terms of index bloat mitigation, very aggressive
vacuum strategy could be justified.

I agree with Robert that being able to store an arbitrary payload as a
> TID is probably not going to ever work very well.

Support of arbitrary payload as a TID doesn't sound easy.  However, that
doesn't mean it's unachievable. For me, it's more like long way which could
be traveled step by step.  Some of our existing index access methods
(B-tree, hash, GiST, SP-GiST) may support arbitrary payload relatively
easy, because they are not relying on its internal structure.  For others
(GIN, BRIN) arbitrary payload is much harder to support, but I wouldn't say
it's impossible.  However, if we make arbitrary payload support an option
of index AM and implement this support for first group of index AMs, it
would be already great step forward.  So, for sample, it would be possible
to use indirect indexes when primary key is not 6-bytes, if index AM
supports arbitrary payload.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-07-15 Thread Peter Geoghegan

On Sat, Jul 15, 2017 at 3:36 PM, Alexander Korotkov
 wrote:
> I think that pruning and vacuum are artifacts of not only current heap
> formats, but they are also artifacts of current index AM API.  And this is
> more significant circumstance given that we're going to preserve
> compatibility of new storage engines with current index AMs.  Our current
> index AM API assumes that we can delete from index only in bulk manner.  Our
> payload to index key is TID, not arbitrary piece of data.  And that payload
> can't be updated.

I agree that this is a big set of problems. This is where I think we
can get the most benefit.

One nice thing about having a fully logical TID is that you don't have
to bloat indexes just because a HOT update was not possible. You bloat
something else instead, certainly, but that can be optimized for
garbage collection. Not all bloat is equal. MVCC based on UNDO can be
very fast because UNDO is very well optimized for garbage collection,
and so can be bloated with no long term consequences, and minor short
term consequences. Still, it isn't fair to blame the Postgres VACUUM
design for the fact that Postgres bloats indexes so easily. Nothing
stops us from adding a new layer of indirection, so that bloating an
index degrades into something that is not even as bad as bloating the
heap [1]. We may just have a data structure problem, which is not
nearly the same thing as a fundamental design problem in the storage
layer.

This approach could pay off soon if we start with unique indexes,
where there is "queue" of row versions that can be pruned with simple
logic, temporal locality helps a lot, only zero or one versions can be
visible to your MVCC snapshot, etc. This might require only minimal
revisions the index AM API, to help nbtree. We could later improve
this so that you bloat UNDO instead of bloating a heap-like structure,
both for indexes and for the actual heap. That seems less urgent.

To repeat myself, for emphasis: *Not all bloat is equal*. Index bloat
makes the way a B-Tree's keyspace is split up far too granular, making
pages sparely packed, a problem that is more or less *irreversible* by
VACUUM or any garbage collection process [2]. That's how B-Trees work
-- they're optimized for concurrency, not for garbage collection.

>> InnoDB isn't much like the PostgreSQL heap, and
>> neither is SQL Server, IIUC.  If we're creating a heap format that can
>> only be different in trivial ways from what we have now, and anything
>> that really changes the paradigm is off-limits, well, that's not
>> really interesting enough to justify the work of creating a heap
>> storage API.
>
>
> My concern is that we probably can't do anything that really changes
> paradigm while preserving compatibility with index AM API.  If you don't
> agree with that, it would be good to provide some examples.  It seems
> unlikely for me that we're going to have something like InnoDB or SQL Server
> table with our current index AM API.  InnoDB utilizes index-organized tables
> where primary and secondary indexes are versioned independently.  SQL Server
> utilizes flat data structure similar to our heap, but MVCC implementation
> also seems very different.

I strongly agree. I simply don't understand how you can adopt UNDO for
MVCC, and yet expect to get a benefit commensurate with the effort
without also implementing "retail index tuple deletion" first.
Pursuing UNDO this way has the same problem that WARM likely has -- it
doesn't really help with the worst case, where users get big,
unpleasant surprises. Postgres is probably the only major database
system that doesn't support retail index tuple deletion. It's a basic
thing, that has nothing to do with MVCC. Really, what do we have to
lose?

The biggest weakness of the current design is IMV how it fails to
prevent index bloat in the first place, but avoiding bloating index
leaf pages in the first place doesn't seem incompatible with how
VACUUM works. Or at least, let's not assume that it is. We should
avoid throwing the baby out with the bathwater.

> I think in general there are two ways dealing with out index AM API
> limitation.  One of them is to extend index AM API.  At least, we would need
> a method for deletion of individual index tuple (for sure, we already have
> kill_prior_tuple but it's just a hint for now).

kill_prior_tuple can work well, but, like HOT, it works
inconsistently, in a way that is hard to predict.

> Also, it would be nice to
> have arbitrary payload to index tuples instead of TID, and method to update
> that payload.  But that would be quite big amount of work.  Alternatively,
> we could allow pluggable storages to have their own index AMs, and that will
> move this amount of work to the pluggable storage side.

I agree with Robert that being able to store an arbitrary payload as a
TID is probably not going to ever work very well. However, I don't
think that that's a reason to give up on the underlying idea:

Re: [HACKERS] Pluggable storage

2017-07-15 Thread Alexander Korotkov

On Sat, Jul 15, 2017 at 5:14 AM, Robert Haas  wrote:

> On Thu, Jun 22, 2017 at 9:30 AM, Alexander Korotkov
>  wrote:
> > If #1 is only about changing tuple and page formats, then could be much
> > simpler than the patch upthread?  We can implement "page format access
> > methods" with routines for insertion, update, pruning and deletion of
> tuples
> > *in particular page*.  There is no need to redefine high-level logic for
> > scanning heap, doing updates and so on...
>
> That assumes that every tuple format does those things in the same
> way, which I suspect is not likely to be the case.  I think that
> pruning and vacuum are artifacts of the current heap format, and they
> may be nonexistent or take some altogether different form in some
> other storage engine.

I think that pruning and vacuum are artifacts of not only current heap
formats, but they are also artifacts of current index AM API.  And this is
more significant circumstance given that we're going to preserve
compatibility of new storage engines with current index AMs.  Our current
index AM API assumes that we can delete from index only in bulk manner.
Our payload to index key is TID, not arbitrary piece of data.  And that
payload can't be updated.

InnoDB isn't much like the PostgreSQL heap, and
> neither is SQL Server, IIUC.  If we're creating a heap format that can
> only be different in trivial ways from what we have now, and anything
> that really changes the paradigm is off-limits, well, that's not
> really interesting enough to justify the work of creating a heap
> storage API.

My concern is that we probably can't do anything that really changes
paradigm while preserving compatibility with index AM API.  If you don't
agree with that, it would be good to provide some examples.  It seems
unlikely for me that we're going to have something like InnoDB or SQL
Server table with our current index AM API.  InnoDB utilizes
index-organized tables where primary and secondary indexes are versioned
independently.  SQL Server utilizes flat data structure similar to our
heap, but MVCC implementation also seems very different.

I think in general there are two ways dealing with out index AM API
limitation.  One of them is to extend index AM API.  At least, we would
need a method for deletion of individual index tuple (for sure, we already
have kill_prior_tuple but it's just a hint for now).  Also, it would be
nice to have arbitrary payload to index tuples instead of TID, and method
to update that payload.  But that would be quite big amount of work.
Alternatively, we could allow pluggable storages to have their own index
AMs, and that will move this amount of work to the pluggable storage side.

The thing which we would evade is storage API, which would be invasive like
something changing paradigm, but actually allowing just trivial changes in
heap format.  Mechanical replacement of heap methods with storage API
methods could lead us there.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-07-14 Thread Robert Haas

On Fri, Jul 14, 2017 at 8:35 AM, Haribabu Kommi
 wrote:
> To replace tuple with slot, I took trigger and SPI calls as the first step
> in modifying
> from tuple to slot, Here I attached a WIP patch. The notable changes are,
>
> 1. Replace most of the HeapTuple with Slot in SPI interface functions.
> 2. In SPITupleTable, Instead of HeapTuple, it is changed to TupleTableSlot.
> But this change may not be a proper approach, because a duplicate copy of
> TupleTableSlot is generated and stored.
> 3. Changed all trigger interfaces to accept TupleTableSlot Instead of
> HeapTuple.
> 4. ItemPointerData is added as a member to the TupleTableSlot structure.
> 5. Modified the ExecInsert and others to work directly on TupleTableSlot
> instead
> of tuple(not completely).

What problem are you trying to solve with these changes?  I'm not
saying that it's a bad idea, but I think you should spell out the
motivation a little more explicitly.

I think performance is likely to be a key concern here.  Maybe these
interfaces are Just Better with TupleTableSlots and the patch is a win
regardless of pluggable storage -- but if the reverse turns out to be
true, and this slows things down in cases that anyone cares about, I
think that's going to be a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-14 Thread Robert Haas

On Thu, Jun 22, 2017 at 4:27 PM, Tomas Vondra
 wrote:
> Can you elaborate a bit more about this TID bit pattern issues? I do
> remember that not all TIDs are valid due to safeguards on individual fields,
> like for example
>
> Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits))
>
> But perhaps there are some other issues?

I think the other issues that I know about have largely already been
mentioned by others, but since this question was addressed to me:

1. TID bitmaps assume that the page and offset parts of the TID
contain just those things.  As Tom pointed out, tidbitmap.c isn't cool
with TIDs that have ip_posid out of range.  More broadly, we assume
lossification is a sensible way of keeping a TID bitmap down to a
reasonable size without losing too much efficiency, and that sorting
tuple references by the block ID is likely to produce a sequential I/O
pattern.  Those things won't necessarily be true if TIDs are treated
as opaque tuple identifiers.

2. Apparently, GIN uses the structure of TIDs to compress posting
lists.  (I'm not personally familiar with this code.)

In general, it's a fairly dangerous thing to suppose that you can
repurpose a value as widely used as a TID and not break anything.  I'm
not saying it can't be done, but we use TIDs in an awful lot of places
and rooting out all of the places somebody may have made an assumption
about the structure of them may not be trivial.  I tend to think it's
an ugly kludge to shove some other kind of value into a TID, anyway.
If we need to store something that's not a TID, I think we should have
a purpose-built mechanism for that, not just hammer on the existing
system until it sorta works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-07-14 Thread Robert Haas

On Thu, Jun 22, 2017 at 9:30 AM, Alexander Korotkov
 wrote:
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread?  We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*.  There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...

That assumes that every tuple format does those things in the same
way, which I suspect is not likely to be the case.  I think that
pruning and vacuum are artifacts of the current heap format, and they
may be nonexistent or take some altogether different form in some
other storage engine.  InnoDB isn't much like the PostgreSQL heap, and
neither is SQL Server, IIUC.  If we're creating a heap format that can
only be different in trivial ways from what we have now, and anything
that really changes the paradigm is off-limits, well, that's not
really interesting enough to justify the work of creating a heap
storage API.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-28 Thread Amit Kapila

On Wed, Jun 28, 2017 at 7:43 AM, Haribabu Kommi
 wrote:
>
> On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov
>  wrote:
>>
>> On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila 
>> wrote:
>>>
>>> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
>>>  wrote:
>>>
>>> Probably, some other algorithm for vacuum.  I am not sure current
>>> vacuum with its parameters can be used so easily.  One thing that
>>> might need some thoughts is that is it sufficient to say that keep
>>> autovacuum as off and call some different API for places where the
>>> vacuum can be invoked manually like Vacuum command to the developer
>>> implementing some different strategy for vacuum or we need something
>>> more as well.
>>
>>
>> What kind of other vacuum algorithm do you expect?  It would be rather
>> easier to understand if we would have examples.
>>
>> For me, changing of vacuum algorithm is not needed for just heap page
>> format change.  Existing vacuum algorithm could just call page format API
>> functions for manipulating individual pages.
>>
>> Changing of vacuum algorithm might be needed for more invasive changes
>> than just heap page format.  However, we should first understand what these
>> changes could be and how are they consistent with rest of API design.
>
>
> Yes, I agree that we need some changes in the vacuum to handle the pluggable
> storage.
> Currently I didn't fully checked the changes that are needed in vacuum, but
> I feel the low level changes of the function are enough, and also there
> should be
> some option from storage handler to decide whether it needs a vacuum or not?
> Based on this flag, the vacuum may be skipped on those tables. So these
> handlers
> no need to register those API's.
>

Something in that direction sounds reasonable to me.  I am also not
very clear what kind of pluggability will be required for vacuum.  I
think for now we can park this problem and try to tackle tuple format
and page format related stuff.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-28 Thread Amit Kapila

On Tue, Jun 27, 2017 at 7:23 PM, Alexander Korotkov
 wrote:
> On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila 
> wrote:
>>
>> Similarly,
>> if the page format is changed you need to consider all page scan API's
>> like heapgettup_pagemode/heapgetpage.
>
>
> If page format is changed, then heapgettup_pagemode/heapgetpage should use
> appropriate API functions for manipulating page items.  I'm very afraid of
> overriding whole heapgettup_pagemode/heapgetpage and monstrous functions
> like heap_update without understanding of clear use-case.  It's definitely
> not needed for changing heap page format.
>

Yeah, we might not change them completely, there will always be some
common parts, but as of now, this API considers that we can return a
pointer to tuple on the page with just having a pin on the buffer.
This is based on the assumption that nobody can update the current
tuple, only a new tuple will be created on the update.  For some use
cases like in-place updates, we might want to do that differently.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Haribabu Kommi

On Tue, Jun 27, 2017 at 11:53 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila 
> wrote:
>
>> On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
>>  wrote:
>> > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas 
>> wrote:
>> >>
>> >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>> >>  wrote:
>> >> > Open Items:
>> >> >
>> >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> >> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> >> > on those structures and providing the result.
>> >> >
>> >> > These scan types may not be applicable to different storage formats.
>> >> > So how to handle them?
>> >>
>> >> I think that BitmapHeapScan, at least, is applicable to any table AM
>> >> that has TIDs.   It seems to me that in general we can imagine three
>> >> kinds of table AMs:
>> >>
>> >> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> >> By a real TID, I mean that the block number part is really a block
>> >> number and the item ID is really a location within the block.  These
>> >> are necessarily quite similar to our current heap, but they can change
>> >> the tuple format and page format to some degree, and it seems like in
>> >> many cases it should be possible to plug them into our existing index
>> >> AMs without too much heartache.  Both index scans and bitmap index
>> >> scans ought to work.
>> >
>> >
>> > If #1 is only about changing tuple and page formats, then could be much
>> > simpler than the patch upthread?  We can implement "page format access
>> > methods" with routines for insertion, update, pruning and deletion of
>> tuples
>> > *in particular page*.  There is no need to redefine high-level logic for
>> > scanning heap, doing updates and so on...
>>
>> If you are changing tuple format then you do need to worry about
>> places wherever we are using HeapTuple like TupleTableSlots,
>> Visibility routines, etc. (just think if somebody changes tuple
>> header, then all such places are susceptible to change).
>
>
> Agree.  I think that we can consider pluggable tuple format as an
> independent feature which is desirable to have before pluggable storages.
> For example, I believe some FDWs could already have benefit from pluggable
> tuple format.
>

Accepting multiple tuple format is possible with complete replacement of
HeapTuple with TupleTableSlot or something like value/null array
instead of a single memory chunk tuple data.

Currently I am working on it to replace the HeapTuple with TupleTableSlot
in the upper layers once the tuples is returned from the scan. In most of
the
places where the HeapTuple is present, either replace it with TupleTableSlot
or change it to StorageTuple (void *).

I am yet to evaluate whether it is possible to support as an independent
feature
without the need of some heap format function to understand the tuple format
in every place.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Haribabu Kommi

On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila 
> wrote:
>
>> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
>>  wrote:
>> > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <
>> kommi.harib...@gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <
>> alvhe...@2ndquadrant.com>
>> >> wrote:
>> >>>
>> >>> I have sent the partial patch I have to Hari Babu Kommi.  We expect
>> that
>> >>> he will be able to further this goal some more.
>> >>
>> >>
>> >> Thanks Alvaro for sharing your development patch.
>> >>
>> >> Most of the patch design is same as described by Alvaro in the first
>> mail
>> >> [1].
>> >> I will detail the modifications, pending items and open items (needs
>> >> discussion)
>> >> to implement proper pluggable storage.
>> >>
>> >> Here I attached WIP patches to support pluggable storage. The patch
>> series
>> >> are may not work individually. Still so many things are under
>> development.
>> >> These patches are just to share the approach of the current
>> development.
>> >>
>> >> Some notable changes that I did to make the patch work:
>> >>
>> >> 1. Added storageam handler to the slot, this is because not all places
>> >> the relation is not available in handy.
>> >> 2. Retained the minimal Tuple in the slot, as this is used in HASH
>> join.
>> >> As per the first version, I feel it is fine to allow creating HeapTuple
>> >> format data.
>> >>
>> >> Thanks everyone for sharing their ideas in the developer's
>> unconference at
>> >> PGCon Ottawa.
>> >>
>> >> Pending items:
>> >>
>> >> 1. Replacement of Tuple with slot in Trigger functionality
>> >> 2. Replacement of Tuple with Slot from storage handler functions.
>> >> 3. Remove/minimize the use of HeapTuple as a Datum.
>> >> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> >> 5. Planner changes to consider the relation storage during the
>> planning.
>> >> 6. Any planner changes based on the discussion of open items?
>> >> 7. some Executor changes to consider the storage advantages?
>> >>
>> >> Open Items:
>> >>
>> >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> >> HeapTuple and HeapScanDesc, So these scans are directly operating
>> >> on those structures and providing the result.
>> >
>> >
>> > What about vacuum?  I see vacuum is untouched in the patchset and it is
>> not
>> > mentioned in this discussion.
>> > Do you plan to override low-level function like heap_page_prune(),
>> > lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
>> > Or do you plan to let pluggable storage implement its own high-level
>> vacuum
>> > algorithm?
>> >
>>
>> Probably, some other algorithm for vacuum.  I am not sure current
>> vacuum with its parameters can be used so easily.  One thing that
>> might need some thoughts is that is it sufficient to say that keep
>> autovacuum as off and call some different API for places where the
>> vacuum can be invoked manually like Vacuum command to the developer
>> implementing some different strategy for vacuum or we need something
>> more as well.
>
>
> What kind of other vacuum algorithm do you expect?  It would be rather
> easier to understand if we would have examples.
>
> For me, changing of vacuum algorithm is not needed for just heap page
> format change.  Existing vacuum algorithm could just call page format API
> functions for manipulating individual pages.
>
> Changing of vacuum algorithm might be needed for more invasive changes
> than just heap page format.  However, we should first understand what these
> changes could be and how are they consistent with rest of API design.
>

Yes, I agree that we need some changes in the vacuum to handle the
pluggable storage.
Currently I didn't fully checked the changes that are needed in vacuum, but
I feel the low level changes of the function are enough, and also there
should be
some option from storage handler to decide whether it needs a vacuum or not?
Based on this flag, the vacuum may be skipped on those tables. So these
handlers
no need to register those API's.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Haribabu Kommi

On Thu, Jun 22, 2017 at 5:47 AM, Robert Haas  wrote:

> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>  wrote:
> > Open Items:
> >
> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> > HeapTuple and HeapScanDesc, So these scans are directly operating
> > on those structures and providing the result.
> >
> > These scan types may not be applicable to different storage formats.
> > So how to handle them?
>
> I think that BitmapHeapScan, at least, is applicable to any table AM
> that has TIDs.   It seems to me that in general we can imagine three
> kinds of table AMs:
>
> 1. Table AMs where a tuple can be efficiently located by a real TID.
> By a real TID, I mean that the block number part is really a block
> number and the item ID is really a location within the block.  These
> are necessarily quite similar to our current heap, but they can change
> the tuple format and page format to some degree, and it seems like in
> many cases it should be possible to plug them into our existing index
> AMs without too much heartache.  Both index scans and bitmap index
> scans ought to work.
>
> 2. Table AMs where a tuple has some other kind of locator.  For
> example, imagine an index-organized table where the locator is the
> primary key, which is a bit like what Alvaro had in mind for indirect
> indexes.  If the locator is 6 bytes or less, it could potentially be
> jammed into a TID, but I don't think that's a great idea.  For things
> like int8 or numeric, it won't work at all.  Even for other things,
> it's going to cause problems because the bit patterns won't be what
> the code is expecting; e.g. bitmap scans care about the structure of
> the TID, not just how many bits it is.  (Due credit: Somebody, maybe
> Alvaro, pointed out this problem before, at PGCon.)  For these kinds
> of tables, larger modifications to the index AMs are likely to be
> necessary, at least if we want a really general solution, or maybe we
> should have separate index AMs - e.g. btree for traditional TID-based
> heaps, and generic_btree or indirect_btree or key_btree or whatever
> for heaps with some other kind of locator.  It's not too hard to see
> how to make index scans work with this sort of structure but it's very
> unclear to me whether, or how, bitmap scans can be made to work.
>
> 3. Table AMs where a tuple doesn't really have a locator at all.  In
> these cases, we can't support any sort of index AM at all.  When the
> table is queried, there's really nothing the core system can do except
> ask the table AM for a full scan, supply the quals, and hope the table
> AM has some sort of smarts that enable it to optimize somehow.  For
> example, you can imagine converting cstore_fdw into a table AM of this
> sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
> chunks to be proven uninteresting and skipped.  (You could use chunk
> number + offset to turn this into a table AM of the previous type if
> you wanted to support secondary indexes; not sure if that'd be useful,
> but it'd certainly be harder.)
>
> I'm more interested in #1 than in #3, and more interested in #3 than
> #2, but other people may have different priorities.


Hi Robert,

Thanks for the details and your opinion.
I also agree that option#1 is better to do first.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Alexander Korotkov

On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila 
wrote:

> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
>  wrote:
> > On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <
> kommi.harib...@gmail.com>
> > wrote:
> >>
> >> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <
> alvhe...@2ndquadrant.com>
> >> wrote:
> >>>
> >>> I have sent the partial patch I have to Hari Babu Kommi.  We expect
> that
> >>> he will be able to further this goal some more.
> >>
> >>
> >> Thanks Alvaro for sharing your development patch.
> >>
> >> Most of the patch design is same as described by Alvaro in the first
> mail
> >> [1].
> >> I will detail the modifications, pending items and open items (needs
> >> discussion)
> >> to implement proper pluggable storage.
> >>
> >> Here I attached WIP patches to support pluggable storage. The patch
> series
> >> are may not work individually. Still so many things are under
> development.
> >> These patches are just to share the approach of the current development.
> >>
> >> Some notable changes that I did to make the patch work:
> >>
> >> 1. Added storageam handler to the slot, this is because not all places
> >> the relation is not available in handy.
> >> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
> >> As per the first version, I feel it is fine to allow creating HeapTuple
> >> format data.
> >>
> >> Thanks everyone for sharing their ideas in the developer's unconference
> at
> >> PGCon Ottawa.
> >>
> >> Pending items:
> >>
> >> 1. Replacement of Tuple with slot in Trigger functionality
> >> 2. Replacement of Tuple with Slot from storage handler functions.
> >> 3. Remove/minimize the use of HeapTuple as a Datum.
> >> 4. Replace all references of HeapScanDesc with StorageScanDesc
> >> 5. Planner changes to consider the relation storage during the planning.
> >> 6. Any planner changes based on the discussion of open items?
> >> 7. some Executor changes to consider the storage advantages?
> >>
> >> Open Items:
> >>
> >> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> >> HeapTuple and HeapScanDesc, So these scans are directly operating
> >> on those structures and providing the result.
> >
> >
> > What about vacuum?  I see vacuum is untouched in the patchset and it is
> not
> > mentioned in this discussion.
> > Do you plan to override low-level function like heap_page_prune(),
> > lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> > Or do you plan to let pluggable storage implement its own high-level
> vacuum
> > algorithm?
> >
>
> Probably, some other algorithm for vacuum.  I am not sure current
> vacuum with its parameters can be used so easily.  One thing that
> might need some thoughts is that is it sufficient to say that keep
> autovacuum as off and call some different API for places where the
> vacuum can be invoked manually like Vacuum command to the developer
> implementing some different strategy for vacuum or we need something
> more as well.


What kind of other vacuum algorithm do you expect?  It would be rather
easier to understand if we would have examples.

For me, changing of vacuum algorithm is not needed for just heap page
format change.  Existing vacuum algorithm could just call page format API
functions for manipulating individual pages.

Changing of vacuum algorithm might be needed for more invasive changes than
just heap page format.  However, we should first understand what these
changes could be and how are they consistent with rest of API design.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Alexander Korotkov

On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila 
wrote:

> On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
>  wrote:
> > On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas 
> wrote:
> >>
> >> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
> >>  wrote:
> >> > Open Items:
> >> >
> >> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> >> > HeapTuple and HeapScanDesc, So these scans are directly operating
> >> > on those structures and providing the result.
> >> >
> >> > These scan types may not be applicable to different storage formats.
> >> > So how to handle them?
> >>
> >> I think that BitmapHeapScan, at least, is applicable to any table AM
> >> that has TIDs.   It seems to me that in general we can imagine three
> >> kinds of table AMs:
> >>
> >> 1. Table AMs where a tuple can be efficiently located by a real TID.
> >> By a real TID, I mean that the block number part is really a block
> >> number and the item ID is really a location within the block.  These
> >> are necessarily quite similar to our current heap, but they can change
> >> the tuple format and page format to some degree, and it seems like in
> >> many cases it should be possible to plug them into our existing index
> >> AMs without too much heartache.  Both index scans and bitmap index
> >> scans ought to work.
> >
> >
> > If #1 is only about changing tuple and page formats, then could be much
> > simpler than the patch upthread?  We can implement "page format access
> > methods" with routines for insertion, update, pruning and deletion of
> tuples
> > *in particular page*.  There is no need to redefine high-level logic for
> > scanning heap, doing updates and so on...
>
> If you are changing tuple format then you do need to worry about
> places wherever we are using HeapTuple like TupleTableSlots,
> Visibility routines, etc. (just think if somebody changes tuple
> header, then all such places are susceptible to change).


Agree.  I think that we can consider pluggable tuple format as an
independent feature which is desirable to have before pluggable storages.
For example, I believe some FDWs could already have benefit from pluggable
tuple format.


> Similarly,
> if the page format is changed you need to consider all page scan API's
> like heapgettup_pagemode/heapgetpage.


If page format is changed, then heapgettup_pagemode/heapgetpage should use
appropriate API functions for manipulating page items.  I'm very afraid of
overriding whole heapgettup_pagemode/heapgetpage and monstrous functions
like heap_update without understanding of clear use-case.  It's definitely
not needed for changing heap page format.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Amit Kapila

On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
 wrote:
> On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi 
> wrote:
>>
>> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera 
>> wrote:
>>>
>>> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
>>> he will be able to further this goal some more.
>>
>>
>> Thanks Alvaro for sharing your development patch.
>>
>> Most of the patch design is same as described by Alvaro in the first mail
>> [1].
>> I will detail the modifications, pending items and open items (needs
>> discussion)
>> to implement proper pluggable storage.
>>
>> Here I attached WIP patches to support pluggable storage. The patch series
>> are may not work individually. Still so many things are under development.
>> These patches are just to share the approach of the current development.
>>
>> Some notable changes that I did to make the patch work:
>>
>> 1. Added storageam handler to the slot, this is because not all places
>> the relation is not available in handy.
>> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
>> As per the first version, I feel it is fine to allow creating HeapTuple
>> format data.
>>
>> Thanks everyone for sharing their ideas in the developer's unconference at
>> PGCon Ottawa.
>>
>> Pending items:
>>
>> 1. Replacement of Tuple with slot in Trigger functionality
>> 2. Replacement of Tuple with Slot from storage handler functions.
>> 3. Remove/minimize the use of HeapTuple as a Datum.
>> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> 5. Planner changes to consider the relation storage during the planning.
>> 6. Any planner changes based on the discussion of open items?
>> 7. some Executor changes to consider the storage advantages?
>>
>> Open Items:
>>
>> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> HeapTuple and HeapScanDesc, So these scans are directly operating
>> on those structures and providing the result.
>
>
> What about vacuum?  I see vacuum is untouched in the patchset and it is not
> mentioned in this discussion.
> Do you plan to override low-level function like heap_page_prune(),
> lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> Or do you plan to let pluggable storage implement its own high-level vacuum
> algorithm?
>

Probably, some other algorithm for vacuum.  I am not sure current
vacuum with its parameters can be used so easily.  One thing that
might need some thoughts is that is it sufficient to say that keep
autovacuum as off and call some different API for places where the
vacuum can be invoked manually like Vacuum command to the developer
implementing some different strategy for vacuum or we need something
more as well.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-27 Thread Amit Kapila

On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
 wrote:
> On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas  wrote:
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>>  wrote:
>> > Open Items:
>> >
>> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> > on those structures and providing the result.
>> >
>> > These scan types may not be applicable to different storage formats.
>> > So how to handle them?
>>
>> I think that BitmapHeapScan, at least, is applicable to any table AM
>> that has TIDs.   It seems to me that in general we can imagine three
>> kinds of table AMs:
>>
>> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> By a real TID, I mean that the block number part is really a block
>> number and the item ID is really a location within the block.  These
>> are necessarily quite similar to our current heap, but they can change
>> the tuple format and page format to some degree, and it seems like in
>> many cases it should be possible to plug them into our existing index
>> AMs without too much heartache.  Both index scans and bitmap index
>> scans ought to work.
>
>
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread?  We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*.  There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...
>

If you are changing tuple format then you do need to worry about
places wherever we are using HeapTuple like TupleTableSlots,
Visibility routines, etc. (just think if somebody changes tuple
header, then all such places are susceptible to change).  Similarly,
if the page format is changed you need to consider all page scan API's
like heapgettup_pagemode/heapgetpage.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-26 Thread Alexander Korotkov

On Mon, Jun 26, 2017 at 10:55 PM, Tomas Vondra  wrote:

> On 06/26/2017 05:18 PM, Alexander Korotkov wrote:
>
>> I see that design question for PostgreSQL pluggable storages is very hard.
>>
>
> IMHO it's mostly expected to be hard.
>
> Firstly, PostgreSQL is a mature product with many advanced features, and
> reworking a low-level feature without breaking something on top of it is
> hard by definition.
>
> Secondly, project policies and code quality requirements set the bar very
> high too, I think.


Sure.

> BTW, I think it worth analyzing existing use-cases of pluggable
>
>> storages.  I think that the most famous DBMS with pluggable storage API
>> is MySQL. This why I decided to start with it. I've added
>> MySQL/MariaDB section on wiki page.
>> https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB
>> It appears that significant part of MySQL storage engines are misuses.
>> MySQL lacks of various features like FDWs or writable views and so on.
>> This is why developers created a lot of pluggable storages for that
>> purposes.  We definitely don't want something like this in PostgreSQL now.
>> I created special resume column where I expressed whether it
>> would be nice to have something like this table engine in PostgreSQL.
>>
>
> I don't want to discourage you, but I'm not sure how valuable this is.
>
> I agree it's valuable to have a an over-view of use cases for pluggable
> storage, but I don't think we'll get that from looking at MySQL. As you
> noticed, most of the storage engines are misuses, so it's difficult to
> learn anything valuable from them. You can argue that using FDWs to
> implement alternative storage engines is a misuse too, but at least that
> gives us a valid use case (columnar storage implemented using FDW).
>
> If anything, the MySQL storage engines should serve as a cautionary tale
> how not to do things - there's also a plenty of references in the MySQL
> "Restrictions and Limitations" section of the manual:
>
>   https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf


Just to clarify the thing.  I don't propose any adoption of MySQL pluggable
storage API to PostgreSQL or something like this.  I just wrote this table
for completeness of vision.  It may appear that somebody will make some
valuable notes out of it, it may appear that not.  "Yes" in third column
means only that there is positive user visible effects which are *nice to
have* in PostgreSQL.

Also, I remember there was a table with comparison of different designs of
pluggable storages and their use-cases at PGCon 2017 unconference.  Could
somebody reproduce it?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-26 Thread Tomas Vondra


Hi,

On 06/26/2017 05:18 PM, Alexander Korotkov wrote:

Hackers,

I see that design question for PostgreSQL pluggable storages is very 
hard.


IMHO it's mostly expected to be hard.

Firstly, PostgreSQL is a mature product with many advanced features, and 
reworking a low-level feature without breaking something on top of it is 
hard by definition.


Secondly, project policies and code quality requirements set the bar 
very high too, I think.


> BTW, I think it worth analyzing existing use-cases of pluggable

storages.  I think that the most famous DBMS with pluggable storage API
is MySQL. This why I decided to start with it. I've added
MySQL/MariaDB section on wiki page.
https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB
It appears that significant part of MySQL storage engines are misuses.  
MySQL lacks of various features like FDWs or writable views and so on.  
This is why developers created a lot of pluggable storages for that 
purposes.  We definitely don't want something like this in PostgreSQL 
now.  I created special resume column where I expressed whether it

would be nice to have something like this table engine in PostgreSQL.



I don't want to discourage you, but I'm not sure how valuable this is.

I agree it's valuable to have a an over-view of use cases for pluggable 
storage, but I don't think we'll get that from looking at MySQL. As you 
noticed, most of the storage engines are misuses, so it's difficult to 
learn anything valuable from them. You can argue that using FDWs to 
implement alternative storage engines is a misuse too, but at least that 
gives us a valid use case (columnar storage implemented using FDW).


If anything, the MySQL storage engines should serve as a cautionary tale 
how not to do things - there's also a plenty of references in the MySQL 
"Restrictions and Limitations" section of the manual:


  https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-26 Thread Alexander Korotkov

Hackers,

I see that design question for PostgreSQL pluggable storages is very hard.
BTW, I think it worth analyzing existing use-cases of pluggable storages.
I think that the most famous DBMS with pluggable storage API is MySQL.
This why I decided to start with it.  I've added MySQL/MariaDB section on
wiki page.
https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB
It appears that significant part of MySQL storage engines are misuses.
MySQL lacks of various features like FDWs or writable views and so on.
This is why developers created a lot of pluggable storages for that
purposes.  We definitely don't want something like this in PostgreSQL now.
I created special resume column where I expressed whether it would be nice
to have something like this table engine in PostgreSQL.

Any comments and additions are welcome.  I'm planning to write similar
table for MongoDB.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-25 Thread Julien Rouhaud

On 23/06/2017 16:07, Tomas Vondra wrote:
> 
> I think you're probably right - GIN does compress the posting lists by
> exploiting the TID redundancy (that it's page/offset structure), and I
> don't think there are other index AMs doing that.
> 
> But I'm not sure we can simply rely on that - it's possible people will
> try to improve other index types (e.g. by adding similar compression to
> other index types).
That reminds me
https://www.postgresql.org/message-id/55e4051b.7020...@postgrespro.ru
where Anastasia proposed something similar.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-23 Thread Tom Lane

Tomas Vondra  writes:
> It would be really great if you could explain why BitmapScans are 
> dubious, instead of just labeling them as dubious. (I guess you mean 
> Bitmap Heap Scans, right?)

The two things I'm aware of are (a) the assumption you noted, that
fetching tuples in TID sort order is a reasonably efficient thing,
and (b) the "offset" part of a TID can't exceed MaxHeapTuplesPerPage
--- see data structure in tidbitmap.c.  The latter issue means that
you don't really have a full six bytes to play with in a TID, only
about five.

I don't think (b) would be terribly hard to fix if we had a motivation to,
but I wonder whether there aren't other places that also know this about
TIDs.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-23 Thread Tomas Vondra

Hi,

On 6/23/17 10:38 AM, Teodor Sigaev wrote:

1. Table AM with a 6-byte TID.
2. Table AM with a custom locator format, which could be TID-like.
3. Table AM with no locators.

Currently TID has its own type in system catalog. Seems, it's possible 
that storage claims type of TID which it uses. Then, AM could claim it 
too, so the core based on that information could solve the question 
about AM-storage compatibility. Storage could also claim that it hasn't 
TID type at all so it couldn't be used with any access method, use case: 
compressed column oriented storage.

Isn't the fact that TID is an existing type defined in system catalogs 
is a fairly insignificant detail? I mean, we could just as easily define 
a new 64-bit locator data type, and use that instead, for example.

The main issue here is that we assume things about the TID contents, 
i.e. that it contains page/offset etc. And Bitmap nodes rely on that, to 
some extent - e.g. when prefetching data.

>
As I remeber, only GIN depends on TID format, other indexes use it as 
opaque type. Except, at least, btree and GiST - they believe that 
internal pointers are the same as outer (to heap)

I think you're probably right - GIN does compress the posting lists by 
exploiting the TID redundancy (that it's page/offset structure), and I 
don't think there are other index AMs doing that.

But I'm not sure we can simply rely on that - it's possible people will 
try to improve other index types (e.g. by adding similar compression to 
other index types). Moreover we now support extensions defining custom 
index AMs, and we have no clue what people may do in those.

So this would clearly require some sort of flag for each index AM.

>

Another dubious part - BitmapScan.

It would be really great if you could explain why BitmapScans are 
dubious, instead of just labeling them as dubious. (I guess you mean 
Bitmap Heap Scans, right?)

I see no conceptual issue with bitmap scans on arbitrary locator types, 
as long as there's sufficient locality information encoded in the value. 
What I mean by that is that for any two locator values A and B:

   (1) if (A < B) then (A is stored before B)

   (2) if (A is close to B) then (A is stored close to B)

Without these features it's probably futile to try to do bitmap scans, 
because the bitmap would not result in mostly sequential access pattern 
and things like prefetch would not be very efficient, I think.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-23 Thread Teodor Sigaev


1. Table AM with a 6-byte TID.
2. Table AM with a custom locator format, which could be TID-like.
3. Table AM with no locators.


Currently TID has its own type in system catalog. Seems, it's possible that 
storage claims type of TID which it uses. Then, AM could claim it too, so the 
core based on that information could solve the question about AM-storage 
compatibility. Storage could also claim that it hasn't TID type at all so it 
couldn't be used with any access method, use case: compressed column oriented 
storage.


As I remeber, only GIN depends on TID format, other indexes use it as opaque 
type. Except, at least, btree and GiST - they believ that internal pointers are 
the same as outer (to heap)


Another dubious part - BitmapScan.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Michael Paquier

On Thu, Jun 22, 2017 at 11:27 PM, Robert Haas  wrote:
> On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
>  wrote:
>> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier 
>> wrote:
>>> Putting that in a couple of words.
>>> 1. Table AM with a 6-byte TID.
>>> 2. Table AM with a custom locator format, which could be TID-like.
>>> 3. Table AM with no locators.
>>>
>>> Getting into having #1 first to work out would already be really
>>> useful for users.
>>
>> What exactly would be useful for *users*?  Any kind of API itself is
>> completely useless for users, because they are users, not developers.
>> Storage API could be useful for developers to implement storage AMs whose in
>> turn could be useful for users.
>
> What's your point?  I assume that is what Michael meant.

Sorry for the confusion. I implied here that users are the ones
developing modules.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Tomas Vondra

Hi,

On 6/22/17 4:36 PM, Alexander Korotkov wrote:
On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas > wrote:

On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
> wrote:
> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier >
> wrote:
>> Putting that in a couple of words.
>> 1. Table AM with a 6-byte TID.
>> 2. Table AM with a custom locator format, which could be TID-like.
>> 3. Table AM with no locators.
>>
>> Getting into having #1 first to work out would already be really
>> useful for users.
>
> What exactly would be useful for *users*?  Any kind of API itself is
> completely useless for users, because they are users, not developers.
> Storage API could be useful for developers to implement storage AMs whose 
in
> turn could be useful for users.

What's your point?  I assume that is what Michael meant.

TBH, I don't understand what particular enchantments do we expect from 
having #1.

I'd say that's one of the things we're trying to figure out in this 
thread. Does it make sense to go with #1 in v1 of the patch, or do we 
have to implement #2 or #3 to get some real benefit for the users?

>
This is why it's hard for me to say if #1 is good idea.  It's also hard 
for me to say if patch upthread is right way of implementing #1.
But, I have gut feeling that if even #1 is good idea itself, it's 
definitely not what users expect from "pluggable storages".

The question is "why" do you think that. What features do you expect 
from pluggable storage API that would be impossible to implement with 
option #1?

I'm not trying to be annoying or anything like that - I don't know the 
answer and discussing those things is exactly why this thread exists.

I do agree #1 has limitations, and that it'd be great to get API that 
supports all kinds of pluggable storage implementations. But I guess 
that'll take some time.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Tomas Vondra


Hi,

On 6/21/17 9:47 PM, Robert Haas wrote:
...
>

like int8 or numeric, it won't work at all.  Even for other things,
it's going to cause problems because the bit patterns won't be what
the code is expecting; e.g. bitmap scans care about the structure of
the TID, not just how many bits it is.  (Due credit: Somebody, maybe
Alvaro, pointed out this problem before, at PGCon.) 


Can you elaborate a bit more about this TID bit pattern issues? I do 
remember that not all TIDs are valid due to safeguards on individual 
fields, like for example


Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits))

But perhaps there are some other issues?

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Alexander Korotkov

On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas  wrote:

> On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
>  wrote:
> > On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <
> michael.paqu...@gmail.com>
> > wrote:
> >> Putting that in a couple of words.
> >> 1. Table AM with a 6-byte TID.
> >> 2. Table AM with a custom locator format, which could be TID-like.
> >> 3. Table AM with no locators.
> >>
> >> Getting into having #1 first to work out would already be really
> >> useful for users.
> >
> > What exactly would be useful for *users*?  Any kind of API itself is
> > completely useless for users, because they are users, not developers.
> > Storage API could be useful for developers to implement storage AMs
> whose in
> > turn could be useful for users.
>
> What's your point?  I assume that is what Michael meant.
>

TBH, I don't understand what particular enchantments do we expect from
having #1.
This is why it's hard for me to say if #1 is good idea.  It's also hard for
me to say if patch upthread is right way of implementing #1.
But, I have gut feeling that if even #1 is good idea itself, it's
definitely not what users expect from "pluggable storages".
>From user side, it would be normal to expect that "pluggable storage" could
be index-organized table where index could be say LSM...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Alexander Korotkov

On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas  wrote:

> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>  wrote:
> > Open Items:
> >
> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> > HeapTuple and HeapScanDesc, So these scans are directly operating
> > on those structures and providing the result.
> >
> > These scan types may not be applicable to different storage formats.
> > So how to handle them?
>
> I think that BitmapHeapScan, at least, is applicable to any table AM
> that has TIDs.   It seems to me that in general we can imagine three
> kinds of table AMs:
>
> 1. Table AMs where a tuple can be efficiently located by a real TID.
> By a real TID, I mean that the block number part is really a block
> number and the item ID is really a location within the block.  These
> are necessarily quite similar to our current heap, but they can change
> the tuple format and page format to some degree, and it seems like in
> many cases it should be possible to plug them into our existing index
> AMs without too much heartache.  Both index scans and bitmap index
> scans ought to work.
>

If #1 is only about changing tuple and page formats, then could be much
simpler than the patch upthread?  We can implement "page format access
methods" with routines for insertion, update, pruning and deletion of
tuples *in particular page*.  There is no need to redefine high-level logic
for scanning heap, doing updates and so on...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Robert Haas

On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
 wrote:
> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier 
> wrote:
>> Putting that in a couple of words.
>> 1. Table AM with a 6-byte TID.
>> 2. Table AM with a custom locator format, which could be TID-like.
>> 3. Table AM with no locators.
>>
>> Getting into having #1 first to work out would already be really
>> useful for users.
>
> What exactly would be useful for *users*?  Any kind of API itself is
> completely useless for users, because they are users, not developers.
> Storage API could be useful for developers to implement storage AMs whose in
> turn could be useful for users.

What's your point?  I assume that is what Michael meant.

> Then while saying that #1 is useful for
> users, it would be nice to keep in mind particular storage AMs which can be
> implemented using #1.

I don't think anybody's arguing with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Alexander Korotkov

On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier 
wrote:

> Putting that in a couple of words.
> 1. Table AM with a 6-byte TID.
> 2. Table AM with a custom locator format, which could be TID-like.
> 3. Table AM with no locators.
>
> Getting into having #1 first to work out would already be really
> useful for users.

What exactly would be useful for *users*?  Any kind of API itself is
completely useless for users, because they are users, not developers.
Storage API could be useful for developers to implement storage AMs whose
in turn could be useful for users.  Then while saying that #1 is useful for
users, it would be nice to keep in mind particular storage AMs which can be
implemented using #1.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-22 Thread Alexander Korotkov

On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi 
wrote:

> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera 
> wrote:
>
>> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
>> he will be able to further this goal some more.
>
>
> Thanks Alvaro for sharing your development patch.
>
> Most of the patch design is same as described by Alvaro in the first mail
> [1].
> I will detail the modifications, pending items and open items (needs
> discussion)
> to implement proper pluggable storage.
>
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.
>
> Some notable changes that I did to make the patch work:
>
> 1. Added storageam handler to the slot, this is because not all places
> the relation is not available in handy.
> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
> As per the first version, I feel it is fine to allow creating HeapTuple
> format data.
>
> Thanks everyone for sharing their ideas in the developer's unconference at
> PGCon Ottawa.
>
> Pending items:
>
> 1. Replacement of Tuple with slot in Trigger functionality
> 2. Replacement of Tuple with Slot from storage handler functions.
> 3. Remove/minimize the use of HeapTuple as a Datum.
> 4. Replace all references of HeapScanDesc with StorageScanDesc
> 5. Planner changes to consider the relation storage during the planning.
> 6. Any planner changes based on the discussion of open items?
> 7. some Executor changes to consider the storage advantages?
>
> Open Items:
>
> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> HeapTuple and HeapScanDesc, So these scans are directly operating
> on those structures and providing the result.
>

What about vacuum?  I see vacuum is untouched in the patchset and it is not
mentioned in this discussion.
Do you plan to override low-level function like heap_page_prune(),
lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
Or do you plan to let pluggable storage implement its own high-level vacuum
algorithm?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-06-21 Thread Michael Paquier

On Thu, Jun 22, 2017 at 11:12 AM, Amit Langote
 wrote:
> On 2017/06/22 10:01, Michael Paquier wrote:
>> #3 implies that the index AM logic is implemented in the table
>> AM. Not saying that it is not useful, but it does not feel natural to
>> have the planner request for a sequential scan, just to have the table
>> AM secretly do some kind of index/skipping scan.
>
> I had read a relevant comment on a pluggable storage thread awhile back
> [1].  In short, the comment was that the planner should be able to get
> some intelligence, via some API, from the heap storage implementation
> about the latter's access cost characteristics.  The storage should
> provide accurate-enough cost information to the planner when such a
> request is made by, say, cost_seqscan(), so that the planner can make
> appropriate choice.  If two tables containing the same number of rows (and
> the same size in bytes, perhaps) use different storage implementations,
> then, planner's cost parameters remaining same, cost_seqscan() will end up
> calculating different costs for the two tables.  Perhaps, SeqScan would be
> chosen for one table but not the another based on that.

Yeah, I agree that the costing part needs some clear attention and
thoughts, and the gains are absolutely huge with the correct
interface. That could be done in a later step though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-21 Thread Amit Langote

On 2017/06/22 10:01, Michael Paquier wrote:
> #3 implies that the index AM logic is implemented in the table
> AM. Not saying that it is not useful, but it does not feel natural to
> have the planner request for a sequential scan, just to have the table
> AM secretly do some kind of index/skipping scan.

I had read a relevant comment on a pluggable storage thread awhile back
[1].  In short, the comment was that the planner should be able to get
some intelligence, via some API, from the heap storage implementation
about the latter's access cost characteristics.  The storage should
provide accurate-enough cost information to the planner when such a
request is made by, say, cost_seqscan(), so that the planner can make
appropriate choice.  If two tables containing the same number of rows (and
the same size in bytes, perhaps) use different storage implementations,
then, planner's cost parameters remaining same, cost_seqscan() will end up
calculating different costs for the two tables.  Perhaps, SeqScan would be
chosen for one table but not the another based on that.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoY3LXVUPQVdZW70XKp5PsXffO82pXXt%3DbeegcV%2B%3DRsQgg%40mail.gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-21 Thread Michael Paquier

On Thu, Jun 22, 2017 at 4:47 AM, Robert Haas  wrote:
> I think that BitmapHeapScan, at least, is applicable to any table AM
> that has TIDs.   It seems to me that in general we can imagine three
> kinds of table AMs:
>
> 1. Table AMs where a tuple can be efficiently located by a real TID.
> By a real TID, I mean that the block number part is really a block
> number and the item ID is really a location within the block.  These
> are necessarily quite similar to our current heap, but they can change
> the tuple format and page format to some degree, and it seems like in
> many cases it should be possible to plug them into our existing index
> AMs without too much heartache.  Both index scans and bitmap index
> scans ought to work.
>
> 2. Table AMs where a tuple has some other kind of locator.  For
> example, imagine an index-organized table where the locator is the
> primary key, which is a bit like what Alvaro had in mind for indirect
> indexes.  If the locator is 6 bytes or less, it could potentially be
> jammed into a TID, but I don't think that's a great idea.  For things
> like int8 or numeric, it won't work at all.  Even for other things,
> it's going to cause problems because the bit patterns won't be what
> the code is expecting; e.g. bitmap scans care about the structure of
> the TID, not just how many bits it is.  (Due credit: Somebody, maybe
> Alvaro, pointed out this problem before, at PGCon.)  For these kinds
> of tables, larger modifications to the index AMs are likely to be
> necessary, at least if we want a really general solution, or maybe we
> should have separate index AMs - e.g. btree for traditional TID-based
> heaps, and generic_btree or indirect_btree or key_btree or whatever
> for heaps with some other kind of locator.  It's not too hard to see
> how to make index scans work with this sort of structure but it's very
> unclear to me whether, or how, bitmap scans can be made to work.
>
> 3. Table AMs where a tuple doesn't really have a locator at all.  In
> these cases, we can't support any sort of index AM at all.  When the
> table is queried, there's really nothing the core system can do except
> ask the table AM for a full scan, supply the quals, and hope the table
> AM has some sort of smarts that enable it to optimize somehow.  For
> example, you can imagine converting cstore_fdw into a table AM of this
> sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
> chunks to be proven uninteresting and skipped.  (You could use chunk
> number + offset to turn this into a table AM of the previous type if
> you wanted to support secondary indexes; not sure if that'd be useful,
> but it'd certainly be harder.)
>
> I'm more interested in #1 than in #3, and more interested in #3 than
> #2, but other people may have different priorities.

Putting that in a couple of words.
1. Table AM with a 6-byte TID.
2. Table AM with a custom locator format, which could be TID-like.
3. Table AM with no locators.

Getting into having #1 first to work out would already be really
useful for users. My take on the matter is that being able to plug in
in-core index AMs directly into a table AM #1 is more useful in the
long term, as it is possible for multiple table AMs to use the same
kind of index AM which is designed nicely enough. So the index AM
logic basically does not need to be duplicated across multiple table
AMs. #3 implies that the index AM logic is implemented in the table
AM. Not saying that it is not useful, but it does not feel natural to
have the planner request for a sequential scan, just to have the table
AM secretly do some kind of index/skipping scan.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-06-21 Thread Robert Haas

On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
 wrote:
> Open Items:
>
> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> HeapTuple and HeapScanDesc, So these scans are directly operating
> on those structures and providing the result.
>
> These scan types may not be applicable to different storage formats.
> So how to handle them?

I think that BitmapHeapScan, at least, is applicable to any table AM
that has TIDs.   It seems to me that in general we can imagine three
kinds of table AMs:

1. Table AMs where a tuple can be efficiently located by a real TID.
By a real TID, I mean that the block number part is really a block
number and the item ID is really a location within the block.  These
are necessarily quite similar to our current heap, but they can change
the tuple format and page format to some degree, and it seems like in
many cases it should be possible to plug them into our existing index
AMs without too much heartache.  Both index scans and bitmap index
scans ought to work.

2. Table AMs where a tuple has some other kind of locator.  For
example, imagine an index-organized table where the locator is the
primary key, which is a bit like what Alvaro had in mind for indirect
indexes.  If the locator is 6 bytes or less, it could potentially be
jammed into a TID, but I don't think that's a great idea.  For things
like int8 or numeric, it won't work at all.  Even for other things,
it's going to cause problems because the bit patterns won't be what
the code is expecting; e.g. bitmap scans care about the structure of
the TID, not just how many bits it is.  (Due credit: Somebody, maybe
Alvaro, pointed out this problem before, at PGCon.)  For these kinds
of tables, larger modifications to the index AMs are likely to be
necessary, at least if we want a really general solution, or maybe we
should have separate index AMs - e.g. btree for traditional TID-based
heaps, and generic_btree or indirect_btree or key_btree or whatever
for heaps with some other kind of locator.  It's not too hard to see
how to make index scans work with this sort of structure but it's very
unclear to me whether, or how, bitmap scans can be made to work.

3. Table AMs where a tuple doesn't really have a locator at all.  In
these cases, we can't support any sort of index AM at all.  When the
table is queried, there's really nothing the core system can do except
ask the table AM for a full scan, supply the quals, and hope the table
AM has some sort of smarts that enable it to optimize somehow.  For
example, you can imagine converting cstore_fdw into a table AM of this
sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
chunks to be proven uninteresting and skipped.  (You could use chunk
number + offset to turn this into a table AM of the previous type if
you wanted to support secondary indexes; not sure if that'd be useful,
but it'd certainly be harder.)

I'm more interested in #1 than in #3, and more interested in #3 than
#2, but other people may have different priorities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-10-13 Thread Alvaro Herrera

I have sent the partial patch I have to Hari Babu Kommi.  We expect that
he will be able to further this goal some more.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-24 Thread Alvaro Herrera

Alvaro Herrera wrote:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.  Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.

I regret to announce that I'll have to stay away from this topic for a
little while, as I have another project taking priority.  I expect to
return to this shortly thereafter, hopefully in time to get it done for
pg10.

If anyone is interested in helping with the (currently not compilable)
patch I have, please mail me privately and we can discuss.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Andres Freund

On 2016-08-18 08:58:11 +0100, Simon Riggs wrote:
> On 16 August 2016 at 19:46, Andres Freund  wrote:
> > On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
> >> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
> >>
> >> One, I'm worried that adding an additional layer of pointer-jumping is
> >> going to slow things down and make Andres' work to speed up the
> >> executor more difficult.  I don't know that there is a problem there,
> >> and if there is a problem I don't know what to do about it, but I
> >> think it's something we need to consider.
> >
> > I'm quite concerned about that as well.
> 
> This objection would apply to all other proposals as well, FDW etc..

Not really. The place you draw your boundary significantly influences
where and how much of a price you pay.  Having another indirection
inside HeapTuple - which is accessed in many many places, is something
different from having a seqscan equivalent, which returns you a batch of
already deformed tuples in array form.  In the latter case there's one
additional indirection per batch of tuples, in the former there's many
for each tuple.

> Do you see some way to add flexibility yet without adding a branch
> point in the code?

I'm not even saying that the approach of doing the indirection inside
the HeapTuple replacement is a no-go, just that it concerns me.  I do
think that working on only lugging arround values/isnull arrays is
something that I could see working better, if some problems are
addressed beforehand.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Andres Freund



On August 18, 2016 7:44:50 AM PDT, Ants Aasma  wrote:
>On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund 
>wrote:
>> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>>> I am somewhat inclined to
>>> believe that we need to restructure the executor in a bigger way so
>>> that it passes around datums instead of tuples; I'm inclined to
>>> believe that the current tuple-centric model is probably not optimal
>>> even for the existing storage format.
>>
>> I actually prototyped that, and it's not an easy win so far. Column
>> extraction cost, even after significant optimization, is still often
>a
>> significant portion of the runtime. And e.g. projection only
>extracting
>> all columns, after evaluating a restrictive qual referring to an
>"early"
>> column, can be a significant win.  We'd definitely have to give up on
>> extracting columns 0..n when accessing later columns... Hm.
>
>What about going even further than [1] in converting the executor to
>being opcode based and merging projection and qual evaluation to a
>single pass? Optimizer would then have some leeway about how to order
>column extraction and qual evaluation. Might even be worth it to
>special case some functions as separate opcodes (e.g. int4eq,
>timestamp_lt).
>
>Regards,
>Ants Aasma
>
>[1]
>https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3...@alap3.anarazel.de

Good question. I think I have a reasonable answer,  but lets discuss that in 
the other thread.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Ants Aasma

On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund  wrote:
> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>> I am somewhat inclined to
>> believe that we need to restructure the executor in a bigger way so
>> that it passes around datums instead of tuples; I'm inclined to
>> believe that the current tuple-centric model is probably not optimal
>> even for the existing storage format.
>
> I actually prototyped that, and it's not an easy win so far. Column
> extraction cost, even after significant optimization, is still often a
> significant portion of the runtime. And e.g. projection only extracting
> all columns, after evaluating a restrictive qual referring to an "early"
> column, can be a significant win.  We'd definitely have to give up on
> extracting columns 0..n when accessing later columns... Hm.

What about going even further than [1] in converting the executor to
being opcode based and merging projection and qual evaluation to a
single pass? Optimizer would then have some leeway about how to order
column extraction and qual evaluation. Might even be worth it to
special case some functions as separate opcodes (e.g. int4eq,
timestamp_lt).

Regards,
Ants Aasma

[1] 
https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3...@alap3.anarazel.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Amit Kapila

On Wed, Aug 17, 2016 at 10:33 PM, Alvaro Herrera
 wrote:
> Anastasia Lubennikova wrote:
>>
>> Except these, there are some pretty strange and unrelated functions in
>> src/backend/catalog.
>> I'm willing to fix them, but I'd like to synchronize our efforts.
>
> I very much would like to stay away from touching src/backend/catalog,
> which are the functions that deal with system catalogs.  We can simply
> say that system catalogs are hardcoded to use heapam.c storage for now.
>

Does this mean that if any storage needs to access system catalog
information, they need to be aware of HeapTuple and other required
stuff like syscache?  Again, if they need to update some stats or
something like that, they need to be aware of heap tuple format.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Alexander Korotkov

On Thu, Aug 18, 2016 at 10:58 AM, Simon Riggs  wrote:

> On 16 August 2016 at 19:46, Andres Freund  wrote:
> > On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
> >> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
> >>
> >> One, I'm worried that adding an additional layer of pointer-jumping is
> >> going to slow things down and make Andres' work to speed up the
> >> executor more difficult.  I don't know that there is a problem there,
> >> and if there is a problem I don't know what to do about it, but I
> >> think it's something we need to consider.
> >
> > I'm quite concerned about that as well.
>
> This objection would apply to all other proposals as well, FDW etc..
>
> Do you see some way to add flexibility yet without adding a branch
> point in the code?
>

It's impossible without branch point in code.  The question is where this
branch should be located.
In particular, be can put this branch point into planner by defining
distinct executor nodes for each pluggable storage.  In this case, each
storage would have own optimized executor nodes.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2016-08-18 Thread Simon Riggs

On 16 August 2016 at 19:46, Andres Freund  wrote:
> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
>>
>> One, I'm worried that adding an additional layer of pointer-jumping is
>> going to slow things down and make Andres' work to speed up the
>> executor more difficult.  I don't know that there is a problem there,
>> and if there is a problem I don't know what to do about it, but I
>> think it's something we need to consider.
>
> I'm quite concerned about that as well.

This objection would apply to all other proposals as well, FDW etc..

Do you see some way to add flexibility yet without adding a branch
point in the code?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-17 Thread Alvaro Herrera

Anastasia Lubennikova wrote:
> 13.08.2016 02:15, Alvaro Herrera:

> >To support this, we introduce StorageTuple and StorageScanDesc.
> >StorageTuples represent a physical tuple coming from some storage AM.
> >It is necessary to have a pointer to a StorageAmRoutine in order to
> >manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.
> 
> StorageTuples concept looks really cool. I've got some questions on
> details of implementation.
> 
> Do StorageTuples have fields common to all implementations?
> Or StorageTuple is totally abstract structure that has nothing to do
> with data, except pointing to it?
> 
> I mean, now we already have HeapTupleData structure, which is a pretty
> good candidate to replace with StorageTuple.

I was planning to replace all uses of HeapTuple in the executor with
StorageTuple, actually.  But the main reason I would like to avoid
HeapTupleData itself is that it contains an assumption that there is a
single palloc chunk that contains the tuple (t_len and t_data).  This
might not be true in representations that split the tuple, for example
in columnar storage where you have one column in page A and another
column in page B, for the same tuple.  I suppose there might be some
point to keeping t_tableOid and t_self, though.

> And maybe add a "t_handler" field that points out to handler functions.
> I don't sure if it will be a name of StorageAm, or its OID, or maybe the
> main function itself. Although, If I'm not mistaken, we always have
> RelationData when we want to operate the tuple, so having t_handler
> in the StorageTuple is excessive.

Yeah, I think the RelationData (or more precisely the StorageAmRoutine)
is going to be available always, so I don't think we need a pointer in
the tuple itself.

> This approach allows to minimize code changes and ensure that we
> won't miss any function that handles tuples.
> 
> Do you see any weak points of the suggestion?
> What design do you use in your prototype?

It's currently a "void *" pointer in my prototype.

> >RelationData gains ->rd_stamroutine which is a pointer to the
> >StorageAmRoutine for the relation in question.  Similarly,
> >TupleTableSlot is augmented with a link to the StorageAmRoutine to
> >handle the StorageTuple it contains (probably in most cases it's set at
> >the same time as the tupdesc).  This implies that routines such as
> >ExecAssignScanType need to pass down the StorageAmRoutine from the
> >relation to the slot.
> 
> If we already have this pointer in t_handler as described below,
> we don't need to pass it between functions and slots.

I think it's better to have it in slots, so you can install multiple
tuples in the slot without having to change the routine pointers each
time.

> >The executor is modified so that instead of calling heap_insert etc
> >directly, it uses rel->rd_stamroutine to call these methods.  The
> >executor is still in charge of dealing with indexes, constraints, and
> >any other thing that's not the tuple storage itself (this is one major
> >point in which this differs from FDWs).  This all looks simple enough,
> >with one exception and a few notes:
> 
> That is exactly what I tried to describe in my proposal.
> Chapter "Relation management". I'm sure, you've already noticed
> that it will require huge source code cleaning. I've carefully read
> the sources and found "violators" of abstraction in src/backend/commands.
> The list is attached to the wiki page
> https://wiki.postgresql.org/wiki/HeapamRefactoring.
> 
> Except these, there are some pretty strange and unrelated functions in
> src/backend/catalog.
> I'm willing to fix them, but I'd like to synchronize our efforts.

I very much would like to stay away from touching src/backend/catalog,
which are the functions that deal with system catalogs.  We can simply
say that system catalogs are hardcoded to use heapam.c storage for now.
If we later see a need to enable some particular catalog using a
different storage implementation, we can change the code for that
specific catalog in src/backend/catalog and everywhere else, to use the
abstract API instead of hardcoding heap_insert etc.  But that can be
left for a second pass.  (This is my point "iv" further below, to which
you said "+1").

> Nothing to do, just substitute t_data with proper HeapTupleHeader
> representation. I think it's a job for StorageAm. Let's say each StorageAm
> must have stam_to_heaptuple() function and opposite function
> stam_from_heaptuple().

Hmm, yeah, that also works.  We'd have to check again whether it's more
convenient to start as a slot rather than a StorageTuple.  AFAICS the
trigger.c code is all starting from a slot, so it makes sense to have
the conversion use the slot code -- that way, there's no need for each
storageAM to re-implement conversion to HeapTuple.

> >note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
> >format.  In the long run, it may be possible to replace them with a
> >separate storage

Re: [HACKERS] Pluggable storage

2016-08-17 Thread Anastasia Lubennikova


13.08.2016 02:15, Alvaro Herrera:

Many have expressed their interest in this topic, but I haven't seen any
design of how it should work.  Here's my attempt; I've been playing with
this for some time now and I think what I propose here is a good initial
plan.  This will allow us to write permanent table storage that works
differently than heapam.c.  At this stage, I haven't throught through
whether this is going to allow extensions to define new storage modules;
I am focusing on AMs that can coexist with heapam in core.

The design starts with a new row type in pg_am, of type "s" (for "storage").
The handler function returns a struct of node StorageAmRoutine.  This
contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
(tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
like), and operations on tuples that are part of slots (tuple_deform,
materialize).

To support this, we introduce StorageTuple and StorageScanDesc.
StorageTuples represent a physical tuple coming from some storage AM.
It is necessary to have a pointer to a StorageAmRoutine in order to
manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.


StorageTuples concept looks really cool. I've got some questions on
details of implementation.

Do StorageTuples have fields common to all implementations?
Or StorageTuple is totally abstract structure that has nothing to do
with data, except pointing to it?

I mean, now we already have HeapTupleData structure, which is a pretty
good candidate to replace with StorageTuple.
It's already widely used in executor and moreover, it's the only structure
(except MinimalTuples and all those crazy optimizations) that works with
tuples, both extracted from the page or created on-the-fly in executor node.

typedef struct HeapTupleData
{
uint32t_len;   /* length of *t_data */
ItemPointerData t_self;/* SelfItemPointer */
Oidt_tableOid; /* table the tuple came from */
HeapTupleHeader t_data;/* -> tuple header and data */
} HeapTupleData;

We can simply change t_data type from HeapTupleHeader to Pointer.
And maybe add a "t_handler" field that points out to handler functions.
I don't sure if it will be a name of StorageAm, or its OID, or maybe the
main function itself. Although, If I'm not mistaken, we always have
RelationData when we want to operate the tuple, so having t_handler
in the StorageTuple is excessive.


typedef struct StorageTupleData
{
uint32t_len; /* length of *t_data */
ItemPointerData   t_self;/* SelfItemPointer */
Oid   t_tableOid;/* table the tuple came from */
Pointer   t_data;/* -> tuple header and data
  * This field should never be 
accessed directly,
  * only via StorageAm handler 
functions,
  * because we don't know 
underlying data structure.

  */
???   t_handler; /* StorageAm that knows what to do 
with the tuple */

} StorageTupleData
;

This approach allows to minimize code changes and ensure that we
won't miss any function that handles tuples.

Do you see any weak points of the suggestion?
What design do you use in your prototype?


RelationData gains ->rd_stamroutine which is a pointer to the
StorageAmRoutine for the relation in question.  Similarly,
TupleTableSlot is augmented with a link to the StorageAmRoutine to
handle the StorageTuple it contains (probably in most cases it's set at
the same time as the tupdesc).  This implies that routines such as
ExecAssignScanType need to pass down the StorageAmRoutine from the
relation to the slot.


If we already have this pointer in t_handler as described below,
we don't need to pass it between functions and slots.

The executor is modified so that instead of calling heap_insert etc
directly, it uses rel->rd_stamroutine to call these methods.  The
executor is still in charge of dealing with indexes, constraints, and
any other thing that's not the tuple storage itself (this is one major
point in which this differs from FDWs).  This all looks simple enough,
with one exception and a few notes:


That is exactly what I tried to describe in my proposal.
Chapter "Relation management". I'm sure, you've already noticed
that it will require huge source code cleaning. I've carefully read
the sources and found "violators" of abstraction in src/backend/commands.
The list is attached to the wiki page 
https://wiki.postgresql.org/wiki/HeapamRefactoring.


Except these, there are some pretty strange and unrelated functions in 
src/backend/catalog.

I'm willing to fix them, but I'd like to synchronize our efforts.


exception a) ExecMaterializeSlot needs special consideration.  This is
used in two different ways: a1) is the stated "make tuple independent
from any underlying storage" point, which is handled

Re: [HACKERS] Pluggable storage

2016-08-16 Thread Andres Freund

On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
> 
> One, I'm worried that adding an additional layer of pointer-jumping is
> going to slow things down and make Andres' work to speed up the
> executor more difficult.  I don't know that there is a problem there,
> and if there is a problem I don't know what to do about it, but I
> think it's something we need to consider.

I'm quite concerned about that as well.

> I am somewhat inclined to
> believe that we need to restructure the executor in a bigger way so
> that it passes around datums instead of tuples; I'm inclined to
> believe that the current tuple-centric model is probably not optimal
> even for the existing storage format.

I actually prototyped that, and it's not an easy win so far. Column
extraction cost, even after significant optimization, is still often a
significant portion of the runtime. And e.g. projection only extracting
all columns, after evaluating a restrictive qual referring to an "early"
column, can be a significant win.  We'd definitely have to give up on
extracting columns 0..n when accessing later columns... Hm.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-15 Thread Anastasia Lubennikova


13.08.2016 02:15, Alvaro Herrera:

Many have expressed their interest in this topic, but I haven't seen any
design of how it should work.  Here's my attempt; I've been playing with
this for some time now and I think what I propose here is a good initial
plan.  This will allow us to write permanent table storage that works
differently than heapam.c.  At this stage, I haven't throught through
whether this is going to allow extensions to define new storage modules;
I am focusing on AMs that can coexist with heapam in core.

Hello,
thank you for starting this topic.

I am working on a very similar proposal in another thread.
https://commitfest.postgresql.org/10/700/

At first glance our drafts are very similar. I hope it means that we are
moving in the right direction. It's great that your proposal is focused on
interactions with executor while mine are mostly about internals of new
StorageAm interface and interactions with a buffer and storage management.

Please read and comment https://wiki.postgresql.org/wiki/HeapamRefactoring

I'll leave more concrete comments tomorrow.
Thank you for the explicit description of issues.



The design starts with a new row type in pg_am, of type "s" (for "storage").
The handler function returns a struct of node StorageAmRoutine.  This
contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
(tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
like), and operations on tuples that are part of slots (tuple_deform,
materialize).
To support this, we introduce StorageTuple and StorageScanDesc.
StorageTuples represent a physical tuple coming from some storage AM.
It is necessary to have a pointer to a StorageAmRoutine in order to
manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.

RelationData gains ->rd_stamroutine which is a pointer to the
StorageAmRoutine for the relation in question.  Similarly,
TupleTableSlot is augmented with a link to the StorageAmRoutine to
handle the StorageTuple it contains (probably in most cases it's set at
the same time as the tupdesc).  This implies that routines such as
ExecAssignScanType need to pass down the StorageAmRoutine from the
relation to the slot.

The executor is modified so that instead of calling heap_insert etc
directly, it uses rel->rd_stamroutine to call these methods.  The
executor is still in charge of dealing with indexes, constraints, and
any other thing that's not the tuple storage itself (this is one major
point in which this differs from FDWs).  This all looks simple enough,
with one exception and a few notes:

exception a) ExecMaterializeSlot needs special consideration.  This is
used in two different ways: a1) is the stated "make tuple independent
from any underlying storage" point, which is handled by
ExecMaterializeSlot itself and calling a method from the storage AM to
do any byte copying as needed.  ExecMaterializeSlot no longer returns a
HeapTuple, because there might not be any.  The second usage pattern a2)
is to create a HeapTuple that's passed to other modules which only deal
with HT and not slots (triggers are the main case I noticed, but I think
there are others such as the executor itself wanting tuples as Datum for
some reason).  For the moment I'm handling this by having a new
ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
the original tuple format.

note b) EvalPlanQual currently maintains an array of HeapTuple in
EState->es_epqTuple.  I think it works to replace that with an array of
StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
methods in order to interact with it.  Other than those changes, it
seems okay.

note c) nodeSubplan has curTuple as a HeapTuple.  It seems simple
to replace this with an independent slot-based tuple.

note d) grp_firstTuple in nodeAgg / nodeSetOp.  These are less
simple than the above, but replacing the HeapTuple with a slot-based
tuple seems doable too.

note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
TupleTableSlot also seems a good replacement.  This has fallout in other
users of EvalPlanQual, too.

note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
format.  In the long run, it may be possible to replace them with a
separate storage module that's specifically designed to handle tuples
meant for tuplestores etc.  That may simplify TupleTableSlot and
execTuples.  For the moment we keep the tts_mintuple as it is.  Whenever
a tuple is not already in heap format, we heapify it in order to put in
the store.


The current heapam.c routines need some changes.  Currently, practice is
that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
their input tuples to set the resulting ItemPointer in tuple->t_self.
This is messy if we want StorageTuples to be abstract.  I'm changing
this so that the resulting ItemPointer is returned in a separate output
argument; the tuple itself is left alone.  This is somewhat messy in the
case of

Re: [HACKERS] Pluggable storage

2016-08-15 Thread Robert Haas

On Fri, Aug 12, 2016 at 7:15 PM, Alvaro Herrera
 wrote:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.  Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.  This will allow us to write permanent table storage that works
> differently than heapam.c.  At this stage, I haven't throught through
> whether this is going to allow extensions to define new storage modules;
> I am focusing on AMs that can coexist with heapam in core.

Thanks for taking a stab at this.  I'd like to throw out a few concerns.

One, I'm worried that adding an additional layer of pointer-jumping is
going to slow things down and make Andres' work to speed up the
executor more difficult.  I don't know that there is a problem there,
and if there is a problem I don't know what to do about it, but I
think it's something we need to consider.  I am somewhat inclined to
believe that we need to restructure the executor in a bigger way so
that it passes around datums instead of tuples; I'm inclined to
believe that the current tuple-centric model is probably not optimal
even for the existing storage format.  It seems even less likely to be
right for a data format in which fetching columns is more expensive
than currently, such as a columnar store.

Two, I think that we really need to think very hard about how the
query planner will interact with new heap storage formats.  For
example, suppose cstore_fdw were rewritten as a new heap storage
format.   Because ORC contains internal indexing structures with
characteristics somewhat similar to BRIN, many scans can be executed
much more efficiently than for our current heap storage format.  If it
can be seen that an entire chunk will fail to match the quals, we can
skip the whole chunk.  Some operations may permit additional
optimizations: for example, given SELECT count(*) FROM thing WHERE
quals, we may be able to push the COUNT(*) down into the heap access
layer.  If it can be verified that EVERY tuple in a chunk will match
the quals, we can just increment the count by that number without
visiting each tuple individually.  This could be really fast.  These
kinds of query planner issues are generally why I have favored trying
to do something like this through the FDW interface, which already has
the right APIs for this kind of thing, even if we're not using them
all yet.  I don't say that's the only way to crack this problem, but I
think we're going to find that a heap storage API that doesn't include
adequate query planner integration is not a very exciting thing.

Three, with respect to this limitation:

> iii) All tuples need to be identifiable by ItemPointers.  Storages that
> have different requirements will need careful additional thought across
> the board.

I think it's a good idea for a first patch in this area to ignore (or
mostly ignore) this problem - e.g. maybe allow such storage formats
but refuse to create indexes on them.  But eventually I think we're
going to want/need to do something about it.  There are an awful lot
of interesting ideas that we can't pursue without addressing this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2016-08-15 Thread Alexander Korotkov

On Sat, Aug 13, 2016 at 2:15 AM, Alvaro Herrera 
wrote:

> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.

That's it.  My design presented at PGCon was very sketchy, and I didn't
deliver any prototype yet.

> Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.

Great!  It's nice to see you're working in this direction!

> This will allow us to write permanent table storage that works
> differently than heapam.c.  At this stage, I haven't throught through
> whether this is going to allow extensions to define new storage modules;
> I am focusing on AMs that can coexist with heapam in core.
>

So, as I get you're proposing extendability to replace heap with something
another
but compatible with heap (with executor nodes, index access methods and so
on).
That's good, but what alternative storage access methods could be?
AFAICS, we can't fit here, for instance, another MVCC implementation (undo
log) or
in-memory storage or columnar storage. However, it seems that we would be
able to make compressed heap or alternative HOT/WARM tuples mechanism.
Correct me if I'm wrong.

ISTM you're proposing something quite orthogonal to my view
of pluggable storage engines.  My idea, in general, was to extend FDW
mechanism
to let it be something manageable inside database (support for VACUUM,
defining
indexes and so on).  But imperative was that it should have its own
executor nodes,
and it doesn't have to compatible with existing index access methods.

Therefore, I think my design presented at PGCon and your current proposal
are
about orthogonal features which could coexist.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Pluggable storage

2016-08-12 Thread Alvaro Herrera

Many have expressed their interest in this topic, but I haven't seen any
design of how it should work.  Here's my attempt; I've been playing with
this for some time now and I think what I propose here is a good initial
plan.  This will allow us to write permanent table storage that works
differently than heapam.c.  At this stage, I haven't throught through
whether this is going to allow extensions to define new storage modules;
I am focusing on AMs that can coexist with heapam in core.

The design starts with a new row type in pg_am, of type "s" (for "storage").
The handler function returns a struct of node StorageAmRoutine.  This
contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
(tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
like), and operations on tuples that are part of slots (tuple_deform,
materialize).

To support this, we introduce StorageTuple and StorageScanDesc.
StorageTuples represent a physical tuple coming from some storage AM.
It is necessary to have a pointer to a StorageAmRoutine in order to
manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.

RelationData gains ->rd_stamroutine which is a pointer to the
StorageAmRoutine for the relation in question.  Similarly,
TupleTableSlot is augmented with a link to the StorageAmRoutine to
handle the StorageTuple it contains (probably in most cases it's set at
the same time as the tupdesc).  This implies that routines such as
ExecAssignScanType need to pass down the StorageAmRoutine from the
relation to the slot.

The executor is modified so that instead of calling heap_insert etc
directly, it uses rel->rd_stamroutine to call these methods.  The
executor is still in charge of dealing with indexes, constraints, and
any other thing that's not the tuple storage itself (this is one major
point in which this differs from FDWs).  This all looks simple enough,
with one exception and a few notes:

exception a) ExecMaterializeSlot needs special consideration.  This is
used in two different ways: a1) is the stated "make tuple independent
from any underlying storage" point, which is handled by
ExecMaterializeSlot itself and calling a method from the storage AM to
do any byte copying as needed.  ExecMaterializeSlot no longer returns a
HeapTuple, because there might not be any.  The second usage pattern a2)
is to create a HeapTuple that's passed to other modules which only deal
with HT and not slots (triggers are the main case I noticed, but I think
there are others such as the executor itself wanting tuples as Datum for
some reason).  For the moment I'm handling this by having a new
ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
the original tuple format.

note b) EvalPlanQual currently maintains an array of HeapTuple in
EState->es_epqTuple.  I think it works to replace that with an array of
StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
methods in order to interact with it.  Other than those changes, it
seems okay.

note c) nodeSubplan has curTuple as a HeapTuple.  It seems simple
to replace this with an independent slot-based tuple.

note d) grp_firstTuple in nodeAgg / nodeSetOp.  These are less
simple than the above, but replacing the HeapTuple with a slot-based
tuple seems doable too.

note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
TupleTableSlot also seems a good replacement.  This has fallout in other
users of EvalPlanQual, too.

note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
format.  In the long run, it may be possible to replace them with a
separate storage module that's specifically designed to handle tuples
meant for tuplestores etc.  That may simplify TupleTableSlot and
execTuples.  For the moment we keep the tts_mintuple as it is.  Whenever
a tuple is not already in heap format, we heapify it in order to put in
the store.


The current heapam.c routines need some changes.  Currently, practice is
that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
their input tuples to set the resulting ItemPointer in tuple->t_self.
This is messy if we want StorageTuples to be abstract.  I'm changing
this so that the resulting ItemPointer is returned in a separate output
argument; the tuple itself is left alone.  This is somewhat messy in the
case of heap_multi_insert because it returns several items; I think it's
acceptable to return an array of ItemPointers in the same order as the
input tuples.  This works fine for the only caller, which is COPY in
batch mode.  For the other routines, they don't really care where the
TID is returned AFAICS.


Additional noteworthy items:

i) Speculative insertion: the speculative insertion token is no longer
installed directly in the heap tuple by the executor (of course).
Instead, the token becomes part of the slot.  When the tuple_insert
method is called, the insertion routine is in charge of setting the
token from the slot into the storage tuple.  Executor is

99 matches

Mail list logo