subject:"Re\: \[HACKERS\] UNDO and in\-place update"

Re: [HACKERS] UNDO and in-place update

2017-01-13 Thread Robert Haas

On Fri, Jan 13, 2017 at 5:57 AM, Amit Kapila  wrote:
> Sure, we can do that way and I agree that it is worth considering.
> Few cases where it can be really costly is when undo chain overflows
> to multiple pages and those pages doesn't exist in cache.  I think the
> reason why it is probable is that undo chain is not maintained for row
> update, rather it is maintained for a page level changes for each
> transaction.  So, what this means is that if we have to traverse the
> chain for record X, then I think it is quite possible that during
> chain traversal, we will encounter undo of record Y, ofcourse we can
> identify and ignore it, but certainly the chain can be much longer.
> Another kind of cost could be construction of actual record from undo
> record, for example if we consider to undo log only changed parts of
> Update as we do for WAL, then we have to incur some record
> construction cost as well. I think what can make it really costly is
> repeating this work.  Even, if we consider the above as a worse case,
> I am not sure whether it will be cheap for short transactions like
> pgbench as there could be cases where concurrent updates/reads needs
> to traverse undo chains.
>
> Having said that, if you don't feel strongly about caching the
> information in such a way that it can be reused by different sessions,
> then we can initially implement the system as outlined by you, then we
> can extend it based on the performance characteristics.

Sounds fine.  Also, one idea I've been considering is the idea of
allowing each shared buffer to have some associated "scratch space"
which would act as a cache that lasts as long as that page doesn't get
evicted.  I think that might have some applications for hash indexes
as well.  Of course deciding how much scratch space to allocate and
whether you wouldn't be better off using that memory some other way is
tricky, but it's a thought.

>> I don't see a reason to do that.  I think one of the advantages of
>> storing the UNDO pages in the same pool as shared_buffers is that the
>> amount of UNDO that we store in there can float up and down according
>> to demand.  If the heap pages are being accessed more frequently than
>> the UNDO pages then let the UNDO pages get evicted.  If the reverse is
>> the case, that's OK too.
>
> Okay, not a problem. I was thinking to priortise UNDO buffers, so that
> they can only be evicted during checkpoint.  However, I think there
> might not be much value in doing so.

Seems like we can leave the decision to later.  If that turns out to
be the thing that is hurting performance, we can fix it then.

>>> Vacuum can delete or make the undo file reusable when the
>>> corresponding transaction precedes RecentGlobalXmin.
>>
>> If the associated transaction commits, yes.  If it aborts, then we can
>> recycle it once undo is complete.  But I think this should be the job
>> of a dedicated process which also executes undo actions; I don't think
>> it should be associated with vacuum.
>
> Agreed that it will be good to keep a separate undo process.  However,
> I am not exactly clear about your point of executing undo actions in
> background. Do you mean to say that the main backend will mark
> transaction as aborted in clog, but undo actions will be done in
> backend or you have something else in mind?

I think it's not good to have the main backend perform the UNDO
actions because they might ERROR.  That's not something we can support
when we're already in the abort path.  If it happens in the background
process, it can catch the ERROR and return to the toplevel and retry
after a pause or do whatever needs to be done.  For example, imagine
an I/O error.

> Okay, it seems we can deduce it from trasnction status. If the
> transaction is aborted, then we know undo log is invalid.  If it is
> in-progress, then there will be a valid undo log. If it is committed,
> and all-visble (precedes RecentGlobalXmin), then the undo log will be
> invalid.

For MVCC purposes, that might work, but for other purposes, maybe not.
We actually need to replay the undo log on abort before removing it.

> By create, it seems you are thinking about having undo files as some
> in-memory thing (space reserved in memory) and if required, then only
> create the file?

Yes.

> If so, then the point we have discussed above for
> giving equal priority to undo pages in shared_buffers as heap pages is
> relevant, because backend evicting undo page could be costly as
> compare to heap page.

It's possible, but I don't think that's a key part of the design so
I'd say skip it for now and we'll deal with it if it becomes an issue.

> Won't it be efficient if we can do the same mapping between undo
> byte-offset and undo file as we do for WAL ((WALWrite offset to file).
> We can keep file size bigger than what we have for each WAL Segment?
> Do we need something more?

Well you need something more if you want to combine multiple undo logs
in a single

Re: [HACKERS] UNDO and in-place update

2017-01-13 Thread Amit Kapila

On Thu, Jan 12, 2017 at 8:56 PM, Robert Haas  wrote:
> On Wed, Jan 11, 2017 at 5:18 AM, Amit Kapila  wrote:
>> Moving further, I have thought about consistent reads and different
>> formats for storing undo with pros and cons of each format.
>>
>> Consistent Reads
>> 
>> If the page is modified by any transaction that is not visible to read
>> transaction's snapshot, we have to reconstruct a copy of the page.
>> Now there could be multiple ways to achieve it:
>>
>> 1. Apply the undo for the transactions that are not visible to read
>> transaction and keep the copy of the page.
>> 2. Apply the undo for all the transactions in TPD entry of a page.
>>
>> I think we can't do second because this can create rows which are not
>> consistent due to apply of undo log for transactions which are visible
>> to read transaction.  Now, where the first approach will work, but it
>> might turn out to be inefficient if not implemented carefully, as for
>> each read transaction we need to create separate copies of the page
>> where a different set of transactions are not visible to different
>> read transactions. Even for read transactions to which the same set of
>> transactions are not visible, we need some way to recognize the
>> version of the page that can be used, probably try with all the
>> version of pages and see if any of the existing versions can serve the
>> purpose.  I think we also need some way to link the different versions
>> of the page in shared buffers so that before trying to create a new
>> version of the page we can look at linked versions.
>>
>> Yet another consideration in this area could be to perform consistent
>> reads differently for index scans.  For index scan, we will receive a
>> particular TID to read, so, in this case, we can try to reconstruct
>> the old image of that particular row.  Now, whereas this sounds
>> efficient, but we might need to reconstruct the old image of rows
>> multiple times for different read transactions. The effect will be
>> magnificent when we have to perform this for multiple rows of a page.
>> I think it might be better to always construct a complete copy of page
>> if one or more transactions that have changed the page are not visible
>> to read snapshot.
>
> I was thinking about this whole area somewhat differently.  We don't
> really need a complete imagine of the page.  We just need to find the
> correct version of each tuple that is visible.  So I would suggest
> that we consider avoiding page reconstruction altogether.  A
> sequential scan iterates through each TID on the page and says "is
> this tuple visible?".  Right now the answer is either "yes" or "no",
> but maybe the answer could be "yes", "no", or "look at this other
> version in the UNDO segment instead".  The advantage of this approach
> is that we don't end up copying tuples into a new "reconstructed"
> page.  That copying will take CPU time and the resulting page will
> consume cache space.
>
> Now, the problem with this is that we want to avoid walking through
> the whole UNDO chain for the page multiple times.  But maybe there are
> ways to avoid that -- for starters, if the recently-modified bit on
> the page isn't set, we don't need to walk through the UNDO chain at
> all.  For a sequential scan, we could maintain a backend-local cache
> of information that gets updated as we walk the chain.  Perhaps
> there's no way to make these ideas work and we're going to end up
> doing some form of page reconstruction, but I think it's worth some
> thought anyway.
>

Sure, we can do that way and I agree that it is worth considering.
Few cases where it can be really costly is when undo chain overflows
to multiple pages and those pages doesn't exist in cache.  I think the
reason why it is probable is that undo chain is not maintained for row
update, rather it is maintained for a page level changes for each
transaction.  So, what this means is that if we have to traverse the
chain for record X, then I think it is quite possible that during
chain traversal, we will encounter undo of record Y, ofcourse we can
identify and ignore it, but certainly the chain can be much longer.
Another kind of cost could be construction of actual record from undo
record, for example if we consider to undo log only changed parts of
Update as we do for WAL, then we have to incur some record
construction cost as well. I think what can make it really costly is
repeating this work.  Even, if we consider the above as a worse case,
I am not sure whether it will be cheap for short transactions like
pgbench as there could be cases where concurrent updates/reads needs
to traverse undo chains.

Having said that, if you don't feel strongly about caching the
information in such a way that it can be reused by different sessions,
then we can initially implement the system as outlined by you, then we
can extend it based on the performance characteristics.


>
> I

Re: [HACKERS] UNDO and in-place update

2017-01-12 Thread Robert Haas

On Wed, Jan 11, 2017 at 5:18 AM, Amit Kapila  wrote:
> Moving further, I have thought about consistent reads and different
> formats for storing undo with pros and cons of each format.
>
> Consistent Reads
> 
> If the page is modified by any transaction that is not visible to read
> transaction's snapshot, we have to reconstruct a copy of the page.
> Now there could be multiple ways to achieve it:
>
> 1. Apply the undo for the transactions that are not visible to read
> transaction and keep the copy of the page.
> 2. Apply the undo for all the transactions in TPD entry of a page.
>
> I think we can't do second because this can create rows which are not
> consistent due to apply of undo log for transactions which are visible
> to read transaction.  Now, where the first approach will work, but it
> might turn out to be inefficient if not implemented carefully, as for
> each read transaction we need to create separate copies of the page
> where a different set of transactions are not visible to different
> read transactions. Even for read transactions to which the same set of
> transactions are not visible, we need some way to recognize the
> version of the page that can be used, probably try with all the
> version of pages and see if any of the existing versions can serve the
> purpose.  I think we also need some way to link the different versions
> of the page in shared buffers so that before trying to create a new
> version of the page we can look at linked versions.
>
> Yet another consideration in this area could be to perform consistent
> reads differently for index scans.  For index scan, we will receive a
> particular TID to read, so, in this case, we can try to reconstruct
> the old image of that particular row.  Now, whereas this sounds
> efficient, but we might need to reconstruct the old image of rows
> multiple times for different read transactions. The effect will be
> magnificent when we have to perform this for multiple rows of a page.
> I think it might be better to always construct a complete copy of page
> if one or more transactions that have changed the page are not visible
> to read snapshot.

I was thinking about this whole area somewhat differently.  We don't
really need a complete imagine of the page.  We just need to find the
correct version of each tuple that is visible.  So I would suggest
that we consider avoiding page reconstruction altogether.  A
sequential scan iterates through each TID on the page and says "is
this tuple visible?".  Right now the answer is either "yes" or "no",
but maybe the answer could be "yes", "no", or "look at this other
version in the UNDO segment instead".  The advantage of this approach
is that we don't end up copying tuples into a new "reconstructed"
page.  That copying will take CPU time and the resulting page will
consume cache space.

Now, the problem with this is that we want to avoid walking through
the whole UNDO chain for the page multiple times.  But maybe there are
ways to avoid that -- for starters, if the recently-modified bit on
the page isn't set, we don't need to walk through the UNDO chain at
all.  For a sequential scan, we could maintain a backend-local cache
of information that gets updated as we walk the chain.  Perhaps
there's no way to make these ideas work and we're going to end up
doing some form of page reconstruction, but I think it's worth some
thought anyway.

> Undo Page Format
> ---
> I think we can organize undo in multiple ways
>
> (a) One undo file per transaction - We can name each undo file as
> epoch+xid or epch+xid.undo. These files can be kept in pg_undo
> directory under PGDATA.  In this approach, we have to store
> latest_undo offset in file header to perform the rollback operation.
> Also, this offset can be used to allocate the location for new undo
> record.  Apart from that undo entries can be organized one after
> another.  Each undo entry will have two back pointers, one to reach
> the previous undo location which will be used for rollback and the
> second one to maintain the prior undo chain for changes to a
> particular page which will be used to reconstruct the copy of page.

Sounds good so far.

> During recovery,  we need to read each undo file and check the
> transaction status in 'clog', if it is not committed, then apply all
> the undo starting from latest_undo position.

I mostly agree, but let's make this a little more precise.  During
normal running, when a transaction aborts, we will perform all of the
undo actions associated with that transaction, and we will generate
WAL for those changes.  For example, if an insert adds a tuple to a
page, and the containing transaction aborts, then there will be an
undo entry to remove the tuple.  But removing that tuple has to be
WAL-logged, just as adding it was.  When all actions for the
transaction have been undone, we will remove the undo log for that
transaction and the

Re: [HACKERS] UNDO and in-place update

2017-01-11 Thread Amit Kapila

On Tue, Jan 10, 2017 at 7:28 PM, Amit Kapila  wrote:
> On Mon, Jan 9, 2017 at 11:47 PM, Robert Haas  wrote:
>>
>> Yes, something like this can be done.  You don't really need any new
>> page-level header data, because you can get the XIDs from the TPD
>> entry (or from the page itself if there's only one).  But you could
>> expand the single "is-modified" bit that I've proposed adding to each
>> tuple to multiple bits.  0 means not recently modified.  1 means
>> modified by the first or only transaction that has recently modified
>> the page.  2 means modified by the second transaction that has
>> recently modified the page.  Etc.
>>
>
> makes sense.
>
>> What I was thinking about doing instead is storing an array in the TPD
>> containing the same information.  There would be one byte or one half
>> a byte or whatever per TID and it would contain the index of the XID
>> in the TPD that had most recently modified or locked that TID.  Your
>> solution might be better, though, at least for cases where the number
>> of tuples that have modified the page is small.
>>
>
> I think we also need to prevent multiple backends trying to reserve a
> slot in this array which can be a point of contention.  Another point
> is during pruning, if due to row movement TIDs are changed, we need to
> keep this array in sync.
>

Moving further, I have thought about consistent reads and different
formats for storing undo with pros and cons of each format.

Consistent Reads

If the page is modified by any transaction that is not visible to read
transaction's snapshot, we have to reconstruct a copy of the page.
Now there could be multiple ways to achieve it:

1. Apply the undo for the transactions that are not visible to read
transaction and keep the copy of the page.
2. Apply the undo for all the transactions in TPD entry of a page.

I think we can't do second because this can create rows which are not
consistent due to apply of undo log for transactions which are visible
to read transaction.  Now, where the first approach will work, but it
might turn out to be inefficient if not implemented carefully, as for
each read transaction we need to create separate copies of the page
where a different set of transactions are not visible to different
read transactions. Even for read transactions to which the same set of
transactions are not visible, we need some way to recognize the
version of the page that can be used, probably try with all the
version of pages and see if any of the existing versions can serve the
purpose.  I think we also need some way to link the different versions
of the page in shared buffers so that before trying to create a new
version of the page we can look at linked versions.

Yet another consideration in this area could be to perform consistent
reads differently for index scans.  For index scan, we will receive a
particular TID to read, so, in this case, we can try to reconstruct
the old image of that particular row.  Now, whereas this sounds
efficient, but we might need to reconstruct the old image of rows
multiple times for different read transactions. The effect will be
magnificent when we have to perform this for multiple rows of a page.
I think it might be better to always construct a complete copy of page
if one or more transactions that have changed the page are not visible
to read snapshot.

Undo Page Format
---
I think we can organize undo in multiple ways

(a) One undo file per transaction - We can name each undo file as
epoch+xid or epch+xid.undo. These files can be kept in pg_undo
directory under PGDATA.  In this approach, we have to store
latest_undo offset in file header to perform the rollback operation.
Also, this offset can be used to allocate the location for new undo
record.  Apart from that undo entries can be organized one after
another.  Each undo entry will have two back pointers, one to reach
the previous undo location which will be used for rollback and the
second one to maintain the prior undo chain for changes to a
particular page which will be used to reconstruct the copy of page.
During recovery,  we need to read each undo file and check the
transaction status in 'clog', if it is not committed, then apply all
the undo starting from latest_undo position.  I think we should
organize undo's in form of pages with a size similar to heap pages.
This will be helpful both for writing WAL and doing checkpoints for
undo where we expect pages.  The access to UNDO can be via shared
buffers. I think we might want to keep the access to shared buffers
containing undo separate from heap/index pages so that they can't be
evicted based on access of heap pages.  Different TPD entries
containing the same XID will point to different locations in undo.
Undo location will be just a page offset for a particular page entry.
Vacuum can delete or make the undo file reusable when the
corresponding

Re: [HACKERS] UNDO and in-place update

2017-01-10 Thread Amit Kapila

On Mon, Jan 9, 2017 at 11:47 PM, Robert Haas  wrote:
> On Mon, Jan 9, 2017 at 7:50 AM, Amit Kapila  wrote:
>> One idea could be that we have some fixed number of
>> slots (i think we can make it variable as well, but for simplicity,
>> lets consider it as fixed) in the page header which will store the
>> offset to the transaction id inside a TPD entry of the page.  Consider
>> a TPD entry of page contains four transactions, so we will just store
>> enough information in heap page header to reach the transaction id for
>> these four transactions. I think each such page header slot could be
>> three or four bits long depending upon how many concurrent
>> transactions we want to support on a page after which a new
>> transaction has to wait (I think in most workloads supporting
>> simultaneous eight transactions on a page should be sufficient).
>> Then we can have an additional byte (or less than byte) in the tuple
>> header to store lock info which is nothing but an offset to the slot
>> in the page header.   We might find some other locking technique as
>> well, but I think keeping it same as current has benefit.
>
> Yes, something like this can be done.  You don't really need any new
> page-level header data, because you can get the XIDs from the TPD
> entry (or from the page itself if there's only one).  But you could
> expand the single "is-modified" bit that I've proposed adding to each
> tuple to multiple bits.  0 means not recently modified.  1 means
> modified by the first or only transaction that has recently modified
> the page.  2 means modified by the second transaction that has
> recently modified the page.  Etc.
>

makes sense.

> What I was thinking about doing instead is storing an array in the TPD
> containing the same information.  There would be one byte or one half
> a byte or whatever per TID and it would contain the index of the XID
> in the TPD that had most recently modified or locked that TID.  Your
> solution might be better, though, at least for cases where the number
> of tuples that have modified the page is small.
>

I think we also need to prevent multiple backends trying to reserve a
slot in this array which can be a point of contention.  Another point
is during pruning, if due to row movement TIDs are changed, we need to
keep this array in sync.

>  However, I'm not
> totally sure.  I think it's important to keep the tuple headers VERY
> small, like 3 bytes.  Or 2 bytes.  Or maybe even variable size but
> only 1 byte in common cases.  So I expect bit space in those places to
> be fairly scarce and precious.
>

I agree that we should carefully choose the format so as to keep a
trade-off between performance and space savings.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-09 Thread Robert Haas

On Mon, Jan 9, 2017 at 7:50 AM, Amit Kapila  wrote:
> Okay, I see the point.  I think here UNDO pointer can be marked
> invalid either during page-pruning or vacuum.

I explicitly want to avoid that, because it means re-dirtying the
page.  The UNDO pointer becomes invalid by removing the UNDO log to
which it points; the page itself does not need to be dirtied for the
UNDO pointer to become invalid.

> Another point which I think is not discussed till now is, how the
> locking will work.  As of now, we first lock the tuple and then xid on
> which we have to wait. In the new design, we don't know the xid which
> has updated the tuple and I am not sure there is any easy way to find
> that information.

Yes, there is: you can get it from the UNDO pointer.  Each UNDO
pointer is of the form . If the page has a regular
UNDO pointer, then you can get the XID for any modified table right
out of the UNDO pointer itself.  If the page has a TPD pointer, the
TPD contains a series of UNDO pointers and you can get all of the XIDs
that have touched the page by iterating through those UNDO pointers.
There is some work to do to make it cheap to find which XID pertains
to which updated tuple, but I think that problem can be solved by
choosing carefully what to store in the TPD.  I think, in fact, that
for performance it's absolutely essential to solve that problem well.

> One idea could be that we have some fixed number of
> slots (i think we can make it variable as well, but for simplicity,
> lets consider it as fixed) in the page header which will store the
> offset to the transaction id inside a TPD entry of the page.  Consider
> a TPD entry of page contains four transactions, so we will just store
> enough information in heap page header to reach the transaction id for
> these four transactions. I think each such page header slot could be
> three or four bits long depending upon how many concurrent
> transactions we want to support on a page after which a new
> transaction has to wait (I think in most workloads supporting
> simultaneous eight transactions on a page should be sufficient).
> Then we can have an additional byte (or less than byte) in the tuple
> header to store lock info which is nothing but an offset to the slot
> in the page header.   We might find some other locking technique as
> well, but I think keeping it same as current has benefit.

Yes, something like this can be done.  You don't really need any new
page-level header data, because you can get the XIDs from the TPD
entry (or from the page itself if there's only one).  But you could
expand the single "is-modified" bit that I've proposed adding to each
tuple to multiple bits.  0 means not recently modified.  1 means
modified by the first or only transaction that has recently modified
the page.  2 means modified by the second transaction that has
recently modified the page.  Etc.

What I was thinking about doing instead is storing an array in the TPD
containing the same information.  There would be one byte or one half
a byte or whatever per TID and it would contain the index of the XID
in the TPD that had most recently modified or locked that TID.  Your
solution might be better, though, at least for cases where the number
of tuples that have modified the page is small.  However, I'm not
totally sure.  I think it's important to keep the tuple headers VERY
small, like 3 bytes.  Or 2 bytes.  Or maybe even variable size but
only 1 byte in common cases.  So I expect bit space in those places to
be fairly scarce and precious.

Now that might be the wrong idea -- maybe it's worth expanding that
header in order to speed things up.  On the other hand, having to read
more database pages in order to process the same amount of data is
*really* expensive, especially when you start talking about data sets
that probably don't fit in memory, like a 10 TB or 100 TB table.  If
you've got 100 tuples per page (~81 bytes per tuple), increasing the
size of a tuple by 1 byte causes that 100 TB table to increase in size
by about 1.2 TB (modulo padding effects).  An extra byte of space (or
even an extra ten bytes) doesn't matter much for a table with a
million tuples in it because the whole table is going to fit in memory
either way, but when you have 10+ billion tuples those extra bytes
start to matter a LOT.  And the problem is not only or even primarily
the space consumption - the issue is that all of your queries run that
much slower because of the extra I/O.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-09 Thread Amit Kapila

On Fri, Jan 6, 2017 at 7:41 PM, Robert Haas  wrote:
> On Fri, Jan 6, 2017 at 6:28 AM, Amit Kapila  wrote:
>>> Also, I'm thinking the bit could be stored in the line pointer rather
>>> than the tuple, because with this design we don't need
>>> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
>>> bit to indicate dead or not-dead and the second bit to indicate
>>> recently-modified or not-recently-modified.  With that approach,
>>> clearing the bits only requires iterating over the line pointer array,
>>> not the tuples themselves.
>>
>> I think this can help in mitigating the overhead.  However, now
>> another question is if we just unset it when there is no other active
>> transaction operating on the current page except for current
>> transaction, then will that tuple be considered all-visible?  I think
>> no transaction operating on a page can't be taken as a guarantee for
>> tuple to be marked as all-visible. If that is true, then what is
>> advantage of clearing the bit?
>
> That's kind of a strange question.  The mission of this proposed bit
> is to tell you whether the transaction(s) referenced in the page's
> UNDO pointer have modified that tuple.  If you didn't clear it when
> you reset the UNDO pointer to a new transaction, then the bit would
> always be set for every tuple on the page and it would be completely
> useless.  (I mean, the tuples had to be inserted originally, right?
> So the bit was set then.  And if you never clear it, it will just stay
> set.)
>
> As to your other questions: If the page's UNDO pointer isn't valid,
> then the whole page is all-visible.  If the page's UNDO pointer is
> valid, then any tuple for which the bit is not set is all-visible.
>

Okay, I see the point.  I think here UNDO pointer can be marked
invalid either during page-pruning or vacuum.

Another point which I think is not discussed till now is, how the
locking will work.  As of now, we first lock the tuple and then xid on
which we have to wait. In the new design, we don't know the xid which
has updated the tuple and I am not sure there is any easy way to find
that information.  One idea could be that we have some fixed number of
slots (i think we can make it variable as well, but for simplicity,
lets consider it as fixed) in the page header which will store the
offset to the transaction id inside a TPD entry of the page.  Consider
a TPD entry of page contains four transactions, so we will just store
enough information in heap page header to reach the transaction id for
these four transactions. I think each such page header slot could be
three or four bits long depending upon how many concurrent
transactions we want to support on a page after which a new
transaction has to wait (I think in most workloads supporting
simultaneous eight transactions on a page should be sufficient).
Then we can have an additional byte (or less than byte) in the tuple
header to store lock info which is nothing but an offset to the slot
in the page header.   We might find some other locking technique as
well, but I think keeping it same as current has benefit.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-06 Thread Robert Haas

On Fri, Jan 6, 2017 at 6:28 AM, Amit Kapila  wrote:
>> Also, I'm thinking the bit could be stored in the line pointer rather
>> than the tuple, because with this design we don't need
>> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
>> bit to indicate dead or not-dead and the second bit to indicate
>> recently-modified or not-recently-modified.  With that approach,
>> clearing the bits only requires iterating over the line pointer array,
>> not the tuples themselves.
>
> I think this can help in mitigating the overhead.  However, now
> another question is if we just unset it when there is no other active
> transaction operating on the current page except for current
> transaction, then will that tuple be considered all-visible?  I think
> no transaction operating on a page can't be taken as a guarantee for
> tuple to be marked as all-visible. If that is true, then what is
> advantage of clearing the bit?

That's kind of a strange question.  The mission of this proposed bit
is to tell you whether the transaction(s) referenced in the page's
UNDO pointer have modified that tuple.  If you didn't clear it when
you reset the UNDO pointer to a new transaction, then the bit would
always be set for every tuple on the page and it would be completely
useless.  (I mean, the tuples had to be inserted originally, right?
So the bit was set then.  And if you never clear it, it will just stay
set.)

As to your other questions: If the page's UNDO pointer isn't valid,
then the whole page is all-visible.  If the page's UNDO pointer is
valid, then any tuple for which the bit is not set is all-visible.

> Okay, I think we could clean undo and heap without caring for the
> index.  Basically, the idea works on the premise that we won't allow
> same value rows in the index for same TID and using rechecks we can
> identify the deleted tuple.

Right, seems we are on the same page now.  Just to be clear, I'm not
saying there aren't other possible designs, and one of those designs
might be better.  But at least now we both seem to have the same
understanding of what I proposed, which seems like progress.

> On rethinking about the working of vacuum
> in this system,  it seems we can clean the heap independently and then
> for index we crosscheck each of the entry in the heap that is marked
> as deleted.

Right.

> Now this has the advantage that we don't need to do two
> passes of the heap, but for the index, we might need to re-fetch heap
> pages randomly to detect if the delete marked row is actually deleted.

Yes.  There are some performance trade-offs here no matter what
decision you make.  If you decide to try to remove delete-marked index
entries during UNDO, then (1) you have to somehow make sure that those
index entries aren't still needed by some previous transaction that
isn't yet all-visible (e.g. imagine key is updated 1->2->1, first
update commits but second one rolls back before first is all-visible)
which might be expensive and (2) if those index pages have been
evicted from cache you will incur additional I/O to bring them back
into cache, which might be more expensive than just cleaning them
later.  On the other hand, if you don't try to remove index pointers
during UNDO, then you're leaving bloat in your index.  We might want
to consider some combination of these things - e.g. if the page has
only one recent transaction (no TPD) and the index pages are still in
memory, have UNDO try to remove them, otherwise clean them later.  Or
some other rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-06 Thread Amit Kapila

On Thu, Jan 5, 2017 at 9:25 PM, Robert Haas  wrote:
> On Wed, Jan 4, 2017 at 6:05 AM, Amit Kapila  wrote:
>> Okay, so this optimization can work only after all the active
>> transactions operating on a page are finished.  If that is true, in
>> some cases such a design can consume a lot of CPU traversing all the
>> tuples in a page for un-setting the bit, especially when such tuples
>> are less.
>
> I suppose.  I didn't think that cost was likely to be big enough to
> worry about, but I might be wrong.  The worst case would be when you
> modify one tuple on a page, let the transaction that did the
> modification become all-visible, modify one tuple on the page again,
> etc. and, at the same time, the page is entirely full of tuples.  So
> you keep having to loop over all the bits to clear them (they are all
> clear except one, but you don't know that) and then re-set just one of
> them.  That's not free, but keep in mind that the existing system
> would be forced to perform non-HOT updates in that situation, which
> isn't free either.
>
> Also, I'm thinking the bit could be stored in the line pointer rather
> than the tuple, because with this design we don't need
> LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
> bit to indicate dead or not-dead and the second bit to indicate
> recently-modified or not-recently-modified.  With that approach,
> clearing the bits only requires iterating over the line pointer array,
> not the tuples themselves.
>

I think this can help in mitigating the overhead.  However, now
another question is if we just unset it when there is no other active
transaction operating on the current page except for current
transaction, then will that tuple be considered all-visible?  I think
no transaction operating on a page can't be taken as a guarantee for
tuple to be marked as all-visible. If that is true, then what is
advantage of clearing the bit?

>>> We don't necessarily need UNDO to clean up the indexes, although it
>>> might be a good idea.  It would *work* to just periodically scan the
>>> index for delete-marked items.  For each one, visit the heap and see
>>> if there's a version of that tuple present in the heap or current UNDO
>>> that matches the index entry.  If not, remove the index entry.
>>
>> I think this is somewhat similar to how we clean the index now and
>> seems to be a workable design.  However, don't you think that it will
>> have somewhat similar characteristics for index-bloat as we have now?
>
> Yes, it would have similar characteristics.  Thus we might want to do better.
>
>> OTOH for heap, it will help us to take away old versions away from the
>> main heap, but still, we can't get rid of that space or reuse that
>> space till we can clean corresponding index entries.
>
> I don't think that's true.  If in-place update is ever allowed in
> cases where indexed columns have been modified, then the index already
> has to cope with the possibility that the heap tuple it can see
> doesn't match the index.  And if it can cope with that, then why do we
> have to remove the index entry before reusing the heap TID?
>

No need. I think I can see what you are saying.

>> UNDO has to be kept till heap page is marked as all visible.  This is
>> required to check the visibility of index.  Now, I think the page can
>> be marked as all visible when we removed corresponding dead entries in
>> heap.   I think the main point of discussion here is how vacuum will
>> clean heap and index entries in this new system.  Another idea could
>> be that we allow undo entries to be removed (we can allow heap entry
>> to be removed) based on if they are visible or not and then while
>> traversing index we cross check in heap if the entry can be removed.
>> Basically,  check the TID and traverse undo contents if any to verify
>> if the index entry can be removed.  I think this sounds to be more
>> complicated and less efficient than what I suggested earlier.
>
> In my proposal, when a tuple gets updated or deleted in the heap, we
> go find the corresponding index entries and delete-mark them.  When an
> index tuple is delete-marked, we have to cross-check it against the
> heap, because the index tuple might not match the version of the heap
> tuple that our snapshot can see.  Therefore, it's OK to discard the
> heap page's UNDO before cleaning the indexes, because the index tuples
> are already delete-marked and rechecks will therefore do the right
> thing.
>

Okay, I think we could clean undo and heap without caring for the
index.  Basically, the idea works on the premise that we won't allow
same value rows in the index for same TID and using rechecks we can
identify the deleted tuple.  On rethinking about the working of vacuum
in this system,  it seems we can clean the heap independently and then
for index we crosscheck each of the entry in the heap that is marked
as deleted.  Now this has the advantage that we don't need

Re: [HACKERS] UNDO and in-place update

2017-01-05 Thread Robert Haas

On Thu, Jan 5, 2017 at 6:51 AM, Amit Kapila  wrote:
> UNDO has to be kept till heap page is marked as all visible.  This is
> required to check the visibility of index.  Now, I think the page can
> be marked as all visible when we removed corresponding dead entries in
> heap.   I think the main point of discussion here is how vacuum will
> clean heap and index entries in this new system.  Another idea could
> be that we allow undo entries to be removed (we can allow heap entry
> to be removed) based on if they are visible or not and then while
> traversing index we cross check in heap if the entry can be removed.
> Basically,  check the TID and traverse undo contents if any to verify
> if the index entry can be removed.  I think this sounds to be more
> complicated and less efficient than what I suggested earlier.

In my proposal, when a tuple gets updated or deleted in the heap, we
go find the corresponding index entries and delete-mark them.  When an
index tuple is delete-marked, we have to cross-check it against the
heap, because the index tuple might not match the version of the heap
tuple that our snapshot can see.  Therefore, it's OK to discard the
heap page's UNDO before cleaning the indexes, because the index tuples
are already delete-marked and rechecks will therefore do the right
thing.

In short, I agree with Dilip.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-05 Thread Robert Haas

On Wed, Jan 4, 2017 at 6:05 AM, Amit Kapila  wrote:
> Okay, so this optimization can work only after all the active
> transactions operating on a page are finished.  If that is true, in
> some cases such a design can consume a lot of CPU traversing all the
> tuples in a page for un-setting the bit, especially when such tuples
> are less.

I suppose.  I didn't think that cost was likely to be big enough to
worry about, but I might be wrong.  The worst case would be when you
modify one tuple on a page, let the transaction that did the
modification become all-visible, modify one tuple on the page again,
etc. and, at the same time, the page is entirely full of tuples.  So
you keep having to loop over all the bits to clear them (they are all
clear except one, but you don't know that) and then re-set just one of
them.  That's not free, but keep in mind that the existing system
would be forced to perform non-HOT updates in that situation, which
isn't free either.

Also, I'm thinking the bit could be stored in the line pointer rather
than the tuple, because with this design we don't need
LP_UNUSED/LP_NORMAL/LP_REDIRECT/LP_DEAD any more.  We could use one
bit to indicate dead or not-dead and the second bit to indicate
recently-modified or not-recently-modified.  With that approach,
clearing the bits only requires iterating over the line pointer array,
not the tuples themselves.

>> We don't necessarily need UNDO to clean up the indexes, although it
>> might be a good idea.  It would *work* to just periodically scan the
>> index for delete-marked items.  For each one, visit the heap and see
>> if there's a version of that tuple present in the heap or current UNDO
>> that matches the index entry.  If not, remove the index entry.
>
> I think this is somewhat similar to how we clean the index now and
> seems to be a workable design.  However, don't you think that it will
> have somewhat similar characteristics for index-bloat as we have now?

Yes, it would have similar characteristics.  Thus we might want to do better.

> OTOH for heap, it will help us to take away old versions away from the
> main heap, but still, we can't get rid of that space or reuse that
> space till we can clean corresponding index entries.

I don't think that's true.  If in-place update is ever allowed in
cases where indexed columns have been modified, then the index already
has to cope with the possibility that the heap tuple it can see
doesn't match the index.  And if it can cope with that, then why do we
have to remove the index entry before reusing the heap TID?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-05 Thread Amit Kapila

On Wed, Jan 4, 2017 at 8:35 PM, Dilip Kumar  wrote:
> On Wed, Jan 4, 2017 at 4:35 PM, Amit Kapila  wrote:
>> In this new system, I
>> think we can't remove undo entries of heap page till we clear
>> corresponding index entries. I think we need to somehow collect the
>> old values from undo corresponding to index and then scan the index
>> remove the index entry and after that corresponding undo entry can be
>> removed.
>
> Do we really need to keep undo for heap until index entry is not
> removed?
>

UNDO has to be kept till heap page is marked as all visible.  This is
required to check the visibility of index.  Now, I think the page can
be marked as all visible when we removed corresponding dead entries in
heap.   I think the main point of discussion here is how vacuum will
clean heap and index entries in this new system.  Another idea could
be that we allow undo entries to be removed (we can allow heap entry
to be removed) based on if they are visible or not and then while
traversing index we cross check in heap if the entry can be removed.
Basically,  check the TID and traverse undo contents if any to verify
if the index entry can be removed.  I think this sounds to be more
complicated and less efficient than what I suggested earlier.

> IIUC, we anyway need to revalidate the index key with heap
> tuple. What I am trying the say is that if we no longer needed UNDO
> for the heap page (e.g because of rollback) then we can apply the UNDO
> and remove it. I agree that there will be multiple index entries will
> be pointing to this tuple, but only one of them can pass the key
> revalidation with the heap. isn't it?
>

Yeah, but I think for that, you need to grovel through all the undo
chain if present for a tuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-04 Thread Dilip Kumar

On Wed, Jan 4, 2017 at 4:35 PM, Amit Kapila  wrote:
> In this new system, I
> think we can't remove undo entries of heap page till we clear
> corresponding index entries. I think we need to somehow collect the
> old values from undo corresponding to index and then scan the index
> remove the index entry and after that corresponding undo entry can be
> removed.

Do we really need to keep undo for heap until index entry is not
removed? IIUC, we anyway need to revalidate the index key with heap
tuple. What I am trying the say is that if we no longer needed UNDO
for the heap page (e.g because of rollback) then we can apply the UNDO
and remove it. I agree that there will be multiple index entries will
be pointing to this tuple, but only one of them can pass the key
revalidation with the heap. isn't it?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2017-01-04 Thread Amit Kapila

On Tue, Dec 6, 2016 at 9:35 PM, Robert Haas  wrote:
> On Mon, Dec 5, 2016 at 4:49 AM, Amit Kapila  wrote:
>> I can very well understand the reason for not doing so (IIUC, it is
>> about complexity and time to get it right when we are already trying
>> to solve one big and complex problem of the system), but saying most
>> of the benefit can be realized by having simple optimization seems
>> less convincing.  I think having simple optimizations won't solve the
>> multi-pass vacuum problem.  However, having the visibility information
>> in the index will solve that and avoid index bloat much aggressively.
>
> I don't think that multi-pass vacuum would exist in this system,
> regardless of any of the details we're talking about here.  If you are
> doing in-place updates of tuples and changing the indexed columns,
> then you can't rely on the current system for removing index entries
> at all.

Sure.  The point I was trying to make was that index entries can't be
removed independently apart from the case where some previous index
scan has marked it dead after consulting heap.  In this new system, I
think we can't remove undo entries of heap page till we clear
corresponding index entries. I think we need to somehow collect the
old values from undo corresponding to index and then scan the index
remove the index entry and after that corresponding undo entry can be
removed.

>  The indexed entries don't point to a "dead" line pointer;
> they point to a tuple that no longer matches the index entry.
>
>>> 3. Otherwise, you need to look at the heap page.  But if the heap
>>> page's UNDO pointer is invalid, then the whole page is all-visible, so
>>> you're done.  (You can also set the visibility map bit and/or the
>>> index tuple's definitely-all-visible bit to avoid having to recheck
>>> the heap page in the future.)
>>>
>>> 4. Each tuple will have a bit indicating whether it's been recently
>>> modified; that is, whether the transaction whose UNDO log is
>>> referenced by the page's UNDO pointer did anything to that tuple; or
>>> if the page points to a TPD entry, whether any of the transactions in
>>> the TPD modified that tuple.  If that bit is not set for that
>>> particular tuple, it's all-visible and you're done.
>>
>> I think 4th optimization is especially good and can cover many cases,
>> but I am not sure how do we unset it once it is set.  For example,
>> once the effect of transaction that has touched that row is
>> all-visible, how will we unset it.   It might be better to store the
>> offset of transaction that has last modified the tuple, if the last
>> transaction that has touched the row is visible to snapshot, then we
>> don't need to process the UNDO.
>
> If the page has an invalid UNDO pointer and you change it to a valid
> UNDO pointer, you unset all the bits at the same time, except of
> course for the bit pertaining to the tuple you are modifying.  If the
> page's UNDO pointer is still valid when you update it, then you set
> the bits for any tuples you are modifying for which it is not already
> set.
>

Okay, so this optimization can work only after all the active
transactions operating on a page are finished.  If that is true, in
some cases such a design can consume a lot of CPU traversing all the
tuples in a page for un-setting the bit, especially when such tuples
are less.


>> One more thing that needs some thoughts is how will the rollback work
>> if keep just one bit for delete marking.  I mean for heap we can
>> directly apply from UNDO, but for the index, I think we need to also
>> take care of undoing it in some way.  For example, search the tree to
>> remove the deleting marking from old value and delete marking the new
>> value.  Now if that is what we need to do then first we need to
>> traverse the tree twice which is okay considering rollback is a seldom
>> operation, the second challenge would be how to make such an operation
>> atomic with respect to WAL.
>
> We don't necessarily need UNDO to clean up the indexes, although it
> might be a good idea.  It would *work* to just periodically scan the
> index for delete-marked items.  For each one, visit the heap and see
> if there's a version of that tuple present in the heap or current UNDO
> that matches the index entry.  If not, remove the index entry.
>

I think this is somewhat similar to how we clean the index now and
seems to be a workable design.  However, don't you think that it will
have somewhat similar characteristics for index-bloat as we have now?
OTOH for heap, it will help us to take away old versions away from the
main heap, but still, we can't get rid of that space or reuse that
space till we can clean corresponding index entries.

> However, we could try to be more aggressive: when you UNDO an UPDATE
> or DELETE in the heap, also go search the index and remove any
> delete-marked entries that are no longer needed.  The advantage of
> this is that we'd be more likely

Re: [HACKERS] UNDO and in-place update

2016-12-06 Thread Robert Haas

On Mon, Dec 5, 2016 at 4:49 AM, Amit Kapila  wrote:
> I can very well understand the reason for not doing so (IIUC, it is
> about complexity and time to get it right when we are already trying
> to solve one big and complex problem of the system), but saying most
> of the benefit can be realized by having simple optimization seems
> less convincing.  I think having simple optimizations won't solve the
> multi-pass vacuum problem.  However, having the visibility information
> in the index will solve that and avoid index bloat much aggressively.

I don't think that multi-pass vacuum would exist in this system,
regardless of any of the details we're talking about here.  If you are
doing in-place updates of tuples and changing the indexed columns,
then you can't rely on the current system for removing index entries
at all.  The indexed entries don't point to a "dead" line pointer;
they point to a tuple that no longer matches the index entry.

>> 3. Otherwise, you need to look at the heap page.  But if the heap
>> page's UNDO pointer is invalid, then the whole page is all-visible, so
>> you're done.  (You can also set the visibility map bit and/or the
>> index tuple's definitely-all-visible bit to avoid having to recheck
>> the heap page in the future.)
>>
>> 4. Each tuple will have a bit indicating whether it's been recently
>> modified; that is, whether the transaction whose UNDO log is
>> referenced by the page's UNDO pointer did anything to that tuple; or
>> if the page points to a TPD entry, whether any of the transactions in
>> the TPD modified that tuple.  If that bit is not set for that
>> particular tuple, it's all-visible and you're done.
>
> I think 4th optimization is especially good and can cover many cases,
> but I am not sure how do we unset it once it is set.  For example,
> once the effect of transaction that has touched that row is
> all-visible, how will we unset it.   It might be better to store the
> offset of transaction that has last modified the tuple, if the last
> transaction that has touched the row is visible to snapshot, then we
> don't need to process the UNDO.

If the page has an invalid UNDO pointer and you change it to a valid
UNDO pointer, you unset all the bits at the same time, except of
course for the bit pertaining to the tuple you are modifying.  If the
page's UNDO pointer is still valid when you update it, then you set
the bits for any tuples you are modifying for which it is not already
set.

> One more thing that needs some thoughts is how will the rollback work
> if keep just one bit for delete marking.  I mean for heap we can
> directly apply from UNDO, but for the index, I think we need to also
> take care of undoing it in some way.  For example, search the tree to
> remove the deleting marking from old value and delete marking the new
> value.  Now if that is what we need to do then first we need to
> traverse the tree twice which is okay considering rollback is a seldom
> operation, the second challenge would be how to make such an operation
> atomic with respect to WAL.

We don't necessarily need UNDO to clean up the indexes, although it
might be a good idea.  It would *work* to just periodically scan the
index for delete-marked items.  For each one, visit the heap and see
if there's a version of that tuple present in the heap or current UNDO
that matches the index entry.  If not, remove the index entry.
However, we could try to be more aggressive: when you UNDO an UPDATE
or DELETE in the heap, also go search the index and remove any
delete-marked entries that are no longer needed.  The advantage of
this is that we'd be more likely to visit the index entries while they
are still in cache.  The disadvantage is that it is complex and might
fail if the index is corrupted.  We can't have the whole system melt
down in that case; UNDO has to always succeed.

>> The UNDO chain has to remain until all transactions that modified the
>> page are all-visible, not just until the transactions are committed.
>
> Okay, I think this will work if we avoid the second insertion of the
> same value-TID pair in the index as you mentioned above.   Without
> that, I think we might not distinguish which of the two rows are
> visible for a transaction to which the latest (step-3) row is visible.

Agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-12-05 Thread Amit Kapila

On Thu, Dec 1, 2016 at 8:55 PM, Robert Haas  wrote:
> On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila  wrote:
>
> I see what you're going for, but I'm not sure it's worth it.  I mean,
> say you just have one bit per index tuple.  If it's set, the heap
> tuple is definitely all-visible; if it's clear, then the tuple may or
> may not be visible to your snapshot.  You insert tuples with the bit
> clear, and clear the bit again when the tuple is deleted or the
> indexed column is updated.  You set the bit when you follow the index
> pointer and discover that the tuple is all-visible, so that next time
> you don't need to follow it again.  Of the advantages you mention
> above, this still allows (a), but less efficiently.  It allows (b).
> It does not allow (c).  So, it's not as good.  But, it saves you the
> overhead of storing xmin and/or xmax for each tuple in the page.  Even
> if you have an optimization to list the XIDs once per page and then
> refer to them from the tuples, you are going to need a byte in each
> tuple for the xmin-index and a byte for the xmax-index, or something
> like that, plus the space for the XID list itself.  That's not a ton
> of space, maybe, but it's definitely more.  It's also a significantly
> bigger change to the tuple format, I think.  What I'm proposing could
> be done by repurposing the LP_* bits available in the line pointer.
> So I think the questions are (1) will the extra space consumption hurt
> us more than the benefit we will get? and (2) is it really worth the
> work to change the page format to that degree?
>
> As you can probably tell, I'm leaning toward "no".  I think a bit
> per-tuple -- whether we call it delete-mark or definitely-all-visible
> -- should be enough to get most of the benefit that is available here
> for substantially less work and no extra space.
>

I can very well understand the reason for not doing so (IIUC, it is
about complexity and time to get it right when we are already trying
to solve one big and complex problem of the system), but saying most
of the benefit can be realized by having simple optimization seems
less convincing.  I think having simple optimizations won't solve the
multi-pass vacuum problem.  However, having the visibility information
in the index will solve that and avoid index bloat much aggressively.

>> I think with proposed design there is a good chance that for many of
>> the index scans, there is a need for the extra hop to decide
>> visibility of index tuple. I think as of today, during index scan if
>> the corresponding heap page is not-all-visible, then we check the
>> corresponding heap tuple to decide about the visibility of index
>> tuple, whereas with proposed design, I think it first needs to check
>> heap page and then TPD, if there is more than one transaction
>> operating on page.
>
> There's a couple of possible designs here, but there is the
> possibility for extra hops in some cases.  But there are things we can
> do to mitigate that.
>
> 1. If each tuple has a definitely-all-visible bit, you can check that
> first; if it's set, the tuple is definitely all-visible you don't need
> to do anything further.
>
> 2. If we still maintain a visibility map, you can check that next.  If
> the bit is set for the target page, then the tuple is definitely
> all-visible and you don't need to do anything further.
>
> 3. Otherwise, you need to look at the heap page.  But if the heap
> page's UNDO pointer is invalid, then the whole page is all-visible, so
> you're done.  (You can also set the visibility map bit and/or the
> index tuple's definitely-all-visible bit to avoid having to recheck
> the heap page in the future.)
>
> 4. Each tuple will have a bit indicating whether it's been recently
> modified; that is, whether the transaction whose UNDO log is
> referenced by the page's UNDO pointer did anything to that tuple; or
> if the page points to a TPD entry, whether any of the transactions in
> the TPD modified that tuple.  If that bit is not set for that
> particular tuple, it's all-visible and you're done.
>

I think 4th optimization is especially good and can cover many cases,
but I am not sure how do we unset it once it is set.  For example,
once the effect of transaction that has touched that row is
all-visible, how will we unset it.   It might be better to store the
offset of transaction that has last modified the tuple, if the last
transaction that has touched the row is visible to snapshot, then we
don't need to process the UNDO.

> 5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.
>
> If we put in the visibility information in the index as you are
> proposing, then we can always resolve the visibility of the index
> tuple at step 1; we just test xmin and xmax against the snapshot.  For
> index-only scans, that's great, because we never need to consult the
> heap at all.  For regular index scans, it's not nearly as great,
> because we're

Re: [HACKERS] UNDO and in-place update

2016-12-02 Thread Robert Haas

On Fri, Dec 2, 2016 at 5:01 AM, Alexander Korotkov
 wrote:
> Idea of storing just one visibility bit in index tuple is a subject of
> serious doubts for me.
>
> 1. When definitely-all-visible isn't set then we have to recheck during
> scanning heap, right?
> But our current recheck (i.e. rechecking scan quals) is not enough.  Imagine
> you have a range
> index scan (val >= 100 AND val < 200), and val field was updated from val =
> 50 and val = 60
> in some row.  Then you might have this row duplicated in the output.
> Removing index scan and
> sticking to only bitmap index scan doesn't seem to be fair.  Thus, we need
> to introduce another
> kind of recheck that heapTuple.indexKey = indexTuple.indexKey.

Yes.

> 2. Another question is how it could work while index key being intensively
> updated.  There is a risk
> that we would have to traverse same UNDO-chains multiple times.  In worst
> case, we would have
> to spend quadratic time of number of row versions since our snapshot taken.
> We might try to mitigate this
> by caching TID => heap tuple map for our snapshot.  But I don't like this
> approach.
> If we have large enough index scan, then we could either run out of cache or
> consume too much memory
> for that cache.

I agree with the concern, but I don't think that's necessarily the
only mitigation strategy.  The details of what goes into UNDO and what
goes into the TPD aren't really defined yet, and if we structure those
so that you can efficiently find out what happened to a particular TID
with only a bounded number of accesses, then it might not be too bad.
If you imagine having to walk a chain of 1000 UNDO entries once per
TID, and there are 100 TIDs on the page, that sounds pretty bad.  But
maybe we can design the UNDO format in such a way that you never
actually need to walk the entire chain, or only in extremely rare
corner cases.

It strikes me that there are three possibilities.  One is that we can
design the UNDO-based visibility reconstruction so that it is
blazingly fast.  In that case, the only benefit of putting the
visibility information in the index is that index-only scans will be
able to avoid touching the heap in some cases where they presently
cannot.  The second possibility is that despite our best efforts the
UNDO-based visibility reconstruction is doomed to be cripplingly slow.
In that case, even in "good" cases like sequential scans and bitmap
heap scans where we can process the entire page at once instead of
possibly having to make a separate visit per TID, we'll still be
painfully slow.  In that case, we might as well forget this project
altogether.  The third is that we're somewhere in the middle.  If
UNDO-based visibility reconstruction is only a minor annoyance in
sequential scan and bitmap-heap scan cases but, despite our best
efforts, becomes highly painful in index scan cases, then the case for
putting XIDs in the index suddenly becomes a lot more interesting in
my mind.

We may well end up in exactly that place, but I think it's a little
too early to decide yet.  I think we need to write a detailed design
for how UNDO-based visibility reconstruction would actually work, and
maybe even build a prototype to see how that performs, before we can
decide on this.  I kind of hope we don't need to do XIDs-in-the-index;
it sounds like a lot of work, and this project is bound to be
extremely difficult even without that additional complication.
However, if it becomes clear that it's the only way for a system like
this to perform acceptably, then it'll have to be done (unless we give
up on the whole thing).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-12-02 Thread Alexander Korotkov

On Thu, Dec 1, 2016 at 6:25 PM, Robert Haas  wrote:

> There's a couple of possible designs here, but there is the
> possibility for extra hops in some cases.  But there are things we can
> do to mitigate that.
>
> 1. If each tuple has a definitely-all-visible bit, you can check that
> first; if it's set, the tuple is definitely all-visible you don't need
> to do anything further.
>
> 2. If we still maintain a visibility map, you can check that next.  If
> the bit is set for the target page, then the tuple is definitely
> all-visible and you don't need to do anything further.
>
> 3. Otherwise, you need to look at the heap page.  But if the heap
> page's UNDO pointer is invalid, then the whole page is all-visible, so
> you're done.  (You can also set the visibility map bit and/or the
> index tuple's definitely-all-visible bit to avoid having to recheck
> the heap page in the future.)
>
> 4. Each tuple will have a bit indicating whether it's been recently
> modified; that is, whether the transaction whose UNDO log is
> referenced by the page's UNDO pointer did anything to that tuple; or
> if the page points to a TPD entry, whether any of the transactions in
> the TPD modified that tuple.  If that bit is not set for that
> particular tuple, it's all-visible and you're done.
>
> 5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.
>
> If we put in the visibility information in the index as you are
> proposing, then we can always resolve the visibility of the index
> tuple at step 1; we just test xmin and xmax against the snapshot.  For
> index-only scans, that's great, because we never need to consult the
> heap at all.  For regular index scans, it's not nearly as great,
> because we're still going to need to consult the TPD and UNDO logs to
> determine which version of the tuple is visible to that snapshot.  You
> can avoid looking at the heap in the case where no version of the
> tuple is visible to the scan, but that's probably mostly going to
> happen only in cases where the transaction is still in flight so in
> most cases the heap will be in shared_buffers anyway and you won't
> save very much.
>

Idea of storing just one visibility bit in index tuple is a subject of
serious doubts for me.

1. When definitely-all-visible isn't set then we have to recheck during
scanning heap, right?
But our current recheck (i.e. rechecking scan quals) is not enough.
Imagine you have a range
index scan (val >= 100 AND val < 200), and val field was updated from val =
50 and val = 60
in some row.  Then you might have this row duplicated in the output.
Removing index scan and
sticking to only bitmap index scan doesn't seem to be fair.  Thus, we need
to introduce another
kind of recheck that heapTuple.indexKey = indexTuple.indexKey.

2. Another question is how it could work while index key being intensively
updated.  There is a risk
that we would have to traverse same UNDO-chains multiple times.  In worst
case, we would have
to spend quadratic time of number of row versions since our snapshot
taken.  We might try to mitigate this
by caching TID => heap tuple map for our snapshot.  But I don't like this
approach.
If we have large enough index scan, then we could either run out of cache
or consume too much memory
for that cache.

So, I think storing visibility information in index tuple is great for
regular index scans too.  At least,
it allow us to evade #2 problem.  That is great by itself.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] UNDO and in-place update

2016-12-01 Thread Robert Haas

On Tue, Nov 29, 2016 at 12:21 AM, Amit Kapila  wrote:
>> I think we need to avoid putting the visibility information in the
>> index because that will make the index much bigger.
>
> I agree that there will be an increase in index size, but it shouldn't
> be much if we have transaction information (and that too only active
> transactions) at page level instead of at tuple level.

It depends on the workload.  If you have something where the number of
transactions per page is small, like a bulk load, that scheme may work
out great.  But if you have something like a pgbench workload with an
extra index on the abalance column, you could have a different
transaction for almost every tuple.  Of course the question of how
many of those will be recent is a good one; if it's not too many,
maybe it's not that bad.

> I think the
> number of concurrent writes on the index will be lesser as compared to
> the heap.  There are a couple of benefits of having visibility
> information in the index.
>
> a. Heap and index could be independently cleaned up and in most cases
> by foreground transactions only.  The work for vacuum will be quite
> less as compared to now.  I think this will help us in avoiding the
> bloat to a great degree.
>
> b. Improved index-only scans, because now we don't need visibility map
> of the heap to check tuple visibility.
>
> c. Some speed improvements for index scans can also be expected
> because with this we don't need to perform heap fetch for invisible
> index tuples.

I see what you're going for, but I'm not sure it's worth it.  I mean,
say you just have one bit per index tuple.  If it's set, the heap
tuple is definitely all-visible; if it's clear, then the tuple may or
may not be visible to your snapshot.  You insert tuples with the bit
clear, and clear the bit again when the tuple is deleted or the
indexed column is updated.  You set the bit when you follow the index
pointer and discover that the tuple is all-visible, so that next time
you don't need to follow it again.  Of the advantages you mention
above, this still allows (a), but less efficiently.  It allows (b).
It does not allow (c).  So, it's not as good.  But, it saves you the
overhead of storing xmin and/or xmax for each tuple in the page.  Even
if you have an optimization to list the XIDs once per page and then
refer to them from the tuples, you are going to need a byte in each
tuple for the xmin-index and a byte for the xmax-index, or something
like that, plus the space for the XID list itself.  That's not a ton
of space, maybe, but it's definitely more.  It's also a significantly
bigger change to the tuple format, I think.  What I'm proposing could
be done by repurposing the LP_* bits available in the line pointer.
So I think the questions are (1) will the extra space consumption hurt
us more than the benefit we will get? and (2) is it really worth the
work to change the page format to that degree?

As you can probably tell, I'm leaning toward "no".  I think a bit
per-tuple -- whether we call it delete-mark or definitely-all-visible
-- should be enough to get most of the benefit that is available here
for substantially less work and no extra space.

> I think with proposed design there is a good chance that for many of
> the index scans, there is a need for the extra hop to decide
> visibility of index tuple. I think as of today, during index scan if
> the corresponding heap page is not-all-visible, then we check the
> corresponding heap tuple to decide about the visibility of index
> tuple, whereas with proposed design, I think it first needs to check
> heap page and then TPD, if there is more than one transaction
> operating on page.

There's a couple of possible designs here, but there is the
possibility for extra hops in some cases.  But there are things we can
do to mitigate that.

1. If each tuple has a definitely-all-visible bit, you can check that
first; if it's set, the tuple is definitely all-visible you don't need
to do anything further.

2. If we still maintain a visibility map, you can check that next.  If
the bit is set for the target page, then the tuple is definitely
all-visible and you don't need to do anything further.

3. Otherwise, you need to look at the heap page.  But if the heap
page's UNDO pointer is invalid, then the whole page is all-visible, so
you're done.  (You can also set the visibility map bit and/or the
index tuple's definitely-all-visible bit to avoid having to recheck
the heap page in the future.)

4. Each tuple will have a bit indicating whether it's been recently
modified; that is, whether the transaction whose UNDO log is
referenced by the page's UNDO pointer did anything to that tuple; or
if the page points to a TPD entry, whether any of the transactions in
the TPD modified that tuple.  If that bit is not set for that
particular tuple, it's all-visible and you're done.

5. Otherwise you have to look at the TPD entry (if any) and UNDO logs.

If we put in the

Re: [HACKERS] UNDO and in-place update

2016-11-29 Thread Alexander Korotkov

On Tue, Nov 29, 2016 at 8:21 AM, Amit Kapila 
wrote:

> On Mon, Nov 28, 2016 at 11:01 PM, Robert Haas 
> wrote:
> > On Sun, Nov 27, 2016 at 10:44 PM, Amit Kapila 
> wrote:
> >> On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas 
> wrote:
> >>> Well, my original email did contain a discussion of the need for
> >>> delete-marking.  I said that you couldn't do in-place updates when
> >>> indexed columns were modified unless the index AMs had support for
> >>> delete-marking, which is the same point you are making here.
> >>
> >> Sorry, I had not read that part earlier, but now that I read it, I
> >> think there is a slight difference in what I am saying.   I thought
> >> along with delete-marking, we might need transaction visibility
> >> information in the index as well.
> >
> > I think we need to avoid putting the visibility information in the
> > index because that will make the index much bigger.
> >
>
> I agree that there will be an increase in index size, but it shouldn't
> be much if we have transaction information (and that too only active
> transactions) at page level instead of at tuple level.  I think the
> number of concurrent writes on the index will be lesser as compared to
> the heap.  There are a couple of benefits of having visibility
> information in the index.
>
> a. Heap and index could be independently cleaned up and in most cases
> by foreground transactions only.  The work for vacuum will be quite
> less as compared to now.  I think this will help us in avoiding the
> bloat to a great degree.
>
> b. Improved index-only scans, because now we don't need visibility map
> of the heap to check tuple visibility.
>
> c. Some speed improvements for index scans can also be expected
> because with this we don't need to perform heap fetch for invisible
> index tuples.
>

+1
I think once we're considering marking deleted index tuples, we should
provide an option of visibility-aware indexes.
Probably, it shouldn't be the only option for UNDO-based table engine.  But
it definitely should be one of options.

d. This is the way to eventually have index-organized tables.  Once index
is visiblity-aware, it becomes possible
to store date there without heap but with snapshots and transactions.
Also, it would be possible to achieve more
unification between heap and index access methods.  Imagine that heap is
TID => tuple map and index is index_key => tuple
map.  I think the reason why there is distinguishing between heap (which is
hardcoded) access method and index access method is that
heap is visiblity-aware and index is not visiblity aware.  Once they both
are visibility-aware, there is no much difference between them:
they are just different kind of maps.  And they could implement the same
interface.  Imagine you can select between heap-organized table
and index-organized table just by choosing its primary access method.  If
you select heap for primary access method, indexes would
refer TID.  If you select btree on could id as primary access method,
indexes would refer id could.  That would be great extendability.
Way better than what we have now.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] UNDO and in-place update

2016-11-28 Thread Amit Kapila

On Mon, Nov 28, 2016 at 11:01 PM, Robert Haas  wrote:
> On Sun, Nov 27, 2016 at 10:44 PM, Amit Kapila  wrote:
>> On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas  wrote:
>>> Well, my original email did contain a discussion of the need for
>>> delete-marking.  I said that you couldn't do in-place updates when
>>> indexed columns were modified unless the index AMs had support for
>>> delete-marking, which is the same point you are making here.
>>
>> Sorry, I had not read that part earlier, but now that I read it, I
>> think there is a slight difference in what I am saying.   I thought
>> along with delete-marking, we might need transaction visibility
>> information in the index as well.
>
> I think we need to avoid putting the visibility information in the
> index because that will make the index much bigger.
>

I agree that there will be an increase in index size, but it shouldn't
be much if we have transaction information (and that too only active
transactions) at page level instead of at tuple level.  I think the
number of concurrent writes on the index will be lesser as compared to
the heap.  There are a couple of benefits of having visibility
information in the index.

a. Heap and index could be independently cleaned up and in most cases
by foreground transactions only.  The work for vacuum will be quite
less as compared to now.  I think this will help us in avoiding the
bloat to a great degree.

b. Improved index-only scans, because now we don't need visibility map
of the heap to check tuple visibility.

c. Some speed improvements for index scans can also be expected
because with this we don't need to perform heap fetch for invisible
index tuples.

>  It would be a
> shame to get the visibility index out of the heap (as this design
> does) only to be forced to add it to the index.  Note that,
> percentage-wise, it's quite a bit worse to have visibility information
> in the index than in the heap, because the index tuples are going to
> be much narrower than the heap tuples, often just one column.  (The
> cost of having this information in the heap can be amortized across
> the whole width of the table.)
>

I think with proposed design there is a good chance that for many of
the index scans, there is a need for the extra hop to decide
visibility of index tuple. I think as of today, during index scan if
the corresponding heap page is not-all-visible, then we check the
corresponding heap tuple to decide about the visibility of index
tuple, whereas with proposed design, I think it first needs to check
heap page and then TPD, if there is more than one transaction
operating on page.

> I don't have the whole design for the delete-marking stuff worked out
> yet.  I'm thinking we could have a visibility map where the bit for
> each page is 1 if the page certainly has no pending UNDO and 0 if it
> might have some.  In addition, if a tuple is deleted or the indexed
> column value is changed, we delete-mark the associated index entries.
> If we later discover that the page has no current UNDO (i.e. is
> all-visible) and the tuple at a given TID matches our index entry, we
> can clear the delete-mark for that index entry.  So, an index-only
> scan rechecks the heap if the tuples is delete-marked OR the
> visibility-map bit for the page is not set; if neither is the case, it
> can assume the heap tuple is visible.
>

Such a design will work, but I think this is more work to mark the
tuple as Dead as compared to current system.

> Another option would be to get rid of the visibility map and rely only
> on the delete-marks.  If we did that, then tuples would have to be
> delete-marked when first inserted since they wouldn't be all-visible
> until sometime after the commit of the inserting transaction.
>

The previous idea sounds better.

>
>>>  That's OK, but we're in real trouble if
>>> step-3 inserted an additional index tuple pointing to that chain
>>> rather than simply noting that one already exists.  If it adds an
>>> additional one, then we'll visit two index tuples for that TID.  Each
>>> time, the visibility information in UNDO will cause us to return the
>>> correct tuple, but we've erred by returning it twice (once per index
>>> entry) rather than just once.
>>
>> Why will scan traverse the UNDO chain if it starts after step-3 commit?
>
> Why wouldn't it?  I think that if an index scan hits a page with an
> UNDO chain, it always need to traverse the whole thing.
>

If you notice the scan has started after all the writes have
committed, so what is the guarantee that there will be a UNDO chain?
Yes, it could be there if there is some live transaction which started
before step, otherwise, I am not sure if we can guarantee that UNDO
chain will be present.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] UNDO and in-place update

2016-11-28 Thread Robert Haas

On Sun, Nov 27, 2016 at 10:44 PM, Amit Kapila  wrote:
> On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas  wrote:
>> Well, my original email did contain a discussion of the need for
>> delete-marking.  I said that you couldn't do in-place updates when
>> indexed columns were modified unless the index AMs had support for
>> delete-marking, which is the same point you are making here.
>
> Sorry, I had not read that part earlier, but now that I read it, I
> think there is a slight difference in what I am saying.   I thought
> along with delete-marking, we might need transaction visibility
> information in the index as well.

I think we need to avoid putting the visibility information in the
index because that will make the index much bigger.  It would be a
shame to get the visibility index out of the heap (as this design
does) only to be forced to add it to the index.  Note that,
percentage-wise, it's quite a bit worse to have visibility information
in the index than in the heap, because the index tuples are going to
be much narrower than the heap tuples, often just one column.  (The
cost of having this information in the heap can be amortized across
the whole width of the table.)

I don't have the whole design for the delete-marking stuff worked out
yet.  I'm thinking we could have a visibility map where the bit for
each page is 1 if the page certainly has no pending UNDO and 0 if it
might have some.  In addition, if a tuple is deleted or the indexed
column value is changed, we delete-mark the associated index entries.
If we later discover that the page has no current UNDO (i.e. is
all-visible) and the tuple at a given TID matches our index entry, we
can clear the delete-mark for that index entry.  So, an index-only
scan rechecks the heap if the tuples is delete-marked OR the
visibility-map bit for the page is not set; if neither is the case, it
can assume the heap tuple is visible.

Another option would be to get rid of the visibility map and rely only
on the delete-marks.  If we did that, then tuples would have to be
delete-marked when first inserted since they wouldn't be all-visible
until sometime after the commit of the inserting transaction.

> BTW, it is not completely clear
> whether you want a delete-marking system or you think we could do
> without that by avoiding in-place updates, it seems to me from what
> you have written that you are inclined towards having a delete-marking
> system.

Yes, that's my inclination.  I don't think it would be necessary to
have the delete-marking in order to produce a committable patch, but
the benefit of this approach would be much reduced without that.

>>  However,
>> I agree that the case where the indexed column gets set back to a
>> previous value while the old index tuple for that value still exists
>> is particularly tricky.  I think that what we need to do there is
>> avoid inserting an index entry for a given value/TID pair if an index
>> entry for that value/TID pair already exists.
>
> Are you saying this for indexes with a delete-marking system or for
> indexes without that or for both?

I'm saying that in any case in which you allow in-place update, you
have to make sure you don't get multiple entries pointing at the same
TID unless they have different values in the index tuple.

>> That's a little different from your analysis.  In your step-3
>> analysis, you say that an index scan for 2 will find the step-1 tuple,
>> but that's not really right.  The index scan will find the index tuple
>> which points to the whole chain of tuples, step-3 => step-2 => step-1,
>> and it will decide which heap tuple from that chain the user can see
>> based on the user's snapshot.
>
> I think the scan will not traverse the chain if it starts after
> step-3's commit and that's what I intend to say.

I don't see why that would be true.

>>  That's OK, but we're in real trouble if
>> step-3 inserted an additional index tuple pointing to that chain
>> rather than simply noting that one already exists.  If it adds an
>> additional one, then we'll visit two index tuples for that TID.  Each
>> time, the visibility information in UNDO will cause us to return the
>> correct tuple, but we've erred by returning it twice (once per index
>> entry) rather than just once.
>
> Why will scan traverse the UNDO chain if it starts after step-3 commit?

Why wouldn't it?  I think that if an index scan hits a page with an
UNDO chain, it always need to traverse the whole thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-28 Thread Bruce Momjian

On Sun, Nov 27, 2016 at 09:19:06AM +0530, Amit Kapila wrote:
> At this point, index scan for value 2 will find index tuple of step-1
> (2) and will conclude 2,def as a right tuple, but actually, that is
> wrong as the step-1 (2) index tuple should not be visible to the user.
> Do you also this as a problem or am I missing something?  If this a
> problem, then I think we need some form of delete marking system for
> the index, probably transaction visibility information as well.

Yes, very similar to the problems with WARM already discussed.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-27 Thread Amit Kapila

On Mon, Nov 28, 2016 at 4:50 AM, Robert Haas  wrote:
>
> Well, my original email did contain a discussion of the need for
> delete-marking.  I said that you couldn't do in-place updates when
> indexed columns were modified unless the index AMs had support for
> delete-marking, which is the same point you are making here.
>

Sorry, I had not read that part earlier, but now that I read it, I
think there is a slight difference in what I am saying.   I thought
along with delete-marking, we might need transaction visibility
information in the index as well.  BTW, it is not completely clear
whether you want a delete-marking system or you think we could do
without that by avoiding in-place updates, it seems to me from what
you have written that you are inclined towards having a delete-marking
system.

>  However,
> I agree that the case where the indexed column gets set back to a
> previous value while the old index tuple for that value still exists
> is particularly tricky.  I think that what we need to do there is
> avoid inserting an index entry for a given value/TID pair if an index
> entry for that value/TID pair already exists.
>

Are you saying this for indexes with a delete-marking system or for
indexes without that or for both?

> That's a little different from your analysis.  In your step-3
> analysis, you say that an index scan for 2 will find the step-1 tuple,
> but that's not really right.  The index scan will find the index tuple
> which points to the whole chain of tuples, step-3 => step-2 => step-1,
> and it will decide which heap tuple from that chain the user can see
> based on the user's snapshot.
>

I think the scan will not traverse the chain if it starts after
step-3's commit and that's what I intend to say.

>  That's OK, but we're in real trouble if
> step-3 inserted an additional index tuple pointing to that chain
> rather than simply noting that one already exists.  If it adds an
> additional one, then we'll visit two index tuples for that TID.  Each
> time, the visibility information in UNDO will cause us to return the
> correct tuple, but we've erred by returning it twice (once per index
> entry) rather than just once.
>

Why will scan traverse the UNDO chain if it starts after step-3 commit?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-27 Thread Robert Haas

On Sun, Nov 27, 2016 at 8:26 PM, Tsunakawa, Takayuki
 wrote:
> I see.  autovacuum is certainly almost unpredictable, at least for those who 
> are not aware of its existence and tuning.  Recently, one of our customers 
> faced the inability to perform INSERTs because of xid wraparound.
> Their workload is INSERT-heavy, and (inefficiently) used autocommit to insert 
> each row, which resulted in the xid consumption faster than the slow xid 
> wraparound autovacuum.

Yes, that can happen.  The new freeze map stuff in 9.6 should help
with it, even without doing anything like what I am proposing here.
But with this, there is no deferred work at all: once the transaction
is all-visible, the UNDO can be discarded and there's nothing further
to do.

>> > Furthermore, it maybe the best to be able to switch the method for
>> > each table and/or tablespace.  For example, in pgbench, history table
>> uses the current method, and other tables use the UNDO method.  Is it time
>> to introduce a pluggable storage system?
>>
>> IMHO, it's past time for that.
>
> Do you mean by "past time" that the community decided not to introduce 
> pluggable storage manager?  If it's true, that's a pity.  But I remember that 
> there was a discussion about pluggable storage manager at PGConf or some 
> other event this year.  Or, do you mean that the current approach should be 
> abandoned and the UNDO approach replace it?

I mean that we should most definitely introduce a pluggable storage
manager and that I think we should even have done it sooner, before
now.

>> I agree up to a point.  I think we need to design our own system as well
>> as we can, not just copy what others have done.  For example, the design
>> I sketched will work with all of PostgreSQL's existing index types.  You
>> need to modify each AM in order to support in-place updates when a column
>> indexed by that AM has been modified, and that's probably highly desirable,
>> but it's not a hard requirement.  I believe that's a better approach for
>> us than insisting that we have to do it in exactly the same way as some
>> other system.  Now, that doesn't mean we shouldn't learn from what works
>> well and poorly in other systems, but I think our goal here should be to
>> chart the best way forward given PostgreSQL's existing architecture and
>> its existing strengths and weaknesses, rather than to make it exactly like
>> Oracle or MySQL or anything else.  Few people on this mailing list would
>> say that either of those systems are categorically better than PostgreSQL;
>> most, I suspect, would disagree somewhat vigorously.
>
> Yes, agreed.  I didn't intend to just imitate Oracle/MySQL design.  I meant 
> that it will be better to study in advance what trouble Oracle/MySQL design 
> has caused their users, and avoid pitfalls as much as possible.  For example, 
> when I ran TPC-B benchmark against Oracle and PostgreSQL, I was embarrassed 
> by frequent deadlocks in Oracle.  It took some time for me to find out that 
> INITRANS needs to be tuned with ALTER TABLE.  PostgreSQL ran smoothly without 
> any tuning.

I agree that we want to stick with our principal of requiring as
little tuning as possible -- and where possible reduce tuning that is
required today.  I have only limited experience with it, but some of
those experiences were frustrating.

> I find your UNDO approach attractive.  On the other hand, I sometimes wonder 
> where PostgreSQL is headed for.  I'm sometimes asked by database users "How 
> different is PostgreSQL from MySQL?"  If the UNDO approach is taken, 
> PostgreSQL would appear more similar to MySQL.  I don't say that's bad, but I 
> wonder whether we can appeal the new feature in a big picture.  For example, 
> the current (VACUUM) approach would prevent PostgreSQL from becoming a 
> database for OLTP/analytics mixed workload, because long-running analytics 
> queries cause tale and index bloat regardless of whether those queries access 
> the same data as the OLTP workload, wouldn't it?  Can we appeal the future of 
> PostgreSQL and the difference from MySQL as "PostgreSQL is pursuing to handle 
> multiple workloads in the same database to better utilize data and IT 
> resources.  The new UNDO approach is one step toward that."

I think there are a lot of differences between PostgreSQL and MySQL
other than whether UNDO is used.  For example, just in the area of
storage, MySQL uses index-organized tables, while PostgreSQL's current
heap and this proposed new type of heap are both flat.  There are
differences in the types of indexing that are offered, the datatypes
available, the features available in every area of the system, and of
course the licensing.  If we get to a point where there are no real
differences in licensing or extensibility or features or any other
area between PostgreSQL and MySQL, then of course it makes no
difference which one people use, but I don't think there is any
prospect of such a thing

Re: [HACKERS] UNDO and in-place update

2016-11-27 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> On Thu, Nov 24, 2016 at 2:32 AM, Tsunakawa, Takayuki
>  wrote:
> > IMHO, overall, there should be pros and cons of the current approach and
> the new UNDo one (like Oracle?), depending on the workload.  Under
> update-heavy workload, the UNDO method may be better.  OTOH, under the
> mostly-INSERT workload (like data warehouse?), the current method will be
> better because it writes no log for UNDO.
> 
> The foreground operation will complete more quickly, because it won't have
> to write UNDO.  On the other hand, you'll have to set hint bits later, as
> well as freeze, which may be more expensive than writing UNDO by the time
> all is said and done.  Whether it's better to do pay a foreground tax
> immediately or to do deferred work at a later time depends on things like
> whether you have quiet times during which you can catch up on the deferred
> work ... but the number of users who have gotten unpleasant surprises due
> to autovacuum kicking in during a busy period is not small.

I see.  autovacuum is certainly almost unpredictable, at least for those who 
are not aware of its existence and tuning.  Recently, one of our customers 
faced the inability to perform INSERTs because of xid wraparound.  
Their workload is INSERT-heavy, and (inefficiently) used autocommit to insert 
each row, which resulted in the xid consumption faster than the slow xid 
wraparound autovacuum.

> > Furthermore, it maybe the best to be able to switch the method for
> > each table and/or tablespace.  For example, in pgbench, history table
> uses the current method, and other tables use the UNDO method.  Is it time
> to introduce a pluggable storage system?
> 
> IMHO, it's past time for that.

Do you mean by "past time" that the community decided not to introduce 
pluggable storage manager?  If it's true, that's a pity.  But I remember that 
there was a discussion about pluggable storage manager at PGConf or some other 
event this year.  Or, do you mean that the current approach should be abandoned 
and the UNDO approach replace it?

> > Because PostgreSQL is a follower in the UNDO approach, I think it will
> be better to study other DBMSs well (Oracle and MySQL?).  That includes
> not only their manuals, but also whitepapers and books.  Especially, I
> expect good books to give deep knowledge on performance tuning and
> troubleshooting, from which we will be able to know the cons that Oracle's
> materials don't state.
> 
> I agree up to a point.  I think we need to design our own system as well
> as we can, not just copy what others have done.  For example, the design
> I sketched will work with all of PostgreSQL's existing index types.  You
> need to modify each AM in order to support in-place updates when a column
> indexed by that AM has been modified, and that's probably highly desirable,
> but it's not a hard requirement.  I believe that's a better approach for
> us than insisting that we have to do it in exactly the same way as some
> other system.  Now, that doesn't mean we shouldn't learn from what works
> well and poorly in other systems, but I think our goal here should be to
> chart the best way forward given PostgreSQL's existing architecture and
> its existing strengths and weaknesses, rather than to make it exactly like
> Oracle or MySQL or anything else.  Few people on this mailing list would
> say that either of those systems are categorically better than PostgreSQL;
> most, I suspect, would disagree somewhat vigorously.

Yes, agreed.  I didn't intend to just imitate Oracle/MySQL design.  I meant 
that it will be better to study in advance what trouble Oracle/MySQL design has 
caused their users, and avoid pitfalls as much as possible.  For example, when 
I ran TPC-B benchmark against Oracle and PostgreSQL, I was embarrassed by 
frequent deadlocks in Oracle.  It took some time for me to find out that 
INITRANS needs to be tuned with ALTER TABLE.  PostgreSQL ran smoothly without 
any tuning.

I find your UNDO approach attractive.  On the other hand, I sometimes wonder 
where PostgreSQL is headed for.  I'm sometimes asked by database users "How 
different is PostgreSQL from MySQL?"  If the UNDO approach is taken, PostgreSQL 
would appear more similar to MySQL.  I don't say that's bad, but I wonder 
whether we can appeal the new feature in a big picture.  For example, the 
current (VACUUM) approach would prevent PostgreSQL from becoming a database for 
OLTP/analytics mixed workload, because long-running analytics queries cause 
tale and index bloat regardless of whether those queries access the same data 
as the OLTP workload, wouldn't it?  Can we appeal the future of PostgreSQL and 
the difference from MySQL as "PostgreSQL is pursuing to handle multiple 
workloads in the same database to better utilize data and IT resources.  The 
new UNDO approach is one step toward that."

Regards
Takayuki Tsunakawa

-- 
Sent via pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-27 Thread Robert Haas

On Sat, Nov 26, 2016 at 10:49 PM, Amit Kapila  wrote:
> On Fri, Nov 25, 2016 at 11:23 PM, Bruce Momjian  wrote:
>> On Thu, Nov 24, 2016 at 12:23:28PM -0500, Robert Haas wrote:
>>> I agree up to a point.  I think we need to design our own system as
>>> well as we can, not just copy what others have done.  For example, the
>>> design I sketched will work with all of PostgreSQL's existing index
>>> types.  You need to modify each AM in order to support in-place
>>> updates when a column indexed by that AM has been modified, and that's
>>> probably highly desirable, but it's not a hard requirement.
>>
>> I feel you are going to get into the problem of finding the index entry
>> for the old row --- the same thing that is holding back more aggressive
>> WARM updates.
>>
>
> I think I see a problem in that regards when the index column in
> updated twice with the same value.  For example,
>
> table -  test_undo(c1 int, c2 char(10));
> index -  idx_tu on test_undo(c1);
>
> Step-1
> Insert (2, 'abc')
> Heap - 2, abc
> Index - 2
> Commit;
>
> Step -2
> Update (3,def) where c1 = 2;
> Heap - 3, def
> Index - 3 (another value will be 2)
> Undo - 2, abc
> Commit;
>
> At this point a new query which scans the index for value 2 will reach
> to the tuple (3,def).  As scan started after all commits, tuple
> (3,def) appears okay, but it can compare the key value in the tuple to
> detect that this is not the right tuple.  So, all is well till here.
>
> Step -3
> Update (2) where c1 = 3;
> Heap - 2, def
> Index - 2 (another values will be 3,2)
> Undo - 3;
> Commit
>
> At this point, index scan for value 2 will find index tuple of step-1
> (2) and will conclude 2,def as a right tuple, but actually, that is
> wrong as the step-1 (2) index tuple should not be visible to the user.
> Do you also this as a problem or am I missing something?  If this a
> problem, then I think we need some form of delete marking system for
> the index, probably transaction visibility information as well.

Well, my original email did contain a discussion of the need for
delete-marking.  I said that you couldn't do in-place updates when
indexed columns were modified unless the index AMs had support for
delete-marking, which is the same point you are making here.  However,
I agree that the case where the indexed column gets set back to a
previous value while the old index tuple for that value still exists
is particularly tricky.  I think that what we need to do there is
avoid inserting an index entry for a given value/TID pair if an index
entry for that value/TID pair already exists.

That's a little different from your analysis.  In your step-3
analysis, you say that an index scan for 2 will find the step-1 tuple,
but that's not really right.  The index scan will find the index tuple
which points to the whole chain of tuples, step-3 => step-2 => step-1,
and it will decide which heap tuple from that chain the user can see
based on the user's snapshot.  That's OK, but we're in real trouble if
step-3 inserted an additional index tuple pointing to that chain
rather than simply noting that one already exists.  If it adds an
additional one, then we'll visit two index tuples for that TID.  Each
time, the visibility information in UNDO will cause us to return the
correct tuple, but we've erred by returning it twice (once per index
entry) rather than just once.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-26 Thread Amit Kapila

On Fri, Nov 25, 2016 at 11:23 PM, Bruce Momjian  wrote:
> On Thu, Nov 24, 2016 at 12:23:28PM -0500, Robert Haas wrote:
>> I agree up to a point.  I think we need to design our own system as
>> well as we can, not just copy what others have done.  For example, the
>> design I sketched will work with all of PostgreSQL's existing index
>> types.  You need to modify each AM in order to support in-place
>> updates when a column indexed by that AM has been modified, and that's
>> probably highly desirable, but it's not a hard requirement.
>
> I feel you are going to get into the problem of finding the index entry
> for the old row --- the same thing that is holding back more aggressive
> WARM updates.
>

I think I see a problem in that regards when the index column in
updated twice with the same value.  For example,

table -  test_undo(c1 int, c2 char(10));
index -  idx_tu on test_undo(c1);

Step-1
Insert (2, 'abc')
Heap - 2, abc
Index - 2
Commit;

Step -2
Update (3,def) where c1 = 2;
Heap - 3, def
Index - 3 (another value will be 2)
Undo - 2, abc
Commit;

At this point a new query which scans the index for value 2 will reach
to the tuple (3,def).  As scan started after all commits, tuple
(3,def) appears okay, but it can compare the key value in the tuple to
detect that this is not the right tuple.  So, all is well till here.

Step -3
Update (2) where c1 = 3;
Heap - 2, def
Index - 2 (another values will be 3,2)
Undo - 3;
Commit

At this point, index scan for value 2 will find index tuple of step-1
(2) and will conclude 2,def as a right tuple, but actually, that is
wrong as the step-1 (2) index tuple should not be visible to the user.
Do you also this as a problem or am I missing something?  If this a
problem, then I think we need some form of delete marking system for
the index, probably transaction visibility information as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-25 Thread Greg Stark

On 24 November 2016 at 23:03, Robert Haas  wrote:
>> For snapshot isolation Oracle has yet a *third* copy of the data in a
>> space called the "rollback segment(s)".
>
> My understanding is that this isn't correct.  I think the rollback
> segments are what they call the thing that stores UNDO.  See e.g.
> http://ss64.com/ora/syntax-redo.html

It looks like you're right. Rollback segments and Undo segments are
two different pieces of code but one is just the old way and the other
the new way of managing the same data. You can't have both active in
the same database at the same time. I'm a bit confused because I
distinctly remembered an UNDO log back in the 8i days as well but
apparently that's just me imagining things. UNDO segments were
introduced in 9i.

This explained a bunch
http://satya-dba.blogspot.ie/2009/09/undo-tablespace-undo-management.html

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-25 Thread Bruce Momjian

On Thu, Nov 24, 2016 at 12:23:28PM -0500, Robert Haas wrote:
> I agree up to a point.  I think we need to design our own system as
> well as we can, not just copy what others have done.  For example, the
> design I sketched will work with all of PostgreSQL's existing index
> types.  You need to modify each AM in order to support in-place
> updates when a column indexed by that AM has been modified, and that's
> probably highly desirable, but it's not a hard requirement.

I feel you are going to get into the problem of finding the index entry
for the old row --- the same thing that is holding back more aggressive
WARM updates.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-25 Thread Amit Kapila

On Fri, Nov 25, 2016 at 8:03 PM, Amit Kapila  wrote:
>>
>
> Another way to do is to write UNDO log for split operation (with exact
> record locations on pages that are split) so that you don't need to
> perform this expensive search operation.  I can understand writing
> UNDO for split could be costly but so is writing WAL for it.  I think
> for UNDO, we should follow a generic rule as for heap
>

typo.
/heap/WAL

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-25 Thread Amit Kapila

On Thu, Nov 24, 2016 at 10:09 PM, Robert Haas  wrote:
>
> I don't really understand how a system of this type copes with page
> splits.  Suppose there are entries 1000, 2000, ..., 1e6 in the btree.
> I now start two overlapping transactions, one inserting all the
> positive odd values < 1e6 and the other inserting all the positive
> even values < 1e6 that are not already present.  The insertions are
> interleaved with each other.  After they get done inserting, what does
> the 3D time traveling btree look like at this point?
>
> The problem I'm getting at here is that the heap, unlike an index, is
> unordered.  So when I've got all values 1..1e6, they've got to be in a
> certain order.  Necessarily, the values from the in-flight
> transactions are interleaved with each other and with the old data.
> If they're not, I can no longer support efficient searches by the
> in-flight transactions, plus I have an unbounded amount of cleanup
> work to do at commit time.  But if I *do* have all of those values
> interleaved in the proper order, then an abort leaves fragmented free
> space.
>

Yeah, that's right, but at next insert, on the same page, we can
defragment the page and claim all the space.  We might want to perform
such a cleanup only in certain cases like when the remaining space in
the page is below a certain threshold.

>  I can go and kill all of the inserted tuples during UNDO of an
> aborted transactions, but because page splits have happened, the index
> tuples I need to kill may not be on the same pages where I inserted
> them, so I might need to just search from the top of the tree, so it's
> expensive.
>

Another way to do is to write UNDO log for split operation (with exact
record locations on pages that are split) so that you don't need to
perform this expensive search operation.  I can understand writing
UNDO for split could be costly but so is writing WAL for it.  I think
for UNDO, we should follow a generic rule as for heap which is that
any change in the page should have it's corresponding UNDO.

>  And even if I do it anyway, the pages are left half-full.
> The point is that the splits are the joint result of the two
> concurrent transactions taken together - you can't blame individual
> splits on individual transactions.
>

Sure, but you are talking about an extreme case, so it might be okay
to leave pages half-full in such cases, next inserts will consume this
space.  Also aborts are less frequent operations, so the impact should
be minimal.

>
> Do you see some trick here that I'm missing?
>

I think we should be able to find some way to tackle the problems you
have listed above as there are databases (including Oracle) which
write UNDO for indexes (btree's) as well.  However, I think here the
crucial question is whether the only-heap-based-undo design can give
us consistent data for all possible cases if so, then it is better to
do that in the first phase and then later we can implement UNDO for
indexes as well.   As of now, nobody has reported any such problem, so
maybe we can stick with the only-heap-based-undo design.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Pavel Stehule

2016-11-25 1:44 GMT+01:00 Robert Haas :

> On Thu, Nov 24, 2016 at 6:20 PM, Pavel Stehule 
> wrote:
> >> I think that the whole emphasis on whether and to what degree this is
> >> like Oracle is somewhat misplaced.  I would look at it a different
> >> way.  We've talked many times over the years about how PostgreSQL is
> >> optimized for aborts.  Everybody that I've heard comment on that issue
> >> thinks that is a bad thing.
> >
> >
> > again this depends on usage - when you have a possibility to run VACUUM,
> > then this strategy is better.
> >
> > The fast aborts is one pretty good feature for stable usage.
> >
> > Isn't possible to use UNDO log (or something similar) for VACUUM?
> ROLLBACK
> > should be fast, but
> > VACUUM can do more work?
>
> I think that in this design we wouldn't use VACUUM at all.  However,
> if what you are saying is that we should try to make aborts
> near-instantaneous by pushing UNDO actions into the background, I
> agree entirely.  InnoDB already does that, IIUC.
>

ok, it can be another process - that can be more aggressive and less
limited than vacuum.

Regards

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Robert Haas

On Thu, Nov 24, 2016 at 6:20 PM, Pavel Stehule  wrote:
>> I think that the whole emphasis on whether and to what degree this is
>> like Oracle is somewhat misplaced.  I would look at it a different
>> way.  We've talked many times over the years about how PostgreSQL is
>> optimized for aborts.  Everybody that I've heard comment on that issue
>> thinks that is a bad thing.
>
>
> again this depends on usage - when you have a possibility to run VACUUM,
> then this strategy is better.
>
> The fast aborts is one pretty good feature for stable usage.
>
> Isn't possible to use UNDO log (or something similar) for VACUUM? ROLLBACK
> should be fast, but
> VACUUM can do more work?

I think that in this design we wouldn't use VACUUM at all.  However,
if what you are saying is that we should try to make aborts
near-instantaneous by pushing UNDO actions into the background, I
agree entirely.  InnoDB already does that, IIUC.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Pavel Stehule

> I think that the whole emphasis on whether and to what degree this is
> like Oracle is somewhat misplaced.  I would look at it a different
> way.  We've talked many times over the years about how PostgreSQL is
> optimized for aborts.  Everybody that I've heard comment on that issue
> thinks that is a bad thing.


again this depends on usage - when you have a possibility to run VACUUM,
then this strategy is better.

The fast aborts is one pretty good feature for stable usage.

Isn't possible to use UNDO log (or something similar) for VACUUM? ROLLBACK
should be fast, but
VACUUM can do more work?


> I am proposing a design that is optimized
> for commits; that is, if the transaction commits, none of the pages it
> modified need to be dirtied again afterwards at any point.  I think
> that's an extremely important property and it's one that we are very
> far from having today.  It necessarily implies that you cannot store
> the old row versions in the heap, because if you do, then you are
> going to have to dirty the pages again to get rid of them (unless you
> prefer to just leak the space forever).  Now there is plenty of room
> for argument about whether the specific design I proposed is going to
> be any good, and I think that would be quite an interesting discussion
> to have.  But I think if you say, well, you know, the fact that we may
> rewrite the same page 5 or 6 times after a modification to set hint
> bits (a few at a time) and HOT prune and set all-visible and freeze
> isn't really any big deal, you must live in a world where the write
> bandwidth of the I/O channel is a lot less of a problem than it is in
> mine.  And we've been around and around on all of that stuff and
> people have come up with various ideas to improve the situation - some
> of which have been implemented - but many of those ideas involve
> unpleasant trade-offs and so the core problems remain.  If we don't do
> something more fundamental, we're still going to be up against the
> same basic issues ten years from now.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Robert Haas

On Thu, Nov 24, 2016 at 11:06 AM, Greg Stark  wrote:
> Fwiw, Oracle does not use the undo log for snapshot fetches. It's used
> only for transaction rollback and recovery.
>
> For snapshot isolation Oracle has yet a *third* copy of the data in a
> space called the "rollback segment(s)". When you update a row in a
> block you save the whole block in the rollback segment. When you try
> to access a block you check if the CSN -- which is basically
> equivalent to our LSN -- is newer than your snapshot and if it is you
> fetch the old version of the block from the rollback.

My understanding is that this isn't correct.  I think the rollback
segments are what they call the thing that stores UNDO.  See e.g.
http://ss64.com/ora/syntax-redo.html

Having said that, I'm by no means an Oracle expert.

> Essentially their MVCC is done on a per-block level rather than a
> per-row level and they keep only the newest version of the block in
> the table, the rest are in the rollback segment.  For what it's worth
> I think our approach is cleaner and more flexible. They had a lot of
> trouble with their approach over the years and it works well only
> because they invested an enormous amount of development in it and also
> because people throw a lot of hardware at it too.

People do throw a lot of hardware at Oracle, but I don't think I'd
accept the contention that Oracle generally gets less performance out
of the same amount of hardware than PostgreSQL.  There are certainly
some things we often do faster -- including inserts, deletes, and
transaction aborts -- but their implementation of updates seems to be
very fast.  Obviously, a lot depends on configuration and workload, so
it's hard to make general statements, but I think it's more correct to
imagine that people throw a lot of hardware at Oracle because that's
where their big, critical databases are than to imagine that it's
because Oracle is a resource hog.

> I think the main use case we have trouble with is actually the "update
> every row in the table" type of update which requires we write to
> every block, plus a second copy of every block, plus write full pages
> of both copies, then later set hint bits dirtying pages again and
> generating more full pages writes, then later come along and vacuum
> which requires two more writes of every block, etc. If we had a
> solution for the special case of an update that replaces every row in
> a page that I think would complement HOT nicely and go a long way
> towards fixing our issues.

This case is certainly a problem, but I don't believe it's anywhere
close to being our only problem.  My experience is that our system
works fairly well when there are only a few indexes.  However, with
heavily indexed tables, all of your updates become non-HOT and then
bad things happen.  So you can either leave out indexes that you need
for good query performance and then you're hosed, or you can add those
indexes and then you're hosed anyway because of the bloat and write
amplification associated with non-HOT updates.

Also, write-heavy workloads that perform OK as long as there are no
long-lived snapshots often enter a death spiral when you run reporting
queries on the same system.  Finer-grained snapshot tracking would
help with that problem, but it doesn't solve it: now a single
long-lived snapshot can "only" cause the database to double in size
instead of increasing without bound, a scant consolation.  Nor does it
change the fundamental calculus that once a table gets bloated, it's
very painful to de-bloat it.  The appeal of a design that supports
in-place update is that you don't bloat the table in the first place.
You still have the bloat, of course, but it's off in a separate data
structure that is engineered for efficient deletion.

I think that the whole emphasis on whether and to what degree this is
like Oracle is somewhat misplaced.  I would look at it a different
way.  We've talked many times over the years about how PostgreSQL is
optimized for aborts.  Everybody that I've heard comment on that issue
thinks that is a bad thing.  I am proposing a design that is optimized
for commits; that is, if the transaction commits, none of the pages it
modified need to be dirtied again afterwards at any point.  I think
that's an extremely important property and it's one that we are very
far from having today.  It necessarily implies that you cannot store
the old row versions in the heap, because if you do, then you are
going to have to dirty the pages again to get rid of them (unless you
prefer to just leak the space forever).  Now there is plenty of room
for argument about whether the specific design I proposed is going to
be any good, and I think that would be quite an interesting discussion
to have.  But I think if you say, well, you know, the fact that we may
rewrite the same page 5 or 6 times after a modification to set hint
bits (a few at a time) and HOT prune and set all-visible and freeze
isn't really any big

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Robert Haas

On Thu, Nov 24, 2016 at 2:32 AM, Tsunakawa, Takayuki
 wrote:
> IMHO, overall, there should be pros and cons of the current approach and the 
> new UNDo one (like Oracle?), depending on the workload.  Under update-heavy 
> workload, the UNDO method may be better.  OTOH, under the mostly-INSERT 
> workload (like data warehouse?), the current method will be better because it 
> writes no log for UNDO.

The foreground operation will complete more quickly, because it won't
have to write UNDO.  On the other hand, you'll have to set hint bits
later, as well as freeze, which may be more expensive than writing
UNDO by the time all is said and done.  Whether it's better to do pay
a foreground tax immediately or to do deferred work at a later time
depends on things like whether you have quiet times during which you
can catch up on the deferred work ... but the number of users who have
gotten unpleasant surprises due to autovacuum kicking in during a busy
period is not small.

> Furthermore, it maybe the best to be able to switch the method for each table 
> and/or tablespace.  For example, in pgbench, history table uses the current 
> method,
> and other tables use the UNDO method.  Is it time to introduce a pluggable 
> storage system?

IMHO, it's past time for that.

> Because PostgreSQL is a follower in the UNDO approach, I think it will be 
> better to study other DBMSs well (Oracle and MySQL?).  That includes not only 
> their manuals, but also whitepapers and books.  Especially, I expect good 
> books to give deep knowledge on performance tuning and troubleshooting, from 
> which we will be able to know the cons that Oracle's materials don't state.

I agree up to a point.  I think we need to design our own system as
well as we can, not just copy what others have done.  For example, the
design I sketched will work with all of PostgreSQL's existing index
types.  You need to modify each AM in order to support in-place
updates when a column indexed by that AM has been modified, and that's
probably highly desirable, but it's not a hard requirement.  I believe
that's a better approach for us than insisting that we have to do it
in exactly the same way as some other system.  Now, that doesn't mean
we shouldn't learn from what works well and poorly in other systems,
but I think our goal here should be to chart the best way forward
given PostgreSQL's existing architecture and its existing strengths
and weaknesses, rather than to make it exactly like Oracle or MySQL or
anything else.  Few people on this mailing list would say that either
of those systems are categorically better than PostgreSQL; most, I
suspect, would disagree somewhat vigorously.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Thomas Kellerer

> FWIW, while this is basically true, the idea of repurposing UNDO to be
> usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
> nothing about MVCC.
> For snapshot isolation Oracle has yet a *third* copy of the data in a
> space called the "rollback segment(s)". 

UNDO and rollback segment are the same thing in Oracle. 

In older versions it was just called "rollback segment" I think it started
with Oracle 10g that they called it UNDO. 

> Fwiw, Oracle does not use the undo log for snapshot fetches. 
> It's used only for transaction rollback and recovery.

As UNDO and "rollback" are the same they do use the UNDO information for
MVCC:

http://docs.oracle.com/database/121/CNCPT/consist.htm#GUID-00A3688F-1219-423C-A5ED-4B8F25BEEAFB__BABFDBAJ






--
View this message in context: 
http://postgresql.nabble.com/UNDO-and-in-place-update-tp5931575p5931844.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Bruce Momjian

On Thu, Nov 24, 2016 at 04:06:14PM +, Greg Stark wrote:
> For snapshot isolation Oracle has yet a *third* copy of the data in a
> space called the "rollback segment(s)". When you update a row in a
> block you save the whole block in the rollback segment. When you try
> to access a block you check if the CSN -- which is basically
> equivalent to our LSN -- is newer than your snapshot and if it is you
> fetch the old version of the block from the rollback.

I assume they can do this with good performance because, unlike UNDO,
they don't need to fsync the pages kept for snapshot visibility --- if
the server crashes, you don't need them.

> Essentially their MVCC is done on a per-block level rather than a
> per-row level and they keep only the newest version of the block in
> the table, the rest are in the rollback segment.  For what it's worth
> I think our approach is cleaner and more flexible. They had a lot of
> trouble with their approach over the years and it works well only
> because they invested an enormous amount of development in it and also
> because people throw a lot of hardware at it too.
> 
> I think the main use case we have trouble with is actually the "update
> every row in the table" type of update which requires we write to
> every block, plus a second copy of every block, plus write full pages
> of both copies, then later set hint bits dirtying pages again and
> generating more full pages writes, then later come along and vacuum
> which requires two more writes of every block, etc. If we had a
> solution for the special case of an update that replaces every row in
> a page that I think would complement HOT nicely and go a long way
> towards fixing our issues.

Any ideas how we could optimize the update-all workload?  Create a new
heap file and indexes?

Our previous worst-case workload was a single row that was updated
repeatedly, but HOT fixed that for non-indexed rows, and WARM will
improve it for indexed rows.

One of the big takeaways when writing my MVCC talk is that no matter how
much we optimize UPDATE, we still need cleanup for deletes and aborted
transactions.

I am hopeful we can continue reducing the number of times we write a
page  for maintenance purposes.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Robert Haas

On Wed, Nov 23, 2016 at 5:18 PM, Thomas Munro
 wrote:
> On Wed, Nov 23, 2016 at 6:01 PM, Peter Geoghegan  wrote:
>> * Our behavior with many duplicates in secondary indexes is pretty bad
>> in general, I suspect.
>
> From the pie-in-the-sky department:  I believe there are
> snapshot-based systems that don't ever have more than one entry for a
> logical row in a btree.  Instead, they do UNDO-log based snapshot
> reconstruction for btrees too, generalising what is being discussed
> for heap tables in this thread.  That means that whenever you descend
> a btree, at each page you have to be prepared to go and look in the
> UNDO log if necessary on a page-by-page basis.  Alternatively, some
> systems use a shadow table rather than an UNDO log for the old
> entries, but still privilege queries that need the latest version by
> keeping only those in the btree.  I don't know the details but suspect
> that both of those approaches might require CSN (= LSN?) based
> snapshots, so that you can make page-level visibility determinations
> while traversing the 'current' btree based on a page header.  That
> would reduce both kinds of btree bloat: the primary kind of being
> entries for dead tuples that still exist in the heap, and the
> secondary kind being the resultant permanent btree fragmentation that
> remains even after vacuum removes them.  Presumably once you have all
> that, you can allow index-only-scans without a visibility map.  Also,
> I suspect that such a "3D time travelling" btree would be a
> requirement for index ordered tables in a snapshot-based RDBMS.

I don't really understand how a system of this type copes with page
splits.  Suppose there are entries 1000, 2000, ..., 1e6 in the btree.
I now start two overlapping transactions, one inserting all the
positive odd values < 1e6 and the other inserting all the positive
even values < 1e6 that are not already present.  The insertions are
interleaved with each other.  After they get done inserting, what does
the 3D time traveling btree look like at this point?

The problem I'm getting at here is that the heap, unlike an index, is
unordered.  So when I've got all values 1..1e6, they've got to be in a
certain order.  Necessarily, the values from the in-flight
transactions are interleaved with each other and with the old data.
If they're not, I can no longer support efficient searches by the
in-flight transactions, plus I have an unbounded amount of cleanup
work to do at commit time.  But if I *do* have all of those values
interleaved in the proper order, then an abort leaves fragmented free
space.  I can go and kill all of the inserted tuples during UNDO of an
aborted transactions, but because page splits have happened, the index
tuples I need to kill may not be on the same pages where I inserted
them, so I might need to just search from the top of the tree, so it's
expensive.  And even if I do it anyway, the pages are left half-full.
The point is that the splits are the joint result of the two
concurrent transactions taken together - you can't blame individual
splits on individual transactions.

Another way to put this is - if you could make all of the index
operations appear to happen instantaneously at commit time, then I see
how we could make what you're talking about work.  And if you never
needed to split pages, then you could do that in the same way that you
can do it for the heap.  But that's not realistic.

Even for an UNDO-based heap, we have a mild form of this problem.
Suppose we have a full page and delete a tuple, creating free space.
Now another transaction inserts a tuple, using up that free space.
Well, if the first transaction aborts and the second one commits,
we've got trouble.  So we can't allow that.  We have to keep track of
the amount of space that could potentially be gobbled up by UNDO
actions and leave enough space on the page to guarantee that such
actions will succeed.  If that means that some new tuple has to be
placed on a different page instead, so be it.  (This could create some
minor bloat, but I think it's not really that bad.)  For an index, we
can't manage so easily, because we don't have the same kind of
flexibility about where to put inserted tuples.

Do you see some trick here that I'm missing?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Greg Stark

On 23 November 2016 at 04:28, Peter Geoghegan  wrote:
> On Tue, Nov 22, 2016 at 7:01 PM, Robert Haas  wrote:
>> This basic DO-UNDO-REDO protocol has been well-understood for
>> decades.
>
> FWIW, while this is basically true, the idea of repurposing UNDO to be
> usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
> nothing about MVCC.

Fwiw, Oracle does not use the undo log for snapshot fetches. It's used
only for transaction rollback and recovery.

For snapshot isolation Oracle has yet a *third* copy of the data in a
space called the "rollback segment(s)". When you update a row in a
block you save the whole block in the rollback segment. When you try
to access a block you check if the CSN -- which is basically
equivalent to our LSN -- is newer than your snapshot and if it is you
fetch the old version of the block from the rollback.

Essentially their MVCC is done on a per-block level rather than a
per-row level and they keep only the newest version of the block in
the table, the rest are in the rollback segment.  For what it's worth
I think our approach is cleaner and more flexible. They had a lot of
trouble with their approach over the years and it works well only
because they invested an enormous amount of development in it and also
because people throw a lot of hardware at it too.

I think the main use case we have trouble with is actually the "update
every row in the table" type of update which requires we write to
every block, plus a second copy of every block, plus write full pages
of both copies, then later set hint bits dirtying pages again and
generating more full pages writes, then later come along and vacuum
which requires two more writes of every block, etc. If we had a
solution for the special case of an update that replaces every row in
a page that I think would complement HOT nicely and go a long way
towards fixing our issues.

Incidentally the "Interested transaction list" is for locking rows for
updates and it's basically similar to what we've discussed before of
having a "most frequent xmin" in the header and then a bit indicating
the xmin is missing from the row header. Except in their case they
don't need it for the actual xmin/xmax because their visibility is
done per-block, only the transient lock state

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-24 Thread Bruce Momjian

On Wed, Nov 23, 2016 at 11:35:38PM -0800, Peter Geoghegan wrote:
> On Wed, Nov 23, 2016 at 11:32 PM, Tsunakawa, Takayuki
>  wrote:
> > IMHO, overall, there should be pros and cons of the current approach and 
> > the new UNDo one (like Oracle?), depending on the workload.  Under 
> > update-heavy workload, the UNDO method may be better.  OTOH, under the 
> > mostly-INSERT workload (like data warehouse?), the current method will be 
> > better because it writes no log for UNDO.
> 
> I believe that you are correct about that.

This is probably similar to our use of double-buffering, using
shared_buffers and the kernel cache --- in some cases, the
double-buffering is better (variable workloads), while in other cases a
single cache is best.

We have had trouble figuring out if we need to support both single and
double caching methods, and when to recommend one over the other.  Seems
UNDO would have the same complexity.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-23 Thread Peter Geoghegan

On Wed, Nov 23, 2016 at 11:32 PM, Tsunakawa, Takayuki
 wrote:
> IMHO, overall, there should be pros and cons of the current approach and the 
> new UNDo one (like Oracle?), depending on the workload.  Under update-heavy 
> workload, the UNDO method may be better.  OTOH, under the mostly-INSERT 
> workload (like data warehouse?), the current method will be better because it 
> writes no log for UNDO.

I believe that you are correct about that.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-23 Thread Tsunakawa, Takayuki

IMHO, overall, there should be pros and cons of the current approach and the 
new UNDo one (like Oracle?), depending on the workload.  Under update-heavy 
workload, the UNDO method may be better.  OTOH, under the mostly-INSERT 
workload (like data warehouse?), the current method will be better because it 
writes no log for UNDO.

Furthermore, it maybe the best to be able to switch the method for each table 
and/or tablespace.  For example, in pgbench, history table uses the current 
method,
and other tables use the UNDO method.  Is it time to introduce a pluggable 
storage system?

Because PostgreSQL is a follower in the UNDO approach, I think it will be 
better to study other DBMSs well (Oracle and MySQL?).  That includes not only 
their manuals, but also whitepapers and books.  Especially, I expect good books 
to give deep knowledge on performance tuning and troubleshooting, from which we 
will be able to know the cons that Oracle's materials don't state.

Regards
Takayuki Tsunakawa






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-23 Thread Thomas Munro

On Wed, Nov 23, 2016 at 6:01 PM, Peter Geoghegan  wrote:
> * Our behavior with many duplicates in secondary indexes is pretty bad
> in general, I suspect.

>From the pie-in-the-sky department:  I believe there are
snapshot-based systems that don't ever have more than one entry for a
logical row in a btree.  Instead, they do UNDO-log based snapshot
reconstruction for btrees too, generalising what is being discussed
for heap tables in this thread.  That means that whenever you descend
a btree, at each page you have to be prepared to go and look in the
UNDO log if necessary on a page-by-page basis.  Alternatively, some
systems use a shadow table rather than an UNDO log for the old
entries, but still privilege queries that need the latest version by
keeping only those in the btree.  I don't know the details but suspect
that both of those approaches might require CSN (= LSN?) based
snapshots, so that you can make page-level visibility determinations
while traversing the 'current' btree based on a page header.  That
would reduce both kinds of btree bloat: the primary kind of being
entries for dead tuples that still exist in the heap, and the
secondary kind being the resultant permanent btree fragmentation that
remains even after vacuum removes them.  Presumably once you have all
that, you can allow index-only-scans without a visibility map.  Also,
I suspect that such a "3D time travelling" btree would be a
requirement for index ordered tables in a snapshot-based RDBMS.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-23 Thread Robert Haas

On Tue, Nov 22, 2016 at 11:18 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>>> Oracle spends a lot of time on this, and it's really cache-inefficient
>>> because the data is spread all over.  This was what this guy felt in
>>> circa 2001; I'd have to think that the cache unfriendliness problem is
>>> much worse for modern hardware.
>
>> I imagine that temporal locality helps a lot. Most snapshots will be
>> interested in the same row version.
>
> Yeah, but the problem is that all those transactions have to scavenge
> through a big chunk of UNDO log to find out what they need to know.

In my design, there's one UNDO log per transaction, so if a page has
been recently modified only by a single transaction whose effects are
visible to your snapshot, you can very quickly decide that you don't
need to look at UNDO at all.  If it's been recently modified by
multiple transactions all of which are visible to your snapshot, you
can consult the TPD and then decide that none of those transactions
are interesting and therefore their UNDO doesn't need to be examined.

When there are transactions that whose effects are not visible to your
snapshot, the cost is proportional to the number of times those
transactions have modified the page.  Therefore, the "bad" case for
this design is when the same page gets updated a large number of times
by one or more concurrent transactions.  If the transaction you can't
see has made 437 separate modifications to the page, you're going to
need to look at all 437 UNDO records to figure out what you can see.
That is likely to suck.

I think that one of the crucial questions for this design is how often
that will happen.  For people with mostly static data, it should be
rare.  In something like a pgbench workload, it boils down to how
often multiple transactions hit the same page before the previous
transactions that hit that page become all-visible.  Of course, if you
have too many backends relative to the scale factor, pgbench is going
to become lock-bound anyway and it probably won't matter much, but
it's about two orders of magnitude easier to hit the same page than to
hit the same tuple, so it could be that the point where you start
incurring significant costs for visibility checking and page
reconstruction happens well before the tuple lock contention becomes
noticeable.  It might be worth doing some mathematical modeling to try
to estimate how serious that effect is likely to be.

There are also some things that you can do to mitigate this, although
they add even more complexity.  For example:

- If the same transaction touches the same page repeatedly, you could
make a rule that every 5th or every 10th or every 20th modification to
the same page within the same transaction emits a larger summary
record into the UNDO that is a consolidated record of all changes made
to the page by that transaction thus far.  That penalizes writers, but
helps readers; when a reader reaches a summary record, it need look no
further.

- Peter alluded to the possibility - or so I thought - of caching
reconstructed page versions.  That would make a lot of sense for
sequential scans or cases where the same page is being accessed
repeatedly.

- If you're doing an index scan and you only care about one tuple, the
page could contain one bit per tuple which is set if the tuple has
been modified by any transaction the page views as recent.  If the bit
is clear for the one tuple you care about, you can ignore the UNDO
even if is for a transaction not visible to your snapshot, because you
know it didn't touch that tuple.

- If you create a TPD entry for the page because there are multiple
recent transactions that have modified it, you can put additional
information into the TPD entry to speed up access.  For example, you
could store an array indexed by itemid which indicates which of the
recent transactions have touched any given itemid.  This is mostly
useful for index scans, which can then ignore all the transactions
that don't pertain to the one TID they care about.

Which of those ideas are best?  Are they enough, even if we do them
all?  Will there be other down sides?  I don't know.  But I think it's
clear that bloat is an ongoing and serious concern for our users (cf.
snapshot too old), and that vacuuming is too (cf. freeze map) and I
believe we are reaching a point where we can't make much further
improvement in those areas without some kind of major architectural
change.  WARM might qualify; this idea certainly does.  I also believe
that we NEED to improve further.  At least for EnterpriseDB, the
problems I'm trying to address through this design are big problems.
If this design isn't good, let's come up with something else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)

Re: [HACKERS] UNDO and in-place update

2016-11-23 Thread Albe Laurenz

Robert Haas wrote:
> To implement this in PostgreSQL, I believe we would need support for
> UNDO.

> - Reading a page that has been recently modified gets significantly
> more expensive; it is necessary to read the associated UNDO entries
> and do a bunch of calculation that is significantly more complex than
> what is required today.

As others have said, that is the main drawback.

Also, and related to this, ROLLBACK would become an expensive operation.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan

On Tue, Nov 22, 2016 at 8:45 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> The best thing by far about an alternative design like this is that it
>> performs *consistently*.
>
> Really?  I think it just moves the issues somewhere else.

Definitely, yes.

* HOT is great and all, but HOT-safety is a pretty onerous thing to
ask users to worry about. They don't know anything about it. Maybe
WARM eventually helps with this -- I don't know about that.

* It's bad that the effectiveness of index-only scans is strongly
influenced by the visibility map. And, autovacuum doesn't care about
index-only scans (nor should it, I suppose).

* The high watermark size of a table is much higher for an
update-mostly workload. When bloat is localized to certain values in
indexes, it's a lot worse.

* Our behavior with many duplicates in secondary indexes is pretty bad
in general, I suspect. I think we do badly at amortizing inserts into
indexes from updates (we dirty more leaf pages than we really need
to). I think we could do better at making LP_DEAD style IndexTuple
recycling more effective.

* Most obviously, autovacuum can create significant load at
inconvenient times. Paying that cost in a piecemeal fashion has a
certain appeal.

For what it's worth, I am not planning to work on this.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Pavan Deolasee

On Wed, Nov 23, 2016 at 10:07 AM, Peter Geoghegan  wrote:

> On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane  wrote:
> > Ultimately, I doubt that update-in-place buys much that we don't already
> > have with HOT updates (which postdate this old conversation, btw).
> > If you want MVCC semantics, you need to hold both versions of the tuple
> > *somewhere*.  Oracle's solution is different from ours, but I'm not
> > exactly convinced that it's better.  It makes some aspects better and
> > others worse.
>
> I agree that HOT is roughly as effective, at least when you get mostly
> HOT updates.
>
>
I agree. We'd tried update-in-place before writing HOT and it turned out to
be complex and error-prone [1]. I am not suggesting that we can't do a
better job this time, but as Tom said, it's going to be lot of work. If
WARM makes any progress, it will also help addressing some of the issues
around index bloats when HOT updates can not be done.

One key issue that I see is that unrelated long running transactions end up
holding up OldestXmin advances and that leads to significant bloat to
update-heavy tables. If we could somehow address that situation, that may
still go a long way in containing bloat. Of course, there may still be
cases where bloat will be unavoidable, but it seems far lesser work to me
that a complete overhaul of the MVCC working.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/1163092396.3634.461.ca...@silverbirch.site
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Mark Kirkwood


On 23/11/16 16:31, Tom Lane wrote:

Robert Haas  writes:

[ Let's invent Oracle-style UNDO logs ]


I dunno.  I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better.  I don't recall all the details
of the conversation but I think his key point was basically this:


- Reading a page that has been recently modified gets significantly
more expensive; it is necessary to read the associated UNDO entries
and do a bunch of calculation that is significantly more complex than
what is required today.




Also ROLLBACK becomes vastly more expensive than COMMIT (I can recall 
many years ago when I used to be an Oracle DBA reading whole chapters of 
novels waiting for failed batch jobs to roll back).


However I'd like to add that I agree this is worth looking at, as 
ideally it would be great to be able to choose whether to have No-UNDO 
or UNDO on a table by table basis...


regards

Mark



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Tom Lane

Peter Geoghegan  writes:
> The best thing by far about an alternative design like this is that it
> performs *consistently*.

Really?  I think it just moves the issues somewhere else.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan

On Tue, Nov 22, 2016 at 8:18 PM, Tom Lane  wrote:
> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.

I agree that HOT is roughly as effective, at least when you get mostly
HOT updates.

I think that there is an under-appreciated issue with index bloat and
VACUUM. Bloat in the sense of a B-Tree becoming unbalanced. I think
that short term bursts of non-HOT updates bloat the *keyspace* of
B-Trees, making them unbalanced. A short term burst (possibly
associated with one long running transaction that prevents pruning)
can cause long term damage to the key space.

The best thing by far about an alternative design like this is that it
performs *consistently*. I believe that we will lose on real-world
cases that play to the strengths of the current design. That doesn't
mean that it isn't worth it, but it does make it exceptional
difficult.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Amit Kapila

On Wed, Nov 23, 2016 at 9:48 AM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>>> Oracle spends a lot of time on this, and it's really cache-inefficient
>>> because the data is spread all over.  This was what this guy felt in
>>> circa 2001; I'd have to think that the cache unfriendliness problem is
>>> much worse for modern hardware.
>
>> I imagine that temporal locality helps a lot. Most snapshots will be
>> interested in the same row version.
>
> Yeah, but the problem is that all those transactions have to scavenge
> through a big chunk of UNDO log to find out what they need to know.
>

I think that depends on how we design to read the old data, if every
transaction has to go through UNDO log then it can impact the
performance, however, if we design such that the transaction can take
help from the previous transaction which has gone through undo log,
then it will be much better. For that to happen, I think we need some
kind of page level MVCC and that should be possible in design proposed
above where the latest transactions information is present at the page
level.

> Ultimately, I doubt that update-in-place buys much that we don't already
> have with HOT updates (which postdate this old conversation, btw).
> If you want MVCC semantics, you need to hold both versions of the tuple
> *somewhere*.  Oracle's solution is different from ours, but I'm not
> exactly convinced that it's better.  It makes some aspects better and
> others worse.
>

I think not only Oracle but some of the other database systems like
SQL Server (and IIRC MySQL) does a better job in terms of bloat
management (by keeping it separate from main Heap) which is why in
many cases they are preferred over PostgreSQL.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan

On Tue, Nov 22, 2016 at 7:01 PM, Robert Haas  wrote:
> This basic DO-UNDO-REDO protocol has been well-understood for
> decades.

FWIW, while this is basically true, the idea of repurposing UNDO to be
usable for MVCC is definitely an Oracleism. Mohan's ARIES paper says
nothing about MVCC.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Amit Kapila

On Wed, Nov 23, 2016 at 9:32 AM, Peter Geoghegan  wrote:
> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>>> - Reading a page that has been recently modified gets significantly
>>> more expensive; it is necessary to read the associated UNDO entries
>>> and do a bunch of calculation that is significantly more complex than
>>> what is required today.
>
> Someone told me that there is something called an interested
> transaction list stored in the page header, and from that I infer that
> isn't *really* true. I think that unless you're interested in an
> affected row, rather than just some row that happens to be on the same
> page, you don't really have to worry about it.
>

Yeah, so basically if there is an effect of any transaction which is
not visible to the snapshot of transaction reading the page, you need
to do something to read the old row/rows present on that page.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Tom Lane

Peter Geoghegan  writes:
> On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>> Oracle spends a lot of time on this, and it's really cache-inefficient
>> because the data is spread all over.  This was what this guy felt in
>> circa 2001; I'd have to think that the cache unfriendliness problem is
>> much worse for modern hardware.

> I imagine that temporal locality helps a lot. Most snapshots will be
> interested in the same row version.

Yeah, but the problem is that all those transactions have to scavenge
through a big chunk of UNDO log to find out what they need to know.

Ultimately, I doubt that update-in-place buys much that we don't already
have with HOT updates (which postdate this old conversation, btw).
If you want MVCC semantics, you need to hold both versions of the tuple
*somewhere*.  Oracle's solution is different from ours, but I'm not
exactly convinced that it's better.  It makes some aspects better and
others worse.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan

On Tue, Nov 22, 2016 at 7:31 PM, Tom Lane  wrote:
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.

Someone told me that there is something called an interested
transaction list stored in the page header, and from that I infer that
isn't *really* true. I think that unless you're interested in an
affected row, rather than just some row that happens to be on the same
page, you don't really have to worry about it.

> Oracle spends a lot of time on this, and it's really cache-inefficient
> because the data is spread all over.  This was what this guy felt in
> circa 2001; I'd have to think that the cache unfriendliness problem is
> much worse for modern hardware.

I imagine that temporal locality helps a lot. Most snapshots will be
interested in the same row version.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Robert Haas

On Tue, Nov 22, 2016 at 10:41 PM, Peter Geoghegan  wrote:
> On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas  wrote:
>>> Heikki's been fooling with some ideas that I think have more promise.
>>> I wish he'd get to the point of presenting them publicly rather than
>>> just over beers at conferences.
>>
>> That would be good, too!
>
> I was told that TED was kind of formally proposed in the "MultiXact
> hindsight design" thread.

I read that, but:

1. Nothing's really come of that, AFAICS, and

2. It doesn't allow for in-place update, and

3. It doesn't eliminate the need for freezing.

Implementing TED would probably have been better than doing fkeylocks
the way we did, but now that we've mostly stabilized the multixact
work that was actually done, I have a hard time imagining that we
really want to revamp the whole thing without fixing any of the more
fundamental problems with our on-disk format.  In my mind, the bloat
created by storing both the old and new versions of an updated tuple
in the heap, and the need to rewrite pages multiple times to set hint
bits, set all-visible, and ultimately freeze are the three-ton
gorillas in the corner.  I am not arguing that what I've outlined here
is the only way to fix those problems or even the best way, just that
it isn't really worth the trouble of doing major surgery if we aren't
going to make a dent in those problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Peter Geoghegan

On Tue, Nov 22, 2016 at 7:39 PM, Robert Haas  wrote:
>> Heikki's been fooling with some ideas that I think have more promise.
>> I wish he'd get to the point of presenting them publicly rather than
>> just over beers at conferences.
>
> That would be good, too!

I was told that TED was kind of formally proposed in the "MultiXact
hindsight design" thread.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Robert Haas

On Tue, Nov 22, 2016 at 10:31 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> [ Let's invent Oracle-style UNDO logs ]
>
> I dunno.  I remember being told years ago, by an ex-Oracle engineer,
> that he thought our approach was better.  I don't recall all the details
> of the conversation but I think his key point was basically this:
>
>> - Reading a page that has been recently modified gets significantly
>> more expensive; it is necessary to read the associated UNDO entries
>> and do a bunch of calculation that is significantly more complex than
>> what is required today.
>
> Oracle spends a lot of time on this, and it's really cache-inefficient
> because the data is spread all over.  This was what this guy felt in
> circa 2001; I'd have to think that the cache unfriendliness problem is
> much worse for modern hardware.

Yes, that's the main thing I'm worried about with this design.  Now,
bloat is pretty cache-inefficient too.  If our table is 2x the size of
Oracle's table, they can spend more cycles per page and still do
fairly well.  If that 2x increment pushes us out of RAM, they can
spend basically all the CPU time they want and still win handily.  But
that isn't to say that this concern isn't justified.

> Which is not to say that it's not worth experimenting with.  But what you
> describe is going to be a huge amount of work even to get to the point
> where we could try to measure the actual costs and benefits :-(

Yeah.

> Heikki's been fooling with some ideas that I think have more promise.
> I wish he'd get to the point of presenting them publicly rather than
> just over beers at conferences.

That would be good, too!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UNDO and in-place update

2016-11-22 Thread Tom Lane

Robert Haas  writes:
> [ Let's invent Oracle-style UNDO logs ]

I dunno.  I remember being told years ago, by an ex-Oracle engineer,
that he thought our approach was better.  I don't recall all the details
of the conversation but I think his key point was basically this:

> - Reading a page that has been recently modified gets significantly
> more expensive; it is necessary to read the associated UNDO entries
> and do a bunch of calculation that is significantly more complex than
> what is required today.

Oracle spends a lot of time on this, and it's really cache-inefficient
because the data is spread all over.  This was what this guy felt in
circa 2001; I'd have to think that the cache unfriendliness problem is
much worse for modern hardware.

Which is not to say that it's not worth experimenting with.  But what you
describe is going to be a huge amount of work even to get to the point
where we could try to measure the actual costs and benefits :-(

Heikki's been fooling with some ideas that I think have more promise.
I wish he'd get to the point of presenting them publicly rather than
just over beers at conferences.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

61 matches

Mail list logo