On 03/03/18 05:03, Robert Haas wrote:
On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov
<a.korot...@postgrespro.ru> wrote:
I would propose "zero-bloat heap" disambiguation of zheap.  Seems like fair
enough explanation for me without need to rename :)
It will be possible to bloat a zheap table in certain usage patterns.
For example, if you bulk-load the table with a ton of data, commit the
transaction, delete every other row, and then never insert any more
rows ever again, the table is bloated: it's twice as large as it
really needs to be, and we have no provision for shrinking it.  In
general, I think it's very hard to keep bulk deletes from leaving
bloat in the table, and to the extent that it *is* possible, we're not
doing it.  One could imagine, for example, an index-organized table
that automatically combines adjacent pages when they're empty enough,
and that also relocates data to physically lower-numbered pages
whenever possible.  Such a storage engine might automatically shrink
the on-disk footprint after a large delete, but we have no plans to go
in that direction.

Rather, our assumption is that the bloat most people care about comes
from updates.  By performing updates in-place as often as possible, we
hope to avoid bloating both the heap (because we're not adding new row
versions to it which then have to be removed) and the indexes (because
if we don't add new row versions at some other TID, then we don't need
to add index pointers to that new TID either, or remove the old index
pointers to the old TID).  Without delete-marking, we can basically
optimize the case that is currently handled via HOT updates: no
indexed columns have changed.  However, the in-place update has a
major advantage that it still works even when the page is completely
full, provided that the row does not expand.  As Amit's results show,
that can hugely reduce bloat and increase performance in the face of
long-running concurrent transactions.  With delete-marking, we can
also optimize the case where indexed columns have been changed.  We
don't know exactly how well this will work yet because the code isn't
written and therefore can't be benchmarked, but am hopeful that that
in-place updates will be a big win here too.

So, I would not describe a zheap table as zero-bloat, but it should
involve a lot less bloat than our standard heap.


For folk doing ETL type data warehousing this should be great, as the typical workload tends to be like: COPY (or similar) from foreign data source, then do several sets of UPDATES to fix/check/scrub the data...which tends to result in huge bloat with the current heap design (despite telling people 'you can do it another way to' to avoid bloat - I guess it seems to be more intuitive to just to do it as described).

regards
Mark


Reply via email to