On Dec 1, 2009, at 4:13 PM, Greg Stark wrote:
On Tue, Dec 1, 2009 at 9:57 PM, Richard Huxton <d...@archonet.com>
wrote:
Why are we writing out the hint bits to disk anyway? Is it really so
slow to calculate them on read + cache them that it's worth all this
trouble? Are they not also to blame for the "write my import data
twice"
feature?
It would be interesting to experiment with different strategies. But
the results would depend a lot on workloads and I doubt one strategy
is best for everyone.
I agree that we'll always have the issue with freezing. But I also
think it's time to revisit the whole idea of hint bits. AFAIK we only
keep at maximum 2B transactions, and each one takes 2 bits in CLOG.
So worst-case scenario, we're looking at 4G of clog. On modern
hardware, that's not a lot. And that's also assuming that we don't do
any kind of compression on that data (obviously we couldn't use just
any old compression algorithm, but there's certainly tricks that
could be used to reduce the size of this information).
I know this is something that folks at EnterpriseDB have looked at,
perhaps there's data they can share.
It has often been suggested that we could set the hint bits but not
dirty the page, so they would never be written out unless some other
update hit the page. In most use cases that would probably result in
the right thing happening where we avoid half the writes but still
stop doing transaction status lookups relatively promptly. The scary
thing is that there might be use cases such as static data loaded
where the hint bits never get set and every scan of the page has to
recheck those statuses until the tuples are frozen.
(Not dirtying the page almost gets us out of the CRC problems -- it
doesn't in our current setup because we don't take a lock when setting
the hint bits, so you could set it on a page someone is in the middle
of CRC checking and writing. There were other solutions proposed for
that, including just making hint bits require locking the page or
double buffering the write.)
There does need to be something like the hint bits which does
eventually have to be set because we can't keep transaction
information around forever. Even if you keep the transaction
information all the way back to the last freeze date (up to about 1GB
and change I think) then the data has to be written twice, the second
time is to freeze the transactions. In the worst case then reading a
page requires a random page access (or two) from anywhere in that 1GB+
file for each tuple on the page (whether visible to us or not).
--
greg
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
--
Jim C. Nasby, Database Architect j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers