On 8/30/07, Tom Lane <[EMAIL PROTECTED]> wrote: > > "Pavan Deolasee" <[EMAIL PROTECTED]> writes: > > Please see the version 14 of HOT patch attached. > > I expected to find either a large new README, or some pretty substantial > additions to existing README files, to document how this all works. > The comments included do not represent nearly enough documentation.
I shall take that up. There are couple of documents posted by Heikki and Greg, apart from several email threads and original design doc by Simon. I shall consolidate everything in a single README. One thing I was unable to discern from the comments is how CREATE INDEX > can work at all. A new index might mean that tuples that could formerly > legally be part of the same hot-chain no longer can. I can't find any > code here that looks like it's addressing that. You are right - a new index might mean that an existing HOT chain is broken as far as the new index is concerned. The way we address that is by indexing the root tuple of the chain, but the index key is extracted from the last tuple in the chain. The index is marked "unusable" for all those existing transactions which can potentially see any intermediate tuples in the chain. Please see this document written by Greg Stark. He has nicely summarized how CREATE INDEX and CREATE INDEX CONCURRENTLY works with HOT. http://archives.postgresql.org/pgsql-patches/2007-07/msg00360.php I also don't think I > believe the reasoning for not indexing DELETE_IN_PROGRESS hot-updated > tuples: what if the index completion commits, but the concurrent delete > rolls back? Then you've got a valid tuple that's not in the index. Since CREATE INDEX works with ShareLock on the relation, we can safely assume that we can't find any DELETE_IN_PROGRESS tuples except those deleted by our own transaction. The only exception is system relation, but we don't do HOT updates on system relation. Given that, it seems OK to me to ignore these tuples because if the transaction aborts, CREATE INDEX is aborted as well. Am I overlooking something here ? There is a comment to this regard in the current code as well. > I also took this opportunity to remove the modularity invasion caused > > by heap_check_idxupdate() since it was using resultRelInfo. We now > > build the list of attributes that must be checked to satisfy HOT update. > > This list includes all the index columns, columns in the partial index > > predicates and expression index expressions and is built in the > > executor. > > The executor is the wrong place for that: I'm not sure why you think > that making heapam depend on the executor is a modularity improvement. Earlier (in the previous version of HOT patch) we were passing ResultRelInfo to heap_update, which was more ugly. The comment is in that context :-) Furthermore this approach requires recalculating the list during > each query, which is wasteful when it could only change during schema > updates. I'd suggest making the relcache responsible for computing and > saving this data, along the same lines as RelationGetIndexList(). > (That also means that heapam can get the data from the relcache, saving > a lot of API-bloating from passing around Attrids explicitly.) > Also, rather than inventing Attrids, I'd suggest just using one > Bitmapset with the convention that its contents are offset by > FirstLowInvalidHeapAttributeNumber. I liked all these suggestions. I know I thought about computing the attribute list in relcache, but probably avoided to keep things simple. I shall make these changes. The redefinition of the value of MaxHeapTuplesPerPage seems pretty > ugly and unprincipled. I think it'd be better to leave it as-is, > and just enforce that we don't allow more than that many line pointers > on a heap page (as indeed you have to do anyway, so it's not clear > what the change is buying). The only reason to redefine MaxHeapTuplesPerPage to higher side is because HOT allows us to accommodate more tuples per page because of redirect-dead line pointers. For a table with sufficiently large tuples, the original bound would work well, but for very small tuples - we might hit the line pointer limit even if there is free space available in the page. Doubling the value is chosen as a balance between heap page utilization, line pointer bloating and overhead for bitmap scans. But I agree that the factor choice is rather arbitrary. > Is it really necessary to add hot_update to xl_heap_update? Seems the > information should be available from the tuple header fields. I think its not necessary. The reason I did that way because the new block might be backed up in the WAL record and hence we might have recorded the new tuple infomasks. But we can surely handle that corner case. I shall fix this. Have you demonstrated that the "prune_hard" logic is worth its weight? > Considering that many of us want to drop VACUUM FULL, adding more > complexity to it doesn't seem like a profitable use of time. The prune_hard code is lot more simple in this version (it was horrible before Heikki and I reworked the pruning logic). We just follow the HOT chain to the last definitely DEAD tuple and remove all tuples preceding that. This simplifies the second phase of VACUUM FULL since we don't need to handle any broken HOT chains because of intermediate DEAD tuples. Also, prune_hard allows us to cleanup the redirected line pointers. Another reason for doing this is to avoid any significant changes to VACUUM FULL code since the code itself is complex and we might just drop this in future. But until that time, we need to make sure that HOT works fine with VACUUM FULL. Is it really safe, or productive, to run heap_page_prune when the buffer > is not locked for cleanup (ie, there are other people with pins on it)? It looks safe to me. We don't move tuples around during chain pruning. We only fix the HOT chains by removing intermediate DEAD tuples. Other places where we follow HOT chains, we do so while holding at-least SHARE lock on the buffer. So there are no race conditions here. A place where this bite me is that the heap-scan with SnapshotAny may return a tuple which gets pruned (and marked ~LP_USED) before the caller can see it. Fortunately there are only limited users of SnapshotAny. We make sure that IndexBuildHeapScan confirms that the ItemIdIsValid() after acquiring lock on the buffer. Cluster works with exclusive lock on the relation and hence there is no concurrent pruning possible. Even if it's safe, ISTM what you're mostly accomplishing there is to > expend a lot of cycles while holding exclusive lock on the page, when > there is good reason to think that you're blocking other people who are > interested in using the page. Eliminating the separation between that > and cleanup would also allow eliminating the separate "PD_FRAGMENTED" > page state. The reason we did it that way because repairing fragmentation seems much more costly that pruning. Please note that we prune a single chain during index fetch. Its only for heap-scans (and VACUUM) that we try to prune all chains in the page. So unless we are doing heap-scan, I am not sure if we are spending too much time holding the exclusive lock. I agree we don't have any specific numbers to prove that though. Another reasoning behind separating these two steps is: pruning requires exclusive lock whereas repairing fragmentation requires cleanup lock. Also we want to prune the chains aggressively because the visible tuple is most likely at the end of the chain. Longer the chain, greater the cost to fetch the tuple. PlanSetValidForThisTransaction is completely bogus --- it's not > re-entrant, and it needs to be. I think you need some state in > PlannerGlobal instead. I was not happy with this - frankly I don't understand the planning code much. Thanks for pointing me to the appropriate code. I shall try fix this, unless you would like to do it yourself. rd_avgfsm seems fairly bogus ... when does it get updated? Yeah, its not the best solution. What I wanted to do is delay repairing page fragmentation as far as possible. But I noticed that fsmrel->avgRequest (rd_avgfsm is initialized to it) starts with a small value (I guess 256). For a table with large tuples, we may not repair fragmentation at all, thus causing unnecessary COLD updates. rd_avgfsm is updated when the relcache is rebuilt. Since we don't send out any relcache invalidation when fsmrel->avgRequest changes, rd_avgfsm may never get updated in a session. Should we just revert to checking if the free space is less than a certain percentage of BLCKSZ and trigger defragmentation ? Or may be do a combination of both ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com