On Tue, Dec 18, 2018 at 8:04 PM Michael Paquier <mich...@paquier.xyz> wrote: > On Tue, Dec 18, 2018 at 01:41:06PM -0500, Robert Haas wrote: > > OK. I'll post what I have by the end of the week. > > Thanks, Robert.
OK, so I got slightly delayed here by utterly destroying my laptop, but I've mostly reconstructed what I did. I think there are some remaining problems, but this seems like a good time to share what I've got so far. I'm attaching three patches. 0001 is one which I posted before. It attempts to fix up RelationBuildPartitionDesc() so that this function will always return a partition descriptor based on a consistent snapshot of the catalogs. Without this, I think there's nothing to prevent a commit which happens while the function is running from causing the function to fail or produce nonsense answers. 0002 introduces the concept of a partition directory. The idea is that the planner will create a partition directory, and so will the executor, and all calls which occur in those places to RelationGetPartitionDesc() will instead call PartitionDirectoryLookup(). Every lookup for the same relation in the same partition directory is guaranteed to produce the same answer. I believe this patch still has a number of weaknesses. More on that below. 0003 actually lowers the lock level. The comment here might need some more work. Here is a list of possible or definite problems that are known to me: - I think we need a way to make partition directory lookups consistent across backends in the case of parallel query. I believe this can be done with a dshash and some serialization and deserialization logic, but I haven't attempted that yet. - I refactored expand_inherited_rtentry() to drive partition expansion entirely off of PartitionDescs. The reason why this is necessary is that it clearly will not work to have find_all_inheritors() use a current snapshot to decide what children we have and lock them, and then consult a different source of truth to decide which relations to open with NoLock. There's nothing to keep the lists of partitions from being different in the two cases, and that demonstrably causes assertion failures if you SELECT with an ATTACH/DETACH loop running in the background. However, it also changes the order in which tables get locked. Possibly that could be fixed by teaching expand_partitioned_rtentry() to qsort() the OIDs the way find_inheritance_children() does. It also loses the infinite-loop protection which find_all_inheritors() has. Not sure what to do about that. - In order for the new PartitionDirectory machinery to avoid use-after-free bugs, we have to either copy the PartitionDesc out of the relcache into the partition directory or avoid freeing it while it is still in use. Copying it seems unappealing for performance reasons, so I took the latter approach. However, what I did here in terms of reclaiming memory is just about the least aggressive strategy short of leaking it altogether - it just keeps it around until the next rebuild that occurs while the relcache entry is not in use. We might want to do better, e.g. freeing old copies immediately as soon as the relcache reference count drops to 0. I just did it this way because it was simple to code. - I tried this with Alvaro's isolation tests and it fails some tests. That's because Alvaro's tests expect that the list of accessible partitions is based on the query snapshot. For reasons I attempted to explain in the comments in 0003, I think the idea that we can choose the set of accessible partitions based on the query snapshot is very wrong. For example, suppose transaction 1 begins, reads an unrelated table to establish a snapshot, and then goes idle. Then transaction 2 comes along, detaches a partition from an important table, and then does crazy stuff with that table -- changes the column list, drops it, reattaches it with different bounds, whatever. Then it commits. If transaction 1 now comes along and uses the query snapshot to decide that the detached partition ought to still be seen as a partition of that partitioned table, disaster will ensue. - I don't have any tests here, but I think it would be good to add some, perhaps modeled on Alvaro's, and also some that involve multiple ATTACH and DETACH operations mixed with other interesting kinds of DDL. I also didn't make any attempt to update the documentation, which is another thing that will probably have to be done at some point. Not sure how much documentation we have of any of this now. - I am uncertain whether the logic that builds up the partition constraint is really safe in the face of concurrent DDL. I kinda suspect there are some problems there, but maybe not. Once you hold ANY lock on a partition, the partition constraint can't concurrently become stricter, because no ATTACH can be done without AccessExclusiveLock on the partition being attached; but it could concurrently become less strict, because you only need a lesser lock for a detach. Not sure if/how that could foul with this logic. - I have not done anything about the fact that index detach takes AccessExclusiveLock. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
0001-Ensure-that-RelationBuildPartitionDesc-sees-a-consis.patch
Description: Binary data
0002-Introduce-the-concept-of-a-partition-directory.patch
Description: Binary data
0003-Lower-the-lock-level-for-ALTER-TABLE-.-ATTACH-DETACH.patch
Description: Binary data