On 03/02/2020 20:29, Jeff Davis wrote:
1. Use a minheap for the freelist. The original design used an array
that had to be sorted between a read (which frees a block) and a write
(which needs to sort the array to consume the lowest block number). The
comments said:

   * sorted.  This is an efficient way to handle it because we expect
cycles
   * of releasing many blocks followed by re-using many blocks, due to
   * the larger read buffer.

But I didn't find a case where that actually wins over a simple
minheap. With that in mind, a minheap seems closer to what one might
expect for that purpose, and more robust when the assumptions don't
hold up as well. If someone knows of a case where the re-sorting
behavior is important, please let me know.

A minheap certainly seems more natural for that. I guess re-sorting the array would be faster in the extreme case that you free almost all of the blocks, and then consume almost all of the blocks, but I don't think the usage pattern is ever that extreme. Because if all the data fit in memory, we wouldn't be spilling in the first place.

I wonder if a more advanced heap like the pairing heap or fibonacci heap would perform better? Probably doesn't matter in practice, so better keep it simple...

Changing to a minheap effectively solves the problem for HashAgg,
though in theory the memory consumption of the freelist itself could
become significant (though it's only 0.1% of the free space being
tracked).

We could fairly easily spill parts of the freelist to disk, too, if necessary. But it's probably not worth the trouble.

2. Lazily-allocate the read buffer. The write buffer was always lazily-
allocated, so this patch creates better symmetry. More importantly, it
means freshly-rewound tapes don't have any buffer allocated, so it
greatly expands the number of tapes that can be managed efficiently as
long as only a limited number are active at once.

Makes sense.

3. Allow expanding the number of tapes for an existing tape set. This
is useful for HashAgg, which doesn't know how many tapes will be needed
in advance.

I'd love to change the LogicalTape API so that you could allocate and free tapes more freely. I wrote a patch to do that, as part of replacing tuplesort.c's polyphase algorithm with a simpler one (see [1]), but I never got around to committing it. Maybe the time is ripe to do that now?

[1] https://www.postgresql.org/message-id/420a0ec7-602c-d406-1e75-1ef7ddc58...@iki.fi

- Heikki


Reply via email to