On Tue, May 19, 2020 at 09:27:34AM -0700, Jeff Davis wrote:
On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
I think there are two related problem - with costing and with
excessive
I/O due to using logical tapes.

Thank you for the detailed analysis. I am still digesting this
information.

This kinda makes me question whether logical tapes are the right tool
for hashagg. I've read the explanation in logtape.c why it's about
the
same amount of I/O as using separate files, but IMO that only really
works for I/O patters similar to merge sort - the more I think about
this, the more I'm convinced we should just do what hashjoin is
doing.

Fundamentally, sort writes sequentially and reads randomly; while
HashAgg writes randomly and reads sequentially.


Not sure. I think the charts and stats of iosnoop data show that an
awful lot of reads during sort is actually pretty sequential. Moreover,
sort manages to read the data in much larger blocks - 128kB instead of
just 8kB (which is what hashagg seems to be doing).

I wonder why is that and if we could achieve that for hashagg too ...

If the random writes of HashAgg end up fragmented too much on disk,
then clearly the sequential reads are not so sequential anyway. The
only way to avoid fragmentation on disk is to preallocate for the
tape/file.


And if there a way to pre-allocate larger chunks? Presumably we could
assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x 8kB)
instead of just single block. I haven't seen anything like that in
tape.c, though ...

BufFile (relying more on the OS) would probably do a better job of
preallocating the disk space in a useful way; whereas logtape.c makes
it easier to manage buffers and the overall number of files created
(thereby allowing higher fanout of partitions).

We have a number of possibilities here:

1. Improve costing to reflect that HashAgg is creating more random IOs
than Sort.

I think we'll need to do something about this, but I think we should try
improving the behavior first and then model the costing based on that.

2. Reduce the partition fanout in the hopes that the OS does a better
job with readahead.

I doubt this will make a significant difference. I think the problem is
the partitions end up interleaved way too much in the temp file, and I
don't see how a lower fanout would fix that.

BTW what do you mean when you say "fanout"? Do you mean how fast we
increase the number of partitions, or some parameter in particular?

3. Switch back to BufFile, in which case we probably need to reduce the
fanout for other reasons.

Maybe, although that seems pretty invasive post beta1.

4. Change logtape.c to allow preallocation or to write in larger
blocks.

I think this is what I suggested above (allocating 16 blocks at a time,
or something). I wonder how wasteful this would be, but I think not very
much. Essentially, with 1024 partitions and pre-allocating space in
128kB chunks, that means 128MB may end up unused, which seems ok-ish,
and I guess we could further restrict that by starting with lower value
and gradually increasing the number. Or something like that ...

5. Change BufFile to allow more control over buffer usage, and switch
to that.


Maybe. I don't recall what exactly is the issue with buffer usage, but I
think it has the same invasiveness issue as (3). OTOH it's what hashjoin
does, and we've lived with it for ages ...

#1 or #2 are the least invasive, and I think we can get a satisfactory
solution by combining those.


OK. I think tweaking the costing (and essentially reverting to what 12
does for those queries) is perfectly reasonable. But if we can actually
get some speedup thanks to hashagg, even better.

I saw good results with the high fanout and low work_mem when there is
still a lot of system memory. That's a nice benefit, but perhaps it's
safer to use a lower fanout (which will lead to recursion) until we get
a better handle on the IO patterns.


I don't know how much we can rely on that - once we push some of the
data from page cache, it has the issues I described. The trouble is
people may not have enough memory to keep everything in cache, otherwise
they might just as well bump up work_mem and not spill at all.

Perhaps you can try recompiling with a lower max partitions and rerun
the query? How much would we have to lower it for either the cost to
approach reality or the OS readahead to become effective?


I can try that, of course. Which parameters should I tweak / how?

I can also try running it with BufFile, in case you prepare a WIP patch.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to