On Sun, Aug 17, 2014 at 1:17 PM, Tomas Vondra <t...@fuzzy.cz> wrote: > Being able to batch inner and outer relations in a matching way is > certainly one of the reasons why hashjoin uses that particular scheme. > There are other reasons, though - for example being able to answer 'Does > this group belong to this batch?' quickly, and automatically update > number of batches. > > I'm not saying the lookup is extremely costly, but I'd be very surprised > if it was as cheap as modulo on a 32-bit integer. Not saying it's the > dominant cost here, but memory bandwidth is quickly becoming one of the > main bottlenecks.
Well, I think you're certainly right that a hash table lookup is more expensive than modulo on a 32-bit integer; so much is obvious. But if the load factor is not too large, I think that it's not a *lot* more expensive, so it could be worth it if it gives us other advantages. As I see it, the advantage of Jeff's approach is that it doesn't really matter whether our estimates are accurate or not. We don't have to decide at the beginning how many batches to do, and then possibly end up using too much or too little memory per batch if we're wrong; we can let the amount of memory actually used during execution determine the number of batches. That seems good. Of course, a hash join can increase the number of batches on the fly, but only by doubling it, so you might go from 4 batches to 8 when 5 would really have been enough. And a hash join also can't *reduce* the number of batches on the fly, which might matter a lot. Getting the number of batches right avoids I/O, which is a lot more expensive than CPU. >> But the situation here isn't comparable, because there's only one >> input stream. I'm pretty sure we'll want to keep track of which >> transition states we've spilled due to lack of memory as opposed to >> those which were never present in the table at all, so that we can >> segregate the unprocessed tuples that pertain to spilled transition >> states from the ones that pertain to a group we haven't begun yet. > > Why would that be necessary or useful? I don't see a reason for tracking > that / segregating the tuples. Suppose there are going to be three groups: A, B, C. Each is an array_agg(), and they're big, so only of them will fit in work_mem at a time. However, we don't know that at the beginning, either because we don't write the code to try or because we do write that code but our cardinality estimates are way off; instead, we're under the impression that all four will fit in work_mem. So we start reading tuples. We see values for A and B, but we don't see any values for C because those all occur later in the input. Eventually, we run short of memory and cut off creation of new groups. Any tuples for C are now going to get written to a tape from which we'll later reread them. After a while, even that proves insufficient and we spill the transition state for B to disk. Any further tuples that show up for C will need to be written to tape as well. We continue processing and finish group A. Now it's time to do batch #2. Presumably, we begin by reloading the serialized transition state for group B. To finish group B, we must look at all the tuples that might possibly fall in that group. If all of the remaining tuples are on a single tape, we'll have to read all the tuples in group B *and* all the tuples in group C; we'll presumably rewrite the tuples that are not part of this batch onto a new tape, which we'll then process in batch #3. But if we took advantage of the first pass through the input to put the tuples for group B on one tape and the tuples for group C on another tape, we can be much more efficient - just read the remaining tuples for group B, not mixed with anything else, and then read a separate tape for group C. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers