Hi,

On 09/23/2015 11:25 PM, Tom Lane wrote:
Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
On 09/11/2015 07:16 PM, Robert Haas wrote:
On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.

Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than
using a function that can handle the actual value.

The proposed bugfix addresses the issue in the most straightforward way,
without introducing additional considerations about possible
over-estimations (which the current code completely ignores, so this is
a new thing). I think bugfixes should not introduce such changes to
behavior (albeit internal), especially not without any numbers.

This thread seems to have stalled out...

After re-reading it, I'm going to agree with Robert that we should
clamp the initial pointer-array size to work with palloc (ie, 512MB
worth of pointers, which please recall is going to represent at
least 10X that much in hashtable entries, probably more).

10x that much in entries? Do you mean NTUP_PER_BUCKET? Because that was reduced to 1 last year as part of the hashjoin improvements. So we do have more buckets than tuples (to keep load factor below 1.0). It's still true the entries will need more space than buckets (because of headers and such), but it may easily get well below 10x.

In the example reported by KaiGai-san, the entries are 8B wide (~24B with header), while buckets are 8B. That's 1:3 ratio. It is a bit extreme example because in other cases the tuples will be wider.

It also seems to me that the higher the ratio, the lower the need to actually impose such limit because it increases the natural pressure keeping buckets down (because both buckets and entries need to share work_mem of limited size).

The argument that allowing it to be larger would be a performance win
seems entirely unbased on any evidence, while the risks of allowing
arbitrarily large allocations based on planner estimates seem pretty
obvious to me.

Do we agree that resizing the hash table is not free? Because my
argument was that we're forcing the well-estimated cases to do
additional resize, so maybe we should measure the impact first.

Now, maybe it does not really matter in this case - we probably get slightly inaccurate estimates all the time. Not terribly wrong, but enough to make the initial number of buckets too low, so we may actually do the resize quite anyway. Also, if we're dealing with hash tables of this size, we're probably dealing with much larger outer relation and the additional resize is going to be just noise ...

I however quite dislike the dismissal of the possible impact. It should be the responsibility of the person introducing the change to show that no such impact actually exists, not just waving it off as "unbased on any evidence" when there's no evidence presented.

Had I been adding such limit, I'd do at least some testing and presented the results here. Perhaps I'm a bit over-protective of this piece of code as I've spent quite a bit of time getting it faster, but I believe the principle that the person proposing a change should demonstrate the (lack of) performance impact is sound.

And the argument that such overestimates are a bug that we can easily
fix is based on even less evidence; in fact, I would dismiss that out
of hand as hubris.

I don't think anyone was suggesting the overestimates are easy to fix (or even possible to fix in general). If I ever claimed that, I was clearly wrong.


Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit.  There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).

So I don't like any of the proposed patches exactly as-is.

BTW, just looking at the code in question, it seems to be desperately
in need of editorial review.  A couple of examples:

* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.

Yeah, that's a stupid bug.


* ExecHashIncreaseNumBuckets starts out with a run-time test on
something that its sole caller has just Assert()'d to not be the
case, and which really ought to be impossible with or without that
Assert.

Yeah, right. Excessively defensive coding on my side (I think I've added the Assert later, or something).


* ExecHashTableInsert believes it can only increase nbuckets_optimal
if we're still in the first batch, but MultiExecHash() will call
ExecHashIncreaseNumBuckets at the end of input-acquisition whether
we've created more than one batch or not.  Meanwhile,
ExecHashIncreaseNumBatches thinks it can change the number of buckets
in any case.  Maybe that's all okay, but at the very least those
tests ought to look like they'd heard of each other.  And of those
three places, having the safety check where it is seems like the
least reasonable place.

I don't quite understand where you see the problem. I believe the logic is sound, although adding a comment explaining it, but let me explain.

The only place where we actually mess with the number of buckets is ExecHashTableInsert() around line 880. All the other places are resizes of the hash table, rebuilding it to the optimal number of buckets.

The rebuild can happen in two situations - either at the end of the input (which is what happens in MultiExecHash), or when we start batching (ExecHashIncreaseNumBatches).

Once we're batching, there's no point in further messing with buckets because we assume using more buckets would overflow work_mem.

The fact that we're doing the resize within ExecHashIncreaseNumBatches is merely because of efficiency - we are going to walk the hash table anyway (because we need to get rid of ~1/2 the tuples), so we simply rebuild the buckets too.

Tracking an "optimal" number of buckets seems like an activity that
should not be constrained by whether we have any hope of being able
to apply the number.

Not sure I understand?


So I'm not having a lot of faith that there aren't other bugs in the
immediate area.

Well, I surely am not infallible, but so far I haven't seen any proof of an actual bug, except for needlessly recomputing a value (which can only happen once), and excessive check on an already asserted value.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to