Hi,
On 09/23/2015 11:25 PM, Tom Lane wrote:
Tomas Vondra <tomas.von...@2ndquadrant.com> writes:
On 09/11/2015 07:16 PM, Robert Haas wrote:
On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.
Well, this is part of how we're looking it differently. I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value. You are clearly defining the bug a bit differently.
Yes, I see it differently.
I don't quite understand why limiting the value is more "proper" than
using a function that can handle the actual value.
The proposed bugfix addresses the issue in the most straightforward way,
without introducing additional considerations about possible
over-estimations (which the current code completely ignores, so this is
a new thing). I think bugfixes should not introduce such changes to
behavior (albeit internal), especially not without any numbers.
This thread seems to have stalled out...
After re-reading it, I'm going to agree with Robert that we should
clamp the initial pointer-array size to work with palloc (ie, 512MB
worth of pointers, which please recall is going to represent at
least 10X that much in hashtable entries, probably more).
10x that much in entries? Do you mean NTUP_PER_BUCKET? Because that was
reduced to 1 last year as part of the hashjoin improvements. So we do
have more buckets than tuples (to keep load factor below 1.0). It's
still true the entries will need more space than buckets (because of
headers and such), but it may easily get well below 10x.
In the example reported by KaiGai-san, the entries are 8B wide (~24B
with header), while buckets are 8B. That's 1:3 ratio. It is a bit
extreme example because in other cases the tuples will be wider.
It also seems to me that the higher the ratio, the lower the need to
actually impose such limit because it increases the natural pressure
keeping buckets down (because both buckets and entries need to share
work_mem of limited size).
The argument that allowing it to be larger would be a performance win
seems entirely unbased on any evidence, while the risks of allowing
arbitrarily large allocations based on planner estimates seem pretty
obvious to me.
Do we agree that resizing the hash table is not free? Because my
argument was that we're forcing the well-estimated cases to do
additional resize, so maybe we should measure the impact first.
Now, maybe it does not really matter in this case - we probably get
slightly inaccurate estimates all the time. Not terribly wrong, but
enough to make the initial number of buckets too low, so we may actually
do the resize quite anyway. Also, if we're dealing with hash tables of
this size, we're probably dealing with much larger outer relation and
the additional resize is going to be just noise ...
I however quite dislike the dismissal of the possible impact. It should
be the responsibility of the person introducing the change to show that
no such impact actually exists, not just waving it off as "unbased on
any evidence" when there's no evidence presented.
Had I been adding such limit, I'd do at least some testing and presented
the results here. Perhaps I'm a bit over-protective of this piece of
code as I've spent quite a bit of time getting it faster, but I believe
the principle that the person proposing a change should demonstrate the
(lack of) performance impact is sound.
And the argument that such overestimates are a bug that we can easily
fix is based on even less evidence; in fact, I would dismiss that out
of hand as hubris.
I don't think anyone was suggesting the overestimates are easy to fix
(or even possible to fix in general). If I ever claimed that, I was
clearly wrong.
Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit. There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).
So I don't like any of the proposed patches exactly as-is.
BTW, just looking at the code in question, it seems to be desperately
in need of editorial review. A couple of examples:
* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.
Yeah, that's a stupid bug.
* ExecHashIncreaseNumBuckets starts out with a run-time test on
something that its sole caller has just Assert()'d to not be the
case, and which really ought to be impossible with or without that
Assert.
Yeah, right. Excessively defensive coding on my side (I think I've added
the Assert later, or something).
* ExecHashTableInsert believes it can only increase nbuckets_optimal
if we're still in the first batch, but MultiExecHash() will call
ExecHashIncreaseNumBuckets at the end of input-acquisition whether
we've created more than one batch or not. Meanwhile,
ExecHashIncreaseNumBatches thinks it can change the number of buckets
in any case. Maybe that's all okay, but at the very least those
tests ought to look like they'd heard of each other. And of those
three places, having the safety check where it is seems like the
least reasonable place.
I don't quite understand where you see the problem. I believe the logic
is sound, although adding a comment explaining it, but let me explain.
The only place where we actually mess with the number of buckets is
ExecHashTableInsert() around line 880. All the other places are resizes
of the hash table, rebuilding it to the optimal number of buckets.
The rebuild can happen in two situations - either at the end of the
input (which is what happens in MultiExecHash), or when we start
batching (ExecHashIncreaseNumBatches).
Once we're batching, there's no point in further messing with buckets
because we assume using more buckets would overflow work_mem.
The fact that we're doing the resize within ExecHashIncreaseNumBatches
is merely because of efficiency - we are going to walk the hash table
anyway (because we need to get rid of ~1/2 the tuples), so we simply
rebuild the buckets too.
Tracking an "optimal" number of buckets seems like an activity that
should not be constrained by whether we have any hope of being able
to apply the number.
Not sure I understand?
So I'm not having a lot of faith that there aren't other bugs in the
immediate area.
Well, I surely am not infallible, but so far I haven't seen any proof of
an actual bug, except for needlessly recomputing a value (which can only
happen once), and excessive check on an already asserted value.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers