Re: [HACKERS] DBT-3 with SF=20 got failed

2015-10-04 Thread Tom Lane
Tomas Vondra  writes:
> Anyway, I think you're right we're going in circles here. I think we 
> both presented all the arguments we had and we still disagree. I'm not 
> going to continue with this - I'm unlikely to win an argument against 
> two committers if that didn't happen until now. Thanks for the 
> discussion though.

Since we're hard up against the 9.5beta1 deadline, I've made an executive
decision to commit just the minimal change, which I view as being to
constrain the array size to MaxAllocSize where it has been all along.

I found a second rather serious bug in the new hash code while doing that
--- it failed to ensure nbuckets was actually a power of 2 --- which did
not improve my opinion of it one bit.

It's clear from this discussion that there's room for further improvement
in the hashtable sizing behavior, but I think that's 9.6 material at
this point.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-25 Thread Tomas Vondra



On 09/25/2015 02:54 AM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 1:58 PM, Tomas Vondra
 wrote:

Meh, you're right - I got the math wrong. It's 1.3% in both cases.

However the question still stands - why should we handle the
over-estimate in one case and not the other? We're wasting the
samefraction of memory in both cases.


Well, I think we're going around in circles here. It doesn't seem
likely that either of us will convince the other.


Let's agree we disagree ;-) That's perfectly OK, no hard feelings.


But for the record, I agree with you that in the scenario you lay
out, it's the about the same problem in both cases. I could argue
that it's slightly different because of [ tedious and somewhat
tenuous argument omitted ], but I'll spare you that.


OK, although that makes kinda prevents further discussion.


However, consider the alternative scenario where, on the same
machine, perhaps even in the same query, we perform two hash joins,
one of which involves hashing a small table (say, 2MB) and one of
which involves hashing a big table (say, 2GB). If the small query
uses twice the intended amount of memory, probably nothing bad will
happen. If the big query does the same thing, a bad outcome is much
more likely. Say the machine has 16GB of RAM. Well, a 2MB
over-allocation is not going to break the world. A 2GB
over-allocation very well might.


I've asked about case A. You've presented case B and shown that indeed, 
the limit seems to help here. I don't see how that makes any difference 
in case A, which I asked about?



I really don't see why this is a controversial proposition. It seems
clearly as daylight from here.


I wouldn't say controversial, but I do see the proposed solution as 
misguided because we're fixing A and claiming to also fix B. Not only 
we're not really fixing B, but may actually make it needlessly slower 
for people who don't have problems with B at all.


We've ran into problem with allocating more than MaxAllocSize. The 
proposed fix (imposing arbitrary limit) is also supposedly fixing 
over-estimation problems, but actually it not (IMNSHO).


And I think my view is supported by the fact that solutions that seem to 
actually fix the over-estimation properly emerged. I mean the "let's not 
build the buckets at all, until the very end" and "let's start with 
nbatches=0" discussed yesterday. (And I'm not saying that because I 
proposed those two things.)


Anyway, I think you're right we're going in circles here. I think we 
both presented all the arguments we had and we still disagree. I'm not 
going to continue with this - I'm unlikely to win an argument against 
two committers if that didn't happen until now. Thanks for the 
discussion though.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tomas Vondra

Hi,

On 09/23/2015 11:25 PM, Tom Lane wrote:

Tomas Vondra  writes:

On 09/11/2015 07:16 PM, Robert Haas wrote:

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
 wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.



Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.



Yes, I see it differently.



I don't quite understand why limiting the value is more "proper" than
using a function that can handle the actual value.



The proposed bugfix addresses the issue in the most straightforward way,
without introducing additional considerations about possible
over-estimations (which the current code completely ignores, so this is
a new thing). I think bugfixes should not introduce such changes to
behavior (albeit internal), especially not without any numbers.


This thread seems to have stalled out...

After re-reading it, I'm going to agree with Robert that we should
clamp the initial pointer-array size to work with palloc (ie, 512MB
worth of pointers, which please recall is going to represent at
least 10X that much in hashtable entries, probably more).


10x that much in entries? Do you mean NTUP_PER_BUCKET? Because that was 
reduced to 1 last year as part of the hashjoin improvements. So we do 
have more buckets than tuples (to keep load factor below 1.0). It's 
still true the entries will need more space than buckets (because of 
headers and such), but it may easily get well below 10x.


In the example reported by KaiGai-san, the entries are 8B wide (~24B 
with header), while buckets are 8B. That's 1:3 ratio. It is a bit 
extreme example because in other cases the tuples will be wider.


It also seems to me that the higher the ratio, the lower the need to 
actually impose such limit because it increases the natural pressure 
keeping buckets down (because both buckets and entries need to share 
work_mem of limited size).



The argument that allowing it to be larger would be a performance win
seems entirely unbased on any evidence, while the risks of allowing
arbitrarily large allocations based on planner estimates seem pretty
obvious to me.


Do we agree that resizing the hash table is not free? Because my
argument was that we're forcing the well-estimated cases to do
additional resize, so maybe we should measure the impact first.

Now, maybe it does not really matter in this case - we probably get 
slightly inaccurate estimates all the time. Not terribly wrong, but 
enough to make the initial number of buckets too low, so we may actually 
do the resize quite anyway. Also, if we're dealing with hash tables of 
this size, we're probably dealing with much larger outer relation and 
the additional resize is going to be just noise ...


I however quite dislike the dismissal of the possible impact. It should 
be the responsibility of the person introducing the change to show that 
no such impact actually exists, not just waving it off as "unbased on 
any evidence" when there's no evidence presented.


Had I been adding such limit, I'd do at least some testing and presented 
the results here. Perhaps I'm a bit over-protective of this piece of 
code as I've spent quite a bit of time getting it faster, but I believe 
the principle that the person proposing a change should demonstrate the 
(lack of) performance impact is sound.



And the argument that such overestimates are a bug that we can easily
fix is based on even less evidence; in fact, I would dismiss that out
of hand as hubris.


I don't think anyone was suggesting the overestimates are easy to fix 
(or even possible to fix in general). If I ever claimed that, I was 
clearly wrong.




Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit.  There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).

So I don't like any of the proposed patches exactly as-is.

BTW, just looking at the code in question, it seems to be desperately
in need of editorial review.  A couple of examples:

* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.


Yeah, that's a stupid bug.



* ExecHashIncreaseNumBuckets starts out with a run-time test on
something that its sole caller has just Assert()'d to not be the
case, and which really ought to be impossible with or without that
Assert.


Yeah, right. Excessively defensive coding on my side (I think I've added 
the Assert later, or something).




* ExecHashTableInsert believes it can only increase 

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Robert Haas
On Thu, Sep 24, 2015 at 5:50 AM, Tomas Vondra
 wrote:
> I however quite dislike the dismissal of the possible impact. It should be
> the responsibility of the person introducing the change to show that no such
> impact actually exists, not just waving it off as "unbased on any evidence"
> when there's no evidence presented.

So, we're talking about determining the behavior in a case that
currently fails.  Making it behave like a case that currently works
can't but be an improvement.  Making it do something that currently
never happens might be better still, or it might be equivalent, or it
might be worse.  I just don't buy the argument that somebody's got to
justify on performance grounds a decision not to allocate more memory
than we currently ever allocate.  That seems 100% backwards to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tomas Vondra



On 09/24/2015 01:51 PM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 5:50 AM, Tomas Vondra
 wrote:

I however quite dislike the dismissal of the possible impact. It should be
the responsibility of the person introducing the change to show that no such
impact actually exists, not just waving it off as "unbased on any evidence"
when there's no evidence presented.


So, we're talking about determining the behavior in a case that
currently fails. Making it behave like a case that currently works
can't but be an improvement. Making it do something that currently
never happens might be better still, or it might be equivalent, or
it might be worse. I just don't buy the argument that somebody's got
to justify on performance grounds a decision not to allocate more
memory than we currently ever allocate. That seems 100% backwards to
me.


Yes, it's true that if you hit the issue it fails, so I understand your 
view that it's a win to fix this by introducing the (arbitrary) limit. I 
disagree with this view because the limit changes at the limit - if you 
get a good estimate just below the limit, you get no resize, if you get 
slightly higher estimate you get resize.


So while it does not introduce behavior change in this particular case 
(because it fails, as you point out), it introduces a behavior change in 
general - it simply triggers behavior that does not happen below the 
limit. Would we accept the change if the proposed limit was 256MB, for 
example?


It also seems to me that we don't really need the hash table until after 
MultiExecHash(), so maybe building the hash table incrementally is just 
unnecessary and we could simply track the optimal number of buckets and 
build the buckets at the end of MultiExecHash (essentially at the place 
where we do the resize now). We'd have to walk the tuples and insert 
them into the buckets, but that seems more efficient than the 
incremental build (no data to support that at this point).



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Robert Haas
On Thu, Sep 24, 2015 at 9:49 AM, Tomas Vondra
 wrote:
> So while it does not introduce behavior change in this particular case
> (because it fails, as you point out), it introduces a behavior change in
> general - it simply triggers behavior that does not happen below the limit.
> Would we accept the change if the proposed limit was 256MB, for example?

So, I'm a huge fan of arbitrary limits.

That's probably the single thing I'll say this year that sounds most
like a troll, but it isn't.  I really, honestly believe that.
Doubling things is very sensible when they are small, but at some
point it ceases to be sensible.  The fact that we can't set a
black-and-white threshold as to when we've crossed over that line
doesn't mean that there is no line.  We can't imagine that the
occasional 32GB allocation when 4GB would have been optimal is no more
problematic than the occasional 32MB allocation when 4MB would have
been optimal.  Where exactly to put the divider is subjective, but
"what palloc will take" is not an obviously unreasonable barometer.

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tom Lane
Robert Haas  writes:
> Of course, if we can postpone sizing the hash table until after the
> input size is known, as you suggest, then that would be better still
> (but not back-patch material).

AFAICS, it works that way today as long as the hash fits in memory
(ie, single-batch).  We load into a possibly seriously undersized hash
table, but that won't matter for performance until we start probing it.
At the conclusion of loading, MultiExecHash will call
ExecHashIncreaseNumBuckets which will re-hash into a better-sized hash
table.  I doubt this can be improved on much.

It would be good if we could adjust the numbuckets choice at the
conclusion of the input phase for the multi-batch case as well.
The code appears to believe that wouldn't work, but I'm not sure if
it's right about that, or how hard it'd be to fix if so.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tomas Vondra



On 09/24/2015 05:09 PM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 9:49 AM, Tomas Vondra
 wrote:

So while it does not introduce behavior change in this particular
case (because it fails, as you point out), it introduces a behavior
change in general - it simply triggers behavior that does not
happen below the limit. Would we accept the change if the proposed
limit was 256MB, for example?


So, I'm a huge fan of arbitrary limits.

That's probably the single thing I'll say this year that sounds most
 like a troll, but it isn't. I really, honestly believe that.
Doubling things is very sensible when they are small, but at some
point it ceases to be sensible. The fact that we can't set a
black-and-white threshold as to when we've crossed over that line
doesn't mean that there is no line. We can't imagine that the
occasional 32GB allocation when 4GB would have been optimal is no
more problematic than the occasional 32MB allocation when 4MB would
have been optimal. Where exactly to put the divider is subjective,
but "what palloc will take" is not an obviously unreasonable
barometer.


There are two machines - one with 32GB of RAM and work_mem=2GB, the 
other one with 256GB of RAM and work_mem=16GB. The machines are hosting 
about the same data, just scaled accordingly (~8x more data on the large 
machine).


Let's assume there's a significant over-estimate - we expect to get 
about 10x the actual number of tuples, and the hash table is expected to 
almost exactly fill work_mem. Using the 1:3 ratio (as in the query at 
the beginning of this thread) we'll use ~512MB and ~4GB for the buckets, 
and the rest is for entries.


Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the 
buckets, so we're wasting ~448MB (13% of RAM) on the small machine and 
~3.5GB (~1.3%) on the large machine.


How does it make any sense to address the 1.3% and not the 13%?



Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
 (but not back-patch material).


This dynamic resize is 9.5-only anyway.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tomas Vondra



On 09/24/2015 07:42 PM, Robert Haas wrote:

On Thu, Sep 24, 2015 at 12:40 PM, Tomas Vondra
 wrote:

There are two machines - one with 32GB of RAM and work_mem=2GB, the other
one with 256GB of RAM and work_mem=16GB. The machines are hosting about the
same data, just scaled accordingly (~8x more data on the large machine).

Let's assume there's a significant over-estimate - we expect to get about
10x the actual number of tuples, and the hash table is expected to almost
exactly fill work_mem. Using the 1:3 ratio (as in the query at the beginning
of this thread) we'll use ~512MB and ~4GB for the buckets, and the rest is
for entries.

Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
~3.5GB (~1.3%) on the large machine.

How does it make any sense to address the 1.3% and not the 13%?


One of us is confused, because from here it seems like 448MB is 1.3%
of 32GB, not 13%.


Meh, you're right - I got the math wrong. It's 1.3% in both cases.

However the question still stands - why should we handle the 
over-estimate in one case and not the other? We're wasting the same 
fraction of memory in both cases.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tomas Vondra



On 09/24/2015 07:04 PM, Tom Lane wrote:

Tomas Vondra  writes:

But what about computing the number of expected batches, but always
start executing assuming no batching? And only if we actually fill
work_mem, we start batching and use the expected number of batches?


Hmm. You would likely be doing the initial data load with a "too
small" numbuckets for single-batch behavior, but if you successfully
loaded all the data then you could resize the table at little
penalty. So yeah, that sounds like a promising approach for cases
where the initial rowcount estimate is far above reality.


I don't understand the comment about "too small" numbuckets - isn't 
doing that the whole point of using the proposed limit? The batching is 
merely a consequence of how bad the over-estimate is.



But I kinda thought we did this already, actually.


I don't think so - I believe we haven't modified this aspect at all. It 
may not have been as pressing thanks to NTUP_PER_BUCKET=10 in the past.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tomas Vondra



On 09/24/2015 05:18 PM, Tom Lane wrote:

Robert Haas  writes:

Of course, if we can postpone sizing the hash table until after the
input size is known, as you suggest, then that would be better still
(but not back-patch material).


AFAICS, it works that way today as long as the hash fits in memory
(ie, single-batch).  We load into a possibly seriously undersized hash
table, but that won't matter for performance until we start probing it.
At the conclusion of loading, MultiExecHash will call
ExecHashIncreaseNumBuckets which will re-hash into a better-sized hash
table.  I doubt this can be improved on much.

It would be good if we could adjust the numbuckets choice at the
conclusion of the input phase for the multi-batch case as well.
The code appears to believe that wouldn't work, but I'm not sure if
it's right about that, or how hard it'd be to fix if so.


So you suggest to use a small hash table even when we expect batching?

That would be rather difficult to do because of the way we derive 
buckets and batches from the hash value - they must not overlap. The 
current code simply assumes that once we start batching the number of 
bits needed for buckets does not change anymore.


It's possible to rework of course - the initial version of the patch 
actually did just that (although it was broken in other ways).


But I think the real problem here is the batching itself - if we 
overestimate and start batching (while we could actually run with a 
single batch), we've already lost.


But what about computing the number of expected batches, but always 
start executing assuming no batching? And only if we actually fill 
work_mem, we start batching and use the expected number of batches?


I.e.

1) estimate nbatches, but use nbatches=1

2) run until exhausting work_mem

3) start batching, with the initially estimated number of batches


regards


--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Tom Lane
Tomas Vondra  writes:
> But what about computing the number of expected batches, but always 
> start executing assuming no batching? And only if we actually fill 
> work_mem, we start batching and use the expected number of batches?

Hmm.  You would likely be doing the initial data load with a "too small"
numbuckets for single-batch behavior, but if you successfully loaded all
the data then you could resize the table at little penalty.  So yeah,
that sounds like a promising approach for cases where the initial rowcount
estimate is far above reality.

But I kinda thought we did this already, actually.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Robert Haas
On Thu, Sep 24, 2015 at 1:58 PM, Tomas Vondra
 wrote:
> Meh, you're right - I got the math wrong. It's 1.3% in both cases.
>
> However the question still stands - why should we handle the over-estimate
> in one case and not the other? We're wasting the same fraction of memory in
> both cases.

Well, I think we're going around in circles here.  It doesn't seem
likely that either of us will convince the other.

But for the record, I agree with you that in the scenario you lay out,
it's the about the same problem in both cases.  I could argue that
it's slightly different because of [ tedious and somewhat tenuous
argument omitted ], but I'll spare you that.  However, consider the
alternative scenario where, on the same machine, perhaps even in the
same query, we perform two hash joins, one of which involves hashing a
small table (say, 2MB) and one of which involves hashing a big table
(say, 2GB).  If the small query uses twice the intended amount of
memory, probably nothing bad will happen.  If the big query does the
same thing, a bad outcome is much more likely.  Say the machine has
16GB of RAM.  Well, a 2MB over-allocation is not going to break the
world.  A 2GB over-allocation very well might.

I really don't see why this is a controversial proposition.  It seems
clearly as daylight from here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-24 Thread Robert Haas
On Thu, Sep 24, 2015 at 12:40 PM, Tomas Vondra
 wrote:
> There are two machines - one with 32GB of RAM and work_mem=2GB, the other
> one with 256GB of RAM and work_mem=16GB. The machines are hosting about the
> same data, just scaled accordingly (~8x more data on the large machine).
>
> Let's assume there's a significant over-estimate - we expect to get about
> 10x the actual number of tuples, and the hash table is expected to almost
> exactly fill work_mem. Using the 1:3 ratio (as in the query at the beginning
> of this thread) we'll use ~512MB and ~4GB for the buckets, and the rest is
> for entries.
>
> Thanks to the 10x over-estimate, ~64MB and 512MB would be enough for the
> buckets, so we're wasting ~448MB (13% of RAM) on the small machine and
> ~3.5GB (~1.3%) on the large machine.
>
> How does it make any sense to address the 1.3% and not the 13%?

One of us is confused, because from here it seems like 448MB is 1.3%
of 32GB, not 13%.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-23 Thread Tom Lane
Tomas Vondra  writes:
> On 09/11/2015 07:16 PM, Robert Haas wrote:
>> On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
>>  wrote:
>>> I'm arguing for fixing the existing bug, and then addressing the case of
>>> over-estimation separately, with proper analysis.

>> Well, this is part of how we're looking it differently.  I think the
>> bug is "we're passing a value to palloc that is too large, so
>> sometimes it fails" and the way to fix that is to properly limit the
>> value.  You are clearly defining the bug a bit differently.

> Yes, I see it differently.

> I don't quite understand why limiting the value is more "proper" than 
> using a function that can handle the actual value.

> The proposed bugfix addresses the issue in the most straightforward way, 
> without introducing additional considerations about possible 
> over-estimations (which the current code completely ignores, so this is 
> a new thing). I think bugfixes should not introduce such changes to 
> behavior (albeit internal), especially not without any numbers.

This thread seems to have stalled out...

After re-reading it, I'm going to agree with Robert that we should clamp
the initial pointer-array size to work with palloc (ie, 512MB worth of
pointers, which please recall is going to represent at least 10X that much
in hashtable entries, probably more).  The argument that allowing it to be
larger would be a performance win seems entirely unbased on any evidence,
while the risks of allowing arbitrarily large allocations based on planner
estimates seem pretty obvious to me.  And the argument that such
overestimates are a bug that we can easily fix is based on even less
evidence; in fact, I would dismiss that out of hand as hubris.

Now there is a separate issue about whether we should allow hashtable
resizes to exceed that limit.  There I would vote yes, as long as the
resize is based on arrival of actual data and not estimates (following
Robert's point that the existing uses of repalloc_huge are driven by
actual need).

So I don't like any of the proposed patches exactly as-is.

BTW, just looking at the code in question, it seems to be desperately
in need of editorial review.  A couple of examples:

* Some of the code goes to the trouble of maintaining a
log2_nbuckets_optimal value, but ExecHashIncreaseNumBuckets doesn't
know about that and recomputes the value.

* ExecHashIncreaseNumBuckets starts out with a run-time test on something
that its sole caller has just Assert()'d to not be the case, and which
really ought to be impossible with or without that Assert.

* ExecHashTableInsert believes it can only increase nbuckets_optimal
if we're still in the first batch, but MultiExecHash() will call
ExecHashIncreaseNumBuckets at the end of input-acquisition whether we've
created more than one batch or not.  Meanwhile, ExecHashIncreaseNumBatches
thinks it can change the number of buckets in any case.  Maybe that's all
okay, but at the very least those tests ought to look like they'd heard of
each other.  And of those three places, having the safety check where it
is seems like the least reasonable place.  Tracking an "optimal" number
of buckets seems like an activity that should not be constrained by
whether we have any hope of being able to apply the number.

So I'm not having a lot of faith that there aren't other bugs in the
immediate area.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Robert Haas
On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
 wrote:
> I'm arguing for fixing the existing bug, and then addressing the case of
> over-estimation separately, with proper analysis.

Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Tomas Vondra



On 09/11/2015 07:16 PM, Robert Haas wrote:

On Fri, Sep 11, 2015 at 1:12 PM, Tomas Vondra
 wrote:

I'm arguing for fixing the existing bug, and then addressing the case of
over-estimation separately, with proper analysis.


Well, this is part of how we're looking it differently.  I think the
bug is "we're passing a value to palloc that is too large, so
sometimes it fails" and the way to fix that is to properly limit the
value.  You are clearly defining the bug a bit differently.


Yes, I see it differently.

I don't quite understand why limiting the value is more "proper" than 
using a function that can handle the actual value.


The proposed bugfix addresses the issue in the most straightforward way, 
without introducing additional considerations about possible 
over-estimations (which the current code completely ignores, so this is 
a new thing). I think bugfixes should not introduce such changes to 
behavior (albeit internal), especially not without any numbers.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Robert Haas
On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
 wrote:
> Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
> buckets are just pointers). No matter how awful estimate you get (or how
> insanely high you set work_mem) you can't exceed this.

OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit.  Right now, if we would have allocated more than 512MB,
we instead fail.  There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.

My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate.  Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.

I guess we'll need to wait for some other opinions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-11 Thread Tomas Vondra



On 09/11/2015 06:55 PM, Robert Haas wrote:

On Wed, Sep 9, 2015 at 11:54 AM, Tomas Vondra
 wrote:

Secondly, we limit the number of buckets to INT_MAX, so about 16GB (because
buckets are just pointers). No matter how awful estimate you get (or how
insanely high you set work_mem) you can't exceed this.


OK, so this is an interesting point, and I think it clarifies things.
Essentially, we're arguing about whether a 16GB limit is as good as a
512MB limit.  Right now, if we would have allocated more than 512MB,
we instead fail.  There are two possible solutions:

1. I'm arguing for maintaining the 512MB limit, but by clamping the
allocation to 512MB (and the number of buckets accordingly) so that it
works with fewer buckets instead of failing.

2. You're arguing for removing the 512MB limit, allowing an initial
allocation of up to 16GB.


I'm arguing for fixing the existing bug, and then addressing the case of 
over-estimation separately, with proper analysis.




My judgement is that #2 could give some people a nasty surprise, in
that such a large initial allocation might cause problems, especially
if driven by a bad estimate.  Your judgement is that this is unlikely
to be a problem, and that the performance consequences of limiting a
hash join to an initial allocation of 64 million buckets rather than 2
billion buckets are the thing to worry about.


Not quite, my judgment is that

- We shouldn't address this in this particular bugfix, because it's a
  separete problem (even if we limit the initial allocation, we still
  have to fix the repalloc after we build the Hash).

- I assume the "might cause problems" refers to malloc() issues on some
  platforms. In that case we still have to apply it to both places, not
  just to the initial allocation. I don't know if this is a problem (I
  haven't heard any such reports until now), but if it is we better
  address this consistently everywhere, not just this one place.

- I'm not really sure about the impact of the additional resize. I
  surely don't want to significantly penalize the well-estimated cases,
  so I'd like to see some numbers first.



I guess we'll need to wait for some other opinions.



OK

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-09 Thread Robert Haas
On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
 wrote:
> Also, I'm not sure what other places do you have in mind (could you list
> some examples?) but I'd bet we limit the allocation to 1GB because of the
> palloc() limit and not because of fear of over-estimates.

I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception.  This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason.  But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code.  And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic.  That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences.  So, I
believe that  palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.

Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB.  Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB.  Nothing in this
discussion convinces me that this is such a place.  Note that
tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate.  I think that would be a
sound principle here, too.  Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so.  Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun.  But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable.  Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one?  That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.

>> More importantly, removing the cap on the allocation size makes the
>> problem a lot worse.  You might be sad if a bad planner estimate
>> causes the planner to allocate 1GB when 64MB would have been enough,
>> but on modern systems it is not likely to be an enormous problem.  If
>> a similar mis-estimation causes the planner to allocate 16GB rather
>> than 1GB, the opportunity for you to be sad is magnified pretty
>> considerably.  Therefore, I don't really see the over-estimation bug
>> fix as being separate from this one.
>
> Perhaps. But if you want to absolutely prevent such sadness then maybe you
> should not set work_mem that high?

I think that's a red herring for a number of reasons.  One, the
allocation for the hash buckets is only a small portion of the total
memory.  Two, the fact that you are OK with the hash table growing to
a certain size does not mean that you want it to start out that big on
the strength of a frequently-flawed planner estimate.

> Anyway, I do see this as a rather orthogonal problem - an independent
> improvement, mostly unrelated to the bugfix. Even if we decide to redesign
> it like this (and I'm not particularly opposed to that, assuming someone
> takes the time to measure how expensive the additional resize actually is),
> we'll still have to fix the repalloc().
>
> So I still fail to see why we shouldn't apply this fix.

In all seriousness, that is fine.  I respect your opinion; I'm just
telling you mine, which happens to be different.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-09 Thread Tomas Vondra

On 09/09/2015 03:55 PM, Robert Haas wrote:

On Tue, Sep 8, 2015 at 5:02 PM, Tomas Vondra
 wrote:

Also, I'm not sure what other places do you have in mind (could you list
some examples?) but I'd bet we limit the allocation to 1GB because of the
palloc() limit and not because of fear of over-estimates.


I don't really think those two things are different from each other.
The palloc() limit is a means of enforcing a general policy of
limiting all allocations to 1GB except in places where we've made a
very conscious decision to allow a specific exception.  This limit
happens to dovetail nicely with the varlena size limit, so in many
cases it is the exactly correct limit just for that reason.  But even
when, as here, that's not at issue, it's still a useful limit, because
there are many ways that some garbage value can get passed to palloc
-- bad planner estimates, corrupted tuples, bugs in other parts of our
code.  And at least on my old MacBook Pro (I haven't tested the
current one), passing a sufficiently-large value to malloc() causes a
kernel panic.  That's probably a particularly bad bug, but there are
lots of systems where "accidentally" allocating an unreasonable amount
of space will have all kinds of unpleasant consequences.  So, I
believe that  palloc()'s limit improves the overall stability of the
system considerably even if it causes some occasional annoyance.


I'm not really buying this. The 1GB has nothing to do with platform 
limits, it's there exactly to make it varlena-like (which has exactly 
the same limit), and because it allows using 32-bit int to track all the 
bytes. Neither of these is relevant here.


It has nothing to do with malloc() limits on various platforms, and if 
there really are such limits that we think we should worry about, we 
should probably address those properly. Not by and-aiding all the 
various places independently.


And most importantly, these platform limits would apply to both the 
initial allocation and to the subsequent resize. It's utterly useless to 
just "fix" the initial allocation and then allow failure when we try to 
resize the hash table.



Most of the time, you can just palloc() and not worry too much about
whether you're going to blow up the machine: you won't, because you
aren't going to allocate more than 1GB.  Any place that wants to
allocate more than that needs to be someplace where we can be pretty
sure that we're not going to accidentally allocate some completely
unreasonable amount of memory, like say 1TB.  Nothing in this
discussion convinces me that this is such a place.  Note that


We're not going to allocate a completely unreasonable amount of memory, 
because there already are some checks in place.


Firstly, you can't really get buckets larger than ~25% of work_mem, 
because we the pointer has only 8B, while the HJ tuple has 16B plus the 
data (IIRC). For wider tuples the size of buckets further decreases.


Secondly, we limit the number of buckets to INT_MAX, so about 16GB 
(because buckets are just pointers). No matter how awful estimate you 
get (or how insanely high you set work_mem) you can't exceed this.



tuplesort.c and tuplestore.c, the only existing callers of
repalloc_huge, only allocate such large amounts of memory when they
actually have enough tuples to justify it - it is always based on the
actual number of tuples, never an estimate.  I think that would be a
sound principle here, too.  Resizing the hash table to such a large
size based on the actual load factor is very reasonable; starting with
such a large size seems less so.  Admittedly, 512MB is an arbitrary
point: and if it so happened that the limit was 256MB or 1GB or 128MB
or even 2GB I wouldn't advocate for changing it just for fun. But
you're saying we should just remove that limit altogether, and I think
that's clearly unreasonable.  Do you really want to start out with a
TB or even PB-sized hash table when the actual number of tuples is,
say, one?  That may sound crazy, but I've seen enough bad query plans
to know that, yes, we are sometimes off by nine orders of magnitude.
This is not a hypothetical problem.


No, I'm not saying anything like that - I actually explicitly stated 
that I'm not against such change (further restricting the initial hash 
table size), if someone takes the time to do a bit of testing and 
provide some numbers.


Moreover as I explained there already are limits in place (25% of 
work_mem or 16GB, whichever is lower), so I don't really see the bugfix 
as unreasonable.


Maybe if we decide to lift this restriction (using int64 to address the 
buckets, which removes the 16GB limit) this issue will get much more 
pressing. But I guess hash tables handling 2B buckets will be enough for 
the near future.



More importantly, removing the cap on the allocation size makes the
problem a lot worse.  You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been 

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-08 Thread Kouhei Kaigai
> Hello KaiGai-san,
> 
> I've discovered a bug in the proposed patch - when resetting the hash
> table after the first batch, it does this:
> 
> memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));
> 
> The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
> the array (usually resulting in crashes due to invalid pointers).
> 
> I fixed it to
> 
>memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
> 
> Fixed patch attached (marked as v2).
>
Thanks, it was my bug, but oversight.

I want committer to push this fix.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-08 Thread Robert Haas
On Tue, Sep 8, 2015 at 8:28 AM, Kouhei Kaigai  wrote:
>> Hello KaiGai-san,
>>
>> I've discovered a bug in the proposed patch - when resetting the hash
>> table after the first batch, it does this:
>>
>> memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));
>>
>> The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
>> the array (usually resulting in crashes due to invalid pointers).
>>
>> I fixed it to
>>
>>memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
>>
>> Fixed patch attached (marked as v2).
>>
> Thanks, it was my bug, but oversight.
>
> I want committer to push this fix.

I'm not in agreement with this fix, and would prefer to instead
proceed by limiting the initial allocation to 1GB.  Otherwise, as
KaiGai has mentioned, we might end up trying to allocate completely
unreasonable amounts of memory if the planner gives a bad estimate.
Of course, it's true (as Tomas points out) that this issue already
exists today to some degree, and it's also true (as he also points
out) that 1GB is an arbitrary limit.  But it's also true that we use
that same arbitrary 1GB limit in a lot of places, so it's hardly
without precedent.

More importantly, removing the cap on the allocation size makes the
problem a lot worse.  You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem.  If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably.  Therefore, I don't really see the over-estimation bug
fix as being separate from this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-08 Thread Tomas Vondra



On 09/08/2015 08:44 PM, Robert Haas wrote:

On Tue, Sep 8, 2015 at 8:28 AM, Kouhei Kaigai  wrote:

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash
table after the first batch, it does this:

memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of
the array (usually resulting in crashes due to invalid pointers).

I fixed it to

memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).


Thanks, it was my bug, but oversight.

I want committer to push this fix.


I'm not in agreement with this fix, and would prefer to instead
proceed by limiting the initial allocation to 1GB.  Otherwise, as
KaiGai has mentioned, we might end up trying to allocate completely
unreasonable amounts of memory if the planner gives a bad estimate.
Of course, it's true (as Tomas points out) that this issue already
exists today to some degree, and it's also true (as he also points
out) that 1GB is an arbitrary limit.  But it's also true that we use
that same arbitrary 1GB limit in a lot of places, so it's hardly
without precedent.


AFAIK the limit is not 1GB, but 512MB (because we use 2^N and the actual 
limit is 1GB-1B). But that's a minor detail.


Also, I'm not sure what other places do you have in mind (could you list 
some examples?) but I'd bet we limit the allocation to 1GB because of 
the palloc() limit and not because of fear of over-estimates.



More importantly, removing the cap on the allocation size makes the
problem a lot worse.  You might be sad if a bad planner estimate
causes the planner to allocate 1GB when 64MB would have been enough,
but on modern systems it is not likely to be an enormous problem.  If
a similar mis-estimation causes the planner to allocate 16GB rather
than 1GB, the opportunity for you to be sad is magnified pretty
considerably.  Therefore, I don't really see the over-estimation bug
fix as being separate from this one.


Perhaps. But if you want to absolutely prevent such sadness then maybe 
you should not set work_mem that high?


Anyway, I do see this as a rather orthogonal problem - an independent 
improvement, mostly unrelated to the bugfix. Even if we decide to 
redesign it like this (and I'm not particularly opposed to that, 
assuming someone takes the time to measure how expensive the additional 
resize actually is), we'll still have to fix the repalloc().


So I still fail to see why we shouldn't apply this fix.

regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-09-02 Thread Tomas Vondra

Hello KaiGai-san,

I've discovered a bug in the proposed patch - when resetting the hash 
table after the first batch, it does this:


memset(hashtable->buckets, 0, sizeof(nbuckets * sizeof(HashJoinTuple)));

The first 'sizeof' is bogus, so this only zeroes the first 8 bytes of 
the array (usually resulting in crashes due to invalid pointers).


I fixed it to

  memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));

Fixed patch attached (marked as v2).

kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
 src/backend/executor/nodeHash.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 906cb46..63d484c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -388,7 +388,9 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	MemoryContextSwitchTo(hashtable->batchCxt);
 
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+			   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	/*
 	 * Set up for skew optimization, if possible and there's a need for more
@@ -654,7 +656,7 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		hashtable->nbuckets = hashtable->nbuckets_optimal;
 		hashtable->log2_nbuckets = hashtable->log2_nbuckets_optimal;
 
-		hashtable->buckets = repalloc(hashtable->buckets,
+		hashtable->buckets = repalloc_huge(hashtable->buckets,
 sizeof(HashJoinTuple) * hashtable->nbuckets);
 	}
 
@@ -779,7 +781,7 @@ ExecHashIncreaseNumBuckets(HashJoinTable hashtable)
 	 * chunks)
 	 */
 	hashtable->buckets =
-		(HashJoinTuple *) repalloc(hashtable->buckets,
+		(HashJoinTuple *) repalloc_huge(hashtable->buckets,
 hashtable->nbuckets * sizeof(HashJoinTuple));
 
 	memset(hashtable->buckets, 0, sizeof(void *) * hashtable->nbuckets);
@@ -1217,7 +1219,9 @@ ExecHashTableReset(HashJoinTable hashtable)
 
 	/* Reallocate and reinitialize the hash bucket headers. */
 	hashtable->buckets = (HashJoinTuple *)
-		palloc0(nbuckets * sizeof(HashJoinTuple));
+		MemoryContextAllocHuge(hashtable->batchCxt,
+			   nbuckets * sizeof(HashJoinTuple));
+	memset(hashtable->buckets, 0, nbuckets * sizeof(HashJoinTuple));
 
 	hashtable->spaceUsed = 0;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-21 Thread Tomas Vondra

Hello KaiGai-san,

On 08/21/2015 02:28 AM, Kouhei Kaigai wrote:
...


But what is the impact on queries that actually need more than 1GB
of buckets? I assume we'd only limit the initial allocation and
still allow the resize based on the actual data (i.e. the 9.5
improvement), so the queries would start with 1GB and then resize
once finding out the optimal size (as done in 9.5). The resize is
not very expensive, but it's not free either, and with so many
tuples (requiring more than 1GB of buckets, i.e. ~130M tuples) it's
probably just a noise in the total query runtime. But I'd be nice
to see some proofs of that ...


The problem here is we cannot know exact size unless Hash node
doesn't read entire inner relation. All we can do is relying
planner's estimation, however, it often computes a crazy number of
rows. I think resizing of hash buckets is a reasonable compromise.


I understand the estimation problem. The question I think we need to 
answer is how to balance the behavior for well- and poorly-estimated 
cases. It'd be unfortunate if we lower the memory consumption in the 
over-estimated case while significantly slowing down the well-estimated 
ones.


I don't think we have a clear answer at this point - maybe it's not a 
problem at all and it'll be a win no matter what threshold we choose. 
But it's a separate problem from the bugfix.



I believe the patch proposed by KaiGai-san is the right one to fix
the bug discussed in this thread. My understanding is KaiGai-san
withdrew the patch as he wants to extend it to address the
over-estimation issue.

I don't think we should do that - IMHO that's an unrelated
improvement and should be addressed in a separate patch.


OK, it might not be a problem we should conclude within a few days,
just before the beta release.


I don't quite see a reason to wait for the over-estimation patch. We 
probably should backpatch the bugfix anyway (although it's much less 
likely to run into that before 9.5), and we can't really backpatch the 
behavior change there (as there's no hash resize).


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-21 Thread Kouhei Kaigai
 Hello KaiGai-san,
 
 On 08/21/2015 02:28 AM, Kouhei Kaigai wrote:
 ...
 
  But what is the impact on queries that actually need more than 1GB
  of buckets? I assume we'd only limit the initial allocation and
  still allow the resize based on the actual data (i.e. the 9.5
  improvement), so the queries would start with 1GB and then resize
  once finding out the optimal size (as done in 9.5). The resize is
  not very expensive, but it's not free either, and with so many
  tuples (requiring more than 1GB of buckets, i.e. ~130M tuples) it's
  probably just a noise in the total query runtime. But I'd be nice
  to see some proofs of that ...
 
  The problem here is we cannot know exact size unless Hash node
  doesn't read entire inner relation. All we can do is relying
  planner's estimation, however, it often computes a crazy number of
  rows. I think resizing of hash buckets is a reasonable compromise.
 
 I understand the estimation problem. The question I think we need to
 answer is how to balance the behavior for well- and poorly-estimated
 cases. It'd be unfortunate if we lower the memory consumption in the
 over-estimated case while significantly slowing down the well-estimated
 ones.
 
 I don't think we have a clear answer at this point - maybe it's not a
 problem at all and it'll be a win no matter what threshold we choose.
 But it's a separate problem from the bugfix.

I agree with this is a separate (and maybe not easy) problem.

If somebody know previous research in academic area, please share with us. 

  I believe the patch proposed by KaiGai-san is the right one to fix
  the bug discussed in this thread. My understanding is KaiGai-san
  withdrew the patch as he wants to extend it to address the
  over-estimation issue.
 
  I don't think we should do that - IMHO that's an unrelated
  improvement and should be addressed in a separate patch.
 
  OK, it might not be a problem we should conclude within a few days,
  just before the beta release.
 
 I don't quite see a reason to wait for the over-estimation patch. We
 probably should backpatch the bugfix anyway (although it's much less
 likely to run into that before 9.5), and we can't really backpatch the
 behavior change there (as there's no hash resize).

I don't argue this bugfix anymore.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-20 Thread Kouhei Kaigai
 On 08/19/2015 03:53 PM, Tom Lane wrote:
 
  I don't see anything very wrong with constraining the initial
  allocation to 1GB, or even less. That will prevent consuming insane
  amounts of work_mem when the planner's rows estimate is too high
  rather than too low. And we do have the ability to increase the hash
  table size on the fly.
 
 Perhaps. Limiting the initial allocation to 1GB might help with lowering
 memory consumed in case of over-estimation, and it's high enough so that
 regular queries don't run into that.
 
 My initial thought was if the memory consumption is a problem, lower
 the work_mem but after thinking about it for a while I don't see a
 reason to limit the memory consumption this way, if possible.
 
 But what is the impact on queries that actually need more than 1GB of
 buckets? I assume we'd only limit the initial allocation and still allow
 the resize based on the actual data (i.e. the 9.5 improvement), so the
 queries would start with 1GB and then resize once finding out the
 optimal size (as done in 9.5). The resize is not very expensive, but
 it's not free either, and with so many tuples (requiring more than 1GB
 of buckets, i.e. ~130M tuples) it's probably just a noise in the total
 query runtime. But I'd be nice to see some proofs of that ...

The problem here is we cannot know exact size unless Hash node doesn't
read entire inner relation. All we can do is relying planner's estimation,
however, it often computes a crazy number of rows.
I think resizing of hash buckets is a reasonable compromise.

 Also, why 1GB and not 512MB (which is effectively the current limit on
 9.4, because we can't allocate 1GB there so we end up with 1/2 of that)?
 Why not some small percentage of work_mem, e.g. 5%?

No clear reason at this moment.

 Most importantly, this is mostly orthogonal to the original bug report.
 Even if we do this, we still have to fix the palloc after the resize.
 
 So I'd say let's do a minimal bugfix of the actual issue, and then
 perhaps improve the behavior in case of significant overestimates. The
 bugfix should happen ASAP so that it gets into 9.5.0 (and backported).
 We already have patches for that.
 
 Limiting the initial allocation deserves more thorough discussion and
 testing, and I'd argue it's 9.6 material at this point.

Indeed, the original bug was caused by normal MemoryContextAlloc().
The issue around memory over consumption is a problem newly caused
by this. I didn't notice it two months before.

  The real problem here is the potential integer overflow in
  ExecChooseHashTableSize. Now that we know there is one, that should
  be fixed (and not only in HEAD/9.5).
 
 I don't think there's any integer overflow. The problem is that we end
 up with nbuckets so high that (nbuckets*8) overflows 1GB-1.
 
 There's a protection against integer overflow in place (it was there for
 a long time), but there never was any protection against the 1GB limit.
 Because we've been using much smaller work_mem values and
 NTUP_PER_BUCKET=10, so we could not actually reach it.
 
  But I believe Kaigai-san withdrew his initial proposed patch, and we
  don't have a replacementas far as I saw.
 
 I believe the patch proposed by KaiGai-san is the right one to fix the
 bug discussed in this thread. My understanding is KaiGai-san withdrew
 the patch as he wants to extend it to address the over-estimation issue.

 I don't think we should do that - IMHO that's an unrelated improvement
 and should be addressed in a separate patch.

OK, it might not be a problem we should conclude within a few days, just
before the beta release.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-20 Thread Tomas Vondra

Hi,

On 08/19/2015 03:53 PM, Tom Lane wrote:


I don't see anything very wrong with constraining the initial
allocation to 1GB, or even less. That will prevent consuming insane
amounts of work_mem when the planner's rows estimate is too high
rather than too low. And we do have the ability to increase the hash
table size on the fly.


Perhaps. Limiting the initial allocation to 1GB might help with lowering 
memory consumed in case of over-estimation, and it's high enough so that 
regular queries don't run into that.


My initial thought was if the memory consumption is a problem, lower 
the work_mem but after thinking about it for a while I don't see a 
reason to limit the memory consumption this way, if possible.


But what is the impact on queries that actually need more than 1GB of 
buckets? I assume we'd only limit the initial allocation and still allow 
the resize based on the actual data (i.e. the 9.5 improvement), so the 
queries would start with 1GB and then resize once finding out the 
optimal size (as done in 9.5). The resize is not very expensive, but 
it's not free either, and with so many tuples (requiring more than 1GB 
of buckets, i.e. ~130M tuples) it's probably just a noise in the total 
query runtime. But I'd be nice to see some proofs of that ...


Also, why 1GB and not 512MB (which is effectively the current limit on 
9.4, because we can't allocate 1GB there so we end up with 1/2 of that)? 
Why not some small percentage of work_mem, e.g. 5%?


Most importantly, this is mostly orthogonal to the original bug report. 
Even if we do this, we still have to fix the palloc after the resize.


So I'd say let's do a minimal bugfix of the actual issue, and then 
perhaps improve the behavior in case of significant overestimates. The 
bugfix should happen ASAP so that it gets into 9.5.0 (and backported). 
We already have patches for that.


Limiting the initial allocation deserves more thorough discussion and 
testing, and I'd argue it's 9.6 material at this point.



The real problem here is the potential integer overflow in
ExecChooseHashTableSize. Now that we know there is one, that should
be fixed (and not only in HEAD/9.5).


I don't think there's any integer overflow. The problem is that we end 
up with nbuckets so high that (nbuckets*8) overflows 1GB-1.


There's a protection against integer overflow in place (it was there for 
a long time), but there never was any protection against the 1GB limit. 
Because we've been using much smaller work_mem values and 
NTUP_PER_BUCKET=10, so we could not actually reach it.



But I believe Kaigai-san withdrew his initial proposed patch, and we
don't have a replacementas far as I saw.


I believe the patch proposed by KaiGai-san is the right one to fix the 
bug discussed in this thread. My understanding is KaiGai-san withdrew 
the patch as he wants to extend it to address the over-estimation issue.


I don't think we should do that - IMHO that's an unrelated improvement 
and should be addressed in a separate patch.



kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai
2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
 On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 I see two ways to fix this:

 (1) enforce the 1GB limit (probably better for back-patching, if that's
 necessary)

 (2) make it work with hash tables over 1GB

 I'm in favor of (2) if there's a good way to do that. It seems a bit
 stupid not to be able to use fast hash table because there's some artificial
 limit. Are there any fundamental reasons not to use the
 MemoryContextAllocHuge fix, proposed by KaiGai-san?


 If there are no objections, I will apply the patch for 2) to HEAD and
 backpatch to 9.5.

Please don't be rush. :-)

It is not difficult to replace palloc() by palloc_huge(), however, it may lead
another problem once planner gives us a crazy estimation.
Below is my comment on the another thread.

==
Also, we may need to pay attention to reliability of scale estimation
by planner.
Even though the plan below says that Join generates 60521928028 rows,
it actually generates 776157676 rows (0.12%).


tpcds100=# EXPLAIN ANALYZE select
ws1.ws_order_number,ws1.ws_warehouse_sk wh1,ws2.ws_warehouse_sk wh2
 from web_sales ws1,web_sales ws2
 where ws1.ws_order_number = ws2.ws_order_number
   and ws1.ws_warehouse_sk  ws2.ws_warehouse_sk;
 QUERY PLAN


 Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
   Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
   Join Filter: (ws1.ws_warehouse_sk  ws2.ws_warehouse_sk)
   Rows Removed by Join Filter: 127853313
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
 Sort Key: ws1.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
   -  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
 Sort Key: ws2.ws_order_number
 Sort Method: quicksort  Memory: 7083296kB
 -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
 Planning time: 0.232 ms
 Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
  On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com
 wrote:
 
 
  I see two ways to fix this:
 
  (1) enforce the 1GB limit (probably better for back-patching, if that's
  necessary)
 
  (2) make it work with hash tables over 1GB
 
  I'm in favor of (2) if there's a good way to do that. It seems a bit
  stupid not to be able to use fast hash table because there's some
 artificial
  limit. Are there any fundamental reasons not to use the
  MemoryContextAllocHuge fix, proposed by KaiGai-san?
 
 
  If there are no objections, I will apply the patch for 2) to HEAD and
  backpatch to 9.5.
 
 Please don't be rush. :-)


Please explain what rush you see?


 It is not difficult to replace palloc() by palloc_huge(), however, it may
 lead
 another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.


 Yes, I can read both threads and would apply patches for each problem.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 Please don't be rush. :-)

 Please explain what rush you see?

Yours.  You appear to be in a hurry to apply patches that there's no
consensus on.

 It is not difficult to replace palloc() by palloc_huge(), however, it
 may lead another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.

  Yes, I can read both threads and would apply patches for each problem.

I don't see anything very wrong with constraining the initial allocation
to 1GB, or even less.  That will prevent consuming insane amounts of
work_mem when the planner's rows estimate is too high rather than too low.
And we do have the ability to increase the hash table size on the fly.

The real problem here is the potential integer overflow in
ExecChooseHashTableSize.  Now that we know there is one, that should be
fixed (and not only in HEAD/9.5).  But I believe Kaigai-san withdrew his
initial proposed patch, and we don't have a replacement as far as I saw.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Simon Riggs
On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com wrote:


 I see two ways to fix this:

 (1) enforce the 1GB limit (probably better for back-patching, if that's
 necessary)

 (2) make it work with hash tables over 1GB

 I'm in favor of (2) if there's a good way to do that. It seems a bit
 stupid not to be able to use fast hash table because there's some
 artificial limit. Are there any fundamental reasons not to use the
 MemoryContextAllocHuge fix, proposed by KaiGai-san?


If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Kohei KaiGai
2015-08-19 21:29 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
 On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-08-19 20:12 GMT+09:00 Simon Riggs si...@2ndquadrant.com:
  On 12 June 2015 at 00:29, Tomas Vondra tomas.von...@2ndquadrant.com
  wrote:
 
 
  I see two ways to fix this:
 
  (1) enforce the 1GB limit (probably better for back-patching, if that's
  necessary)
 
  (2) make it work with hash tables over 1GB
 
  I'm in favor of (2) if there's a good way to do that. It seems a bit
  stupid not to be able to use fast hash table because there's some
  artificial
  limit. Are there any fundamental reasons not to use the
  MemoryContextAllocHuge fix, proposed by KaiGai-san?
 
 
  If there are no objections, I will apply the patch for 2) to HEAD and
  backpatch to 9.5.
 
 Please don't be rush. :-)


 Please explain what rush you see?

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.

Thanks,

 It is not difficult to replace palloc() by palloc_huge(), however, it may
 lead
 another problem once planner gives us a crazy estimation.
 Below is my comment on the another thread.


  Yes, I can read both threads and would apply patches for each problem.

 --
 Simon Riggshttp://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services



-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Simon Riggs
On 19 August 2015 at 14:53, Tom Lane t...@sss.pgh.pa.us wrote:

 Simon Riggs si...@2ndquadrant.com writes:
  On 19 August 2015 at 12:55, Kohei KaiGai kai...@kaigai.gr.jp wrote:
  Please don't be rush. :-)

  Please explain what rush you see?

 Yours.  You appear to be in a hurry to apply patches that there's no
 consensus on.


I think that comment is unreasonable.

The problem was reported 2 months ago; following independent confirmation
of the proposed patch, I suggested committing it, with these words:

If there are no objections, I will apply the patch for 2) to HEAD and
backpatch to 9.5.

I was clearly waiting for objections before acting, to test whether there
was consensus or not.

Please explain what procedure you would like committers to follow?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tomas Vondra

Hi,

On 08/19/2015 01:55 PM, Kohei KaiGai wrote:

  Merge Join  (cost=25374644.08..1160509591.61 rows=60521928028
width=24) (actual time=138347.979..491889.343 rows=776157676 loops=1)
Merge Cond: (ws1.ws_order_number = ws2.ws_order_number)
Join Filter: (ws1.ws_warehouse_sk  ws2.ws_warehouse_sk)
Rows Removed by Join Filter: 127853313
-  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=73252.300..79017.420 rows=72001237 loops=1)
  Sort Key: ws1.ws_order_number
  Sort Method: quicksort  Memory: 7083296kB
  -  Seq Scan on web_sales ws1  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.023..39951.201 rows=72001237
loops=1)
-  Sort  (cost=12687322.04..12867325.16 rows=72001248 width=16)
(actual time=65095.655..128885.811 rows=904010978 loops=1)
  Sort Key: ws2.ws_order_number
  Sort Method: quicksort  Memory: 7083296kB
  -  Seq Scan on web_sales ws2  (cost=0.00..3290612.48
rows=72001248 width=16) (actual time=0.014..31046.888 rows=72001237
loops=1)
  Planning time: 0.232 ms
  Execution time: 530176.521 ms
(14 rows)


So, even if we allows nodeHash.c to allocate hash buckets larger than
1GB, its initial size may be determined carefully.
Probably, 1GB is a good starting point even if expanded later.


I'm not sure I understand what is the problem here? Could you elaborate?

The initial size of the hash table is determined using the estimate, and 
if we overestimate it will create more buckets (i.e. consuming more 
memory) and/or start batching (which might be unnecessary).


But I don't really see any more careful way to do this, without 
penalizing the cases where the estimate is actually correct - e.g. by 
starting with much smaller buckets (and then resizing the hash table, 
which is not free). Or by starting without batching, betting that we 
won't actually need it.


I think it'll be very difficult to get those working without causing 
real trouble to cases where we actually do have good estimates (and 
those are vast majority of queries).


But both of those are features, and we're dealing with a bug fix here.


kind regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tomas Vondra

Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.


I agree we need to put a few more safeguards there (e.g. make sure we 
don't overflow INT when counting the buckets, which may happen with the 
amounts of work_mem we'll see in the wild soon).


But I think we should not do any extensive changes to how we size the 
hashtable - that's not something we should do in a bugfix I think.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread David Rowley
On 12 June 2015 at 02:40, Kohei KaiGai kai...@kaigai.gr.jp wrote:

 2015-06-11 23:28 GMT+09:00 Robert Haas robertmh...@gmail.com:
  On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com
 wrote:
  The attached patch replaces this palloc0() by MemoryContextAllocHuge()
 + memset().
  Indeed, this hash table is constructed towards the relation with
 nrows=119994544,
  so, it is not strange even if hash-slot itself is larger than 1GB.
 
  You forgot to attach the patch, I think.
 
 Oops, I forgot to attach indeed.

   It looks to me like the size
  of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
  That's a lot of buckets, but maybe not unreasonably many if you've got
  enough memory.
 
 EXPLAIN says, this Hash node takes underlying SeqScan with
 119994544 (~119 million) rows, but it is much smaller than my
 work_mem setting.


I've just run into this problem while running a TPC-H benchmark of 100GB,
on a machine with 64GB of RAM.
When attempting to run Q21 with a work_mem of 10GB I'm getting:
 ERROR:  invalid memory alloc request size 1073741824

Setting work_mem to 1GB or less gets the query running.

I've patched the code with your patch Kohei, and it's now working.

Thought I'd better post this just in case this gets forgotten about.

Thanks

David

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-08-19 Thread Tomas Vondra

Hi,

On 08/20/2015 04:15 AM, Tomas Vondra wrote:

Hello KaiGain-san,

On 08/19/2015 03:19 PM, Kohei KaiGai wrote:

Unless we have no fail-safe mechanism when planner estimated too
large number of tuples than actual needs, a strange estimation will
consume massive amount of RAMs. It's a bad side effect.
My previous patch didn't pay attention to the scenario, so needs to
revise the patch.


I agree we need to put a few more safeguards there (e.g. make sure we
don't overflow INT when counting the buckets, which may happen with the
amounts of work_mem we'll see in the wild soon).

But I think we should not do any extensive changes to how we size the
hashtable - that's not something we should do in a bugfix I think.


Attached are two alternative version of a backpatch. Both limit the 
nbuckets so that it never exceeds MaxAllocSize - effectively 512MB due 
to the doubling (so ~67M buckets on 64-bit architectures).


The only difference is that the 'alternative' patch limits max_pointers

+   /* ensure we don't exceed the maximum allocation size */
+   max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));

so it affects both nbuckets and nbatches. That seems a bit more correct, 
but I guess whoever gets this many batches would be grateful even for 
the quick crash.



For master, I think the right patch is what KaiGai-san posted in June. I 
don't think we should really try making it smarter about handling 
overestimates at this point - that's 9.6 stuff IMNSHO.


I find it a bit awkward that we only have MemoryContextAllocHuge and 
repalloc_huge, especially as nodeHash.c needs MemoryContextAllocHuge + 
memset to zero the chunk.


So I think we should extend the memutils API by adding palloc0_huge (and 
possibly palloc_huge, although that's not needed by nodeHash.c).



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6917d3f..e72ac8b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -475,6 +475,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 		lbuckets = (hash_table_bytes / tupsize) / NTUP_PER_BUCKET;
 		lbuckets = Min(lbuckets, max_pointers);
+		lbuckets = Min(lbuckets, MaxAllocSize / sizeof(void*));
 		nbuckets = (int) lbuckets;
 
 		dbatch = ceil(inner_rel_bytes / hash_table_bytes);
@@ -491,6 +492,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 		dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
 		dbuckets = Min(dbuckets, max_pointers);
+		dbuckets = Min(dbuckets, MaxAllocSize / sizeof(void*));
 		nbuckets = (int) dbuckets;
 
 		nbatch = 1;
@@ -500,7 +502,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Both nbuckets and nbatch must be powers of 2 to make
 	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
 	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
+	 * be real small, by starting the search at 2^10, or too large - we allocate
+	 * them as a single chunk, so must fit in MaxAllocSize. (Note: above we made
 	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
 	 * overflow, nor can the final shift to recalculate nbuckets.)
 	 */
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6917d3f..ca90aed 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -465,6 +465,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	max_pointers = (work_mem * 1024L) / sizeof(void *);
 	/* also ensure we avoid integer overflow in nbatch and nbuckets */
 	max_pointers = Min(max_pointers, INT_MAX / 2);
+	/* ensure we don't exceed the maximum allocation size */
+	max_pointers = Min(max_pointers, MaxAllocSize / sizeof(void*));
 
 	if (inner_rel_bytes  hash_table_bytes)
 	{
@@ -500,7 +502,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	 * Both nbuckets and nbatch must be powers of 2 to make
 	 * ExecHashGetBucketAndBatch fast.  We already fixed nbatch; now inflate
 	 * nbuckets to the next larger power of 2.  We also force nbuckets to not
-	 * be real small, by starting the search at 2^10.  (Note: above we made
+	 * be real small, by starting the search at 2^10, or too large - we allocate
+	 * them as a single chunk, so must fit in MaxAllocSize. (Note: above we made
 	 * sure that nbuckets is not more than INT_MAX / 2, so this loop cannot
 	 * overflow, nor can the final shift to recalculate nbuckets.)
 	 */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Jan Wieck

On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?


work_mem=48GB

My machine mounts 256GB physical RAM.


work_mem can be allocated several times per backend. Nodes like sort and 
hash_aggregate may each allocate that much. You should set work_mem to a 
fraction of physical-RAM / concurrent-connections depending on the 
complexity of your queries. 48GB does not sound reasonable.



Regards, Jan

--
Jan Wieck
Senior Software Engineer
http://slony.info


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Robert Haas
On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
 memset().
 Indeed, this hash table is constructed towards the relation with 
 nrows=119994544,
 so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think.  It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai
2015-06-11 23:20 GMT+09:00 Jan Wieck j...@wi3ck.info:
 On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

 curious: what was work_mem set to?

 work_mem=48GB

 My machine mounts 256GB physical RAM.


 work_mem can be allocated several times per backend. Nodes like sort and
 hash_aggregate may each allocate that much. You should set work_mem to a
 fraction of physical-RAM / concurrent-connections depending on the
 complexity of your queries. 48GB does not sound reasonable.

Smaller number of max_connections and large work_mem configuration are
usual for typical OLAP workloads.

Even if configuration is not reasonable, it is not a right error message.
People cannot understand how to fix it.

psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Tom Lane
Tomas Vondra tomas.von...@2ndquadrant.com writes:
 Interestingly, the hash code checks for INT_MAX overflows on a number of 
 places, but does not check for this ...

Yeah, and at least at one time there were checks to prevent the hash table
request from exceeding MaxAllocSize.  Did those get removed by somebody?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Tomas Vondra

Hi,

On 06/11/15 16:20, Jan Wieck wrote:

On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

curious: what was work_mem set to?


work_mem=48GB

My machine mounts 256GB physical RAM.


work_mem can be allocated several times per backend. Nodes like sort
and hash_aggregate may each allocate that much. You should set
work_mem to a fraction of physical-RAM / concurrent-connections
depending on the complexity of your queries. 48GB does not sound
reasonable.


That's true, but there are cases where values like this may be useful 
(e.g. for a particular query). We do allow such work_mem values, so I 
consider this failure to be a bug.


It probably existed in the past, but was amplified by the hash join 
improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead 
of NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we 
also much more memory than we had in the past.


Interestingly, the hash code checks for INT_MAX overflows on a number of 
places, but does not check for this ...


regards
--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai
2015-06-11 23:33 GMT+09:00 Tomas Vondra tomas.von...@2ndquadrant.com:
 Hi,

 On 06/11/15 16:20, Jan Wieck wrote:

 On 06/11/2015 09:53 AM, Kouhei Kaigai wrote:

 curious: what was work_mem set to?

 work_mem=48GB

 My machine mounts 256GB physical RAM.


 work_mem can be allocated several times per backend. Nodes like sort
 and hash_aggregate may each allocate that much. You should set
 work_mem to a fraction of physical-RAM / concurrent-connections
 depending on the complexity of your queries. 48GB does not sound
 reasonable.


 That's true, but there are cases where values like this may be useful (e.g.
 for a particular query). We do allow such work_mem values, so I consider
 this failure to be a bug.

 It probably existed in the past, but was amplified by the hash join
 improvements I did for 9.5, because that uses NTUP_PER_BUCKET=1 instead of
 NTUP_PER_BUCKET=10. So the arrays of buckets are much larger, and we also
 much more memory than we had in the past.

 Interestingly, the hash code checks for INT_MAX overflows on a number of
 places, but does not check for this ...

Which number should be changed in this case?

Indeed, nbuckets is declared as int, so INT_MAX is hard limit of hash-slot.
However, some extreme usage can easily create a situation that we shall
touch this restriction.

Do we have nbuckets using long int?

-- 
KaiGai Kohei kai...@kaigai.gr.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kohei KaiGai
2015-06-11 23:28 GMT+09:00 Robert Haas robertmh...@gmail.com:
 On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
 memset().
 Indeed, this hash table is constructed towards the relation with 
 nrows=119994544,
 so, it is not strange even if hash-slot itself is larger than 1GB.

 You forgot to attach the patch, I think.

Oops, I forgot to attach indeed.

  It looks to me like the size
 of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
 That's a lot of buckets, but maybe not unreasonably many if you've got
 enough memory.

EXPLAIN says, this Hash node takes underlying SeqScan with
119994544 (~119 million) rows, but it is much smaller than my
work_mem setting.

-- 
KaiGai Kohei kai...@kaigai.gr.jp


hashslot-allocation-by-huge-alloc.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Merlin Moncure
On Wed, Jun 10, 2015 at 9:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 Hello,

 I got the following error during DBT-3 benchmark with SF=20.

   psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
   psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824

 It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
 the limitation of none-huge interface.

 (gdb) bt
 #0  0x7f669d29a989 in raise () from /lib64/libc.so.6
 #1  0x7f669d29c098 in abort () from /lib64/libc.so.6
 #2  0x0090ccfd in ExceptionalCondition (conditionName=0xb18130 
 !(((Size) (size) = ((Size) 0x3fff))),
 errorType=0xb17efd FailedAssertion, fileName=0xb17e40 mcxt.c, 
 lineNumber=856) at assert.c:54
 #3  0x0093ad53 in palloc0 (size=1073741824) at mcxt.c:856
 #4  0x00673045 in ExecHashTableCreate (node=0x7f669de951f0, 
 hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
 #5  0x006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
 #6  0x0065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
 #7  0x00681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
 #8  0x0065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
 #9  0x00681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
 #10 0x0065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
 #11 0x00681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
 #12 0x0065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
 #13 0x00681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
 #14 0x0065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

 The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
 memset().
 Indeed, this hash table is constructed towards the relation with 
 nrows=119994544,
 so, it is not strange even if hash-slot itself is larger than 1GB.

 Another allocation request potentially reset of expand hash-slot may also 
 needs
 to be Huge version of memory allocation, I think.

 Thanks,

 Below is the query itself and EXPLAIN result.
 
 dbt3c=# EXPLAIN VERBOSE
 dbt3c-# select
 dbt3c-# s_name,
 dbt3c-# count(*) as numwait
 dbt3c-# from
 dbt3c-# supplier,
 dbt3c-# lineitem l1,
 dbt3c-# orders,
 dbt3c-# nation
 dbt3c-# where
 dbt3c-# s_suppkey = l1.l_suppkey
 dbt3c-# and o_orderkey = l1.l_orderkey
 dbt3c-# and o_orderstatus = 'F'
 dbt3c-# and l1.l_receiptdate  l1.l_commitdate
 dbt3c-# and exists (
 dbt3c(# select
 dbt3c(# *
 dbt3c(# from
 dbt3c(# lineitem l2
 dbt3c(# where
 dbt3c(# l2.l_orderkey = l1.l_orderkey
 dbt3c(# and l2.l_suppkey  l1.l_suppkey
 dbt3c(# )
 dbt3c-# and not exists (
 dbt3c(# select
 dbt3c(# *
 dbt3c(# from
 dbt3c(# lineitem l3
 dbt3c(# where
 dbt3c(# l3.l_orderkey = l1.l_orderkey
 dbt3c(# and l3.l_suppkey  l1.l_suppkey
 dbt3c(# and l3.l_receiptdate  l3.l_commitdate
 dbt3c(# )
 dbt3c-# and s_nationkey = n_nationkey
 dbt3c-# and n_name = 'UNITED KINGDOM'
 dbt3c-# group by
 dbt3c-# s_name
 dbt3c-# order by
 dbt3c-# numwait desc,
 dbt3c-# s_name
 dbt3c-# LIMIT 100;

 QUERY PLAN

 --
 --
 --
  Limit  (cost=6792765.24..6792765.24 rows=1 width=26)
Output: supplier.s_name, (count(*))
-  Sort  (cost=6792765.24..6792765.24 rows=1 width=26)
  Output: supplier.s_name, (count(*))
  Sort Key: (count(*)) DESC, supplier.s_name
  -  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)
Output: supplier.s_name, count(*)
Group Key: supplier.s_name
-  Nested Loop Anti Join  (cost=4831094.94..6792765.21 rows=1 
 width=26)
  Output: supplier.s_name
  -  Nested Loop  (cost=4831094.37..6792737.52 rows=1 
 width=34)
Output: supplier.s_name, l1.l_suppkey, 
 l1.l_orderkey
Join Filter: (supplier.s_nationkey = 
 nation.n_nationkey)
-  Nested Loop  (cost=4831094.37..6792736.19 
 rows=1 width=38)
  Output: supplier.s_name, 
 

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Kouhei Kaigai
 curious: what was work_mem set to?

work_mem=48GB

My machine mounts 256GB physical RAM.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: Merlin Moncure [mailto:mmonc...@gmail.com]
 Sent: Thursday, June 11, 2015 10:52 PM
 To: Kaigai Kouhei(海外 浩平)
 Cc: pgsql-hackers@postgreSQL.org
 Subject: Re: [HACKERS] DBT-3 with SF=20 got failed
 
 On Wed, Jun 10, 2015 at 9:57 PM, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
  Hello,
 
  I got the following error during DBT-3 benchmark with SF=20.
 
psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
 
  It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
  the limitation of none-huge interface.
 
  (gdb) bt
  #0  0x7f669d29a989 in raise () from /lib64/libc.so.6
  #1  0x7f669d29c098 in abort () from /lib64/libc.so.6
  #2  0x0090ccfd in ExceptionalCondition (conditionName=0xb18130
 !(((Size) (size) = ((Size) 0x3fff))),
  errorType=0xb17efd FailedAssertion, fileName=0xb17e40 mcxt.c,
 lineNumber=856) at assert.c:54
  #3  0x0093ad53 in palloc0 (size=1073741824) at mcxt.c:856
  #4  0x00673045 in ExecHashTableCreate (node=0x7f669de951f0,
 hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
  #5  0x006752e1 in ExecHashJoin (node=0x24d74e0) at 
  nodeHashjoin.c:169
  #6  0x0065abf4 in ExecProcNode (node=0x24d74e0) at 
  execProcnode.c:477
  #7  0x00681026 in ExecNestLoop (node=0x24d6668) at 
  nodeNestloop.c:123
  #8  0x0065abca in ExecProcNode (node=0x24d6668) at 
  execProcnode.c:469
  #9  0x00681026 in ExecNestLoop (node=0x24d61f8) at 
  nodeNestloop.c:123
  #10 0x0065abca in ExecProcNode (node=0x24d61f8) at 
  execProcnode.c:469
  #11 0x00681026 in ExecNestLoop (node=0x24d5478) at 
  nodeNestloop.c:123
  #12 0x0065abca in ExecProcNode (node=0x24d5478) at 
  execProcnode.c:469
  #13 0x00681026 in ExecNestLoop (node=0x24d51d0) at 
  nodeNestloop.c:123
  #14 0x0065abca in ExecProcNode (node=0x24d51d0) at 
  execProcnode.c:469
 
  The attached patch replaces this palloc0() by MemoryContextAllocHuge() +
 memset().
  Indeed, this hash table is constructed towards the relation with
 nrows=119994544,
  so, it is not strange even if hash-slot itself is larger than 1GB.
 
  Another allocation request potentially reset of expand hash-slot may also 
  needs
  to be Huge version of memory allocation, I think.
 
  Thanks,
 
  Below is the query itself and EXPLAIN result.
  
  dbt3c=# EXPLAIN VERBOSE
  dbt3c-# select
  dbt3c-# s_name,
  dbt3c-# count(*) as numwait
  dbt3c-# from
  dbt3c-# supplier,
  dbt3c-# lineitem l1,
  dbt3c-# orders,
  dbt3c-# nation
  dbt3c-# where
  dbt3c-# s_suppkey = l1.l_suppkey
  dbt3c-# and o_orderkey = l1.l_orderkey
  dbt3c-# and o_orderstatus = 'F'
  dbt3c-# and l1.l_receiptdate  l1.l_commitdate
  dbt3c-# and exists (
  dbt3c(# select
  dbt3c(# *
  dbt3c(# from
  dbt3c(# lineitem l2
  dbt3c(# where
  dbt3c(# l2.l_orderkey = l1.l_orderkey
  dbt3c(# and l2.l_suppkey  l1.l_suppkey
  dbt3c(# )
  dbt3c-# and not exists (
  dbt3c(# select
  dbt3c(# *
  dbt3c(# from
  dbt3c(# lineitem l3
  dbt3c(# where
  dbt3c(# l3.l_orderkey = l1.l_orderkey
  dbt3c(# and l3.l_suppkey  l1.l_suppkey
  dbt3c(# and l3.l_receiptdate  l3.l_commitdate
  dbt3c(# )
  dbt3c-# and s_nationkey = n_nationkey
  dbt3c-# and n_name = 'UNITED KINGDOM'
  dbt3c-# group by
  dbt3c-# s_name
  dbt3c-# order by
  dbt3c-# numwait desc,
  dbt3c-# s_name
  dbt3c-# LIMIT 100;
 
  QUERY PLAN
 
 
 
 --
 
 
 --
  --
   Limit  (cost=6792765.24..6792765.24 rows=1 width=26)
 Output: supplier.s_name, (count(*))
 -  Sort  (cost=6792765.24..6792765.24 rows=1 width=26)
   Output: supplier.s_name, (count(*))
   Sort Key: (count(*)) DESC, supplier.s_name
   -  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)
 Output: supplier.s_name, count(*)
 Group Key: supplier.s_name

Re: [HACKERS] DBT-3 with SF=20 got failed

2015-06-11 Thread Tomas Vondra



On 06/11/15 16:54, Tom Lane wrote:

Tomas Vondra tomas.von...@2ndquadrant.com writes:

Interestingly, the hash code checks for INT_MAX overflows on a number of
places, but does not check for this ...


Yeah, and at least at one time there were checks to prevent the hash
table request from exceeding MaxAllocSize. Did those get removed by
somebody?


I think the problem is in this piece of code:

  if ((hashtable-nbatch == 1) 
  (hashtable-nbuckets_optimal = INT_MAX / 2) 
   /* overflow protection */
  (ntuples = (hashtable-nbuckets_optimal * NTUP_PER_BUCKET)))
  {
  hashtable-nbuckets_optimal *= 2;
  hashtable-log2_nbuckets_optimal += 1;
  }

ISTM it does not check against the max_pointers (that's only done in 
ExecChooseHashTableSize).




--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] DBT-3 with SF=20 got failed

2015-06-10 Thread Kouhei Kaigai
Hello,

I got the following error during DBT-3 benchmark with SF=20.

  psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824
  psql:query21.sql:50: ERROR:  invalid memory alloc request size 1073741824

It looks to me Hash node tries to 1GB area using palloc0(), but it exceeds
the limitation of none-huge interface.

(gdb) bt
#0  0x7f669d29a989 in raise () from /lib64/libc.so.6
#1  0x7f669d29c098 in abort () from /lib64/libc.so.6
#2  0x0090ccfd in ExceptionalCondition (conditionName=0xb18130 
!(((Size) (size) = ((Size) 0x3fff))),
errorType=0xb17efd FailedAssertion, fileName=0xb17e40 mcxt.c, 
lineNumber=856) at assert.c:54
#3  0x0093ad53 in palloc0 (size=1073741824) at mcxt.c:856
#4  0x00673045 in ExecHashTableCreate (node=0x7f669de951f0, 
hashOperators=0x24dbf90, keepNulls=0 '\000') at nodeHash.c:391
#5  0x006752e1 in ExecHashJoin (node=0x24d74e0) at nodeHashjoin.c:169
#6  0x0065abf4 in ExecProcNode (node=0x24d74e0) at execProcnode.c:477
#7  0x00681026 in ExecNestLoop (node=0x24d6668) at nodeNestloop.c:123
#8  0x0065abca in ExecProcNode (node=0x24d6668) at execProcnode.c:469
#9  0x00681026 in ExecNestLoop (node=0x24d61f8) at nodeNestloop.c:123
#10 0x0065abca in ExecProcNode (node=0x24d61f8) at execProcnode.c:469
#11 0x00681026 in ExecNestLoop (node=0x24d5478) at nodeNestloop.c:123
#12 0x0065abca in ExecProcNode (node=0x24d5478) at execProcnode.c:469
#13 0x00681026 in ExecNestLoop (node=0x24d51d0) at nodeNestloop.c:123
#14 0x0065abca in ExecProcNode (node=0x24d51d0) at execProcnode.c:469

The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
memset().
Indeed, this hash table is constructed towards the relation with 
nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

Another allocation request potentially reset of expand hash-slot may also needs
to be Huge version of memory allocation, I think.

Thanks,

Below is the query itself and EXPLAIN result.

dbt3c=# EXPLAIN VERBOSE
dbt3c-# select
dbt3c-# s_name,
dbt3c-# count(*) as numwait
dbt3c-# from
dbt3c-# supplier,
dbt3c-# lineitem l1,
dbt3c-# orders,
dbt3c-# nation
dbt3c-# where
dbt3c-# s_suppkey = l1.l_suppkey
dbt3c-# and o_orderkey = l1.l_orderkey
dbt3c-# and o_orderstatus = 'F'
dbt3c-# and l1.l_receiptdate  l1.l_commitdate
dbt3c-# and exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l2
dbt3c(# where
dbt3c(# l2.l_orderkey = l1.l_orderkey
dbt3c(# and l2.l_suppkey  l1.l_suppkey
dbt3c(# )
dbt3c-# and not exists (
dbt3c(# select
dbt3c(# *
dbt3c(# from
dbt3c(# lineitem l3
dbt3c(# where
dbt3c(# l3.l_orderkey = l1.l_orderkey
dbt3c(# and l3.l_suppkey  l1.l_suppkey
dbt3c(# and l3.l_receiptdate  l3.l_commitdate
dbt3c(# )
dbt3c-# and s_nationkey = n_nationkey
dbt3c-# and n_name = 'UNITED KINGDOM'
dbt3c-# group by
dbt3c-# s_name
dbt3c-# order by
dbt3c-# numwait desc,
dbt3c-# s_name
dbt3c-# LIMIT 100;

QUERY PLAN

--
--
--
 Limit  (cost=6792765.24..6792765.24 rows=1 width=26)
   Output: supplier.s_name, (count(*))
   -  Sort  (cost=6792765.24..6792765.24 rows=1 width=26)
 Output: supplier.s_name, (count(*))
 Sort Key: (count(*)) DESC, supplier.s_name
 -  HashAggregate  (cost=6792765.22..6792765.23 rows=1 width=26)
   Output: supplier.s_name, count(*)
   Group Key: supplier.s_name
   -  Nested Loop Anti Join  (cost=4831094.94..6792765.21 rows=1 
width=26)
 Output: supplier.s_name
 -  Nested Loop  (cost=4831094.37..6792737.52 rows=1 
width=34)
   Output: supplier.s_name, l1.l_suppkey, l1.l_orderkey
   Join Filter: (supplier.s_nationkey = 
nation.n_nationkey)
   -  Nested Loop  (cost=4831094.37..6792736.19 rows=1 
width=38)
 Output: supplier.s_name, supplier.s_nationkey, 
l1.l_suppkey, l1.l_orderkey
 -  Nested Loop  (cost=4831093.81..6792728.20 
rows=1 width=42)
   Output: