Hi,

On 06/11/15 16:28, Robert Haas wrote:
On Wed, Jun 10, 2015 at 10:57 PM, Kouhei Kaigai <kai...@ak.jp.nec.com> wrote:
The attached patch replaces this palloc0() by MemoryContextAllocHuge() + 
memset().
Indeed, this hash table is constructed towards the relation with 
nrows=119994544,
so, it is not strange even if hash-slot itself is larger than 1GB.

You forgot to attach the patch, I think.  It looks to me like the size
of a HashJoinTuple is going to be 16 bytes, so 1GB/16 = ~64 million.
That's a lot of buckets, but maybe not unreasonably many if you've got
enough memory.

Actually, HashJoinTuple is just a pointer, so it's 8 bytes, so 1GB is enough for 134217728 million rows, which is more than the 119994544 rows from the plan.

Also, looking at the error message again:

    ERROR:  invalid memory alloc request size 1073741824

but this time with beer goggles, I noticed that the amount reported is exactly 1GB. The backtrace also shows the error happens right inside ExecHashTableCreate (and not in the resize which may happen later), which means it gets the nbuckets from ExecChooseHashTableSize directly.

The resize is probably still broken as I mentioned before, but this crash before reaching that code as the estimates are high enough to trigger the issue. But ExecChooseHashTableSize is supposed to keep all the checks from previous versions, and I think it actually does.

But I don't see there any checks regarding the 1GB boundary. What I see is this:

  max_pointers = (work_mem * 1024L) / sizeof(void *);
  max_pointers = Min(max_pointers, INT_MAX / 2);

  ...

  dbuckets = Min(dbuckets, max_pointers);

That has nothing to do with 1GB, and it's in the code since the time work_mem was limited by 2GB, so perhaps there was some reasoning that it's sufficient (because the tuples stored in the hash table will need more than 1/2 of the memory, or something like that).

But today this issue is more likely, because people have more RAM and use higher work_mem values, so the max_pointers value gets much higher. In the extreme it may get to INT_MAX/2, so ~1 billion, so the buckets would allocate ~8B on 64-bit machines (on 32-bit machines we'd also get twice the number of pointers, compared to 64 bits, but that's mostly irrelevant, because of the memory size limits).

It's also true, that the hash-join improvements in 9.5 - namely the decrease of NTUP_PER_BUCKET from 10 to 1, made this error more likely. With 9.4 we'd use only 16777216 buckets (128MB), because that gets us below 10 tuples per bucket. But now we're shooting for 1 tuple per bucket, so we end up with 131M buckets, and that's 1GB.

I see two ways to fix this:

(1) enforce the 1GB limit (probably better for back-patching, if that's
    necessary)

(2) make it work with hash tables over 1GB

I'm in favor of (2) if there's a good way to do that. It seems a bit stupid not to be able to use fast hash table because there's some artificial limit. Are there any fundamental reasons not to use the MemoryContextAllocHuge fix, proposed by KaiGai-san?


--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to