Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-06 Thread Tomas Vondra
On 2.12.2014 02:52, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 01:33, Tom Lane wrote:
 What I suspect you're looking at here is the detritus of creation
 of a huge number of memory contexts. mcxt.c keeps its own state
 about existing contents in TopMemoryContext. So, if we posit that
 those contexts weren't real small, there's certainly room to
 believe that there was some major memory bloat going on recently.
 
 Aha! MemoryContextCreate allocates the memory for the new context
 from TopMemoryContext explicitly, so that it survives resets of the
 parent context. Is that what you had in mind by keeping state about
 existing contexts?
 
 Right.
 
 That'd probably explain the TopMemoryContext size, because
 array_agg() creates separate context for each group. So if you have
 1M groups, you have 1M contexts. Although I don's see how the size
 of those contexts would matter?
 
 Well, if they're each 6K, that's your 6GB right there.

FWIW, I tried to repeat the test with the array_agg() patch, submitted
to the next CF:

  http://www.postgresql.org/message-id/5479faea.3070...@fuzzy.cz

and the memory context stats after the first iteration look like this:

master (unpatched)
--
TopMemoryContext: 204878128 total in 25011 blocks; 204010192 free
(750034 chunks); 867936 used

patched with array_agg
--
TopMemoryContext: 78128 total in 11 blocks; 10144 free (34 chunks);
67984 used

Also, the RSS issue (which was the initial subject of this thread)
pretty much disappeared. So indeed, this seems to be caused by the
'islands' referenced by TopMemoryContext.

I wonder whether it'd be worth looking for places doing something
similar to array_agg(), potentially creating many contexts in parallel,
and maybe modifying them as in the patch.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-02 Thread Tomas Vondra

Dne 2014-12-02 02:52, Tom Lane napsal:

Tomas Vondra t...@fuzzy.cz writes:

On 2.12.2014 01:33, Tom Lane wrote:

What I suspect you're looking at here is the detritus of creation of
a huge number of memory contexts. mcxt.c keeps its own state about
existing contents in TopMemoryContext. So, if we posit that those
contexts weren't real small, there's certainly room to believe that
there was some major memory bloat going on recently.



Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?


Right.


That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?


Well, if they're each 6K, that's your 6GB right there.


Yeah, but this memory should be freed after the query finishes, no?


Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed?


We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such shell parent 
contexts

should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to 
rethink

whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.


Maybe. For me, the 130MB is not really a big deal, because for this to
happen there really needs to be many child contexts at the same time,
consuming much more memory. With 6.5GB consumed in total, 130MB amounts
to ~2% which is negligible. Unless we can fix the RSS bloat.

Also, this explains the TopMemoryContext size, but not the RSS size 
(or

am I missing something)?


Very possibly you're left with islands that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.


Yes, that's something I was thinking about too - I believe what happens
is that allocations of info in TopMemoryContext and the actual contexts
are interleaved, and at the end only the memory contexts are deleted. 
The

blocks allocated in TopMemoryContexts are kept, creating the islands.

If that's the case, allocating the child context info within the parent
context would solve this, because these pieces would be reclaimed with 
the

rest of the parent memory.

But then again, there are probably other ways to create such islands
(e.g. allocating additional block in a long-lived context while the 
child

contexts exist).

regards
Tomas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-02 Thread Tomas Vondra
Dne 2 Prosinec 2014, 10:59, Tomas Vondra napsal(a):
 Dne 2014-12-02 02:52, Tom Lane napsal:
 Tomas Vondra t...@fuzzy.cz writes:

 Also, this explains the TopMemoryContext size, but not the RSS size
 (or am I missing something)?

 Very possibly you're left with islands that prevent reclaiming very
 much of the peak RAM usage.  It'd be hard to be sure without some sort
 of memory map, of course.

 Yes, that's something I was thinking about too - I believe what happens
 is that allocations of info in TopMemoryContext and the actual contexts
 are interleaved, and at the end only the memory contexts are deleted.
 The blocks allocated in TopMemoryContexts are kept, creating the
 islands.

 If that's the case, allocating the child context info within the parent
 context would solve this, because these pieces would be reclaimed with
 the rest of the parent memory.

On second thought, I'm not sure this explains the continuous increase of
consumed memory. When the first iteration consumes 2,818g of memory, why
should the following iterations consume significantly more? The allocation
patterns should be (almost) exactly the same, reusing the already
allocated memory (either at the system or TopMemoryContext level).

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

  [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.

The only thing I can think of is some interaction (fragmentation?) with
the system allocator (I'm using glibc-2.19 and kernel 3.16.1, btw),
making the RSS values even less useful than I thought. Sadly, it seems
to trigger the OOM killer :-(

It's entirely possible that this is a known behavior of the allocator,
and I've been unaware of it. It's also true that the work_mem values are
really excessive, and the actual values would be much lower, which would
make the issues much less serious.

To demonstrate the problem I'll use the attached SQL script - it's a
rather simple script generating sample tables and then executing trivial
array_agg() queries.

The script sets work_mem to two different values:

  work_mem = '1GB'==  this limits the INSERT (generating data)

  work_mem = '1024GB' ==  bogus value, forcing a hash aggegate
   (assuming  the hash table fits into memory)

The size of the sample tables (and the amount of memory needed for the
hash aggregate) is determined by the fist parameter set in the script.
With this value

  \set size 1

I get a OOM crash on the first execution of the SQL script (on a machine
with 8GB of RAM and 512MB shared buffers), but YMMV.

The problem is, that even with a much smaller dataset (say, using size
7500) you'll get an OOM error after several executions of the script.
How many executions are needed seems to be inversely proportional to the
size of the data set.

The problem is that the RSS amount is increasing over time for some
reason. For example with the size = 5000, the memory stats for the
process look like this over the first few minutes:

 VIRTRESSHR  %CPU %MEM TIME+ COMMAND
  5045508 2,818g 187220  51,9 36,6   0:15.39 postgres: INSERT
  5045508 3,600g 214868  62,8 46,8   3:11.58 postgres: INSERT
  5045508 3,771g 214868  50,9 49,0   3:40.03 postgres: INSERT
  5045508 3,840g 214868  48,5 49,9   4:00.71 postgres: INSERT
  5045508 3,978g 214880  51,5 51,7   4:40.73 postgres: INSERT
  5045508 4,107g 214892  53,2 53,4   5:22.04 postgres: INSERT
  5045508 4,297g 215276  53,9 55,9   6:22.63 postgres: INSERT
  ...

Those are rows for the backend process, captured from top over time.
How long the backend was running is in the TIME column. Each iteration
takes ~30 seconds, so those lines represent approximately iterations 1,
6, 7, 8, 11, etc.

Notice how the RSS value grows over time, and also notice that this is
the INSERT, restricted by work_mem=1GB. So the memory consumption should
be ~1.5GB, and MemoryContextStats(TopMemoryContext) collected at this
point is consistent with that (see the mem-ctx.log).

And then the value stabilizes at ~4,430g and stops growing. With size
7500 it however takes only ~20 iterations to reach the OOM issue, with a
crash log like this:

[10560.843547] Killed process 15862 (postgres) total-vm:7198260kB,
anon-rss:6494136kB, file-rss:300436kB

So, any ideas what might be the culprit here?

As I said, this is clearly made worse by inappropriately high work_mem
values, but I'm not sure it's completely harmless. Imagine for example
long-running backends, executing complex queries with inaccurate
estimates. That may easily result in using much more memory than the
work_mem limit, and increasing the RSS value over time.

regards
Tomas


array-agg.sql
Description: application/sql
TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 chunks); 608256 used
  TopTransactionContext: 8192 total in 1 blocks; 7904 free (0 chunks); 288 used
  TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
  Attopt cache: 8192 total in 1 blocks; 2704 free (0 chunks); 5488 used
  Type information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 used
  Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
  MessageContext: 32768 total in 3 blocks; 9640 free (0 chunks); 23128 used
  Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  smgr relation table: 24576 total in 2 blocks; 9808 free (4 chunks); 14768 used
  TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
  Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 848 free (0 chunks); 176 used
  ExecutorState: 318758960 total in 40 blocks; 5879664 free (8 chunks); 

Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Andrew Dunstan


On 12/01/2014 05:39 PM, Tomas Vondra wrote:

Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

   [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.



Doesn't this line:

   TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 
chunks); 608256 used


look pretty suspicious?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
On 2.12.2014 00:31, Andrew Dunstan wrote:
 
 On 12/01/2014 05:39 PM, Tomas Vondra wrote:
 Hi all,

 while working on the patch decreasing amount of memory consumed by
 array_agg [1], I've ran into some strange OOM issues. Reproducing them
 using the attached SQL script is rather simple.

[1] https://commitfest.postgresql.org/action/patch_view?id=1652

 At first I thought there's some rare hidden memory leak, but I'm pretty
 sure that's not the case. I've even put some explicit counters into
 aset.c counting allocated/freed blocks, and it seems to be working just
 fine (and matching the context tree perfectly). So no memory leak.
 
 
 Doesn't this line:
 
TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
 (500017 chunks); 608256 used
 
 
 look pretty suspicious?

It certainly looks a bit large, especially considering this is a fresh
cluster with a single DB, containing a single table (created by user),
no doubt about that.

For comparison, this is a new backend:

TopMemoryContext: 78128 total in 11 blocks; 8528 free (16 chunks); 69600
used

OTOH, it's just 130MB, and the RSS values are ~6GB when the OOM hits.


For the record, I only tweaked these settings in the config file:

test=# select name, setting from pg_settings where source like
'%configuration file%';
name|  setting
+---
 checkpoint_segments| 32
 DateStyle  | ISO, DMY
 default_text_search_config | pg_catalog.simple
 dynamic_shared_memory_type | posix
 lc_messages| cs_CZ.UTF-8
 lc_monetary| cs_CZ.UTF-8
 lc_numeric | cs_CZ.UTF-8
 lc_time| cs_CZ.UTF-8
 log_timezone   | Europe/Prague
 maintenance_work_mem   | 524288
 max_connections| 100
 shared_buffers | 65536
 TimeZone   | Europe/Prague
 work_mem   | 1048576
(14 rows)

Some of those are set by the initdb, of course.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 00:31, Andrew Dunstan wrote:
 Doesn't this line:
 TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
 (500017 chunks); 608256 used
 look pretty suspicious?

 It certainly looks a bit large, especially considering this is a fresh
 cluster with a single DB, containing a single table (created by user),
 no doubt about that.
 OTOH, it's just 130MB, and the RSS values are ~6GB when the OOM hits.

Yeah, but what was that 130MB being used for?  It's not generally
considered good practice to put large or random stuff in TopMemoryContext
--- arguably, any code doing so is broken already, because almost by
definition such allocations won't be reclaimed on error.

What I suspect you're looking at here is the detritus of creation of a
huge number of memory contexts.  mcxt.c keeps its own state about existing
contents in TopMemoryContext.  So, if we posit that those contexts weren't
real small, there's certainly room to believe that there was some major
memory bloat going on recently.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
On 2.12.2014 01:33, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 00:31, Andrew Dunstan wrote:
 Doesn't this line:
 TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
 (500017 chunks); 608256 used
 look pretty suspicious?
 
 It certainly looks a bit large, especially considering this is a fresh
 cluster with a single DB, containing a single table (created by user),
 no doubt about that.
 OTOH, it's just 130MB, and the RSS values are ~6GB when the OOM hits.
 
 Yeah, but what was that 130MB being used for?  It's not generally
 considered good practice to put large or random stuff in TopMemoryContext
 --- arguably, any code doing so is broken already, because almost by
 definition such allocations won't be reclaimed on error.
 
 What I suspect you're looking at here is the detritus of creation of
 a huge number of memory contexts. mcxt.c keeps its own state about
 existing contents in TopMemoryContext. So, if we posit that those
 contexts weren't real small, there's certainly room to believe that
 there was some major memory bloat going on recently.

Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?

That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?

Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed? Aside
from the regular blocks, of course. But maybe it's not worth the effort,
because you can only have so many memory contexts created concurrently.
For AllocSet it's at most RAM/1kB (because that's the minimum block
size) before the OOM intervenes. And in this case the contexts have 8kB
initial size, so the number of contexts is even lower. So maybe it's not
worth the effort.

Also, this explains the TopMemoryContext size, but not the RSS size (or
am I missing something)?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 01:33, Tom Lane wrote:
 What I suspect you're looking at here is the detritus of creation of
 a huge number of memory contexts. mcxt.c keeps its own state about
 existing contents in TopMemoryContext. So, if we posit that those
 contexts weren't real small, there's certainly room to believe that
 there was some major memory bloat going on recently.

 Aha! MemoryContextCreate allocates the memory for the new context from
 TopMemoryContext explicitly, so that it survives resets of the parent
 context. Is that what you had in mind by keeping state about existing
 contexts?

Right.

 That'd probably explain the TopMemoryContext size, because array_agg()
 creates separate context for each group. So if you have 1M groups, you
 have 1M contexts. Although I don's see how the size of those contexts
 would matter?

Well, if they're each 6K, that's your 6GB right there.

 Maybe we could move this info (list of child contexts) to the parent
 context somehow, so that it's freed when the context is destroyed?

We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such shell parent contexts
should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to rethink
whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.

 Also, this explains the TopMemoryContext size, but not the RSS size (or
 am I missing something)?

Very possibly you're left with islands that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers