Docker + postgreSQL : OOM killing in a large Group by operation

Jorge Daniel Thu, 05 Apr 2018 10:03:44 -0700

Hi  Guys:
I have a problem with a query that grabs a bunch of rows and then does an 
aggreate operation, at that moment it gots killed by OOM-killer, I don't know 
why, the  engine starts using tmpfiles as expected , and then tries to work in  
memory and gots killed.
I've test it in an small enviroment for more dramatic/quick results


Env:
Running on a 1GB memory Docker-container .

PostgreSQL 9.4.8 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 
4.8.2-19ubuntu1) 4.8.2, 64-bit
Database Analyzed and the stats are up-to-date

postgresql.conf:
shared_buffers = 512MB
work_mem = 16MB
effective_cache_size = 256MB


Bad Query:

SELECT count(*)
     FROM "changelog_change_transaction"
     INNER JOIN "changelog_change_stats" ON ( 
changelog_change_stats.changelog_change_transactionid = 
changelog_change_transaction.changelog_change_transactionid )
     LEFT OUTER JOIN "changelog_change_group" ON ( 
changelog_change_transaction.changelog_change_groupid = 
changelog_change_group.changelog_change_groupid )

    WHERE ( changelog_change_group.companyid = 40 OR 
changelog_change_group.companyid = 1 OR changelog_change_group.companyid = 53 
OR changelog_change_group.companyid IS NULL )
    AND changelog_change_transaction.started_at > '2017-04-21'
GROUP BY  "changelog_change_transaction"."changelog_change_transactionid", 
"changelog_change_transaction"."epoch", "changelog_change_transaction"
."changelog_change_groupid", "changelog_change_transaction"."started_at", 
"changelog_change_transaction"."duration_microseconds", 
"changelog_change_transaction"."changed_items", 
"changelog_change_transaction"."xmin"
;

Explain :

 HashAggregate  (cost=7845766.73..8117654.17 rows=27188744 width=152)
   Group Key: changelog_change_transaction.changelog_change_transactionid, 
changelog_change_transaction.epoch, 
changelog_change_transaction.changelog_change_groupid, 
changelog_change_transaction.started_at, 
changelog_change_transaction.duration_microseconds, 
changelog_change_transaction.changed_items, changelog_change_transaction.xmin
   ->  Hash Left Join  (cost=2498235.67..7301991.85 rows=27188744 width=152)
         Hash Cond: (changelog_change_transaction.changelog_change_groupid = 
changelog_change_group.changelog_change_groupid)
         Filter: ((changelog_change_group.companyid = 40) OR 
(changelog_change_group.companyid = 1) OR (changelog_change_group.companyid = 
53) OR (changelog_change_group.companyid IS NULL))
         ->  Hash Join  (cost=2142692.83..5176273.34 rows=27726867 width=152)
               Hash Cond: 
(changelog_change_stats.changelog_change_transactionid = 
changelog_change_transaction.changelog_change_transactionid)
               ->  Seq Scan on changelog_change_stats  (cost=0.00..689345.48 
rows=33612148 width=6)
               ->  Hash  (cost=1215858.45..1215858.45 rows=27272350 width=152)
                     ->  Seq Scan on changelog_change_transaction  
(cost=0.00..1215858.45 rows=27272350 width=152)
                           Filter: (started_at > '2017-04-21 
00:00:00-07'::timestamp with time zone)
         ->  Hash  (cost=200291.82..200291.82 rows=8931282 width=10)
               ->  Seq Scan on changelog_change_group  (cost=0.00..200291.82 
rows=8931282 width=10)

Log execution:

2018-04-05 09:07:12.444 PDT rhost=[local] 
app=psql:user=postgres:db=telecom:LOG:  temporary file: path 
"base/pgsql_tmp/pgsql_tmp6455.19", size 9437168
......
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:LOG:  server process (PID 
6455) was terminated by signal 9: Killed
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:DETAIL:  Failed process was 
running: SELECT count(*)
     FROM "changelog_change_transaction"
     INNER JOIN "changelog_change_stats" ON ( 
changelog_change_stats.changelog_change_transactionid = 
changelog_change_transaction.changelog_change_transactionid )
     LEFT OUTER JOIN "changelog_change_group" ON ( 
changelog_change_transaction.changelog_change_groupid = 
changelog_change_group.changelog_change_groupid )

    WHERE ( changelog_change_group.companyid = 40 OR 
changelog_change_group.companyid = 1 OR changelog_change_group.companyid = 53 
OR changelog_change_group.companyid IS NULL )
    AND changelog_change_transaction.started_at > '2017-04-21'
GROUP BY  "changelog_change_transaction"."changelog_change_transactionid", 
"changelog_change_transaction"."epoch", "changelog_change_transaction"
."changelog_change_groupid", "changelog_change_transaction"."started_at", 
"changelog_change_transaction"."duration_microseconds", 
"changelog_change_transaction"."changed_items", 
"changelog_change_transaction"."xmin"
;
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:LOG:  terminating any other 
active server processes
2018-04-05 09:07:25.605 PDT rhost= app=:user=:db=:WARNING:  terminating 
connection because of crash of another server process


Monitoring the /proc/*/status of the running client process ,I've noticed this :

VmData:     7944 kB
VmData:     7944 kB
VmData:     7944 kB
VmData:    29788 kB
VmData:    30696 kB
VmData:    31724 kB
...
VmData:    33776 kB
...
VmData:    37876 kB
...
VmData:    46072 kB
...
VmData:    55272 kB
VmData:    67568 kB
VmData:    76032 kB
VmData:    76484 kB
VmData:    78156 kB
...
...
VmData:    80208 kB
...
VmData:    84308 kB
VmData:    92504 kB
VmData:   198972 kB
VmData:   354620 kB
VmData:   495208 kB
...
VmData:   682364 kB
VmData:   788988 kB
VmData:   821756 kB
VmData:   887292 kB
...
VmData:   961020 kB
...
VmData:   993788 kB
...
VmData: 1001980 kB


For sure if the GROUP BY the one that causes this OOM (when I removed it, the 
query finish ok ) , so I've change the query-plan to avoid the HashAggregate:

telecom=# show enable_hashagg ;
 enable_hashagg
 ----------------
  off

  But the explain still shows:

 HashAggregate  (cost=19768044.56..20039932.00 rows=27188744 width=152)
   Group Key: changelog_change_transaction.changelog_change_transactionid, 
changelog_change_transaction.epoch, 
changelog_change_transaction.changelog_change_groupid, 
changelog_change_transaction.started_at, 
changelog_change_transaction.duration_microseconds, 
changelog_change_transaction.changed_items, changelog_change_transaction.xmin

   Any help or light on this will be really appreciated!

Regards

Jorge Daniel Fernandez

Docker + postgreSQL : OOM killing in a large Group by operation

Reply via email to