Mesos replicated log fills disk with logging output

Stephan Erb Mon, 08 Jan 2018 14:29:15 -0800

Hi everyone,

a few days ago, we have bumped into an interesting issue that we had not seen 
before. Essentially, one of our toy clusters dissolved itself:



  *   3 masters, each running Mesos (1.2.1), Aurora (0.19.0), and ZooKeeper 
(3.4.5) for leader election
  *   Master 1 and master 2 had 100% disk usage, because 
/var/lib/mesos/replicated_log/LOG had grown to about 170 GB
  *   The replicated log of both Master 1 and 2 was corrupted. A process 
restart did not fix it.
  *   The ZooKeeper on Master 2 was corrupted as well. Logs indicated this was 
caused by the full disk.
  *   Master 3 was the leading Mesos master and healthy. Its disk usage was 
normal.


The content of /var/lib/mesos/replicated_log/LOG was an endless stream of:

2018/01/04-12:30:56.776466 7f65aae877c0 Recovering log #1753
2018/01/04-12:30:56.776577 7f65aae877c0 Level-0 table #1756: started
2018/01/04-12:30:56.778885 7f65aae877c0 Level-0 table #1756: 7526 bytes OK
2018/01/04-12:30:56.782433 7f65aae877c0 Delete type=0 #1753
2018/01/04-12:30:56.782484 7f65aae877c0 Delete type=3 #1751
2018/01/04-12:30:56.782642 7f6597fff700 Level-0 table #1759: started
2018/01/04-12:30:56.782686 7f6597fff700 Level-0 table #1759: 0 bytes OK
2018/01/04-12:30:56.783242 7f6597fff700 Delete type=0 #1757
2018/01/04-12:30:56.783312 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783499 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783538 7f6597fff700 Delete type=2 #1760
2018/01/04-12:30:56.783563 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783598 7f6597fff700 Manual compaction at level-0 from 
(begin) .. (end); will stop at '0000003060' @ 9423 : 1
2018/01/04-12:30:56.783607 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783698 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783728 7f6597fff700 Delete type=2 #1761
2018/01/04-12:30:56.783749 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783770 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.783900 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.783929 7f6597fff700 Delete type=2 #1762
2018/01/04-12:30:56.783950 7f6597fff700 Compaction error: IO error: 
/var/lib/mesos/replicated_log/001735.sst: No such file or directory
2018/01/04-12:30:56.783970 7f6597fff700 Compacting 4@0 + 1@1 files
2018/01/04-12:30:56.784312 7f6597fff700 compacted to: files[ 4 1 0 0 0 0 0 ]
2018/01/04-12:30:56.785547 7f6597fff700 Delete type=2 #1763

Content of the associated folder:

/var/lib/mesos/replicated_log.corrupted# ls -la
total 964480
drwxr-xr-x 2 mesos mesos      4096 Jan  5 10:12 .
drwxr-xr-x 4 mesos mesos      4096 Jan  5 10:27 ..
-rw-r--r-- 1 mesos mesos       724 Dec 14 16:22 001735.ldb
-rw-r--r-- 1 mesos mesos      7393 Dec 14 16:45 001737.sst
-rw-r--r-- 1 mesos mesos     22129 Jan  3 12:53 001742.sst
-rw-r--r-- 1 mesos mesos     14967 Jan  3 13:00 001747.sst
-rw-r--r-- 1 mesos mesos      7526 Jan  4 12:30 001756.sst
-rw-r--r-- 1 mesos mesos     15113 Jan  5 10:08 001765.sst
-rw-r--r-- 1 mesos mesos     65536 Jan  5 10:09 001767.log
-rw-r--r-- 1 mesos mesos        16 Jan  5 10:08 CURRENT
-rw-r--r-- 1 mesos mesos         0 Aug 25  2015 LOCK
-rw-r--r-- 1 mesos mesos 178303865220 Jan  5 10:12 LOG
-rw-r--r-- 1 mesos mesos 463093282 Jan  5 10:08 LOG.old
-rw-r--r-- 1 mesos mesos     65536 Jan  5 10:08 MANIFEST-001764

Monitoring indicates that the disk usage started to grow shortly after a badly 
coordinated configuration deployment change:


  *   Master 1 was leading and restarted after a few hours of uptime
  *   Master 2 was now leading. After a few seconds (30s-60s or so) it got 
restarted as well
  *   Master 3 was now leading (and continued to do so)

I have to admit I am a bit surprised that the restart scenario could lead to 
the issues described above. Has anyone seen similar issues as well?

Thanks and best regards,
Stephan

Mesos replicated log fills disk with logging output

Reply via email to