Hi,

we are running ceph version 12.2.2 (10 nodes, 240 OSDs, 3 mon). While
monitoring the WAL db used bytes we noticed that there are some OSDs
that use proportionally more WAL db bytes than others (800Mb vs 300Mb).
These OSDs eventually exceed the WAL db size (1GB in our case) and spill
over to the HDD data device. So it seems flushing the WAL db does not
free space.

We looked for some hints in the logs of the OSDs in question and spotted
the following entries:

[...]
2018-02-08 16:17:27.496695 7f0ffce55700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_write.cc:684]
reusing log 152 from recycle list
2018-02-08 16:17:27.496768 7f0ffce55700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_write.cc:725]
[default] New memtable created with log file: #162. Immutable memtables: 0.
2018-02-08 16:17:27.496976 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.496841)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:1158]
Calling FlushMemTableToOutputFile with column family [default], flush
slots available 1, compaction slots allowed 1, compaction slots scheduled 1
2018-02-08 16:17:27.496983 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:264]
[default] [JOB 6] Flushing memtable with next log file: 162
2018-02-08 16:17:27.497001 7f0fe7e2b700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1518103047496990, "job": 6, "event": "flush_started",
"num_memtables": 1, "num_entries": 328542, "num_deletes": 66632,
"memory_usage": 260058032}
2018-02-08 16:17:27.497006 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:293]
[default] [JOB 6] Level-0 flush table #163: started
2018-02-08 16:17:27.627110 7f0fe7e2b700  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1518103047627094, "cf_name": "default", "job": 6,
"event": "table_file_creation", "file_number": 163, "file_size":
5502182, "table_properties": {"data_size": 5160167, "index_size": 81548,
"filter_size": 259478, "raw_key_size": 5138655, "raw_average_key_size":
51, "raw_value_size": 3606384, "raw_average_value_size": 36,
"num_data_blocks": 1287, "num_entries": 98984, "filter_policy_name":
"rocksdb.BuiltinBloomFilter", "kDeletedKeys": "66093", "kMergeOperands":
"192"}}
2018-02-08 16:17:27.627127 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/flush_job.cc:319]
[default] [JOB 6] Level-0 flush table #163: 5502182 bytes OK
2018-02-08 16:17:27.627449 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_files.cc:242]
adding log 155 to recycle list
2018-02-08 16:17:27.627457 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627136)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:360]
[default] Level-0 commit table #163 started
2018-02-08 16:17:27.627461 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627402)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:383]
[default] Level-0 commit table #163: memtable #1 done
2018-02-08 16:17:27.627474 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627415) EVENT_LOG_v1 {"time_micros":
1518103047627409, "job": 6, "event": "flush_finished", "lsm_state": [1,
2, 3, 0, 0, 0, 0], "immutable_memtables": 0}
2018-02-08 16:17:27.627476 7f0fe7e2b700  4 rocksdb: (Original Log Time
2018/02/08-16:17:27.627435)
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:132]
[default] Level summary: base level 1 max bytes base 268435456 files[1 2
3 0 0 0 0] max score 0.29
2018-02-08 16:17:27.627481 7f0fe7e2b700  4 rocksdb:
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_files.cc:388]
[JOB 6] Try to delete WAL files size 253479064, prev total WAL file size
253494097, number of live WAL files 2.

What does the log entry "Try to delete WAL files" tell? How can we
delete such files on a raw partition? Should this not be done automagically?

When I restart the OSD then the WAL db gets purged. Otherwise after some
time it just spills over. How can we make sure that


Dietmar

-- 
_________________________________________
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at


Attachment: signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to