Hi,
On 5/29/19 5:23 AM, Frank Yu wrote:
Hi Jake,
I have same question about size of DB/WAL for OSD。My situations: 12
osd per OSD nodes, 8 TB(maybe 12TB later) per OSD, Intel NVMe SSD
(optane P4800x) 375G per OSD nodes, which means DB/WAL can use about
30GB per OSD(8TB), I mainly use CephFS to serve the HPC cluster for ML.
(plan to separate CephFS metadata to pool based on NVMe SSD, BTW, does
this improve the performance a lot? any compares?)
We have a similar setup, but 24 disks and 2x P4800X. And the 375GB NVME
drives are _not_ large enough:
2019-05-29 07:00:00.000108 mon.bcf-03 [WRN] overall HEALTH_WARN BlueFS
spillover detected on 22 OSD(s)
root@bcf-10:~# parted /dev/nvme0n1 print
Model: NVMe Device (nvme)
Disk /dev/nvme0n1: 375GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 31.1GB 31.1GB
2 31.1GB 62.3GB 31.1GB
3 62.3GB 93.4GB 31.1GB
4 93.4GB 125GB 31.1GB
5 125GB 156GB 31.1GB
6 156GB 187GB 31.1GB
7 187GB 218GB 31.1GB
8 218GB 249GB 31.1GB
9 249GB 280GB 31.1GB
10 280GB 311GB 31.1GB
11 311GB 343GB 31.1GB
12 343GB 375GB 32.6GB
The second NVME has the same partition layout. The twelfth partition is
actually large enough to hold all the data, but the other 11 partitions
on this drive are a little bit too small. I'm still trying to calculate
the exact sweet spot....
With 24 OSDs and two of them having a just-large-enough-db-partition, I
end up with 22 OSD not fully using their db partition and spilling over
into the slow disk...exactly as reported by ceph.
Details for one of the affected OSDs:
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 31138504704,
"db_used_bytes": 2782912512,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320062095360,
"slow_used_bytes": 5838471168,
"num_files": 135,
"log_bytes": 13295616,
"log_compactions": 9,
"logged_bytes": 338104320,
"files_written_wal": 2,
"files_written_sst": 5066,
"bytes_written_wal": 375879721287,
"bytes_written_sst": 227201938586,
"bytes_written_slow": 65162240000,
"max_bytes_wal": 0,
"max_bytes_db": 5265940480,
"max_bytes_slow": 7540310016
},
Maybe it's just matter of shifting some megabytes. We are about to
deploy more of these nodes, so I would be grateful if anyone can comment
on the correct size of the DB partitions. Otherwise I'll have to use a
RAID-0 for two drives.
Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com