Re: [ceph-users] OSD fails to start (fsck error, unable to read osd superblock)
On 2/9/19 5:40 PM, Brad Hubbard wrote: > On Sun, Feb 10, 2019 at 1:56 AM Ruben Rodriguez wrote: >> >> Hi there, >> >> Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore. >> >> Today we had two disks fail out of the controller, and after a reboot >> they both seemed to come back fine but ceph-osd was only able to start >> in one of them. The other one gets this: >> >> 2019-02-08 18:53:00.703376 7f64f948ce00 -1 >> bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000 >> checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device >> location [0x4000~1000], logical extent 0x0~1000, object >> #-1:7b3f43c4:::osd_superblock:0# >> 2019-02-08 18:53:00.703406 7f64f948ce00 -1 osd.3 0 OSD::init() : unable >> to read osd superblock >> >> Note that there are no actual IO errors being shown by the controller in >> dmesg, and that the disk is readable. The metadata FS is mounted and >> looks normal. >> >> I tried running "ceph-bluestore-tool repair --path >> /var/lib/ceph/osd/ceph-3 --deep 1" and that gets many instances of: > > Running this with debug_bluestore=30 might give more information on > the nature of the IO error. I had collected the logs with debug info already, and nothing significant was listed there. I applied this patch https://github.com/ceph/ceph/pull/26247 and it allowed me to move forward. There was a osd map corruption issue that I had to handle by hand, but after that the osd started fine. After it started and backfills finished, the bluestore_ignore_data_csum flag is no longer needed, so I reverted to standard packages. -- Ruben Rodriguez | Chief Technology Officer, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD fails to start (fsck error, unable to read osd superblock)
Hi there, Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore. Today we had two disks fail out of the controller, and after a reboot they both seemed to come back fine but ceph-osd was only able to start in one of them. The other one gets this: 2019-02-08 18:53:00.703376 7f64f948ce00 -1 bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# 2019-02-08 18:53:00.703406 7f64f948ce00 -1 osd.3 0 OSD::init() : unable to read osd superblock Note that there are no actual IO errors being shown by the controller in dmesg, and that the disk is readable. The metadata FS is mounted and looks normal. I tried running "ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-3 --deep 1" and that gets many instances of: 2019-02-08 19:00:31.783815 7fa35bd0df80 -1 bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device location [0x4000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# 2019-02-08 19:00:31.783866 7fa35bd0df80 -1 bluestore(/var/lib/ceph/osd/ceph-3) fsck error: #-1:7b3f43c4:::osd_superblock:0# error during read: (5) Input/output error ...which is the same error. Due to a host being down for unrelated reasons, this is preventing some PG's from activating, keeping one pool inaccessible. There is no critical data in it, but I'm more interested in solving the issue for reliability. Any advice? What does bad crc indicate in this context? Should I send this to the bug tracker instead? -- Ruben Rodriguez | Chief Technology Officer, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB
On 14/07/17 18:43, Ruben Rodriguez wrote: > How to reproduce... I'll provide more concise details on how to test this behavior: Ceph config: [client] rbd readahead max bytes = 0 # we don't want forced readahead to fool us rbd cache = true Start a qemu vm, with a rbd image attached with virtio-scsi: Block device parameters, inside the vm: NAME ALIGN MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAME sdb 0 4194304 4194304 512 5121 noop 128 40962G Collect performance statistics from librbd, using command: $ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump Note the values for: - rd: number of read operations done by qemu - rd_bytes: length of read requests done by qemu - cache_ops_hit: read operations hitting the cache - cache_ops_miss: read ops missing the cache - data_read: data read from the cache - op_r: number of objects sent by the OSD Perform one small read, not at the beginning of the image (because udev may have read it already), at a 4MB boundary line: dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes Do it again advancing 5000 bytes (to not overlap with the previous read) Run the perf dump command again dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes Run the perf dump command again If you compare the op_r values at each step, you should see a cache miss each time, and a object read each time. Same object fetched twice. IMPACT: Let's take a look at how the op_r value increases by doing some common operations: - Booting a vm: This operation needs (in my case) ~70MB to be read, which include the kernel, initrd and all files read by systemd and daemons, until a command prompt appears. Values read "rd": 2524, "rd_bytes": 69685248, "cache_ops_hit": 228, "cache_ops_miss": 2268, "cache_bytes_hit": 90353664, "cache_bytes_miss": 63902720, "data_read": 69186560, "op": 2295, "op_r": 2279, That is 2299 objects being fetched from the OSD to read 69MB. - Greping inside the linux source code (833MB), takes almost 3 minutes. Values get increased to: "rd": 65127, "rd_bytes": 1081487360, "cache_ops_hit": 228, "cache_ops_miss": 64885, "cache_bytes_hit": 90353664, "cache_bytes_miss": 1075672064, "data_read": 1080988672, "op_r": 64896, That is over 60.000 objects fetched to read <1GB, and *0* cache hits. Optimized, this should take 10 seconds, and fetch ~700 objects. Is my Qemu implementation completely broken? Or is this expected? Please help! -- Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB
On 15/07/17 09:43, Nick Fisk wrote: >> -Original Message- >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >> Gregory Farnum >> Sent: 15 July 2017 00:09 >> To: Ruben Rodriguez >> Cc: ceph-users >> Subject: Re: [ceph-users] RBD cache being filled up in small increases >> instead >> of 4MB >> >> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez wrote: >>> >>> I'm having an issue with small sequential reads (such as searching >>> through source code files, etc), and I found that multiple small reads >>> withing a 4MB boundary would fetch the same object from the OSD >>> multiple times, as it gets inserted into the RBD cache partially. >>> >>> How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi, >>> writethrough cache on. Monitor with perf dump on the rbd client. The >>> image is filled up with zeroes in advance. Rbd readahead is off. >>> >>> 1 - Small read from a previously unread section of the disk: >>> dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes >>> Notes: dd cannot read less than 512 bytes. The skip is arbitrary to >>> avoid the beginning of the disk, which would have been read at boot. >>> >>> Expected outcomes: perf dump should show a +1 increase on values rd, >>> cache_ops_miss and op_r. This happens correctly. >>> It should show a 4194304 increase in data_read as a whole object is >>> put into the cache. Instead it increases by 4096. (not sure why 4096, btw). >>> >>> 2 - Small read from less than 4MB distance (in the example, +5000b). >>> dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected >>> outcomes: perf dump should show a +1 increase on cache_ops_hit. >>> Instead cache_ops_miss increases. >>> It should show a 4194304 increase in data_read as a whole object is >>> put into the cache. Instead it increases by 4096. >>> op_r should not increase. Instead it increases by one, indicating that >>> the object was fetched again. >>> >>> My tests show that this could be causing a 6 to 20-fold performance >>> loss in small sequential reads. >>> >>> Is it by design that the RBD cache only inserts the portion requested >>> by the client instead of the whole last object fetched? Could it be a >>> tunable in any of my layers (fs, block device, qemu, rbd...) that is >>> preventing this? >> >> I don't know the exact readahead default values in that stack, but there's no >> general reason to think RBD (or any Ceph component) will read a whole >> object at a time. In this case, you're asking for 512 bytes and it appears to >> have turned that into a 4KB read (probably the virtual block size in use?), >> which seems pretty reasonable — if you were asking for 512 bytes out of >> every 4MB and it was reading 4MB each time, you'd probably be wondering >> why you were only getting 1/8192 the expected bandwidth. ;) -Greg > > I think the general readahead logic might be a bit more advanced in the Linux > kernel vs using readahead from the librbd client. Yes, the problems I'm having should be corrected by the vm kernel issuing larger reads, but I'm failing to get that to happen. > The kernel will watch how successful each readahead is and scale as > necessary. You might want to try uping the read_ahead_kb for the block device > in the VM. Something between 4MB to 32MB works well for RBD's, but make sure > you have a 4.x kernel as some fixes to readahead max size were introduced and > not sure if they ever got backported. I'm using kernel 4.4 and 4.8. I have readahead, min_io_size, optimum_io_size and max_sectors_kb set to 4MB. It helps in some use cases, like fio or dd tests, but not with real world tests like cp, grep, tar on a large pool of small files. From all I can tell, optimal read performance would happen when the vm kernel reads in 4MB increases _every_ _time_. I can force that with an ugly hack (putting the files inside a formatted big file, mounted as loop) and gives a 20-fold performance gain. But that is just silly... I documented that experiment on this thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018924.html > Unless you tell the rbd client to not disable readahead after reading the 1st > x number of bytes (rbd readahead disable after bytes=0), it will stop reading > ahead and will only cache exactly what is requested by the client. I realized that, so as a proof of concept I made some changes to the readahead mechanism. I force it on, make it trigger every time, an
Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB
On 15/07/17 15:33, Jason Dillaman wrote: > On Sat, Jul 15, 2017 at 9:43 AM, Nick Fisk wrote: >> Unless you tell the rbd client to not disable readahead after reading the >> 1st x number of bytes (rbd readahead disable after bytes=0), it will stop >> reading ahead and will only cache exactly what is requested by the client. > > The default is to disable librbd readahead caching after reading 50MB > -- since we expect the OS to take over and do a much better job. I understand having the expectation that the client would do the right thing, but from all I can tell it is not the case. I've run out of ways to try to make virtio-scsi (or any other driver) *always* read in 4MB increments. "minimum_io_size" seems to be ignored. BTW I just sent this patch to Qemu (and I'm open to any suggestions on that side!): https://bugs.launchpad.net/qemu/+bug/1600563 But this expectation you mention still has a problem: if you would only put in the RBD cache what the OS specifically requested, the chances of that data being requested twice would be pretty low, since the OS page cache would take care of it better than the RBD cache anyway. So why bother having a read cache if it doesn't fetch anything extra? Incidentally, if the RBD cache were to include the whole object instead of just the requested portion, RBD readahead would be unnecessary. -- Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD cache being filled up in small increases instead of 4MB
I'm having an issue with small sequential reads (such as searching through source code files, etc), and I found that multiple small reads withing a 4MB boundary would fetch the same object from the OSD multiple times, as it gets inserted into the RBD cache partially. How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi, writethrough cache on. Monitor with perf dump on the rbd client. The image is filled up with zeroes in advance. Rbd readahead is off. 1 - Small read from a previously unread section of the disk: dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes Notes: dd cannot read less than 512 bytes. The skip is arbitrary to avoid the beginning of the disk, which would have been read at boot. Expected outcomes: perf dump should show a +1 increase on values rd, cache_ops_miss and op_r. This happens correctly. It should show a 4194304 increase in data_read as a whole object is put into the cache. Instead it increases by 4096. (not sure why 4096, btw). 2 - Small read from less than 4MB distance (in the example, +5000b). dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected outcomes: perf dump should show a +1 increase on cache_ops_hit. Instead cache_ops_miss increases. It should show a 4194304 increase in data_read as a whole object is put into the cache. Instead it increases by 4096. op_r should not increase. Instead it increases by one, indicating that the object was fetched again. My tests show that this could be causing a 6 to 20-fold performance loss in small sequential reads. Is it by design that the RBD cache only inserts the portion requested by the client instead of the whole last object fetched? Could it be a tunable in any of my layers (fs, block device, qemu, rbd...) that is preventing this? Regards, -- Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance issue with small files, and weird "workaround"
On 06/27/2017 07:08 PM, Jason Dillaman wrote: > Have you tried blktrace to determine if there are differences in the > IO patterns to the rbd-backed virtio-scsi block device (direct vs > indirect through loop)? I tried today with the kernel tracing features, and I'll give blktrace a go if necessary. But I did find some important differences between the two read modes already. The main one is that on loop mode there are very few scsi_dispatch calls (1/100 times less), and they have txlen=8192 (being blocks it means 4M reads, matching the rbd object size), and in the direct case txlen is often 8, or 4kb. direct mode: cp-1167 [000] 4790.125637: scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=2 data_sgl=1 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=17540264 txlen=8 protect=0 raw=28 00 01 0b a4 a8 00 00 08 00) vs loop mode: loop0-1021 [000] 4645.976267: scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=2 data_sgl=67 prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=4705776 txlen=8192 protect=0 raw=28 00 00 47 cd f0 00 20 00 00) I also see a number of calls like loop0-1021 [000] 3319.709354: block_bio_backmerge: 8,16 R 10499064 + 2048 [loop0] loop0-1021 [000] 3319.709508: block_bio_backmerge: 8,16 R 10501112 + 2048 [loop0] loop0-1021 [000] 3319.709639: block_bio_backmerge: 8,16 R 10503160 + 2048 [loop0] but only on the loop one. The key to the performance penalty seems to be that the kernel is most of the time reading in 4k chunks instead of 4MB, and setting readahead or min_io_size is failing to fix that. Any idea how to achieve this? > On Tue, Jun 27, 2017 at 3:17 PM, Ruben Rodriguez wrote: >> >> We are setting a new set of servers to run the FSF/GNU infrastructure, >> and we are seeing a strange behavior. From a Qemu host, reading small >> files from a mounted rbd image is very slow. The "realworld" test that I >> use is to copy the linux source code from the filesystem to /dev/shm. On >> the host server that takes ~10 seconds to copy from a mapped rbd image, >> but on the vm it takes over a minute. The same test also takes <20 >> seconds when the vm storage is local LVM. Writing the files to the rbd >> mounted disk also takes ~10 seconds. >> >> I suspect a problem with readahead and caching, so as a test I copied >> those same files into a loop device inside the vm (stored in the same >> rbd), reading takes ~10 seconds. I drop the caches before each test. >> >> This is how I run that test: >> >> dd if=/dev/zero of=test bs=1G count=5 >> mkfs.xfs test >> mount test /mnt >> cp linux-src /mnt -a >> echo 1 > /proc/sys/vm/drop_caches >> time cp /mnt/linux-src /dev/shm -a >> >> I've tested many different parameters (readahead, partition alignment, >> filesystem formatting, block queue settings, etc) with little change in >> performance. Wrapping files in a loop device seems to change things in a >> way that I cannot replicate on the upper layers otherwise. >> >> Is this expected or am I doing something wrong? >> >> Here are the specs: >> Ceph 10.2.7 on Ubuntu xenial derivative. Kernel 4.4, Qemu 2.5 >> 2 Ceph servers running 6x 1TB SSD OSDs each. >> 2 Qemu/kvm servers managed with libvirt >> All connected with 20GbE (bonding). Every server has 2x 16 core opteron >> cpus, 2GB ram per OSD, and a bunch of ram on the KVM host servers. >> >> osd pool default size = 2 >> osd pool default min size = 2 >> osd pool default pg num = 512 >> osd pool default pgp num = 512 >> >> lsblk -t >> NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAM >> sdb 0 512 0 512 5120 noop 128 02G >> loop0 0 512 0 512 5120 128 00B >> >> Some numbers: >> rados bench -p libvirt-pool 10 write: avg MB/s 339.508 avg lat 0.186789 >> rados bench -p libvirt-pool 100 rand: avg MB/s .42 avg lat 0.0534118 >> Random small file read: >> fio read 4k rand inside the vm: avg=2246KB/s 1708usec avg lat, 600IOPS >> Sequential, small files read with readahead: >> fio read 4k seq inside the vm: avg=308351KB/s 11usec avg lat, 55kIOPS >> >> The rbd images are attached with virtio-scsi (no difference using >> virtio) and the guest block devices have 4M readahead set (no difference >> if disabled). Rbd cache is enabled on server and client (no difference >> if disabled). Forcing rbd readahead makes no difference. >> >> Please advice! >> -- >> Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation >> GPG Key: 05
[ceph-users] Performance issue with small files, and weird "workaround"
We are setting a new set of servers to run the FSF/GNU infrastructure, and we are seeing a strange behavior. From a Qemu host, reading small files from a mounted rbd image is very slow. The "realworld" test that I use is to copy the linux source code from the filesystem to /dev/shm. On the host server that takes ~10 seconds to copy from a mapped rbd image, but on the vm it takes over a minute. The same test also takes <20 seconds when the vm storage is local LVM. Writing the files to the rbd mounted disk also takes ~10 seconds. I suspect a problem with readahead and caching, so as a test I copied those same files into a loop device inside the vm (stored in the same rbd), reading takes ~10 seconds. I drop the caches before each test. This is how I run that test: dd if=/dev/zero of=test bs=1G count=5 mkfs.xfs test mount test /mnt cp linux-src /mnt -a echo 1 > /proc/sys/vm/drop_caches time cp /mnt/linux-src /dev/shm -a I've tested many different parameters (readahead, partition alignment, filesystem formatting, block queue settings, etc) with little change in performance. Wrapping files in a loop device seems to change things in a way that I cannot replicate on the upper layers otherwise. Is this expected or am I doing something wrong? Here are the specs: Ceph 10.2.7 on Ubuntu xenial derivative. Kernel 4.4, Qemu 2.5 2 Ceph servers running 6x 1TB SSD OSDs each. 2 Qemu/kvm servers managed with libvirt All connected with 20GbE (bonding). Every server has 2x 16 core opteron cpus, 2GB ram per OSD, and a bunch of ram on the KVM host servers. osd pool default size = 2 osd pool default min size = 2 osd pool default pg num = 512 osd pool default pgp num = 512 lsblk -t NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAM sdb 0 512 0 512 5120 noop 128 02G loop0 0 512 0 512 5120 128 00B Some numbers: rados bench -p libvirt-pool 10 write: avg MB/s 339.508 avg lat 0.186789 rados bench -p libvirt-pool 100 rand: avg MB/s .42 avg lat 0.0534118 Random small file read: fio read 4k rand inside the vm: avg=2246KB/s 1708usec avg lat, 600IOPS Sequential, small files read with readahead: fio read 4k seq inside the vm: avg=308351KB/s 11usec avg lat, 55kIOPS The rbd images are attached with virtio-scsi (no difference using virtio) and the guest block devices have 4M readahead set (no difference if disabled). Rbd cache is enabled on server and client (no difference if disabled). Forcing rbd readahead makes no difference. Please advice! -- Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409 https://fsf.org | https://gnu.org signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com