Re: [ceph-users] OSD fails to start (fsck error, unable to read osd superblock)

2019-02-12 Thread Ruben Rodriguez


On 2/9/19 5:40 PM, Brad Hubbard wrote:
> On Sun, Feb 10, 2019 at 1:56 AM Ruben Rodriguez  wrote:
>>
>> Hi there,
>>
>> Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore.
>>
>> Today we had two disks fail out of the controller, and after a reboot
>> they both seemed to come back fine but ceph-osd was only able to start
>> in one of them. The other one gets this:
>>
>> 2019-02-08 18:53:00.703376 7f64f948ce00 -1
>> bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
>> checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device
>> location [0x4000~1000], logical extent 0x0~1000, object
>> #-1:7b3f43c4:::osd_superblock:0#
>> 2019-02-08 18:53:00.703406 7f64f948ce00 -1 osd.3 0 OSD::init() : unable
>> to read osd superblock
>>
>> Note that there are no actual IO errors being shown by the controller in
>> dmesg, and that the disk is readable. The metadata FS is mounted and
>> looks normal.
>>
>> I tried running "ceph-bluestore-tool repair --path
>> /var/lib/ceph/osd/ceph-3 --deep 1" and that gets many instances of:
> 
> Running this with debug_bluestore=30 might give more information on
> the nature of the IO error.

I had collected the logs with debug info already, and nothing
significant was listed there. I applied this patch
https://github.com/ceph/ceph/pull/26247 and it allowed me to move
forward. There was a osd map corruption issue that I had to handle by
hand, but after that the osd started fine. After it started and
backfills finished, the bluestore_ignore_data_csum flag is no longer
needed, so I reverted to standard packages.

-- 
Ruben Rodriguez | Chief Technology Officer, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD fails to start (fsck error, unable to read osd superblock)

2019-02-09 Thread Ruben Rodriguez
Hi there,

Running 12.2.11-1xenial on a machine with 6 SSD OSD with bluestore.

Today we had two disks fail out of the controller, and after a reboot
they both seemed to come back fine but ceph-osd was only able to start
in one of them. The other one gets this:

2019-02-08 18:53:00.703376 7f64f948ce00 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device
location [0x4000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#
2019-02-08 18:53:00.703406 7f64f948ce00 -1 osd.3 0 OSD::init() : unable
to read osd superblock

Note that there are no actual IO errors being shown by the controller in
dmesg, and that the disk is readable. The metadata FS is mounted and
looks normal.

I tried running "ceph-bluestore-tool repair --path
/var/lib/ceph/osd/ceph-3 --deep 1" and that gets many instances of:

2019-02-08 19:00:31.783815 7fa35bd0df80 -1
bluestore(/var/lib/ceph/osd/ceph-3) _verify_csum bad crc32c/0x1000
checksum at blob offset 0x0, got 0x95104dfc, expected 0xb9e3e26d, device
location [0x4000~1000], logical extent 0x0~1000, object
#-1:7b3f43c4:::osd_superblock:0#
2019-02-08 19:00:31.783866 7fa35bd0df80 -1
bluestore(/var/lib/ceph/osd/ceph-3) fsck error:
#-1:7b3f43c4:::osd_superblock:0# error during read: (5) Input/output error

...which is the same error. Due to a host being down for unrelated
reasons, this is preventing some PG's from activating, keeping one pool
inaccessible. There is no critical data in it, but I'm more interested
in solving the issue for reliability.

Any advice? What does bad crc indicate in this context? Should I send
this to the bug tracker instead?
-- 
Ruben Rodriguez | Chief Technology Officer, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Ruben Rodriguez


On 14/07/17 18:43, Ruben Rodriguez wrote:
> How to reproduce...

I'll provide more concise details on how to test this behavior:

Ceph config:

[client]
rbd readahead max bytes = 0 # we don't want forced readahead to fool us
rbd cache = true

Start a qemu vm, with a rbd image attached with virtio-scsi:
   
  
  

  
  



  
  
  
  
  


Block device parameters, inside the vm:
NAME ALIGN  MIN-IO  OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE   RA WSAME
sdb  0 4194304 4194304 512 5121 noop  128 40962G

Collect performance statistics from librbd, using command:

$ ceph --admin-daemon /var/run/ceph/ceph-client.[...].asok perf dump

Note the values for:
- rd: number of read operations done by qemu
- rd_bytes: length of read requests done by qemu
- cache_ops_hit: read operations hitting the cache
- cache_ops_miss: read ops missing the cache
- data_read: data read from the cache
- op_r: number of objects sent by the OSD

Perform one small read, not at the beginning of the image (because udev
may have read it already), at a 4MB boundary line:

dd if=/dev/sda ibs=512 count=1 skip=41943040 iflag=skip_bytes

Do it again advancing 5000 bytes (to not overlap with the previous read)
Run the perf dump command again

dd if=/dev/sda ibs=512 count=1 skip=41948040 iflag=skip_bytes
Run the perf dump command again

If you compare the op_r values at each step, you should see a cache miss
each time, and a object read each time. Same object fetched twice.

IMPACT:

Let's take a look at how the op_r value increases by doing some common
operations:

- Booting a vm: This operation needs (in my case) ~70MB to be read,
which include the kernel, initrd and all files read by systemd and
daemons, until a command prompt appears. Values read
"rd": 2524,
"rd_bytes": 69685248,
"cache_ops_hit": 228,
"cache_ops_miss": 2268,
"cache_bytes_hit": 90353664,
"cache_bytes_miss": 63902720,
"data_read": 69186560,
"op": 2295,
"op_r": 2279,
That is 2299 objects being fetched from the OSD to read 69MB.

- Greping inside the linux source code (833MB), takes almost 3 minutes.
  Values get increased to:
"rd": 65127,
"rd_bytes": 1081487360,
"cache_ops_hit": 228,
"cache_ops_miss": 64885,
"cache_bytes_hit": 90353664,
"cache_bytes_miss": 1075672064,
"data_read": 1080988672,
"op_r": 64896,
That is over 60.000 objects fetched to read <1GB, and *0* cache hits.
Optimized, this should take 10 seconds, and fetch ~700 objects.

Is my Qemu implementation completely broken? Or is this expected? Please
help!

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Ruben Rodriguez


On 15/07/17 09:43, Nick Fisk wrote:
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Gregory Farnum
>> Sent: 15 July 2017 00:09
>> To: Ruben Rodriguez 
>> Cc: ceph-users 
>> Subject: Re: [ceph-users] RBD cache being filled up in small increases 
>> instead
>> of 4MB
>>
>> On Fri, Jul 14, 2017 at 3:43 PM, Ruben Rodriguez  wrote:
>>>
>>> I'm having an issue with small sequential reads (such as searching
>>> through source code files, etc), and I found that multiple small reads
>>> withing a 4MB boundary would fetch the same object from the OSD
>>> multiple times, as it gets inserted into the RBD cache partially.
>>>
>>> How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
>>> writethrough cache on. Monitor with perf dump on the rbd client. The
>>> image is filled up with zeroes in advance. Rbd readahead is off.
>>>
>>> 1 - Small read from a previously unread section of the disk:
>>> dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
>>> Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
>>> avoid the beginning of the disk, which would have been read at boot.
>>>
>>> Expected outcomes: perf dump should show a +1 increase on values rd,
>>> cache_ops_miss and op_r. This happens correctly.
>>> It should show a 4194304 increase in data_read as a whole object is
>>> put into the cache. Instead it increases by 4096. (not sure why 4096, btw).
>>>
>>> 2 - Small read from less than 4MB distance (in the example, +5000b).
>>> dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes Expected
>>> outcomes: perf dump should show a +1 increase on cache_ops_hit.
>>> Instead cache_ops_miss increases.
>>> It should show a 4194304 increase in data_read as a whole object is
>>> put into the cache. Instead it increases by 4096.
>>> op_r should not increase. Instead it increases by one, indicating that
>>> the object was fetched again.
>>>
>>> My tests show that this could be causing a 6 to 20-fold performance
>>> loss in small sequential reads.
>>>
>>> Is it by design that the RBD cache only inserts the portion requested
>>> by the client instead of the whole last object fetched? Could it be a
>>> tunable in any of my layers (fs, block device, qemu, rbd...) that is
>>> preventing this?
>>
>> I don't know the exact readahead default values in that stack, but there's no
>> general reason to think RBD (or any Ceph component) will read a whole
>> object at a time. In this case, you're asking for 512 bytes and it appears to
>> have turned that into a 4KB read (probably the virtual block size in use?),
>> which seems pretty reasonable — if you were asking for 512 bytes out of
>> every 4MB and it was reading 4MB each time, you'd probably be wondering
>> why you were only getting 1/8192 the expected bandwidth. ;) -Greg
> 
> I think the general readahead logic might be a bit more advanced in the Linux 
> kernel vs using readahead from the librbd client.

Yes, the problems I'm having should be corrected by the vm kernel
issuing larger reads, but I'm failing to get that to happen.

> The kernel will watch how successful each readahead is and scale as 
> necessary. You might want to try uping the read_ahead_kb for the block device 
> in the VM. Something between 4MB to 32MB works well for RBD's, but make sure 
> you have a 4.x kernel as some fixes to readahead max size were introduced and 
> not sure if they ever got backported.

I'm using kernel 4.4 and 4.8. I have readahead, min_io_size,
optimum_io_size and max_sectors_kb set to 4MB. It helps in some use
cases, like fio or dd tests, but not with real world tests like cp,
grep, tar on a large pool of small files.

From all I can tell, optimal read performance would happen when the vm
kernel reads in 4MB increases _every_ _time_. I can force that with an
ugly hack (putting the files inside a formatted big file, mounted as
loop) and gives a 20-fold performance gain. But that is just silly...

I documented that experiment on this thread:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018924.html

> Unless you tell the rbd client to not disable readahead after reading the 1st 
> x number of bytes (rbd readahead disable after bytes=0), it will stop reading 
> ahead and will only cache exactly what is requested by the client.

I realized that, so as a proof of concept I made some changes to the
readahead mechanism. I force it on, make it trigger every time, an

Re: [ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-15 Thread Ruben Rodriguez


On 15/07/17 15:33, Jason Dillaman wrote:
> On Sat, Jul 15, 2017 at 9:43 AM, Nick Fisk  wrote:
>> Unless you tell the rbd client to not disable readahead after reading the 
>> 1st x number of bytes (rbd readahead disable after bytes=0), it will stop 
>> reading ahead and will only cache exactly what is requested by the client.
> 
> The default is to disable librbd readahead caching after reading 50MB
> -- since we expect the OS to take over and do a much better job.

I understand having the expectation that the client would do the right
thing, but from all I can tell it is not the case. I've run out of ways
to try to make virtio-scsi (or any other driver) *always* read in 4MB
increments. "minimum_io_size" seems to be ignored.
BTW I just sent this patch to Qemu (and I'm open to any suggestions on
that side!): https://bugs.launchpad.net/qemu/+bug/1600563

But this expectation you mention still has a problem: if you would only
put in the RBD cache what the OS specifically requested, the chances of
that data being requested twice would be pretty low, since the OS page
cache would take care of it better than the RBD cache anyway. So why
bother having a read cache if it doesn't fetch anything extra?

Incidentally, if the RBD cache were to include the whole object instead
of just the requested portion, RBD readahead would be unnecessary.

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8 27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD cache being filled up in small increases instead of 4MB

2017-07-14 Thread Ruben Rodriguez

I'm having an issue with small sequential reads (such as searching
through source code files, etc), and I found that multiple small reads
withing a 4MB boundary would fetch the same object from the OSD multiple
times, as it gets inserted into the RBD cache partially.

How to reproduce: rbd image accessed from a Qemu vm using virtio-scsi,
writethrough cache on. Monitor with perf dump on the rbd client. The
image is filled up with zeroes in advance. Rbd readahead is off.

1 - Small read from a previously unread section of the disk:
dd if=/dev/sdb ibs=512 count=1 skip=41943040 iflag=skip_bytes
Notes: dd cannot read less than 512 bytes. The skip is arbitrary to
avoid the beginning of the disk, which would have been read at boot.

Expected outcomes: perf dump should show a +1 increase on values rd,
cache_ops_miss and op_r. This happens correctly.
It should show a 4194304 increase in data_read as a whole object is put
into the cache. Instead it increases by 4096. (not sure why 4096, btw).

2 - Small read from less than 4MB distance (in the example, +5000b).
dd if=/dev/sdb ibs=512 count=1 skip=41948040 iflag=skip_bytes
Expected outcomes: perf dump should show a +1 increase on cache_ops_hit.
Instead cache_ops_miss increases.
It should show a 4194304 increase in data_read as a whole object is put
into the cache. Instead it increases by 4096.
op_r should not increase. Instead it increases by one, indicating that
the object was fetched again.

My tests show that this could be causing a 6 to 20-fold performance loss
in small sequential reads.

Is it by design that the RBD cache only inserts the portion requested by
the client instead of the whole last object fetched? Could it be a
tunable in any of my layers (fs, block device, qemu, rbd...) that is
preventing this?

Regards,
-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issue with small files, and weird "workaround"

2017-06-28 Thread Ruben Rodriguez

On 06/27/2017 07:08 PM, Jason Dillaman wrote:
> Have you tried blktrace to determine if there are differences in the
> IO patterns to the rbd-backed virtio-scsi block device (direct vs
> indirect through loop)?

I tried today with the kernel tracing features, and I'll give blktrace a
go if necessary. But I did find some important differences between the
two read modes already. The main one is that on loop mode there are very
few scsi_dispatch calls (1/100 times less), and they have txlen=8192
(being blocks it means 4M reads, matching the rbd object size), and in
the direct case txlen is often 8, or 4kb.

direct mode:  cp-1167  [000]   4790.125637:
scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=2 data_sgl=1
prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=17540264 txlen=8
protect=0 raw=28 00 01 0b a4 a8 00 00 08 00)

vs

loop mode:   loop0-1021  [000]   4645.976267:
scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=2 data_sgl=67
prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=4705776 txlen=8192
protect=0 raw=28 00 00 47 cd f0 00 20 00 00)


I also see a number of calls like
   loop0-1021  [000]   3319.709354: block_bio_backmerge:
8,16 R 10499064 + 2048 [loop0]
   loop0-1021  [000]   3319.709508: block_bio_backmerge:
8,16 R 10501112 + 2048 [loop0]
   loop0-1021  [000]   3319.709639: block_bio_backmerge:
8,16 R 10503160 + 2048 [loop0]
but only on the loop one.

The key to the performance penalty seems to be that the kernel is most
of the time reading in 4k chunks instead of 4MB, and setting readahead
or min_io_size is failing to fix that. Any idea how to achieve this?


> On Tue, Jun 27, 2017 at 3:17 PM, Ruben Rodriguez  wrote:
>>
>> We are setting a new set of servers to run the FSF/GNU infrastructure,
>> and we are seeing a strange behavior. From a Qemu host, reading small
>> files from a mounted rbd image is very slow. The "realworld" test that I
>> use is to copy the linux source code from the filesystem to /dev/shm. On
>> the host server that takes ~10 seconds to copy from a mapped rbd image,
>> but on the vm it takes over a minute. The same test also takes <20
>> seconds when the vm storage is local LVM. Writing the files to the rbd
>> mounted disk also takes ~10 seconds.
>>
>> I suspect a problem with readahead and caching, so as a test I copied
>> those same files into a loop device inside the vm (stored in the same
>> rbd), reading takes ~10 seconds. I drop the caches before each test.
>>
>> This is how I run that test:
>>
>> dd if=/dev/zero of=test bs=1G count=5
>> mkfs.xfs test
>> mount test /mnt
>> cp linux-src /mnt -a
>> echo 1 > /proc/sys/vm/drop_caches
>> time cp /mnt/linux-src /dev/shm -a
>>
>> I've tested many different parameters (readahead, partition alignment,
>> filesystem formatting, block queue settings, etc) with little change in
>> performance. Wrapping files in a loop device seems to change things in a
>> way that I cannot replicate on the upper layers otherwise.
>>
>> Is this expected or am I doing something wrong?
>>
>> Here are the specs:
>> Ceph 10.2.7 on Ubuntu xenial derivative. Kernel 4.4, Qemu 2.5
>> 2 Ceph servers running 6x 1TB SSD OSDs each.
>> 2 Qemu/kvm servers managed with libvirt
>> All connected with 20GbE (bonding). Every server has 2x 16 core opteron
>> cpus, 2GB ram per OSD, and a bunch of ram on the KVM host servers.
>>
>> osd pool default size = 2
>> osd pool default min size = 2
>> osd pool default pg num = 512
>> osd pool default pgp num = 512
>>
>> lsblk -t
>> NAME ALIGNMENT  MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAM
>> sdb 0 512  0 512 5120 noop  128  02G
>> loop0   0 512  0 512 5120   128  00B
>>
>> Some numbers:
>> rados bench -p libvirt-pool 10 write: avg MB/s 339.508 avg lat 0.186789
>> rados bench -p libvirt-pool 100 rand: avg MB/s .42 avg lat 0.0534118
>> Random small file read:
>> fio read 4k rand inside the vm: avg=2246KB/s 1708usec avg lat, 600IOPS
>> Sequential, small files read with readahead:
>> fio read 4k  seq inside the vm: avg=308351KB/s 11usec avg lat, 55kIOPS
>>
>> The rbd images are attached with virtio-scsi (no difference using
>> virtio) and the guest block devices have 4M readahead set (no difference
>> if disabled). Rbd cache is enabled on server and client (no difference
>> if disabled). Forcing rbd readahead makes no difference.
>>
>> Please advice!
>> --
>> Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
>> GPG Key: 05

[ceph-users] Performance issue with small files, and weird "workaround"

2017-06-27 Thread Ruben Rodriguez

We are setting a new set of servers to run the FSF/GNU infrastructure,
and we are seeing a strange behavior. From a Qemu host, reading small
files from a mounted rbd image is very slow. The "realworld" test that I
use is to copy the linux source code from the filesystem to /dev/shm. On
the host server that takes ~10 seconds to copy from a mapped rbd image,
but on the vm it takes over a minute. The same test also takes <20
seconds when the vm storage is local LVM. Writing the files to the rbd
mounted disk also takes ~10 seconds.

I suspect a problem with readahead and caching, so as a test I copied
those same files into a loop device inside the vm (stored in the same
rbd), reading takes ~10 seconds. I drop the caches before each test.

This is how I run that test:

dd if=/dev/zero of=test bs=1G count=5
mkfs.xfs test
mount test /mnt
cp linux-src /mnt -a
echo 1 > /proc/sys/vm/drop_caches
time cp /mnt/linux-src /dev/shm -a

I've tested many different parameters (readahead, partition alignment,
filesystem formatting, block queue settings, etc) with little change in
performance. Wrapping files in a loop device seems to change things in a
way that I cannot replicate on the upper layers otherwise.

Is this expected or am I doing something wrong?

Here are the specs:
Ceph 10.2.7 on Ubuntu xenial derivative. Kernel 4.4, Qemu 2.5
2 Ceph servers running 6x 1TB SSD OSDs each.
2 Qemu/kvm servers managed with libvirt
All connected with 20GbE (bonding). Every server has 2x 16 core opteron
cpus, 2GB ram per OSD, and a bunch of ram on the KVM host servers.

osd pool default size = 2
osd pool default min size = 2
osd pool default pg num = 512
osd pool default pgp num = 512

lsblk -t
NAME ALIGNMENT  MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAM
sdb 0 512  0 512 5120 noop  128  02G
loop0   0 512  0 512 5120   128  00B

Some numbers:
rados bench -p libvirt-pool 10 write: avg MB/s 339.508 avg lat 0.186789
rados bench -p libvirt-pool 100 rand: avg MB/s .42 avg lat 0.0534118
Random small file read:
fio read 4k rand inside the vm: avg=2246KB/s 1708usec avg lat, 600IOPS
Sequential, small files read with readahead:
fio read 4k  seq inside the vm: avg=308351KB/s 11usec avg lat, 55kIOPS

The rbd images are attached with virtio-scsi (no difference using
virtio) and the guest block devices have 4M readahead set (no difference
if disabled). Rbd cache is enabled on server and client (no difference
if disabled). Forcing rbd readahead makes no difference.

Please advice!
-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com