Re: [ceph-users] How to improve single thread sequential reads?
On 18-08-15 12:25, Benedikt Fraunhofer wrote: Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the osd-os-buffers arrive with around 400-700MB/s so I don't think the network is the culprit. (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp 2x10g bonds) rados bench single-threaded performs equally bad, but with its default multithreaded settings it generates wonderful numbers, usually only limiited by linerate and/or interrupts/s. I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to your wonderful numbers, but it's staying below 30 MB/s. I was thinking about using a software raid0 like you did but that's imho really ugly. When I know I needed something speedy, I usually just started dd-ing the file to /dev/null and wait for about three minutes before starting the actual job; some sort of hand-made read-ahead for dummies. It really depends on your situation, but you could also go for larger objects then 4MB for specific block devices. In a use-case with a customer where they read large single-thread files from RBD block devices we went for 64MB objects. That improved our read performance in that case. We didn't have to create a new TCP connection every 4MB and talk to a new OSD. You could try that and see how it works out. Wido Thx in advance Benedikt 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk: Thanks for the replies guys. The client is set to 4MB, I haven't played with the OSD side yet as I wasn't sure if it would make much difference, but I will give it a go. If the client is already passing a 4MB request down through to the OSD, will it be able to readahead any further? The next 4MB object in theory will be on another OSD and so I'm not sure if reading ahead any further on the OSD side would help. How I see the problem is that the RBD client will only read 1 OSD at a time as the RBD readahead can't be set any higher than max_hw_sectors_kb, which is the object size of the RBD. Please correct me if I'm wrong on this. If you could set the RBD readahead to much higher than the object size, then this would probably give the desired effect where the buffer could be populated by reading from several OSD's in advance to give much higher performance. That or wait for striping to appear in the Kernel client. I've also found that BareOS (fork of Bacula) seems to has a direct RADOS feature that supports radosstriper. I might try this and see how it performs as well. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Somnath Roy Sent: 17 August 2015 03:36 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Have you tried setting read_ahead_kb to bigger number for both client/OSD side if you are using krbd ? In case of librbd, try the different config options for rbd cache.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: Sunday, August 16, 2015 7:07 PM To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Hi Nick, On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: 13 August 2015 18:04 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to improve single thread sequential reads? Hi, I'm trying to use a RBD to act as a staging area for some data before pushing it down to some LTO6 tapes. As I cannot use striping with the kernel client I tend to be maxing out at around 80MB/s reads testing with DD. Has anyone got any clever suggestions of giving this a bit of a boost, I think I need to get it up to around 200MB/s to make sure there is always a steady flow of data to the tape drive. I've just tried the testing kernel with the blk-mq fixes in it for full size IO's, this combined with bumping readahead up to 4MB, is now getting me on average 150MB/s to 200MB/s so this might suffice. On a personal interest, I would still like to know if anyone has ideas on how to really push much higher bandwidth through a RBD. Some settings in our ceph.conf that may help: osd_op_threads = 20 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k filestore_queue_max_ops = 9 filestore_flusher = false filestore_max_sync_interval = 10 filestore_sync_flush = false Regards, Alex Rbd-fuse seems to top out at 12MB/s, so there goes that option.
[ceph-users] НА: НА: tcmalloc use a lot of CPU
Hi! How many nodes? How many SSDs/OSDs? 2 Nodes, each: - 1xE5-2670, 128G, - 2x146G SAS 10krpm - system + MON root - 10x600G SAS 10krpm + 7x900G SAS 10krpm single drive RAID0 on lsi2208 - 2x400G SSD Intel DC S3700 on С602 - for separate SSD pool - 2x200G SSD Intel DC S3700 on SATA3- for ceph journals - 10Gbit shared interconnect (Eth) So: 2 MONs (I know about quorum ;) ) + 34 HDD OSDs + 4 SSD OSDs Ceph 0.94.2 on Debian Jessie. Tuning: swappiness, low latency TCP tuning, enlarging TCP buffers, disable interrupt colaescing, noop on ssd, deadline on HDD. Are they random? Yes. 4k random read, 8 pocesses, aio, qd=32 over a 500G RBD volumes. There are 2 testing volumes - on HDD and SSD pools. Client is running on separate host with 10Gbin network. Volumes are real Linux filesystems, created with rbd import, so they are fully allocated. What are you using to make the tests? fio-rbd 2.2.7 - with native rbd support, made from sources. How big are those OPS? When I use deafult ceph.conf (simple messenger, use crc, use cephx, all debug off): 1. ~12k iops from HDD pool in cold state (after dropping caches on all nodes) - 8-10% user, 2-3% sys, ~70% iowait, 18% idle - iostat shows 70% load on OSD drives - perf top shows 7,53% libtcmalloc.so.4.2.2 [.] tcmalloc::SLL_Next(void*) 1,86% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**) 1,51% libpthread-2.19.so[.] __pthread_mutex_unlock_usercnt 1,49% libtcmalloc.so.4.2.2 [.] TCMalloc_PageMap335::get(unsigned long) const 1,29% libtcmalloc.so.4.2.2 [.] PackedCache35, unsigned long::GetOrDefault(unsigned long, unsigned long) 1,25% libtcmalloc.so.4.2.2 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*) 1,19% ceph-osd [.] crush_hash32_3 1,00% libpthread-2.19.so[.] pthread_mutex_lock 0,89% libtcmalloc.so.4.2.2 [.] tcmalloc::ThreadCache::Deallocate(void*, unsigned long) 0,87% libtcmalloc.so.4.2.2 [.] base::subtle::NoBarrier_Load(long const volatile*) 2. ~30-40k iops from HDD pool in warm state (second pass) - 40-60% user (!), 8-10% sys, 1% iowait, ~50% idle - iostat shows 1% load on OSD drives - perf top shows the same - tcmalloc calls are in top I It is quite understandable situation: at the first run most io read from platters and we got 12000iops/34osd ~ 350iops, that is good value for 10krpm drive. At the second run we serve reads (mostly) from pagecache, so no IO on platters. But both runs shows us, that there is some tcmalloc issue, limiting to overall io of cluster. Also 40% CPU in the second run is abnormal value, I think. Next test is the same, except volume is on the SSD pool. 3. ~43k iops from SSD pool in cold state (after dropping caches on all nodes) - 25% user, 8-12% sys, ~6% iowait, ~55-60% idle - iostat shows ~55-65% load on SSD with ~8 kiops each (4 ssd total in pool) - perf top shows two different things, I'll explain later(*) 4. Also the same ~43k iops from SSD pool in warm state This test shows, that ceph somewhere limits performance by itself, cause (a) there are almost no difference in iops between serving io from ssd itself and pagecache. I think io from pagecache will be faster anyway. And (b) each SSD can do 30k iops random read, while we got only ~8k per drive. (*) As for perf top results, sometimes things quickly changed and instead of tcmalloc's calls in top, we got: 46,07% [kernel] [k] _raw_spin_lock 6,51% [kernel] [k] mb_cache_entry_alloc As I can see the function's names, it is kernel calls for cache allocation, in normal situation, they are far behind tcmalloc calls, but sometimes they're go up in perf top. In this moments, performance from SSD pool drops significantly - to 10k iops. And this is not happens, while benchmarking volume, located on HDD pool, only when testing volume on SSD pool. Pity, but I dont have any explanations. Kernel issue? Using atop on the OSD nodes where is your bottleneck? That is the main question! We built this test Hammer install to get the best performance from it, because our productuion Firefly cluster performs not so well. And I can't see any bottleneck, thal limits performance to ~40k iops, except tcmalloc issues. PS: I try to use ms_async messenger, and it raises performance top over 60k! It is very good! But the bad thing is a core dump, that always happens in two minutes after start. As I can see, there is assert on memory deallocation in AsyncMessenger code. Hope, that in new Ceph versions, async messnger will work better, as it really helps to increace performance. Megov Igor CIO, Yuterra От: Luis Periquito periqu...@gmail.com Отправлено: 17 августа 2015 г. 17:15 Кому: Межов Игорь Александрович Копия: YeYin; ceph-users Тема: Re: [ceph-users] НА: tcmalloc use a lot of CPU How
Re: [ceph-users] How to improve single thread sequential reads?
I'm not sure if I missed that but are you testing in a VM backed by RBD device, or using the device directly? I don't see how blk-mq would help if it's not a VM, it just passes the request to the underlying block device, and in case of RBD there is no real block device from the host perspective...? Enlighten me if I'm wrong please. I have some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe because I'm unable to tune the scheduler and it just makes no sense at all...? Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make sure it gets into readahead), also try (if you're not using blk-mq) to a cfq scheduler and set it to rotational=1. I see you've also tried this, but I think blk-mq is the limiting factor here now. If you are running a single-threaded benchmark like rados bench then what's limiting you is latency - it's not surprising it scales up with more threads. It should run nicely with a real workload once readahead kicks in and the queue fills up. But again - not sure how that works with blk-mq and I've never used the RBD device directly (the kernel client). Does it show in /sys/block ? Can you dump find /sys/block/$rbd in here? Jan On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-users.ceph.com.toasta@traced.net wrote: Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the osd-os-buffers arrive with around 400-700MB/s so I don't think the network is the culprit. (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp 2x10g bonds) rados bench single-threaded performs equally bad, but with its default multithreaded settings it generates wonderful numbers, usually only limiited by linerate and/or interrupts/s. I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to your wonderful numbers, but it's staying below 30 MB/s. I was thinking about using a software raid0 like you did but that's imho really ugly. When I know I needed something speedy, I usually just started dd-ing the file to /dev/null and wait for about three minutes before starting the actual job; some sort of hand-made read-ahead for dummies. Thx in advance Benedikt 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk: Thanks for the replies guys. The client is set to 4MB, I haven't played with the OSD side yet as I wasn't sure if it would make much difference, but I will give it a go. If the client is already passing a 4MB request down through to the OSD, will it be able to readahead any further? The next 4MB object in theory will be on another OSD and so I'm not sure if reading ahead any further on the OSD side would help. How I see the problem is that the RBD client will only read 1 OSD at a time as the RBD readahead can't be set any higher than max_hw_sectors_kb, which is the object size of the RBD. Please correct me if I'm wrong on this. If you could set the RBD readahead to much higher than the object size, then this would probably give the desired effect where the buffer could be populated by reading from several OSD's in advance to give much higher performance. That or wait for striping to appear in the Kernel client. I've also found that BareOS (fork of Bacula) seems to has a direct RADOS feature that supports radosstriper. I might try this and see how it performs as well. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Somnath Roy Sent: 17 August 2015 03:36 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Have you tried setting read_ahead_kb to bigger number for both client/OSD side if you are using krbd ? In case of librbd, try the different config options for rbd cache.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: Sunday, August 16, 2015 7:07 PM To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Hi Nick, On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: 13 August 2015 18:04 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to improve single thread sequential reads? Hi, I'm trying to use a RBD to act as a staging area for some data before pushing it down to some LTO6 tapes. As I cannot use striping with the kernel client I tend to be maxing out at around 80MB/s reads
Re: [ceph-users] How to improve single thread sequential reads?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 11:50 To: Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] How to improve single thread sequential reads? I'm not sure if I missed that but are you testing in a VM backed by RBD device, or using the device directly? I don't see how blk-mq would help if it's not a VM, it just passes the request to the underlying block device, and in case of RBD there is no real block device from the host perspective...? Enlighten me if I'm wrong please. I have some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe because I'm unable to tune the scheduler and it just makes no sense at all...? Since 4.0 (I think) the Kernel RBD client now uses the blk-mq infrastructure, but there is a bug which limits max IO sizes to 128kb, which is why for large block/sequential that testing kernel is essential. I think this bug fix should make it to 4.2 hopefully. Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make sure it gets into readahead), also try (if you're not using blk-mq) to a cfq scheduler and set it to rotational=1. I see you've also tried this, but I think blk-mq is the limiting factor here now. I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object size, from what I can tell) and the max_sectors_kb is already set at the hw_max. But it would sure be nice if the max_hw_sectors_kb could be set higher though, but I'm not sure if there is a reason for this limit. If you are running a single-threaded benchmark like rados bench then what's limiting you is latency - it's not surprising it scales up with more threads. Agreed, but with sequential workloads, if you can get readahead working properly then you can easily remove this limitation as a single threaded op effectively becomes multithreaded. It should run nicely with a real workload once readahead kicks in and the queue fills up. But again - not sure how that works with blk-mq and I've never used the RBD device directly (the kernel client). Does it show in /sys/block ? Can you dump find /sys/block/$rbd in here? Jan On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net wrote: Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the osd-os-buffers arrive with around 400-700MB/s so I don't think the network is the culprit. (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp 2x10g bonds) rados bench single-threaded performs equally bad, but with its default multithreaded settings it generates wonderful numbers, usually only limiited by linerate and/or interrupts/s. I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to your wonderful numbers, but it's staying below 30 MB/s. I was thinking about using a software raid0 like you did but that's imho really ugly. When I know I needed something speedy, I usually just started dd-ing the file to /dev/null and wait for about three minutes before starting the actual job; some sort of hand-made read-ahead for dummies. Thx in advance Benedikt 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk: Thanks for the replies guys. The client is set to 4MB, I haven't played with the OSD side yet as I wasn't sure if it would make much difference, but I will give it a go. If the client is already passing a 4MB request down through to the OSD, will it be able to readahead any further? The next 4MB object in theory will be on another OSD and so I'm not sure if reading ahead any further on the OSD side would help. How I see the problem is that the RBD client will only read 1 OSD at a time as the RBD readahead can't be set any higher than max_hw_sectors_kb, which is the object size of the RBD. Please correct me if I'm wrong on this. If you could set the RBD readahead to much higher than the object size, then this would probably give the desired effect where the buffer could be populated by reading from several OSD's in advance to give much higher performance. That or wait for striping to appear in the Kernel client. I've also found that BareOS (fork of Bacula) seems to has a direct RADOS feature that supports radosstriper. I might try this and see how it performs as well. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Somnath Roy Sent: 17 August 2015 03:36 To: Alex Gorbachev
Re: [ceph-users] Repair inconsistent pgs..
Voloshanenko Igor writes: Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! I've had an inconsistent pg once, but it was a different sort of an error (some sort of digest mismatch, where the secondary object copies had later timestamps). This was fixed by moving the object away and restarting, the osd which got fixed when the osd peered, similar to what was mentioned in Sebastian Han's blog[1]. I'm guessing the same method will solve this error as well, but not completely sure, maybe someone else who has seen this particular error could guide you better. [1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ -- Abhishek signature.asc Description: PGP signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Memory-Usage
On Mon, Aug 17, 2015 at 8:21 PM, Patrik Plank pat...@plank.me wrote: Hi, have a ceph cluster witch tree nodes and 32 osds. The tree nodes have 16Gb memory but only 5Gb is in use. Nodes are Dell Poweredge R510. my ceph.conf: [global] mon_initial_members = ceph01 mon_host = 10.0.0.20,10.0.0.21,10.0.0.22 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true filestore_op_threads = 32 public_network = 10.0.0.0/24 cluster_network = 10.0.1.0/24 osd_pool_default_size = 3 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 4096 osd_pool_default_pgp_num = 4096 osd_max_write_size = 200 osd_map_cache_size = 1024 osd_map_cache_bl_size = 128 osd_recovery_op_priority = 1 osd_max_recovery_max_active = 1 osd_recovery_max_backfills = 1 osd_op_threads = 32 osd_disk_threads = 8 is that normal or a bottleneck? Any memory not used by the OSD processes directly will be used by Linux for page caching. That's what we want to have happen! So it's not a problem that it's using only 5 GB. Keep in mind that the memory usage might spike dramatically if the OSDs need to deal with an outage, though — your normal-state usage ought to be lower than our recommended values for that reason. -Greg best regards Patrik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 10:01 To: Alex Gorbachev a...@iss-integration.com Cc: Dominik Zalewski dzalew...@optlink.net; ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Please note, it looks like the main(only?) dev of Bcache has started making a new version of bcache, bcachefs. At this stage I'm not sure what this means for the ongoing support of the existing bcache project. Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to improve single thread sequential reads?
Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the osd-os-buffers arrive with around 400-700MB/s so I don't think the network is the culprit. (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp 2x10g bonds) rados bench single-threaded performs equally bad, but with its default multithreaded settings it generates wonderful numbers, usually only limiited by linerate and/or interrupts/s. I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to your wonderful numbers, but it's staying below 30 MB/s. I was thinking about using a software raid0 like you did but that's imho really ugly. When I know I needed something speedy, I usually just started dd-ing the file to /dev/null and wait for about three minutes before starting the actual job; some sort of hand-made read-ahead for dummies. Thx in advance Benedikt 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk: Thanks for the replies guys. The client is set to 4MB, I haven't played with the OSD side yet as I wasn't sure if it would make much difference, but I will give it a go. If the client is already passing a 4MB request down through to the OSD, will it be able to readahead any further? The next 4MB object in theory will be on another OSD and so I'm not sure if reading ahead any further on the OSD side would help. How I see the problem is that the RBD client will only read 1 OSD at a time as the RBD readahead can't be set any higher than max_hw_sectors_kb, which is the object size of the RBD. Please correct me if I'm wrong on this. If you could set the RBD readahead to much higher than the object size, then this would probably give the desired effect where the buffer could be populated by reading from several OSD's in advance to give much higher performance. That or wait for striping to appear in the Kernel client. I've also found that BareOS (fork of Bacula) seems to has a direct RADOS feature that supports radosstriper. I might try this and see how it performs as well. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Somnath Roy Sent: 17 August 2015 03:36 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Have you tried setting read_ahead_kb to bigger number for both client/OSD side if you are using krbd ? In case of librbd, try the different config options for rbd cache.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alex Gorbachev Sent: Sunday, August 16, 2015 7:07 PM To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Hi Nick, On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Nick Fisk Sent: 13 August 2015 18:04 To: ceph-users@lists.ceph.com Subject: [ceph-users] How to improve single thread sequential reads? Hi, I'm trying to use a RBD to act as a staging area for some data before pushing it down to some LTO6 tapes. As I cannot use striping with the kernel client I tend to be maxing out at around 80MB/s reads testing with DD. Has anyone got any clever suggestions of giving this a bit of a boost, I think I need to get it up to around 200MB/s to make sure there is always a steady flow of data to the tape drive. I've just tried the testing kernel with the blk-mq fixes in it for full size IO's, this combined with bumping readahead up to 4MB, is now getting me on average 150MB/s to 200MB/s so this might suffice. On a personal interest, I would still like to know if anyone has ideas on how to really push much higher bandwidth through a RBD. Some settings in our ceph.conf that may help: osd_op_threads = 20 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k filestore_queue_max_ops = 9 filestore_flusher = false filestore_max_sync_interval = 10 filestore_sync_flush = false Regards, Alex Rbd-fuse seems to top out at 12MB/s, so there goes that option. I'm thinking mapping multiple RBD's and then combining them into a mdadm RAID0 stripe might work, but seems a bit messy. Any suggestions? Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] Ceph File System ACL Support
On Mon, Aug 17, 2015 at 4:12 AM, Yan, Zheng uker...@gmail.com wrote: On Mon, Aug 17, 2015 at 9:38 AM, Eric Eastman eric.east...@keepertech.com wrote: Hi, I need to verify in Ceph v9.0.2 if the kernel version of Ceph file system supports ACLs and the libcephfs file system interface does not. I am trying to have SAMBA, version 4.3.0rc1, support Windows ACLs using vfs objects = acl_xattr with the SAMBA VFS Ceph file system interface vfs objects = ceph and my tests are failing. If I use a kernel mount of the same Ceph file system, it works. Using the SAMBA Ceph VFS interface with logging set to 3 in my smb.conf files shows the following error when on my Windows AD server I try to Disable inheritance of the SAMBA exported directory uu/home: [2015/08/16 18:27:11.546307, 2] ../source3/smbd/posix_acls.c:3006(set_canon_ace_list) set_canon_ace_list: sys_acl_set_file type file failed for file uu/home (Operation not supported). This works using the same Ceph file system kernel mounted. It also works with an XFS file system. Doing some Googling I found this entry on the SAMBA email list: https://lists.samba.org/archive/samba-technical/2015-March/106699.html It states: libcephfs does not support ACL yet, so this patch adds ACL callbacks that do nothing. If ACL support is not in libcephfs, is there plans to add it, as the SAMBA Ceph VFS interface without ACL support is severely limited in a multi-user Windows environment. libcephfs does not support ACL. I have an old patch that adds ACL support to samba's vfs ceph module, but haven't tested it carefully. Are these published somewhere? Even if you don't have time to work on it somebody else might pick it up and finish things if it's available as a starting point. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
No. This will no help ((( I try to found data, but it's look exist with same time stamp on all osd or missing on all osd ... So, need advice , what I need to do... вторник, 18 августа 2015 г. пользователь Abhishek L написал: Voloshanenko Igor writes: Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com javascript:;: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com javascript:;: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! I've had an inconsistent pg once, but it was a different sort of an error (some sort of digest mismatch, where the secondary object copies had later timestamps). This was fixed by moving the object away and restarting, the osd which got fixed when the osd peered, similar to what was mentioned in Sebastian Han's blog[1]. I'm guessing the same method will solve this error as well, but not completely sure, maybe someone else who has seen this particular error could guide you better. [1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ -- Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] tcmalloc use a lot of CPU
Hi Mark, Yep! At least from what I've seen so far, jemalloc is still a little faster for 4k random writes even compared to tcmalloc with the patch + 128MB thread cache. Should have some data soon (mostly just a reproduction of Sandisk and Intel's work). I definitively switch to jemmaloc from my production ceph cluster, I was too tired of this tcmalloc problem (I have hit the bug once or twice, even with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES) Should have some data soon (mostly just a reproduction of Sandisk and Intel's work). Client side,it could be great to run fio or rados bench with jemalloc too, I have see around 20% improvement vs glibc. LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 fio (For my production, I'm running qemu with jemalloc too now) Regards, Alexandre - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 17 Août 2015 16:24:16 Objet: Re: [ceph-users] tcmalloc use a lot of CPU On 08/17/2015 07:03 AM, Alexandre DERUMIER wrote: Hi, Is this phenomenon normal?Is there any idea about this problem? It's a known problem with tcmalloc (search on the ceph mailing). starting osd with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M environnement variable should help. Note that this only works if you use a version of gperftools/tcmalloc newer than 2.1. Another way, is to compile ceph with jemalloc instead tcmalloc (./configure --with-jemalloc ...) Yep! At least from what I've seen so far, jemalloc is still a little faster for 4k random writes even compared to tcmalloc with the patch + 128MB thread cache. Should have some data soon (mostly just a reproduction of Sandisk and Intel's work). - Mail original - De: YeYin ey...@qq.com À: ceph-users ceph-users@lists.ceph.com Envoyé: Lundi 17 Août 2015 11:58:26 Objet: [ceph-users] tcmalloc use a lot of CPU Hi, all, When I do performance test with rados bench, I found tcmalloc consumed a lot of CPU: Samples: 265K of event 'cycles', Event count (approx.): 104385445900 + 27.58% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::FetchFromSpans() + 15.25% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, + 12.20% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*) + 1.63% perf [.] append_chain + 1.39% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*) + 1.02% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) + 0.85% libtcmalloc.so.4.1.0 [.] 0x00017e6f + 0.75% libtcmalloc.so.4.1.0 [.] tcmalloc::ThreadCache::IncreaseCacheLimitLocked() + 0.67% libc-2.12.so [.] memcpy + 0.53% libtcmalloc.so.4.1.0 [.] operator delete(void*) Ceph version: # ceph --version ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e) Kernel version: 3.10.83 Is this phenomenon normal? Is there any idea about this problem? Thanks. Ye ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] НА: Question
Hi! You can run mons on the same hosts, though it is not recommemned. MON daemon itself are not resurce hungry - 1-2 cores and 2-4 Gb RAM is enough in most small installs. But there are some pitfalls: - MONs use LevelDB as a backstorage, and widely use direct write to ensure DB consistency. So, if MON daemon coexits with OSDs not only on the same host, but on the same volume/disk/controller - it will severily reduce disk io available to OSD, thus greatly reduce overall performance. Moving MONs root to separate spindle, or better - separate SSD will keep MONs running fine with OSDs at the same host. - When cluster is in healthy state, MONs are not resource consuming, but when cluster in changing state (adding/removing OSDs, backfiling, etc) the CPU and memory usage for MON can raise significantly. And yes, in small cluster, it is not alaways possible to get 3 separate hosts for MONs only. Megov Igor CIO, Yuterra От: ceph-users ceph-users-boun...@lists.ceph.com от имени Luis Periquito periqu...@gmail.com Отправлено: 17 августа 2015 г. 17:09 Кому: Kris Vaes Копия: ceph-users@lists.ceph.com Тема: Re: [ceph-users] Question yes. The issue is resource sharing as usual: the MONs will use disk I/O, memory and CPU. If the cluster is small (test?) then there's no problem in using the same disks. If the cluster starts to get bigger you may want to dedicate resources (e.g. the disk for the MONs isn't used by an OSD). If the cluster is big enough you may want to dedicate a node for being a MON. On Mon, Aug 17, 2015 at 2:56 PM, Kris Vaes k...@s3s.eumailto:k...@s3s.eu wrote: Hi, Maybe this seems like a strange question but i could not find this info in the docs , i have following question, For the ceph cluster you need osd daemons and monitor daemons, On a host you can run several osd daemons (best one per drive as read in the docs) on one host But now my question can you run on the same host where you run already some osd daemons the monitor daemon Is this possible and what are the implications of doing this Met Vriendelijke Groeten Cordialement Kind Regards Cordialmente С приятелски поздрави [cid:D87E97BC-3D4F-4F8A-AC12-37B6FD3C2E40] This message (including any attachments) may be privileged or confidential. If you have received it by mistake, please notify the sender by return e-mail and delete this message from your system. Any unauthorized use or dissemination of this message in whole or in part is strictly prohibited. S3S rejects any liability for the improper, incomplete or delayed transmission of the information contained in this message, as well as for damages resulting from this e-mail message. S3S cannot guarantee that the message received by you has not been intercepted by third parties and/or manipulated by computer programs used to transmit messages and viruses. ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: Repair inconsistent pgs..
-- Пересылаемое сообщение - От: *Voloshanenko Igor* igor.voloshane...@gmail.com Дата: вторник, 18 августа 2015 г. Тема: Repair inconsistent pgs.. Кому: Irek Fasikhov malm...@gmail.com Some additional inforamtion (Tnx Irek for questions!) Pool values: root@test:~# ceph osd pool get cold-storage size size: 3 root@test:~# ceph osd pool get cold-storage min_size min_size: 2 Broken pgs dump PG_1 # { state: active+clean+inconsistent, snap_trimq: [], epoch: 17541, up: [ 56, 10, 42 ], acting: [ 56, 10, 42 ], actingbackfill: [ 10, 42, 56 ], info: { pgid: 2.c4, last_update: 17541'29153, last_complete: 17541'29153, log_tail: 16746'26095, last_user_version: 401173, last_backfill: MAX, purged_snaps: [1~1,6~1,8~3,11~2,17~2,1f~2,25~1,28~1,2c~5,32~4,37~1,39~7,41~5,47~16,5e~19,cb~1,ce~2,d4~7,dc~1,de~1,e6~4,102~1,105~6,10d~1,119~1,150~1,15d~2,160~3,16d~1,16f~5,178~1,184~2,194~1,1a2~1,1a5~1,1ac~2,1c7~1,1cb~2,1ce~1], history: { epoch_created: 98, last_epoch_started: 17531, last_epoch_clean: 17541, last_epoch_split: 0, same_up_since: 17139, same_interval_since: 17530, same_primary_since: 17530, last_scrub: 17541'29114, last_scrub_stamp: 2015-08-18 07:37:04.567973, last_deep_scrub: 17541'29114, last_deep_scrub_stamp: 2015-08-18 07:37:04.567973, last_clean_scrub_stamp: 2015-08-05 17:23:45.251731 }, stats: { version: 17541'29153, reported_seq: 21552, reported_epoch: 17541, state: active+clean+inconsistent, last_fresh: 2015-08-18 07:48:37.667036, last_change: 2015-08-18 07:37:04.568541, last_active: 2015-08-18 07:48:37.667036, last_peered: 2015-08-18 07:48:37.667036, last_clean: 2015-08-18 07:48:37.667036, last_became_active: 0.00, last_became_peered: 0.00, last_unstale: 2015-08-18 07:48:37.667036, last_undegraded: 2015-08-18 07:48:37.667036, last_fullsized: 2015-08-18 07:48:37.667036, mapping_epoch: 17140, log_start: 16746'26095, ondisk_log_start: 16746'26095, created: 98, last_epoch_clean: 17541, parent: 0.0, parent_split_bits: 0, last_scrub: 17541'29114, last_scrub_stamp: 2015-08-18 07:37:04.567973, last_deep_scrub: 17541'29114, last_deep_scrub_stamp: 2015-08-18 07:37:04.567973, last_clean_scrub_stamp: 2015-08-05 17:23:45.251731, log_size: 3058, ondisk_log_size: 3058, stats_invalid: 0, stat_sum: { num_bytes: 2236608990, num_objects: 307, num_object_clones: 7, num_object_copies: 921, num_objects_missing_on_primary: 0, num_objects_degraded: 0, num_objects_misplaced: 0, num_objects_unfound: 0, num_objects_dirty: 307, num_whiteouts: 0, num_read: 15694, num_read_kb: 401354, num_write: 55720, num_write_kb: 2539827, num_scrub_errors: 1, num_shallow_scrub_errors: 1, num_deep_scrub_errors: 0, num_objects_recovered: 1842, num_bytes_recovered: 13419653940, num_keys_recovered: 36, num_objects_omap: 1, num_objects_hit_set_archive: 0, num_bytes_hit_set_archive: 0 }, up: [ 56, 10, 42 ], acting: [ 56, 10, 42 ], blocked_by: [], up_primary: 56, acting_primary: 56 }, empty: 0, dne: 0, incomplete: 0, last_epoch_started: 17531, hit_set_history: { current_last_update: 0'0, current_last_stamp: 0.00, current_info: { begin: 0.00, end: 0.00, version: 0'0 }, history: [] } }, peer_info: [ { peer: 10, pgid: 2.c4, last_update: 17541'29153, last_complete: 17541'29153, log_tail: 16746'25703, last_user_version: 400914, last_backfill: MAX, purged_snaps:
Re: [ceph-users] How repair 2 invalids pgs
Le 14/08/2015 15:48, Pierre BLONDEAU a écrit : Hy, Yesterday, I removed 5 ods on 15 from my cluster ( machine migration ). When I stopped the processes, I haven't verified that all the pages were in active stat. I removed the 5 ods from the cluster ( ceph osd out osd.9 ; ceph osd crush rm osd.9 ; ceph auth del osd.9 ; ceph osd rm osd.9 ) , and i check after... and I had two inactive pgs I have not formatted the filesystem of the osds. The health : pg 7.b is stuck inactive for 86083.236722, current state inactive, last acting [1,2] pg 7.136 is stuck inactive for 86098.214967, current state inactive, last acting [4,7] The recovery state : recovery_state: [ { name: Started\/Primary\/Peering\/WaitActingChange, enter_time: 2015-08-13 15:19:49.559965, comment: waiting for pg acting set to change}, { name: Started, enter_time: 2015-08-13 15:19:46.492625}], How can i solved my problem ? Can i re-add the osds since the filesystem ? My cluster is used for rbd's image and a little cephfs share. I can read all files in cephfs and I tried to check if there pgs were used by an image. I don't find anything, but I not sure of my script. My cluster is used for rbd image and a little cephfs share. I can read all block in cephfs and i check all image to verify if they use these pgs. I don't find anything. How do you know if a pg is used ? Regards Hello, The names of pgs start with 7.. so they are used by the pool id 7 ? For me, it's cephfs_meta ( cephfs metadata ). I get no response when i done rados -p cephfs_meta ls . Like it's a small share, it's not serious. I can restore it easily. So I add the news OSDs of the new machine. And it solved the problem, but i don't understand why. So if someone have an idea ? Regards PS : I use 0.80.10 on wheezy -- -- Pierre BLONDEAU Administrateur Systèmes réseaux Université de Caen Laboratoire GREYC, Département d'informatique tel : 02 31 56 75 42 bureau : Campus 2, Science 3, 406 -- smime.p7s Description: Signature cryptographique S/MIME ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
No. This will no help ((( I try to found data, but it's look exist with same time stamp on all osd or missing on all osd ... So, need advice , what I need to do... вторник, 18 августа 2015 г. пользователь Abhishek L написал: Voloshanenko Igor writes: Hi Irek, Please read careful ))) You proposal was the first, i try to do... That's why i asked about help... ( 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com javascript:;: Hi, Igor. You need to repair the PG. for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair' | awk {'print$1'}`;do ceph pg repair $i;done С уважением, Фасихов Ирек Нургаязович Моб.: +79229045757 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com javascript:;: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! I've had an inconsistent pg once, but it was a different sort of an error (some sort of digest mismatch, where the secondary object copies had later timestamps). This was fixed by moving the object away and restarting, the osd which got fixed when the osd peered, similar to what was mentioned in Sebastian Han's blog[1]. I'm guessing the same method will solve this error as well, but not completely sure, maybe someone else who has seen this particular error could guide you better. [1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ -- Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Stuck creating pg
1) No errors at all. At loglevel 20 the osd does not say anything about the missing placement group 2) I tried that. Several times actually, also for the secondary osd's, but it does not work. gr, Bart On Tue, Aug 18, 2015 at 4:28 AM minchen minche...@outlook.com wrote: osd.19 is blocked by pg creating and 19 client ops, 1. check osd.19's log to see if any errors 2. if not, out 19 from osdmap to remap pg 5.6c7 ceph osd out 19 // this will cause data migration I am not sure whether this will help you! -- Original -- *From: * Bart Vanbrabant;b...@vanbrabant.eu; *Date: * Mon, Aug 17, 2015 10:14 PM *To: * minchenminche...@outlook.com; ceph-users ceph-users@lists.ceph.com; *Subject: * Re: [ceph-users] Stuck creating pg 1) ~# ceph pg 5.6c7 query Error ENOENT: i don't have pgid 5.6c7 In the osd log: 2015-08-17 16:11:45.185363 7f311be40700 0 osd.19 64706 do_command r=-2 i don't have pgid 5.6c7 2015-08-17 16:11:45.185380 7f311be40700 0 log_channel(cluster) log [INF] : i don't have pgid 5.6c7 2) I do not see anything wrong with this rule: { rule_id: 0, rule_name: data, ruleset: 0, type: 1, min_size: 1, max_size: 10, steps: [ { op: take, item: -1, item_name: default }, { op: chooseleaf_firstn, num: 0, type: host }, { op: emit } ] }, 3) I rebooted all machines in the cluster and increased the replication level of the affected pool to 3, to be more sure. After recovery of this reboot we are currently in the current state: HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; 103 requests are blocked 32 sec; 2 osds have slow requests; pool volumes pg_num 2048 pgp_num 1400 pg 5.6c7 is stuck inactive since forever, current state creating, last acting [19,25,17] pg 5.6c7 is stuck unclean since forever, current state creating, last acting [19,25,17] 103 ops are blocked 524.288 sec 19 ops are blocked 524.288 sec on osd.19 84 ops are blocked 524.288 sec on osd.25 2 osds have slow requests pool volumes pg_num 2048 pgp_num 1400 Thanks, Bart On 08/17/2015 03:44 PM, minchen wrote: It looks like the crushrule does't work properly by osdmap changed, there are 3 pgs unclean: 5.6c7 5.2c7 15.2bd I think you can try follow method to help locate the problem 1st, ceph pg pgid query to lookup detail of pg state, eg, blocked by which osd? 2st, check the crushrule ceph osd crush rule dump and check the crush_ruleset for pools: 5 , 15 eg, the chooseleaf may be not choose the right osd ? minchen -- Original -- *From: * Bart Vanbrabant;b...@vanbrabant.eu b...@vanbrabant.eu; *Date: * Sun, Aug 16, 2015 07:27 PM *To: * ceph-usersceph-users@lists.ceph.com ceph-users@lists.ceph.com; *Subject: * [ceph-users] Stuck creating pg Hi, I have a ceph cluster with 26 osd's in 4 hosts only use for rbd for an OpenStack cluster (started at 0.48 I think), currently running 0.94.2 on Ubuntu 14.04. A few days ago one of the osd's was at 85% disk usage while only 30% of the raw disk space is used. I ran reweight-by-utilization with 150 was cutoff level. This reshuffled the data. I also noticed that the number of pg was still at the level when there were less disks in the cluster (1300). Based on the current guidelines I increased pg_num to 2048. It created the placement groups except for the last one. To try to force the creation of the pg I removed the OSD's (ceph osd out) assigned to that pg but that makes no difference. Currently all OSD's are back in and two pg's are also stuck in an unclean state: ceph health detail: HEALTH_WARN 2 pgs degraded; 2 pgs stale; 2 pgs stuck degraded; 1 pgs stuck inactive; 2 pgs stuck stale; 3 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; 59 requests are blocked 32 sec; 3 osds have slow requests; recovery 221/549658 objects degraded (0.040%); recovery 221/549658 objects misplaced (0.040%); pool volumes pg_num 2048 pgp_num 1400 pg 5.6c7 is stuck inactive since forever, current state creating, last acting [19,25] pg 5.6c7 is stuck unclean since forever, current state creating, last acting [19,25] pg 5.2c7 is stuck unclean for 313513.609864, current state stale+active+undersized+degraded+remapped, last acting [9] pg 15.2bd is stuck unclean for 313513.610368, current state stale+active+undersized+degraded+remapped, last acting [9] pg 5.2c7 is stuck undersized for 308381.750768, current state stale+active+undersized+degraded+remapped, last acting [9] pg 15.2bd is stuck undersized for 308381.751913, current state stale+active+undersized+degraded+remapped, last acting [9] pg 5.2c7 is stuck degraded for 308381.750876, current state
[ceph-users] radosgw-agent keeps syncing most active bucket - ignoring others
Hi, from the doc of radosgw-agent and some items in this list, I understood that the max-entries argument was there to prevent a very active bucket to keep the other buckets from keeping synced. In our agent logs however we saw a lot of bucket instance bla has 1000 entries after bla, and the agent kept on syncing that active bucket. Looking at the code, in class DataWorkerIncremental, it looks like the agent loops in fetching log entries from the bucket until it receives less entries then the max_entries. Is this intended behaviour? I would suspect it to just pass the max_entries log entries for processing and increase the marker. Is there any other way to make sure less active buckets are frequently synced? We've tried increasing num-workers, but this only has affect the first pass. Thanks, Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: Emmanuel Florac [mailto:eflo...@intellique.com] Sent: 18 August 2015 12:26 To: Nick Fisk n...@fisk.me.uk Cc: 'Jan Schermer' j...@schermer.cz; 'Alex Gorbachev' ag@iss- integration.com; 'Dominik Zalewski' dzalew...@optlink.net; ceph- us...@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Le Tue, 18 Aug 2015 10:12:59 +0100 Nick Fisk n...@fisk.me.uk écrivait: Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Please note, it looks like the main(only?) dev of Bcache has started making a new version of bcache, bcachefs. At this stage I'm not sure what this means for the ongoing support of the existing bcache project. bcachefs is more than a new version of bcache, it's a complete POSIX filesystem with integrated caching. Looks like a silly idea if you ask me (because we already have several excellent filesystems; because developing a reliable filesystem is DAMN HARD; because building a feature-complete FS is CRAZY HARD; because FTL sucks anyway; etc). Agreed, it's such a shame that there isn't a simple, reliable and maintained caching solution out there for Linux. When I started seeing all these projects spring up 5-6 years ago I was full of optimism, but we still don't have anything I would call fully usable. -- Emmanuel Florac | Direction technique | Intellique | eflo...@intellique.com | +33 1 78 94 84 02 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:44 To: Mark Nelson mnel...@redhat.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? I did not. Not sure why now - probably for the same reason I didn't extensively test bcache. I'm not a real fan of device mapper though, so if I had to choose I'd still go for bcache :-) Jan On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list
Re: [ceph-users] any recommendation of using EnhanceIO?
Le Tue, 18 Aug 2015 10:12:59 +0100 Nick Fisk n...@fisk.me.uk écrivait: Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Please note, it looks like the main(only?) dev of Bcache has started making a new version of bcache, bcachefs. At this stage I'm not sure what this means for the ongoing support of the existing bcache project. bcachefs is more than a new version of bcache, it's a complete POSIX filesystem with integrated caching. Looks like a silly idea if you ask me (because we already have several excellent filesystems; because developing a reliable filesystem is DAMN HARD; because building a feature-complete FS is CRAZY HARD; because FTL sucks anyway; etc). -- Emmanuel Florac | Direction technique | Intellique | eflo...@intellique.com | +33 1 78 94 84 02 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rename Ceph cluster
Hi, Does anyone know what steps should be taken to rename a Ceph cluster? Btw, is it ever possbile without data loss? Background: I have a cluster named ceph-prod integrated with OpenStack, however I found out that the default cluster name ceph is very much hardcoded into OpenStack so I decided to change it to the default value. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rename Ceph cluster
I think it's pretty clear: http://ceph.com/docs/master/install/manual-deployment/ For example, when you run multiple clusters in a federated architecture, the cluster name (e.g., us-west, us-east) identifies the cluster for the current CLI session. Note: To identify the cluster name on the command line interface, specify the a Ceph configuration file with the cluster name (e.g., ceph.conf, us-west.conf, us-east.conf, etc.). Also see CLI usage (ceph --cluster {cluster-name}). But it could be tricky on the OSDs that are running, depending on the distribution initscripts - you could find out that you can't service ceph stop osd... anymore after the change, since it can't find it's pidfile anymore. Looking at Centos initscript it looks like it accepts -c conffile argument though. (So you should be managins OSDs with -c ceph-prod.conf now?) Jan On 18 Aug 2015, at 14:13, Erik McCormick emccorm...@cirrusseven.com wrote: I've got a custom named cluster integrated with Openstack (Juno) and didn't run into any hard-coded name issues that I can recall. Where are you seeing that? As to the name change itself, I think it's really just a label applying to a configuration set. The name doesn't actually appear *in* the configuration files. It stands to reason you should be able to rename the configuration files on the client side and leave the cluster alone. It'd be with trying in a test environment anyway. -Erik On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz mailto:j...@schermer.cz wrote: This should be simple enough mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf No? :-) Or you could set this in nova.conf: images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf Obviously since different parts of openstack have their own configs, you'd have to do something similiar for cinder/glance... so not worth the hassle. Jan On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com mailto:anga...@gmail.com wrote: Hi, Does anyone know what steps should be taken to rename a Ceph cluster? Btw, is it ever possbile without data loss? Background: I have a cluster named ceph-prod integrated with OpenStack, however I found out that the default cluster name ceph is very much hardcoded into OpenStack so I decided to change it to the default value. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Repair inconsistent pgs..
From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rename Ceph cluster
On 18-08-15 14:13, Erik McCormick wrote: I've got a custom named cluster integrated with Openstack (Juno) and didn't run into any hard-coded name issues that I can recall. Where are you seeing that? As to the name change itself, I think it's really just a label applying to a configuration set. The name doesn't actually appear *in* the configuration files. It stands to reason you should be able to rename the configuration files on the client side and leave the cluster alone. It'd be with trying in a test environment anyway. To add to id, internally a Ceph cluster ONLY uses the fsid which you can find in the OSDMap and on all the data dirs of the OSDs. The cluster name is indeed nothing more then a reference to a specific configuration file. Wido -Erik On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz mailto:j...@schermer.cz wrote: This should be simple enough mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf No? :-) Or you could set this in nova.conf: images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf Obviously since different parts of openstack have their own configs, you'd have to do something similiar for cinder/glance... so not worth the hassle. Jan On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com mailto:anga...@gmail.com wrote: Hi, Does anyone know what steps should be taken to rename a Ceph cluster? Btw, is it ever possbile without data loss? Background: I have a cluster named ceph-prod integrated with OpenStack, however I found out that the default cluster name ceph is very much hardcoded into OpenStack so I decided to change it to the default value. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] radosgw-agent keeps syncing most active bucket - ignoring others
Hmm, looks like intended behaviour: SNIP CommitDate: Mon Mar 3 06:08:42 2014 -0800 worker: process all bucket instance log entries at once Currently if there are more than max_entries in a single bucket instance's log, only max_entries of those will be processed, and the bucket instance will not be examined again until it is modified again. To keep it simple, get the entire log of entries to be updated and process them all at once. This means one busy shard may block others from syncing, but multiple instances of radosgw-agent can be run to circumvent that issue. With only one instance, users can be sure everything is synced when an incremental sync completes with no errors. /SNIP However, this brings us to a new issue. After starting a second agent, one of the agents is busy syncing the busy shard and the other agent synced correctly all of the other buckets. So far, so good. But, since a few of them are almost static, it looks like it started syncing those in a second run from the beginning all over again. As versioning was enabled on those buckets after they were created and with already objects and removed objects in there, it seems like the agent is copying those unversioned objects to versioned ones, creating a lot of delete markers and multiple versions in the secondary zone. Anyone any idea how to handle this correctly. I've already did a cleanup some weeks ago, but if the agent is going to keep on restarting the sync from the beginning, I'll have to cleanup every time. regards, Sam On 18-08-15 09:36, Sam Wouters wrote: Hi, from the doc of radosgw-agent and some items in this list, I understood that the max-entries argument was there to prevent a very active bucket to keep the other buckets from keeping synced. In our agent logs however we saw a lot of bucket instance bla has 1000 entries after bla, and the agent kept on syncing that active bucket. Looking at the code, in class DataWorkerIncremental, it looks like the agent loops in fetching log entries from the bucket until it receives less entries then the max_entries. Is this intended behaviour? I would suspect it to just pass the max_entries log entries for processing and increase the marker. Is there any other way to make sure less active buckets are frequently synced? We've tried increasing num-workers, but this only has affect the first pass. Thanks, Sam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
We're using an extra caching layer for ceph since the beginning for our older ceph deployments. All new deployments go with full SSDs. I've tested so far: - EnhanceIO - Flashcache - Bcache - dm-cache - dm-writeboost The best working solution was and is bcache except for it's buggy code. The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e. discards result in crashed FS after reboots and so on. But it's still the fastest for ceph. The 2nd best solution which we already use in production is dm-writeboost (https://github.com/akiradeveloper/dm-writeboost). Everything else is too slow. Stefan Am 18.08.2015 um 13:33 schrieb Mark Nelson: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to improve single thread sequential reads?
Reply in text On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 11:50 To: Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] How to improve single thread sequential reads? I'm not sure if I missed that but are you testing in a VM backed by RBD device, or using the device directly? I don't see how blk-mq would help if it's not a VM, it just passes the request to the underlying block device, and in case of RBD there is no real block device from the host perspective...? Enlighten me if I'm wrong please. I have some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe because I'm unable to tune the scheduler and it just makes no sense at all...? Since 4.0 (I think) the Kernel RBD client now uses the blk-mq infrastructure, but there is a bug which limits max IO sizes to 128kb, which is why for large block/sequential that testing kernel is essential. I think this bug fix should make it to 4.2 hopefully. blk-mq is supposed to remove redundancy of having IO scheduler in VM - VM block device - host IO scheduler - block device it's a paravirtualized driver that just moves requests from inside the VM to the host queue (and this is why inside the VM you have no IO scheduler options - it effectively becomes noop). But this just doesn't make sense if you're using qemu with librbd - there's no host queue. It would make sense if the qemu drive was krbd device with a queue. If there's no VM there should be no blk-mq? So what was added to the kernel was probably the host-side infrastructure to handle blk-mq in guest passthrough to the krdb device, but that's probably not your case, is it? Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make sure it gets into readahead), also try (if you're not using blk-mq) to a cfq scheduler and set it to rotational=1. I see you've also tried this, but I think blk-mq is the limiting factor here now. I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object size, from what I can tell) and the max_sectors_kb is already set at the hw_max. But it would sure be nice if the max_hw_sectors_kb could be set higher though, but I'm not sure if there is a reason for this limit. If you are running a single-threaded benchmark like rados bench then what's limiting you is latency - it's not surprising it scales up with more threads. Agreed, but with sequential workloads, if you can get readahead working properly then you can easily remove this limitation as a single threaded op effectively becomes multithreaded. Thinking on this more - I don't know if this will help after all, it will still be a single thread, just trying to get ahead of the client IO - and that's not likely to happen unless you can read the data in userspace slower than what Ceph can read... I think striping multiple device could be the answer after all. But have you tried creating the RBD volume as striped in Ceph? It should run nicely with a real workload once readahead kicks in and the queue fills up. But again - not sure how that works with blk-mq and I've never used the RBD device directly (the kernel client). Does it show in /sys/block ? Can you dump find /sys/block/$rbd in here? Jan On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net wrote: Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck at ~30MB/s reading cold data. single-threaded-writes are quite speedy, around 600MB/s. radosgw for cold data is around the 90MB/s, which is imho limitted by the speed of a single disk. Data already present on the osd-os-buffers arrive with around 400-700MB/s so I don't think the network is the culprit. (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds each, lacp 2x10g bonds) rados bench single-threaded performs equally bad, but with its default multithreaded settings it generates wonderful numbers, usually only limiited by linerate and/or interrupts/s. I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to get to your wonderful numbers, but it's staying below 30 MB/s. I was thinking about using a software raid0 like you did but that's imho really ugly. When I know I needed something speedy, I usually just started dd-ing the file to /dev/null and wait for about three minutes before starting the actual job; some sort of hand-made read-ahead for dummies. Thx in advance Benedikt 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk: Thanks for the replies guys. The client is set to 4MB, I haven't played with the OSD side yet as I wasn't sure if it would make much difference, but
Re: [ceph-users] How to improve single thread sequential reads?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:41 To: Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Reply in text On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 11:50 To: Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] How to improve single thread sequential reads? I'm not sure if I missed that but are you testing in a VM backed by RBD device, or using the device directly? I don't see how blk-mq would help if it's not a VM, it just passes the request to the underlying block device, and in case of RBD there is no real block device from the host perspective...? Enlighten me if I'm wrong please. I have some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe because I'm unable to tune the scheduler and it just makes no sense at all...? Since 4.0 (I think) the Kernel RBD client now uses the blk-mq infrastructure, but there is a bug which limits max IO sizes to 128kb, which is why for large block/sequential that testing kernel is essential. I think this bug fix should make it to 4.2 hopefully. blk-mq is supposed to remove redundancy of having IO scheduler in VM - VM block device - host IO scheduler - block device it's a paravirtualized driver that just moves requests from inside the VM to the host queue (and this is why inside the VM you have no IO scheduler options - it effectively becomes noop). But this just doesn't make sense if you're using qemu with librbd - there's no host queue. It would make sense if the qemu drive was krbd device with a queue. If there's no VM there should be no blk-mq? I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq itself seems to be a lot more about enhancing the overall block layer performance in Linux https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec hanism_(blk-mq) So what was added to the kernel was probably the host-side infrastructure to handle blk-mq in guest passthrough to the krdb device, but that's probably not your case, is it? Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make sure it gets into readahead), also try (if you're not using blk-mq) to a cfq scheduler and set it to rotational=1. I see you've also tried this, but I think blk-mq is the limiting factor here now. I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object size, from what I can tell) and the max_sectors_kb is already set at the hw_max. But it would sure be nice if the max_hw_sectors_kb could be set higher though, but I'm not sure if there is a reason for this limit. If you are running a single-threaded benchmark like rados bench then what's limiting you is latency - it's not surprising it scales up with more threads. Agreed, but with sequential workloads, if you can get readahead working properly then you can easily remove this limitation as a single threaded op effectively becomes multithreaded. Thinking on this more - I don't know if this will help after all, it will still be a single thread, just trying to get ahead of the client IO - and that's not likely to happen unless you can read the data in userspace slower than what Ceph can read... I think striping multiple device could be the answer after all. But have you tried creating the RBD volume as striped in Ceph? Yes striping would probably give amazing performance, but the kernel client currently doesn't support it, which leaves us in the position of trying to find work arounds to boost performance. Although the client read is single threaded, the RBD/RADOS layer would split these larger readahead IOs into 4MB requests that would then be processed in parallel by the OSD's. This is much the same way as sequential access performance varies with a RAID array. If your IO size matches the stripe size of the array then you get nearly the bandwidth of all disks involved. I think in Ceph the effective stripe size is the object size * #OSDS. It should run nicely with a real workload once readahead kicks in and the queue fills up. But again - not sure how that works with blk-mq and I've never used the RBD device directly (the kernel client). Does it show in /sys/block ? Can you dump find /sys/block/$rbd in here? Jan On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net wrote: Hi Nick, did you do anything fancy to get to ~90MB/s in the first place? I'm stuck
Re: [ceph-users] How to improve single thread sequential reads?
On 18 Aug 2015, at 13:58, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:41 To: Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] How to improve single thread sequential reads? Reply in text On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 11:50 To: Benedikt Fraunhofer given.to.lists.ceph- users.ceph.com.toasta@traced.net Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] How to improve single thread sequential reads? I'm not sure if I missed that but are you testing in a VM backed by RBD device, or using the device directly? I don't see how blk-mq would help if it's not a VM, it just passes the request to the underlying block device, and in case of RBD there is no real block device from the host perspective...? Enlighten me if I'm wrong please. I have some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe because I'm unable to tune the scheduler and it just makes no sense at all...? Since 4.0 (I think) the Kernel RBD client now uses the blk-mq infrastructure, but there is a bug which limits max IO sizes to 128kb, which is why for large block/sequential that testing kernel is essential. I think this bug fix should make it to 4.2 hopefully. blk-mq is supposed to remove redundancy of having IO scheduler in VM - VM block device - host IO scheduler - block device it's a paravirtualized driver that just moves requests from inside the VM to the host queue (and this is why inside the VM you have no IO scheduler options - it effectively becomes noop). But this just doesn't make sense if you're using qemu with librbd - there's no host queue. It would make sense if the qemu drive was krbd device with a queue. If there's no VM there should be no blk-mq? I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq itself seems to be a lot more about enhancing the overall block layer performance in Linux https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec hanism_(blk-mq) So what was added to the kernel was probably the host-side infrastructure to handle blk-mq in guest passthrough to the krdb device, but that's probably not your case, is it? Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make sure it gets into readahead), also try (if you're not using blk-mq) to a cfq scheduler and set it to rotational=1. I see you've also tried this, but I think blk-mq is the limiting factor here now. I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object size, from what I can tell) and the max_sectors_kb is already set at the hw_max. But it would sure be nice if the max_hw_sectors_kb could be set higher though, but I'm not sure if there is a reason for this limit. If you are running a single-threaded benchmark like rados bench then what's limiting you is latency - it's not surprising it scales up with more threads. Agreed, but with sequential workloads, if you can get readahead working properly then you can easily remove this limitation as a single threaded op effectively becomes multithreaded. Thinking on this more - I don't know if this will help after all, it will still be a single thread, just trying to get ahead of the client IO - and that's not likely to happen unless you can read the data in userspace slower than what Ceph can read... I think striping multiple device could be the answer after all. But have you tried creating the RBD volume as striped in Ceph? Yes striping would probably give amazing performance, but the kernel client currently doesn't support it, which leaves us in the position of trying to find work arounds to boost performance. Although the client read is single threaded, the RBD/RADOS layer would split these larger readahead IOs into 4MB requests that would then be processed in parallel by the OSD's. This is much the same way as sequential access performance varies with a RAID array. If your IO size matches the stripe size of the array then you get nearly the bandwidth of all disks involved. I think in Ceph the effective stripe size is the object size * #OSDS. Hmmm... RBD - PG - objects stripe_unit (more commonly called stride) bytes are put into strip_count objects - not OSDs, but it's possible you'll hit all OSDs with a small enough stride and large enough stripe_count... I have no idea how well that works in practice on current Ceph releases, my Dumpling experience is probably useless here. So we're back at striping with mdraid I guess ... :) It should run nicely with a real workload
Re: [ceph-users] Rename Ceph cluster
I've got a custom named cluster integrated with Openstack (Juno) and didn't run into any hard-coded name issues that I can recall. Where are you seeing that? As to the name change itself, I think it's really just a label applying to a configuration set. The name doesn't actually appear *in* the configuration files. It stands to reason you should be able to rename the configuration files on the client side and leave the cluster alone. It'd be with trying in a test environment anyway. -Erik On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz wrote: This should be simple enough mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf No? :-) Or you could set this in nova.conf: images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf Obviously since different parts of openstack have their own configs, you'd have to do something similiar for cinder/glance... so not worth the hassle. Jan On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com wrote: Hi, Does anyone know what steps should be taken to rename a Ceph cluster? Btw, is it ever possbile without data loss? Background: I have a cluster named ceph-prod integrated with OpenStack, however I found out that the default cluster name ceph is very much hardcoded into OpenStack so I decided to change it to the default value. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
Yes, writeback mode. I didn't try anything else. Jan On 18 Aug 2015, at 18:30, Alex Gorbachev a...@iss-integration.com wrote: HI Jan, On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer j...@schermer.cz wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). Out of curiosity, were you using EnhanceIO in writeback mode? I assume so, as a read cache should not hurt anything. Thanks, Alex If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 17:13 To: Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 18 Aug 2015, at 16:44, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 14:51 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. Yes, I agree, it's not ideal. But I believe it’s the only way to get the performance required for some workloads that need write latency's 1ms. I'm still in testing at the moment with the testing kernel that includes blk- mq fixes for large queue depths and max io sizes. But if we decide to put into production, it would be using 2x SAS dual port SSD's in RAID1 across two servers for HA. As we are currently using iSCSI from these two servers, there is no real loss of availability by doing this. Generally I think as long as you build this around the fault domains of the application you are caching, it shouldn't impact too much. I guess for people using openstack and other direct RBD interfaces it may not be such an attractive option. I've been thinking that maybe Ceph needs to have an additional daemon with very low overheads, which is run on SSD's to provide shared persistent cache devices for librbd. There's still a trade off, maybe not as much as using Flashcache, but for some workloads like database's, many people may decide that it's worth it. Of course I realise this would be a lot of work and everyone is really busy, but in terms of performance gained it would most likely have a dramatic effect in making Ceph look comparable to other solutions like VSAN or ScaleIO when it comes to high iops/low latency stuff. Additional daemon that is persistent how? Isn't that what journal does already, just too slowly? The journal is part of an OSD, as is speed restricted by a lot of the functionality that Ceph has to provide. I was more thinking of a very light weight service that acts as an interface between a SSD and librbd and is focussed on speed. For something like a standalone SQL server it might run on the SQL server with a local SSD, but in other scenarios you might have this service remote where the SSD's are installed. HA for the SSD could be provided by RAID+Dual Port SAS, or maybe some sort of lightweight replication could be built into the service. This was just a random though rather than something I have planned out though. I think the best (and easiest!) approach is to mimic what a monilithic SAN does Currently 1) client issues blocking/atomic/sync IO 2) rbd client sends this IO to all OSDs 3) after all OSDs process the IO, the IO is finished and considered persistent That has serious implications * every IO is processed separately, not much coalescing * OSD processes add the latency when processing this IO * one OSD can be slow momentarily, IO backs up and the cluster stalls Let me just select what processing the IO means with respect to my architecture and I can likely get a 100x improvement Let me choose: 1) WHERE the IO is persisted Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2) sufficient? Not waiting for one slow OSD gives me at least some SLA for planned tasks like backfilling, scrubbing, deep-scrubbing Hands up who can afford to leav deep-scrub enabled in production... In my testing the difference between 2 and 3 Replica's wasn't that much, as once the primary OSD sends out the replica's they happen more or less in parallel. 2) WHEN the IO is persisted Do I really need all OSDs to flush the data to disk? If all the nodes are in the same cabinet and on the same UPS then this makes sense. But my nodes are actually in different buildings ~10km apart. The chances of power failing simultaneously, N+1 UPSes failing simultaneously, diesels failing simultaneously... When nukes start
Re: [ceph-users] any recommendation of using EnhanceIO?
IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal overhead for large writes? Are there specific use cases where we should specifically be focusing attention? general iscsi? S3? databases directly on RBD? etc. There's tons of different areas that we can work on (general OSD threading improvements, different messenger implementations, newstore, client side bottlenecks, etc) but all of those things tackle different kinds of problems. Mark, my take is definitely write latency. Base on this discussion, there is no real safe solution for write caching outside Ceph. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On 08/18/2015 11:52 AM, Nick Fisk wrote: snip Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? I believe you can already do this, though I haven't tested it. You can certainly move the monitors to rocksdb (tested) and newstore uses rocksdb as well. Interesting, I might have a look into this. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over- promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. Agreed. This is definitely the crux of the problem. The example below is a great start! It'd would be fantastic if we could get more feedback from the list on the relative importance of low latency operations vs high IOPS through concurrency. We have general suspicions but not a ton of actual data regarding what folks are seeing in practice and under what scenarios. If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs. Probably the big question is what are the pain points? The most common answer we get when asking folks what applications they run on top of Ceph is everything!. This is wonderful, but not helpful when trying to figure out what performance issues matter most! :) IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal overhead for large writes? Are there specific use cases where we should specifically be focusing attention? general iscsi? S3? databases directly on RBD? etc. There's tons of different areas that we can work on (general OSD threading improvements, different messenger implementations, newstore, client side bottlenecks, etc) but all of those things tackle different kinds of problems. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
1. We've kicked this around a bit. What kind of failure semantics would you be comfortable with here (that is, what would be reasonable behavior if the client side cache fails)? 2. We've got a branch which should merge soon (tomorrow probably) which actually does allow writes to be proxied, so that should alleviate some of these pain points somewhat. I'm not sure it is clever enough to allow through writefulls for an ec base tier though (but it would be a good idea!) -Sam On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 18:51 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 11:52 AM, Nick Fisk wrote: snip Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? I believe you can already do this, though I haven't tested it. You can certainly move the monitors to rocksdb (tested) and newstore uses rocksdb as well. Interesting, I might have a look into this. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over- promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. Agreed. This is definitely the crux of the problem. The example below is a great start! It'd would be fantastic if we could get more feedback from the list on the relative importance of low latency operations vs high IOPS through concurrency. We have general suspicions but not a ton of actual data regarding what folks are seeing in practice and under what scenarios. If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs. Probably the big question is what are the pain points? The most common answer we get when asking folks what applications they run on top of Ceph is everything!. This is wonderful, but not helpful when trying to figure out what performance issues matter most! :) Sort of like someone telling you their pc is broken and when asked for details getting It's not working in return. In general I think a lot of it comes down to people not appreciating the differences between Ceph and say a Raid array. For most things like larger block IO performance tends to scale with cluster size and the cost effectiveness of Ceph makes this a no brainer not to just add a handful of extra OSD's. I will try and be more precise. Here is my list of pain points / wishes that I have come across in the last 12 months of running Ceph. 1. Improve small IO write latency As discussed in depth in this thread. If it's possible just to make Ceph a lot faster then great, but I fear even a doubling in performance will still fall short compared to if you are caching writes at the client. Most things in Ceph tend to improve with scale, but write latency is the same with 2 OSD's as it is with 2000. I would urge some sort of investigation into the possibility of some sort of persistent librbd caching. This will probably help across a large number of scenarios, as in the end, most things are effected by latency and I think will provide across the board improvements. 2. Cache Tiering I know a lot of work is going into this currently, but I will cover my
[ceph-users] ceph-osd suddenly dies and no longer can be started
Hello. I have a small Ceph cluster running 9 OSDs, using XFS on separate disks and dedicated partitions on system disk for journals. After creation it worked fine for a while. Then suddenly one of OSDs stopped and didn't start. I had to recreate it. Recovery started. After few days of recovery OSD on another machine also stopped. I try to start it, it runs for few minutes and dies, looks like it is not able to recover journal. According to strace, it tries to allocate too much memory and stops with ENOMEM. Sometimes it is being killed by kernel's OOM killer. I tried flushing journal manually with `ceph-osd -i 3 --flush-journal`, but it didn't work either. Error log is as follows: [root@assets-2 ~]# ceph-osd -i 3 --flush-journal SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0d 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2015-08-18 23:00:37.956714 7ff102040880 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find 225eff8c/default.4323.18_22783306dc51892b40b49e3e26f79baf_55c38b33172600566c46_s.jpeg/head//8 in index: (2) No such file or directory 2015-08-18 23:00:37.956741 7ff102040880 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find 235eff8c/default.4323.16_3018ff7c6066bddc0c867b293724d7b1_dolar7_106_m.jpg/head//8 in index: (2) No such file or directory skipped 2015-08-18 23:00:37.958424 7ff102040880 -1 filestore(/var/lib/ceph/osd/ceph-3) could not find c//head//8 in index: (2) No such file or directory tcmalloc: large alloc 1073741824 bytes == 0x66b1 @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) tcmalloc: large alloc 2147483648 bytes == 0xbf49 @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) tcmalloc: large alloc 4294967296 bytes == 0x16e32 @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) tcmalloc: large alloc 8589934592 bytes == (nil) @ 0x7ff10115ae6a 0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f 0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064 0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil) terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc *** Caught signal (Aborted) ** in thread 7ff102040880 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3) 1: ceph-osd() [0xac5642] 2: (()+0xf130) [0x7ff1009d4130] 3: (gsignal()+0x37) [0x7ff0ff3ee5d7] 4: (abort()+0x148) [0x7ff0ff3efcc8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7ff0ffcf29b5] 6: (()+0x5e926) [0x7ff0ffcf0926] 7: (()+0x5e953) [0x7ff0ffcf0953] 8: (()+0x5eb73) [0x7ff0ffcf0b73] 9: (()+0x15d3e) [0x7ff10115ad3e] 10: (tc_new()+0x1e0) [0x7ff10117ade0] 11: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocatorchar const)+0x59) [0x7ff0ffd4fc29] 12: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned long)+0x1b) [0x7ff0ffd5086b] 13: (std::string::reserve(unsigned long)+0x44) [0x7ff0ffd50914] 14: (std::string::append(char const*, unsigned long)+0x4f) [0x7ff0ffd50b7f] 15: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string const)+0xdf) [0x968a0f] 16: (DBObjectMap::clear_header(std::tr1::shared_ptrDBObjectMap::_Header, std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xd3) [0xa572b3] 17: (DBObjectMap::_clear(std::tr1::shared_ptrDBObjectMap::_Header, std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xa1) [0xa5c6b1] 18: (DBObjectMap::clear(ghobject_t const, SequencerPosition const*)+0x202) [0xa5f762] 19: (FileStore::lfn_unlink(coll_t, ghobject_t const, SequencerPosition const, bool)+0x16a) [0x9018ba] 20: (FileStore::_remove(coll_t, ghobject_t const, SequencerPosition const)+0x9e) [0x90238e] 21: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long, int, ThreadPool::TPHandle*)+0x252c) [0x911b2c] 22: (FileStore::_do_transactions(std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* , unsigned long, ThreadPool::TPHandle*)+0x64) [0x915064] 23: (JournalingObjectStore::journal_replay(unsigned long)+0x5db) [0x92d7cb] 24: (FileStore::mount()+0x3730) [0x8ff890] 25: (main()+0xec9) [0x642239] 26: (__libc_start_main()+0xf5) [0x7ff0ff3daaf5] 27: ceph-osd() [0x65cdc9] 2015-08-18 23:02:38.167194 7ff102040880 -1 *** Caught signal (Aborted) ** in thread 7ff102040880 I can recreate filesystem on this OSD's disk and recreate OSD, but I'm not sure that this won't happen with another OSD on this or another machine, and eventually I won't lose all my data because it doesn't
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 18:51 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 11:52 AM, Nick Fisk wrote: snip Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? I believe you can already do this, though I haven't tested it. You can certainly move the monitors to rocksdb (tested) and newstore uses rocksdb as well. Interesting, I might have a look into this. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over- promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. Agreed. This is definitely the crux of the problem. The example below is a great start! It'd would be fantastic if we could get more feedback from the list on the relative importance of low latency operations vs high IOPS through concurrency. We have general suspicions but not a ton of actual data regarding what folks are seeing in practice and under what scenarios. If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs. Probably the big question is what are the pain points? The most common answer we get when asking folks what applications they run on top of Ceph is everything!. This is wonderful, but not helpful when trying to figure out what performance issues matter most! :) Sort of like someone telling you their pc is broken and when asked for details getting It's not working in return. In general I think a lot of it comes down to people not appreciating the differences between Ceph and say a Raid array. For most things like larger block IO performance tends to scale with cluster size and the cost effectiveness of Ceph makes this a no brainer not to just add a handful of extra OSD's. I will try and be more precise. Here is my list of pain points / wishes that I have come across in the last 12 months of running Ceph. 1. Improve small IO write latency As discussed in depth in this thread. If it's possible just to make Ceph a lot faster then great, but I fear even a doubling in performance will still fall short compared to if you are caching writes at the client. Most things in Ceph tend to improve with scale, but write latency is the same with 2 OSD's as it is with 2000. I would urge some sort of investigation into the possibility of some sort of persistent librbd caching. This will probably help across a large number of scenarios, as in the end, most things are effected by latency and I think will provide across the board improvements. 2. Cache Tiering I know a lot of work is going into this currently, but I will cover my experience. 2A)Deletion of large RBD's takes forever. It seems to have to promote all objects, even non-existent ones to the cache tier before it can delete them. Operationally this is really poor as it has a negative effect on the cache tier contents as well. 2B) Erasure Coding requires all writes to be promoted 1st. I think it should be pretty easy to allow proxy writes for erasure coded pools if the IO size = Object Size. A lot of backup applications can be configured to write out in static sized blocks and would be an ideal candidate for this sort of
Re: [ceph-users] any recommendation of using EnhanceIO?
On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote: [mega snip] 4. Disk based OSD with SSD Journal performance As I touched on above earlier, I would expect a disk based OSD with SSD journal to have similar performance to a pure SSD OSD when dealing with sequential small IO's. Currently the levelDB sync and potentially other things slow this down. Has anybody tried symlinking the omap directory to a SSD and tested if hat makes a (significant) difference? Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On Tue, 18 Aug 2015 12:50:38 -0500 Mark Nelson wrote: [snap] Probably the big question is what are the pain points? The most common answer we get when asking folks what applications they run on top of Ceph is everything!. This is wonderful, but not helpful when trying to figure out what performance issues matter most! :) Well, the everything answer really is the one everybody who runs VMs backed by RBD for internal or external customers will give. I.e. no idea what is installed and no control over how it accesses the Ceph cluster. And even when you think you have a predictable use case it might not be true. As in, one of our Ceph installs backs a ganeti cluster with hundreds of VMs running 2 type of applications and from past experience I know their I/O patterns (nearly 100% write only, any reads usually can be satisfied from local or storage node pagecache). Thus the Ceph cluster was configured in a way that was optimized for this and it worked beautifully until: a) scrubs became too heavy (generating too many read IOPS while also invalidating page caches) and b) somebody thought a 3rd type of VM using Windows with IOPS that equal dozens of the other types would be a good idea. IE, should we be focusing on IOPS? Latency? Finding a way to avoid journal overhead for large writes? Are there specific use cases where we should specifically be focusing attention? general iscsi? S3? databases directly on RBD? etc. There's tons of different areas that we can work on (general OSD threading improvements, different messenger implementations, newstore, client side bottlenecks, etc) but all of those things tackle different kinds of problems. All of these except S3 would have a positive impact in my various use cases. However at the risk of sounding like a broken record, any time spent on these improvements before Ceph can recover from a scrub error fully autonomously (read: checksums) would be a waste in my book. All the speed in the world is pretty insignificant when a simple ceph pg repair (which is still in the Ceph docs w/o any qualification of what it actually does) has a good chance of wiping out good data by imposing the primary OSD's view of the world on the replicas, to quote Greg. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [Cache-tier] librbd: error finding source object: (2) No such file or directory
Hi everyone, I has been used the cache-tier on a data pool. After a long time, a lot of rbd images don't be displayed in rbd -p data ls. Although that Images still show through rbd info and rados ls command. rbd -p data info volume-008ae4f7-3464-40c0-80b0-51140d8b95a8 rbd image 'volume-008ae4f7-3464-40c0-80b0-51140d8b95a8': size 128 GB in 32768 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.10c1c102eb141f2 format: 2 features: layering flags: And: rados -p data ls|grep 10c1c102eb141f2 # grep though block_name_prefix. = show: rbd_header.10c1c102eb141f2 Or: rados -p data ls|grep volume-008ae4f7-3464-40c0-80b0-51140d8b95a8 = show: rbd_id.volume-008ae4f7-3464-40c0-80b0-51140d8b95a8 Everything seem is normal. But*I tried to move ( and rename) above Image*, then received the following error: #rbd mv data/volume-008ae4f7-3464-40c0-80b0-51140d8b95a8 data/volume-008ae4f7-3464-40c0-80b0-51140d8b95a8_new rbd: rename error: (2) No such file or directory 2015-08-19 10:46:07.175525 7fb8b0985840 -1 librbd: error finding source object: (2) No such file or directory = rename action will spawn a new RBD, didn't delete original RBD *and when deleting the Image (deleting still sucessfullly**)*: deleting data/volume-32e1fa85-2e03-4cbe-be36-09358aa6e7f4 Removing all snapshots: 100% complete...done. Removing image: 99% complete...failed. rbd: delete error: (2) No such file or directory 2015-08-19 11:27:17.904695 7f9c32217840 -1 librbd: error removing img from new-style directory: (2) No such file or directory What happend with that RBDs?, how to fix that error? Thanks so much! -- Tuan- HaNoi,VietNam ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
Hi Sam, -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Samuel Just Sent: 18 August 2015 21:38 To: Nick Fisk n...@fisk.me.uk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? 1. We've kicked this around a bit. What kind of failure semantics would you be comfortable with here (that is, what would be reasonable behavior if the client side cache fails)? I would either expect to provide the cache with a redundant block device (ie RAID1 SSD's) or the cache to allow itself to be configured to mirror across two SSD's. Of course single SSD's can be used if the user accepts the risk. If the cache did the mirroring then you could do fancy stuff like mirror the writes, but leave the read cache blocks as single copies to increase the cache capacity. In either case although an outage is undesirable, its only data loss which would be unacceptable, which would hopefully be avoided by the mirroring. As part of this, it would need to be a way to make sure a dirty RBD can't be accessed unless the corresponding cache is also attached. I guess as it caching the RBD and not the pool or entire cluster, the cache only needs to match the failure requirements of the application its caching. If I need to cache a RBD that is on a single server, there is no requirement to make the cache redundant across racks/PDU's/servers...etc. I hope I've answered your question? 2. We've got a branch which should merge soon (tomorrow probably) which actually does allow writes to be proxied, so that should alleviate some of these pain points somewhat. I'm not sure it is clever enough to allow through writefulls for an ec base tier though (but it would be a good idea!) - Excellent news, I shall look forward to testing in the future. I did mention the proxy write for write fulls to someone who was working on the proxy write code, but I'm not sure if it ever got followed up. Sam On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 18:51 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 11:52 AM, Nick Fisk wrote: snip Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? I believe you can already do this, though I haven't tested it. You can certainly move the monitors to rocksdb (tested) and newstore uses rocksdb as well. Interesting, I might have a look into this. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over- promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. Agreed. This is definitely the crux of the problem. The example below is a great start! It'd would be fantastic if we could get more feedback from the list on the relative importance of low latency operations vs high IOPS through concurrency. We have general suspicions but not a ton of actual data regarding what folks are seeing in practice and under what scenarios. If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is
Re: [ceph-users] any recommendation of using EnhanceIO?
Hey Stefan, Are you using your Ceph cluster for virtualization storage? Is dm-writeboost configured on the OSD nodes themselves? - Original Message - From: Stefan Priebe - Profihost AG s.pri...@profihost.ag To: Mark Nelson mnel...@redhat.com, ceph-users@lists.ceph.com Sent: Tuesday, August 18, 2015 7:36:10 AM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? We're using an extra caching layer for ceph since the beginning for our older ceph deployments. All new deployments go with full SSDs. I've tested so far: - EnhanceIO - Flashcache - Bcache - dm-cache - dm-writeboost The best working solution was and is bcache except for it's buggy code. The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e. discards result in crashed FS after reboots and so on. But it's still the fastest for ceph. The 2nd best solution which we already use in production is dm-writeboost (https://github.com/akiradeveloper/dm-writeboost). Everything else is too slow. Stefan Am 18.08.2015 um 13:33 schrieb Mark Nelson: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:44 To: Mark Nelson mnel...@redhat.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? I did not. Not sure why now - probably for the same reason I didn't extensively test bcache. I'm not a real fan of device mapper though, so if I had to choose I'd still go for bcache :-) Jan On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On 08/18/2015 09:24 AM, Jan Schermer wrote: On 18 Aug 2015, at 15:50, Mark Nelson mnel...@redhat.com wrote: On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. Was it me pestering you about it? :-) All my customers need this desperately - people don't care about having RPO=0 seconds when all hell breaks loose. People care about their apps being slow all the time which is effectively an outage. I (sysadmin) care about having consistent data where all I have to do is start up the VMs. Any ideas how to approach this? I think even checkpoints (like reverting to a known point in the past) would be great and sufficient for most people... Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over-promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:44 To: Mark Nelson mnel...@redhat.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? I did not. Not sure why now - probably for the same reason I didn't extensively test bcache. I'm not a real fan of device mapper though, so if I had to choose I'd still go for bcache :-) Jan On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders
Re: [ceph-users] ceph cluster_network with linklocal ipv6
Should ceph care about what scope the address is in? We don't specify it for ipv4 anyway, or is link-scope special in some way? And isn't this the correct syntax actually? cluster_network = fe80::/64%cephnet On 18 Aug 2015, at 16:17, Wido den Hollander w...@42on.com wrote: On 18-08-15 16:02, Jan Schermer wrote: Shouldn't this: cluster_network = fe80::%cephnet/64 be this: cluster_network = fe80::/64 ? That won't work since the kernel doesn't know the scope. So %devname is right, but Ceph can't parse it. Although it sounds cool to run Ceph over link-local I don't think it currently works. Wido On 18 Aug 2015, at 15:39, Björn Lässig b.laes...@pengutronix.de wrote: Hi, i just setup my first ceph cluster and after breaking things for a while and let ceph repair itself, i want to setup the cluster network. Unfortunately i am doing something wrong :-) For not having any dependencies in my cluster network, i want to use only ipv6 link-local addresses on interface 'cephnet'. /var/log/ceph/ceph-osd.4.log:2015-08-18 15:10:38.954592 7f24c0ac2880 -1 unable to parse network: fe80::%cephnet/64 --- /etc/ceph/ceph.conf [global] ... cluster_network = fe80::%cephnet/64 --- /etc/network/interfaces auto cephnet iface cephnet inet6 auto sysctl -wq net.ipv6.conf.$IFACE.accept_ra_defrtr=0 What could i do? thanks, Björn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph distributed osd
Hi Luis, What i mean , we have three OSD with Harddisk size each 1TB and two pool (poolA and poolB) with replica 2. Here writing behavior is the confusion for us. Our assumptions is below. PoolA -- may write with OSD1 and OSD2 (is this correct) PoolB -- may write with OSD3 and OSD1 (is this correct) suppose the hard disk size got full , then how many OSD's need to be added and How will be the writing behavior to new OSD's After added few osd's PoolA -- may write with OSD4 and OSD5 (is this correct) PoolB -- may write with OSD5 and OSD6 (is this correct) Regards Prabu On Mon, 17 Aug 2015 19:41:53 +0530 Luis Periquito lt;periqu...@gmail.comgt; wrote I don't understand your question? You created a 1G RBD/disk and it's full. You are able to grow it though - but that's a Linux management issue, not ceph. As everything is thin-provisioned you can create a RBD with an arbitrary size - I've create one with 1PB when the cluster only had 600G/Raw available. On Mon, Aug 17, 2015 at 1:18 PM, gjprabu lt;gjpr...@zohocorp.comgt; wrote: Hi All, Anybody can help on this issue. Regards Prabu On Mon, 17 Aug 2015 12:08:28 +0530 gjprabu lt;gjpr...@zohocorp.comgt; wrote Hi All, Also please find osd information. ceph osd dump | grep 'replicated size' pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 126 pgp_num 126 last_change 21573 flags hashpspool stripe_width 0 Regards Prabu On Mon, 17 Aug 2015 11:58:55 +0530 gjprabu lt;gjpr...@zohocorp.comgt; wrote Hi All, We need to test three OSD and one image with replica 2(size 1GB). While testing data is not writing above 1GB. Is there any option to write on third OSD. ceph osd pool get repo pg_num pg_num: 126 # rbd showmapped id pool image snap device 0 rbd integdownloads -/dev/rbd0 -- Already one 2 repo integrepotest -/dev/rbd2 -- newly created [root@hm2 repository]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/sda5ext4 289G 18G 257G 7% / devtmpfs devtmpfs 252G 0 252G 0% /dev tmpfstmpfs 252G 0 252G 0% /dev/shm tmpfstmpfs 252G 538M 252G 1% /run tmpfstmpfs 252G 0 252G 0% /sys/fs/cgroup /dev/sda2ext4 488M 212M 241M 47% /boot /dev/sda4ext4 1.9T 20G 1.8T 2% /var /dev/mapper/vg0-zoho ext4 8.6T 1.7T 6.5T 21% /zoho /dev/rbd0ocfs2 977G 101G 877G 11% /zoho/build/downloads /dev/rbd2ocfs21000M 1000M 0 100% /zoho/build/repository @:~$ scp -r sample.txt root@integ-hm2:/zoho/build/repository/ root@integ-hm2's password: sample.txt 100% 1024MB 4.5MB/s 03:48 scp: /zoho/build/repository//sample.txt: No space left on device Regards Prabu On Thu, 13 Aug 2015 19:42:11 +0530 gjprabu lt;gjpr...@zohocorp.comgt; wrote Dear Team, We are using two ceph OSD with replica 2 and it is working properly. Here my doubt is (Pool A -image size will be 10GB) and its replicated with two OSD, what will happen suppose if the size reached the limit, Is there any chance to make the data to continue writing in another two OSD's. Regards Prabu ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 14:51 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. Yes, I agree, it's not ideal. But I believe it’s the only way to get the performance required for some workloads that need write latency's 1ms. I'm still in testing at the moment with the testing kernel that includes blk-mq fixes for large queue depths and max io sizes. But if we decide to put into production, it would be using 2x SAS dual port SSD's in RAID1 across two servers for HA. As we are currently using iSCSI from these two servers, there is no real loss of availability by doing this. Generally I think as long as you build this around the fault domains of the application you are caching, it shouldn't impact too much. I guess for people using openstack and other direct RBD interfaces it may not be such an attractive option. I've been thinking that maybe Ceph needs to have an additional daemon with very low overheads, which is run on SSD's to provide shared persistent cache devices for librbd. There's still a trade off, maybe not as much as using Flashcache, but for some workloads like database's, many people may decide that it's worth it. Of course I realise this would be a lot of work and everyone is really busy, but in terms of performance gained it would most likely have a dramatic effect in making Ceph look comparable to other solutions like VSAN or ScaleIO when it comes to high iops/low latency stuff. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:44 To: Mark Nelson mnel...@redhat.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? I did not. Not sure why now - probably for the same reason I didn't extensively test bcache. I'm not a real fan of device mapper though, so if I had to choose I'd still go for bcache :-) Jan On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't had the spare cycles. Mark On 08/18/2015 04:00 AM, Jan Schermer wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On
Re: [ceph-users] any recommendation of using EnhanceIO?
snip Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? I believe you can already do this, though I haven't tested it. You can certainly move the monitors to rocksdb (tested) and newstore uses rocksdb as well. Interesting, I might have a look into this. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over- promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. Agreed. This is definitely the crux of the problem. The example below is a great start! It'd would be fantastic if we could get more feedback from the list on the relative importance of low latency operations vs high IOPS through concurrency. We have general suspicions but not a ton of actual data regarding what folks are seeing in practice and under what scenarios. If you have any specific questions that you think I might be able to answer, please let me know. The only other main app that I can really think of where these sort of write latency is critical is SQL, particularly the transaction logs. To give a real world example of what I see when doing various tests, here is a rough guide to IOP's when removing a snapshot on a ESX server Traditional Array 10K disks = 300-600 IOPs Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the main limitation) Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's) I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help here. Sandisk and Intel have both done some very useful investigations, I've got some additional tests replicating some of their findings coming shortly. Ok, will be interesting to se. I will see if I can change it on my environment and if it has any improvement. I think I came to the conclusion that Ceph takes a certain amount of time to do a write and by the time you add in a replica copy I was struggling to get much below 2ms per IO with my 2.1GHz CPU's. 2ms = ~500IOPs. Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful) Indeed. There's some work going on in this area too. Hopefully we'll know how some of our ideas pan out later this week. Assuming excessive promotions aren't a problem, the jemalloc/tcmalloc improvements I suspect will generally make cache teiring more interesting (though buffer cache will still be the primary source of really hot cached reads) Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high bursts if snapshot blocks are sequential) Good to know! And when copying VM's to datastore (ESXi does this in sequential 64k IO's.yes silly I know) Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other arrays I guess this scales) Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here for sequential writes) This is pretty bad. Is RBD cache enabled? Tell me about it, moving a 2TB VM is a painful experience. Yes the librbd cache is on, but as iSCSI effectively turns all writes into sync writes so this bypasses the cache, so you are dependent on the time it takes for each OSD to ACK the write. In this case waiting each time for 64kb IO's to complete due to the levelDB sync you end up with transfer speeds somewhere in the region of 15-20MB/s. You can do the same thing with something IOmeter (64k, sequential write, directio, QD=1). NFS is even worse as every ESX write also requires a FS journal sync on the FS being used for NFS. So you have to wait for two ACK's from Ceph, normally meaning
Re: [ceph-users] ceph cluster_network with linklocal ipv6
Op 18 aug. 2015 om 18:15 heeft Jan Schermer j...@schermer.cz het volgende geschreven: On 18 Aug 2015, at 17:57, Björn Lässig b.laes...@pengutronix.de wrote: On 08/18/2015 04:32 PM, Jan Schermer wrote: Should ceph care about what scope the address is in? We don't specify it for ipv4 anyway, or is link-scope special in some way? fe80::/64 is on every ipv6 enabled interface ... thats different from legacy ip. I'm not a network guru, but you can have same/overlapping subnets with IPv4 as well that's why you have scopes, metrics, routing tables, policy routing tables etc. That's why I'm wondering what's different here. It's IPv6, that's different. Search for link local, it will tell you what it is. Wido And isn't this the correct syntax actually? cluster_network = fe80::/64%cephnet That is a very good question! I will look into it. Björn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
HI Jan, On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer j...@schermer.cz wrote: I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 and 4.0 kernel-lt if I remember correctly). It worked fine during benchmarks and stress tests, but once we run DB2 on it it panicked within minutes and took all the data with it (almost literally - files that werent touched, like OS binaries were b0rked and the filesystem was unsalvageable). Out of curiosity, were you using EnhanceIO in writeback mode? I assume so, as a read cache should not hurt anything. Thanks, Alex If you disregard this warning - the performance gains weren't that great either, at least in a VM. It had problems when flushing to disk after reaching dirty watermark and the block size has some not-well-documented implications (not sure now, but I think it only cached IO _larger_than the block size, so if your database keeps incrementing an XX-byte counter it will go straight to disk). Flashcache doesn't respect barriers (or does it now?) - if that's ok for you than go for it, it should be stable and I used it in the past in production without problems. bcache seemed to work fine, but I needed to a) use it for root b) disable and enable it on the fly (doh) c) make it non-persisent (flush it) before reboot - not sure if that was possible either. d) all that in a customer's VM, and that customer didn't have a strong technical background to be able to fiddle with it... So I haven't tested it heavily. Bcache should be the obvious choice if you are in control of the environment. At least you can cry on LKML's shoulder when you lose data :-) Jan On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: What about https://github.com/Frontier314/EnhanceIO? Last commit 2 months ago, but no external contributors :( The nice thing about EnhanceIO is there is no need to change device name, unlike bcache, flashcache etc. Best regards, Alex On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote: I did some (non-ceph) work on these, and concluded that bcache was the best supported, most stable, and fastest. This was ~1 year ago, to take it with a grain of salt, but that's what I would recommend. Daniel From: Dominik Zalewski dzalew...@optlink.net To: German Anders gand...@despegar.com Cc: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, July 1, 2015 5:28:10 PM Subject: Re: [ceph-users] any recommendation of using EnhanceIO? Hi, I’ve asked same question last weeks or so (just search the mailing list archives for EnhanceIO :) and got some interesting answers. Looks like the project is pretty much dead since it was bought out by HGST. Even their website has some broken links in regards to EnhanceIO I’m keen to try flashcache or bcache (its been in the mainline kernel for some time) Dominik On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: Hi cephers, Is anyone out there that implement enhanceIO in a production environment? any recommendation? any perf output to share with the diff between using it and not? Thanks in advance, German ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On 18 Aug 2015, at 16:44, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 14:51 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. Yes, I agree, it's not ideal. But I believe it’s the only way to get the performance required for some workloads that need write latency's 1ms. I'm still in testing at the moment with the testing kernel that includes blk-mq fixes for large queue depths and max io sizes. But if we decide to put into production, it would be using 2x SAS dual port SSD's in RAID1 across two servers for HA. As we are currently using iSCSI from these two servers, there is no real loss of availability by doing this. Generally I think as long as you build this around the fault domains of the application you are caching, it shouldn't impact too much. I guess for people using openstack and other direct RBD interfaces it may not be such an attractive option. I've been thinking that maybe Ceph needs to have an additional daemon with very low overheads, which is run on SSD's to provide shared persistent cache devices for librbd. There's still a trade off, maybe not as much as using Flashcache, but for some workloads like database's, many people may decide that it's worth it. Of course I realise this would be a lot of work and everyone is really busy, but in terms of performance gained it would most likely have a dramatic effect in making Ceph look comparable to other solutions like VSAN or ScaleIO when it comes to high iops/low latency stuff. Additional daemon that is persistent how? Isn't that what journal does already, just too slowly? I think the best (and easiest!) approach is to mimic what a monilithic SAN does Currently 1) client issues blocking/atomic/sync IO 2) rbd client sends this IO to all OSDs 3) after all OSDs process the IO, the IO is finished and considered persistent That has serious implications * every IO is processed separately, not much coalescing * OSD processes add the latency when processing this IO * one OSD can be slow momentarily, IO backs up and the cluster stalls Let me just select what processing the IO means with respect to my architecture and I can likely get a 100x improvement Let me choose: 1) WHERE the IO is persisted Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2) sufficient? Not waiting for one slow OSD gives me at least some SLA for planned tasks like backfilling, scrubbing, deep-scrubbing Hands up who can afford to leav deep-scrub enabled in production... 2) WHEN the IO is persisted Do I really need all OSDs to flush the data to disk? If all the nodes are in the same cabinet and on the same UPS then this makes sense. But my nodes are actually in different buildings ~10km apart. The chances of power failing simultaneously, N+1 UPSes failing simultaneously, diesels failing simultaneously... When nukes start falling and this happens then I'll start looking for backups. Even if your nodes are in one datacentre, there are likely redundant (2+) circuits. And even if you have just one cabinet, you can add 3x UPS in there and gain a nice speed boost. So the IO could be actually pretty safe and happy when it gets to a remote buffers on enough (quorum) nodes and waits for processing. It can be batched, it can be coalesced, it can be rewritten with subsequent updates... 3) WHAT amount of IO is stored Do I need to have the last transaction or can I tolerate 1 minute of missing data? Checkpoints, checksums on last transaction, rollback (journal already does this AFAIK)... 4) I DON'T CARE mode :-) qemu cache=unsafe equivalent but set on a RBD volume/pool Because sometimes you just need to crunch data without really storing them persistently - how are CERN/HADOOP/Big Data guys approcaching this? And you can't always disable flushing. Filesystems have nobarriers (usually) but if you need a block device for raw database tablespace, you're pretty much SOL without lots of trickery 1) is doable eventually. 2) is
Re: [ceph-users] ceph cluster_network with linklocal ipv6
On 18 Aug 2015, at 17:57, Björn Lässig b.laes...@pengutronix.de wrote: On 08/18/2015 04:32 PM, Jan Schermer wrote: Should ceph care about what scope the address is in? We don't specify it for ipv4 anyway, or is link-scope special in some way? fe80::/64 is on every ipv6 enabled interface ... thats different from legacy ip. I'm not a network guru, but you can have same/overlapping subnets with IPv4 as well that's why you have scopes, metrics, routing tables, policy routing tables etc. That's why I'm wondering what's different here. And isn't this the correct syntax actually? cluster_network = fe80::/64%cephnet That is a very good question! I will look into it. Björn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
On 08/18/2015 11:08 AM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 15:55 To: Jan Schermer j...@schermer.cz Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 09:24 AM, Jan Schermer wrote: On 18 Aug 2015, at 15:50, Mark Nelson mnel...@redhat.com wrote: On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. Was it me pestering you about it? :-) All my customers need this desperately - people don't care about having RPO=0 seconds when all hell breaks loose. People care about their apps being slow all the time which is effectively an outage. I (sysadmin) care about having consistent data where all I have to do is start up the VMs. Any ideas how to approach this? I think even checkpoints (like reverting to a known point in the past) would be great and sufficient for most people... Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? I believe you can already do this, though I haven't tested it. You can certainly move the monitors to rocksdb (tested) and newstore uses rocksdb as well. 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over-promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. Agreed. This is definitely the crux of the problem. The example below is a great start! It'd would be fantastic if we could get more feedback from the list on the relative importance of low latency operations vs high IOPS through concurrency. We have general suspicions but not a ton of actual data regarding what folks are seeing in practice and under what scenarios. To give a real world example of what I see when doing various tests, here is a rough guide to IOP's when removing a snapshot on a ESX server Traditional Array 10K disks = 300-600 IOPs Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the main limitation) Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's) I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help here. Sandisk and Intel have both done some very useful investigations, I've got some additional tests replicating some of their findings coming shortly. Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful) Indeed. There's some work going on in this area too. Hopefully we'll know how some of our ideas pan out later this week. Assuming excessive promotions aren't a problem, the jemalloc/tcmalloc improvements I suspect will generally make cache teiring more interesting (though buffer cache will still be the primary source of really hot cached reads) Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high bursts if snapshot blocks are sequential) Good to know! And when copying VM's to datastore
Re: [ceph-users] Repair inconsistent pgs..
Also, what command are you using to take snapshots? -Sam On Tue, Aug 18, 2015 at 8:48 AM, Samuel Just sj...@redhat.com wrote: Is the number of inconsistent objects growing? Can you attach the whole ceph.log from the 6 hours before and after the snippet you linked above? Are you using cache/tiering? Can you attach the osdmap (ceph osd getmap -o outfile)? -Sam On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: ceph - 0.94.2 Its happen during rebalancing I thought too, that some OSD miss copy, but looks like all miss... So any advice in which direction i need to go 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com: From a quick peek it looks like some of the OSDs are missing clones of objects. I'm not sure how that could happen and I'd expect the pg repair to handle that but if it's not there's probably something wrong; what version of Ceph are you running? Sam, is this something you've seen, a new bug, or some kind of config issue? -Greg On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor igor.voloshane...@gmail.com wrote: Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in inconsistent state... root@temp:~# ceph health detail | grep inc HEALTH_ERR 2 pgs inconsistent; 18 scrub errors pg 2.490 is active+clean+inconsistent, acting [56,15,29] pg 2.c4 is active+clean+inconsistent, acting [56,10,42] From OSD logs, after recovery attempt: root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do ceph pg repair ${i} ; done dumped all in format plain instructing pg 2.490 on osd.56 to repair instructing pg 2.c4 on osd.56 to repair /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone 90c59490/rbd_data.eb486436f2beb.7a65/141//2 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone f5759490/rbd_data.1631755377d7e.04da/141//2 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone fee49490/rbd_data.12483d3ba0794b.522f/141//2 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone a9b39490/rbd_data.12483d3ba0794b.37b3/141//2 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone bac19490/rbd_data.1238e82ae8944a.032e/141//2 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone 98519490/rbd_data.123e9c2ae8944a.0807/141//2 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone 28809490/rbd_data.edea7460fe42b.01d9/141//2 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors So, how i can solve expected clone situation by hand? Thank in advance! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] any recommendation of using EnhanceIO?
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 18 August 2015 15:55 To: Jan Schermer j...@schermer.cz Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk Subject: Re: [ceph-users] any recommendation of using EnhanceIO? On 08/18/2015 09:24 AM, Jan Schermer wrote: On 18 Aug 2015, at 15:50, Mark Nelson mnel...@redhat.com wrote: On 08/18/2015 06:47 AM, Nick Fisk wrote: Just to chime in, I gave dmcache a limited test but its lack of proper writeback cache ruled it out for me. It only performs write back caching on blocks already on the SSD, whereas I need something that works like a Battery backed raid controller caching all writes. It's amazing the 100x performance increase you get with RBD's when doing sync writes and give it something like just 1GB write back cache with flashcache. For your use case, is it ok that data may live on the flashcache for some amount of time before making to ceph to be replicated? We've wondered internally if this kind of trade-off is acceptable to customers or not should the flashcache SSD fail. Was it me pestering you about it? :-) All my customers need this desperately - people don't care about having RPO=0 seconds when all hell breaks loose. People care about their apps being slow all the time which is effectively an outage. I (sysadmin) care about having consistent data where all I have to do is start up the VMs. Any ideas how to approach this? I think even checkpoints (like reverting to a known point in the past) would be great and sufficient for most people... Here's kind of how I see the field right now: 1) Cache at the client level. Likely fastest but obvious issues like above. RAID1 might be an option at increased cost. Lack of barriers in some implementations scary. Agreed. 2) Cache below the OSD. Not much recent data on this. Not likely as fast as client side cache, but likely cheaper (fewer OSD nodes than client nodes?). Lack of barriers in some implementations scary. This also has the benefit of caching the leveldb on the OSD, so get a big performance gain from there too for small sequential writes. I looked at using Flashcache for this too but decided it was adding to much complexity and risk. I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is there anything in the pipeline for something like moving the filestore to use RocksDB? 3) Ceph Cache Tiering. Network overhead and write amplification on promotion makes this primarily useful when workloads fit mostly into the cache tier. Overall safe design but care must be taken to not over-promote. 4) separate SSD pool. Manual and not particularly flexible, but perhaps best for applications that need consistently high performance. I think it depends on the definition of performance. Currently even very fast CPU's and SSD's in their own pool will still struggle to get less than 1ms of write latency. If your performance requirements are for large queue depths then you will probably be alright. If you require something that mirrors the performance of traditional write back cache, then even pure SSD Pools can start to struggle. To give a real world example of what I see when doing various tests, here is a rough guide to IOP's when removing a snapshot on a ESX server Traditional Array 10K disks = 300-600 IOPs Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the main limitation) Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's) Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful) Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high bursts if snapshot blocks are sequential) And when copying VM's to datastore (ESXi does this in sequential 64k IO's.yes silly I know) Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other arrays I guess this scales) Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here for sequential writes) Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring) Ceph Cache Tiering = ~50MB/s when writing to new block, 10MB/s when promote+overwrite Ceph + RBD Caching with Flashcache = As fast as the SSD will go -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: 18 August 2015 12:44 To: Mark Nelson mnel...@redhat.com Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] any recommendation of using EnhanceIO? I did not. Not sure why now - probably for the same reason I didn't extensively test bcache. I'm not a real fan of device mapper though, so if I had to choose I'd still go for bcache :-) Jan On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote: Hi Jan, Out of curiosity did you ever try dm-cache? I've been meaning to give it a spin but haven't
Re: [ceph-users] ceph cluster_network with linklocal ipv6
On 08/18/2015 04:32 PM, Jan Schermer wrote: Should ceph care about what scope the address is in? We don't specify it for ipv4 anyway, or is link-scope special in some way? fe80::/64 is on every ipv6 enabled interface ... thats different from legacy ip. And isn't this the correct syntax actually? cluster_network = fe80::/64%cephnet That is a very good question! I will look into it. Björn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com