date:20150818

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Wido den Hollander



On 18-08-15 12:25, Benedikt Fraunhofer wrote:
 Hi Nick,
 
 did you do anything fancy to get to ~90MB/s in the first place?
 I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
 quite speedy, around 600MB/s.
 
 radosgw for cold data is around the 90MB/s, which is imho limitted by
 the speed of a single disk.
 
 Data already present on the osd-os-buffers arrive with around
 400-700MB/s so I don't think the network is the culprit.
 
 (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
 each, lacp 2x10g bonds)
 
 rados bench single-threaded performs equally bad, but with its default
 multithreaded settings it generates wonderful numbers, usually only
 limiited by linerate and/or interrupts/s.
 
 I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
 get to your wonderful numbers, but it's staying below 30 MB/s.
 
 I was thinking about using a software raid0 like you did but that's
 imho really ugly.
 When I know I needed something speedy, I usually just started dd-ing
 the file to /dev/null and wait for about  three minutes before
 starting the actual job; some sort of hand-made read-ahead for
 dummies.
 

It really depends on your situation, but you could also go for larger
objects then 4MB for specific block devices.

In a use-case with a customer where they read large single-thread files
from RBD block devices we went for 64MB objects.

That improved our read performance in that case. We didn't have to
create a new TCP connection every 4MB and talk to a new OSD.

You could try that and see how it works out.

Wido

 Thx in advance
   Benedikt
 
 
 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.

 The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
 sure if it would make much difference, but I will give it a go. If the
 client is already passing a 4MB request down through to the OSD, will it be
 able to readahead any further? The next 4MB object in theory will be on
 another OSD and so I'm not sure if reading ahead any further on the OSD side
 would help.

 How I see the problem is that the RBD client will only read 1 OSD at a time
 as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
 is the object size of the RBD. Please correct me if I'm wrong on this.

 If you could set the RBD readahead to much higher than the object size, then
 this would probably give the desired effect where the buffer could be
 populated by reading from several OSD's in advance to give much higher
 performance. That or wait for striping to appear in the Kernel client.

 I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
 feature that supports radosstriper. I might try this and see how it performs
 as well.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Hi Nick,

 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?

 Hi,

 I'm trying to use a RBD to act as a staging area for some data before
 pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
 client I
 tend to be maxing out at around 80MB/s reads testing with DD. Has
 anyone got any clever suggestions of giving this a bit of a boost, I
 think I need
 to get it
 up to around 200MB/s to make sure there is always a steady flow of
 data to the tape drive.

 I've just tried the testing kernel with the blk-mq fixes in it for
 full size IO's, this combined with bumping readahead up to 4MB, is now
 getting me on average 150MB/s to 200MB/s so this might suffice.

 On a personal interest, I would still like to know if anyone has ideas
 on how to really push much higher bandwidth through a RBD.

 Some settings in our ceph.conf that may help:

 osd_op_threads = 20
 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
 filestore_queue_max_ops = 9 filestore_flusher = false
 filestore_max_sync_interval = 10 filestore_sync_flush = false

 Regards,
 Alex



 Rbd-fuse seems to top out at 12MB/s, so there goes that option.

[ceph-users] НА: НА: tcmalloc use a lot of CPU

2015-08-18 Thread Межов Игорь Александрович

Hi!

How many nodes? How many SSDs/OSDs?

2 Nodes, each:
- 1xE5-2670, 128G,
- 2x146G SAS 10krpm - system + MON root
- 10x600G SAS 10krpm + 7x900G SAS 10krpm single drive RAID0 on lsi2208
- 2x400G SSD Intel DC S3700 on С602 - for separate SSD pool
- 2x200G SSD Intel DC S3700 on SATA3- for ceph journals
- 10Gbit shared interconnect (Eth)

So: 2 MONs (I know about quorum ;) ) + 34 HDD OSDs + 4 SSD OSDs
Ceph 0.94.2 on Debian Jessie. Tuning: swappiness, low latency TCP tuning,
enlarging TCP buffers, disable interrupt colaescing, noop on ssd, deadline on 
HDD.

Are they random?

Yes. 4k random read, 8 pocesses, aio, qd=32 over a 500G RBD volumes.
There are 2 testing volumes - on HDD and SSD pools. Client is running
on separate host with 10Gbin network. Volumes are real Linux filesystems,
created with rbd import, so they are fully allocated.

What are you using to make the tests?

fio-rbd 2.2.7 - with native rbd support, made from sources.

How big are those OPS?

When I use deafult ceph.conf (simple messenger, use crc, use cephx, all debug 
off):

1. ~12k iops from HDD pool in cold state (after dropping caches on all nodes)
- 8-10% user, 2-3% sys, ~70% iowait, 18% idle
- iostat shows 70% load on OSD drives
- perf top shows
   7,53%  libtcmalloc.so.4.2.2  [.] tcmalloc::SLL_Next(void*)
   1,86%  libtcmalloc.so.4.2.2  [.] 
tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)
   1,51%  libpthread-2.19.so[.] __pthread_mutex_unlock_usercnt
   1,49%  libtcmalloc.so.4.2.2  [.] 
TCMalloc_PageMap335::get(unsigned long) const
   1,29%  libtcmalloc.so.4.2.2  [.] PackedCache35, unsigned 
long::GetOrDefault(unsigned long, unsigned long)
   1,25%  libtcmalloc.so.4.2.2  [.] 
tcmalloc::CentralFreeList::ReleaseToSpans(void*)
   1,19%  ceph-osd  [.] crush_hash32_3
   1,00%  libpthread-2.19.so[.] pthread_mutex_lock
   0,89%  libtcmalloc.so.4.2.2  [.] 
tcmalloc::ThreadCache::Deallocate(void*, unsigned long)
   0,87%  libtcmalloc.so.4.2.2  [.] 
base::subtle::NoBarrier_Load(long const volatile*)

2. ~30-40k iops from HDD pool in warm state (second pass)
- 40-60% user (!), 8-10% sys, 1% iowait, ~50% idle
- iostat shows 1% load on OSD drives
- perf top shows the same - tcmalloc calls are in top

I It is quite understandable situation: at the first run most io read from 
platters and we got
12000iops/34osd ~ 350iops, that is good value for 10krpm drive. At the second 
run we serve
reads (mostly) from pagecache, so no IO on platters. But both runs shows us, 
that there is
some tcmalloc issue, limiting to overall io of cluster. Also 40% CPU in the 
second run
is abnormal value, I think.


Next test is the same, except volume is on the SSD pool.

3. ~43k iops from SSD pool in cold state (after dropping caches on all nodes)
- 25% user, 8-12% sys, ~6% iowait, ~55-60% idle
- iostat shows ~55-65% load on SSD with ~8 kiops each (4 ssd total in pool)
- perf top shows two different things, I'll explain later(*)

4. Also the same ~43k iops from SSD pool in warm state

This test shows, that ceph somewhere limits performance by itself,
cause (a) there are almost no difference in iops between serving io
from ssd itself and pagecache. I think io from pagecache will be faster anyway.
And (b) each SSD can do 30k iops random read, while we got only ~8k per drive.

(*) As for perf top results, sometimes things quickly changed and instead of 
tcmalloc's
calls in top, we got:
  46,07%  [kernel]  [k] _raw_spin_lock
   6,51%  [kernel]  [k] mb_cache_entry_alloc
As I can see the function's names, it is kernel calls for cache allocation, in 
normal
situation, they are far behind tcmalloc calls, but sometimes they're go up in 
perf top.
In this moments, performance from SSD pool drops significantly - to 10k iops.
And this is not happens, while benchmarking volume, located on HDD pool,
only when testing volume on SSD pool. Pity, but I dont have any explanations.
Kernel issue?

Using atop on the OSD nodes where is your bottleneck?

That is the main question! We built this test Hammer install to get the best 
performance
from it, because our productuion Firefly cluster performs not so well. And I 
can't see
any bottleneck, thal limits performance to ~40k iops, except tcmalloc issues.

PS: I try to use ms_async messenger, and it raises performance top over 60k!
It is very good! But the bad thing is a core dump, that always happens in two 
minutes
after start. As I can see, there is assert on memory deallocation in 
AsyncMessenger code.
Hope, that in new Ceph versions, async messnger will work better, as it really 
helps to
increace performance.


Megov Igor
CIO, Yuterra



От: Luis Periquito periqu...@gmail.com
Отправлено: 17 августа 2015 г. 17:15
Кому: Межов Игорь Александрович
Копия: YeYin; ceph-users
Тема: Re: [ceph-users] НА: tcmalloc use a lot of CPU

How

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer

I'm not sure if I missed that but are you testing in a VM backed by RBD device, 
or using the device directly?

I don't see how blk-mq would help if it's not a VM, it just passes the request 
to the underlying block device, and in case of RBD there is no real block 
device from the host perspective...? Enlighten me if I'm wrong please. I have 
some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me cringe 
because I'm unable to tune the scheduler and it just makes no sense at all...?

Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to make 
sure it gets into readahead), also try (if you're not using blk-mq) to a cfq 
scheduler and set it to rotational=1. I see you've also tried this, but I think 
blk-mq is the limiting factor here now.

If you are running a single-threaded benchmark like rados bench then what's 
limiting you is latency - it's not surprising it scales up with more threads.
It should run nicely with a real workload once readahead kicks in and the queue 
fills up. But again - not sure how that works with blk-mq and I've never used 
the RBD device directly (the kernel client). Does it show in /sys/block ? Can 
you dump find /sys/block/$rbd in here?

Jan


 On 18 Aug 2015, at 12:25, Benedikt Fraunhofer 
 given.to.lists.ceph-users.ceph.com.toasta@traced.net wrote:
 
 Hi Nick,
 
 did you do anything fancy to get to ~90MB/s in the first place?
 I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
 quite speedy, around 600MB/s.
 
 radosgw for cold data is around the 90MB/s, which is imho limitted by
 the speed of a single disk.
 
 Data already present on the osd-os-buffers arrive with around
 400-700MB/s so I don't think the network is the culprit.
 
 (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
 each, lacp 2x10g bonds)
 
 rados bench single-threaded performs equally bad, but with its default
 multithreaded settings it generates wonderful numbers, usually only
 limiited by linerate and/or interrupts/s.
 
 I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
 get to your wonderful numbers, but it's staying below 30 MB/s.
 
 I was thinking about using a software raid0 like you did but that's
 imho really ugly.
 When I know I needed something speedy, I usually just started dd-ing
 the file to /dev/null and wait for about  three minutes before
 starting the actual job; some sort of hand-made read-ahead for
 dummies.
 
 Thx in advance
  Benedikt
 
 
 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.
 
 The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
 sure if it would make much difference, but I will give it a go. If the
 client is already passing a 4MB request down through to the OSD, will it be
 able to readahead any further? The next 4MB object in theory will be on
 another OSD and so I'm not sure if reading ahead any further on the OSD side
 would help.
 
 How I see the problem is that the RBD client will only read 1 OSD at a time
 as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
 is the object size of the RBD. Please correct me if I'm wrong on this.
 
 If you could set the RBD readahead to much higher than the object size, then
 this would probably give the desired effect where the buffer could be
 populated by reading from several OSD's in advance to give much higher
 performance. That or wait for striping to appear in the Kernel client.
 
 I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
 feature that supports radosstriper. I might try this and see how it performs
 as well.
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Hi Nick,
 
 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Nick Fisk
 Sent: 13 August 2015 18:04
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] How to improve single thread sequential reads?
 
 Hi,
 
 I'm trying to use a RBD to act as a staging area for some data before
 pushing
 it down to some LTO6 tapes. As I cannot use striping with the kernel
 client I
 tend to be maxing out at around 80MB/s reads

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 11:50
 To: Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 I'm not sure if I missed that but are you testing in a VM backed by RBD
 device, or using the device directly?

 I don't see how blk-mq would help if it's not a VM, it just passes the
request
 to the underlying block device, and in case of RBD there is no real block
 device from the host perspective...? Enlighten me if I'm wrong please. I
have
 some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
 cringe because I'm unable to tune the scheduler and it just makes no sense
 at all...?

Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
infrastructure, but there is a bug which limits max IO sizes to 128kb, which
is why for large block/sequential that testing kernel is essential. I think
this bug fix should make it to 4.2 hopefully.

 Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
 make sure it gets into readahead), also try (if you're not using blk-mq)
to a
 cfq scheduler and set it to rotational=1. I see you've also tried this,
but I think
 blk-mq is the limiting factor here now.

I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
size, from what I can tell) and the max_sectors_kb is already set at the
hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
higher though, but I'm not sure if there is a reason for this limit.

 If you are running a single-threaded benchmark like rados bench then
what's
 limiting you is latency - it's not surprising it scales up with more
threads.

Agreed, but with sequential workloads, if you can get readahead working
properly then you can easily remove this limitation as a single threaded op
effectively becomes multithreaded.

 It should run nicely with a real workload once readahead kicks in and the
 queue fills up. But again - not sure how that works with blk-mq and I've
 never used the RBD device directly (the kernel client). Does it show in
 /sys/block ? Can you dump find /sys/block/$rbd in here?

 Jan

  On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net wrote:

  Hi Nick,

  did you do anything fancy to get to ~90MB/s in the first place?
  I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
  quite speedy, around 600MB/s.

  radosgw for cold data is around the 90MB/s, which is imho limitted by
  the speed of a single disk.

  Data already present on the osd-os-buffers arrive with around
  400-700MB/s so I don't think the network is the culprit.

  (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
  each, lacp 2x10g bonds)

  rados bench single-threaded performs equally bad, but with its default
  multithreaded settings it generates wonderful numbers, usually only
  limiited by linerate and/or interrupts/s.

  I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
  get to your wonderful numbers, but it's staying below 30 MB/s.

  I was thinking about using a software raid0 like you did but that's
  imho really ugly.
  When I know I needed something speedy, I usually just started dd-ing
  the file to /dev/null and wait for about  three minutes before
  starting the actual job; some sort of hand-made read-ahead for
  dummies.

  Thx in advance
   Benedikt

  2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
  Thanks for the replies guys.

  The client is set to 4MB, I haven't played with the OSD side yet as I
  wasn't sure if it would make much difference, but I will give it a
  go. If the client is already passing a 4MB request down through to
  the OSD, will it be able to readahead any further? The next 4MB
  object in theory will be on another OSD and so I'm not sure if
  reading ahead any further on the OSD side would help.

  How I see the problem is that the RBD client will only read 1 OSD at
  a time as the RBD readahead can't be set any higher than
  max_hw_sectors_kb, which is the object size of the RBD. Please correct
 me if I'm wrong on this.

  If you could set the RBD readahead to much higher than the object
  size, then this would probably give the desired effect where the
  buffer could be populated by reading from several OSD's in advance to
  give much higher performance. That or wait for striping to appear in
the
 Kernel client.

  I've also found that BareOS (fork of Bacula) seems to has a direct
  RADOS feature that supports radosstriper. I might try this and see
  how it performs as well.

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
  Behalf Of Somnath Roy
  Sent: 17 August 2015 03:36
  To: Alex Gorbachev

Re: [ceph-users] Repair inconsistent pgs..

2015-08-18 Thread Abhishek L


Voloshanenko Igor writes:

 Hi Irek, Please read careful )))
 You proposal was the first, i try to do...  That's why i asked about
 help... (

 2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com:

 Hi, Igor.

 You need to repair the PG.

 for i in `ceph pg dump| grep inconsistent | grep -v 'inconsistent+repair'
 | awk {'print$1'}`;do ceph pg repair $i;done

 С уважением, Фасихов Ирек Нургаязович
 Моб.: +79229045757

 2015-08-18 8:27 GMT+03:00 Voloshanenko Igor igor.voloshane...@gmail.com:

 Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs
 in inconsistent state...

 root@temp:~# ceph health detail | grep inc
 HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
 pg 2.490 is active+clean+inconsistent, acting [56,15,29]
 pg 2.c4 is active+clean+inconsistent, acting [56,10,42]

 From OSD logs, after recovery attempt:

 root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do
 ceph pg repair ${i} ; done
 dumped all in format plain
 instructing pg 2.490 on osd.56 to repair
 instructing pg 2.c4 on osd.56 to repair

 /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone
 90c59490/rbd_data.eb486436f2beb.7a65/141//2
 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone
 f5759490/rbd_data.1631755377d7e.04da/141//2
 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone
 fee49490/rbd_data.12483d3ba0794b.522f/141//2
 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone
 a9b39490/rbd_data.12483d3ba0794b.37b3/141//2
 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone
 bac19490/rbd_data.1238e82ae8944a.032e/141//2
 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone
 98519490/rbd_data.123e9c2ae8944a.0807/141//2
 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone
 c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2
 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700
 -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone
 28809490/rbd_data.edea7460fe42b.01d9/141//2
 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700
 -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors

 So, how i can solve expected clone situation by hand?
 Thank in advance!

I've had an inconsistent pg once, but it was a different sort of an
error (some sort of digest mismatch, where the secondary object copies
had later timestamps). This was fixed by moving the object away and
restarting, the osd which got fixed when the osd peered, similar to what
was mentioned in Sebastian Han's blog[1].

I'm guessing the same method will solve this error as well, but not
completely sure, maybe someone else who has seen this particular error
could guide you better.

[1]: http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

-- 
Abhishek


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Memory-Usage

2015-08-18 Thread Gregory Farnum

On Mon, Aug 17, 2015 at 8:21 PM, Patrik Plank pat...@plank.me wrote:
 Hi,


 have a ceph cluster witch tree nodes and 32 osds.

 The tree nodes have 16Gb memory but only 5Gb is in use.

 Nodes are Dell Poweredge R510.


 my ceph.conf:


 [global]
 mon_initial_members = ceph01
 mon_host = 10.0.0.20,10.0.0.21,10.0.0.22
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 filestore_op_threads = 32
 public_network = 10.0.0.0/24
 cluster_network = 10.0.1.0/24
 osd_pool_default_size = 3
 osd_pool_default_min_size = 1
 osd_pool_default_pg_num = 4096
 osd_pool_default_pgp_num = 4096
 osd_max_write_size = 200
 osd_map_cache_size = 1024
 osd_map_cache_bl_size = 128
 osd_recovery_op_priority = 1
 osd_max_recovery_max_active = 1
 osd_recovery_max_backfills = 1
 osd_op_threads = 32
 osd_disk_threads = 8

 is that normal or a bottleneck?

Any memory not used by the OSD processes directly will be used by
Linux for page caching. That's what we want to have happen! So it's
not a problem that it's using only 5 GB. Keep in mind that the
memory usage might spike dramatically if the OSDs need to deal with an
outage, though — your normal-state usage ought to be lower than our
recommended values for that reason.
-Greg



 best regards

 Patrik


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Jan Schermer

I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 
and 4.0 kernel-lt if I remember correctly).
It worked fine during benchmarks and stress tests, but once we run DB2 on it it 
panicked within minutes and took all the data with it (almost literally - files 
that werent touched, like OS binaries were b0rked and the filesystem was 
unsalvageable).
If you disregard this warning - the performance gains weren't that great 
either, at least in a VM. It had problems when flushing to disk after reaching 
dirty watermark and the block size has some not-well-documented implications 
(not sure now, but I think it only cached IO _larger_than the block size, so if 
your database keeps incrementing an XX-byte counter it will go straight to 
disk).

Flashcache doesn't respect barriers (or does it now?) - if that's ok for you 
than go for it, it should be stable and I used it in the past in production 
without problems.

bcache seemed to work fine, but I needed to
a) use it for root
b) disable and enable it on the fly (doh)
c) make it non-persisent (flush it) before reboot - not sure if that was 
possible either.
d) all that in a customer's VM, and that customer didn't have a strong 
technical background to be able to fiddle with it...
So I haven't tested it heavily.

Bcache should be the obvious choice if you are in control of the environment. 
At least you can cry on LKML's shoulder when you lose data :-)

Jan


 On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:
 
 What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
 months ago, but no external contributors :(
 
 The nice thing about EnhanceIO is there is no need to change device
 name, unlike bcache, flashcache etc.
 
 Best regards,
 Alex
 
 On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote:
 I did some (non-ceph) work on these, and concluded that bcache was the best
 supported, most stable, and fastest.  This was ~1 year ago, to take it with
 a grain of salt, but that's what I would recommend.
 
 Daniel
 
 
 
 From: Dominik Zalewski dzalew...@optlink.net
 To: German Anders gand...@despegar.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, July 1, 2015 5:28:10 PM
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
 
 
 Hi,
 
 I’ve asked same question last weeks or so (just search the mailing list
 archives for EnhanceIO :) and got some interesting answers.
 
 Looks like the project is pretty much dead since it was bought out by HGST.
 Even their website has some broken links in regards to EnhanceIO
 
 I’m keen to try flashcache or bcache (its been in the mainline kernel for
 some time)
 
 Dominik
 
 On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:
 
 Hi cephers,
 
   Is anyone out there that implement enhanceIO in a production environment?
 any recommendation? any perf output to share with the diff between using it
 and not?
 
 Thanks in advance,
 
 German
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 10:01
 To: Alex Gorbachev a...@iss-integration.com
 Cc: Dominik Zalewski dzalew...@optlink.net; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 I already evaluated EnhanceIO in combination with CentOS 6 (and
 backported 3.10 and 4.0 kernel-lt if I remember correctly).
 It worked fine during benchmarks and stress tests, but once we run DB2 on it
 it panicked within minutes and took all the data with it (almost literally - 
 files
 that werent touched, like OS binaries were b0rked and the filesystem was
 unsalvageable).
 If you disregard this warning - the performance gains weren't that great
 either, at least in a VM. It had problems when flushing to disk after reaching
 dirty watermark and the block size has some not-well-documented
 implications (not sure now, but I think it only cached IO _larger_than the
 block size, so if your database keeps incrementing an XX-byte counter it will
 go straight to disk).

 Flashcache doesn't respect barriers (or does it now?) - if that's ok for you
 than go for it, it should be stable and I used it in the past in production
 without problems.

 bcache seemed to work fine, but I needed to
 a) use it for root
 b) disable and enable it on the fly (doh)
 c) make it non-persisent (flush it) before reboot - not sure if that was
 possible either.
 d) all that in a customer's VM, and that customer didn't have a strong
 technical background to be able to fiddle with it...
 So I haven't tested it heavily.

 Bcache should be the obvious choice if you are in control of the environment.
 At least you can cry on LKML's shoulder when you lose data :-)

Please note, it looks like the main(only?) dev of Bcache has started making a 
new version of bcache, bcachefs. At this stage I'm not sure what this means for 
the ongoing support of the existing bcache project.

 Jan

  On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:

  What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
  months ago, but no external contributors :(

  The nice thing about EnhanceIO is there is no need to change device
  name, unlike bcache, flashcache etc.

  Best regards,
  Alex

  On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com
 wrote:
  I did some (non-ceph) work on these, and concluded that bcache was
  the best supported, most stable, and fastest.  This was ~1 year ago,
  to take it with a grain of salt, but that's what I would recommend.

  Daniel

  From: Dominik Zalewski dzalew...@optlink.net
  To: German Anders gand...@despegar.com
  Cc: ceph-users ceph-users@lists.ceph.com
  Sent: Wednesday, July 1, 2015 5:28:10 PM
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

  Hi,

  I’ve asked same question last weeks or so (just search the mailing
  list archives for EnhanceIO :) and got some interesting answers.

  Looks like the project is pretty much dead since it was bought out by
 HGST.
  Even their website has some broken links in regards to EnhanceIO

  I’m keen to try flashcache or bcache (its been in the mainline kernel
  for some time)

  Dominik

  On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:

  Hi cephers,

Is anyone out there that implement enhanceIO in a production
 environment?
  any recommendation? any perf output to share with the diff between
  using it and not?

  Thanks in advance,

  German
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Benedikt Fraunhofer

Hi Nick,

did you do anything fancy to get to ~90MB/s in the first place?
I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
quite speedy, around 600MB/s.

radosgw for cold data is around the 90MB/s, which is imho limitted by
the speed of a single disk.

Data already present on the osd-os-buffers arrive with around
400-700MB/s so I don't think the network is the culprit.

(20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
each, lacp 2x10g bonds)

rados bench single-threaded performs equally bad, but with its default
multithreaded settings it generates wonderful numbers, usually only
limiited by linerate and/or interrupts/s.

I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
get to your wonderful numbers, but it's staying below 30 MB/s.

I was thinking about using a software raid0 like you did but that's
imho really ugly.
When I know I needed something speedy, I usually just started dd-ing
the file to /dev/null and wait for about  three minutes before
starting the actual job; some sort of hand-made read-ahead for
dummies.

Thx in advance
  Benedikt


2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.

 The client is set to 4MB, I haven't played with the OSD side yet as I wasn't
 sure if it would make much difference, but I will give it a go. If the
 client is already passing a 4MB request down through to the OSD, will it be
 able to readahead any further? The next 4MB object in theory will be on
 another OSD and so I'm not sure if reading ahead any further on the OSD side
 would help.

 How I see the problem is that the RBD client will only read 1 OSD at a time
 as the RBD readahead can't be set any higher than max_hw_sectors_kb, which
 is the object size of the RBD. Please correct me if I'm wrong on this.

 If you could set the RBD readahead to much higher than the object size, then
 this would probably give the desired effect where the buffer could be
 populated by reading from several OSD's in advance to give much higher
 performance. That or wait for striping to appear in the Kernel client.

 I've also found that BareOS (fork of Bacula) seems to has a direct RADOS
 feature that supports radosstriper. I might try this and see how it performs
 as well.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Somnath Roy
 Sent: 17 August 2015 03:36
 To: Alex Gorbachev a...@iss-integration.com; Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Have you tried setting read_ahead_kb to bigger number for both client/OSD
 side if you are using krbd ?
 In case of librbd, try the different config options for rbd cache..

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Alex Gorbachev
 Sent: Sunday, August 16, 2015 7:07 PM
 To: Nick Fisk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Hi Nick,

 On Thu, Aug 13, 2015 at 4:37 PM, Nick Fisk n...@fisk.me.uk wrote:
  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Nick Fisk
  Sent: 13 August 2015 18:04
  To: ceph-users@lists.ceph.com
  Subject: [ceph-users] How to improve single thread sequential reads?
 
  Hi,
 
  I'm trying to use a RBD to act as a staging area for some data before
  pushing
  it down to some LTO6 tapes. As I cannot use striping with the kernel
  client I
  tend to be maxing out at around 80MB/s reads testing with DD. Has
  anyone got any clever suggestions of giving this a bit of a boost, I
  think I need
  to get it
  up to around 200MB/s to make sure there is always a steady flow of
  data to the tape drive.
 
  I've just tried the testing kernel with the blk-mq fixes in it for
  full size IO's, this combined with bumping readahead up to 4MB, is now
  getting me on average 150MB/s to 200MB/s so this might suffice.
 
  On a personal interest, I would still like to know if anyone has ideas
  on how to really push much higher bandwidth through a RBD.

 Some settings in our ceph.conf that may help:

 osd_op_threads = 20
 osd_mount_options_xfs = rw,noatime,inode64,logbsize=256k
 filestore_queue_max_ops = 9 filestore_flusher = false
 filestore_max_sync_interval = 10 filestore_sync_flush = false

 Regards,
 Alex

 
 
  Rbd-fuse seems to top out at 12MB/s, so there goes that option.
 
  I'm thinking mapping multiple RBD's and then combining them into a
  mdadm
  RAID0 stripe might work, but seems a bit messy.
 
  Any suggestions?
 
  Thanks,
  Nick
 
 
 
 
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com

Re: [ceph-users] Ceph File System ACL Support

2015-08-18 Thread Gregory Farnum

On Mon, Aug 17, 2015 at 4:12 AM, Yan, Zheng uker...@gmail.com wrote:
 On Mon, Aug 17, 2015 at 9:38 AM, Eric Eastman
 eric.east...@keepertech.com wrote:
 Hi,

 I need to verify in Ceph v9.0.2 if the kernel version of Ceph file
 system supports ACLs and the libcephfs file system interface does not.
 I am trying to have SAMBA, version 4.3.0rc1, support Windows ACLs
 using vfs objects = acl_xattr with the SAMBA VFS Ceph file system
 interface vfs objects = ceph and my tests are failing. If I use a
 kernel mount of the same Ceph file system, it works.  Using the SAMBA
 Ceph VFS interface with logging set to 3 in my smb.conf files shows
 the following error when on my Windows AD server I try to Disable
 inheritance of the SAMBA exported directory uu/home:

 [2015/08/16 18:27:11.546307,  2]
 ../source3/smbd/posix_acls.c:3006(set_canon_ace_list)
   set_canon_ace_list: sys_acl_set_file type file failed for file
 uu/home (Operation not supported).

 This works using the same Ceph file system kernel mounted. It also
 works with an XFS file system.

 Doing some Googling I found this entry on the SAMBA email list:

 https://lists.samba.org/archive/samba-technical/2015-March/106699.html

 It states: libcephfs does not support ACL yet, so this patch adds ACL
 callbacks that do nothing.

 If ACL support is not in libcephfs, is there plans to add it, as the
 SAMBA Ceph VFS interface without ACL support is severely limited in a
 multi-user Windows environment.


 libcephfs does not support ACL. I have an old patch that adds ACL
 support to samba's vfs ceph module, but haven't tested it carefully.

Are these published somewhere? Even if you don't have time to work on
it somebody else might pick it up and finish things if it's available
as a starting point. :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Repair inconsistent pgs..

2015-08-18 Thread Voloshanenko Igor

No. This will no help (((
I try to found data, but it's look exist with same time stamp on all osd or
missing on all osd ...

So, need advice , what I need to do...

вторник, 18 августа 2015 г. пользователь Abhishek L написал:


 Voloshanenko Igor writes:

  Hi Irek, Please read careful )))
  You proposal was the first, i try to do...  That's why i asked about
  help... (
 
  2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com
 javascript:;:
 
  Hi, Igor.
 
  You need to repair the PG.
 
  for i in `ceph pg dump| grep inconsistent | grep -v
 'inconsistent+repair'
  | awk {'print$1'}`;do ceph pg repair $i;done
 
  С уважением, Фасихов Ирек Нургаязович
  Моб.: +79229045757
 
  2015-08-18 8:27 GMT+03:00 Voloshanenko Igor 
 igor.voloshane...@gmail.com javascript:;:
 
  Hi all, at our production cluster, due high rebalancing ((( we have 2
 pgs
  in inconsistent state...
 
  root@temp:~# ceph health detail | grep inc
  HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
  pg 2.490 is active+clean+inconsistent, acting [56,15,29]
  pg 2.c4 is active+clean+inconsistent, acting [56,10,42]
 
  From OSD logs, after recovery attempt:
 
  root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i;
 do
  ceph pg repair ${i} ; done
  dumped all in format plain
  instructing pg 2.490 on osd.56 to repair
  instructing pg 2.c4 on osd.56 to repair
 
  /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone
  90c59490/rbd_data.eb486436f2beb.7a65/141//2
  /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected
 clone
  f5759490/rbd_data.1631755377d7e.04da/141//2
  /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected
 clone
  fee49490/rbd_data.12483d3ba0794b.522f/141//2
  /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected
 clone
  a9b39490/rbd_data.12483d3ba0794b.37b3/141//2
  /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected
 clone
  bac19490/rbd_data.1238e82ae8944a.032e/141//2
  /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected
 clone
  98519490/rbd_data.123e9c2ae8944a.0807/141//2
  /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone
  c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2
  /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  e1509490/rbd_data.1423897545e146.09a6/head//2 expected
 clone
  28809490/rbd_data.edea7460fe42b.01d9/141//2
  /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors
 
  So, how i can solve expected clone situation by hand?
  Thank in advance!

 I've had an inconsistent pg once, but it was a different sort of an
 error (some sort of digest mismatch, where the secondary object copies
 had later timestamps). This was fixed by moving the object away and
 restarting, the osd which got fixed when the osd peered, similar to what
 was mentioned in Sebastian Han's blog[1].

 I'm guessing the same method will solve this error as well, but not
 completely sure, maybe someone else who has seen this particular error
 could guide you better.

 [1]:
 http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

 --
 Abhishek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] tcmalloc use a lot of CPU

2015-08-18 Thread Alexandre DERUMIER

Hi Mark,

Yep! At least from what I've seen so far, jemalloc is still a little 
faster for 4k random writes even compared to tcmalloc with the patch + 
128MB thread cache. Should have some data soon (mostly just a 
reproduction of Sandisk and Intel's work).

I definitively switch to jemmaloc from my production ceph cluster,
I was too tired of this tcmalloc problem (I have hit the bug once or twice, 
even with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES)

Should have some data soon (mostly just a 
reproduction of Sandisk and Intel's work).

Client side,it could be great to run fio or rados bench with jemalloc too, I 
have see around 20% improvement vs glibc.
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 fio 


(For my production, I'm running qemu with jemalloc too now)

Regards,

Alexandre

- Mail original -
De: Mark Nelson mnel...@redhat.com
À: ceph-users ceph-users@lists.ceph.com
Envoyé: Lundi 17 Août 2015 16:24:16
Objet: Re: [ceph-users] tcmalloc use a lot of CPU

On 08/17/2015 07:03 AM, Alexandre DERUMIER wrote: 
 Hi, 
 
 Is this phenomenon normal?Is there any idea about this problem? 
 
 It's a known problem with tcmalloc (search on the ceph mailing). 
 
 starting osd with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128M environnement 
 variable should help. 

Note that this only works if you use a version of gperftools/tcmalloc 
newer than 2.1. 

 
 
 Another way, is to compile ceph with jemalloc instead tcmalloc (./configure 
 --with-jemalloc ...) 

Yep! At least from what I've seen so far, jemalloc is still a little 
faster for 4k random writes even compared to tcmalloc with the patch + 
128MB thread cache. Should have some data soon (mostly just a 
reproduction of Sandisk and Intel's work). 

 
 
 
 - Mail original - 
 De: YeYin ey...@qq.com 
 À: ceph-users ceph-users@lists.ceph.com 
 Envoyé: Lundi 17 Août 2015 11:58:26 
 Objet: [ceph-users] tcmalloc use a lot of CPU 
 
 Hi, all, 
 When I do performance test with rados bench, I found tcmalloc consumed a lot 
 of CPU: 
 
 Samples: 265K of event 'cycles', Event count (approx.): 104385445900 
 + 27.58% libtcmalloc.so.4.1.0 [.] tcmalloc::CentralFreeList::FetchFromSpans() 
 + 15.25% libtcmalloc.so.4.1.0 [.] 
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
  unsigned long, 
 + 12.20% libtcmalloc.so.4.1.0 [.] 
 tcmalloc::CentralFreeList::ReleaseToSpans(void*) 
 + 1.63% perf [.] append_chain 
 + 1.39% libtcmalloc.so.4.1.0 [.] 
 tcmalloc::CentralFreeList::ReleaseListToSpans(void*) 
 + 1.02% libtcmalloc.so.4.1.0 [.] 
 tcmalloc::CentralFreeList::RemoveRange(void**, void**, int) 
 + 0.85% libtcmalloc.so.4.1.0 [.] 0x00017e6f 
 + 0.75% libtcmalloc.so.4.1.0 [.] 
 tcmalloc::ThreadCache::IncreaseCacheLimitLocked() 
 + 0.67% libc-2.12.so [.] memcpy 
 + 0.53% libtcmalloc.so.4.1.0 [.] operator delete(void*) 
 
 Ceph version: 
 # ceph --version 
 ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e) 
 
 Kernel version: 
 3.10.83 
 
 Is this phenomenon normal? Is there any idea about this problem? 
 
 Thanks. 
 Ye 
 
 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] НА: Question

2015-08-18 Thread Межов Игорь Александрович

Hi!

You can run mons on the same hosts, though it is not recommemned. MON daemon
itself are not resurce hungry - 1-2 cores and 2-4 Gb RAM is enough in most small
installs. But there are some pitfalls:
- MONs use LevelDB as a backstorage, and widely use direct write to ensure DB 
consistency.
So, if MON daemon coexits with OSDs not only on the same host, but on the same
volume/disk/controller - it will severily reduce disk io available to OSD, thus 
greatly
reduce overall performance. Moving MONs root to separate spindle, or better - 
separate SSD
will keep MONs running fine with OSDs at the same host.
- When cluster is in healthy state, MONs are not resource consuming, but when 
cluster
in changing state (adding/removing OSDs, backfiling, etc) the CPU and memory 
usage
for MON can raise significantly.

And yes, in small cluster, it is not alaways possible to get 3 separate hosts 
for MONs only.


Megov Igor
CIO, Yuterra


От: ceph-users ceph-users-boun...@lists.ceph.com от имени Luis Periquito 
periqu...@gmail.com
Отправлено: 17 августа 2015 г. 17:09
Кому: Kris Vaes
Копия: ceph-users@lists.ceph.com
Тема: Re: [ceph-users] Question

yes. The issue is resource sharing as usual: the MONs will use disk I/O, memory 
and CPU. If the cluster is small (test?) then there's no problem in using the 
same disks. If the cluster starts to get bigger you may want to dedicate 
resources (e.g. the disk for the MONs isn't used by an OSD). If the cluster is 
big enough you may want to dedicate a node for being a MON.

On Mon, Aug 17, 2015 at 2:56 PM, Kris Vaes k...@s3s.eumailto:k...@s3s.eu 
wrote:
Hi,

Maybe this seems like a strange question but i could not find this info in the 
docs , i have following question,

For the ceph cluster you need osd daemons and monitor daemons,

On a host you can run several osd daemons (best one per drive as read in the 
docs) on one host

But now my question  can you run on the same host where you run already some 
osd daemons the monitor daemon

Is this possible and what are the implications of doing this



Met Vriendelijke Groeten
Cordialement
Kind Regards
Cordialmente
С приятелски поздрави

[cid:D87E97BC-3D4F-4F8A-AC12-37B6FD3C2E40]

This message (including any attachments) may be privileged or confidential. If 
you have received it by mistake, please notify the sender by return e-mail and 
delete this message from your system. Any unauthorized use or dissemination of 
this message in whole or in part is strictly prohibited. S3S rejects any 
liability for the improper, incomplete or delayed transmission of the 
information contained in this message, as well as for damages resulting from 
this e-mail message. S3S cannot guarantee that the message received by you has 
not been intercepted by third parties and/or manipulated by computer programs 
used to transmit messages and viruses.

___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Fwd: Repair inconsistent pgs..

2015-08-18 Thread Voloshanenko Igor

-- Пересылаемое сообщение -
От: *Voloshanenko Igor* igor.voloshane...@gmail.com
Дата: вторник, 18 августа 2015 г.
Тема: Repair inconsistent pgs..
Кому: Irek Fasikhov malm...@gmail.com


Some additional inforamtion (Tnx Irek for questions!)

Pool values:

root@test:~# ceph osd pool get cold-storage size
size: 3
root@test:~# ceph osd pool get cold-storage min_size
min_size: 2


Broken pgs dump

PG_1 #

{
state: active+clean+inconsistent,
snap_trimq: [],
epoch: 17541,
up: [
56,
10,
42
],
acting: [
56,
10,
42
],
actingbackfill: [
10,
42,
56
],
info: {
pgid: 2.c4,
last_update: 17541'29153,
last_complete: 17541'29153,
log_tail: 16746'26095,
last_user_version: 401173,
last_backfill: MAX,
purged_snaps:
[1~1,6~1,8~3,11~2,17~2,1f~2,25~1,28~1,2c~5,32~4,37~1,39~7,41~5,47~16,5e~19,cb~1,ce~2,d4~7,dc~1,de~1,e6~4,102~1,105~6,10d~1,119~1,150~1,15d~2,160~3,16d~1,16f~5,178~1,184~2,194~1,1a2~1,1a5~1,1ac~2,1c7~1,1cb~2,1ce~1],
history: {
epoch_created: 98,
last_epoch_started: 17531,
last_epoch_clean: 17541,
last_epoch_split: 0,
same_up_since: 17139,
same_interval_since: 17530,
same_primary_since: 17530,
last_scrub: 17541'29114,
last_scrub_stamp: 2015-08-18 07:37:04.567973,
last_deep_scrub: 17541'29114,
last_deep_scrub_stamp: 2015-08-18 07:37:04.567973,
last_clean_scrub_stamp: 2015-08-05 17:23:45.251731
},
stats: {
version: 17541'29153,
reported_seq: 21552,
reported_epoch: 17541,
state: active+clean+inconsistent,
last_fresh: 2015-08-18 07:48:37.667036,
last_change: 2015-08-18 07:37:04.568541,
last_active: 2015-08-18 07:48:37.667036,
last_peered: 2015-08-18 07:48:37.667036,
last_clean: 2015-08-18 07:48:37.667036,
last_became_active: 0.00,
last_became_peered: 0.00,
last_unstale: 2015-08-18 07:48:37.667036,
last_undegraded: 2015-08-18 07:48:37.667036,
last_fullsized: 2015-08-18 07:48:37.667036,
mapping_epoch: 17140,
log_start: 16746'26095,
ondisk_log_start: 16746'26095,
created: 98,
last_epoch_clean: 17541,
parent: 0.0,
parent_split_bits: 0,
last_scrub: 17541'29114,
last_scrub_stamp: 2015-08-18 07:37:04.567973,
last_deep_scrub: 17541'29114,
last_deep_scrub_stamp: 2015-08-18 07:37:04.567973,
last_clean_scrub_stamp: 2015-08-05 17:23:45.251731,
log_size: 3058,
ondisk_log_size: 3058,
stats_invalid: 0,
stat_sum: {
num_bytes: 2236608990,
num_objects: 307,
num_object_clones: 7,
num_object_copies: 921,
num_objects_missing_on_primary: 0,
num_objects_degraded: 0,
num_objects_misplaced: 0,
num_objects_unfound: 0,
num_objects_dirty: 307,
num_whiteouts: 0,
num_read: 15694,
num_read_kb: 401354,
num_write: 55720,
num_write_kb: 2539827,
num_scrub_errors: 1,
num_shallow_scrub_errors: 1,
num_deep_scrub_errors: 0,
num_objects_recovered: 1842,
num_bytes_recovered: 13419653940,
num_keys_recovered: 36,
num_objects_omap: 1,
num_objects_hit_set_archive: 0,
num_bytes_hit_set_archive: 0
},
up: [
56,
10,
42
],
acting: [
56,
10,
42
],
blocked_by: [],
up_primary: 56,
acting_primary: 56
},
empty: 0,
dne: 0,
incomplete: 0,
last_epoch_started: 17531,
hit_set_history: {
current_last_update: 0'0,
current_last_stamp: 0.00,
current_info: {
begin: 0.00,
end: 0.00,
version: 0'0
},
history: []
}
},
peer_info: [
{
peer: 10,
pgid: 2.c4,
last_update: 17541'29153,
last_complete: 17541'29153,
log_tail: 16746'25703,
last_user_version: 400914,
last_backfill: MAX,
purged_snaps:

Re: [ceph-users] How repair 2 invalids pgs

2015-08-18 Thread Pierre BLONDEAU

Le 14/08/2015 15:48, Pierre BLONDEAU a écrit :
 Hy,
 
 Yesterday, I removed 5 ods on 15 from my cluster ( machine migration ).
 
 When I stopped the processes, I haven't verified that all the pages were
 in active stat. I removed the 5 ods from the cluster ( ceph osd out
 osd.9 ; ceph osd crush rm osd.9 ; ceph auth del osd.9 ; ceph osd rm
 osd.9 ) , and i check after... and I had two inactive pgs
 
 I have not formatted the filesystem of the osds.
 
 The health :
 pg 7.b is stuck inactive for 86083.236722, current state inactive, last
 acting [1,2]
 pg 7.136 is stuck inactive for 86098.214967, current state inactive,
 last acting [4,7]
 
 The recovery state :
 recovery_state: [
 { name: Started\/Primary\/Peering\/WaitActingChange,
   enter_time: 2015-08-13 15:19:49.559965,
   comment: waiting for pg acting set to change},
 { name: Started,
   enter_time: 2015-08-13 15:19:46.492625}],
 
 How can i solved my problem ?
 
 Can i re-add the osds since the filesystem ?
 
 My cluster is used for rbd's image and a little cephfs share. I can read
 all files in cephfs and I tried to check if there pgs were used by an
 image. I don't find anything, but I not sure of my script.
 
 My cluster is used for rbd image and a little cephfs share. I can read
 all block in cephfs and i check all image to verify if they use these
 pgs. I don't find anything.
 
 How do you know if a pg is used ?
 
 Regards

Hello,

The names of pgs start with 7.. so they are used by the pool id 7 ?

For me, it's cephfs_meta ( cephfs metadata ). I get no response when i
done rados -p cephfs_meta ls .

Like it's a small share, it's not serious. I can restore it easily. So I
add the news OSDs of the new machine.

And it solved the problem, but i don't understand why. So if someone
have an idea ?

Regards

PS : I use 0.80.10 on wheezy

-- 
--
Pierre BLONDEAU
Administrateur Systèmes  réseaux
Université de Caen
Laboratoire GREYC, Département d'informatique

tel : 02 31 56 75 42
bureau  : Campus 2, Science 3, 406
--



smime.p7s
Description: Signature cryptographique S/MIME
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Repair inconsistent pgs..

2015-08-18 Thread Voloshanenko Igor

No. This will no help (((
I try to found data, but it's look exist with same time stamp on all osd or
missing on all osd ...

So, need advice , what I need to do...

вторник, 18 августа 2015 г. пользователь Abhishek L написал:


 Voloshanenko Igor writes:

  Hi Irek, Please read careful )))
  You proposal was the first, i try to do...  That's why i asked about
  help... (
 
  2015-08-18 8:34 GMT+03:00 Irek Fasikhov malm...@gmail.com
 javascript:;:
 
  Hi, Igor.
 
  You need to repair the PG.
 
  for i in `ceph pg dump| grep inconsistent | grep -v
 'inconsistent+repair'
  | awk {'print$1'}`;do ceph pg repair $i;done
 
  С уважением, Фасихов Ирек Нургаязович
  Моб.: +79229045757
 
  2015-08-18 8:27 GMT+03:00 Voloshanenko Igor 
 igor.voloshane...@gmail.com javascript:;:
 
  Hi all, at our production cluster, due high rebalancing ((( we have 2
 pgs
  in inconsistent state...
 
  root@temp:~# ceph health detail | grep inc
  HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
  pg 2.490 is active+clean+inconsistent, acting [56,15,29]
  pg 2.c4 is active+clean+inconsistent, acting [56,10,42]
 
  From OSD logs, after recovery attempt:
 
  root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i;
 do
  ceph pg repair ${i} ; done
  dumped all in format plain
  instructing pg 2.490 on osd.56 to repair
  instructing pg 2.c4 on osd.56 to repair
 
  /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone
  90c59490/rbd_data.eb486436f2beb.7a65/141//2
  /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected
 clone
  f5759490/rbd_data.1631755377d7e.04da/141//2
  /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected
 clone
  fee49490/rbd_data.12483d3ba0794b.522f/141//2
  /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected
 clone
  a9b39490/rbd_data.12483d3ba0794b.37b3/141//2
  /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected
 clone
  bac19490/rbd_data.1238e82ae8944a.032e/141//2
  /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected
 clone
  98519490/rbd_data.123e9c2ae8944a.0807/141//2
  /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone
  c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2
  /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : deep-scrub 2.490
  e1509490/rbd_data.1423897545e146.09a6/head//2 expected
 clone
  28809490/rbd_data.edea7460fe42b.01d9/141//2
  /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765
 7f94663b3700
  -1 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors
 
  So, how i can solve expected clone situation by hand?
  Thank in advance!

 I've had an inconsistent pg once, but it was a different sort of an
 error (some sort of digest mismatch, where the secondary object copies
 had later timestamps). This was fixed by moving the object away and
 restarting, the osd which got fixed when the osd peered, similar to what
 was mentioned in Sebastian Han's blog[1].

 I'm guessing the same method will solve this error as well, but not
 completely sure, maybe someone else who has seen this particular error
 could guide you better.

 [1]:
 http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

 --
 Abhishek

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Stuck creating pg

2015-08-18 Thread Bart Vanbrabant

1) No errors at all. At loglevel 20 the osd does not say anything about the
missing placement group
2) I tried that. Several times actually, also for the secondary osd's, but
it does not work.

gr,
Bart

On Tue, Aug 18, 2015 at 4:28 AM minchen minche...@outlook.com wrote:


 osd.19 is blocked by pg creating and 19 client ops,
 1. check osd.19's log to see if any errors
 2. if not, out 19 from osdmap to remap pg 5.6c7
 ceph osd out 19 // this will cause data migration
 I am not sure whether this will help you!


 -- Original --
 *From: * Bart Vanbrabant;b...@vanbrabant.eu;
 *Date: * Mon, Aug 17, 2015 10:14 PM
 *To: * minchenminche...@outlook.com; ceph-users
 ceph-users@lists.ceph.com;
 *Subject: * Re: [ceph-users] Stuck creating pg

 1)

 ~# ceph pg 5.6c7 query
 Error ENOENT: i don't have pgid 5.6c7

 In the osd log:

 2015-08-17 16:11:45.185363 7f311be40700  0 osd.19 64706 do_command r=-2 i
 don't have pgid 5.6c7
 2015-08-17 16:11:45.185380 7f311be40700  0 log_channel(cluster) log [INF]
 : i don't have pgid 5.6c7

 2) I do not see anything wrong with this rule:

 {
 rule_id: 0,
 rule_name: data,
 ruleset: 0,
 type: 1,
 min_size: 1,
 max_size: 10,
 steps: [
 {
 op: take,
 item: -1,
 item_name: default
 },
 {
 op: chooseleaf_firstn,
 num: 0,
 type: host
 },
 {
 op: emit
 }
 ]
 },

 3) I rebooted all machines in the cluster and increased the replication
 level of the affected pool to 3, to be more sure.  After recovery of this
 reboot we are currently in the current state:

 HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; 103 requests are
 blocked  32 sec; 2 osds have slow requests; pool volumes pg_num 2048 
 pgp_num 1400
 pg 5.6c7 is stuck inactive since forever, current state creating, last
 acting [19,25,17]
 pg 5.6c7 is stuck unclean since forever, current state creating, last
 acting [19,25,17]
 103 ops are blocked  524.288 sec
 19 ops are blocked  524.288 sec on osd.19
 84 ops are blocked  524.288 sec on osd.25
 2 osds have slow requests
 pool volumes pg_num 2048  pgp_num 1400

 Thanks,

 Bart

 On 08/17/2015 03:44 PM, minchen wrote:


 It looks like the crushrule does't work properly by osdmap changed,
  there are 3 pgs unclean: 5.6c7  5.2c7  15.2bd
 I think you can try follow method to help locate the problem
 1st,  ceph pg pgid query to lookup detail of pg state,
 eg, blocked by which osd?
 2st, check the crushrule
 ceph osd crush rule dump
 and check the crush_ruleset for pools: 5 , 15
 eg,  the chooseleaf may be not choose the right osd ?

 minchen
 -- Original --
 *From: * Bart Vanbrabant;b...@vanbrabant.eu b...@vanbrabant.eu;
 *Date: * Sun, Aug 16, 2015 07:27 PM
 *To: * ceph-usersceph-users@lists.ceph.com ceph-users@lists.ceph.com;

 *Subject: * [ceph-users] Stuck creating pg

 Hi,

 I have a ceph cluster with 26 osd's in 4 hosts only use for rbd for an
 OpenStack cluster (started at 0.48 I think), currently running 0.94.2 on
 Ubuntu 14.04. A few days ago one of the osd's was at 85% disk usage while
 only 30% of the raw disk space is used. I ran reweight-by-utilization with
 150 was cutoff level. This reshuffled the data. I also noticed that the
 number of pg was still at the level when there were less disks in the
 cluster (1300).

 Based on the current guidelines I increased pg_num to 2048. It created the
 placement groups except for the last one. To try to force the creation of
 the pg I removed the OSD's (ceph osd out) assigned to that pg but that
 makes no difference. Currently all OSD's are back in and two pg's are also
 stuck in an unclean state:

 ceph health detail:

 HEALTH_WARN 2 pgs degraded; 2 pgs stale; 2 pgs stuck degraded; 1 pgs stuck
 inactive; 2 pgs stuck stale; 3 pgs stuck unclean; 2 pgs stuck undersized; 2
 pgs undersized; 59 requests are blocked  32 sec; 3 osds have slow
 requests; recovery 221/549658 objects degraded (0.040%); recovery
 221/549658 objects misplaced (0.040%); pool volumes pg_num 2048  pgp_num
 1400
 pg 5.6c7 is stuck inactive since forever, current state creating, last
 acting [19,25]
 pg 5.6c7 is stuck unclean since forever, current state creating, last
 acting [19,25]
 pg 5.2c7 is stuck unclean for 313513.609864, current state
 stale+active+undersized+degraded+remapped, last acting [9]
 pg 15.2bd is stuck unclean for 313513.610368, current state
 stale+active+undersized+degraded+remapped, last acting [9]
 pg 5.2c7 is stuck undersized for 308381.750768, current state
 stale+active+undersized+degraded+remapped, last acting [9]
 pg 15.2bd is stuck undersized for 308381.751913, current state
 stale+active+undersized+degraded+remapped, last acting [9]
 pg 5.2c7 is stuck degraded for 308381.750876, current state

[ceph-users] radosgw-agent keeps syncing most active bucket - ignoring others

2015-08-18 Thread Sam Wouters

Hi,

from the doc of radosgw-agent and some items in this list, I understood
that the max-entries argument was there to prevent a very active bucket
to keep the other buckets from keeping synced. In our agent logs however
we saw a lot of bucket instance bla has 1000 entries after bla, and
the agent kept on syncing that active bucket.

Looking at the code, in class DataWorkerIncremental, it looks like the
agent loops in fetching log entries from the bucket until it receives
less entries then the max_entries. Is this intended behaviour? I would
suspect it to just pass the max_entries log entries for processing and
increase the marker.

Is there any other way to make sure less active buckets are frequently
synced? We've tried increasing num-workers, but this only has affect the
first pass.

Thanks,
Sam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: Emmanuel Florac [mailto:eflo...@intellique.com]
 Sent: 18 August 2015 12:26
 To: Nick Fisk n...@fisk.me.uk
 Cc: 'Jan Schermer' j...@schermer.cz; 'Alex Gorbachev' ag@iss-
 integration.com; 'Dominik Zalewski' dzalew...@optlink.net; ceph-
 us...@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 Le Tue, 18 Aug 2015 10:12:59 +0100
 Nick Fisk n...@fisk.me.uk écrivait:

   Bcache should be the obvious choice if you are in control of the
   environment. At least you can cry on LKML's shoulder when you lose
   data :-)

  Please note, it looks like the main(only?) dev of Bcache has started
  making a new version of bcache, bcachefs. At this stage I'm not sure
  what this means for the ongoing support of the existing bcache
  project.

 bcachefs is more than a new version of bcache, it's a complete POSIX
 filesystem with integrated caching. Looks like a silly idea if you ask me
 (because we already have several excellent filesystems; because developing
 a reliable filesystem is DAMN HARD; because building a feature-complete FS
 is CRAZY HARD; because FTL sucks anyway; etc).

Agreed, it's such a shame that there isn't a simple, reliable and maintained 
caching solution out there for Linux. When I started seeing all these projects 
spring up 5-6 years ago I was full of optimism, but we still don't have 
anything I would call fully usable.

 --

 Emmanuel Florac |   Direction technique
 |   Intellique
 | eflo...@intellique.com
 |   +33 1 78 94 84 02

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

Just to chime in, I gave dmcache a limited test but its lack of proper 
writeback cache ruled it out for me. It only performs write back caching on 
blocks already on the SSD, whereas I need something that works like a Battery 
backed raid controller caching all writes.

It's amazing the 100x performance increase you get with RBD's when doing sync 
writes and give it something like just 1GB write back cache with flashcache.


 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 12:44
 To: Mark Nelson mnel...@redhat.com
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
 
 I did not. Not sure why now - probably for the same reason I didn't
 extensively test bcache.
 I'm not a real fan of device mapper though, so if I had to choose I'd still 
 go for
 bcache :-)
 
 Jan
 
  On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote:
 
  Hi Jan,
 
  Out of curiosity did you ever try dm-cache?  I've been meaning to give it a
 spin but haven't had the spare cycles.
 
  Mark
 
  On 08/18/2015 04:00 AM, Jan Schermer wrote:
  I already evaluated EnhanceIO in combination with CentOS 6 (and
 backported 3.10 and 4.0 kernel-lt if I remember correctly).
  It worked fine during benchmarks and stress tests, but once we run DB2
 on it it panicked within minutes and took all the data with it (almost 
 literally -
 files that werent touched, like OS binaries were b0rked and the filesystem
 was unsalvageable).
  If you disregard this warning - the performance gains weren't that great
 either, at least in a VM. It had problems when flushing to disk after reaching
 dirty watermark and the block size has some not-well-documented
 implications (not sure now, but I think it only cached IO _larger_than the
 block size, so if your database keeps incrementing an XX-byte counter it will
 go straight to disk).
 
  Flashcache doesn't respect barriers (or does it now?) - if that's ok for 
  you
 than go for it, it should be stable and I used it in the past in production
 without problems.
 
  bcache seemed to work fine, but I needed to
  a) use it for root
  b) disable and enable it on the fly (doh)
  c) make it non-persisent (flush it) before reboot - not sure if that was
 possible either.
  d) all that in a customer's VM, and that customer didn't have a strong
 technical background to be able to fiddle with it...
  So I haven't tested it heavily.
 
  Bcache should be the obvious choice if you are in control of the
  environment. At least you can cry on LKML's shoulder when you lose
  data :-)
 
  Jan
 
 
  On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com
 wrote:
 
  What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
  months ago, but no external contributors :(
 
  The nice thing about EnhanceIO is there is no need to change device
  name, unlike bcache, flashcache etc.
 
  Best regards,
  Alex
 
  On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com
 wrote:
  I did some (non-ceph) work on these, and concluded that bcache was
  the best supported, most stable, and fastest.  This was ~1 year
  ago, to take it with a grain of salt, but that's what I would recommend.
 
  Daniel
 
 
  
  From: Dominik Zalewski dzalew...@optlink.net
  To: German Anders gand...@despegar.com
  Cc: ceph-users ceph-users@lists.ceph.com
  Sent: Wednesday, July 1, 2015 5:28:10 PM
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
 
 
  Hi,
 
  I’ve asked same question last weeks or so (just search the mailing
  list archives for EnhanceIO :) and got some interesting answers.
 
  Looks like the project is pretty much dead since it was bought out by
 HGST.
  Even their website has some broken links in regards to EnhanceIO
 
  I’m keen to try flashcache or bcache (its been in the mainline
  kernel for some time)
 
  Dominik
 
  On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com
 wrote:
 
  Hi cephers,
 
Is anyone out there that implement enhanceIO in a production
 environment?
  any recommendation? any perf output to share with the diff between
  using it and not?
 
  Thanks in advance,
 
  German
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
  ___
  ceph-users mailing list

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Emmanuel Florac

Le Tue, 18 Aug 2015 10:12:59 +0100
Nick Fisk n...@fisk.me.uk écrivait:

  Bcache should be the obvious choice if you are in control of the
  environment. At least you can cry on LKML's shoulder when you lose
  data :-)  
 
 Please note, it looks like the main(only?) dev of Bcache has started
 making a new version of bcache, bcachefs. At this stage I'm not sure
 what this means for the ongoing support of the existing bcache
 project.

bcachefs is more than a new version of bcache, it's a complete POSIX
filesystem with integrated caching. Looks like a silly idea if you ask
me (because we already have several excellent filesystems; because
developing a reliable filesystem is DAMN HARD; because building a
feature-complete FS is CRAZY HARD; because FTL sucks anyway; etc).

-- 

Emmanuel Florac |   Direction technique
|   Intellique
|   eflo...@intellique.com
|   +33 1 78 94 84 02

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Rename Ceph cluster

2015-08-18 Thread Vasiliy Angapov

Hi,

Does anyone know what steps should be taken to rename a Ceph cluster?
Btw, is it ever possbile without data loss?

Background: I have a cluster named ceph-prod integrated with
OpenStack, however I found out that the default cluster name ceph is
very much hardcoded into OpenStack so I decided to change it to the
default value.

Regards, Vasily.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rename Ceph cluster

2015-08-18 Thread Jan Schermer

I think it's pretty clear:

http://ceph.com/docs/master/install/manual-deployment/

For example, when you run multiple clusters in a federated architecture, the 
cluster name (e.g., us-west, us-east) identifies the cluster for the current 
CLI session. Note: To identify the cluster name on the command line interface, 
specify the a Ceph configuration file with the cluster name (e.g., ceph.conf, 
us-west.conf, us-east.conf, etc.). Also see CLI usage (ceph --cluster 
{cluster-name}).

But it could be tricky on the OSDs that are running, depending on the 
distribution initscripts - you could find out that you can't service ceph stop 
osd... anymore after the change, since it can't find it's pidfile anymore. 
Looking at Centos initscript it looks like it accepts -c conffile argument 
though.
(So you should be managins OSDs with -c ceph-prod.conf now?)

Jan


 On 18 Aug 2015, at 14:13, Erik McCormick emccorm...@cirrusseven.com wrote:
 
 I've got a custom named cluster integrated with Openstack (Juno) and didn't 
 run into any hard-coded name issues that I can recall. Where are you seeing 
 that?
 
 As to the name change itself, I think it's really just a label applying to a 
 configuration set. The name doesn't actually appear *in* the configuration 
 files. It stands to reason you should be able to rename the configuration 
 files on the client side and leave the cluster alone. It'd be with trying in 
 a test environment anyway.
 
 -Erik
 
 On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz 
 mailto:j...@schermer.cz wrote:
 This should be simple enough
 
 mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf
 
 No? :-)
 
 Or you could set this in nova.conf:
 images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf
 
 Obviously since different parts of openstack have their own configs, you'd 
 have to do something similiar for cinder/glance... so not worth the hassle.
 
 Jan
 
  On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com 
  mailto:anga...@gmail.com wrote:
 
  Hi,
 
  Does anyone know what steps should be taken to rename a Ceph cluster?
  Btw, is it ever possbile without data loss?
 
  Background: I have a cluster named ceph-prod integrated with
  OpenStack, however I found out that the default cluster name ceph is
  very much hardcoded into OpenStack so I decided to change it to the
  default value.
 
  Regards, Vasily.
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Repair inconsistent pgs..

2015-08-18 Thread Gregory Farnum

From a quick peek it looks like some of the OSDs are missing clones of
objects. I'm not sure how that could happen and I'd expect the pg
repair to handle that but if it's not there's probably something
wrong; what version of Ceph are you running? Sam, is this something
you've seen, a new bug, or some kind of config issue?
-Greg

On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor
igor.voloshane...@gmail.com wrote:
 Hi all, at our production cluster, due high rebalancing ((( we have 2 pgs in
 inconsistent state...

 root@temp:~# ceph health detail | grep inc
 HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
 pg 2.490 is active+clean+inconsistent, acting [56,15,29]
 pg 2.c4 is active+clean+inconsistent, acting [56,10,42]

 From OSD logs, after recovery attempt:

 root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do
 ceph pg repair ${i} ; done
 dumped all in format plain
 instructing pg 2.490 on osd.56 to repair
 instructing pg 2.c4 on osd.56 to repair

 /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone
 90c59490/rbd_data.eb486436f2beb.7a65/141//2
 /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone
 f5759490/rbd_data.1631755377d7e.04da/141//2
 /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone
 fee49490/rbd_data.12483d3ba0794b.522f/141//2
 /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone
 a9b39490/rbd_data.12483d3ba0794b.37b3/141//2
 /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone
 bac19490/rbd_data.1238e82ae8944a.032e/141//2
 /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone
 98519490/rbd_data.123e9c2ae8944a.0807/141//2
 /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone
 c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2
 /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700 -1
 log_channel(cluster) log [ERR] : deep-scrub 2.490
 e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone
 28809490/rbd_data.edea7460fe42b.01d9/141//2
 /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700 -1
 log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors

 So, how i can solve expected clone situation by hand?
 Thank in advance!



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rename Ceph cluster

2015-08-18 Thread Wido den Hollander



On 18-08-15 14:13, Erik McCormick wrote:
 I've got a custom named cluster integrated with Openstack (Juno) and
 didn't run into any hard-coded name issues that I can recall. Where are
 you seeing that?
 
 As to the name change itself, I think it's really just a label applying
 to a configuration set. The name doesn't actually appear *in* the
 configuration files. It stands to reason you should be able to rename
 the configuration files on the client side and leave the cluster alone.
 It'd be with trying in a test environment anyway.
 

To add to id, internally a Ceph cluster ONLY uses the fsid which you can
find in the OSDMap and on all the data dirs of the OSDs.

The cluster name is indeed nothing more then a reference to a specific
configuration file.

Wido

 -Erik
 
 On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz
 mailto:j...@schermer.cz wrote:
 
 This should be simple enough
 
 mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf
 
 No? :-)
 
 Or you could set this in nova.conf:
 images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf
 
 Obviously since different parts of openstack have their own configs,
 you'd have to do something similiar for cinder/glance... so not
 worth the hassle.
 
 Jan
 
  On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com
 mailto:anga...@gmail.com wrote:
 
  Hi,
 
  Does anyone know what steps should be taken to rename a Ceph cluster?
  Btw, is it ever possbile without data loss?
 
  Background: I have a cluster named ceph-prod integrated with
  OpenStack, however I found out that the default cluster name ceph is
  very much hardcoded into OpenStack so I decided to change it to the
  default value.
 
  Regards, Vasily.
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Mark Nelson


Hi Jan,

Out of curiosity did you ever try dm-cache?  I've been meaning to give 
it a spin but haven't had the spare cycles.


Mark

On 08/18/2015 04:00 AM, Jan Schermer wrote:

I already evaluated EnhanceIO in combination with CentOS 6 (and backported 3.10 
and 4.0 kernel-lt if I remember correctly).
It worked fine during benchmarks and stress tests, but once we run DB2 on it it 
panicked within minutes and took all the data with it (almost literally - files 
that werent touched, like OS binaries were b0rked and the filesystem was 
unsalvageable).
If you disregard this warning - the performance gains weren't that great 
either, at least in a VM. It had problems when flushing to disk after reaching 
dirty watermark and the block size has some not-well-documented implications 
(not sure now, but I think it only cached IO _larger_than the block size, so if 
your database keeps incrementing an XX-byte counter it will go straight to 
disk).

Flashcache doesn't respect barriers (or does it now?) - if that's ok for you 
than go for it, it should be stable and I used it in the past in production 
without problems.

bcache seemed to work fine, but I needed to
a) use it for root
b) disable and enable it on the fly (doh)
c) make it non-persisent (flush it) before reboot - not sure if that was 
possible either.
d) all that in a customer's VM, and that customer didn't have a strong 
technical background to be able to fiddle with it...
So I haven't tested it heavily.

Bcache should be the obvious choice if you are in control of the environment. 
At least you can cry on LKML's shoulder when you lose data :-)

Jan



On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:

What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
months ago, but no external contributors :(

The nice thing about EnhanceIO is there is no need to change device
name, unlike bcache, flashcache etc.

Best regards,
Alex

On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote:

I did some (non-ceph) work on these, and concluded that bcache was the best
supported, most stable, and fastest.  This was ~1 year ago, to take it with
a grain of salt, but that's what I would recommend.

Daniel



From: Dominik Zalewski dzalew...@optlink.net
To: German Anders gand...@despegar.com
Cc: ceph-users ceph-users@lists.ceph.com
Sent: Wednesday, July 1, 2015 5:28:10 PM
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?


Hi,

I’ve asked same question last weeks or so (just search the mailing list
archives for EnhanceIO :) and got some interesting answers.

Looks like the project is pretty much dead since it was bought out by HGST.
Even their website has some broken links in regards to EnhanceIO

I’m keen to try flashcache or bcache (its been in the mainline kernel for
some time)

Dominik

On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:

Hi cephers,

   Is anyone out there that implement enhanceIO in a production environment?
any recommendation? any perf output to share with the diff between using it
and not?

Thanks in advance,

German
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] radosgw-agent keeps syncing most active bucket - ignoring others

2015-08-18 Thread Sam Wouters

Hmm,

looks like intended behaviour:

SNIP
CommitDate: Mon Mar 3 06:08:42 2014 -0800

   worker: process all bucket instance log entries at once

 Currently if there are more than max_entries in a single bucket
   instance's log, only max_entries of those will be processed, and the
   bucket instance will not be examined again until it is modified again.

   To keep it simple, get the entire log of entries to be updated and
   process them all at once. This means one busy shard may block others
   from syncing, but multiple instances of radosgw-agent can be run to
   circumvent that issue. With only one instance, users can be sure
   everything is synced when an incremental sync completes with no
   errors.
/SNIP

However, this brings us to a new issue. After starting a second agent,
one of the agents is busy syncing the busy shard and the other agent
synced correctly all of the other buckets. So far, so good. But, since a
few of them are almost static, it looks like it started syncing those in
a second run from the beginning all over again.
As versioning was enabled on those buckets after they were created and
with already objects and removed objects in there, it seems like the
agent is copying those unversioned objects to versioned ones, creating a
lot of delete markers and multiple versions in the secondary zone.

Anyone any idea how to handle this correctly. I've already did a cleanup
some weeks ago, but if the agent is going to keep on restarting the sync
from the beginning, I'll have to cleanup every time.

regards,
Sam

On 18-08-15 09:36, Sam Wouters wrote:
 Hi,

 from the doc of radosgw-agent and some items in this list, I understood
 that the max-entries argument was there to prevent a very active bucket
 to keep the other buckets from keeping synced. In our agent logs however
 we saw a lot of bucket instance bla has 1000 entries after bla, and
 the agent kept on syncing that active bucket.

 Looking at the code, in class DataWorkerIncremental, it looks like the
 agent loops in fetching log entries from the bucket until it receives
 less entries then the max_entries. Is this intended behaviour? I would
 suspect it to just pass the max_entries log entries for processing and
 increase the marker.

 Is there any other way to make sure less active buckets are frequently
 synced? We've tried increasing num-workers, but this only has affect the
 first pass.

 Thanks,
 Sam
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Stefan Priebe - Profihost AG

We're using an extra caching layer for ceph since the beginning for our
older ceph deployments. All new deployments go with full SSDs.

I've tested so far:
- EnhanceIO
- Flashcache
- Bcache
- dm-cache
- dm-writeboost

The best working solution was and is bcache except for it's buggy code.
The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e.
discards result in crashed FS after reboots and so on. But it's still
the fastest for ceph.

The 2nd best solution which we already use in production is
dm-writeboost (https://github.com/akiradeveloper/dm-writeboost).

Everything else is too slow.

Stefan
Am 18.08.2015 um 13:33 schrieb Mark Nelson:
 Hi Jan,
 
 Out of curiosity did you ever try dm-cache?  I've been meaning to give
 it a spin but haven't had the spare cycles.
 
 Mark
 
 On 08/18/2015 04:00 AM, Jan Schermer wrote:
 I already evaluated EnhanceIO in combination with CentOS 6 (and
 backported 3.10 and 4.0 kernel-lt if I remember correctly).
 It worked fine during benchmarks and stress tests, but once we run DB2
 on it it panicked within minutes and took all the data with it (almost
 literally - files that werent touched, like OS binaries were b0rked
 and the filesystem was unsalvageable).
 If you disregard this warning - the performance gains weren't that
 great either, at least in a VM. It had problems when flushing to disk
 after reaching dirty watermark and the block size has some
 not-well-documented implications (not sure now, but I think it only
 cached IO _larger_than the block size, so if your database keeps
 incrementing an XX-byte counter it will go straight to disk).

 Flashcache doesn't respect barriers (or does it now?) - if that's ok
 for you than go for it, it should be stable and I used it in the past
 in production without problems.

 bcache seemed to work fine, but I needed to
 a) use it for root
 b) disable and enable it on the fly (doh)
 c) make it non-persisent (flush it) before reboot - not sure if that
 was possible either.
 d) all that in a customer's VM, and that customer didn't have a strong
 technical background to be able to fiddle with it...
 So I haven't tested it heavily.

 Bcache should be the obvious choice if you are in control of the
 environment. At least you can cry on LKML's shoulder when you lose
 data :-)

 Jan


 On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:

 What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
 months ago, but no external contributors :(

 The nice thing about EnhanceIO is there is no need to change device
 name, unlike bcache, flashcache etc.

 Best regards,
 Alex

 On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com
 wrote:
 I did some (non-ceph) work on these, and concluded that bcache was
 the best
 supported, most stable, and fastest.  This was ~1 year ago, to take
 it with
 a grain of salt, but that's what I would recommend.

 Daniel


 
 From: Dominik Zalewski dzalew...@optlink.net
 To: German Anders gand...@despegar.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, July 1, 2015 5:28:10 PM
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?


 Hi,

 I’ve asked same question last weeks or so (just search the mailing list
 archives for EnhanceIO :) and got some interesting answers.

 Looks like the project is pretty much dead since it was bought out
 by HGST.
 Even their website has some broken links in regards to EnhanceIO

 I’m keen to try flashcache or bcache (its been in the mainline
 kernel for
 some time)

 Dominik

 On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:

 Hi cephers,

Is anyone out there that implement enhanceIO in a production
 environment?
 any recommendation? any perf output to share with the diff between
 using it
 and not?

 Thanks in advance,

 German
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer

Reply in text

 On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 11:50
 To: Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 I'm not sure if I missed that but are you testing in a VM backed by RBD
 device, or using the device directly?
 
 I don't see how blk-mq would help if it's not a VM, it just passes the
 request
 to the underlying block device, and in case of RBD there is no real block
 device from the host perspective...? Enlighten me if I'm wrong please. I
 have
 some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
 cringe because I'm unable to tune the scheduler and it just makes no sense
 at all...?
 
 Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
 infrastructure, but there is a bug which limits max IO sizes to 128kb, which
 is why for large block/sequential that testing kernel is essential. I think
 this bug fix should make it to 4.2 hopefully.

blk-mq is supposed to remove redundancy of having

IO scheduler in VM - VM block device - host IO scheduler - block device

it's a paravirtualized driver that just moves requests from inside the VM to 
the host queue (and this is why inside the VM you have no IO scheduler options 
- it effectively becomes noop).

But this just doesn't make sense if you're using qemu with librbd - there's no 
host queue.
It would make sense if the qemu drive was krbd device with a queue.

If there's no VM there should be no blk-mq?

So what was added to the kernel was probably the host-side infrastructure to 
handle blk-mq in guest passthrough to the krdb device, but that's probably not 
your case, is it?

 
 
 Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb (to
 make sure it gets into readahead), also try (if you're not using blk-mq)
 to a
 cfq scheduler and set it to rotational=1. I see you've also tried this,
 but I think
 blk-mq is the limiting factor here now.
 
 I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals object
 size, from what I can tell) and the max_sectors_kb is already set at the
 hw_max. But it would sure be nice if the max_hw_sectors_kb could be set
 higher though, but I'm not sure if there is a reason for this limit.
 
 
 If you are running a single-threaded benchmark like rados bench then
 what's
 limiting you is latency - it's not surprising it scales up with more
 threads.
 
 Agreed, but with sequential workloads, if you can get readahead working
 properly then you can easily remove this limitation as a single threaded op
 effectively becomes multithreaded.

Thinking on this more - I don't know if this will help after all, it will still 
be a single thread, just trying to get ahead of the client IO - and that's not 
likely to happen unless you can read the data in userspace slower than what 
Ceph can read...

I think striping multiple device could be the answer after all. But have you 
tried creating the RBD volume as striped in Ceph?

 
 It should run nicely with a real workload once readahead kicks in and the
 queue fills up. But again - not sure how that works with blk-mq and I've
 never used the RBD device directly (the kernel client). Does it show in
 /sys/block ? Can you dump find /sys/block/$rbd in here?
 
 Jan
 
 
 On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net wrote:
 
 Hi Nick,
 
 did you do anything fancy to get to ~90MB/s in the first place?
 I'm stuck at ~30MB/s reading cold data. single-threaded-writes are
 quite speedy, around 600MB/s.
 
 radosgw for cold data is around the 90MB/s, which is imho limitted by
 the speed of a single disk.
 
 Data already present on the osd-os-buffers arrive with around
 400-700MB/s so I don't think the network is the culprit.
 
 (20 node cluster, 12x4TB 7.2k disks, 2 ssds for journals for 6 osds
 each, lacp 2x10g bonds)
 
 rados bench single-threaded performs equally bad, but with its default
 multithreaded settings it generates wonderful numbers, usually only
 limiited by linerate and/or interrupts/s.
 
 I just gave kernel 4.0 with its rbd-blk-mq feature a shot, hoping to
 get to your wonderful numbers, but it's staying below 30 MB/s.
 
 I was thinking about using a software raid0 like you did but that's
 imho really ugly.
 When I know I needed something speedy, I usually just started dd-ing
 the file to /dev/null and wait for about  three minutes before
 starting the actual job; some sort of hand-made read-ahead for
 dummies.
 
 Thx in advance
 Benedikt
 
 
 2015-08-17 13:29 GMT+02:00 Nick Fisk n...@fisk.me.uk:
 Thanks for the replies guys.
 
 The client is set to 4MB, I haven't played with the OSD side yet as I
 wasn't sure if it would make much difference, but

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 12:41
 To: Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?

 Reply in text

  On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote:

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Jan Schermer
  Sent: 18 August 2015 11:50
  To: Benedikt Fraunhofer given.to.lists.ceph-
  users.ceph.com.toasta@traced.net
  Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
  Subject: Re: [ceph-users] How to improve single thread sequential
reads?

  I'm not sure if I missed that but are you testing in a VM backed by
  RBD device, or using the device directly?

  I don't see how blk-mq would help if it's not a VM, it just passes
  the
  request
  to the underlying block device, and in case of RBD there is no real
  block device from the host perspective...? Enlighten me if I'm wrong
  please. I
  have
  some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
  cringe because I'm unable to tune the scheduler and it just makes no
  sense at all...?

  Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
  infrastructure, but there is a bug which limits max IO sizes to 128kb,
  which is why for large block/sequential that testing kernel is
  essential. I think this bug fix should make it to 4.2 hopefully.

 blk-mq is supposed to remove redundancy of having

 IO scheduler in VM - VM block device - host IO scheduler - block device

 it's a paravirtualized driver that just moves requests from inside the VM
to
 the host queue (and this is why inside the VM you have no IO scheduler
 options - it effectively becomes noop).

 But this just doesn't make sense if you're using qemu with librbd -
there's no
 host queue.
 It would make sense if the qemu drive was krbd device with a queue.

 If there's no VM there should be no blk-mq?

I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
itself seems to be a lot more about enhancing the overall block layer
performance in Linux

https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
hanism_(blk-mq)

 So what was added to the kernel was probably the host-side infrastructure
 to handle blk-mq in guest passthrough to the krdb device, but that's
probably
 not your case, is it?

  Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
  (to make sure it gets into readahead), also try (if you're not using
  blk-mq)
  to a
  cfq scheduler and set it to rotational=1. I see you've also tried
  this,
  but I think
  blk-mq is the limiting factor here now.

  I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
  object size, from what I can tell) and the max_sectors_kb is already
  set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
  could be set higher though, but I'm not sure if there is a reason for
this
 limit.

  If you are running a single-threaded benchmark like rados bench then
  what's
  limiting you is latency - it's not surprising it scales up with more
  threads.

  Agreed, but with sequential workloads, if you can get readahead
  working properly then you can easily remove this limitation as a
  single threaded op effectively becomes multithreaded.

 Thinking on this more - I don't know if this will help after all, it will
still be a
 single thread, just trying to get ahead of the client IO - and that's not
likely to
 happen unless you can read the data in userspace slower than what Ceph
 can read...

 I think striping multiple device could be the answer after all. But have
you
 tried creating the RBD volume as striped in Ceph?

Yes striping would probably give amazing performance, but the kernel client
currently doesn't support it, which leaves us in the position of trying to
find work arounds to boost performance.

Although the client read is single threaded, the RBD/RADOS layer would split
these larger readahead IOs into 4MB requests that would then be processed in
parallel by the OSD's. This is much the same way as sequential access
performance varies with a RAID array. If your IO size matches the stripe
size of the array then you get nearly the bandwidth of all disks involved. I
think in Ceph the effective stripe size is the   object size * #OSDS.

  It should run nicely with a real workload once readahead kicks in and
  the queue fills up. But again - not sure how that works with blk-mq
  and I've never used the RBD device directly (the kernel client). Does
  it show in /sys/block ? Can you dump find /sys/block/$rbd in here?

  Jan

  On 18 Aug 2015, at 12:25, Benedikt Fraunhofer given.to.lists.ceph-
  users.ceph.com.toasta@traced.net wrote:

  Hi Nick,

  did you do anything fancy to get to ~90MB/s in the first place?
  I'm stuck

Re: [ceph-users] How to improve single thread sequential reads?

2015-08-18 Thread Jan Schermer


 On 18 Aug 2015, at 13:58, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 12:41
 To: Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] How to improve single thread sequential reads?
 
 Reply in text
 
 On 18 Aug 2015, at 12:59, Nick Fisk n...@fisk.me.uk wrote:
 
 
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Jan Schermer
 Sent: 18 August 2015 11:50
 To: Benedikt Fraunhofer given.to.lists.ceph-
 users.ceph.com.toasta@traced.net
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] How to improve single thread sequential
 reads?
 
 I'm not sure if I missed that but are you testing in a VM backed by
 RBD device, or using the device directly?
 
 I don't see how blk-mq would help if it's not a VM, it just passes
 the
 request
 to the underlying block device, and in case of RBD there is no real
 block device from the host perspective...? Enlighten me if I'm wrong
 please. I
 have
 some Ubuntu VMs that use blk-mq for virtio-blk devices and makes me
 cringe because I'm unable to tune the scheduler and it just makes no
 sense at all...?
 
 Since 4.0 (I think) the Kernel RBD client now uses the blk-mq
 infrastructure, but there is a bug which limits max IO sizes to 128kb,
 which is why for large block/sequential that testing kernel is
 essential. I think this bug fix should make it to 4.2 hopefully.
 
 blk-mq is supposed to remove redundancy of having
 
 IO scheduler in VM - VM block device - host IO scheduler - block device
 
 it's a paravirtualized driver that just moves requests from inside the VM
 to
 the host queue (and this is why inside the VM you have no IO scheduler
 options - it effectively becomes noop).
 
 But this just doesn't make sense if you're using qemu with librbd -
 there's no
 host queue.
 It would make sense if the qemu drive was krbd device with a queue.
 
 If there's no VM there should be no blk-mq?
 
 I think you might be thinking about the virtio-blk driver for blk-mq. Blk-mq
 itself seems to be a lot more about enhancing the overall block layer
 performance in Linux
 
 https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mec
 hanism_(blk-mq)
 
 
 
 
 So what was added to the kernel was probably the host-side infrastructure
 to handle blk-mq in guest passthrough to the krdb device, but that's
 probably
 not your case, is it?
 
 
 
 Anyway I'd try to bump up read_ahead_kb first, and max_hw_sectors_kb
 (to make sure it gets into readahead), also try (if you're not using
 blk-mq)
 to a
 cfq scheduler and set it to rotational=1. I see you've also tried
 this,
 but I think
 blk-mq is the limiting factor here now.
 
 I'm pretty sure you can't adjust the max_hw_sectors_kb (which equals
 object size, from what I can tell) and the max_sectors_kb is already
 set at the hw_max. But it would sure be nice if the max_hw_sectors_kb
 could be set higher though, but I'm not sure if there is a reason for
 this
 limit.
 
 
 If you are running a single-threaded benchmark like rados bench then
 what's
 limiting you is latency - it's not surprising it scales up with more
 threads.
 
 Agreed, but with sequential workloads, if you can get readahead
 working properly then you can easily remove this limitation as a
 single threaded op effectively becomes multithreaded.
 
 Thinking on this more - I don't know if this will help after all, it will
 still be a
 single thread, just trying to get ahead of the client IO - and that's not
 likely to
 happen unless you can read the data in userspace slower than what Ceph
 can read...
 
 I think striping multiple device could be the answer after all. But have
 you
 tried creating the RBD volume as striped in Ceph?
 
 Yes striping would probably give amazing performance, but the kernel client
 currently doesn't support it, which leaves us in the position of trying to
 find work arounds to boost performance.
 
 Although the client read is single threaded, the RBD/RADOS layer would split
 these larger readahead IOs into 4MB requests that would then be processed in
 parallel by the OSD's. This is much the same way as sequential access
 performance varies with a RAID array. If your IO size matches the stripe
 size of the array then you get nearly the bandwidth of all disks involved. I
 think in Ceph the effective stripe size is the   object size * #OSDS.
 

Hmmm...

RBD - PG - objects

stripe_unit (more commonly called stride) bytes are put into strip_count 
objects - not OSDs, but it's possible you'll hit all OSDs with a small enough 
stride and large enough stripe_count... 
I have no idea how well that works in practice on current Ceph releases, my 
Dumpling experience is probably useless here.

So we're back at striping with mdraid I guess ... :)

 
 
 
 It should run nicely with a real workload

Re: [ceph-users] Rename Ceph cluster

2015-08-18 Thread Erik McCormick

I've got a custom named cluster integrated with Openstack (Juno) and didn't
run into any hard-coded name issues that I can recall. Where are you seeing
that?

As to the name change itself, I think it's really just a label applying to
a configuration set. The name doesn't actually appear *in* the
configuration files. It stands to reason you should be able to rename the
configuration files on the client side and leave the cluster alone. It'd be
with trying in a test environment anyway.

-Erik
On Aug 18, 2015 7:59 AM, Jan Schermer j...@schermer.cz wrote:

 This should be simple enough

 mv /etc/ceph/ceph-prod.conf /etc/ceph/ceph.conf

 No? :-)

 Or you could set this in nova.conf:
 images_rbd_ceph_conf=/etc/ceph/ceph-prod.conf

 Obviously since different parts of openstack have their own configs, you'd
 have to do something similiar for cinder/glance... so not worth the hassle.

 Jan

  On 18 Aug 2015, at 13:50, Vasiliy Angapov anga...@gmail.com wrote:
 
  Hi,
 
  Does anyone know what steps should be taken to rename a Ceph cluster?
  Btw, is it ever possbile without data loss?
 
  Background: I have a cluster named ceph-prod integrated with
  OpenStack, however I found out that the default cluster name ceph is
  very much hardcoded into OpenStack so I decided to change it to the
  default value.
 
  Regards, Vasily.
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Jan Schermer

Yes, writeback mode. I didn't try anything else.

Jan

 On 18 Aug 2015, at 18:30, Alex Gorbachev a...@iss-integration.com wrote:
 
 HI Jan,
 
 On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer j...@schermer.cz wrote:
 I already evaluated EnhanceIO in combination with CentOS 6 (and backported 
 3.10 and 4.0 kernel-lt if I remember correctly).
 It worked fine during benchmarks and stress tests, but once we run DB2 on it 
 it panicked within minutes and took all the data with it (almost literally - 
 files that werent touched, like OS binaries were b0rked and the filesystem 
 was unsalvageable).
 
 Out of curiosity, were you using EnhanceIO in writeback mode?  I
 assume so, as a read cache should not hurt anything.
 
 Thanks,
 Alex
 
 If you disregard this warning - the performance gains weren't that great 
 either, at least in a VM. It had problems when flushing to disk after 
 reaching dirty watermark and the block size has some not-well-documented 
 implications (not sure now, but I think it only cached IO _larger_than the 
 block size, so if your database keeps incrementing an XX-byte counter it 
 will go straight to disk).
 
 Flashcache doesn't respect barriers (or does it now?) - if that's ok for you 
 than go for it, it should be stable and I used it in the past in production 
 without problems.
 
 bcache seemed to work fine, but I needed to
 a) use it for root
 b) disable and enable it on the fly (doh)
 c) make it non-persisent (flush it) before reboot - not sure if that was 
 possible either.
 d) all that in a customer's VM, and that customer didn't have a strong 
 technical background to be able to fiddle with it...
 So I haven't tested it heavily.
 
 Bcache should be the obvious choice if you are in control of the 
 environment. At least you can cry on LKML's shoulder when you lose data :-)
 
 Jan
 
 
 On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:
 
 What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
 months ago, but no external contributors :(
 
 The nice thing about EnhanceIO is there is no need to change device
 name, unlike bcache, flashcache etc.
 
 Best regards,
 Alex
 
 On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote:
 I did some (non-ceph) work on these, and concluded that bcache was the best
 supported, most stable, and fastest.  This was ~1 year ago, to take it with
 a grain of salt, but that's what I would recommend.
 
 Daniel
 
 
 
 From: Dominik Zalewski dzalew...@optlink.net
 To: German Anders gand...@despegar.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, July 1, 2015 5:28:10 PM
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?
 
 
 Hi,
 
 I’ve asked same question last weeks or so (just search the mailing list
 archives for EnhanceIO :) and got some interesting answers.
 
 Looks like the project is pretty much dead since it was bought out by HGST.
 Even their website has some broken links in regards to EnhanceIO
 
 I’m keen to try flashcache or bcache (its been in the mainline kernel for
 some time)
 
 Dominik
 
 On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:
 
 Hi cephers,
 
  Is anyone out there that implement enhanceIO in a production environment?
 any recommendation? any perf output to share with the diff between using it
 and not?
 
 Thanks in advance,
 
 German
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Jan Schermer
 Sent: 18 August 2015 17:13
 To: Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

  On 18 Aug 2015, at 16:44, Nick Fisk n...@fisk.me.uk wrote:

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Mark Nelson
  Sent: 18 August 2015 14:51
  To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

  On 08/18/2015 06:47 AM, Nick Fisk wrote:
  Just to chime in, I gave dmcache a limited test but its lack of
  proper
  writeback cache ruled it out for me. It only performs write back
  caching on blocks already on the SSD, whereas I need something that
  works like a Battery backed raid controller caching all writes.

  It's amazing the 100x performance increase you get with RBD's when
  doing
  sync writes and give it something like just 1GB write back cache with
  flashcache.

  For your use case, is it ok that data may live on the flashcache for
  some amount of time before making to ceph to be replicated?  We've
  wondered internally if this kind of trade-off is acceptable to
  customers or not should the flashcache SSD fail.

  Yes, I agree, it's not ideal. But I believe it’s the only way to get the
 performance required for some workloads that need write latency's 1ms.

  I'm still in testing at the moment with the testing kernel that includes 
  blk-
 mq fixes for large queue depths and max io sizes. But if we decide to put into
 production, it would be using 2x SAS dual port SSD's in RAID1 across two
 servers for HA. As we are currently using iSCSI from these two servers, there
 is no real loss of availability by doing this. Generally I think as long as 
 you build
 this around the fault domains of the application you are caching, it shouldn't
 impact too much.

  I guess for people using openstack and other direct RBD interfaces it may
 not be such an attractive option. I've been thinking that maybe Ceph needs
 to have an additional daemon with very low overheads, which is run on SSD's
 to provide shared persistent cache devices for librbd. There's still a trade 
 off,
 maybe not as much as using Flashcache, but for some workloads like
 database's, many people may decide that it's worth it. Of course I realise 
 this
 would be a lot of work and everyone is really busy, but in terms of
 performance gained it would most likely have a dramatic effect in making
 Ceph look comparable to other solutions like VSAN or ScaleIO when it comes
 to high iops/low latency stuff.

 Additional daemon that is persistent how? Isn't that what journal does
 already, just too slowly?

The journal is part of an OSD, as is speed restricted by a lot of the 
functionality that Ceph has to provide. I was more thinking of a very light 
weight service that acts as an interface between a SSD and librbd and is 
focussed on speed. For something like a standalone SQL server it might run on 
the SQL server with a local SSD, but in other scenarios you might have this 
service remote where the SSD's are installed. HA for the SSD could be 
provided by RAID+Dual Port SAS, or maybe some sort of lightweight replication 
could be built into the service.

This was just a random though rather than something I have planned out though.

 I think the best (and easiest!) approach is to mimic what a monilithic SAN
 does

 Currently
 1) client issues blocking/atomic/sync IO
 2) rbd client sends this IO to all OSDs
 3) after all OSDs process the IO, the IO is finished and considered 
 persistent

 That has serious implications
   * every IO is processed separately, not much coalescing
   * OSD processes add the latency when processing this IO
   * one OSD can be slow momentarily, IO backs up and the cluster
 stalls

 Let me just select what processing the IO means with respect to my
 architecture and I can likely get a 100x improvement

 Let me choose:

 1) WHERE the IO is persisted
 Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2)
 sufficient?
 Not waiting for one slow OSD gives me at least some SLA for planned tasks
 like backfilling, scrubbing, deep-scrubbing Hands up who can afford to leav
 deep-scrub enabled in production...

In my testing the difference between 2 and 3 Replica's wasn't that much, as 
once the primary OSD sends out the replica's they happen more or less in 
parallel.

 2) WHEN the IO is persisted
 Do I really need all OSDs to flush the data to disk?
 If all the nodes are in the same cabinet and on the same UPS then this makes
 sense.
 But my nodes are actually in different buildings ~10km apart. The chances of
 power failing simultaneously, N+1 UPSes failing simultaneously, diesels 
 failing
 simultaneously... When nukes start

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Alex Gorbachev

 IE, should we be focusing on IOPS?  Latency?  Finding a way to avoid journal
 overhead for large writes?  Are there specific use cases where we should
 specifically be focusing attention? general iscsi?  S3? databases directly
 on RBD? etc.  There's tons of different areas that we can work on (general
 OSD threading improvements, different messenger implementations, newstore,
 client side bottlenecks, etc) but all of those things tackle different kinds
 of problems.


Mark, my take is definitely write latency.  Base on this discussion,
there is no real safe solution for write caching outside Ceph.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Mark Nelson




On 08/18/2015 11:52 AM, Nick Fisk wrote:

snip


Here's kind of how I see the field right now:

1) Cache at the client level.  Likely fastest but obvious issues like above.
RAID1 might be an option at increased cost.  Lack of barriers in some
implementations scary.


Agreed.



2) Cache below the OSD.  Not much recent data on this.  Not likely as
fast as client side cache, but likely cheaper (fewer OSD nodes than client

nodes?).

Lack of barriers in some implementations scary.


This also has the benefit of caching the leveldb on the OSD, so get a big

performance gain from there too for small sequential writes. I looked at
using Flashcache for this too but decided it was adding to much complexity
and risk.


I thought I read somewhere that RocksDB allows you to move its WAL to

SSD, is there anything in the pipeline for something like moving the filestore
to use RocksDB?

I believe you can already do this, though I haven't tested it.  You can 
certainly
move the monitors to rocksdb (tested) and newstore uses rocksdb as well.



Interesting, I might have a look into this.





3) Ceph Cache Tiering. Network overhead and write amplification on
promotion makes this primarily useful when workloads fit mostly into the
cache tier.  Overall safe design but care must be taken to not over-

promote.


4) separate SSD pool.  Manual and not particularly flexible, but perhaps

best

for applications that need consistently high performance.


I think it depends on the definition of performance. Currently even very

fast CPU's and SSD's in their own pool will still struggle to get less than 1ms 
of
write latency. If your performance requirements are for large queue depths
then you will probably be alright. If you require something that mirrors the
performance of traditional write back cache, then even pure SSD Pools can
start to struggle.

Agreed.  This is definitely the crux of the problem.  The example below
is a great start!  It'd would be fantastic if we could get more feedback
from the list on the relative importance of low latency operations vs
high IOPS through concurrency.  We have general suspicions but not a ton
of actual data regarding what folks are seeing in practice and under
what scenarios.



If you have any specific questions that you think I might be able to answer, 
please let me know. The only other main app that I can really think of where 
these sort of write latency is critical is SQL, particularly the transaction 
logs.


Probably the big question is what are the pain points?  The most common 
answer we get when asking folks what applications they run on top of 
Ceph is everything!.  This is wonderful, but not helpful when trying 
to figure out what performance issues matter most! :)


IE, should we be focusing on IOPS?  Latency?  Finding a way to avoid 
journal overhead for large writes?  Are there specific use cases where 
we should specifically be focusing attention? general iscsi?  S3? 
databases directly on RBD? etc.  There's tons of different areas that we 
can work on (general OSD threading improvements, different messenger 
implementations, newstore, client side bottlenecks, etc) but all of 
those things tackle different kinds of problems.


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Samuel Just

1.  We've kicked this around a bit.  What kind of failure semantics
would you be comfortable with here (that is, what would be reasonable
behavior if the client side cache fails)?
2. We've got a branch which should merge soon (tomorrow probably)
which actually does allow writes to be proxied, so that should
alleviate some of these pain points somewhat.  I'm not sure it is
clever enough to allow through writefulls for an ec base tier though
(but it would be a good idea!)
-Sam

On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk n...@fisk.me.uk wrote:




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 18 August 2015 18:51
 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?



 On 08/18/2015 11:52 AM, Nick Fisk wrote:
  snip
 
  Here's kind of how I see the field right now:
 
  1) Cache at the client level.  Likely fastest but obvious issues like
 above.
  RAID1 might be an option at increased cost.  Lack of barriers in
  some implementations scary.
 
  Agreed.
 
 
  2) Cache below the OSD.  Not much recent data on this.  Not likely
  as fast as client side cache, but likely cheaper (fewer OSD nodes
  than client
  nodes?).
  Lack of barriers in some implementations scary.
 
  This also has the benefit of caching the leveldb on the OSD, so get
  a big
  performance gain from there too for small sequential writes. I looked
  at using Flashcache for this too but decided it was adding to much
  complexity and risk.
 
  I thought I read somewhere that RocksDB allows you to move its WAL
  to
  SSD, is there anything in the pipeline for something like moving the
  filestore to use RocksDB?
 
  I believe you can already do this, though I haven't tested it.  You
  can certainly move the monitors to rocksdb (tested) and newstore uses
 rocksdb as well.
 
 
  Interesting, I might have a look into this.
 
 
 
  3) Ceph Cache Tiering. Network overhead and write amplification on
  promotion makes this primarily useful when workloads fit mostly
  into the cache tier.  Overall safe design but care must be taken to
  not over-
  promote.
 
  4) separate SSD pool.  Manual and not particularly flexible, but
  perhaps
  best
  for applications that need consistently high performance.
 
  I think it depends on the definition of performance. Currently even
  very
  fast CPU's and SSD's in their own pool will still struggle to get
  less than 1ms of write latency. If your performance requirements are
  for large queue depths then you will probably be alright. If you
  require something that mirrors the performance of traditional write
  back cache, then even pure SSD Pools can start to struggle.
 
  Agreed.  This is definitely the crux of the problem.  The example
  below is a great start!  It'd would be fantastic if we could get more
  feedback from the list on the relative importance of low latency
  operations vs high IOPS through concurrency.  We have general
  suspicions but not a ton of actual data regarding what folks are
  seeing in practice and under what scenarios.
 
 
  If you have any specific questions that you think I might be able to
 answer,
 please let me know. The only other main app that I can really think of
 where
 these sort of write latency is critical is SQL, particularly the
 transaction logs.

 Probably the big question is what are the pain points?  The most common
 answer we get when asking folks what applications they run on top of Ceph
 is everything!.  This is wonderful, but not helpful when trying to
 figure out
 what performance issues matter most! :)

 Sort of like someone telling you their pc is broken and when asked for
 details getting It's not working in return.

 In general I think a lot of it comes down to people not appreciating the
 differences between Ceph and say a Raid array. For most things like larger
 block IO performance tends to scale with cluster size and the cost
 effectiveness of Ceph makes this a no brainer not to just add a handful of
 extra OSD's.

 I will try and be more precise. Here is my list of pain points / wishes that
 I have come across in the last 12 months of running Ceph.

 1. Improve small IO write latency
 As discussed in depth in this thread. If it's possible just to make Ceph a
 lot faster then great, but I fear even a doubling in performance will still
 fall short compared to if you are caching writes at the client. Most things
 in Ceph tend to improve with scale, but write latency is the same with 2
 OSD's as it is with 2000. I would urge some sort of investigation into the
 possibility of some sort of persistent librbd caching. This will probably
 help across a large number of scenarios, as in the end, most things are
 effected by latency and I think will provide across the board improvements.

 2. Cache Tiering
 I know a lot of work is going into this currently, but I will cover my

[ceph-users] ceph-osd suddenly dies and no longer can be started

2015-08-18 Thread Евгений Д .

Hello.

I have a small Ceph cluster running 9 OSDs, using XFS on separate disks and
dedicated partitions on system disk for journals.
After creation it worked fine for a while. Then suddenly one of OSDs
stopped and didn't start. I had to recreate it. Recovery started.
After few days of recovery OSD on another machine also stopped. I try to
start it, it runs for few minutes and dies, looks like it is not able to
recover journal.
According to strace, it tries to allocate too much memory and stops with
ENOMEM. Sometimes it is being killed by kernel's OOM killer.

I tried flushing journal manually with `ceph-osd -i 3 --flush-journal`, but
it didn't work either. Error log is as follows:

[root@assets-2 ~]# ceph-osd -i 3 --flush-journal
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0d 00 00 00 00
20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2015-08-18 23:00:37.956714 7ff102040880 -1
filestore(/var/lib/ceph/osd/ceph-3) could not find
225eff8c/default.4323.18_22783306dc51892b40b49e3e26f79baf_55c38b33172600566c46_s.jpeg/head//8
in index: (2) No such file or directory
2015-08-18 23:00:37.956741 7ff102040880 -1
filestore(/var/lib/ceph/osd/ceph-3) could not find
235eff8c/default.4323.16_3018ff7c6066bddc0c867b293724d7b1_dolar7_106_m.jpg/head//8
in index: (2) No such file or directory
skipped
2015-08-18 23:00:37.958424 7ff102040880 -1
filestore(/var/lib/ceph/osd/ceph-3) could not find c//head//8 in index: (2)
No such file or directory
tcmalloc: large alloc 1073741824 bytes == 0x66b1 @  0x7ff10115ae6a
0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
tcmalloc: large alloc 2147483648 bytes == 0xbf49 @  0x7ff10115ae6a
0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
tcmalloc: large alloc 4294967296 bytes == 0x16e32 @  0x7ff10115ae6a
0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
tcmalloc: large alloc 8589934592 bytes == (nil) @  0x7ff10115ae6a
0x7ff10117ad64 0x7ff0ffd4fc29 0x7ff0ffd5086b 0x7ff0ffd50914 0x7ff0ffd50b7f
0x968a0f 0xa572b3 0xa5c6b1 0xa5f762 0x9018ba 0x90238e 0x911b2c 0x915064
0x92d7cb 0x8ff890 0x642239 0x7ff0ff3daaf5 0x65cdc9 (nil)
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
*** Caught signal (Aborted) **
 in thread 7ff102040880
 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: ceph-osd() [0xac5642]
 2: (()+0xf130) [0x7ff1009d4130]
 3: (gsignal()+0x37) [0x7ff0ff3ee5d7]
 4: (abort()+0x148) [0x7ff0ff3efcc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7ff0ffcf29b5]
 6: (()+0x5e926) [0x7ff0ffcf0926]
 7: (()+0x5e953) [0x7ff0ffcf0953]
 8: (()+0x5eb73) [0x7ff0ffcf0b73]
 9: (()+0x15d3e) [0x7ff10115ad3e]
 10: (tc_new()+0x1e0) [0x7ff10117ade0]
 11: (std::string::_Rep::_S_create(unsigned long, unsigned long,
std::allocatorchar const)+0x59) [0x7ff0ffd4fc29]
 12: (std::string::_Rep::_M_clone(std::allocatorchar const, unsigned
long)+0x1b) [0x7ff0ffd5086b]
 13: (std::string::reserve(unsigned long)+0x44) [0x7ff0ffd50914]
 14: (std::string::append(char const*, unsigned long)+0x4f) [0x7ff0ffd50b7f]
 15: (LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix(std::string
const)+0xdf) [0x968a0f]
 16: (DBObjectMap::clear_header(std::tr1::shared_ptrDBObjectMap::_Header,
std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xd3) [0xa572b3]
 17: (DBObjectMap::_clear(std::tr1::shared_ptrDBObjectMap::_Header,
std::tr1::shared_ptrKeyValueDB::TransactionImpl)+0xa1) [0xa5c6b1]
 18: (DBObjectMap::clear(ghobject_t const, SequencerPosition
const*)+0x202) [0xa5f762]
 19: (FileStore::lfn_unlink(coll_t, ghobject_t const, SequencerPosition
const, bool)+0x16a) [0x9018ba]
 20: (FileStore::_remove(coll_t, ghobject_t const, SequencerPosition
const)+0x9e) [0x90238e]
 21: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
int, ThreadPool::TPHandle*)+0x252c) [0x911b2c]
 22: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long,
ThreadPool::TPHandle*)+0x64) [0x915064]
 23: (JournalingObjectStore::journal_replay(unsigned long)+0x5db) [0x92d7cb]
 24: (FileStore::mount()+0x3730) [0x8ff890]
 25: (main()+0xec9) [0x642239]
 26: (__libc_start_main()+0xf5) [0x7ff0ff3daaf5]
 27: ceph-osd() [0x65cdc9]
2015-08-18 23:02:38.167194 7ff102040880 -1 *** Caught signal (Aborted) **
 in thread 7ff102040880



I can recreate filesystem on this OSD's disk and recreate OSD, but I'm not
sure that this won't happen with another OSD on this or another machine,
and eventually I won't lose all my data because it doesn't

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 18 August 2015 18:51
 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 On 08/18/2015 11:52 AM, Nick Fisk wrote:
  snip

  Here's kind of how I see the field right now:

  1) Cache at the client level.  Likely fastest but obvious issues like
above.
  RAID1 might be an option at increased cost.  Lack of barriers in
  some implementations scary.

  Agreed.

  2) Cache below the OSD.  Not much recent data on this.  Not likely
  as fast as client side cache, but likely cheaper (fewer OSD nodes
  than client
  nodes?).
  Lack of barriers in some implementations scary.

  This also has the benefit of caching the leveldb on the OSD, so get
  a big
  performance gain from there too for small sequential writes. I looked
  at using Flashcache for this too but decided it was adding to much
  complexity and risk.

  I thought I read somewhere that RocksDB allows you to move its WAL
  to
  SSD, is there anything in the pipeline for something like moving the
  filestore to use RocksDB?

  I believe you can already do this, though I haven't tested it.  You
  can certainly move the monitors to rocksdb (tested) and newstore uses
 rocksdb as well.

  Interesting, I might have a look into this.

  3) Ceph Cache Tiering. Network overhead and write amplification on
  promotion makes this primarily useful when workloads fit mostly
  into the cache tier.  Overall safe design but care must be taken to
  not over-
  promote.

  4) separate SSD pool.  Manual and not particularly flexible, but
  perhaps
  best
  for applications that need consistently high performance.

  I think it depends on the definition of performance. Currently even
  very
  fast CPU's and SSD's in their own pool will still struggle to get
  less than 1ms of write latency. If your performance requirements are
  for large queue depths then you will probably be alright. If you
  require something that mirrors the performance of traditional write
  back cache, then even pure SSD Pools can start to struggle.

  Agreed.  This is definitely the crux of the problem.  The example
  below is a great start!  It'd would be fantastic if we could get more
  feedback from the list on the relative importance of low latency
  operations vs high IOPS through concurrency.  We have general
  suspicions but not a ton of actual data regarding what folks are
  seeing in practice and under what scenarios.

  If you have any specific questions that you think I might be able to
answer,
 please let me know. The only other main app that I can really think of
where
 these sort of write latency is critical is SQL, particularly the
transaction logs.

 Probably the big question is what are the pain points?  The most common
 answer we get when asking folks what applications they run on top of Ceph
 is everything!.  This is wonderful, but not helpful when trying to
figure out
 what performance issues matter most! :)

Sort of like someone telling you their pc is broken and when asked for
details getting It's not working in return. 

In general I think a lot of it comes down to people not appreciating the
differences between Ceph and say a Raid array. For most things like larger
block IO performance tends to scale with cluster size and the cost
effectiveness of Ceph makes this a no brainer not to just add a handful of
extra OSD's.

I will try and be more precise. Here is my list of pain points / wishes that
I have come across in the last 12 months of running Ceph.

1. Improve small IO write latency
As discussed in depth in this thread. If it's possible just to make Ceph a
lot faster then great, but I fear even a doubling in performance will still
fall short compared to if you are caching writes at the client. Most things
in Ceph tend to improve with scale, but write latency is the same with 2
OSD's as it is with 2000. I would urge some sort of investigation into the
possibility of some sort of persistent librbd caching. This will probably
help across a large number of scenarios, as in the end, most things are
effected by latency and I think will provide across the board improvements.

2. Cache Tiering
I know a lot of work is going into this currently, but I will cover my
experience. 
2A)Deletion of large RBD's takes forever. It seems to have to promote all
objects, even non-existent ones to the cache tier before it can delete them.
Operationally this is really poor as it has a negative effect on the cache
tier contents as well.
2B) Erasure Coding requires all writes to be promoted 1st. I think it should
be pretty easy to allow proxy writes for erasure coded pools if the IO size
= Object Size. A lot of backup applications can be configured to write out
in static sized blocks and would be an ideal candidate for this sort of

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Christian Balzer

On Tue, 18 Aug 2015 20:48:26 +0100 Nick Fisk wrote:

[mega snip]
 4. Disk based OSD with SSD Journal performance
 As I touched on above earlier, I would expect a disk based OSD with SSD
 journal to have similar performance to a pure SSD OSD when dealing with
 sequential small IO's. Currently the levelDB sync and potentially other
 things slow this down.
 

Has anybody tried symlinking the omap directory to a SSD and tested if hat
makes a (significant) difference?

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Christian Balzer

On Tue, 18 Aug 2015 12:50:38 -0500 Mark Nelson wrote:

[snap]
 Probably the big question is what are the pain points?  The most common 
 answer we get when asking folks what applications they run on top of 
 Ceph is everything!.  This is wonderful, but not helpful when trying 
 to figure out what performance issues matter most! :)
 
Well, the everything answer really is the one everybody who runs VMs
backed by RBD for internal or external customers will give.
I.e. no idea what is installed and no control over how it accesses the
Ceph cluster.

And even when you think you have a predictable use case it might not be
true.
As in, one of our Ceph installs backs a ganeti cluster with hundreds of VMs
running 2 type of applications and from past experience I know their I/O
patterns (nearly 100% write only, any reads usually can be satisfied from
local or storage node pagecache). 
Thus the Ceph cluster was configured in a way that was optimized for this
and it worked beautifully until:
a) scrubs became too heavy (generating too many read IOPS while also
invalidating page caches) and
b) somebody thought a 3rd type of VM using Windows with IOPS that equal
dozens of the other types would be a good idea.


 IE, should we be focusing on IOPS?  Latency?  Finding a way to avoid 
 journal overhead for large writes?  Are there specific use cases where 
 we should specifically be focusing attention? general iscsi?  S3? 
 databases directly on RBD? etc.  There's tons of different areas that we 
 can work on (general OSD threading improvements, different messenger 
 implementations, newstore, client side bottlenecks, etc) but all of 
 those things tackle different kinds of problems.
 
All of these except S3 would have a positive impact in my various use
cases.
However at the risk of sounding like a broken record, any time spent on
these improvements before Ceph can recover from a scrub error fully
autonomously (read: checksums) would be a waste in my book.

All the speed in the world is pretty insignificant when a simple 
ceph pg repair (which is still in the Ceph docs w/o any qualification of
what it actually does) has a good chance of wiping out good data by
imposing the primary OSD's view of the world on the replicas, to quote
Greg.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] [Cache-tier] librbd: error finding source object: (2) No such file or directory

2015-08-18 Thread Ta Ba Tuan


Hi everyone,

I has been used the cache-tier on a data pool.
After a long time, a lot of rbd images don't be  displayed in rbd -p 
data ls.

Although that Images still  show through rbd info and rados ls command.

rbd -p  data info  volume-008ae4f7-3464-40c0-80b0-51140d8b95a8
rbd image 'volume-008ae4f7-3464-40c0-80b0-51140d8b95a8':
size 128 GB in 32768 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.10c1c102eb141f2
format: 2
features: layering
flags:

And:   rados -p data ls|grep 10c1c102eb141f2   # grep though 
block_name_prefix.

= show: rbd_header.10c1c102eb141f2

Or:
rados -p data ls|grep volume-008ae4f7-3464-40c0-80b0-51140d8b95a8
= show:  rbd_id.volume-008ae4f7-3464-40c0-80b0-51140d8b95a8

Everything seem is  normal. But*I tried to  move ( and rename) above 
Image*, then received the following error:
#rbd mv data/volume-008ae4f7-3464-40c0-80b0-51140d8b95a8 
data/volume-008ae4f7-3464-40c0-80b0-51140d8b95a8_new


rbd: rename error: (2) No such file or directory
2015-08-19 10:46:07.175525 7fb8b0985840 -1 librbd: error finding source 
object: (2) No such file or directory

= rename action will spawn a new RBD, didn't delete  original RBD

*and when deleting the Image (deleting still sucessfullly**)*:
deleting data/volume-32e1fa85-2e03-4cbe-be36-09358aa6e7f4
Removing all snapshots: 100% complete...done.
Removing image: 99% complete...failed.
rbd: delete error: (2) No such file or directory
2015-08-19 11:27:17.904695 7f9c32217840 -1 librbd: error removing img 
from new-style directory: (2) No such file or directory


What happend with that RBDs?, how to fix that error?
Thanks so much!

--
Tuan-  HaNoi,VietNam











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

Hi Sam,

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Samuel Just
 Sent: 18 August 2015 21:38
 To: Nick Fisk n...@fisk.me.uk
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 1.  We've kicked this around a bit.  What kind of failure semantics would
you
 be comfortable with here (that is, what would be reasonable behavior if
the
 client side cache fails)?

I would either expect to provide the cache with a redundant block device (ie
RAID1 SSD's) or the cache to allow itself to be configured to mirror across
two SSD's. Of course single SSD's can be used if the user accepts the risk.
If the cache did the mirroring then you could do fancy stuff like mirror the
writes, but leave the read cache blocks as single copies to increase the
cache capacity.

In either case although an outage is undesirable, its only data loss which
would be unacceptable, which would hopefully be avoided by the mirroring. As
part of this, it would need to be a way to make sure a dirty RBD can't be
accessed unless the corresponding cache is also attached.

I guess as it caching the RBD and not the pool or entire cluster, the cache
only needs to match the failure requirements of the application its caching.
If I need to cache a RBD that is on  a single server, there is no
requirement to make the cache redundant across racks/PDU's/servers...etc. 

I hope I've answered your question?

 2. We've got a branch which should merge soon (tomorrow probably) which
 actually does allow writes to be proxied, so that should alleviate some of
 these pain points somewhat.  I'm not sure it is clever enough to allow
 through writefulls for an ec base tier though (but it would be a good
idea!) -

Excellent news, I shall look forward to testing in the future. I did mention
the proxy write for write fulls to someone who was working on the proxy
write code, but I'm not sure if it ever got followed up.

 Sam

 On Tue, Aug 18, 2015 at 12:48 PM, Nick Fisk n...@fisk.me.uk wrote:

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Mark Nelson
  Sent: 18 August 2015 18:51
  To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

  On 08/18/2015 11:52 AM, Nick Fisk wrote:
   snip

   Here's kind of how I see the field right now:

   1) Cache at the client level.  Likely fastest but obvious issues
   like
  above.
   RAID1 might be an option at increased cost.  Lack of barriers in
   some implementations scary.

   Agreed.

   2) Cache below the OSD.  Not much recent data on this.  Not
   likely as fast as client side cache, but likely cheaper (fewer
   OSD nodes than client
   nodes?).
   Lack of barriers in some implementations scary.

   This also has the benefit of caching the leveldb on the OSD, so
   get a big
   performance gain from there too for small sequential writes. I
   looked at using Flashcache for this too but decided it was adding
   to much complexity and risk.

   I thought I read somewhere that RocksDB allows you to move its
   WAL to
   SSD, is there anything in the pipeline for something like moving
   the filestore to use RocksDB?

   I believe you can already do this, though I haven't tested it.
   You can certainly move the monitors to rocksdb (tested) and
   newstore uses
  rocksdb as well.

   Interesting, I might have a look into this.

   3) Ceph Cache Tiering. Network overhead and write amplification
   on promotion makes this primarily useful when workloads fit
   mostly into the cache tier.  Overall safe design but care must
   be taken to not over-
   promote.

   4) separate SSD pool.  Manual and not particularly flexible, but
   perhaps
   best
   for applications that need consistently high performance.

   I think it depends on the definition of performance. Currently
   even very
   fast CPU's and SSD's in their own pool will still struggle to get
   less than 1ms of write latency. If your performance requirements
   are for large queue depths then you will probably be alright. If
   you require something that mirrors the performance of traditional
   write back cache, then even pure SSD Pools can start to struggle.

   Agreed.  This is definitely the crux of the problem.  The example
   below is a great start!  It'd would be fantastic if we could get
   more feedback from the list on the relative importance of low
   latency operations vs high IOPS through concurrency.  We have
   general suspicions but not a ton of actual data regarding what
   folks are seeing in practice and under what scenarios.

   If you have any specific questions that you think I might be able
   to
  answer,
  please let me know. The only other main app that I can really think
  of
  where
  these sort of write latency is critical is

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Campbell, Bill

Hey Stefan, 
Are you using your Ceph cluster for virtualization storage? Is dm-writeboost 
configured on the OSD nodes themselves? 

- Original Message -

From: Stefan Priebe - Profihost AG s.pri...@profihost.ag 
To: Mark Nelson mnel...@redhat.com, ceph-users@lists.ceph.com 
Sent: Tuesday, August 18, 2015 7:36:10 AM 
Subject: Re: [ceph-users] any recommendation of using EnhanceIO? 

We're using an extra caching layer for ceph since the beginning for our 
older ceph deployments. All new deployments go with full SSDs. 

I've tested so far: 
- EnhanceIO 
- Flashcache 
- Bcache 
- dm-cache 
- dm-writeboost 

The best working solution was and is bcache except for it's buggy code. 
The current code in 4.2-rc7 vanilla kernel still contains bugs. f.e. 
discards result in crashed FS after reboots and so on. But it's still 
the fastest for ceph. 

The 2nd best solution which we already use in production is 
dm-writeboost (https://github.com/akiradeveloper/dm-writeboost). 

Everything else is too slow. 

Stefan 
Am 18.08.2015 um 13:33 schrieb Mark Nelson: 
 Hi Jan, 

 Out of curiosity did you ever try dm-cache? I've been meaning to give 
 it a spin but haven't had the spare cycles. 

 Mark 

 On 08/18/2015 04:00 AM, Jan Schermer wrote: 
 I already evaluated EnhanceIO in combination with CentOS 6 (and 
 backported 3.10 and 4.0 kernel-lt if I remember correctly). 
 It worked fine during benchmarks and stress tests, but once we run DB2 
 on it it panicked within minutes and took all the data with it (almost 
 literally - files that werent touched, like OS binaries were b0rked 
 and the filesystem was unsalvageable). 
 If you disregard this warning - the performance gains weren't that 
 great either, at least in a VM. It had problems when flushing to disk 
 after reaching dirty watermark and the block size has some 
 not-well-documented implications (not sure now, but I think it only 
 cached IO _larger_than the block size, so if your database keeps 
 incrementing an XX-byte counter it will go straight to disk). 

 Flashcache doesn't respect barriers (or does it now?) - if that's ok 
 for you than go for it, it should be stable and I used it in the past 
 in production without problems. 

 bcache seemed to work fine, but I needed to 
 a) use it for root 
 b) disable and enable it on the fly (doh) 
 c) make it non-persisent (flush it) before reboot - not sure if that 
 was possible either. 
 d) all that in a customer's VM, and that customer didn't have a strong 
 technical background to be able to fiddle with it... 
 So I haven't tested it heavily. 

 Bcache should be the obvious choice if you are in control of the 
 environment. At least you can cry on LKML's shoulder when you lose 
 data :-) 

 Jan 

 On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote: 

 What about https://github.com/Frontier314/EnhanceIO? Last commit 2 
 months ago, but no external contributors :( 

 The nice thing about EnhanceIO is there is no need to change device 
 name, unlike bcache, flashcache etc. 

 Best regards, 
 Alex 

 On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com 
 wrote: 
 I did some (non-ceph) work on these, and concluded that bcache was 
 the best 
 supported, most stable, and fastest. This was ~1 year ago, to take 
 it with 
 a grain of salt, but that's what I would recommend. 

 Daniel 

 From: Dominik Zalewski dzalew...@optlink.net 
 To: German Anders gand...@despegar.com 
 Cc: ceph-users ceph-users@lists.ceph.com 
 Sent: Wednesday, July 1, 2015 5:28:10 PM 
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO? 

 Hi, 

 I’ve asked same question last weeks or so (just search the mailing list 
 archives for EnhanceIO :) and got some interesting answers. 

 Looks like the project is pretty much dead since it was bought out 
 by HGST. 
 Even their website has some broken links in regards to EnhanceIO 

 I’m keen to try flashcache or bcache (its been in the mainline 
 kernel for 
 some time) 

 Dominik 

 On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote: 

 Hi cephers, 

 Is anyone out there that implement enhanceIO in a production 
 environment? 
 any recommendation? any perf output to share with the diff between 
 using it 
 and not? 

 Thanks in advance, 

 German 
 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

 ___ 
 ceph-users mailing list 
 ceph-users@lists.ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Mark Nelson




On 08/18/2015 06:47 AM, Nick Fisk wrote:

Just to chime in, I gave dmcache a limited test but its lack of proper 
writeback cache ruled it out for me. It only performs write back caching on 
blocks already on the SSD, whereas I need something that works like a Battery 
backed raid controller caching all writes.

It's amazing the 100x performance increase you get with RBD's when doing sync 
writes and give it something like just 1GB write back cache with flashcache.


For your use case, is it ok that data may live on the flashcache for 
some amount of time before making to ceph to be replicated?  We've 
wondered internally if this kind of trade-off is acceptable to customers 
or not should the flashcache SSD fail.






-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Jan Schermer
Sent: 18 August 2015 12:44
To: Mark Nelson mnel...@redhat.com
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

I did not. Not sure why now - probably for the same reason I didn't
extensively test bcache.
I'm not a real fan of device mapper though, so if I had to choose I'd still go 
for
bcache :-)

Jan


On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote:

Hi Jan,

Out of curiosity did you ever try dm-cache?  I've been meaning to give it a

spin but haven't had the spare cycles.


Mark

On 08/18/2015 04:00 AM, Jan Schermer wrote:

I already evaluated EnhanceIO in combination with CentOS 6 (and

backported 3.10 and 4.0 kernel-lt if I remember correctly).

It worked fine during benchmarks and stress tests, but once we run DB2

on it it panicked within minutes and took all the data with it (almost 
literally -
files that werent touched, like OS binaries were b0rked and the filesystem
was unsalvageable).

If you disregard this warning - the performance gains weren't that great

either, at least in a VM. It had problems when flushing to disk after reaching
dirty watermark and the block size has some not-well-documented
implications (not sure now, but I think it only cached IO _larger_than the
block size, so if your database keeps incrementing an XX-byte counter it will
go straight to disk).


Flashcache doesn't respect barriers (or does it now?) - if that's ok for you

than go for it, it should be stable and I used it in the past in production
without problems.


bcache seemed to work fine, but I needed to
a) use it for root
b) disable and enable it on the fly (doh)
c) make it non-persisent (flush it) before reboot - not sure if that was

possible either.

d) all that in a customer's VM, and that customer didn't have a strong

technical background to be able to fiddle with it...

So I haven't tested it heavily.

Bcache should be the obvious choice if you are in control of the
environment. At least you can cry on LKML's shoulder when you lose
data :-)

Jan



On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com

wrote:


What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
months ago, but no external contributors :(

The nice thing about EnhanceIO is there is no need to change device
name, unlike bcache, flashcache etc.

Best regards,
Alex

On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com

wrote:

I did some (non-ceph) work on these, and concluded that bcache was
the best supported, most stable, and fastest.  This was ~1 year
ago, to take it with a grain of salt, but that's what I would recommend.

Daniel



From: Dominik Zalewski dzalew...@optlink.net
To: German Anders gand...@despegar.com
Cc: ceph-users ceph-users@lists.ceph.com
Sent: Wednesday, July 1, 2015 5:28:10 PM
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?


Hi,

I’ve asked same question last weeks or so (just search the mailing
list archives for EnhanceIO :) and got some interesting answers.

Looks like the project is pretty much dead since it was bought out by

HGST.

Even their website has some broken links in regards to EnhanceIO

I’m keen to try flashcache or bcache (its been in the mainline
kernel for some time)

Dominik

On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com

wrote:


Hi cephers,

   Is anyone out there that implement enhanceIO in a production

environment?

any recommendation? any perf output to share with the diff between
using it and not?

Thanks in advance,

German
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Mark Nelson




On 08/18/2015 09:24 AM, Jan Schermer wrote:



On 18 Aug 2015, at 15:50, Mark Nelson mnel...@redhat.com wrote:



On 08/18/2015 06:47 AM, Nick Fisk wrote:

Just to chime in, I gave dmcache a limited test but its lack of proper 
writeback cache ruled it out for me. It only performs write back caching on 
blocks already on the SSD, whereas I need something that works like a Battery 
backed raid controller caching all writes.

It's amazing the 100x performance increase you get with RBD's when doing sync 
writes and give it something like just 1GB write back cache with flashcache.


For your use case, is it ok that data may live on the flashcache for some 
amount of time before making to ceph to be replicated?  We've wondered 
internally if this kind of trade-off is acceptable to customers or not should 
the flashcache SSD fail.



Was it me pestering you about it? :-)
All my customers need this desperately - people don't care about having RPO=0 
seconds when all hell breaks loose.
People care about their apps being slow all the time which is effectively an 
outage.
I (sysadmin) care about having consistent data where all I have to do is start 
up the VMs.

Any ideas how to approach this? I think even checkpoints (like reverting to a 
known point in the past) would be great and sufficient for most people...


Here's kind of how I see the field right now:

1) Cache at the client level.  Likely fastest but obvious issues like 
above.  RAID1 might be an option at increased cost.  Lack of barriers in 
some implementations scary.


2) Cache below the OSD.  Not much recent data on this.  Not likely as 
fast as client side cache, but likely cheaper (fewer OSD nodes than 
client nodes?).  Lack of barriers in some implementations scary.


3) Ceph Cache Tiering. Network overhead and write amplification on 
promotion makes this primarily useful when workloads fit mostly into the 
cache tier.  Overall safe design but care must be taken to not over-promote.


4) separate SSD pool.  Manual and not particularly flexible, but perhaps 
best for applications that need consistently high performance.









-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Jan Schermer
Sent: 18 August 2015 12:44
To: Mark Nelson mnel...@redhat.com
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

I did not. Not sure why now - probably for the same reason I didn't
extensively test bcache.
I'm not a real fan of device mapper though, so if I had to choose I'd still go 
for
bcache :-)

Jan


On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote:

Hi Jan,

Out of curiosity did you ever try dm-cache?  I've been meaning to give it a

spin but haven't had the spare cycles.


Mark

On 08/18/2015 04:00 AM, Jan Schermer wrote:

I already evaluated EnhanceIO in combination with CentOS 6 (and

backported 3.10 and 4.0 kernel-lt if I remember correctly).

It worked fine during benchmarks and stress tests, but once we run DB2

on it it panicked within minutes and took all the data with it (almost 
literally -
files that werent touched, like OS binaries were b0rked and the filesystem
was unsalvageable).

If you disregard this warning - the performance gains weren't that great

either, at least in a VM. It had problems when flushing to disk after reaching
dirty watermark and the block size has some not-well-documented
implications (not sure now, but I think it only cached IO _larger_than the
block size, so if your database keeps incrementing an XX-byte counter it will
go straight to disk).


Flashcache doesn't respect barriers (or does it now?) - if that's ok for you

than go for it, it should be stable and I used it in the past in production
without problems.


bcache seemed to work fine, but I needed to
a) use it for root
b) disable and enable it on the fly (doh)
c) make it non-persisent (flush it) before reboot - not sure if that was

possible either.

d) all that in a customer's VM, and that customer didn't have a strong

technical background to be able to fiddle with it...

So I haven't tested it heavily.

Bcache should be the obvious choice if you are in control of the
environment. At least you can cry on LKML's shoulder when you lose
data :-)

Jan



On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com

wrote:


What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
months ago, but no external contributors :(

The nice thing about EnhanceIO is there is no need to change device
name, unlike bcache, flashcache etc.

Best regards,
Alex

On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com

wrote:

I did some (non-ceph) work on these, and concluded that bcache was
the best supported, most stable, and fastest.  This was ~1 year
ago, to take it with a grain of salt, but that's what I would recommend.

Daniel



From: Dominik Zalewski dzalew...@optlink.net
To: German Anders

Re: [ceph-users] ceph cluster_network with linklocal ipv6

2015-08-18 Thread Jan Schermer

Should ceph care about what scope the address is in? We don't specify it for 
ipv4 anyway, or is link-scope special in some way?

And isn't this the correct syntax actually?

cluster_network = fe80::/64%cephnet


 On 18 Aug 2015, at 16:17, Wido den Hollander w...@42on.com wrote:
 
 
 
 On 18-08-15 16:02, Jan Schermer wrote:
 Shouldn't this:
 
 cluster_network = fe80::%cephnet/64
 
 be this:
 cluster_network = fe80::/64
 ?
 
 That won't work since the kernel doesn't know the scope. So %devname is
 right, but Ceph can't parse it.
 
 Although it sounds cool to run Ceph over link-local I don't think it
 currently works.
 
 Wido
 
 
 
 On 18 Aug 2015, at 15:39, Björn Lässig b.laes...@pengutronix.de wrote:
 
 Hi,
 
 i just setup my first ceph cluster and after breaking things for a while 
 and let ceph repair itself, i want to setup the cluster network.
 
 Unfortunately i am doing something wrong :-)
 
 For not having any dependencies in my cluster network, i want to use only 
 ipv6 link-local addresses on interface 'cephnet'.
 
 /var/log/ceph/ceph-osd.4.log:2015-08-18 15:10:38.954592 7f24c0ac2880 -1 
 unable to parse network: fe80::%cephnet/64
 
 --- /etc/ceph/ceph.conf
 [global]
 ...
 cluster_network = fe80::%cephnet/64
 
 
 --- /etc/network/interfaces
 auto cephnet
 iface cephnet inet6 auto
   sysctl -wq net.ipv6.conf.$IFACE.accept_ra_defrtr=0
 
 What could i do?
 
 thanks,
 Björn
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph distributed osd

2015-08-18 Thread gjprabu

Hi Luis,



What i mean , we have three OSD with Harddisk size each 1TB and two 
pool (poolA and poolB) with replica 2. Here writing behavior is the confusion 
for us. Our assumptions is below.



PoolA   -- may write with OSD1 and OSD2  (is this correct)



PoolB  --  may write with OSD3 and OSD1 (is this correct)



suppose the hard disk size got full , then how many OSD's need to be added and 
How will be the writing behavior to new OSD's



After added few osd's 



PoolA --  may write with OSD4 and OSD5 (is this correct)

PoolB --  may write with OSD5 and OSD6 (is this correct)
 

 

Regards

Prabu




 On Mon, 17 Aug 2015 19:41:53 +0530 Luis Periquito 
lt;periqu...@gmail.comgt; wrote 




I don't understand your question? You created a 1G RBD/disk and it's full. You 
are able to grow it though - but that's a Linux management issue, not ceph.



As everything is thin-provisioned you can create a RBD with an arbitrary size - 
I've create one with 1PB when the cluster only had 600G/Raw available.




On Mon, Aug 17, 2015 at 1:18 PM, gjprabu lt;gjpr...@zohocorp.comgt; wrote:






Hi All,



   Anybody can help on this issue.



Regards

Prabu



  On Mon, 17 Aug 2015 12:08:28 +0530 gjprabu lt;gjpr...@zohocorp.comgt; 
wrote 




Hi All,



   Also please find osd information.



ceph osd dump | grep 'replicated size'

pool 2 'repo' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 126 pgp_num 126 last_change 21573 flags hashpspool stripe_width 0



Regards

Prabu










 On Mon, 17 Aug 2015 11:58:55 +0530 gjprabu lt;gjpr...@zohocorp.comgt; 
wrote 











Hi All,



   We need to test three OSD and one image with replica 2(size 1GB). While 
testing data is not writing above 1GB. Is there any option to write on third 
OSD.



ceph osd pool get  repo  pg_num

pg_num: 126



# rbd showmapped 

id pool image  snap device

0  rbd  integdownloads -/dev/rbd0 -- Already one

2  repo integrepotest  -/dev/rbd2  -- newly created





[root@hm2 repository]# df -Th

Filesystem   Type  Size  Used Avail Use% Mounted on

/dev/sda5ext4  289G   18G  257G   7% /

devtmpfs devtmpfs  252G 0  252G   0% /dev

tmpfstmpfs 252G 0  252G   0% /dev/shm

tmpfstmpfs 252G  538M  252G   1% /run

tmpfstmpfs 252G 0  252G   0% /sys/fs/cgroup

/dev/sda2ext4  488M  212M  241M  47% /boot

/dev/sda4ext4  1.9T   20G  1.8T   2% /var

/dev/mapper/vg0-zoho ext4  8.6T  1.7T  6.5T  21% /zoho

/dev/rbd0ocfs2 977G  101G  877G  11% /zoho/build/downloads

/dev/rbd2ocfs21000M 1000M 0 100% /zoho/build/repository



@:~$ scp -r sample.txt root@integ-hm2:/zoho/build/repository/

root@integ-hm2's password: 

sample.txt  
   100% 1024MB   4.5MB/s   03:48

scp: /zoho/build/repository//sample.txt: No space left on device



Regards

Prabu










 On Thu, 13 Aug 2015 19:42:11 +0530 gjprabu lt;gjpr...@zohocorp.comgt; 
wrote 











Dear Team,



 We are using two ceph OSD with replica 2 and it is working properly. 
Here my doubt is (Pool A -image size will be 10GB) and its replicated with two 
OSD, what will happen suppose if the size reached the limit, Is there any 
chance to make the data to continue writing in another two OSD's.



Regards

Prabu























___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 18 August 2015 14:51
 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 On 08/18/2015 06:47 AM, Nick Fisk wrote:
  Just to chime in, I gave dmcache a limited test but its lack of proper
 writeback cache ruled it out for me. It only performs write back caching on
 blocks already on the SSD, whereas I need something that works like a
 Battery backed raid controller caching all writes.

  It's amazing the 100x performance increase you get with RBD's when doing
 sync writes and give it something like just 1GB write back cache with
 flashcache.

 For your use case, is it ok that data may live on the flashcache for some
 amount of time before making to ceph to be replicated?  We've wondered
 internally if this kind of trade-off is acceptable to customers or not should 
 the
 flashcache SSD fail.

Yes, I agree, it's not ideal. But I believe it’s the only way to get the 
performance required for some workloads that need write latency's 1ms. 

I'm still in testing at the moment with the testing kernel that includes blk-mq 
fixes for large queue depths and max io sizes. But if we decide to put into 
production, it would be using 2x SAS dual port SSD's in RAID1 across two 
servers for HA. As we are currently using iSCSI from these two servers, there 
is no real loss of availability by doing this. Generally I think as long as you 
build this around the fault domains of the application you are caching, it 
shouldn't impact too much.

I guess for people using openstack and other direct RBD interfaces it may not 
be such an attractive option. I've been thinking that maybe Ceph needs to have 
an additional daemon with very low overheads, which is run on SSD's to provide 
shared persistent cache devices for librbd. There's still a trade off, maybe 
not as much as using Flashcache, but for some workloads like database's, many 
people may decide that it's worth it. Of course I realise this would be a lot 
of work and everyone is really busy, but in terms of performance gained it 
would most likely have a dramatic effect in making Ceph look comparable to 
other solutions like VSAN or ScaleIO when it comes to high iops/low latency 
stuff.

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
  Of Jan Schermer
  Sent: 18 August 2015 12:44
  To: Mark Nelson mnel...@redhat.com
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

  I did not. Not sure why now - probably for the same reason I didn't
  extensively test bcache.
  I'm not a real fan of device mapper though, so if I had to choose I'd
  still go for bcache :-)

  Jan

  On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com wrote:

  Hi Jan,

  Out of curiosity did you ever try dm-cache?  I've been meaning to
  give it a
  spin but haven't had the spare cycles.

  Mark

  On 08/18/2015 04:00 AM, Jan Schermer wrote:
  I already evaluated EnhanceIO in combination with CentOS 6 (and
  backported 3.10 and 4.0 kernel-lt if I remember correctly).
  It worked fine during benchmarks and stress tests, but once we run
  DB2
  on it it panicked within minutes and took all the data with it
  (almost literally - files that werent touched, like OS binaries were
  b0rked and the filesystem was unsalvageable).
  If you disregard this warning - the performance gains weren't that
  great
  either, at least in a VM. It had problems when flushing to disk after
  reaching dirty watermark and the block size has some
  not-well-documented implications (not sure now, but I think it only
  cached IO _larger_than the block size, so if your database keeps
  incrementing an XX-byte counter it will go straight to disk).

  Flashcache doesn't respect barriers (or does it now?) - if that's
  ok for you
  than go for it, it should be stable and I used it in the past in
  production without problems.

  bcache seemed to work fine, but I needed to
  a) use it for root
  b) disable and enable it on the fly (doh)
  c) make it non-persisent (flush it) before reboot - not sure if
  that was
  possible either.
  d) all that in a customer's VM, and that customer didn't have a
  strong
  technical background to be able to fiddle with it...
  So I haven't tested it heavily.

  Bcache should be the obvious choice if you are in control of the
  environment. At least you can cry on LKML's shoulder when you lose
  data :-)

  Jan

  On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com
  wrote:

  What about https://github.com/Frontier314/EnhanceIO?  Last commit
  2 months ago, but no external contributors :(

  The nice thing about EnhanceIO is there is no need to change
  device name, unlike bcache, flashcache etc.

  Best regards,
  Alex

  On

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

snip
 
  Here's kind of how I see the field right now:
 
  1) Cache at the client level.  Likely fastest but obvious issues like 
  above.
  RAID1 might be an option at increased cost.  Lack of barriers in some
  implementations scary.
 
  Agreed.
 
 
  2) Cache below the OSD.  Not much recent data on this.  Not likely as
  fast as client side cache, but likely cheaper (fewer OSD nodes than client
 nodes?).
  Lack of barriers in some implementations scary.
 
  This also has the benefit of caching the leveldb on the OSD, so get a big
 performance gain from there too for small sequential writes. I looked at
 using Flashcache for this too but decided it was adding to much complexity
 and risk.
 
  I thought I read somewhere that RocksDB allows you to move its WAL to
 SSD, is there anything in the pipeline for something like moving the filestore
 to use RocksDB?
 
 I believe you can already do this, though I haven't tested it.  You can 
 certainly
 move the monitors to rocksdb (tested) and newstore uses rocksdb as well.
 

Interesting, I might have a look into this. 

 
 
  3) Ceph Cache Tiering. Network overhead and write amplification on
  promotion makes this primarily useful when workloads fit mostly into the
  cache tier.  Overall safe design but care must be taken to not over-
 promote.
 
  4) separate SSD pool.  Manual and not particularly flexible, but perhaps
 best
  for applications that need consistently high performance.
 
  I think it depends on the definition of performance. Currently even very
 fast CPU's and SSD's in their own pool will still struggle to get less than 
 1ms of
 write latency. If your performance requirements are for large queue depths
 then you will probably be alright. If you require something that mirrors the
 performance of traditional write back cache, then even pure SSD Pools can
 start to struggle.
 
 Agreed.  This is definitely the crux of the problem.  The example below
 is a great start!  It'd would be fantastic if we could get more feedback
 from the list on the relative importance of low latency operations vs
 high IOPS through concurrency.  We have general suspicions but not a ton
 of actual data regarding what folks are seeing in practice and under
 what scenarios.
 

If you have any specific questions that you think I might be able to answer, 
please let me know. The only other main app that I can really think of where 
these sort of write latency is critical is SQL, particularly the transaction 
logs.  

 
 
  To give a real world example of what I see when doing various tests,  here
 is a rough guide to IOP's when removing a snapshot on a ESX server
 
  Traditional Array 10K disks = 300-600 IOPs
  Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to
 be the main limitation)
  Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)
 
 I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help
 here.  Sandisk and Intel have both done some very useful investigations,
 I've got some additional tests replicating some of their findings coming
 shortly.

Ok, will be interesting to se. I will see if I can change it on my environment 
and if it has any improvement. I think I came to the conclusion that Ceph takes 
a certain amount of time to do a write and by the time you add in a replica 
copy I was struggling to get much below 2ms per IO with my 2.1GHz CPU's.   2ms 
= ~500IOPs.

 
  Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)
 
 Indeed.  There's some work going on in this area too.  Hopefully we'll
 know how some of our ideas pan out later this week.  Assuming excessive
 promotions aren't a problem, the jemalloc/tcmalloc improvements I
 suspect will generally make cache teiring more interesting (though
 buffer cache will still be the primary source of really hot cached reads)
 
  Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give
 high bursts if snapshot blocks are sequential)
 
 Good to know!
 
 
  And when copying VM's to datastore (ESXi does this in sequential 64k
 IO's.yes silly I know)
 
  Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other
 arrays I guess this scales)
  Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here
 for sequential writes)
 
 This is pretty bad.  Is RBD cache enabled?

Tell me about it, moving a 2TB VM is a painful experience. Yes the librbd cache 
is on, but as iSCSI effectively turns all writes into sync writes so this 
bypasses the cache, so you are dependent on the time it takes for each OSD to 
ACK the write. In this case waiting each time for 64kb IO's to complete due to 
the levelDB sync you end up with transfer speeds somewhere in the region of 
15-20MB/s. You can do the same thing with something IOmeter (64k, sequential 
write, directio, QD=1).

NFS is even worse as every ESX write also requires a FS journal sync on the FS 
being used for NFS. So you have to wait for two ACK's from Ceph, normally 
meaning

Re: [ceph-users] ceph cluster_network with linklocal ipv6

2015-08-18 Thread Wido den Hollander





 Op 18 aug. 2015 om 18:15 heeft Jan Schermer j...@schermer.cz het volgende 
 geschreven:
 
 
 On 18 Aug 2015, at 17:57, Björn Lässig b.laes...@pengutronix.de wrote:
 
 On 08/18/2015 04:32 PM, Jan Schermer wrote:
 Should ceph care about what scope the address is in? We don't specify it 
 for ipv4 anyway, or is link-scope special in some way?
 
 fe80::/64 is on every ipv6 enabled interface ... thats different from legacy 
 ip.
 
 I'm not a network guru, but you can have same/overlapping subnets with IPv4 
 as well
 that's why you have scopes, metrics, routing tables, policy routing tables 
 etc.
 That's why I'm wondering what's different here.
 

It's IPv6, that's different. Search for link local, it will tell you what it is.

Wido


 And isn't this the correct syntax actually?
 
 cluster_network = fe80::/64%cephnet
 
 That is a very good question! I will look into it.
 
 Björn
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Alex Gorbachev

HI Jan,

On Tue, Aug 18, 2015 at 5:00 AM, Jan Schermer j...@schermer.cz wrote:
 I already evaluated EnhanceIO in combination with CentOS 6 (and backported 
 3.10 and 4.0 kernel-lt if I remember correctly).
 It worked fine during benchmarks and stress tests, but once we run DB2 on it 
 it panicked within minutes and took all the data with it (almost literally - 
 files that werent touched, like OS binaries were b0rked and the filesystem 
 was unsalvageable).

Out of curiosity, were you using EnhanceIO in writeback mode?  I
assume so, as a read cache should not hurt anything.

Thanks,
Alex

 If you disregard this warning - the performance gains weren't that great 
 either, at least in a VM. It had problems when flushing to disk after 
 reaching dirty watermark and the block size has some not-well-documented 
 implications (not sure now, but I think it only cached IO _larger_than the 
 block size, so if your database keeps incrementing an XX-byte counter it will 
 go straight to disk).

 Flashcache doesn't respect barriers (or does it now?) - if that's ok for you 
 than go for it, it should be stable and I used it in the past in production 
 without problems.

 bcache seemed to work fine, but I needed to
 a) use it for root
 b) disable and enable it on the fly (doh)
 c) make it non-persisent (flush it) before reboot - not sure if that was 
 possible either.
 d) all that in a customer's VM, and that customer didn't have a strong 
 technical background to be able to fiddle with it...
 So I haven't tested it heavily.

 Bcache should be the obvious choice if you are in control of the environment. 
 At least you can cry on LKML's shoulder when you lose data :-)

 Jan


 On 18 Aug 2015, at 01:49, Alex Gorbachev a...@iss-integration.com wrote:

 What about https://github.com/Frontier314/EnhanceIO?  Last commit 2
 months ago, but no external contributors :(

 The nice thing about EnhanceIO is there is no need to change device
 name, unlike bcache, flashcache etc.

 Best regards,
 Alex

 On Thu, Jul 23, 2015 at 11:02 AM, Daniel Gryniewicz d...@redhat.com wrote:
 I did some (non-ceph) work on these, and concluded that bcache was the best
 supported, most stable, and fastest.  This was ~1 year ago, to take it with
 a grain of salt, but that's what I would recommend.

 Daniel


 
 From: Dominik Zalewski dzalew...@optlink.net
 To: German Anders gand...@despegar.com
 Cc: ceph-users ceph-users@lists.ceph.com
 Sent: Wednesday, July 1, 2015 5:28:10 PM
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?


 Hi,

 I’ve asked same question last weeks or so (just search the mailing list
 archives for EnhanceIO :) and got some interesting answers.

 Looks like the project is pretty much dead since it was bought out by HGST.
 Even their website has some broken links in regards to EnhanceIO

 I’m keen to try flashcache or bcache (its been in the mainline kernel for
 some time)

 Dominik

 On 1 Jul 2015, at 21:13, German Anders gand...@despegar.com wrote:

 Hi cephers,

   Is anyone out there that implement enhanceIO in a production environment?
 any recommendation? any perf output to share with the diff between using it
 and not?

 Thanks in advance,

 German
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Jan Schermer

 On 18 Aug 2015, at 16:44, Nick Fisk n...@fisk.me.uk wrote:

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 18 August 2015 14:51
 To: Nick Fisk n...@fisk.me.uk; 'Jan Schermer' j...@schermer.cz
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 On 08/18/2015 06:47 AM, Nick Fisk wrote:
 Just to chime in, I gave dmcache a limited test but its lack of proper
 writeback cache ruled it out for me. It only performs write back caching on
 blocks already on the SSD, whereas I need something that works like a
 Battery backed raid controller caching all writes.

 It's amazing the 100x performance increase you get with RBD's when doing
 sync writes and give it something like just 1GB write back cache with
 flashcache.

 For your use case, is it ok that data may live on the flashcache for some
 amount of time before making to ceph to be replicated?  We've wondered
 internally if this kind of trade-off is acceptable to customers or not 
 should the
 flashcache SSD fail.

 Yes, I agree, it's not ideal. But I believe it’s the only way to get the 
 performance required for some workloads that need write latency's 1ms. 

 I'm still in testing at the moment with the testing kernel that includes 
 blk-mq fixes for large queue depths and max io sizes. But if we decide to put 
 into production, it would be using 2x SAS dual port SSD's in RAID1 across two 
 servers for HA. As we are currently using iSCSI from these two servers, there 
 is no real loss of availability by doing this. Generally I think as long as 
 you build this around the fault domains of the application you are caching, 
 it shouldn't impact too much.

 I guess for people using openstack and other direct RBD interfaces it may not 
 be such an attractive option. I've been thinking that maybe Ceph needs to 
 have an additional daemon with very low overheads, which is run on SSD's to 
 provide shared persistent cache devices for librbd. There's still a trade 
 off, maybe not as much as using Flashcache, but for some workloads like 
 database's, many people may decide that it's worth it. Of course I realise 
 this would be a lot of work and everyone is really busy, but in terms of 
 performance gained it would most likely have a dramatic effect in making Ceph 
 look comparable to other solutions like VSAN or ScaleIO when it comes to high 
 iops/low latency stuff.

Additional daemon that is persistent how? Isn't that what journal does already, 
just too slowly?

I think the best (and easiest!) approach is to mimic what a monilithic SAN does

Currently
1) client issues blocking/atomic/sync IO
2) rbd client sends this IO to all OSDs
3) after all OSDs process the IO, the IO is finished and considered persistent

That has serious implications
* every IO is processed separately, not much coalescing
* OSD processes add the latency when processing this IO
* one OSD can be slow momentarily, IO backs up and the cluster stalls

Let me just select what processing the IO means with respect to my 
architecture and I can likely get a 100x improvement

Let me choose:

1) WHERE the IO is persisted
Do I really need all (e.g. 3) OSDs to persist the data or is quorum (2) 
sufficient?
Not waiting for one slow OSD gives me at least some SLA for planned tasks like 
backfilling, scrubbing, deep-scrubbing
Hands up who can afford to leav deep-scrub enabled in production...

2) WHEN the IO is persisted
Do I really need all OSDs to flush the data to disk?
If all the nodes are in the same cabinet and on the same UPS then this makes 
sense.
But my nodes are actually in different buildings ~10km apart. The chances of 
power failing simultaneously, N+1 UPSes failing simultaneously, diesels failing 
simultaneously... When nukes start falling and this happens then I'll start 
looking for backups.
Even if your nodes are in one datacentre, there are likely redundant (2+) 
circuits.
And even if you have just one cabinet, you can add 3x UPS in there and gain a 
nice speed boost.

So the IO could be actually pretty safe and happy when it gets to a remote 
buffers on enough (quorum) nodes  and waits for processing. It can be batched, 
it can be coalesced, it can be rewritten with subsequent updates...

3)  WHAT amount of IO is stored
Do I need to have the last transaction or can I tolerate 1 minute of missing 
data?
Checkpoints, checksums on last transaction, rollback (journal already does this 
AFAIK)...

4) I DON'T CARE mode :-)
qemu cache=unsafe equivalent but set on a RBD volume/pool
Because sometimes you just need to crunch data without really storing them 
persistently - how are CERN/HADOOP/Big Data guys approcaching this?
And you can't always disable flushing. Filesystems have nobarriers (usually) 
but if you need a block device for raw database tablespace, you're pretty much 
SOL without lots of trickery

1) is doable eventually.

2) is

Re: [ceph-users] ceph cluster_network with linklocal ipv6

2015-08-18 Thread Jan Schermer


 On 18 Aug 2015, at 17:57, Björn Lässig b.laes...@pengutronix.de wrote:
 
 On 08/18/2015 04:32 PM, Jan Schermer wrote:
 Should ceph care about what scope the address is in? We don't specify it for 
 ipv4 anyway, or is link-scope special in some way?
 
 fe80::/64 is on every ipv6 enabled interface ... thats different from legacy 
 ip.
 

I'm not a network guru, but you can have same/overlapping subnets with IPv4 as 
well
that's why you have scopes, metrics, routing tables, policy routing tables etc.
That's why I'm wondering what's different here.

 And isn't this the correct syntax actually?
 
 cluster_network = fe80::/64%cephnet
 
 That is a very good question! I will look into it.
 
 Björn
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Mark Nelson

On 08/18/2015 11:08 AM, Nick Fisk wrote:

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Mark Nelson
Sent: 18 August 2015 15:55
To: Jan Schermer j...@schermer.cz
Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

On 08/18/2015 09:24 AM, Jan Schermer wrote:

On 18 Aug 2015, at 15:50, Mark Nelson mnel...@redhat.com wrote:

On 08/18/2015 06:47 AM, Nick Fisk wrote:

Just to chime in, I gave dmcache a limited test but its lack of proper

writeback cache ruled it out for me. It only performs write back caching on
blocks already on the SSD, whereas I need something that works like a
Battery backed raid controller caching all writes.

It's amazing the 100x performance increase you get with RBD's when

doing sync writes and give it something like just 1GB write back cache with
flashcache.

For your use case, is it ok that data may live on the flashcache for some

amount of time before making to ceph to be replicated?  We've wondered
internally if this kind of trade-off is acceptable to customers or not should 
the
flashcache SSD fail.

Was it me pestering you about it? :-)
All my customers need this desperately - people don't care about having

RPO=0 seconds when all hell breaks loose.

People care about their apps being slow all the time which is effectively an

outage.

I (sysadmin) care about having consistent data where all I have to do is start

up the VMs.

Any ideas how to approach this? I think even checkpoints (like reverting to

a known point in the past) would be great and sufficient for most people...

Here's kind of how I see the field right now:

1) Cache at the client level.  Likely fastest but obvious issues like above.
RAID1 might be an option at increased cost.  Lack of barriers in some
implementations scary.

Agreed.

2) Cache below the OSD.  Not much recent data on this.  Not likely as fast as
client side cache, but likely cheaper (fewer OSD nodes than client nodes?).
Lack of barriers in some implementations scary.

This also has the benefit of caching the leveldb on the OSD, so get a big 
performance gain from there too for small sequential writes. I looked at using 
Flashcache for this too but decided it was adding to much complexity and risk.

I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is 
there anything in the pipeline for something like moving the filestore to use 
RocksDB?

I believe you can already do this, though I haven't tested it.  You can 
certainly move the monitors to rocksdb (tested) and newstore uses 
rocksdb as well.

3) Ceph Cache Tiering. Network overhead and write amplification on
promotion makes this primarily useful when workloads fit mostly into the
cache tier.  Overall safe design but care must be taken to not over-promote.

4) separate SSD pool.  Manual and not particularly flexible, but perhaps best
for applications that need consistently high performance.

I think it depends on the definition of performance. Currently even very fast 
CPU's and SSD's in their own pool will still struggle to get less than 1ms of 
write latency. If your performance requirements are for large queue depths then 
you will probably be alright. If you require something that mirrors the 
performance of traditional write back cache, then even pure SSD Pools can start 
to struggle.

Agreed.  This is definitely the crux of the problem.  The example below 
is a great start!  It'd would be fantastic if we could get more feedback 
from the list on the relative importance of low latency operations vs 
high IOPS through concurrency.  We have general suspicions but not a ton 
of actual data regarding what folks are seeing in practice and under 
what scenarios.

To give a real world example of what I see when doing various tests,  here is a 
rough guide to IOP's when removing a snapshot on a ESX server

Traditional Array 10K disks = 300-600 IOPs
Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the 
main limitation)
Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)

I'd be curious to see how much jemalloc or tcmalloc 2.4 + 128MB TC help 
here.  Sandisk and Intel have both done some very useful investigations, 
I've got some additional tests replicating some of their findings coming 
shortly.

Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)

Indeed.  There's some work going on in this area too.  Hopefully we'll 
know how some of our ideas pan out later this week.  Assuming excessive 
promotions aren't a problem, the jemalloc/tcmalloc improvements I 
suspect will generally make cache teiring more interesting (though 
buffer cache will still be the primary source of really hot cached reads)

Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high 
bursts if snapshot blocks are sequential)

Good to know!

And when copying VM's to datastore

Re: [ceph-users] Repair inconsistent pgs..

2015-08-18 Thread Samuel Just

Also, what command are you using to take snapshots?
-Sam

On Tue, Aug 18, 2015 at 8:48 AM, Samuel Just sj...@redhat.com wrote:
 Is the number of inconsistent objects growing?  Can you attach the
 whole ceph.log from the 6 hours before and after the snippet you
 linked above?  Are you using cache/tiering?  Can you attach the osdmap
 (ceph osd getmap -o outfile)?
 -Sam

 On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor
 igor.voloshane...@gmail.com wrote:
 ceph - 0.94.2
 Its happen during rebalancing

 I thought too, that some OSD miss copy, but looks like all miss...
 So any advice in which direction i need to go

 2015-08-18 14:14 GMT+03:00 Gregory Farnum gfar...@redhat.com:

 From a quick peek it looks like some of the OSDs are missing clones of
 objects. I'm not sure how that could happen and I'd expect the pg
 repair to handle that but if it's not there's probably something
 wrong; what version of Ceph are you running? Sam, is this something
 you've seen, a new bug, or some kind of config issue?
 -Greg

 On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor
 igor.voloshane...@gmail.com wrote:
  Hi all, at our production cluster, due high rebalancing ((( we have 2
  pgs in
  inconsistent state...
 
  root@temp:~# ceph health detail | grep inc
  HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
  pg 2.490 is active+clean+inconsistent, acting [56,15,29]
  pg 2.c4 is active+clean+inconsistent, acting [56,10,42]
 
  From OSD logs, after recovery attempt:
 
  root@test:~# ceph pg dump | grep -i incons | cut -f 1 | while read i; do
  ceph pg repair ${i} ; done
  dumped all in format plain
  instructing pg 2.490 on osd.56 to repair
  instructing pg 2.c4 on osd.56 to repair
 
  /var/log/ceph/ceph-osd.56.log:51:2015-08-18 07:26:37.035910 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  f5759490/rbd_data.1631755377d7e.04da/head//2 expected clone
  90c59490/rbd_data.eb486436f2beb.7a65/141//2
  /var/log/ceph/ceph-osd.56.log:52:2015-08-18 07:26:37.035960 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  fee49490/rbd_data.12483d3ba0794b.522f/head//2 expected clone
  f5759490/rbd_data.1631755377d7e.04da/141//2
  /var/log/ceph/ceph-osd.56.log:53:2015-08-18 07:26:37.036133 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  a9b39490/rbd_data.12483d3ba0794b.37b3/head//2 expected clone
  fee49490/rbd_data.12483d3ba0794b.522f/141//2
  /var/log/ceph/ceph-osd.56.log:54:2015-08-18 07:26:37.036243 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  bac19490/rbd_data.1238e82ae8944a.032e/head//2 expected clone
  a9b39490/rbd_data.12483d3ba0794b.37b3/141//2
  /var/log/ceph/ceph-osd.56.log:55:2015-08-18 07:26:37.036289 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  98519490/rbd_data.123e9c2ae8944a.0807/head//2 expected clone
  bac19490/rbd_data.1238e82ae8944a.032e/141//2
  /var/log/ceph/ceph-osd.56.log:56:2015-08-18 07:26:37.036314 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  c3c09490/rbd_data.1238e82ae8944a.0c2b/head//2 expected clone
  98519490/rbd_data.123e9c2ae8944a.0807/141//2
  /var/log/ceph/ceph-osd.56.log:57:2015-08-18 07:26:37.036363 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  28809490/rbd_data.edea7460fe42b.01d9/head//2 expected clone
  c3c09490/rbd_data.1238e82ae8944a.0c2b/141//2
  /var/log/ceph/ceph-osd.56.log:58:2015-08-18 07:26:37.036432 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : deep-scrub 2.490
  e1509490/rbd_data.1423897545e146.09a6/head//2 expected clone
  28809490/rbd_data.edea7460fe42b.01d9/141//2
  /var/log/ceph/ceph-osd.56.log:59:2015-08-18 07:26:38.548765 7f94663b3700
  -1
  log_channel(cluster) log [ERR] : 2.490 deep-scrub 17 errors
 
  So, how i can solve expected clone situation by hand?
  Thank in advance!
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-08-18 Thread Nick Fisk

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Mark Nelson
 Sent: 18 August 2015 15:55
 To: Jan Schermer j...@schermer.cz
 Cc: ceph-users@lists.ceph.com; Nick Fisk n...@fisk.me.uk
 Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

 On 08/18/2015 09:24 AM, Jan Schermer wrote:

  On 18 Aug 2015, at 15:50, Mark Nelson mnel...@redhat.com wrote:

  On 08/18/2015 06:47 AM, Nick Fisk wrote:
  Just to chime in, I gave dmcache a limited test but its lack of proper
 writeback cache ruled it out for me. It only performs write back caching on
 blocks already on the SSD, whereas I need something that works like a
 Battery backed raid controller caching all writes.

  It's amazing the 100x performance increase you get with RBD's when
 doing sync writes and give it something like just 1GB write back cache with
 flashcache.

  For your use case, is it ok that data may live on the flashcache for some
 amount of time before making to ceph to be replicated?  We've wondered
 internally if this kind of trade-off is acceptable to customers or not should 
 the
 flashcache SSD fail.

  Was it me pestering you about it? :-)
  All my customers need this desperately - people don't care about having
 RPO=0 seconds when all hell breaks loose.
  People care about their apps being slow all the time which is effectively an
 outage.
  I (sysadmin) care about having consistent data where all I have to do is 
  start
 up the VMs.

  Any ideas how to approach this? I think even checkpoints (like reverting to
 a known point in the past) would be great and sufficient for most people...

 Here's kind of how I see the field right now:

 1) Cache at the client level.  Likely fastest but obvious issues like above.
 RAID1 might be an option at increased cost.  Lack of barriers in some
 implementations scary.

Agreed.

 2) Cache below the OSD.  Not much recent data on this.  Not likely as fast as
 client side cache, but likely cheaper (fewer OSD nodes than client nodes?).
 Lack of barriers in some implementations scary.

This also has the benefit of caching the leveldb on the OSD, so get a big 
performance gain from there too for small sequential writes. I looked at using 
Flashcache for this too but decided it was adding to much complexity and risk.

I thought I read somewhere that RocksDB allows you to move its WAL to SSD, is 
there anything in the pipeline for something like moving the filestore to use 
RocksDB?

 3) Ceph Cache Tiering. Network overhead and write amplification on
 promotion makes this primarily useful when workloads fit mostly into the
 cache tier.  Overall safe design but care must be taken to not over-promote.

 4) separate SSD pool.  Manual and not particularly flexible, but perhaps best
 for applications that need consistently high performance.

I think it depends on the definition of performance. Currently even very fast 
CPU's and SSD's in their own pool will still struggle to get less than 1ms of 
write latency. If your performance requirements are for large queue depths then 
you will probably be alright. If you require something that mirrors the 
performance of traditional write back cache, then even pure SSD Pools can start 
to struggle.

To give a real world example of what I see when doing various tests,  here is a 
rough guide to IOP's when removing a snapshot on a ESX server

Traditional Array 10K disks = 300-600 IOPs
Ceph 7.2K + SSD Journal = 100-200 IOPs (LevelDB syncing on OSD seems to be the 
main limitation)
Ceph Pure SSD Pool = 500 IOPs (Intel s3700 SSD's)
Ceph Cache Tiering = 10-500 IOPs (As we know, misses can be very painful)
Ceph + RBD Caching with Flashcache = 200-1000 IOPs (Readahead can give high 
bursts if snapshot blocks are sequential)

And when copying VM's to datastore (ESXi does this in sequential 64k 
IO's.yes silly I know)

Traditional Array 10K disks = ~100MB/s (Limited by 1GB interface, on other 
arrays I guess this scales)
Ceph 7.2K + SSD Journal = ~20MB/s (Again LevelDB sync seems to limit here for 
sequential writes)
Ceph Pure SSD Pool = ~50MB/s (Ceph CPU bottleneck is occurring)
Ceph Cache Tiering = ~50MB/s when writing to new block, 10MB/s when 
promote+overwrite
Ceph + RBD Caching with Flashcache = As fast as the SSD will go

  -Original Message-
  From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
  Behalf Of Jan Schermer
  Sent: 18 August 2015 12:44
  To: Mark Nelson mnel...@redhat.com
  Cc: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] any recommendation of using EnhanceIO?

  I did not. Not sure why now - probably for the same reason I didn't
  extensively test bcache.
  I'm not a real fan of device mapper though, so if I had to choose
  I'd still go for bcache :-)

  Jan

  On 18 Aug 2015, at 13:33, Mark Nelson mnel...@redhat.com
 wrote:

  Hi Jan,

  Out of curiosity did you ever try dm-cache?  I've been meaning to
  give it a
  spin but haven't

Re: [ceph-users] ceph cluster_network with linklocal ipv6

2015-08-18 Thread Björn Lässig


On 08/18/2015 04:32 PM, Jan Schermer wrote:

Should ceph care about what scope the address is in? We don't specify it for 
ipv4 anyway, or is link-scope special in some way?


fe80::/64 is on every ipv6 enabled interface ... thats different from 
legacy ip.



And isn't this the correct syntax actually?

cluster_network = fe80::/64%cephnet


That is a very good question! I will look into it.

Björn

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

58 matches

Mail list logo