Re: [ceph-users] Instrument librbd+qemu IO from hypervisor

2018-03-15 Thread Martin Millnert
Self-follow-up:

The ceph version is 0.80.11 in the cluster I'm working. So quite old.

Adding:
  admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
  log file = /var/log/ceph/

to /etc/ceph.conf, and then in my case tweaking apparmor (for test
disabling it):
  service apparmor teardown
  service apparmor stop

Then stopping a qemu VM:
  virsh stop $instance

Then restarting libvirt-bin:
  service libvirt-bin restart

Then starting the VM again:
  virsh start $instance

Allowed me to get at the the perf dump data, which seems to contain
basically what I need for the moment:
{ "librbd--compute/a43efe1b-461a-4b54-923e-09c2e95da1ba_disk": { "rd": 0,
  "rd_bytes": 0,
  "rd_latency": { "avgcount": 0,
  "sum": 0.0},
  "wr": 0,
  "wr_bytes": 0,
  "wr_latency": { "avgcount": 0,
  "sum": 0.0},
  "discard": 0,
  "discard_bytes": 0,
  "discard_latency": { "avgcount": 0,
  "sum": 0.0},
  "flush": 9,
  "aio_rd": 4596,
  "aio_rd_bytes": 88915968,
  "aio_rd_latency": { "avgcount": 4596,
  "sum": 7.335787000},
  "aio_wr": 114,
  "aio_wr_bytes": 1438720,
  "aio_wr_latency": { "avgcount": 114,
  "sum": 0.011218000},
  "aio_discard": 0,
  "aio_discard_bytes": 0,
  "aio_discard_latency": { "avgcount": 0,
  "sum": 0.0},
  "aio_flush": 0,
  "aio_flush_latency": { "avgcount": 0,
  "sum": 0.0},
  "snap_create": 0,
  "snap_remove": 0,
  "snap_rollback": 0,
  "notify": 0,
  "resize": 0},
  "objectcacher-librbd--compute/a43efe1b-461a-4b54-923e-09c2e95da1ba_disk": { 
"cache_ops_hit": 114,
  "cache_ops_miss": 4458,
  "cache_bytes_hit": 24985600,
  "cache_bytes_miss": 88279552,
  "data_read": 88764416,
  "data_written": 1438720,
  "data_flushed": 1438720,
  "data_overwritten_while_flushing": 0,
  "write_ops_blocked": 0,
  "write_bytes_blocked": 0,
  "write_time_blocked": 0.0},
  "objecter": { "op_active": 0,
  "op_laggy": 0,
  "op_send": 4553,
  "op_send_bytes": 0,
  "op_resend": 0,
  "op_ack": 4552,
  "op_commit": 89,
  "op": 4553,
  "op_r": 4464,
  "op_w": 88,
  "op_rmw": 1,
  "op_pg": 0,
  "osdop_stat": 2,
  "osdop_create": 0,
  "osdop_read": 4458,
  "osdop_write": 88,
  "osdop_writefull": 0,
  "osdop_append": 0,
  "osdop_zero": 0,
  "osdop_truncate": 0,
  "osdop_delete": 0,
  "osdop_mapext": 0,
  "osdop_sparse_read": 0,
  "osdop_clonerange": 0,
  "osdop_getxattr": 0,
  "osdop_setxattr": 0,
  "osdop_cmpxattr": 0,
  "osdop_rmxattr": 0,
  "osdop_resetxattrs": 0,
  "osdop_tmap_up": 0,
  "osdop_tmap_put": 0,
  "osdop_tmap_get": 0,
  "osdop_call": 9,
  "osdop_watch": 1,
  "osdop_notify": 0,
  "osdop_src_cmpxattr": 0,
  "osdop_pgls": 0,
  "osdop_pgls_filter": 0,
  "osdop_other": 88,
  "linger_active": 1,
  "linger_send": 1,
  "linger_resend": 0,
  "poolop_active": 0,
  "poolop_send": 0,
  "poolop_resend": 0,
  "poolstat_active": 0,
  "poolstat_send": 0,
  "poolstat_resend": 0,
  "statfs_active": 0,
  "statfs_send": 0,
  "statfs_resend": 0,
  "command_active": 0,
  "command_send": 0,
  "command_resend": 0,
  "map_epoch": 0,
  "map_full": 0,
  "map_inc": 0,
  "osd_sessions": 7140,
  "osd_session_open": 119,
  "osd_session_close": 0,
  "osd_laggy": 1},
  "throttle-msgr_dispatch_throttler-radosclient": { "val": 0,
  "max": 104857600,
  "get": 4643,
  "get_sum": 89851514,
  "get_or_fail_fail": 0,
  "get_or_fail_success": 0,
  "

[ceph-users] Instrument librbd+qemu IO from hypervisor

2018-03-15 Thread Martin Millnert
Dear fellow cephalopods,

does anyone have any pointers on how to instrument librbd as-driven-by
qemu IO performance from a hypervisor?

Are there less intrusive ways than perf or equivalent? Can librbd be
told to dump statistics somewhere (per volume) - clientside?

This would come in real handy whilst debugging potential performance
issues troubling me.

Ideally I'd like to get per-volume metrics out that I can submit to
InfluxDB for presentation in Graphana. But I'll take anything.

Best,
Martin


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous CephFS on EC - how?

2017-08-30 Thread Martin Millnert
On Wed, Aug 30, 2017 at 02:06:29PM +0100, John Spray wrote:
> > As I wrote in my ticket there is room for improvement in docs on how to
> > do it and with cli/api rejecting "ceph fs new  " with
> > pool1 or pool2 being EC.
> 
> The CLI will indeed reject attempts to use an EC pool for metadata,
> and when an EC pool is used for data it verifies that the EC
> overwrites are enabled.  This is meant to work, you're just ("just"
> being my understatement of the day) hitting an OSD crash as soon as
> you try and use it!
> 
> re. the docs: https://github.com/ceph/ceph/pull/17372 - voila.
> 

Oh, ok, so it *is* supposed to work the way I did it then with the
cephfs base data pool being EC natively. Interesting!
Then I'll just hang around for a patch for the crash then. :)
Thanks for the clarification & see you in the bug report. ;)

/M


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous CephFS on EC - how?

2017-08-30 Thread Martin Millnert
On Wed, Aug 30, 2017 at 11:06:02AM +0200, Peter Maloney wrote:
> What kind of terrible mail client is this that sends a multipart message where
> one part is blank and that's the one Thunderbird chooses to show? (see
> blankness below)

It's a real email client (mutt) sending text to the mailing list, with a
attached PGP signature. The list servers does various violence to the
emails sent and your client renders that as it sees fit I suppose. WFM.

> Yes you're on the right track. As long as the main fs is on a replicated pool
> (the one with omap), the ones below it (using file layouts) can be EC without
> needing a cache pool.

Thanks!

> a quote from your first url: 
> http://docs.ceph.com/docs/master/rados/operations/
> erasure-code/#erasure-coding-with-overwrites
> 
> 
> For Cephfs, using an erasure coded pool means setting that pool in a file
> layout.

Yeah that's not very descriptive at all to me without clear examples for
EC on the link target.

/M


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous CephFS on EC - how?

2017-08-30 Thread Martin Millnert
Hi,

On Wed, Aug 30, 2017 at 12:28:12PM +0100, John Spray wrote:
> On Wed, Aug 30, 2017 at 7:21 AM, Martin Millnert <mar...@millnert.se> wrote:
> > Hi,
> >
> > what is the proper method to not only setup but also successfully use
> > CephFS on erasure coded data pool?
> > The docs[1] very vaguely state that erasure coded pools do not support omap
> > operations hence, "For Cephfs, using an erasure coded pool means setting
> > that pool in a file layout.". The file layout docs says nothing further
> > about this [2].  (I filed a bug[3].)
> >
> > I'm guessing this translates to something along the lines of:
> >
> >   ceph fs new cephfs cephfs_metadata cephfs_replicated_data
> >   ceph fs add_data_pool cephfs cephfs_ec_data
> >
> > And then,
> >
> >   setfattr -n ceph.dir.layout.SOMETHING -v cephfs_ec_data  $cephfs_dir
> 
> Yep.  The SOMETHING is just "pool".

Ok, thanks!

> I see from your ticket that you're getting an OSD crash, which is
> pretty bad news!

> For what it's worth, I have a home cephfs-on-EC configuration that has
> run happily for quite a while, so this can be done -- we just need to
> work out what's making the OSDs crash in this particular case.

Well, my base pool is EC and I guessed from the log output that that is
the root cause of the error. I.e. the list of pending omap operations is
too large.

As I wrote in my ticket there is room for improvement in docs on how to
do it and with cli/api rejecting "ceph fs new  " with
pool1 or pool2 being EC.

/M


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous CephFS on EC - how?

2017-08-30 Thread Martin Millnert
Hi,

what is the proper method to not only setup but also successfully use
CephFS on erasure coded data pool?
The docs[1] very vaguely state that erasure coded pools do not support omap
operations hence, "For Cephfs, using an erasure coded pool means setting
that pool in a file layout.". The file layout docs says nothing further
about this [2].  (I filed a bug[3].)

I'm guessing this translates to something along the lines of:

  ceph fs new cephfs cephfs_metadata cephfs_replicated_data
  ceph fs add_data_pool cephfs cephfs_ec_data

And then,

  setfattr -n ceph.dir.layout.SOMETHING -v cephfs_ec_data  $cephfs_dir

To achieve the inheritance of all files under $cephfs_dir to use the
erasure coded pool afterwards.

Am I on the right track here?

/M

1. http://docs.ceph.com/docs/master/rados/operations/erasure-code/
2. http://docs.ceph.com/docs/master/cephfs/file-layouts/
3. http://tracker.ceph.com/issues/21174


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?

2017-04-25 Thread Martin Millnert
On Tue, Apr 25, 2017 at 03:39:42PM -0400, Gregory Farnum wrote:
> > I'd like to understand if "prio" in Jewel is as explained, i.e.
> > something similar to the following pseudo code:
> >
> >   if len(subqueue) > 0:
> > dequeue(subqueue)
> >   if tokens(global) > some_cost:
> > for queue in queues_high_to_low:
> >   if len(queue) > 0:
> > dequeue(queue)
> > tokens = tokens - some_other_cost
> >   else:
> > for queue in queues_low_to_high:
> >   if len(queue) > 0:
> > dequeue(queue)
> > tokens = tokens - some_other_cost
> >   tokens = min(tokens + some_refill_rate, max_tokens)
> 
> That looks about right.

OK, thanks for validation. That has indeed impact on the entire priority
queue under stress, then. (WPQ motivation seems clear :) )

> > The objective is to increase servicing time of client IO, especially
> > read, while deep scrub is occuring. It doesn't matter for us if a
> > deep-scrub takes x or 3x time, essentially. More consistent latency
> > to clients is more important.
> 
> I don't have any experience with SMR drives so it wouldn't surprise me
> if there are some exciting emergent effects with them.

Basically a very large chunk of disk area needs to be rewritten on each
write. So write amplification factor of an inode update is just silly.
They have a PMR buffer area on approx 500 GB, but that area can run out
pretty fast during consistent IO over time (exact buffer management
logic not known).

> But it sounds
> to me like you want to start by adjusting the osd_scrub_priority
> (default 5) and osd_scrub_cost (default 50 << 20, ie 50MB). That will
> directly impact how they move through the queue in relation to client
> ops. (There are also the family of scrub scheduling options, which
> might make sense if you are more tolerant of slow IO at certain times
> of the day/week, but I'm not familiar with them).
> -Greg

Thanks for those pointers!  It seems from a distance that it's necessary
to use WPQ if it can be suspected that the IO scheduler is running
without available tokens (not sure how to verify *that*).


#ceph also helped point out that indeed I'm missing noatime,nodiratime
on the mount options. So every read is causing an inode update which is
extremely expensive on SMR, compared with regular HDD (e.g. PMR).
(Not sure how I missed this when I set it up, because I've been aware of
noatime earlier :) )

I think that's the first fix we'll want to do, and the biggest source of
trouble, and then look back in a week or so to see how it's doing then.
Then after that look into the various scrub-vs-client op scheduling
artefacts.

Thanks!

/M


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?

2017-04-25 Thread Martin Millnert
Hi,

experiencing significant impact from deep scrubs on Jewel.
Started investigating OP priorities. We use default values on
related/relevant OSD priority settings.

"osd op queue" on
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations
states:  "The normal queue is different between implementations."

So... in Jewel, where except code can I learn what is the queue
behavior? Is there anyone who's familiar with it?


I'd like to understand if "prio" in Jewel is as explained, i.e.
something similar to the following pseudo code:

  if len(subqueue) > 0:
dequeue(subqueue)
  if tokens(global) > some_cost:
for queue in queues_high_to_low:
  if len(queue) > 0:
dequeue(queue)
tokens = tokens - some_other_cost
  else:
for queue in queues_low_to_high:
  if len(queue) > 0:
dequeue(queue)
tokens = tokens - some_other_cost
  tokens = min(tokens + some_refill_rate, max_tokens)



The background, for anyone interested, is:

If it is similar to above, this would explain extreme OSD commit
latencies / client latency. My current theory is that the deep scrub
quite possibly is consuming all available tokens, such that when a
client op arrives, and priority(client_io) > priority([deep_]scrub), the
prio queue essentially inverts and low priority ops get priority over
high priority ops.

The OSD:s are SMR but the question here is specifically not how they
perform (we're quite intimately aware of their performance profiles),
but how to tame Ceph to make cluster behave as good as possible in
normal case.

I put up some graphs on https://martin.millnert.se/ceph/jewel_prio/ :
 - OSD Journal/Commit/Apply latencies show very strong correlation with
ongoing deep scrubs.
 - When latencies are low and noisy there's essentially no client IO
   happening.
 - There is some evidence the write latency shoots through the roof --
   but there isn't much client write occuring... Possible Deep Scrub
   causes disk write IO?
   * mount opts used are:
[...] type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

The objective is to increase servicing time of client IO, especially
read, while deep scrub is occuring. It doesn't matter for us if a
deep-scrub takes x or 3x time, essentially. More consistent latency
to clients is more important.

Best,
Martin Millnert


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-24 Thread Martin Millnert
On Fri, 2016-07-22 at 08:28 -0400, Jason Dillaman wrote:
> You aren't, by chance, sharing the same RBD image between multiple
> VMs, are you? An order-of-magnitude performance degradation would not
> be unexpected if you have multiple clients concurrently accessing the
> same image with the "exclusive-lock" feature enabled on the image.

No, though I did perform a live migration of the VM between the tests as
well. But there is only one client of it.

> 4000 IOPS for 4K random writes also sounds suspiciously high to me.

Are the replica writes of the primary OSD async/parallel?

/M

> On Thu, Jul 21, 2016 at 7:32 PM, Martin Millnert <mar...@millnert.se> wrote:
> > Hi,
> >
> > I just upgraded from Infernalis to Jewel and see an approximate 10x
> > latency increase.
> >
> > Quick facts:
> >  - 3x replicated pool
> >  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610 SSDs,
> >  - LSI3008 controller with up-to-date firmware and upstream driver, and
> > up-to-date firmware on SSDs.
> >  - 40GbE (Mellanox, with up-to-date drivers & firmware)
> >  - CentOS 7.2
> >
> > Physical checks out, both iperf3 for network and e.g. fio over all the
> > SSDs. Not done much of Linux tuning yet; but irqbalanced does a pretty
> > good job with pairing both NIC and HBA with their respective CPUs.
> >
> > In performance hunting mode, and today took the next logical step of
> > upgrading from Infernalis to Jewel.
> >
> > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image with
> > fio. The test scenario is 4K randomwrite, libaio, directIO, QD=1,
> > runtime=900s, test-file-size=40GiB.
> >
> > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of the
> > I/O complete within maximum 250 µsec (~4000 IOPS). This, [2], sees
> > 98.95% of the IO at ~4 msec (actually ~300 IOPs).
> >
> > Between [1] and [2] (simple plots of FIO's E2E-latency metrics), the
> > entire cluster including compute nodes code went from Infernalis to
> > 10.2.2
> >
> > What's going on here?
> >
> > I haven't tuned Ceph OSDs either in config or via Linux kernel at all
> > yet; upgrade to Jewel came first. I haven't changed any OSD configs
> > between [1] and [2] myself (only minimally before [1], 0 effort on
> > performance tuning) , other than updated to Jewel tunables. But the
> > difference is very drastic, wouldn't you say?
> >
> > Best,
> > Martin
> > [1] 
> > http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08/ceph-fio-bench_lat.1.png
> > [2] 
> > http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10/ceph-fio-bench_lat.1.png
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency increase

2016-07-22 Thread Martin Millnert
On Fri, 2016-07-22 at 08:56 +0100, Nick Fisk wrote:
> > 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > Behalf Of Martin Millnert
> > Sent: 22 July 2016 00:33
> > To: Ceph Users <ceph-users@lists.ceph.com>
> > Subject: [ceph-users] Infernalis -> Jewel, 10x+ RBD latency
> > increase
> > 
> > Hi,
> > 
> > I just upgraded from Infernalis to Jewel and see an approximate 10x
> > latency increase.
> > 
> > Quick facts:
> >  - 3x replicated pool
> >  - 4x 2x-"E5-2690 v3 @ 2.60GHz", 128GB RAM, 6x 1.6 TB Intel S3610
> > SSDs,
> >  - LSI3008 controller with up-to-date firmware and upstream driver,
> > and up-to-date firmware on SSDs.
> >  - 40GbE (Mellanox, with up-to-date drivers & firmware)
> >  - CentOS 7.2
> > 
> > Physical checks out, both iperf3 for network and e.g. fio over all
> > the SSDs. Not done much of Linux tuning yet; but irqbalanced does a
> > pretty good job with pairing both NIC and HBA with their respective
> > CPUs.
> > 
> > In performance hunting mode, and today took the next logical step
> > of upgrading from Infernalis to Jewel.
> > 
> > Tester is remote KVM/Qemu/libvirt guest (openstack) CentOS 7 image
> > with fio. The test scenario is 4K randomwrite, libaio, directIO,
> > QD=1, runtime=900s, test-file-size=40GiB.
> > 
> > Went from a picture of [1] to [2]. In [1], the guest saw 98.25% of
> > the I/O complete within maximum 250 µsec (~4000 IOPS). This, [2],
> > sees 98.95% of the IO at ~4 msec (actually ~300 IOPs).
> 
> I would be suspicious that somehow somewhere you had some sort of
> caching going on, in the 1st example. 

It wouldn't surprise me either, though I to the best of my knowledge
haven't actively configured any such write caching anywhere.

I did forget one brief detail regarding the setup: We run 4x OSDs per
SSD-drive, i.e. roughly 400 GB each.
Consistent 4k random-write performance onto /var/lib/ceph/osd-
$num/fiotestfile, with similar test-config as above, is 13k IOPS *per
partition*.

> 250us is pretty much unachievable for directio writes with Ceph.

Thanks for the feedback, though it's disappointing to hear.

>  I've just built some new nodes with the pure goal of crushing
> (excuse the pun) write latency and after extensive tuning can't get
> it much below 600-700us. 

What of the below, or other than the below, have you done, considering
the directIO baseline?
 - SSD only hosts
 - NIC <-> CPU/NUMA mapping
 - HBA <-> CPU/NUMA mapping
 - ceph-osd process <-> CPU/NUMA mapping
 - Partition SSDs into multiple partitions
 - Ceph OSD tunings for concurrency (many-clients)
 - Ceph OSD tunings for latency (many-clients)
 - async messenger, new in Jewel (not sure what impact is), or,
change/tuning of memory allocator
 - RDMA (e.g. Mellanox) messenger

I have yet to iron out precisely what those two OSD tunings would be.

> The 4ms sounds more likely for an untuned cluster. I wonder if any of
> the RBD or qemu cache settings would have changed between versions?

I'm curious about this too.  What are relevant OSD-side configs here?
And how do I check what the librbd clients experience? What parameters
from e.g. /etc/ceph/$clustername.conf applies to them?

I'll have to make another pass over the rbd PRs between Infernalis and
10.2.2 I suppose.


> > Between [1] and [2] (simple plots of FIO's E2E-latency metrics),
> > the entire cluster including compute nodes code went from
> > Infernalis
> > to
> > 10.2.2
> > 
> > What's going on here?
> > 
> > I haven't tuned Ceph OSDs either in config or via Linux kernel at
> > all yet; upgrade to Jewel came first. I haven't changed any OSD
> > configs
> > between [1] and [2] myself (only minimally before [1], 0 effort on
> > performance tuning) , other than updated to Jewel tunables. But
> > the difference is very drastic, wouldn't you say?
> > 
> > Best,
> > Martin
> > [1] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test08
> > /ceph-fio-bench_lat.1.png
> > [2] http://martin.millnert.se/ceph/pngs/guest-ceph-fio-bench/test10
> > /ceph-fio-bench_lat.1.png
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-20 Thread Martin Millnert
The thing that worries me with your next-gen design (actually your current 
design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day 
per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read ratio 
is quite high in terms of writes as-is.
You're also throughput-limiting yourself to the pci-e bw of the NVME device 
(regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of 
course in relative terms. NVRAM vs SSD here is simply a choice between wear 
(NVRAM as journal minimum), and cache hit probability (size).  
Interesting thought experiment anyway for me, thanks for sharing Wido.
/M

 Original message 
From: Wido den Hollander  
Date: 20/10/2015  16:00  (GMT+01:00) 
To: ceph-users  
Subject: [ceph-users] Ceph OSDs with bcache experience 

Hi,

In the "newstore direction" thread on ceph-devel I wrote that I'm using
bcache in production and Mark Nelson asked me to share some details.

Bcache is running in two clusters now that I manage, but I'll keep this
information to one of them (the one at PCextreme behind CloudStack).

In this cluster has been running for over 2 years now:

epoch 284353
fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
created 2013-09-23 11:06:11.819520
modified 2015-10-20 15:27:48.734213

The system consists out of 39 hosts:

2U SuperMicro chassis:
* 80GB Intel SSD for OS
* 240GB Intel S3700 SSD for Journaling + Bcache
* 6x 3TB disk

This isn't the newest hardware. The next batch of hardware will be more
disks per chassis, but this is it for now.

All systems were installed with Ubuntu 12.04, but they are all running
14.04 now with bcache.

The Intel S3700 SSD is partitioned with a GPT label:
- 5GB Journal for each OSD
- 200GB Partition for bcache

root@ceph11:~# df -h|grep osd
/dev/bcache0    2.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
/dev/bcache1    2.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
/dev/bcache2    2.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
/dev/bcache3    2.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
/dev/bcache4    2.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
/dev/bcache5    2.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
root@ceph11:~#

root@ceph11:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 14.04.3 LTS
Release:14.04
Codename:   trusty
root@ceph11:~# uname -r
3.19.0-30-generic
root@ceph11:~#

"apply_latency": {
    "avgcount": 2985023,
    "sum": 226219.891559000
}

What did we notice?
- Less spikes on the disk
- Lower commit latencies on the OSDs
- Almost no 'slow requests' during backfills
- Cache-hit ratio of about 60%

Max backfills and recovery active are both set to 1 on all OSDs.

For the next generation hardware we are looking into using 3U chassis
with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
tested those yet, so nothing to say about it.

The current setup is 200GB of cache for 18TB of disks. The new setup
will be 1200GB for 64TB, curious to see what that does.

Our main conclusion however is that it does smoothen the I/O-pattern
towards the disks and that gives a overall better response of the disks.

Wido

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSDs with bcache experience

2015-10-20 Thread Martin Millnert
OK - seems my android email client (native samsung) messed up
"in-reply-to" which confuses some MUA's. Apologies for that ()

/M

On Tue, Oct 20, 2015 at 09:45:25PM +0200, Martin Millnert wrote:
> The thing that worries me with your next-gen design (actually your current
> design aswell) is SSD wear. If you use Intel SSD at 10 DWPD, that's 12TB/day
> per 64TB total.  I guess use case dependant,  and perhaps 1:4 write read ratio
> is quite high in terms of writes as-is.
> 
> You're also throughput-limiting yourself to the pci-e bw of the NVME device
> (regardless of NVRAM/SSD). Compared to traditonal interface, that may be ok of
> course in relative terms. NVRAM vs SSD here is simply a choice between wear
> (NVRAM as journal minimum), and cache hit probability (size).  
> 
> Interesting thought experiment anyway for me, thanks for sharing Wido.
> 
> /M
> 
> 
>  Original message 
> From: Wido den Hollander <w...@42on.com>
> Date: 20/10/2015 16:00 (GMT+01:00)
> To: ceph-users <ceph-us...@ceph.com>
> Subject: [ceph-users] Ceph OSDs with bcache experience
> 
> Hi,
> 
> In the "newstore direction" thread on ceph-devel I wrote that I'm using
> bcache in production and Mark Nelson asked me to share some details.
> 
> Bcache is running in two clusters now that I manage, but I'll keep this
> information to one of them (the one at PCextreme behind CloudStack).
> 
> In this cluster has been running for over 2 years now:
> 
> epoch 284353
> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
> created 2013-09-23 11:06:11.819520
> modified 2015-10-20 15:27:48.734213
> 
> The system consists out of 39 hosts:
> 
> 2U SuperMicro chassis:
> * 80GB Intel SSD for OS
> * 240GB Intel S3700 SSD for Journaling + Bcache
> * 6x 3TB disk
> 
> This isn't the newest hardware. The next batch of hardware will be more
> disks per chassis, but this is it for now.
> 
> All systems were installed with Ubuntu 12.04, but they are all running
> 14.04 now with bcache.
> 
> The Intel S3700 SSD is partitioned with a GPT label:
> - 5GB Journal for each OSD
> - 200GB Partition for bcache
> 
> root@ceph11:~# df -h|grep osd
> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
> /dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
> /dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
> /dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
> /dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
> root@ceph11:~#
> 
> root@ceph11:~# lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 14.04.3 LTS
> Release: 14.04
> Codename: trusty
> root@ceph11:~# uname -r
> 3.19.0-30-generic
> root@ceph11:~#
> 
> "apply_latency": {
> "avgcount": 2985023,
> "sum": 226219.891559000
> }
> 
> What did we notice?
> - Less spikes on the disk
> - Lower commit latencies on the OSDs
> - Almost no 'slow requests' during backfills
> - Cache-hit ratio of about 60%
> 
> Max backfills and recovery active are both set to 1 on all OSDs.
> 
> For the next generation hardware we are looking into using 3U chassis
> with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, but we haven't
> tested those yet, so nothing to say about it.
> 
> The current setup is 200GB of cache for 18TB of disks. The new setup
> will be 1200GB for 64TB, curious to see what that does.
> 
> Our main conclusion however is that it does smoothen the I/O-pattern
> towards the disks and that gives a overall better response of the disks.
> 
> Wido
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Force an OSD to try to peer

2015-04-14 Thread Martin Millnert
On Tue, Mar 31, 2015 at 10:44:51PM +0300, koukou73gr wrote:
 On 03/31/2015 09:23 PM, Sage Weil wrote:
 
 It's nothing specific to peering (or ceph).  The symptom we've seen is
 just that byte stop passing across a TCP connection, usually when there is
 some largish messages being sent.  The ping/heartbeat messages get through
 because they are small and we disable nagle so they never end up in large
 frames.
 
 Is there any special route one should take in order to transition a
 live cluster to use jumbo frames and avoid such pitfalls with OSD
 peering?

1. Configure entire switch infrastructure for jumbo frames.
2. Enable config versioning of switch infrastructure configurations
3. Bonus points: Monitor config changes of switch infrastructure
4. Run ping test using e.g. fping from each node to every other node,
with large frames.
5. Bonus points: Setup such a test in some monitor infrastructure.
6. Once you trust the config (and monitoring), up all the nodes MTU
to jumbo size, simultaneously.  This is the critical step and perhaps
it could be further perfected. Ideally you would like an atomic
MTU-upgrade command on the entire cluster.

/M


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-29 Thread Martin Millnert
On Thu, Mar 26, 2015 at 12:36:53PM -0500, Mark Nelson wrote:
 Having said that, small nodes are
 absolutely more expensive per OSD as far as raw hardware and
 power/cooling goes.

The smaller volume manufacturers have on the units, the worse the margin
typically (from buyers side).  Also, CPUs typically run up a premium the
higher you go.  I've found a lot of local maximas, optimization-wise,
over the past years both in 12 OSD/U vs 18 OSD/U dedicated storage node
setups, for instance.
  There may be local maximas along colocated low-scale storage/compute
nodes, but the one major problem with colocating storage with compute is
that you can't scale compute independently from storage efficiently, on
using that building block alone.  There may be temporal optimizations in
doing so however (e.g. before you have reached sufficient scale).

There's no single optimal answer when you're dealing with 20+ variables
to consider... :)

BR,
Martin


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cold-storage tuning Ceph

2015-01-16 Thread Martin Millnert
Hello list,

I'm currently trying to understand what I can do with Ceph to optimize
it for a cold-storage (write-once, read-very-rarely) like scenario,
trying to compare cost against LTO-6 tape.

There is a single main objective:
 - minimal cost/GB/month of operations (including power, DC)

To achieve this, I can break it down to:
 - Use best cost/GB HDD
   * SMR today
 - Minimal cost/3.5-slot
 - Minimal power-utilization/drive

While staying within what is available today, I don't imagine going to
power-down individual disk slots using IPMI etc, as some vendors allow.

Now, putting Ceph on this, drives will be on, but it would be very
useful to be able to spin-down drives that aren't used.

It then seems to me that I want to do a few things with Ceph:
 - Have only a subest of the cluster 'active' for writes at any point in
   time
 - Yet, still have the entire cluster online and available for reads
 - Minimize concurrent OSD operations in a node that uses RAM, e.g.
   - Scrubbing, minimal number of OSDs active (ideally max 1)
   - In general, minimize concurrent active OSDs as per above
 - Minimize risk that any type of re-balancing of data occurs at all
   - E.g. use a high number of EC parity chunks


Assuming e.g. 16 drives/host and 10TB drives, at ~100MB/s read and
nearly full cluster, deep scrubbing the host would take 18.5 days.
This means roughly 2 deep scrubs per month.
Using EC pool, I wouldn't be very worried about errors, so perhaps
that's ok (calculable), but they need to be repaired obviously.
Mathematically, I can use an increase of parity chunks to lengthen the
interval between deep scrubs.


Is there anyone on the list who can provide some thoughts on the
higher-order goal of Minimizing concurrently active OSDs in a node?

I imagine I need to steer writes towards a subset of the system - but I
have no idea how to implement it - using multiple separate clusters eg.
each OSD on a node participate in unique clusters could perhaps help.

Any feedback appreciated.  It does appear a hot topic (pun intended).

Best,
Martin


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados -p pool cache-flush-evict-all surprisingly slow

2014-11-12 Thread Martin Millnert
Dear Cephers,

I have a lab setup with 6x dual-socket hosts, 48GB RAM, 2x10Gbps hosts,
each equipped with 2x S3700 100GB SSDs and 4x 500GB HDD, where the HDDs
are mapped in a tree under a 'platter' root tree similar to guidance from
Seb at 
http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
 ,
and SSDs similarily under an 'ssd' root.  Replication is set to 3.
Journals on tmpfs (simulating NVRAM).

I have put an ssd pool as a cache tier in front of an hdd pool (rbd), and run
fio-rbd against rbd.  In the benchmarks, at bs=32kb, QD=128 from a
single separate client machine, I reached at peak throughput of around
1.2 GB/s.  So there is some capability.  IOPS-wise I see a max of around
15k iops currently.

After having filled the SSD cache tier, I ran rados -p rbd
cache-flush-evict-all - and I was expecting to see the 6 SSD OSDs start
to evict all the cache-tier pg's to the underlying pool, rbd, which maps
to the HDDs.  I would have expected parallellism and high throughput,
but what I now observe is ~80 MB/s on average flush speed.

Which leads me to the question:  Is rados -p pool
cache-flush-evict-all supposed to work in a parallell manner?

Cursory viewing in tcpdump suggests to me that eviction operation is
serial, in which case the performance could make a little bit sense,
since it is basically limited by the write speed of a single hdd.

What should I see?

If it is indeed a serial operation, is this different from the regular
cache tier eviction routines that are triggered by full_ratios, max
objects or max storage volume?

Regards,
Martin


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com