Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
Quoting Martin Mlynář (nexus+c...@smoula.net):

> Do you think this could help? OSD does not even start, I'm getting a little
> lost how flushing caches could help.

I might have mis-understood. I though the OSDs crashed when you set the
config setting.

> According to trace I suspect something around processing config values.

I've just set the same config setting on a test cluster and restarted an
OSD without problem. So, not sure what is going on there.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crash after change of osd_memory_target

2020-01-21 Thread Stefan Kooman
Quoting Martin Mlynář (nexus+c...@smoula.net):

> 
> When I remove this option:
> # ceph config rm osd osd_memory_target
> 
> OSD starts without any trouble. I've seen same behaviour when I wrote
> this parameter into /etc/ceph/ceph.conf
> 
> Is this a known bug? Am I doing something wrong?

I wonder if they would still crash if the OSD would drop their caches
beforehand. There is support for this in master, but it doesn't look
like it's backported to nautilus: https://tracker.ceph.com/issues/24176

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph nautilus cluster name

2020-01-16 Thread Stefan Kooman
Quoting Ignazio Cassano (ignaziocass...@gmail.com):
> Hello, I just deployed nautilus with ceph-deploy.
> I did not find any option to give a cluster name to my ceph so its name is
> "ceph".
> Please, how can I chenge my cluster name without reinstalling ?
> 
> Please, how can I set the cluster name in installation phase ?

TL;DR: You don't want to name it anything else. This feature was hardly used
and therefore deprecated. You can find some historic info here:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-November/022202.html

I'm not sure if nameing support is already removed from the code but in
any case don't try to name it anything else.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2020-01-14 Thread Stefan Kooman
Quoting Robert LeBlanc (rob...@leblancnet.us):
> 
> req_create
> req_getattr
> req_readdir
> req_lookupino
> req_open
> req_unlink
> 
> We were graphing these as ops, but using the new avgcount, we are getting
> very different values, so I'm wondering if we are choosing the wrong new
> value, or we misunderstood what the old value really was and have been
> plotting it wrong all this time.

I think the last one: not plotting what you think you did. We are using
the telegraf plugin from the manager and using "mds.request" from
"ceph_daemon_stats" to plot the number of requests. 

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2020-01-14 Thread Stefan Kooman
Quoting Robert LeBlanc (rob...@leblancnet.us):
> The link that you referenced above is no longer available, do you have a
> new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all
> changed, so I'm trying to may the old values to the new values. Might just
> have to look in the code. :(

I cannot recall that the metrics have ever changed between 12.2.8 and
12.2.12. Anyways, it depends on what module you use to collect the
metrics if the right metrics are even there. See this issue:
https://tracker.ceph.com/issues/41881

...

The "(avg)count" metric is needed to perform calculations to obtain
"avgtime" (sum/avgcount).

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com):
> 
> Hmm, I meant you can use large block size for the large files and small
> block size for the small files.
> 
> Sure, but how to do that.  As far as I know block size is a property of the 
> pool, not a single file.

recordsize: https://blog.programster.org/zfs-record-size,
https://blogs.oracle.com/roch/tuning-zfs-recordsize

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com):
> 
> 
> > On Jan 9, 2020, at 8:00 AM, Stefan Kooman  wrote:
> > 
> > Quoting Kyriazis, George (george.kyria...@intel.com):
> > 
> >> The source pool has mainly big files, but there are quite a few
> >> smaller (<4KB) files that I’m afraid will create waste if I create the
> >> destination zpool with ashift > 12 (>4K blocks).  I am not sure,
> >> though, if ZFS will actually write big files in consecutive blocks
> >> (through a send/receive), so maybe the blocking factor is not the
> >> actual file size, but rather the zfs block size.  I am planning on
> >> using zfs gzip-9 compression on the destination pool, if it matters.
> > 
> > You might want to consider Zstandard for compression:
> > https://engineering.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/
> > 
> Thanks for the pointer.  Sorry, I am not sure how you are suggesting
> to using zstd, since it’s not part of the standard zfs compression
> algorithms.

It's in FreeBSD ... and should be in ZOL soon:
https://github.com/zfsonlinux/zfs/pull/9735

> > You can optimize a ZFS fs to use larger blocks for those files that are
> > small ... and use large block sizes for other fs ... if it's easy to
> > split them.
> > 
> From what I understand, zfs uses a single block per file, if files are
> <4K, ie. It does not put 2 small files in a single block.  How would
> larger blocks help small files?  Also, as far as I know ashift is a
> pool property, set only at pool creation.

Hmm, I meant you can use large block size for the large files and small
block size for the small files.

> 
> I don’t have control over the original files and how they are stored
> in the source server.  These are user’s files.

Then you somehow need to find a middle ground.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD EC images for a ZFS pool

2020-01-09 Thread Stefan Kooman
Quoting Kyriazis, George (george.kyria...@intel.com):

> The source pool has mainly big files, but there are quite a few
> smaller (<4KB) files that I’m afraid will create waste if I create the
> destination zpool with ashift > 12 (>4K blocks).  I am not sure,
> though, if ZFS will actually write big files in consecutive blocks
> (through a send/receive), so maybe the blocking factor is not the
> actual file size, but rather the zfs block size.  I am planning on
> using zfs gzip-9 compression on the destination pool, if it matters.

You might want to consider Zstandard for compression:
https://engineering.fb.com/core-data/smaller-and-faster-data-compression-with-zstandard/

You can optimize a ZFS fs to use larger blocks for those files that are
small ... and use large block sizes for other fs ... if it's easy to
split them.

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rebalance all at once or host-by-host?

2020-01-09 Thread Stefan Kooman
Quoting Sean Matheny (s.math...@auckland.ac.nz):
> I tested this out by setting norebalance and norecover, moving the host 
> buckets under the rack buckets (all of them), and then unsetting. Ceph starts 
> melting down with escalating slow requests, even with backfill and recovery 
> parameters set to throttle. I moved the host buckets back to the default root 
> bucket, and things mostly came right, but I still had some inactive / unknown 
> pgs that I had to restart some OSDs to get back to health_ok.
> 
> I’m sure there’s a way you can tune things or fade in crush weights or 
> something, but I’m happy just moving one at a time.

For big changes like this you can use Dan's UPMAP trick:
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

Python script:
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

This way you can pause the process or get in "HEALTH_OK" state when
you want to.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log format in Ceph

2020-01-08 Thread Stefan Kooman
Quoting Sinan Polat (si...@turka.nl):
> Hi,
> 
> 
> I couldn't find any documentation or information regarding the log format in
> Ceph. For example, I have 2 log lines (see below). For each 'word' I would 
> like
> to know what it is/means.
> 
> As far as I know, I can break the log lines into:
> [date] [timestamp] [unknown] [unknown] [unknown] [pthread] [colon char]
> [unknown] [PRIORITY] [message]
> 
> Can anyone fill in the [unknown] fields, or redirect me to some
> documentation/information?

Issue "ceph daemon osd.3 dump_historic_slow_ops" on the storage node
hosting this OSD and you will get JSON output with the reason
(flag_point) of the slow op and the series of events.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
Quoting Paul Emmerich (paul.emmer...@croit.io):
> We've also seen some problems with FileStore on newer kernels; 4.9 is the
> last kernel that worked reliably with FileStore in my experience.
> 
> But I haven't seen problems with BlueStore related to the kernel version
> (well, except for that scrub bug, but my work-around for that is in all
> release versions).

What scrub bug are you talking about?

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow request and unresponsive kvm guests after upgrading ceph cluster and os, please help debugging

2020-01-07 Thread Stefan Kooman
Quoting Jelle de Jong (jelledej...@powercraft.nl):

> question 2: what systemd target i can use to run a service after all
> ceph-osds are loaded? I tried ceph.target ceph-osd.target both do not work
> reliable.

ceph-osd.target works for us (every time). Have you enabled all the
individual OSD services, i.e. ceph-osd@0.service?

> question 3: should I still try to upgrade to bluestore or pray to the system
> ods that my performance is back after many many hours of troubleshooting?

I would suggest the first, second is optional ;-). Especially because
you have seperate NVMe device you can use for WAL / DB. It has
advantages over filestore ...

> I made a few changes I am going to just list them for other people that are
> suffering from slow performance after upgrading there Ceph and/or OS.
> 
> Disk utilization is back around 10% no more 80-100%... and rados bench is
> stable again.
> 
> apt-get install irqbalance nftables

^^ Are these some of these changes? Do you need those packages in order
to unload / blacklist them?

I don't get what your fixes are, or what the problem was. Firewall
issues?

What Ceph version did you upgrade to?

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Architecture - Recommendations

2020-01-06 Thread Stefan Kooman
Quoting Radhakrishnan2 S (radhakrishnan...@tcs.com):
> Where hypervisor would be your Ceph nodes. I.e. you can connect your
> Ceph nodes on L2 or make them part of the L3 setup (more modern way of
> doing it). You can use "ECMP" to add more network capacity when you need
> it. Setting up a BGP EVPN VXLAN network is not trivial ... I advise on
> getting networking expertise in your team.
> 
> Radha: Thanks for the reference. We are planning to have a dedicated
> set of nodes, for our ceph cluster and not make it hyperconverged. Do
> you see that as a recommended option ? Since we might also have
> baremetal servers for workloads, we want to make the storage as a
> separate dedicated one. 

I would definately recommend that, especially for larger deployments.
Although this might look old fashioned as the rest of the world is
changing everything in containers. It makes (performance) debugging *a
lot* easier as you can actually isolate things. Something which is way
more difficult to achieve in servers where you have a complex workload
going on ...

I guess (no proof of that) that performance will be more consistent as
well.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous bluestore poor random write performances

2020-01-02 Thread Stefan Kooman
Quoting Ignazio Cassano (ignaziocass...@gmail.com):
> Hello All,
> I installed ceph luminous with openstack, an using fio in a virtual machine
> I got slow random writes:
> 
> fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test
> --filename=random_read_write.fio --bs=4k --iodepth=64 --size=4G
> --readwrite=randrw --rwmixread=75

Do you use virtio-scsi with a SCSI queue per virtual CPU core? How many
cores do you have? I suspect that the queue depth is hampering
throughput here ... but is throughput performance really interesting
anyway for your use case? Low latency generally matters most.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Architecture - Recommendations

2019-12-31 Thread Stefan Kooman
Quoting Radhakrishnan2 S (radhakrishnan...@tcs.com):
> In addition, about putting all kinds of disks, putting all drives in
> one box was done for two reasons, 
> 
> 1. Avoid CPU choking
This depends only on what kind of hardware you select and how you
configure it. You can (if need be) restrict #CPU the ceph daemons get
with cgroups for example ... (or use containers).

>2. Example: If my cluster has 20 nodes in total,
> then all 20 nodes will have NVMe SSD and NL-SAS, this way I'll get
> more capacity and performance when compared to homogeneous nodes. If I
> have to break the 20 nodes into 5 NVMe based, 5 SSD based and
> remaining 10 as spindle based with NVMe acting as bcache, then I'm
> restricting the count of drives there by lesser IO density /
> performance. Please advice in detail based on your production
> deployments. 

The drawback of all types of disk in one box is that all pools in your
cluster are affected when one nodes goes down.

If your storage needs change in the future than it does not make sense
to buy similar boxes. I.e. it's cheaper to buy dedicated boxes for
say spinners only if you end up needing that (lower CPU requirements,
cheaper boxes). You need to decide if you want max performance or max
capactity.

More smaller nodes means the overall impact when one node fails is much
smaller. Just check what your budget allowys you to buy with
"all-on-one" boxes versus "dedicated" boxes.

Are you planning on dedicated monitor nodes (I would definately do
that)?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Architecture - Recommendations

2019-12-31 Thread Stefan Kooman
Hi,
> 
> Radha: I'm sure we are using BGP EVPN over VXLAN, but all deployments
> are through the infrastructure management network. We are a CSP and
> overlay means tenant network, if ceph nodes are in overlay, then
> multiple tenants will need to be able to communicate to the ceph
> nodes. If LB is out of the ceph network, lets say XaaS, will routing
> across networks not create a bottleneck ? I'm novice in network, so if
> you can help with a reference architecture, it would be of help. 

https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn

And

https://vincent.bernat.ch/en/blog/2018-l3-routing-hypervisor

Where hypervisor would be your Ceph nodes. I.e. you can connect your
Ceph nodes on L2 or make them part of the L3 setup (more modern way of
doing it). You can use "ECMP" to add more network capacity when you need
it. Setting up a BGP EVPN VXLAN network is not trivial ... I advise on
getting networking expertise in your team.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_ERR, size and min_size

2019-12-29 Thread Stefan Kooman
Quoting Ml Ml (mliebher...@googlemail.com):
> Hello Stefan,
> 
> The status was "HEALTH_OK" before i ran those commands.

\o/

> root@ceph01:~# ceph osd crush rule dump
> [
> {
> "rule_id": 0,
> "rule_name": "replicated_ruleset",
> "ruleset": 0,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"


^^ This is the important part ... host as failure domain (not osd), but
that's fine in your case.

Make sure you only remove OSDs within the same failure domain at a time and
your safe.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_ERR, size and min_size

2019-12-29 Thread Stefan Kooman
Quoting Ml Ml (mliebher...@googlemail.com):
> Hello List,
> i have size = 3 and min_size = 2 with 3 Nodes.

That's good.

> 
> 
> I replaced two osds on node ceph01 and ran into "HEALTH_ERR".
> My problem: it waits for the backfilling process?
> Why did i run into HEALTH_ERR? I thought all data will be available on
> at least one more node. or even two:

How did you replace them? Did you first set them "out" and waited for
the data to be repicated elsewhere before you removed them?

It *might* be because your CRUSH rule set is replicating over "OSD" and
not host. What does a "ceph osd crush rule dump" shows?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Consumer-grade SSD in Ceph

2019-12-27 Thread Stefan Kooman
Quoting Sinan Polat (si...@turka.nl):
> Thanks for all the replies. In summary; consumer grade SSD is a no go.
> 
> What is an alternative to SM863a? Since it is quite hard to get these due non 
> non-stock.

PM863a ... lower endurance ... but still "enterprise" ... but as your
not concerned about lifetime this is just fine. We use quite a lot of
them and even after ~ 2 years the most used SSD is at 4.4% write capacity.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel client io performance decreases extremely

2019-12-27 Thread Stefan Kooman
Quoting renjianxinlover (renjianxinlo...@163.com):
> HI, Nathan, thanks for your quick reply!
> comand 'ceph status' outputs warning including about ten clients failing to 
> respond to cache pressure;
> in addition, in mds node, 'iostat -x 1' shows drive io usage of mds within 
> five seconds as follow,

You should run this iostat -x 1 on the OSD nodes ... MDS is not doing
any IO in and of itself as far as Ceph is concerned.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Architecture - Recommendations

2019-12-27 Thread Stefan Kooman
Quoting Radhakrishnan2 S (radhakrishnan...@tcs.com):
> Hello Everyone, 
> 
> We have a pre-prod Ceph cluster and working towards a production cluster 
> deployment. I have the following queries and request all your expert tips, 
> 
> 
> 1. Network architecture - We are looking for a private and public network, 
> plan is to have L2 at both the networks. I understand that Object / S3 
> needs L3 for tenants / users to access outside the network / overlay. What 
> would be your recommendations to avoid any network related latencies, like 
> should we have a tiered network ? We are intending to go with the standard 
> Spine leaf model, with dedicated TOR for Storage and dedicated leafs for 
> Clients/ Hypervisors / Compute nodes. 

leaf-spine is fine. Are you planning on a big setup? How many nodes?
Leaf-spine can scale well so this shouldn't be a problem. Network
latency won't be the bottleneck of your Ceph cluster, Ceph will be.

I would advise against a public / private network. It makes things more
complicated than needed (and some issues can be hard to debug when
network is partially up).

> 2. Node Design - We are planning to host nodes with mixed set of drives 
> like NVMe, SSD and NL-SAS all in one node in a specific ratio. This is 
> only to avoid any choking of CPU due to the high performance nodes. Please 
> suggest your opinion. 

Don't mix if you don't need to. You can optimize hardware according to
your needs: Less heavy CPU for spinners, beefier CPU for NVME. Why would
you want to put it all in one box? If you are planning to use "NVMe" ..
why bother with SSD? NVMe drives are sometimes even cheaper than SSD nowadays.
You might use NVMe for a couple of spinners to put their WAL / DB on.
It's generally better to have more smaller nodes than a few big nodes.
Ideally you don't want to lose more than 10% of your cluster when a node
goes down (12 nodes and up, more is better).

> 3, S3 Traffic - What is the secured way to provide object storage in a 
> multi tenant environment since LB/ RGW-HA'd, is going to be in an underlay 
> that cant be exposed to clients/ users in the tenant network. Is there a 
> way to add an external IP as VIP to LB/RGW that could be commonly used by 
> all tenants ? 

Underlay / overlay ... are you going to use BGP EVPN (over VXLAN)? In
that case you would have the ceph nodes in the overlay ... You can put a
LB / Proxy up front (Varnish, ha-proxy, nginx, relayd, etc.)... (outside
of Ceph network) and connect over HTTP to the RGW nodes ... wich can
reach the Ceph network (or are even part of it) on the backend. 

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help! pg inactive and slow requests after filestore to bluestore migration, version 12.2.12

2019-12-16 Thread Stefan Kooman
Quoting Jelle de Jong (jelledej...@powercraft.nl):
> 
> It took three days to recover and during this time clients were not
> responsive.
> 
> How can I migrate to bluestore without inactive pgs or slow request. I got
> several more filestore clusters and I would like to know how to migrate
> without inactive pgs and slow reguests?

Several users reported that setting the following parameters:

osd op queue = wpq
osd op queue cut off = high

Helped in cases like this.

Your milage may vary ...

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Use telegraf/influx to detect problems is very difficult

2019-12-12 Thread Stefan Kooman
Quoting Miroslav Kalina (miroslav.kal...@livesport.eu):

> Monitor down is also easy as pie, because it's just "num_mon -
> mon_quorum". But there is also metric mon_outside_quorum which I have
> always zero and don't really know how it works.

See this issue if you want to know where it is used for:
https://tracker.ceph.com/issues/35947

TL;DR: it's not what you think it is.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] //: // ceph-mon is blocked after shutting down and ip address changed

2019-12-11 Thread Stefan Kooman
Quoting Cc君 (o...@qq.com):
> Hi,daemon  is running when using admin socket
> [root@ceph-node1 ceph]#  ceph --admin-daemon 
> /var/run/ceph/ceph-mon.ceph-node1.asok mon_status
> {
>     "name": "ceph-node1",
>     "rank": 0,
>     "state": "leader",
>     "election_epoch": 63,
>     "quorum": [
>         0
>     ],
>     "quorum_age": 40839,

Your ceph.conf show that messenger should listen on 3300 (v2) and 6789
(v1). If only 6789 is actually listening ... and the client tries to
connect to 3300 ... you might get a timeout as well. Not sure if
messenger falls back to v1.

What happens when you change ceph.conf (first without restarting the
mon) and try a "ceph -s" again with a ceph client on the monitor node?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph assimilated configuration - unable to remove item

2019-12-11 Thread Stefan Kooman
Quoting David Herselman (d...@syrex.co):
> Hi,
> 
> We assimilated our Ceph configuration to store attributes within Ceph
> itself and subsequently have a minimal configuration file. Whilst this
> works perfectly we are unable to remove configuration entries
> populated by the assimilate-conf command.

I forgot about this issue, but I encountered this when we upgraded to
mimic. I can confirm this bug. It's possible to have the same key
present with different values. For our production cluster we decided to
stick to ceph.conf for the time being. That's also the workaround for
now if you want to override the config store: just put that in your
config file and reboot the daemon(s).

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 回复: ceph-mon is blocked after shutting down and ip address changed

2019-12-10 Thread Stefan Kooman
Quoting Cc君 (o...@qq.com):
> 4.[root@ceph-node1 ceph]# ceph -s
> just blocked ...
> error 111 after a few hours

Is the daemon running? You can check for the process to be alive in
/var/run/ceph/ceph-mon.$hostname.asok

If so ... then query the monitor for its status:

ceph daemon mon.$hostname quorum_status

If there is no monitor in quorum ... then that's your problem. See [1]
for more info on debugging the monitor.

Gr. Stefan

[1]:
https://docs.ceph.com/docs/nautilus/rados/troubleshooting/troubleshooting-mon/

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-mgr :: Grafana + Telegraf / influxdb metrics format

2019-12-10 Thread Stefan Kooman
Quoting Miroslav Kalina (miroslav.kal...@livesport.eu):
> Hello guys,
> 
> is there anyone using Telegraf / InfluxDB metrics exporter with Grafana
> dashboards? I am asking like that because I was unable to find any
> existing Grafana dashboards based on InfluxDB.

\o (telegraf)

> I am having hard times with creating graphs I want to see. Metrics are
> exported in way that every single one is stored in separated series in
> Influx like:
> 
> > ceph_pool_stats,cluster=ceph1,metric=read value=1234 15506589110
> > ceph_pool_stats,cluster=ceph1,metric=write value=1234 15506589110
> > ceph_pool_stats,cluster=ceph1,metric=total value=1234 15506589110
> 
> instead of single series like:
> 
> > ceph_pool_stats,cluster=ceph1 read=1234,write=1234,total=1234
> 15506589110
 
> I didn't see any possibility how to modify metrics format exported to
> Telegraf. I feel like I am missing something pretty obvious here.

You are not missing anything here, this is how it is. The InfluxDB
module was designed (and coded AFAIK) like the Prometheus module (which
was fist). The telegraf module was designed as a "drop-in" replacement
for InfluxDB. But unlike Prometheus you can't do math in InfluxDB which
complicates things. Besides that, some essential metrics are completely
missing in all but Prometheus module [1]. And then there are metrics not
exported at all by the manager. They opened an "etherpad" for that [2].
We are using the "local" ceph plugin of telegraf to export extra
metrics. You can than use those metrics with this dashboard [3]. You can
use the Prometheus metrics and push them to InfluxDB with the Prometheus
plugin in telegraf [4]. You might be abble to use a Prometheus dashboard
and convert that to InfluxDB compatible dashboard in Grafana. I think I
would do that if I would do it all over again. And / or use Prometheus
with a InfluxDB as the backend for long(er) term storage. With the new
InluxDB query langue "flux" [5], this whole thing might become something
of the past. It can already be tested out in Grafana in a BETA [6].

Gr. Stefan

[1]: https://tracker.ceph.com/issues/41881
[2]: https://pad.ceph.com/p/perf-counters-to-expose
[3]: https://grafana.com/grafana/dashboards/7995
[4]: 
https://github.com/influxdata/telegraf/tree/master/plugins/inputs/Prometheus
[5]: https://www.influxdata.com/products/flux/
[6]: https://grafana.com/grafana/plugins/grafana-influxdb-flux-datasource

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon is blocked after shutting down and ip address changed

2019-12-10 Thread Stefan Kooman
Quoting Cc君 (o...@qq.com):
> ceph version 14.2.4 (75f4de193b3ea58512f204623e6c5a16e6c1e1ba) nautilus 
> (stable)
> 
> os :CentOS Linux release 7.7.1908 (Core)
> single node ceph cluster with 1 mon,1mgr,1 mds,1rgw and 12osds , but 
> only  cephfs is used.
>  ceph -s   is blocked after  shutting down the machine 
> (192.168.0.104), then ip address changed to  192.168.1.6
> 
>  I created the monmap with monmap tool and  update the ceph.conf , 
> hosts file and then start ceph-mon.
> and the ceph-mon  log:
> ...
> 2019-12-11 08:57:45.170 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1285.14s
> 2019-12-11 08:57:50.170 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1290.14s
> 2019-12-11 08:57:55.171 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1295.14s
> 2019-12-11 08:58:00.171 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1300.14s
> 2019-12-11 08:58:05.172 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1305.14s
> 2019-12-11 08:58:10.171 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1310.14s
> 2019-12-11 08:58:15.173 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1315.14s
> 2019-12-11 08:58:20.173 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1320.14s
> 2019-12-11 08:58:25.174 7f952cdac700  1 mon.ceph-node1@0(leader).mds e34 
> no beacon from mds.0.10 (gid: 4384 addr: 
> [v2:192.168.0.104:6898/4084823750,v1:192.168.0.104:6899/4084823750] state: 
> up:active) since 1325.14s
> 
> ...
> 
> 
> I changed IP back to 192.168.0.104 yeasterday, but all the same.

Just checking here: do you run a firewall? Is port 3300 open (besides
6789)?

What do you see in the logs on the MDS and the ODSs? There are timers
configured in the MON / OSD in case they cannot reach (in time) each
other. OSDs might get marked out. But I'm unsure what is the status of
your cluster. Could you paste a "ceph -s"?

Gr. Stefan

P.s. BTW: is this running production?

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Missing Ceph perf-counters in Ceph-Dashboard or Prometheus/InfluxDB...?

2019-12-09 Thread Stefan Kooman
Quoting Ernesto Puerta (epuer...@redhat.com):

> The default behaviour is that only perf-counters with priority
> PRIO_USEFUL (5) or higher are exposed (via `get_all_perf_counters` API
> call) to ceph-mgr modules (including Dashboard, DiskPrediction or
> Prometheus/InfluxDB/Telegraf exporters).
> 
> While changing that is rather trivial, it could make sense to get
> users' feedback and come up with a list of missing perf-counters to be
> exposed.

I made https://tracker.ceph.com/issues/4188 a while ago: missing metrics
in all but prometheus module.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-06 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):

> 13.2.6 with this patch is running production now. We will continue the
> cleanup process that *might* have triggered this tomorrow morning.

For what's worth it ... that process completed succesfully ... Time will
tell if it's really fixed, but it looks promissing ...

FYI

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-05 Thread Stefan Kooman
Hi,

Quoting Yan, Zheng (uker...@gmail.com):

> Please check if https://github.com/ceph/ceph/pull/32020 works

Thanks!

13.2.6 with this patch is running production now. We will continue the
cleanup process that *might* have triggered this tomorrow morning.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-04 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
>  and it crashed again (and again) ... until we stopped the mds and
> deleted the mds0_openfiles.0 from the metadata pool.
> 
> Here is the (debug) output:
> 
> A specific workload that *might* have triggered this: recursively deleting a 
> long
> list of files and directories (~ 7 milion in total) with 5 "rm" processes
> in parallel ...

It crashed two times ... this is the other info of the crash:

-10001> 2019-12-04 20:28:34.929 7fd43ce9b700  5 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/3833566625 >> 
[2001:7b8:80:1:0:1:2:10]:6803/727090 conn(0x55e93ca96300 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=141866652 cs=1 l=1). rx 
osd.90 seq 32255 0x55e9416cb0c0 osd_op_reply(4104640 10001afe266. 
[stat,omap-set-header,omap-set-vals] v63840'10049940 uv10049940 ondisk = 0) v8
-10001> 2019-12-04 20:28:34.929 7fd43ce9b700  1 -- 
[2001:7b8:80:3:0:2c:3:2]:6800/3833566625 <== osd.90 
[2001:7b8:80:1:0:1:2:10]:6803/727090 32255  osd_op_reply(4104640 
10001afe266. [stat,omap-set-header,omap-set-vals] v63840'10049940 
uv10049940 ondisk = 0) v8  248+0+0 (969216453 0 0) 0x55e9416cb0c0 con 
0x55e93ca96300
-10001> 2019-12-04 20:28:34.937 7fd436ca7700  0 mds.0.openfiles omap_num_objs 
1025
-10001> 2019-12-04 20:28:34.937 7fd436ca7700  0 mds.0.openfiles anchor_map size 
19678
-10001> 2019-12-04 20:28:34.937 7fd436ca7700 -1 
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void 
OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 
7fd436ca7700 time 2019-12-04 20:28:34.939048
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs 
<= MAX_OBJECTS)

mds.0.openfiles omap_num_objs 1025 <- ... just 1 higher than 1024? Coincidence?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-12-04 Thread Stefan Kooman
Hi,

Quoting Stefan Kooman (ste...@bit.nl):

> > please apply following patch, thanks.
> > 
> > diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
> > index c0f72d581d..2ca737470d 100644
> > --- a/src/mds/OpenFileTable.cc
> > +++ b/src/mds/OpenFileTable.cc
> > @@ -470,7 +470,11 @@ void OpenFileTable::commit(MDSInternalContextBase *c,
> > uint64_t log_seq, int op_p
> >   }
> >   if (omap_idx < 0) {
> > ++omap_num_objs;
> > -   assert(omap_num_objs <= MAX_OBJECTS);
> > +   if (omap_num_objs > MAX_OBJECTS) {
> > + dout(1) << "omap_num_objs " << omap_num_objs << dendl;
> > + dout(1) << "anchor_map size " << anchor_map.size() << dendl;
> > + assert(omap_num_objs <= MAX_OBJECTS);
> > +   }
> > omap_num_items.resize(omap_num_objs);
> > omap_updates.resize(omap_num_objs);
> > omap_updates.back().clear = true;
> 
> It took a while but an MDS server with this debug patch is now live (and
> up:active).

 and it crashed again (and again) ... until we stopped the mds and
deleted the mds0_openfiles.0 from the metadata pool.

Here is the (debug) output:

2019-12-04 06:25:01.578 7f6200248700 -1 received  signal: Hangup from pkill -1 
-x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 3491) UID: 0
2019-12-04 20:19:58.043 7f61fc859700  0 mds.0.openfiles omap_num_objs 1025
2019-12-04 20:19:58.043 7f61fc859700  0 mds.0.openfiles anchor_map size 4417650
2019-12-04 20:19:58.043 7f61fc859700 -1 
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void 
OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 
7f61fc859700 time 2019-12-04 20:19:58.045875
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs 
<= MAX_OBJECTS)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7f6207d01b5e]
 2: (()+0x2c4cb7) [0x7f6207d01cb7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f) 
[0x55e38662566f]
 4: (MDLog::trim(int)+0x5a6) [0x55e386614666]
 5: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b]
 6: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c]
 7: (Context::complete(int)+0x9) [0x55e3863894b9]
 8: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329]
 9: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d]
 10: (()+0x76db) [0x7f62075b56db]
 11: (clone()+0x3f) [0x7f620679b88f]

2019-12-04 20:19:58.043 7f61fc859700 -1 *** Caught signal (Aborted) **
 in thread 7f61fc859700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0x12890) [0x7f62075c0890]
 2: (gsignal()+0xc7) [0x7f62066b8e97]
 3: (abort()+0x141) [0x7f62066ba801]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x25f) [0x7f6207d01c6f]
 5: (()+0x2c4cb7) [0x7f6207d01cb7]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1c5f) 
[0x55e38662566f]
 7: (MDLog::trim(int)+0x5a6) [0x55e386614666]
 8: (MDSRankDispatcher::tick()+0x24b) [0x55e3863a637b]
 9: (FunctionContext::finish(int)+0x2c) [0x55e38638b51c]
 10: (Context::complete(int)+0x9) [0x55e3863894b9]
 11: (SafeTimer::timer_thread()+0xf9) [0x7f6207cfe329]
 12: (SafeTimerThread::entry()+0xd) [0x7f6207cffa3d]
 13: (()+0x76db) [0x7f62075b56db]
 14: (clone()+0x3f) [0x7f620679b88f]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

A specific workload that *might* have triggered this: recursively deleting a 
long
list of files and directories (~ 7 milion in total) with 5 "rm" processes
in parallel ...

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to encode map errors

2019-12-04 Thread Stefan Kooman
Quoting John Hearns (j...@kheironmed.com):
> And me again for the second time in one day.
> 
> ceph -w is now showing messages like this:
> 
> 2019-12-03 15:17:22.426988 osd.6 [WRN] failed to encode map e28961 with
> expected crc

I have seen messages like this when there are daemons running with
different ceph versions. What does "ceph versions" give you?

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-11-24 Thread Stefan Kooman
Hi,

Quoting Yan, Zheng (uker...@gmail.com):

> > > I double checked the code, but didn't find any clue. Can you compile
> > > mds with a debug patch?
> >
> > Sure, I'll try to do my best to get a properly packaged Ceph Mimic
> > 13.2.6 with the debug patch in it (and / or get help to get it build).
> > Do you already have the patch (on github) somewhere?
> >
> 
> please apply following patch, thanks.
> 
> diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
> index c0f72d581d..2ca737470d 100644
> --- a/src/mds/OpenFileTable.cc
> +++ b/src/mds/OpenFileTable.cc
> @@ -470,7 +470,11 @@ void OpenFileTable::commit(MDSInternalContextBase *c,
> uint64_t log_seq, int op_p
>   }
>   if (omap_idx < 0) {
> ++omap_num_objs;
> -   assert(omap_num_objs <= MAX_OBJECTS);
> +   if (omap_num_objs > MAX_OBJECTS) {
> + dout(1) << "omap_num_objs " << omap_num_objs << dendl;
> + dout(1) << "anchor_map size " << anchor_map.size() << dendl;
> + assert(omap_num_objs <= MAX_OBJECTS);
> +   }
> omap_num_items.resize(omap_num_objs);
> omap_updates.resize(omap_num_objs);
> omap_updates.back().clear = true;

It took a while but an MDS server with this debug patch is now live (and
up:active).

FYI,

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):

> I double checked the code, but didn't find any clue. Can you compile
> mds with a debug patch?

Sure, I'll try to do my best to get a properly packaged Ceph Mimic
13.2.6 with the debug patch in it (and / or get help to get it build).
Do you already have the patch (on github) somewhere?

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):

> delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
> of the crashed mds)

OK, MDS crashed again, restarted. I stopped it, deleted the object and
restarted the MDS. It became active right away.

Any idea on why the openfiles list (object) becomes corrupted? As in:
have a bugfix in place?

Thanks!

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-21 Thread Stefan Kooman
Quoting Yan, Zheng (uker...@gmail.com):

> delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
> of the crashed mds)

Just to make sure I understand correctly. Current status is that the MDS
is active (no standby for now) and not in a "crashed" state (although it
has been crashing for at least 10 times now).

Is the following what you want me to do, and safe to do in this
situation?

1) Stop running (active) MDS
2) delete object 'mdsX_openfiles.0' from cephfs metadata pool

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-19 Thread Stefan Kooman
Dear list,

Quoting Stefan Kooman (ste...@bit.nl):

> I wonder if this situation is more likely to be hit on Mimic 13.2.6 than
> on any other system.
> 
> Any hints / help to prevent this from happening?

We have had this happening another two times now. In both cases the MDS
recovers, becomes active (for a few seconds), and crashes again. It won't
come out of this loop by itself. When put in deug mode "debug_mds =
10/10) we won't hit the bug and it stays active. After a few minutes we
disable debug (live, ceph tell mds.* config set debug_mds 0/0) and it
keeps running (Heisenbug)... until hours later when it crashes again and the 
story
repeats itself.

So unfortunately no more debug information available, but at least a
workaround to get it active again.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS crash - FAILED assert(omap_num_objs <= MAX_OBJECTS)

2019-10-19 Thread Stefan Kooman
Dear list,

Today our active MDS crashed with an assert:

2019-10-19 08:14:50.645 7f7906cb7700 -1 
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void 
OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 
7f7906cb7700 time 2019-10-19 08:14:50.648559
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs 
<= MAX_OBJECTS)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7f7911b2897e]
 2: (()+0x2fab07) [0x7f7911b28b07]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) 
[0x7703f7]
 4: (MDLog::trim(int)+0x5a6) [0x75dcd6]
 5: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
 6: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
 7: (Context::complete(int)+0x9) [0x4d31d9]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
 9: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
 10: (()+0x76ba) [0x7f79113a76ba]
 11: (clone()+0x6d) [0x7f7910bd041d]

2019-10-19 08:14:50.649 7f7906cb7700 -1 *** Caught signal (Aborted) **
 in thread 7f7906cb7700 thread_name:safe_timer

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0x11390) [0x7f79113b1390]
 2: (gsignal()+0x38) [0x7f7910afe428]
 3: (abort()+0x16a) [0x7f7910b0002a]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x256) [0x7f7911b28a86]
 5: (()+0x2fab07) [0x7f7911b28b07]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) 
[0x7703f7]
 7: (MDLog::trim(int)+0x5a6) [0x75dcd6]
 8: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
 9: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
 10: (Context::complete(int)+0x9) [0x4d31d9]
 11: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
 12: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
 13: (()+0x76ba) [0x7f79113a76ba]
 14: (clone()+0x6d) [0x7f7910bd041d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

Apparently this is bug 36094 (https://tracker.ceph.com/issues/36094).

Our active MDS had mds_cache_memory_limit=150G and ~ 27 M CAPS handed
out to 78 clients. A few of them having many milions of CAPS. This
resulted in laggy MDS ... another failover ... until the MDS was finally
able to cope with the load.

We adjusted mds_cache_memory_limit to 32G right after that and activated
the new limit: ceph tell mds.* config set mds_cache_memory_limit
34359738368

Double checked it was set correctly, and monitored mem usage. That all
went fine. Around # 6 M CAPS in use (2 clients used 5/6 of those). After
~ 5 yours the same assert was hit. Fortunately the failover was way
faster now ... but then the, now active MDS, hit the same assert again
triggering another failover ... other MDS took over and failed again ...
the other took over and cephfs healthy again ...

The bug report does not hint on how to prevent this situation. Recently
Zoë'Connell  hit the same issue on a Mimic 13.2.6 system:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html

I wonder if this situation is more likely to be hit on Mimic 13.2.6 than
on any other system.

Any hints / help to prevent this from happening?

Thanks,

Stefan



-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS Stability with lots of CAPS

2019-10-02 Thread Stefan Kooman
Hi,

According to [1] there are new parameters in place to have the MDS
behave more stable. Quoting that blog post "One of the more recent
issues weve discovered is that an MDS with a very large cache (64+GB)
will hang during certain recovery events."

For all of us that are not (yet) running Nautilus I wonder what the best
course of action is to prevent instable MDS during recovery situations.

Artificially limit the "mds_cache_memory_limit" to say 32 GB?

I wonder if the amount of clients is of influence in a MDS being
overwhelmed by release messages. Of are a handfull of clients (with
millions of CAPS) able to overload an MDS?

Is there a way, other than unmounting cephfs on clients, to decrease the
amount of CAPS the MDS has handed out, before an upgrade to a newer Ceph
release is undertaken when running luminous / Mimic?

I'm assuming you need to restart the MDS to make the
"mds_cache_memory_limit" effective, is that correct?

Gr. Stefan

[1]: https://ceph.com/community/nautilus-cephfs/


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-02 Thread Stefan Kooman
> 
> I created this issue: https://tracker.ceph.com/issues/42116
> 
> Seems to be related to the 'crash' module not enabled.
> 
> If you enable the module the problem should be gone. Now I need to check
> why this message is popping up.

Yup, crash module enabled and error message is gone. Either way it
makes sense to enable the crash module anyway.

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-10-01 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Hi List,
> 
> We are planning to move a filesystem workload (currently nfs) to CephFS.
> It's around 29 TB. The unusual thing here is the amount of directories
> in use to host the files. In order to combat a "too many files in one
> directory" scenario a "let's make use of recursive directories" approach.
> Not ideal either. This workload is supposed to be moved to (Ceph) S3
> sometime in the future, but until then, it has to go to a shared
> filesystem ...
> 
> So what is unusual about this? The directory layout looks like this
> 
> /data/files/00/00/[0-8][0-9]/[0-9]/ from this point on there will be 7
> directories created to store 1 file.
> 
> Total amount of directories in a file path is 14. There are around 150 M
> files in 400 M directories.
> 
> The working set won't be big. Most files will just sit around and will
> not be touched. The active amount of files wil be a few thousand.
> 
> We are wondering if this kind of directory structure is suitable for
> CephFS. Might the MDS get difficulties with keeping up that many inodes
> / dentries or doesn't it care at all?
> 
> The amount of metadata overhead might be horrible, but we will test that
> out.

This awkward dataset is "live" ... and the MDS has been happily
crunching away so far. Peaking at 42.5 M caps. Multiple parallel rsyncs
(20+) to fill cephfs was no issue whatsover.

Thanks Nathan Fish and Burkhard Linke for sharing helpful MDS insight!

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Have you enabled the telemetry module yet?

2019-10-01 Thread Stefan Kooman
Quoting Wido den Hollander (w...@42on.com):
> Hi,
> 
> The Telemetry [0] module has been in Ceph since the Mimic release and
> when enabled it sends back a anonymized JSON back to
> https://telemetry.ceph.com/ every 72 hours with information about the
> cluster.
> 
> For example:
> 
> - Version(s)
> - Number of MONs, OSDs, FS, RGW
> - Operating System used
> - CPUs used by MON and OSD
> 
> Enabling the module is very simple:
> 
> $ ceph mgr module enable telemetry

This worked.

ceph mgr module ls
{
"enabled_modules": [
...
...
"telemetry"
],

> Before enabling the module you can also view the JSON document it will
> send back:
> 
> $ ceph telemetry show

This gives me:

ceph telemetry show
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/telemetry/module.py", line 325, in handle_command
report = self.compile_report()
  File "/usr/lib/ceph/mgr/telemetry/module.py", line 291, in compile_report
report['crashes'] = self.gather_crashinfo()
  File "/usr/lib/ceph/mgr/telemetry/module.py", line 214, in gather_crashinfo
errno, crashids, err = self.remote('crash', 'do_ls', '', '')
  File "/usr/lib/ceph/mgr/mgr_module.py", line 845, in remote
args, kwargs)
ImportError: Module not found

Running 13.2.6 on Ubuntu Xenial 16.04.6 LTS

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs: apache locks up after parallel reloads on multiple nodes

2019-09-12 Thread Stefan Kooman
Dear list,

We recently switched the shared storage for our linux shared hosting
platforms from "nfs" to "cephfs". Performance improvement are
noticeable. It all works fine, however, there is one peculiar thing:
when Apache reloads after a logrotate of the "error" logs all but one
node will hang for ~ 15 minutes. The log rotates are scheduled with a
cron, the nodes themselves synced with ntp. The first node that reloads
apache will keep on working, all the others will hang, and after a
period of ~ 15 minutes they will all recover almost simultaneously.

Our setup looks like this: 10 webservers all sharing the same cephfs
filesystem. Each webserver with around 100 apache threads has around
10.000 open file handles to "error" logs on cephfs. To be clear, all
webservers have a file handle on _the same_ "error" logs. The logrotate
takes around two seconds on the "surviving" node.

What could be the reason for this? Does it have something to do with
file locking, i.e. that it behaves differently on cephfs compared to nfs
(more strict)? What would be a good way to find out what is the root
cause? We have sysdig traces of different nodes, but on the nodes where
apache hangs not a lot is going on ... until it all recovers.

We remediated this by delaying the Apache reloads on all but one node.
Then there is no issue at all, even as all the other web servers still
reload almost at the same time.

Any info / hints on how to investigate this issue further are highly
appreciated.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] units of metrics

2019-09-12 Thread Stefan Kooman
Hi Paul,

Quoting Paul Emmerich (paul.emmer...@croit.io):
> https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf

Thanks for the link. So, what tool do you use to gather the metrics? We
are using telegraf module of the Ceph manager. However, this module only
provides "sum" and not "avgtime" so I can't do the calculations. The
influx and zabbix mgr modules also only provide "sum". The only metrics
module that *does* send "avgtime" is the prometheus module:

ceph_mds_reply_latency_sum
ceph_mds_reply_latency_count

All modules use "self.get_all_perf_counters()" though:

~/git/ceph/src/pybind/mgr/ > grep -Ri get_all_perf_counters *
dashboard/controllers/perf_counters.py:return 
mgr.get_all_perf_counters()
diskprediction_cloud/agent/metrics/ceph_mon_osd.py:perf_data = 
obj_api.module.get_all_perf_counters(services=('mon', 'osd'))
influx/module.py:for daemon, counters in 
six.iteritems(self.get_all_perf_counters()):
mgr_module.py:def get_all_perf_counters(self, prio_limit=PRIO_USEFUL,
prometheus/module.py:for daemon, counters in 
self.get_all_perf_counters().items():
restful/api/perf.py:counters = context.instance.get_all_perf_counters()
telegraf/module.py:for daemon, counters in 
six.iteritems(self.get_all_perf_counters())

Besides the *ceph* telegraf module we also use the ceph plugin for
telegraf ... but that plugin does not (yet?) provide mds metrics though.
Ideally we would *only* use the ceph mgr telegraf module to collect *all
the things*.

Not sure what's the difference in python code between the modules that could 
explain this.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-09-11 Thread Stefan Kooman
Quoting Massimo Sgaravatto (massimo.sgarava...@gmail.com):
> Thank you
> 
> But the algorithms used during backfilling and during rebalancing (to
> decide where data have to be placed) are different ?

Yes, the balancer takes more factors into consideration. It also takes
into consideration all of the pools and can make smarter decisions. We
noticed way less data movement when using balancer than expected.

> I.e. assuming that no new data are written and no data are deleted, if you
> rely on the standard way (i.e. backfilling), when the data movement process
> finishes (and therefore the status is HEALTH_OK), can the automatic
> balancer (in upmap mode) decide that  some data have to be re-moved ?

Yes, for sure. Cephs balancing is not perfect (because # PGs is less
than you need for ideal placement). You can look at "ceph osd df" and
look at the standard deviation. If that is quite high it makes sense to
use balancer to equalize to otain higher utilization. Either PG
optimized or capactity optimized (or a mix of both, default balancer
settings).

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-09-11 Thread Stefan Kooman
Quoting Massimo Sgaravatto (massimo.sgarava...@gmail.com):
> Just for my education, why letting the balancer moving the PGs to the new
> OSDs (CERN approach) is better than  a throttled backfilling ?

1) Because you can pause the process on any given moment and obtain
HEALTH_OK again. 2) The balancer moves the data more efficiently. 3) the
balancer will avoid putting PGs on OSDs that are already full ... you
might avoid "too full" PG situations.

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] regurlary 'no space left on device' when deleting on cephfs

2019-09-06 Thread Stefan Kooman
Quoting Kenneth Waegeman (kenneth.waege...@ugent.be):
> The cluster is healthy at this moment, and we have certainly enough space
> (see also osd df below)

It's not well balanced though ... do you use ceph balancer (with
balancer in upmap mode)?

Gr. Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 14.2.2 -> 14.2.3 upgrade [WRN] failed to encode map e905 with expected crc

2019-09-06 Thread Stefan Kooman
Hi,

While upgrading the monitors on a Nautilus test cluster warning messages
apear:

[WRN] failed to encode map e905 with expected crc

Is this expected?

I have only seen this in the past when mixing different releases (major
versions), not when upgrading within a release.

What is the impact of this?

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] units of metrics

2019-09-04 Thread Stefan Kooman
Hi,

Just wondering, what are the units of the metrics logged by "perf dump"?
For example mds perf dump:

"reply_latency": {
"avgcount": 560013520,
"sum": 184229.305658729,
"avgtime": 0.000328972


is the 'avgtime' in seconds, with "avgtime": 0.000328972 representing
0.328972 ms?

As far as I can see the logs collected by the telegraf manager plugin
only sends "sum". So how would I calculate the average reply latency for
mds requests?

Thanks,

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Stefan Kooman
Quoting Nathan Fish (lordci...@gmail.com):
> MDS CPU load is proportional to metadata ops/second. MDS RAM cache is
> proportional to # of files (including directories) in the working set.
> Metadata pool size is proportional to total # of files, plus
> everything in the RAM cache. I have seen that the metadata pool can
> balloon 8x between being idle, and having every inode open by a
> client.
> The main thing I'd recommend is getting SSD OSDs to dedicate to the
> metadata pools, and SSDs for the HDD OSD's DB/WAL. NVMe if you can. If
> you put that much metadata on only HDDs, it's going to be slow.

Only SSD for OSD data pool and NVMe for metadata pool, so that should be
fine. Besides the initial loading of that many files / directories this
workload shouldn't be any problem.

Thanks for your feedback.

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-07-26 Thread Stefan Kooman
Quoting Peter Sabaini (pe...@sabaini.at):
> What kind of commit/apply latency increases have you seen when adding a
> large numbers of OSDs? I'm nervous how sensitive workloads might react
> here, esp. with spinners.

You mean when there is backfilling going on? Instead of doing "a big
bang" you can also use Dan van der Ster's trick with upmap balancer:
https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

See
https://www.slideshare.net/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

So you would still have norebalance / nobackfill / norecover and ceph
balancer off. Then you run the script as many times as necessary to get
"HEALTH_OK" again (on clusters other than nautilus) and there a no more
PGs remapped. Unset the flags and enable the ceph balancer ... now the
balancer will slowly move PGs to the new OSDs.

We've used this trick to increase the number of PGs on a pool, and will
use this to expand the cluster in the near future.

This only works if you can use the balancer in "upmap" mode. Note that
using upmap requires that all clients be Luminous or newer. If you are
using cephfs kernel client it might report as not compatible (jewel) but
recent linux distributions work well (Ubuntu 18.04 / CentOS 7).

Gr. Stefan

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS / CephFS behaviour with unusual directory layout

2019-07-26 Thread Stefan Kooman
Hi List,

We are planning to move a filesystem workload (currently nfs) to CephFS.
It's around 29 TB. The unusual thing here is the amount of directories
in use to host the files. In order to combat a "too many files in one
directory" scenario a "let's make use of recursive directories" approach.
Not ideal either. This workload is supposed to be moved to (Ceph) S3
sometime in the future, but until then, it has to go to a shared
filesystem ...

So what is unusual about this? The directory layout looks like this

/data/files/00/00/[0-8][0-9]/[0-9]/ from this point on there will be 7
directories created to store 1 file.

Total amount of directories in a file path is 14. There are around 150 M
files in 400 M directories.

The working set won't be big. Most files will just sit around and will
not be touched. The active amount of files wil be a few thousand.

We are wondering if this kind of directory structure is suitable for
CephFS. Might the MDS get difficulties with keeping up that many inodes
/ dentries or doesn't it care at all?

The amount of metadata overhead might be horrible, but we will test that
out.

Thanks,

Stefan


-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] To backport or not to backport

2019-07-04 Thread Stefan Kooman
Hi,

Now the release cadence has been set, it's time for another discussion
:-).

During Ceph day NL we had a panel q/a [1]. One of the things that was
discussed were backports. Occasionally users will ask for backports of
functionality in newer releases to older releases (that are still in
support).

Ceph is quite a unique project in the sense that new functionality gets
backported to older releases. Sometimes even functionality gets changed
in the lifetime of a release. I can recall "ceph-volume" change to LVM
in the beginning of the Luminous release. While backports can enrich the
user experience of a ceph operator, it's not without risks. There have
been several issues with "incomplete" backports and or unforeseen
circumstances that had the reverse effect: downtime of (part of) ceph
services. The ones that come to my mind are:

- MDS (cephfs damaged)  mimic backport (13.2.2)
- RADOS (pg log hard limit) luminous / mimic backport (12.2.8 / 13.2.2)

I would like to define a simple rule of when to backport:

- Only backport fixes that do not introduce new functionality, but addresses
  (impaired) functionality already present in the release.

Example of, IMHO, a backport that matches the backport criteria was the
"bitmap_allocator" fix. It fixed a real problem, not some corner case.
Don't get me wrong here, it is important to catch corner cases, but it
should not put the majority of clusters at risk.

The time and effort that might be saved with this approach can indeed be
spend in one of the new focus areas Sage mentioned during his keynote
talk at Cephalocon Barcelona: quality. Quality of the backports that are
needed, improved testing, especially for upgrades to newer releases. If
upgrades are seemless, people are more willing to upgrade, because hey,
it just works(tm). Upgrades should be boring.

How many clusters (not nautilus ;-)) are running with "bitmap_allocator" or
with the pglog_hardlimit enabled? If a new feature is not enabled by
default and it's unclear how "stable" it is to use, operators tend to not
enable it, defeating the purpose of the backport.

Backporting fixes to older releases can be considered a "business
opportunity" for the likes of Red Hat, SUSE, Fujitsu, etc. Especially
for users that want a system that "keeps on running forever" and never
needs "dangerous" updates.

This is my view on the matter, please let me know what you think of
this.

Gr. Stefan

P.s. Just to make things clear: this thread is in _no way_ intended to pick on
anybody. 


[1]: https://pad.ceph.com/p/ceph-day-nl-2019-panel

-- 
| BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Upgrades - sanity check - MDS steps

2019-06-19 Thread Stefan Kooman
Quoting James Wilkins (james.wilk...@fasthosts.com):
> Hi all,
> 
> Just want to (double) check something – we’re in the process of
> luminous -> mimic upgrades for all of our clusters – particularly this
> section regarding MDS steps
> 
>   •  Confirm that only one MDS is online and is rank 0 for your FS: #
>   ceph status •  Upgrade the last remaining MDS daemon by installing
>   the new packages and restarting the daemon:
>
> Namely – is it required to upgrade the live single MDS in place (and
> thus have downtime whilst the MDS restarts – on our first cluster was
> typically 10 minutes of downtime ) – or can we upgrade the
> standby-replays/standbys first and flip once they are back?

You should upgrade in place (the last remaining MDS) and yes that causes
a bit of downtime. In our case it takes ~ 5s. Make sure to _only_
upgrade the ceph packages (no apt upgrade of whole system) as apt will
happily disable services, start updating initramfs ... for all installed
kernels, etc. Doing the full upgrade and reboot can be done later. This
is how we do it:

On (Active) Standby:

mds2: systemctl stop ceph-mds.target

On Active:

apt update
apt policy ceph-base <- check that the version that is available is
indeed the version you want to upgrade to!
apt install ceph-base ceph-common ceph-fuse ceph-mds ceph-mds-dbg
libcephfs2 python-cephfs

If mds doesn't get restarted with the upgrade, do it manually:

systemctl restart ceph-mds.target

^^ a bit of downtime

ceph daemon mds.$id version <- to make sure you are running the upgraded
version

(or run ceph versions to check)

On Standby:

apt install ceph-base ceph-common ceph-fuse ceph-mds ceph-mds-dbg
libcephfs2 python-cephfs

systemctl restart ceph-mds.target

ceph daemon mds.$id version <- to make sure you are running the upgraded
version

On Active:

apt upgrade && reboot

(Standby becomes active)

wait for HEALTH_OK

On (now) Active (previously standby):
apt upgrade && reboot

If you follow this procedure you end up with the same active and standby
as before the upgrades, both up to date with as little downtime as
possible.

That said ... I've accidentally updated a standby MDS to a newer version
than the Active one ... and this didn't cause any issues (12.2.8 ->
12.2.11) ... but I would not recommend it.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-06-11 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com):
> Hi Stefan,
> 
> Sorry I couldn't get back to you sooner.

NP.

> Looks like you hit the infinite loop bug in OpTracker. It was fixed in
> 12.2.11: https://tracker.ceph.com/issues/37977
> 
> The problem was introduced in 12.2.8.

We've been quite long on 12.2.8 (because of issues with 12.2.9 /
uncertainty 12.2.10) ... We upgraded to 12.2.11 at the end of februari
after we stopped seeing crashes ... so it does correlate with the
upgrade, so yeah, probably this bug then.

Thanks,

Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-10 Thread Stefan Kooman
Quoting solarflow99 (solarflo...@gmail.com):
> can the bitmap allocator be set in ceph-ansible?  I wonder why is it not
> default in 12.2.12

We don't use ceph-ansible. But if ceph-ansible allow you to set specific
([osd]) settings in ceph.conf I guess you can do it.

I don't know what the policy is for changing default settings in Ceph. Not sure
if they ever do that. The feature is only available since 12.2.12 and
is not battle tested in luminous. It's not the default in Mimic either
IIRC. Might be default in Nautilus?

Behaviour changes can be tricky without people knowing about it.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-07 Thread Stefan Kooman
Quoting Max Vernimmen (vernim...@textkernel.nl):
> Thank you for the suggestion to use the bitmap allocator. I looked at the
> ceph documentation and could find no mention of this setting. This makes me
> wonder how safe and production ready this setting really is. I'm hesitant
> to apply that to our production environment.
> If the allocator setting helps to resolve the problem then it looks to me
> like there is a bug in the 'stupid' allocator that is causing this
> behavior. Would this qualify for creating a bug report or is some more
> debugging needed before I can do that?

It's safe to use in production. We have test clusters running it, and
recently put it in production as well. As Igor noted this might not help
in your situation, but it might prevent you from running into decreased
performance (increased latency) over time.

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD hanging on 12.2.12 by message worker

2019-06-06 Thread Stefan Kooman
Quoting Max Vernimmen (vernim...@textkernel.nl):
> 
> This is happening several times per day after we made several changes at
> the same time:
> 
>- add physical ram to the ceph nodes
>- move from fixed 'bluestore cache size hdd|sdd' and 'bluestore cache kv
>max' to 'bluestore cache autotune = 1' and 'osd memory target =
>20401094656'.
>- update ceph from 12.2.8 to 12.2.11
>- update clients from 12.2.8 to 12.2.11
> 
> We have since upgraded the ceph nodes to 12.2.12 but it did not help to fix
> this problem.

Have you tried the new bitmap allocator for the OSDs already (available
since 12.2.12):

[osd]

# MEMORY ALLOCATOR
bluestore_allocator = bitmap
bluefs_allocator = bitmap

The issues you are reporting sound like an issue many of us have seen on
luminous and mimic clusters and has been identified to be caused by the
"stupid allocator" memory allocator.

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-05-27 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Hi Patrick,
> 
> Quoting Stefan Kooman (ste...@bit.nl):
> > Quoting Stefan Kooman (ste...@bit.nl):
> > > Quoting Patrick Donnelly (pdonn...@redhat.com):
> > > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > > somewhere it's not even outputting any log messages. If possible, it'd
> > > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > > if you're comfortable with gdb, a backtrace of any threads that look
> > > > suspicious (e.g. not waiting on a futex) including `info threads`.
> > 
> > Today the issue reappeared (after being absent for ~ 3 weeks). This time
> > the standby MDS could take over and would not get into a deadlock
> > itself. We made gdb traces again, which you can find over here:
> > 
> > https://8n1.org/14011/d444
> 
> We are still seeing these crashes occur ~ every 3 weeks or so. Have you
> find the time to look into the backtraces / gdb dumps?

We have not seen this issue anymore for the past three months. We have
updated the cluster to 12.2.11 in the meantime, but not sure if that is
related. Hopefully it stays away.

FYI,

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-27 Thread Stefan Kooman
Quoting Robert Sander (r.san...@heinlein-support.de):
> Hi,
> 
> we have a small cluster at a customer's site with three nodes and 4 SSD-OSDs
> each.
> Connected with 10G the system is supposed to perform well.
> 
> rados bench shows ~450MB/s write and ~950MB/s read speeds with 4MB objects
> but only 20MB/s write and 95MB/s read with 4KB objects.
> 
> This is a little bit disappointing as the 4K performance is also seen in KVM
> VMs using RBD.
> 
> Is there anything we can do to improve performance with small objects /
> block sizes?

Josh gave a talk about this:
https://static.sched.com/hosted_files/cephalocon2019/10/Optimizing%20Small%20Ceph%20Clusters.pdf

TL;DR: 
- For small clusters use relatively more PGs than for large clusters
- Make sure your cluster is well balanced, and this script might
be useful:
https://github.com/JoshSalomon/Cephalocon-2019/blob/master/pool_pgs_osd.sh

Josh is also tuning the objecter_* attributes (if you have plenty of
CPU/Memory):

objecter_inflight_ops = 5120
objecter_inflight_op_bytes = 524288000 (512 * 1,024,000)
## You can multiply / divide both with the same factor

Some more tuning tips in the presentation by Wido/Piotr that might be
useful:
https://static.sched.com/hosted_files/cephalocon2019/d6/ceph%20on%20nvme%20barcelona%202019.pdf

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs free space vs ceph df free space disparity

2019-05-27 Thread Stefan Kooman
Quoting Robert Ruge (robert.r...@deakin.edu.au):
> Ceph newbie question.
> 
> I have a disparity between the free space that my cephfs file system
> is showing and what ceph df is showing.  As you can see below my
> cephfs file system says there is 9.5TB free however ceph df says there
> is 186TB which with replication size 3 should equate to 62TB free
> space.  I guess the basic question is how can I get cephfs to see and
> use all of the available space?  I recently changed my number of pg's
> on the cephfs_data pool from 2048 to 4096 and this gave me another 8TB
> so do I keep increasing the number of pg's or is there something else
> that I am missing? I have only been running ceph for ~6 months so I'm
> relatively new to it all and not being able to use all of the space is
> just plain bugging me.

My guess here is you have a lot of small files in your cephfs, is that
right? Do you have HDD or SDD/NVMe?

Mohamad Gebai gave a talk about this at Cephalocon 2019:
https://static.sched.com/hosted_files/cephalocon2019/d2/cephalocon-2019-mohamad-gebai.pdf
for the slides and the recording:
https://www.youtube.com/watch?v=26FbUEbiUrw&list=PLrBUGiINAakNCnQUosh63LpHbf84vegNu&index=29&t=0s

TL;DR: there is a bluestore_min_alloc_size_ssd which is 16K default for
SSD and 64K default for HDD. With lots of small objects this might add
up to *a lot* of overhead. You can change that to 4k:

bluestore min alloc size ssd = 4096
bluestore min alloc size hdd = 4096

You will have to rebuild _all_ of your OSDs though.

Here is another thread about this:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/thread.html#24801

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):
> 
> [root@ceph-01 ~]# ceph status # before the MDS failed over
>   cluster:
> id: ###
> health: HEALTH_WARN
> 1 MDSs report slow requests
>  
>   services:
> mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
> mgr: ceph-01(active), standbys: ceph-02, ceph-03
> mds: con-fs-1/1/1 up  {0=ceph-08=up:active}, 1 up:standby
> osd: 192 osds: 192 up, 192 in
>  
>   data:
> pools:   5 pools, 750 pgs
> objects: 6.35 M objects, 5.2 TiB
> usage:   5.1 TiB used, 1.3 PiB / 1.3 PiB avail
> pgs: 750 active+clean

How many pools do you plan to use? You have 5 pools and only 750 PGs
total? What hardware do you have for OSDs? If cephfs is your biggest
user I would at up to 6150! PGs to your pool(s). Having around ~ 100 PGs
per OSD is healthy. The cluster will also be able to balance way better.
Math: (100 (PG/OSD) * 192 (# OSDs)) - 750)) / 3 = 6150 for 3 replica
pools. You might have a lot of contention going on on your OSDs, they
are probably under performing.

Gr. Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-18 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):
> Dear Yan and Stefan,
> 
> it happened again and there were only very few ops in the queue. I
> pulled the ops list and the cache. Please find a zip file here:
> "https://files.dtu.dk/u/w6nnVOsp51nRqedU/mds-stuck-dirfrag.zip?l"; .
> Its a bit more than 100MB.
> 
> The active MDS failed over to the standby after or during the dump
> cache operation. Is this expected? As a result, the cluster is healthy
> and I can't do further diagnostics. In case you need more information,
> we have to wait until next time.


> 
> Some further observations:
> 
> There was no load on the system. I start suspecting that this is not a 
> load-induced event. It is also not cause by excessive atime updates, the FS 
> is mounted with relatime. Could it have to do with the large level-2 network 
> (ca. 550 client servers in the same broadcast domain)? I include our kernel 
> tuning profile below, just in case. The cluster networks (back and front) are 
> isolated VLANs, no gateways, no routing.

I am pretty sure you hit bug #26982: https://tracker.ceph.com/issues/26982

"mds: crash when dumping ops in flight".

So, if you need a reason to update to 13.2.5 there you have it. Sorry
that I not realized beforehand you could hit this bug as you're running
13.2.2.

So I would update to 13.2.5 and try again.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Stefan Kooman
Quoting Jan Kasprzak (k...@fi.muni.cz):

>   OK, many responses (thanks for them!) suggest chrony, so I tried it:
> With all three mons running chrony and being in sync with my NTP server
> with offsets under 0.0001 second, I rebooted one of the mons:
> 
>   There still was the HEALTH_WARN clock_skew message as soon as
> the rebooted mon starts responding to ping. The cluster returns to
> HEALTH_OK about 95 seconds later.
> 
>   According to "ntpdate -q my.ntp.server", the initial offset
> after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think),
> but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds
> of HEALTH_WARN is inside Ceph, with mons being already synchronized.
> 
>   So the result is that chrony indeed synchronizes faster,
> but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew
> detected".
> 
>   I guess now the workaround now is to ignore the warning, and wait
> for two minutes before rebooting another mon.

You can tune the "mon_timecheck_skew_interval" which by default is set
to 30 seconds. See [1] and look for "timecheck" to find the different
options.

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):
> Dear Stefan,
> 
> thanks for the fast reply. We encountered the problem again, this time in a 
> much simpler situation; please see below. However, let me start with your 
> questions first:
> 
> What bug? -- In a single-active MDS set-up, should there ever occur an 
> operation with "op_name": "fragmentdir"?

Yes, see http://docs.ceph.com/docs/mimic/cephfs/dirfrags/. If you would
have multiple active MDS the load could be shared among those.

There are some parameters that might need to be tuned in your
environment. But Zheng Yan is an expert in this matter, so maybe after
analysis of the mds dump cache it might reveal what is the culprit.

> Upgrading: The problem described here is the only issue we observe.
> Unless the problem is fixed upstream, upgrading won't help us and
> would be a bit of a waste of time. If someone can confirm that this
> problem is fixed in a newer version, we will do it. Otherwise, we
> might prefer to wait until it is.

Keeping your systems up to date generally improves stability. You might
prevent hitting issues when your workload changes in the future. First
testing new releases on a test system is recommended though.

> 
> News on the problem. We encountered it again when one of our users executed a 
> command in parallel with pdsh on all our ~500 client nodes. This command 
> accesses the same file from all these nodes pretty much simultaneously. We 
> did this quite often in the past, but this time, the command got stuck and we 
> started observing the MDS health problem again. Symptoms:

This command, does that incur writes, reads or a combination of both on
files in this directory? I wonder if you might prevent this from
happening when tuning "Activity thresholds". Especially when you say it
is load (# clients) dependend.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-14 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):

If at all possible I would:

Upgrade to 13.2.5 (there have been quite a few MDS fixes since 13.2.2).
Use more recent kernels on the clients.

Below settings for [mds] might help with trimming (you might already
have changed mds_log_max_segments to 128 according to logs):

[mds]
mds_log_max_expiring = 80  # default 20
# trim max $value segments in parallel
# Defaults are too conservative.
mds_log_max_segments = 120  # default 30


> 1) Is there a bug with having MDS daemons acting as standby-replay?
I can't tell what bug you are referring to based on info below. It does
seem to work as designed.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests from bluestore osds

2019-05-14 Thread Stefan Kooman
Quoting Marc Schöchlin (m...@256bit.org):

> Out new setup is now:
> (12.2.10 on Ubuntu 16.04)
> 
> [osd]
> osd deep scrub interval = 2592000
> osd scrub begin hour = 19
> osd scrub end hour = 6
> osd scrub load threshold = 6
> osd scrub sleep = 0.3
> osd snap trim sleep = 0.4
> pg max concurrent snap trims = 1
> 
> [osd.51]
> osd memory target = 8589934592

I would upgrade to 12.2.12 and set the following:

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Just to make sure you're not hit by the "stupid allocator" behaviour,
which (also) might result in slow ops after $period of OSD uptime.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is recommended ceph docker image for use

2019-05-09 Thread Stefan Kooman
Quoting Patrick Hein (bagba...@googlemail.com):
> It should be as recent as possible. I think would use the HWE Kernel.

^^ This. Use linux-image-generic-hwe-16.04 (4.15 kernel). But ideally you go for
Ubuntu Bionic and use linux-image-generic-hwe-18.04 (4.18 kernel).

Added benefit of 4.18 kernel (4.17 and up) is that cephfs quotas work with ceph 
kernel
client for cephfs for mimic (and up) clusters [1].

Gr. Stefan

[1]: http://docs.ceph.com/docs/nautilus/cephfs/quota/#limitations

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What is recommended ceph docker image for use

2019-05-08 Thread Stefan Kooman
Quoting Ignat Zapolsky (ignat.zapol...@ammeon.com):
> Hi,
> 
> Just a question what is recommended docker container image to use for
> ceph ?
> 
> CEPH website us saying that 12.2.x is LTR but there are at least 2
> more releases in dockerhub – 13 and 14.
> 
> Would there be any advise on selection between 3 releases ?

There isn't a "LTR" concept anymore in Ceph. There are three releases
that are supported at any given time. As soon as Octopus (15) will be
released Luminous (12) won't be (officially) supported anymore.

I would go for the latest release, Nautilus (14), when setting up a new
cluster. But if you want to go for a release that has been more battle
tested in production go for Mimic (13). It might also depend on your use
case: cephfs is proabably best served from Nautilus. Nautilus release
has improvements in all major interfaces (RGW, RBD, cephfs) though, but
might have some (undiscovered) issues.

Gr. Stefan



-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule device classes mystery

2019-05-07 Thread Stefan Kooman
Quoting Konstantin Shalygin (k0...@k0ste.ru):
> Because you set new crush rule only for `cephfs_metadata` pool and look for
> pg at `cephfs_data` pool.

ZOMG :-O

Yup, that was it. cephfs_metadata pool looks good.

Thanks!

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule device classes mystery

2019-05-06 Thread Stefan Kooman
Quoting Gregory Farnum (gfar...@redhat.com):
> What's the output of "ceph -s" and "ceph osd tree"?

ceph -s
  cluster:
id: 40003df8-c64c-5ddb-9fb6-d62b94b47ecf
health: HEALTH_OK
 
  services:
mon: 3 daemons, quorum michelmon1,michelmon2,michelmon3 (age 2d)
mgr: michelmon2(active, since 3d), standbys: michelmon1, michelmon3
mds: cephfs:1 {0=michelmds1=up:active} 1 up:standby
osd: 9 osds: 9 up (since 3d), 9 in (since 7d)
rgw: 2 daemons active (michelrgw1, michelrgw2)
 
  data:
pools:   17 pools, 472 pgs
objects: 244 objects, 4.2 KiB
usage:   9.2 GiB used, 891 GiB / 900 GiB avail
pgs: 472 active+clean

ceph osd tree
ID  CLASSWEIGHT  TYPE NAME   STATUS REWEIGHT PRI-AFF 
 -1  0.89992 root default
 -9  0.89992 region BIT-Ede  
-11  0.29997 datacenter BIT-1
-14  0.29997 rack rack1  
 -2  0.29997 host michelosd1 
  1 cheaphdd 0.0 osd.1   up  1.0 1.0 
  2 cheaphdd 0.0 osd.2   up  1.0 1.0 
  0  fasthdd 0.0 osd.0   up  1.0 1.0 
-12  0.29997 datacenter BIT-2A   
-15  0.29997 rack rack2  
 -3  0.29997 host michelosd2 
  4 cheaphdd 0.0 osd.4   up  1.0 1.0 
  5 cheaphdd 0.0 osd.5   up  1.0 1.0 
  3  fasthdd 0.0 osd.3   up  1.0 1.0 
-10  0.29997 datacenter BIT-2C   
-13  0.29997 rack rack3  
 -4  0.29997 host michelosd3 
  7 cheaphdd 0.0 osd.7   up  1.0 1.0 
  8 cheaphdd 0.0 osd.8   up  1.0 1.0 
  6  fasthdd 0.0 osd.6   up  1.0 1.0

And the shadow tree:

ceph osd crush tree --show-shadow
ID  CLASSWEIGHT  TYPE NAME
-33  fasthdd 0.29997 root default~fasthdd 
-32  fasthdd 0.29997 region BIT-Ede~fasthdd   
-28  fasthdd 0.0 datacenter BIT-1~fasthdd 
-27  fasthdd 0.0 rack rack1~fasthdd   
-26  fasthdd 0.0 host michelosd1~fasthdd  
  0  fasthdd 0.0 osd.0
-31  fasthdd 0.0 datacenter BIT-2A~fasthdd
-30  fasthdd 0.0 rack rack2~fasthdd   
-29  fasthdd 0.0 host michelosd2~fasthdd  
  3  fasthdd 0.0 osd.3
-25  fasthdd 0.0 datacenter BIT-2C~fasthdd
-24  fasthdd 0.0 rack rack3~fasthdd   
-23  fasthdd 0.0 host michelosd3~fasthdd  
  6  fasthdd 0.0 osd.6
-22 cheaphdd 0.59995 root default~cheaphdd
-21 cheaphdd 0.59995 region BIT-Ede~cheaphdd  
-17 cheaphdd 0.19998 datacenter BIT-1~cheaphdd
-16 cheaphdd 0.19998 rack rack1~cheaphdd  
 -8 cheaphdd 0.19998 host michelosd1~cheaphdd 
  1 cheaphdd 0.0 osd.1
  2 cheaphdd 0.0 osd.2
-20 cheaphdd 0.19998 datacenter BIT-2A~cheaphdd   
-19 cheaphdd 0.19998 rack rack2~cheaphdd  
-18 cheaphdd 0.19998 host michelosd2~cheaphdd 
  4 cheaphdd 0.0 osd.4
  5 cheaphdd 0.0 osd.5
 -7 cheaphdd 0.19998 datacenter BIT-2C~cheaphdd   
 -6 cheaphdd 0.19998 rack rack3~cheaphdd  
 -5 cheaphdd 0.19998 host michelosd3~cheaphdd 
  7 cheaphdd 0.0 osd.7
  8 cheaphdd 0.0 osd.8
 -1  0.89992 root default 
 -9  0.89992 region BIT-Ede   
-11  0.29997 datacenter BIT-1 
-14  0.29997 rack rack1   
 -2  0.29997 host michelosd1  
  1 cheaphdd 0.0 osd.1
  2 cheaphdd 0.0 osd.2
  0  fasthdd 0.0 osd.0
-12  0.29997 datacenter BIT-2A
-15  0.29997 rack rack2  

[ceph-users] CRUSH rule device classes mystery

2019-05-03 Thread Stefan Kooman
Hi List,

I'm playing around with CRUSH rules and device classes and I'm puzzled
if it's working correctly. Platform specifics: Ubuntu Bionic with Ceph 14.2.1

I created two new device classes "cheaphdd" and "fasthdd". I made
sure these device classes are applied to the right OSDs and that the
(shadow) crush rule is correctly filtering the right classes for the
OSDs (ceph osd crush tree --show-shadow).

I then created two new crush rules:

ceph osd crush rule create-replicated fastdisks default host fasthdd
ceph osd crush rule create-replicated cheapdisks default host cheaphdd

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule fastdisks {
id 1
type replicated
min_size 1
max_size 10
step take default class fasthdd
step chooseleaf firstn 0 type host
step emit
}
rule cheapdisks {
id 2
type replicated
min_size 1
max_size 10
step take default class cheaphdd
step chooseleaf firstn 0 type host
step emit
}

After that I put the cephfs_metadata on the fastdisks CRUSH rule:

ceph osd pool set cephfs_metadata crush_rule fastdisks

Some data is moved to new osds, but strange enough there is still data on PGs
residing on OSDs in the "cheaphdd" class. I confirmed this with:

ceph pg ls-by-pool cephfs_data

Testing CRUSH rule nr. 1 gives me:

crushtool -i /tmp/crush_raw --test --show-mappings --rule 1 --min-x 1 --max-x 4 
 --num-rep 3
CRUSH rule 1 x 1 [0,3,6]
CRUSH rule 1 x 2 [3,6,0]
CRUSH rule 1 x 3 [0,6,3]
CRUSH rule 1 x 4 [0,6,3]

Which are indeed the OSDs in the fasthdd class.

Why is not all data moved to OSDs 0,3,6, but still spread on OSDs on the
cheaphhd class as well?

Thanks,

Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] VM management setup

2019-04-30 Thread Stefan Kooman
Hi,

> Any recommendations? 
> 
> .. found a lot of names allready .. 
> OpenStack 
> CloudStack 
> Proxmox 
> .. 
> 
> But recommendations are truely welcome. 
I would recommend OpenNebula. Adopters of the KISS methodology.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel SSD D3-S4510 and Intel SSD D3-S4610 firmware advisory notice

2019-04-19 Thread Stefan Kooman
Hi List,

TL;DR:

For those of you who are running a Ceph cluster with Intel SSD D3-S4510
and or Intel SSD D3-S4610 with firmware version XCV10100 please upgrade
to firmware XCV10110 ASAP. At least before ~ 1700 power up hours.

More information here:

https://support.microsoft.com/en-us/help/4499612/intel-ssd-drives-unresponsive-after-1700-idle-hours

https://downloadcenter.intel.com/download/28673/SSD-S4510-S4610-2-5-non-searchable-firmware-links/

Gr. Stefan

P.s. Thanks to Frank Dennis (@jedisct1) for retweeting @NerdPyle:
https://twitter.com/jedisct1/status/1118623635072258049


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Stefan Kooman
Quoting Lars Täuber (taeu...@bbaw.de):
> > > This is something i was told to do, because a reconstruction of failed
> > > OSDs/disks would have a heavy impact on the backend network.  
> > 
> > Opinions vary on running "public" only versus "public" / "backend".
> > Having a separate "backend" network might lead to difficult to debug
> > issues when the "public" network is working fine, but the "backend" is
> > having issues and OSDs can't peer with each other, while the clients can
> > talk to all OSDs. You will get slow requests and OSDs marking each other
> > down while they are still running etc.
> 
> This I was not aware of.

It's real. I've been bitten by this several times in a PoC cluster while
playing around with networking ... make sure you have proper monitoring checks 
on
all network interfaces when running this setup.

> > In your case with only 6 spinners max per server there is no way you
> > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> > (for large spinners) should be just enough to fill a 10 Gb/s link. A
> > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> > both OSD replication traffic and client IO.
> 
> The reason for the choice for the 25GBit network was because a remark
> of someone, that the latency in this ethernet is way below that of
> 10GBit. I never double checked this.

This is probably true. 25 Gb/s is a single-lane (SerDes) which is used in 50 
Gb/s
/ 100 Gb/s 200 Gb/s connections. It operates on ~ 2.5 times the clock
rate of 10 Gb/s / 40 Gb/s. But for clients to fully benefit from this lower
latency, they should be on 25 Gb/s as well. If you can affort to
redesign your cluster (and low latency is important) ...  Then again ...
the latency your spinners introduce is a few orders of magnitude higher
than the network latency ... I would then (also) invest in NVMe drives
for (at least) metadata ... and switch to 3 x replication ... but that
might be too much asked for.

TL;DR: when desinging clusters, try to think about the "weakest" link
(bottleneck) ... most probably this will be disk speed / Ceph overhead.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Stefan Kooman
Quoting Lars Täuber (taeu...@bbaw.de):
> > I'd probably only use the 25G network for both networks instead of
> > using both. Splitting the network usually doesn't help.
> 
> This is something i was told to do, because a reconstruction of failed
> OSDs/disks would have a heavy impact on the backend network.

Opinions vary on running "public" only versus "public" / "backend".
Having a separate "backend" network might lead to difficult to debug
issues when the "public" network is working fine, but the "backend" is
having issues and OSDs can't peer with each other, while the clients can
talk to all OSDs. You will get slow requests and OSDs marking each other
down while they are still running etc.

There might also be pro's for running a separate "backend" network,
anyone?

In your case with only 6 spinners max per server there is no way you
will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
(for large spinners) should be just enough to fill a 10 Gb/s link. A
redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
both OSD replication traffic and client IO.

My 2 cents,

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nautilus upgrade problem

2019-04-02 Thread Stefan Kooman
Quoting Paul Emmerich (paul.emmer...@croit.io):
> This also happened sometimes during a Luminous -> Mimic upgrade due to
> a bug in Luminous; however I thought it was fixed on the ceph-mgr
> side.
> Maybe the fix was (also) required in the OSDs and you are seeing this
> because the running OSDs have that bug?
> 
> Anyways, it's harmless and you can ignore it.

Ah, so it's merely "cosmetic" than that those PGs are really inactive.
Because, that would *freak me out* if I were doing an upgrade.

Thanks for the clarification.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nautilus upgrade problem

2019-04-02 Thread Stefan Kooman
Quoting Stadsnet (jwil...@stads.net):
> On 26-3-2019 16:39, Ashley Merrick wrote:
> >Have you upgraded any OSD's?
> 
> 
> No didn't go through with the osd's

Just checking here: are your sure all PGs have been scrubbed while
running Luminous? As the release notes [1] mention this:

"If you are unsure whether or not your Luminous cluster has completed a
full scrub of all PGs, you can check your clusters state by running:

# ceph osd dump | grep ^flags

In order to be able to proceed to Nautilus, your OSD map must include
the recovery_deletes and purged_snapdirs flags."

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/releases/nautilus/#upgrading-from-mimic-or-luminous

P.s. I expect most users upgrade to Mimic first, then go to Nautilus.
It might be a better tested upgrade path ... 


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Moving pools between cluster

2019-04-02 Thread Stefan Kooman
Quoting Burkhard Linke (burkhard.li...@computational.bio.uni-giessen.de):
> Hi,
> Images:
> 
> Straight-forward attempt would be exporting all images with qemu-img from
> one cluster, and uploading them again on the second cluster. But this will
> break snapshots, protections etc.

You can use rbd-mirror [1] (RBD mirroring requires the Ceph Jewel release or
later.). You do need to be able to set the "journaling" and
"exclusive-lock" feature on the rbd images (rbd feature enable
{pool-name}/{image-name} --image-feature exclusive-lock,journaling).
This will preserve snapshots, etc. When everything is mirrored you can
shutdown the VMs (or one by one) and promote the image(s) on the new
cluster, and have the VM(s) use the new cluster for their storage.
Note: You can also mirror a whole pool instead of mirroring on image level.

Gr. Stefan

[1]: http://docs.ceph.com/docs/mimic/rbd/rbd-mirroring/

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Scale Ceph for Large Numbers of Clients?

2019-03-14 Thread Stefan Kooman
Quoting Zack Brenton (z...@imposium.com):
> On Tue, Mar 12, 2019 at 6:10 AM Stefan Kooman  wrote:
> 
> > Hmm, 6 GiB of RAM is not a whole lot. Especially if you are going to
> > increase the amount of OSDs (partitions) like Patrick suggested. By
> > default it will take 4 GiB per OSD ... Make sure you set the
> > "osd_memory_target" parameter accordingly [1].
> >
> 
> @Stefan: Not sure I follow you here - each OSD pod has 6GiB RAB allocated
> to it, which accounts for the default 4GiB + 20% mentioned in the docs for
> `osd_memory_target` plus a little extra. The pods are running on AWS
> i3.2xlarge instances, which have 61GiB total RAM available, leaving plenty
> of room for an additional OSD pod to manage the additional partition
> created on each node. Why would I need to increase the RAM allocated to
> each OSD pod and adjust `osd_memory_target`? Does using the default values
> leave me at risk of running into some other kind of priority inversion
> issue / deadlock / etc.?

Somehow I understood that the server hosting all OSDs had 6 GiB RAM
available, not 6 GiB per OSD. I'm not up to speed with Ceph hosted in
kubernetes / PODs, so that might explain. Sorry for the noise.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How To Scale Ceph for Large Numbers of Clients?

2019-03-12 Thread Stefan Kooman
Quoting Zack Brenton (z...@imposium.com):
> Types of devices:
> We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3
> Mons, and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs
> with the following resources:
> - osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk
> - mds:  3 CPU, 24Gi RAM
> - mons: 500m (.5) CPU, 1Gi RAM

Hmm, 6 GiB of RAM is not a whole lot. Especially if you are going to
increase the amount of OSDs (partitions) like Patrick suggested. By
default it will take 4 GiB per OSD ... Make sure you set the
"osd_memory_target" parameter accordingly [1].

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/?highlight=osd%20memory%20target


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS_SLOW_METADATA_IO

2019-03-03 Thread Stefan Kooman
Quoting Patrick Donnelly (pdonn...@redhat.com):
> On Thu, Feb 28, 2019 at 12:49 PM Stefan Kooman  wrote:
> >
> > Dear list,
> >
> > After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
> > (MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
> > more that 5 seconds. We have one active, and one active standby MDS. All
> > storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
> > reported. The MDSes are underutilized, only a handful of active clients
> > and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
> > cluster is also far from busy.
> >
> > I've dumped ops in flight on the MDSes but all ops that are printed are
> > finished in a split second (duration: 0.000152), flag_point": "acquired
> > locks".
> 
> I believe you're looking at the wrong "ops" dump. You want to check
> "objector_requests".

I've done that, but not much timing info in there, i.e.:

"ops": [
{
"tid": 416231,
"pg": "6.791ce4c5",
"osd": 31,
"object_id": "200.00024d0f",
"object_locator": "@6",
"target_object_id": "200.00024d0f",
"target_object_locator": "@6",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "79790s",
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2019-03-01 16:39:29.0.41847s",
"osd_ops": [
"write 798204~6249 [fadvise_dontneed]"
]
}

The MDS_SLOW_METADATA_IO warning has been added in commit
https://github.com/ceph/ceph/commit/0f735f40315448560fde049ed3ea019a7d30d868#diff-74784f821b1aae68f768965680914268

Tracker issue: http://tracker.ceph.com/issues/24879
PR: 23022

It uses the "osd_op_complaint_time" as the "complaint_time". We have set
the osd_op_complaint_time to 5 seconds. 

The tracker issue hints that slow metadata would be caused by a slow OSD
(joural writes of MDS that take too long). However, we do not see any
slow mds ops (data), nor slow ops. We've changed the parameter
"mon_osd_warn_op_age" to 5 seconds as well so slow OSD ops would be
reported as well.

I do not think our OSDs are that slow, that often (~ at least every 15
minutes we would see that warning) ... so I really wonder what's going
on.

For now we've set the osd_op_complaint_time to "30" on the MDSes and do
not see any SLOW METADATA warnings anymore.

Anyone willing to set the osd_op_complaint_time to 5 seconds to see if
those messages will start popping up? ceph tell mds.* config set
osd_op_complaint_time 5 should do the trick ... you would need to be
running 12.2.11 and /or mimic 13.2.2.

Thanks,

Stefan

P.s. kudos to Wido den Hollander for helping us debug this.

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-28 Thread Stefan Kooman
Quoting Wido den Hollander (w...@42on.com):
 
> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe
> OSDs as well. Over time their latency increased until we started to
> notice I/O-wait inside VMs.

On a Luminous 12.2.8 cluster with only SSDs we also hit this issue I
guess. After restarting the OSD servers the latency would drop to normal
values again. See https://owncloud.kooman.org/s/BpkUc7YM79vhcDj

Reboots were finished at ~ 19:00.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS_SLOW_METADATA_IO

2019-02-28 Thread Stefan Kooman
Dear list,

After upgrading to 12.2.11 the MDSes are reporting slow metadata IOs
(MDS_SLOW_METADATA_IO). The metadata IOs would have been blocked for
more that 5 seconds. We have one active, and one active standby MDS. All
storage on SSD (Samsung PM863a / Intel DC4500). No other (OSD) slow ops
reported. The MDSes are underutilized, only a handful of active clients
and almost no load (fast hexacore CPU, 256 GB RAM, 20 Gb/s network). The
cluster is also far from busy.

I've dumped ops in flight on the MDSes but all ops that are printed are
finished in a split second (duration: 0.000152), flag_point": "acquired
locks".

I've googled for "MDS_SLOW_METADATA_IO" but no useful info whatsoever.
Are we the only ones getting these slow metadata IOS? 

Any hints on how to proceed to debug this are welcome.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] logging of cluster status (Jewel vs Luminous and later)

2019-01-24 Thread Stefan Kooman
Quoting Matthew Vernon (m...@sanger.ac.uk):
> Hi,
> 
> On our Jewel clusters, the mons keep a log of the cluster status e.g.
> 
> 2019-01-24 14:00:00.028457 7f7a17bef700  0 log_channel(cluster) log [INF] :
> HEALTH_OK
> 2019-01-24 14:00:00.646719 7f7a46423700  0 log_channel(cluster) log [INF] :
> pgmap v66631404: 173696 pgs: 10 active+clean+scrubbing+deep, 173686
> active+clean; 2271 TB data, 6819 TB used, 9875 TB / 16695 TB avail; 1313
> MB/s rd, 236 MB/s wr, 12921 op/s
> 
> This is sometimes useful after a problem, to see when thing started going
> wrong (which can be helpful for incident response and analysis) and so on.
> There doesn't appear to be any such logging in Luminous, either by mons or
> mgrs. What am I missing?

Our mons keep a log in /var/log/ceph/ceph.log (running luminous 12.2.8).
Is that log present on your systems?

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2019-01-16 Thread Stefan Kooman
Hi Patrick,

Quoting Stefan Kooman (ste...@bit.nl):
> Quoting Stefan Kooman (ste...@bit.nl):
> > Quoting Patrick Donnelly (pdonn...@redhat.com):
> > > Thanks for the detailed notes. It looks like the MDS is stuck
> > > somewhere it's not even outputting any log messages. If possible, it'd
> > > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > > if you're comfortable with gdb, a backtrace of any threads that look
> > > suspicious (e.g. not waiting on a futex) including `info threads`.
> 
> Today the issue reappeared (after being absent for ~ 3 weeks). This time
> the standby MDS could take over and would not get into a deadlock
> itself. We made gdb traces again, which you can find over here:
> 
> https://8n1.org/14011/d444

We are still seeing these crashes occur ~ every 3 weeks or so. Have you
find the time to look into the backtraces / gdb dumps?

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephday berlin slides

2018-12-10 Thread Stefan Kooman
Quoting Mike Perez (mipe...@redhat.com):
> Hi Serkan,
> 
> I'm currently working on collecting the slides to have them posted to
> the Ceph Day Berlin page as Lenz mentioned they would show up. I will
> notify once the slides are available on mailing list/twitter. Thanks!

FYI: The Ceph Day Berlin slides are available online:
https://ceph.com/cephdays/ceph-day-berlin/

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance Problems

2018-12-10 Thread Stefan Kooman
Quoting Robert Sander (r.san...@heinlein-support.de):
> On 07.12.18 18:33, Scharfenberg, Buddy wrote:
> 
> > We have 3 nodes set up, 1 with several large drives, 1 with a handful of
> > small ssds, and 1 with several nvme drives.
> 
> This is a very unusual setup. Do you really have all your HDDs in one
> node, the SSDs in another and NVMe in the third?
> 
> How do you guarantee redundancy?

Disk type != redundancy.
> 
> You should evenly distribute your storage devices across your nodes,
> this may already be a performance boost as it distributes the requests.

If performance is indeed important, it makes sense to do as what Robert
suggests. If you want to reduce the chance of having your drives in the
three different hosts die at the same time, it can make sense. Assuming
you have 3 replicas and host as failure domain.

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool Available Capacity Question

2018-12-08 Thread Stefan Kooman
Jay Munsterman  schreef op 7 december 2018 21:55:25 CET:
>Hey all,
>I hope this is a simple question, but I haven't been able to figure it
>out.
>On one of our clusters there seems to be a disparity between the global
>available space and the space available to pools.
>
>$ ceph df
>GLOBAL:
>SIZE  AVAIL RAW USED %RAW USED
>1528T  505T1022T 66.94
>POOLS:
>NAME ID USED   %USED MAX AVAIL OBJECTS
>   fs_data  7678T 85.79  112T 194937779
>   fs_metadata  8  62247k 057495G 92973
>   libvirt_pool 14   495G  0.5786243G127313
>
>The global available space is 505T, the primary pool (fs_data, erasure
>code
>k=2, m=1) lists 112T available. With 2,1 I would expect there to be
>~338T
>available (505 x .67). Seems we have a few hundred TB missing.
>Thoughts?
>Thanks,
>jay

Your OSDs are imbalanced. Ceph reports disk usage of the OSD most full. I 
suggest you check this presentation by Dan van der Ster: 
https://www.slideshare.net/mobile/Inktank_Ceph/ceph-day-berlin-mastering-ceph-operations-upmap-and-the-mgr-balancer

If you are running Ceph Luminous with Luminous only clients: enable upmap for 
balancing and enable balancer module.

Gr. Stefan

Hi,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor ceph cluster performance

2018-11-26 Thread Stefan Kooman
Quoting Cody (codeology@gmail.com):
> The Ceph OSD part of the cluster uses 3 identical servers with the
> following specifications:
> 
> CPU: 2 x E5-2603 @1.8GHz
> RAM: 16GB
> Network: 1G port shared for Ceph public and cluster traffics

This will hamper throughput a lot. 

> Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
> OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)

OK, let's stop here first: Consumer grade SSD. Percona did a nice
writeup about "fsync" speed on consumer grade SSDs [1]. As I don't know
what drives you use this might or might not be the issue.

> 
> This is not beefy enough in any way, but I am running for PoC only,
> with minimum utilization.
> 
> Ceph-mon and ceph-mgr daemons are hosted on the OpenStack Controller
> nodes. Ceph-ansible version is 3.1 and is using Filestore with
> non-colocated scenario (1 SSD for every 2 OSDs). Connection speed
> among Controllers, Computes, and OSD nodes can reach ~900Mbps tested
> using iperf.

Why filestore if I may ask? I guess bluestore with bluestore journal on
SSD and data on SATA should give you better performance. If the SSDs are
suitable for the job at all.

What version of Ceph are use using? Metrics can give you a lot of
insight. Did you take a look at those? In fFor example Ceph mgr dashboard?

> 
> I followed the Red Hat Ceph 3 benchmarking procedure [1] and received
> following results:
> 
> Write Test:
> 
> Total time run: 80.313004
> Total writes made:  17
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 0.846687
> Stddev Bandwidth:   0.320051
> Max bandwidth (MB/sec): 2
> Min bandwidth (MB/sec): 0
> Average IOPS:   0
> Stddev IOPS:0
> Max IOPS:   0
> Min IOPS:   0
> Average Latency(s): 66.6582
> Stddev Latency(s):  15.5529
> Max latency(s): 80.3122
> Min latency(s): 29.7059
> 
> Sequencial Read Test:
> 
> Total time run:   25.951049
> Total reads made: 17
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   2.62032
> Average IOPS: 0
> Stddev IOPS:  0
> Max IOPS: 1
> Min IOPS: 0
> Average Latency(s):   24.4129
> Max latency(s):   25.9492
> Min latency(s):   0.117732
> 
> Random Read Test:
> 
> Total time run:   66.355433
> Total reads made: 46
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   2.77295
> Average IOPS: 0
> Stddev IOPS:  3
> Max IOPS: 27
> Min IOPS: 0
> Average Latency(s):   21.4531
> Max latency(s):   66.1885
> Min latency(s):   0.0395266
> 
> Apparently, the results are pathetic...
> 
> As I moved on to test block devices, I got a following error message:
> 
> # rbd map image01 --pool testbench --name client.admin
> rbd: failed to add secret 'client.admin' to kernel

What replication factor are you using?

Make sure you have the client.admin keyring on the node you are issuing
this command. If you have the keyring present like Ceph expects it to
be, then you can omit the --name client.admin. On a monitor node you can
extract the admin keyring: ceph auth export client.admin. Put the output
of that in /etc/ceph/ceph.client.admin.keyring and this should work.

> Any suggestions on the above error and/or debugging would be greatly
> appreciated!

Gr. Stefan

[1]:
https://www.percona.com/blog/2018/07/18/why-consumer-ssd-reviews-are-useless-for-database-performance-use-case/
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/#benchmarking_performance
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No recovery when "norebalance" flag set

2018-11-26 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com):
> Haven't seen that exact issue.
> 
> One thing to note though is that if osd_max_backfills is set to 1,
> then it can happen that PGs get into backfill state, taking that
> single reservation on a given OSD, and therefore the recovery_wait PGs
> can't get a slot.
> I suppose that backfill prioritization is supposed to prevent this,
> but in my experience luminous v12.2.8 doesn't always get it right.

That's also our experience. Even if if the degraded PGs with backfill /
recovery state are given a higher priority (forced) ... than still
normally backfilling takes place.

> So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart
> the recovering PGs.

Wat still on "1" indeed. We tend to cranck that (and max recovery) with
keeping an eye on max read and write apply latency. In our setup we can
do 16 backfills concurrently / and or 2 recovery / 4 backfills. Recovery
speeds ~ 4 - 5 GB/s ... pushing it beyond that tends to crashing OSDs.

We'll try your suggestion next time.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Stefan Kooman
Quoting Janne Johansson (icepic...@gmail.com):
> Yes, when you add a drive (or 10), some PGs decide they should have one or 
> more
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
> 
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add space 
> to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3 working
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.

Thanks for the explanation. I agree with you that it would be more safe to
first backfill to the new PG instead of just assuming the new OSD will
be fine and discarding a perfectly healthy PG. We do have max_size 3 in
the CRUSH ruleset ... I wonder if Ceph would behave differently if we
would have max_size 4 ... to actually allow a fourth copy in the first
place ...

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Full L3 Ceph

2018-11-25 Thread Stefan Kooman
Quoting Robin H. Johnson (robb...@gentoo.org):
> On Fri, Nov 23, 2018 at 04:03:25AM +0700, Lazuardi Nasution wrote:
> > I'm looking example Ceph configuration and topology on full layer 3
> > networking deployment. Maybe all daemons can use loopback alias address in
> > this case. But how to set cluster network and public network configuration,
> > using supernet? I think using loopback alias address can prevent the
> > daemons down due to physical interfaces disconnection and can load balance
> > traffic between physical interfaces without interfaces bonding, but with
> > ECMP.
> I can say I've done something similar**, but I don't have access to that
> environment or most*** of the configuration anymore.
> 
> One of the parts I do recall, was explicitly setting cluster_network
> and public_network to empty strings, AND using public_addr+cluster_addr
> instead, with routable addressing on dummy interfaces (NOT loopback).

You can do this with MP-BGP (VXLAN) EVPN. We are running it like that.
IPv6 overlay network only. ECMP to make use of all the links. We don't
use a seperate cluster network. That only complicates things, and
there's no real use for it (trademark by Wido den Hollander). If you
want to use BGP on the hosts themselves have a look at this post by
Vincent Bernat (great writeups of complex networking stuff) [1]. You can
use "MC-LAG" on the host to get redundant connectivity, or use "Type 4"
EVPN to get endpoint redundancy (Ethernet Segment Route). FRR 6.0 has
support for most of this (not yet "Type 4" EVPN support IIRC) [2].

We use a network namespace to seperate (IPv6) mangemant traffic
from production traffic. This complicates Ceph deployment a lot, but in
the end it's worth it.

Gr. Stefan

[1]: https://vincent.bernat.ch/en/blog/2017-vxlan-bgp-evpn
[2]: https://frrouting.org/


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-25 Thread Stefan Kooman
Hi List,

Another interesting and unexpected thing we observed during cluster
expansion is the following. After we added  extra disks to the cluster,
while "norebalance" flag was set, we put the new OSDs "IN". As soon as
we did that a couple of hundered objects would become degraded. During
that time no OSD crashed or restarted. Every "ceph osd crush add $osd
weight host=$storage-node" would cause extra degraded objects.

I don't expect objects to become degraded when extra OSDs are added.
Misplaced, yes. Degraded, no

Someone got an explantion for this?

Gr. Stefan



-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] No recovery when "norebalance" flag set

2018-11-25 Thread Stefan Kooman
Hi list,

During cluster expansion (adding extra disks to existing hosts) some
OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
error (39) Directory not empty not handled on operation 21 (op 1,
counting from 0), full details: https://8n1.org/14078/c534). We had
"norebalance", "nobackfill", and "norecover" flags set. After we unset
nobackfill and norecover (to let Ceph fix the degraded PGs) it would
recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
were supposed to have a copy of them, and they were already "probed".  A
day later (~24 hours) it would still not have recovered the degraded
objects.  After we unset the "norebalance" flag it would start
rebalancing, backfilling and recovering. The 12 degraded objects were
recovered.

Is this expected behaviour? I would expect Ceph to always try to fix
degraded things first and foremost. Even "pg force-recover" and "pg
force-backfill" could not force recovery.

Gr. Stefan




-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS hangs in "heartbeat_map" deadlock

2018-11-15 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> Quoting Patrick Donnelly (pdonn...@redhat.com):
> > Thanks for the detailed notes. It looks like the MDS is stuck
> > somewhere it's not even outputting any log messages. If possible, it'd
> > be helpful to get a coredump (e.g. by sending SIGQUIT to the MDS) or,
> > if you're comfortable with gdb, a backtrace of any threads that look
> > suspicious (e.g. not waiting on a futex) including `info threads`.

Today the issue reappeared (after being absent for ~ 3 weeks). This time
the standby MDS could take over and would not get into a deadlock
itself. We made gdb traces again, which you can find over here:

https://8n1.org/14011/d444

Would be great if someone could figure out whats causing this issue.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel client versions - pg-upmap

2018-11-08 Thread Stefan Kooman
Quoting Stefan Kooman (ste...@bit.nl):
> I'm pretty sure it isn't. I'm trying to do the same (force luminous
> clients only) but ran into the same issue. Even when running 4.19 kernel
> it's interpreted as a jewel client. Here is the list I made so far:
> 
> Kernel 4.13 / 4.15:
> "features": "0x7010fb86aa42ada",
> "release": "jewel"
> 
> kernel 4.18 / 4.19
>  "features": "0x27018fb86aa42ada",
>  "release": "jewel"

On a test cluster with kernel clients 4.13, 4.15, 4.19 I have set the
"ceph osd set-require-min-compat-client luminous --yes-i-really-mean-it"
while doing active IO ... no issues. Remount also works ... makes me
wonder how strict this "require-min-compat-client" is ...

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS kernel client versions - pg-upmap

2018-11-08 Thread Stefan Kooman
Quoting Ilya Dryomov (idryo...@gmail.com):
> On Sat, Nov 3, 2018 at 10:41 AM  wrote:
> >
> > Hi.
> >
> > I tried to enable the "new smart balancing" - backend are on RH luminous
> > clients are Ubuntu 4.15 kernel.
[cut]
> > ok, so 4.15 kernel connects as a "hammer" (<1.0) client?  Is there a
> > huge gap in upstreaming kernel clients to kernel.org or what am I
> > misreading here?
> >
> > Hammer is 2015'ish - 4.15 is January 2018'ish?
> >
> > Is kernel client development lacking behind ?
> 
> Hi Jesper,
> 
> There are four different groups of clients in that output.  Which one
> of those four is the kernel client?  Are you sure it's just the hammer
> one?

I'm pretty sure it isn't. I'm trying to do the same (force luminous
clients only) but ran into the same issue. Even when running 4.19 kernel
it's interpreted as a jewel client. Here is the list I made so far:

Kernel 4.13 / 4.15:
"features": "0x7010fb86aa42ada",
"release": "jewel"

kernel 4.18 / 4.19
 "features": "0x27018fb86aa42ada",
 "release": "jewel"

I have tested both Ubuntu as CentOS mainline kernels.  I came accross
this issue made by Sage [1], which is resolved, but which looks similiar
to this.

Gr. Stefan

[1]: https://tracker.ceph.com/issues/20475

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >