Re: [ceph-users] units of metrics

2020-01-14 Thread Robert LeBlanc
On Tue, Jan 14, 2020 at 12:30 AM Stefan Kooman  wrote:

> Quoting Robert LeBlanc (rob...@leblancnet.us):
> > The link that you referenced above is no longer available, do you have a
> > new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all
> > changed, so I'm trying to may the old values to the new values. Might
> just
> > have to look in the code. :(
>
> I cannot recall that the metrics have ever changed between 12.2.8 and
> 12.2.12. Anyways, it depends on what module you use to collect the
> metrics if the right metrics are even there. See this issue:
> https://tracker.ceph.com/issues/41881


Yes, I agree that the metrics should not change within a major version, but
here is the difference. We are using diamond and the CephCollector, but I
verified with the admin socket and dumping the perf counters manually

Metrics collected with 12.2.8:
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request
0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_link 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookup 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookuphash 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupino 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupname 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupparent 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lookupsnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_lssnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mkdir 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mknod 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_mksnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_open 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_readdir 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rename 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_renamesnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmdir 0 1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmsnap 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_rmxattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setdirlayout 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setfilelock 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setlayout 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_setxattr 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_symlink 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_unlink 0
1578955818
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.cap_revoke_eviction 0
1578955878

Metrics collected with 12.2.12: (much more clear and descriptive which is
good)
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_client_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.dispatch_server_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_client_session
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.handle_slave_request
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgcount
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.avgtime
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_create_latency.sum
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgcount
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.avgtime
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getattr_latency.sum
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgcount
0 1578955878
servers.mds01.CephCollector.ceph.mds.mds01.mds_server.req_getfilelock_latency.avgtime
0 157

Re: [ceph-users] units of metrics

2020-01-13 Thread Robert LeBlanc
The link that you referenced above is no longer available, do you have a
new link?. We upgraded from 12.2.8 to 12.2.12 and the MDS metrics all
changed, so I'm trying to may the old values to the new values. Might just
have to look in the code. :(

Thanks!

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 12, 2019 at 8:02 AM Paul Emmerich 
wrote:

> We use a custom script to collect these metrics in croit
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Thu, Sep 12, 2019 at 5:00 PM Stefan Kooman  wrote:
> >
> > Hi Paul,
> >
> > Quoting Paul Emmerich (paul.emmer...@croit.io):
> > >
> https://static.croit.io/ceph-training-examples/ceph-training-example-admin-socket.pdf
> >
> > Thanks for the link. So, what tool do you use to gather the metrics? We
> > are using telegraf module of the Ceph manager. However, this module only
> > provides "sum" and not "avgtime" so I can't do the calculations. The
> > influx and zabbix mgr modules also only provide "sum". The only metrics
> > module that *does* send "avgtime" is the prometheus module:
> >
> > ceph_mds_reply_latency_sum
> > ceph_mds_reply_latency_count
> >
> > All modules use "self.get_all_perf_counters()" though:
> >
> > ~/git/ceph/src/pybind/mgr/ > grep -Ri get_all_perf_counters *
> > dashboard/controllers/perf_counters.py:return
> mgr.get_all_perf_counters()
> > diskprediction_cloud/agent/metrics/ceph_mon_osd.py:perf_data =
> obj_api.module.get_all_perf_counters(services=('mon', 'osd'))
> > influx/module.py:for daemon, counters in
> six.iteritems(self.get_all_perf_counters()):
> > mgr_module.py:def get_all_perf_counters(self, prio_limit=PRIO_USEFUL,
> > prometheus/module.py:for daemon, counters in
> self.get_all_perf_counters().items():
> > restful/api/perf.py:counters =
> context.instance.get_all_perf_counters()
> > telegraf/module.py:for daemon, counters in
> six.iteritems(self.get_all_perf_counters())
> >
> > Besides the *ceph* telegraf module we also use the ceph plugin for
> > telegraf ... but that plugin does not (yet?) provide mds metrics though.
> > Ideally we would *only* use the ceph mgr telegraf module to collect *all
> > the things*.
> >
> > Not sure what's the difference in python code between the modules that
> could explain this.
> >
> > Gr. Stefan
> >
> > --
> > | BIT BV  https://www.bit.nl/Kamer van Koophandel 09090351
> > | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.

2019-12-09 Thread Robert LeBlanc
On Mon, Dec 9, 2019 at 11:58 AM Paul Emmerich 
wrote:

> solved it: the warning is of course generated by ceph-mgr and not ceph-mon.
>
> So for my problem that means: should have injected the option in ceph-mgr.
> That's why it obviously worked when setting it on the pool...
>
> The solution for you is to simply put the option under global and restart
> ceph-mgr (or use daemon config set; it doesn't support changing config via
> ceph tell for some reason)
>
>
> Paul
>
> On Mon, Dec 9, 2019 at 8:32 PM Paul Emmerich 
> wrote:
>
>>
>>
>> On Mon, Dec 9, 2019 at 5:17 PM Robert LeBlanc 
>> wrote:
>>
>>> I've increased the deep_scrub interval on the OSDs on our Nautilus
>>> cluster with the following added to the [osd] section:
>>>
>>
>> should have read the beginning of your email; you'll need to set the
>> option on the mons as well because they generate the warning. So your
>> problem might be completely different from what I'm seeing here
>>
>
>
>>
>>
>> Paul
>>
>>
>>>
>>> osd_deep_scrub_interval = 260
>>>
>>> And I started seeing
>>>
>>> 1518 pgs not deep-scrubbed in time
>>>
>>> in ceph -s. So I added
>>>
>>> mon_warn_pg_not_deep_scrubbed_ratio = 1
>>>
>>> since the default would start warning with a whole week left to scrub.
>>> But the messages persist. The cluster has been running for a month with
>>> these settings. Here is an example of the output. As you can see, some of
>>> these are not even two weeks old, no where close to the 75% of 4 weeks.
>>>
>>> pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373
>>>pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
>>>pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
>>>pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
>>>pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
>>>pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
>>>pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
>>>pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
>>>pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
>>>pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51
>>>pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
>>>1468 more pgs...
>>> Mon Dec  9 08:12:01 PST 2019
>>>
>>> There is very little data on the cluster, so it's not a problem of
>>> deep-scrubs taking too long:
>>>
>>> $ ceph df
>>> RAW STORAGE:
>>>CLASS SIZEAVAIL   USEDRAW USED %RAW USED
>>>hdd   6.3 PiB 6.1 PiB 153 TiB  154 TiB  2.39
>>>nvme  5.8 TiB 5.6 TiB 138 GiB  197 GiB  3.33
>>>TOTAL 6.3 PiB 6.2 PiB 154 TiB  154 TiB  2.39
>>>
>>> POOLS:
>>>POOL   ID STORED  OBJECTS USED
>>>%USED MAX AVAIL
>>>.rgw.root   1 3.0 KiB   7 3.0 KiB
>>> 0   1.8 PiB
>>>default.rgw.control 2 0 B   8 0 B
>>> 0   1.8 PiB
>>>default.rgw.meta3 7.4 KiB  24 7.4 KiB
>>> 0   1.8 PiB
>>>default.rgw.log 4  11 GiB 341  11 GiB
>>> 0   1.8 PiB
>>>default.rgw.buckets.data6 100 TiB  41.84M 100 TiB
>>>  1.82   4.2 PiB
>>>default.rgw.buckets.index   7  33 GiB 574  33 GiB
>>> 0   1.8 PiB
>>>default.rgw.buckets.non-ec  8 8.1 MiB  22 8.1 MiB
>>> 0   1.8 PiB
>>>
>>> Please help me figure out what I'm doing wrong with these settings.
>>>
>>
Paul,

Thanks, I did set both options to the global on the mons and restarted
them, but that didn't help. Having the scrub interval set in the global
section and restarting the mgr fixed it.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Annoying PGs not deep-scrubbed in time messages in Nautilus.

2019-12-09 Thread Robert LeBlanc
I've increased the deep_scrub interval on the OSDs on our Nautilus cluster
with the following added to the [osd] section:

osd_deep_scrub_interval = 260

And I started seeing

1518 pgs not deep-scrubbed in time

in ceph -s. So I added

mon_warn_pg_not_deep_scrubbed_ratio = 1

since the default would start warning with a whole week left to scrub. But
the messages persist. The cluster has been running for a month with these
settings. Here is an example of the output. As you can see, some of these
are not even two weeks old, no where close to the 75% of 4 weeks.

pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373
   pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
   pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
   pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
   pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
   pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
   pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
   pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
   pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
   pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.51
   pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
   1468 more pgs...
Mon Dec  9 08:12:01 PST 2019

There is very little data on the cluster, so it's not a problem of
deep-scrubs taking too long:

$ ceph df
RAW STORAGE:
   CLASS SIZEAVAIL   USEDRAW USED %RAW USED
   hdd   6.3 PiB 6.1 PiB 153 TiB  154 TiB  2.39
   nvme  5.8 TiB 5.6 TiB 138 GiB  197 GiB  3.33
   TOTAL 6.3 PiB 6.2 PiB 154 TiB  154 TiB  2.39

POOLS:
   POOL   ID STORED  OBJECTS USED
   %USED MAX AVAIL
   .rgw.root   1 3.0 KiB   7 3.0 KiB
0   1.8 PiB
   default.rgw.control 2 0 B   8 0 B
0   1.8 PiB
   default.rgw.meta3 7.4 KiB  24 7.4 KiB
0   1.8 PiB
   default.rgw.log 4  11 GiB 341  11 GiB
0   1.8 PiB
   default.rgw.buckets.data6 100 TiB  41.84M 100 TiB
 1.82   4.2 PiB
   default.rgw.buckets.index   7  33 GiB 574  33 GiB
0   1.8 PiB
   default.rgw.buckets.non-ec  8 8.1 MiB  22 8.1 MiB
0   1.8 PiB

Please help me figure out what I'm doing wrong with these settings.

Thanks,
Robert LeBlanc
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs metadata fix tool

2019-12-07 Thread Robert LeBlanc
Our Jewel cluster is exhibiting some similar issues to the one in this
thread [0] and it was indicated that a tool would need to be written to fix
that kind of corruption. Has the tool been written? How would I go about
repair this 16EB directories that won't delete?

Thank you,
Robert LeBlanc

[0] https://www.spinics.net/lists/ceph-users/msg31598.html

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Robert LeBlanc
On Tue, Dec 3, 2019 at 9:11 AM Ed Fisher  wrote:

>
>
> On Dec 3, 2019, at 10:28 AM, Robert LeBlanc  wrote:
>
> Did you make progress on this? We have a ton of < 64K objects as well and
> are struggling to get good performance out of our RGW. Sometimes we have
> RGW instances that are just gobbling up CPU even when there are no requests
> to them, so it seems like things are getting hung up somewhere. There is
> nothing in the logs and I haven't had time to do more troubleshooting.
>
>
> There's a bug in the current stable Nautilus release that causes a loop
> and/or crash in get_obj_data::flush (you should be able to see it gobbling
> up CPU in perf top). This is the related issue:
> https://tracker.ceph.com/issues/39660 -- it should be fixed as soon as
> 14.2.5 is released (any day now, supposedly).
>

We will try out the new version when it's released and see if it improves
things for us.

Thanks,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW performance with low object sizes

2019-12-03 Thread Robert LeBlanc
90   p99
> max |  avg   min   p25   p50   p75   p90   p99   max |
>
> +-++++
> |   8 | 196.3 MB/s |2 1 2 2 3 3 5
> 5 |2 1 2 2 3 3 5 5 |
>
> +-++++
> [...section CLEANUP was deleted...]
>
>
Did you make progress on this? We have a ton of < 64K objects as well and
are struggling to get good performance out of our RGW. Sometimes we have
RGW instances that are just gobbling up CPU even when there are no requests
to them, so it seems like things are getting hung up somewhere. There is
nothing in the logs and I haven't had time to do more troubleshooting.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Revert a CephFS snapshot?

2019-12-03 Thread Robert LeBlanc
On Thu, Nov 14, 2019 at 11:48 AM Sage Weil  wrote:

> On Thu, 14 Nov 2019, Patrick Donnelly wrote:
> > On Wed, Nov 13, 2019 at 6:36 PM Jerry Lee 
> wrote:
> > >
> > > On Thu, 14 Nov 2019 at 07:07, Patrick Donnelly 
> wrote:
> > > >
> > > > On Wed, Nov 13, 2019 at 2:30 AM Jerry Lee 
> wrote:
> > > > > Recently, I'm evaluating the snpahsot feature of CephFS from kernel
> > > > > client and everthing works like a charm.  But, it seems that
> reverting
> > > > > a snapshot is not available currently.  Is there some reason or
> > > > > technical limitation that the feature is not provided?  Any
> insights
> > > > > or ideas are appreciated.
> > > >
> > > > Please provide more information about what you tried to do (commands
> > > > run) and how it surprised you.
> > >
> > > The thing I would like to do is to rollback a snapped directory to a
> > > previous version of snapshot.  It looks like the operation can be done
> > > by over-writting all the current version of files/directories from a
> > > previous snapshot via cp.  But cp may take lots of time when there are
> > > many files and directories in the target directory.  Is there any
> > > possibility to achieve the goal much faster from the CephFS internal
> > > via command like "ceph fs   snap rollback
> > > " (just a example)?  Thank you!
> >
> > RADOS doesn't support rollback of snapshots so it needs to be done
> > manually. The best tool to do this would probably be rsync of the
> > .snap directory with appropriate options including deletion of files
> > that do not exist in the source (snapshot).
>
> rsync is the best bet now, yeah.
>
> RADOS does have a rollback operation that uses clone where it can, but
> it's a per-object operation, so something still needs to walk the
> hierarchy and roll back each file's content.  The MDS could do this more
> efficiently than rsync give what it knows about the snapped inodes
> (skipping untouched inodes or, eventually, entire subtrees) but it's a
> non-trivial amount of work to implement.
>
> Would it make sense to extend CephFS to leverage reflinks for cases like
this? That could be faster than rsync and more space efficient. It would
require some development time though.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-25 Thread Robert LeBlanc
Yout can try adding

osd op queue = wpq
osd op queue cut off = high

To all the osd ceph configs and restarting, That has made reweighting
pretty painless for us.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Oct 22, 2019 at 8:36 PM David Turner  wrote:
>
> Most times you are better served with simpler settings like 
> osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs 
> in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd, 
> osd_recovery_sleep_hybrid). Using those you can tweak a specific type of OSD 
> that might be having problems during recovery/backfill while allowing the 
> others to continue to backfill at regular speeds.
>
> Additionally you mentioned reweighting OSDs, but it sounded like you do this 
> manually. The balancer module, especially in upmap mode, can be configured 
> quite well to minimize client IO impact while balancing. You can specify 
> times of day that it can move data (only in UTC, it ignores local timezones), 
> a threshold of misplaced data that it will stop moving PGs at, the increment 
> size it will change weights with per operation, how many weights it will 
> adjust with each pass, etc.
>
> On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood  
> wrote:
>>
>> Thanks - that's a good suggestion!
>>
>> However I'd still like to know the answers to my 2 questions.
>>
>> regards
>>
>> Mark
>>
>> On 22/10/19 11:22 pm, Paul Emmerich wrote:
>> > getting rid of filestore solves most latency spike issues during
>> > recovery because they are often caused by random XFS hangs (splitting
>> > dirs or just xfs having a bad day)
>> >
>> >
>> > Paul
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Decreasing the impact of reweighting osds

2019-10-25 Thread Robert LeBlanc
You can try adding


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Tue, Oct 22, 2019 at 8:36 PM David Turner  wrote:
>
> Most times you are better served with simpler settings like 
> osd_recovery_sleep, which has 3 variants if you have multiple types of OSDs 
> in your cluster (osd_recovery_sleep_hdd, osd_recovery_sleep_sdd, 
> osd_recovery_sleep_hybrid). Using those you can tweak a specific type of OSD 
> that might be having problems during recovery/backfill while allowing the 
> others to continue to backfill at regular speeds.
>
> Additionally you mentioned reweighting OSDs, but it sounded like you do this 
> manually. The balancer module, especially in upmap mode, can be configured 
> quite well to minimize client IO impact while balancing. You can specify 
> times of day that it can move data (only in UTC, it ignores local timezones), 
> a threshold of misplaced data that it will stop moving PGs at, the increment 
> size it will change weights with per operation, how many weights it will 
> adjust with each pass, etc.
>
> On Tue, Oct 22, 2019, 6:07 PM Mark Kirkwood  
> wrote:
>>
>> Thanks - that's a good suggestion!
>>
>> However I'd still like to know the answers to my 2 questions.
>>
>> regards
>>
>> Mark
>>
>> On 22/10/19 11:22 pm, Paul Emmerich wrote:
>> > getting rid of filestore solves most latency spike issues during
>> > recovery because they are often caused by random XFS hangs (splitting
>> > dirs or just xfs having a bad day)
>> >
>> >
>> > Paul
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 12:35 PM huxia...@horebdata.cn
 wrote:
>
> hello, Robert
>
> thanks for the quick reply. I did test with  osd op queue = wpq , and osd 
> op queue cut off = high
> and
> osd_recovery_op_priority = 1
> osd recovery delay start = 20
> osd recovery max active = 1
> osd recovery max chunk = 1048576
> osd recovery sleep = 1
> osd recovery sleep hdd = 1
> osd recovery sleep ssd = 1
> osd recovery sleep hybrid = 1
> osd recovery priority = 1
> osd max backfills = 1
> osd backfill scan max = 16
> osd backfill scan min = 4
> osd_op_thread_suicide_timeout = 300
>
> But still the ceph cluster showed extremely hug recovery activities during 
> the beginning of the recovery, and after ca. 5-10 minutes, the recovery 
> gradually get under the control. I guess this is quite similar to what you 
> encountered in Nov. 2015.
>
> It is really annoying, and what else can i do to mitigate this weird 
> inital-recovery issue? any suggestions are much appreciated.

Hmm, on our Luminous cluster, we have the defaults other than the op
queue and cut off and bringing in a node is nearly zero impact for
client traffic. Those would need to be set on all OSDs to be
completely effective. Maybe go back to the defaults?


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery

2019-10-17 Thread Robert LeBlanc
On Thu, Oct 17, 2019 at 12:08 PM huxia...@horebdata.cn
 wrote:
>
> I happened to find a note that you wrote in Nov 2015: 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-November/006173.html
> and I believe this is what i just hit exactly the same behavior : a host down 
> will badly take the client performance down 1/10 (with 200MB/s recovery 
> workload) and then took ten minutes  to get good control of OSD recovery.
>
> Could you please share how did you eventally solve that issue? by seting a 
> fair large OSD recovery delay start or any other parameter?

Wow! Dusting off the cobwebs here. I think this is what lead me to dig
into the code and write the WPQ scheduler. I can't remember doing
anything specific. I'm sorry I'm not much help in this regard.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery

2019-10-16 Thread Robert LeBlanc
On Wed, Oct 16, 2019 at 11:53 AM huxia...@horebdata.cn
 wrote:
>
> My Ceph version is Luminuous 12.2.12. Do you think should i upgrade to 
> Nautilus, or will Nautilus have a better control of recovery/backfilling?

We have a Jewel cluster and Luminuous cluster that we have changed
these settings on and it really helped both of them.

----
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Openstack VM IOPS drops dramatically during Ceph recovery

2019-10-14 Thread Robert LeBlanc
On Thu, Oct 10, 2019 at 2:23 PM huxia...@horebdata.cn
 wrote:
>
> Hi, folks,
>
> I have a middle-size Ceph cluster as cinder backup for openstack (queens). 
> Duing testing, one Ceph node went down unexpected and powered up again ca 10 
> minutes later, Ceph cluster starts PG recovery. To my surprise,  VM IOPS 
> drops dramatically during Ceph recovery, from ca. 13K IOPS to about 400, a 
> factor of 1/30, and I did put a stringent throttling on backfill and 
> recovery, with the following ceph parameters
>
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_client_op_priority=63
> osd_recovery_op_priority=1
> osd_recovery_sleep = 0.5
>
> The most weird thing is,
> 1) when there is no IO activity from any VM (ALL VMs are quiet except the 
> recovery IO), the recovery bandwidth is ca. 10MiB/s, 2 objects/s. Seems like 
> recovery throttle setting is working properly
> 2) when using FIO testing inside a VM, the recovery bandwith is going up 
> quickly, reaching above 200MiB/s, 60 objects/s. FIO IOPS performance inside 
> VM, however, is only at 400 IOPS/s (8KiB block size), around 3MiB/s. Obvious 
> recovery throttling DOES NOT work properly
> 3) If i stop the FIO testing in VM, the recovery bandwith then goes down to  
> 10MiB/s, 2 objects/s again, strange enough.
>
> How can this weird behavior happen? I just wonder, is there a method to 
> configure recovery bandwith to a specific value, or the number of recovery 
> objects per second? this may give better control of bakcfilling/recovery, 
> instead of the faulty logic or relative osd_client_op_priority vs 
> osd_recovery_op_priority.
>
> any ideas or suggests to make the recovery under control?
>
> best regards,
>
> Samuel

Not sure which version of Ceph you are on, but add these to your
/etc/ceph/ceph.conf on all your OSDs and restart them.

osd op queue = wpq
osd op queue cut off = high

That should really help and make backfills and recovery be
non-impactful. This will be the default in Octopus.

--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Commit and Apply latency on nautilus

2019-10-01 Thread Robert LeBlanc
On Tue, Oct 1, 2019 at 7:54 AM Robert LeBlanc  wrote:
>
> On Mon, Sep 30, 2019 at 5:12 PM Sasha Litvak
>  wrote:
> >
> > At this point, I ran out of ideas.  I changed nr_requests and readahead 
> > parameters to 128->1024 and 128->4096, tuned nodes to 
> > performance-throughput.  However, I still get high latency during benchmark 
> > testing.  I attempted to disable cache on ssd
> >
> > for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done
> >
> > and I think it make things not better at all.  I have H740 and H730 
> > controllers with drives in HBA mode.
> >
> > Other them converting them one by one to RAID0 I am not sure what else I 
> > can try.
> >
> > Any suggestions?
>
> If you haven't already tried this, add this to your ceph.conf and
> restart your OSDs, this should help bring down the variance in latency
> (It will be the default in Octopus):
>
> osd op queue = wpq
> osd op queue cut off = high

I should clarify. This will reduce the variance in latency for client
OPs. If this counter is also including recovery/backfill/deep_scrub
OP-, then the latency can still be high as these settings make
recovery/backfill/deep_scrub less impactful to client I/O at the cost
of them possibly being delayed a bit.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Commit and Apply latency on nautilus

2019-10-01 Thread Robert LeBlanc
On Mon, Sep 30, 2019 at 5:12 PM Sasha Litvak
 wrote:
>
> At this point, I ran out of ideas.  I changed nr_requests and readahead 
> parameters to 128->1024 and 128->4096, tuned nodes to performance-throughput. 
>  However, I still get high latency during benchmark testing.  I attempted to 
> disable cache on ssd
>
> for i in {a..f}; do hdparm -W 0 -A 0 /dev/sd$i; done
>
> and I think it make things not better at all.  I have H740 and H730 
> controllers with drives in HBA mode.
>
> Other them converting them one by one to RAID0 I am not sure what else I can 
> try.
>
> Any suggestions?

If you haven't already tried this, add this to your ceph.conf and
restart your OSDs, this should help bring down the variance in latency
(It will be the default in Octopus):

osd op queue = wpq
osd op queue cut off = high


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-24 Thread Robert LeBlanc
On Tue, Sep 24, 2019 at 4:33 AM Thomas <74cmo...@gmail.com> wrote:
>
> Hi,
>
> I'm experiencing the same issue with this setting in ceph.conf:
> osd op queue = wpq
> osd op queue cut off = high
>
> Furthermore I cannot read any old data in the relevant pool that is
> serving CephFS.
> However, I can write new data and read this new data.

If you restarted all the OSDs with this setting, it won't necessarily
prevent any blocked IO, it just really helps prevent the really long
blocked IO and makes sure that IO is eventually done in a more fair
manner.

It sounds like you may have some MDS issues that are deeper than my
understanding. First thing I'd try is to bounce the MDS service.

> > If I want to add this my ceph-ansible playbook parameters, in which files I 
> > should add it and what is the best way to do it ?
> >
> > Add those 3 lines in all.yml or osds.yml ?
> >
> > ceph_conf_overrides:
> >   global:
> > osd_op_queue_cut_off: high
> >
> > Is there another (better?) way to do that?

I can't speak to either of those approaches. I wanted all my config in
a single file, so I put it in my inventory file, but it looks like you
have the right idea.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hanging slow requests: failed to authpin, subtree is being exported

2019-09-23 Thread Robert LeBlanc
84:92684 lookup
> > #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
> > caller_gid=0{0,}) currently failed to authpin, subtree is being exported
> > 2019-09-23 12:07:40.621 7f4f401e8700  0 log_channel(cluster) log [WRN]
> > : slow request 1923.409501 seconds old, received at 2019-09-23
> > 11:35:37.217113: client_request(client.38347357:111963 lookup
> > #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
> > caller_gid=0{0,}) currently failed to authpin, subtree is being exported
> > 2019-09-23 12:29:20.639 7f4f401e8700  0 log_channel(cluster) log [WRN]
> > : slow request 3843.057602 seconds old, received at 2019-09-23
> > 11:25:17.598152: client_request(client.38352684:92684 lookup
> > #0x100152383ce/vsc42531 2019-09-23 11:25:17.598077 caller_uid=0,
> > caller_gid=0{0,}) currently failed to authpin, subtree is being exported
> > 2019-09-23 12:39:40.872 7f4f401e8700  0 log_channel(cluster) log [WRN]
> > : slow request 3843.664914 seconds old, received at 2019-09-23
> > 11:35:37.217113: client_request(client.38347357:111963 lookup
> > #0x20005b0130c/testing 2019-09-23 11:35:37.217015 caller_uid=0,
> > caller_gid=0{0,}) currently failed to authpin, subtree is being exported
> If I try to ls this paths, the client will hang.
>
> I tried this using ceph kernel client of centos7.6 and now also with the
> ceph-fuse of 14.2.3, I see the issue with both. I tried remounting, but
> that did not solve the issue, if I restart the mds, the issue goes away
> - for some time
>
>
> > [root@mds02 ~]# ceph -s
> >   cluster:
> > id: 92bfcf0a-1d39-43b3-b60f-44f01b630e47
> > health: HEALTH_WARN
> > 1 MDSs report slow requests
> > 2 MDSs behind on trimming
> >
> >   services:
> > mon: 3 daemons, quorum mds01,mds02,mds03 (age 4d)
> > mgr: mds02(active, since 4d), standbys: mds01, mds03
> > mds: ceph_fs:2 {0=mds03=up:active,1=mds02=up:active} 1 up:standby
> > osd: 535 osds: 535 up, 535 in
> >
> >   data:
> > pools:   3 pools, 3328 pgs
> > objects: 375.14M objects, 672 TiB
> > usage:   1.0 PiB used, 2.2 PiB / 3.2 PiB avail
> > pgs: 3319 active+clean
> >  9active+clean+scrubbing+deep
> >
> >   io:
> > client:   141 KiB/s rd, 54 MiB/s wr, 62 op/s rd, 577 op/s wr
> >
>
> > [root@mds02 ~]# ceph health detail
> > HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming
> > MDS_SLOW_REQUEST 1 MDSs report slow requests
> > mdsmds02(mds.1): 2 slow requests are blocked > 30 secs
> > MDS_TRIM 2 MDSs behind on trimming
> > mdsmds02(mds.1): Behind on trimming (3407/200) max_segments: 200,
> > num_segments: 3407
> > mdsmds03(mds.0): Behind on trimming (4240/200) max_segments: 200,
> > num_segments: 4240
>
> Can someone help me to debug this further?

What is the make up of your cluster? It sounds like it may be all HDD.

If so, try adding this to /etc/ceph/ceph.conf on your OSDs and restart
the processes.

osd op queue cut off = high

Depending on your version (default in newer versions), adding

osd op queue = wpq

can also help.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph; pg scrub errors

2019-09-23 Thread Robert LeBlanc
On Thu, Sep 19, 2019 at 4:34 AM M Ranga Swami Reddy
 wrote:
>
> Hi-Iam using ceph 12.2.11. here I am getting a few scrub errors. To fix these 
> scrub error I ran the "ceph pg repair ".
> But scrub error not going and the repair is talking long time like 8-12 hours.

Depending on the size of the PGs and how active the cluster is, it
could take a long time as it takes another deep scrub to happen to
clear the error status after a repair. Since it is not going away,
either the problem is too complicated to automatically repair and
needs to be done by hand, or the problem is repaired and when it
deep-scrubs to check it, the problem has reappeared or another problem
was found and the disk needs to be replaced.

Try running:
rados list-inconsistent-obj ${PG} --format=json

and see what the exact problems are.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs performance issue MDSs report slow requests and osd memory usage

2019-09-23 Thread Robert LeBlanc
 blocked for > 62442.674748 secs
> > 2019-09-19 08:52:53.960684 mds.icadmin006 [WRN] 10 slow requests, 2 
> > included below; oldest blocked for > 62447.674825 secs
> > 2019-09-19 08:52:53.960692 mds.icadmin006 [WRN] slow request 61441.895507 
> > seconds old, received at 2019-09-18 17:48:52.065114: rejoin:mds.1:13 
> > currently dispatched
> > 2019-09-19 08:52:53.960697 mds.icadmin006 [WRN] slow request 61441.895489 
> > seconds old, received at 2019-09-18 17:48:52.065131: rejoin:mds.1:14 
> > currently dispatched
> > 2019-09-19 08:52:57.527852 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62451.242174 secs
> > 2019-09-19 08:53:02.527972 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62456.242289 secs
> > 2019-09-19 08:52:58.960777 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62452.674936 secs
> > 2019-09-19 08:53:03.960853 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62457.675011 secs
> > 2019-09-19 08:53:07.528033 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62461.242354 secs
> > 2019-09-19 08:53:12.528177 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62466.242487 secs
> > 2019-09-19 08:53:08.960965 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62462.675123 secs
> > 2019-09-19 08:53:13.961034 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62467.675195 secs
> > 2019-09-19 08:53:17.528276 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62471.242592 secs
> > 2019-09-19 08:53:22.528407 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62476.242729 secs
> > 2019-09-19 08:53:18.961149 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62472.675310 secs
> > 2019-09-19 08:53:23.961234 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62477.675392 secs
> > 2019-09-19 08:53:27.528509 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62481.242832 secs
> > 2019-09-19 08:53:32.528651 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62486.242961 secs
> > 2019-09-19 08:53:28.961314 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62482.675471 secs
> > 2019-09-19 08:53:33.961393 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62487.675549 secs
> > 2019-09-19 08:53:37.528706 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62491.243031 secs
> > 2019-09-19 08:53:42.528790 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62496.243105 secs
> > 2019-09-19 08:53:38.961476 mds.icadmin006 [WRN] 10 slow requests, 1 
> > included below; oldest blocked for > 62492.675617 secs
> > 2019-09-19 08:53:38.961485 mds.icadmin006 [WRN] slow request 61441.151061 
> > seconds old, received at 2019-09-18 17:49:37.810351: 
> > client_request(client.21441:176429 getattr pAsLsXsFs #0x1f2b1b3 
> > 2019-09-18 17:49:37.806002 caller_uid=204878, caller_gid=11233{}) currently 
> > failed to rdlock, waiting
> > 2019-09-19 08:53:43.961569 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62497.675728 secs
> > 2019-09-19 08:53:47.528891 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62501.243214 secs
> > 2019-09-19 08:53:52.529021 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62506.243337 secs
> > 2019-09-19 08:53:48.961685 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62502.675839 secs
> > 2019-09-19 08:53:53.961792 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62507.675948 secs
> > 2019-09-19 08:53:57.529113 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62511.243437 secs
> > 2019-09-19 08:54:02.529224 mds.icadmin007 [WRN] 3 slow requests, 0 included 
> > below; oldest blocked for > 62516.243546 secs
> > 2019-09-19 08:53:58.961866 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62512.676025 secs
> > 2019-09-19 08:54:03.961939 mds.icadmin006 [WRN] 10 slow requests, 0 
> > included below; oldest blocked for > 62517.676099 secs
>
> Thanks for your help.

If you haven't set:

osd op queue cut off = high

in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should
help quite a bit with pure HDD clusters.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure to start ceph-mon in docker

2019-08-29 Thread Robert LeBlanc
Frank,

Thank you for the explanation, these are freshly installed machines and did
not have ceph on them. I checked one of the other OSD nodes and there is no
ceph user in /etc/passwd, nor is UID 167 allocated to any user. I did
install ceph-common from the 18.04 repos before realizing that deploying
ceph in containers did not update the host's /etc/apt/sources.list (or add
an entry in /etc/apt/sources.list.d/). I manually added the repo for
nautilus and upgraded the packages. So, I don't know if that had anything
to do with it. Maybe Ubuntu packages ceph under UID 64045 and upgrading to
the Ceph distributed packages didn't change the UID.

Thanks,
Robert LeBlanc
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Aug 29, 2019 at 12:33 AM Frank Schilder  wrote:

> Hi Robert,
>
> this is a bit less trivial than it might look right now. The ceph user is
> usually created by installing the package ceph-common. By default it will
> use id 167. If the ceph user already exists, I would assume it will use the
> existing user to allow an operator to avoid UID collisions (if 167 is used
> already).
>
> If you use docker, the ceph UID on the host and inside the container
> should match (or need to be translated). If they don't, you will have a lot
> of fun re-owning stuff all the time, because deployments will use the
> symbolic name ceph, which has different UIDs on the host and inside the
> container in your case.
>
> I would recommend removing this discrepancy as soon as possible:
>
> 1) Find out why there was a ceph user with UID different from 167 before
> installation of ceph-common.
>Did you create it by hand? Was UID 167 allocated already?
> 2) If you can safely change the GID and UID of ceph to 167, just do
> groupmod+usermod with new GID and UID.
> 3) If 167 is used already by another service, you will have to map the
> UIDs between host and container.
>
> To prevent ansible from deploying dockerized ceph with mismatching user ID
> for ceph, add these tasks to an appropriate part of your deployment
> (general host preparation or so):
>
> - name: "Create group 'ceph'."
>   group:
> name: ceph
> gid: 167
> local: yes
> state: present
> system: yes
>
> - name: "Create user 'ceph'."
>   user:
> name: ceph
> password: "!"
> comment: "ceph-container daemons"
> uid: 167
> group: ceph
> shell: "/sbin/nologin"
> home: "/var/lib/ceph"
> create_home: no
> local: yes
> state: present
> system: yes
>
> This should err if a group and user ceph already exist with IDs different
> from 167.
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: ceph-users  on behalf of Robert
> LeBlanc 
> Sent: 28 August 2019 23:23:06
> To: ceph-users
> Subject: Re: [ceph-users] Failure to start ceph-mon in docker
>
> Turns out /var/lib/ceph was ceph.ceph and not 167.167, chowning it made
> things work. I guess only monitor needs that permission, rgw,mgr,osd are
> all happy without needing it to be 167.167.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Wed, Aug 28, 2019 at 1:45 PM Robert LeBlanc  <mailto:rob...@leblancnet.us>> wrote:
> We are trying to set up a new Nautilus cluster using ceph-ansible with
> containers. We got things deployed, but I couldn't run `ceph s` on the host
> so decided to `apt install ceph-common and installed the Luminous version
> from Ubuntu 18.04. For some reason the docker container that was running
> the monitor restarted and won't restart. I added the repo for Nautilus and
> upgraded ceph-common, but the problem persists. The Manager and OSD docker
> containers don't seem to be affected at all. I see this in the journal:
>
> Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Starting Ceph Monitor...
> Aug 28 20:40:55 sun-gcs02-osd01 docker[2926]: Error: No such container:
> ceph-mon-sun-gcs02-osd01
> Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Started Ceph Monitor.
> Aug 28 20:40:55 sun-gcs02-osd01 docker[2949]: WARNING: Your kernel does
> not support swap limit capabilities or the cgroup is not mounted. Memory
> limited without swap.
> Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:40:56
> /opt/ceph-container/bin/entrypoint.sh: Existing mon, trying to rejoin
> cluster...
> Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: warning: line 41:
> 'osd_memory_target' in section 'osd' redefined
> 

[ceph-users] Specify OSD size and OSD journal size with ceph-ansible

2019-08-28 Thread Robert LeBlanc
I have a new cluster and I'd like to put the DB on the NVMe device, but
only make it 30GB, then use 100GB of the rest of the NVMe as an OSD for the
RGW metadata pool.

I set up the disks like the conf below without the block_db_size and it
created all the LVs on the HDDs and one LV on the NVMe that took up all the
space.

I've tried using block_db_size in vars, and also as a property in the list
for each OSD disk but neither work.

With block_db_size in the vars I get:
failed: [sun-gcs02-osd01] (item={u'db': u'/dev/nvme0n1', u'data':
u'/dev/sda', u'crush_device_class': u'hdd'}) => changed=true
 ansible_loop_var: item
 cmd:


 - docker

   - run
 - --rm
 - --privileged
 - --net=host
 - --ipc=host
 - --ulimit
 - nofile=1024:1024
 - -v
 - /run/lock/lvm:/run/lock/lvm:z
 - -v
 - /var/run/udev/:/var/run/udev/:z
 - -v
 - /dev:/dev
 - -v
 - /etc/ceph:/etc/ceph:z
 - -v
 - /run/lvm/:/run/lvm/
 - -v
 - /var/lib/ceph/:/var/lib/ceph/:z
 - -v
 - /var/log/ceph/:/var/log/ceph/:z
 - --entrypoint=ceph-volume
 - docker.io/ceph/daemon:latest
 - --cluster
 - ceph
 - lvm
 - prepare
 - --bluestore
 - --data
 - /dev/sda
 - --block.db
 - /dev/nvme0n1
 - --crush-device-class
 - hdd
 delta: '0:00:05.004777'
 end: '2019-08-28 23:26:39.074850'
 item:
   crush_device_class: hdd
   data: /dev/sda
   db: /dev/nvme0n1
 msg: non-zero return code
 rc: 1
 start: '2019-08-28 23:26:34.070073'
 stderr: '-->  RuntimeError: unable to use device'
 stderr_lines: 
 stdout: |-
   Running command: /bin/ceph-authtool --gen-print-key
   Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
bcc7b3c3-6203-47c7-9f34-7b2e2060bf59
   Running command: /usr/sbin/vgcreate -s 1G --force --yes
ceph-76cd6a80-17dd-4a89-a35b-0844026bc9d4 /dev/sda
stdout: Physical volume "/dev/sda" successfully created.
stdout: Volume group "ceph-76cd6a80-17dd-4a89-a35b-0844026bc9d4"
successfully created
   Running command: /usr/sbin/lvcreate --yes -l 100%FREE -n
osd-block-bcc7b3c3-6203-47c7-9f34-7b2e2060bf59
ceph-76cd6a80-17dd-4a89-a35b-0844026bc9d4
stdout: Logical volume "osd-block-bcc7b3c3-6203-47c7-9f34-7b2e2060bf59"
created.
   --> blkid could not detect a PARTUUID for device: /dev/nvme0n1
   --> Was unable to complete a new OSD, will rollback changes
   Running command: /bin/ceph --cluster ceph --name client.bootstrap-osd
--keyring /var/lib/ceph/bootstrap-osd/ceph.keyring osd purge-new osd.21
--yes-i-really-mean-it
stderr: purged osd.21
  stdout_lines: 

... (One for each device) ...

And the LVs are created for all the HDD OSDs and none on the NVMe.

Looking through the code I don't see a way to set a size for the OSD, but
maybe I'm just missing it as I'm really new to Ansible.

osds:
 hosts:
   sun-gcs02-osd[01:43]:
   sun-gcs02-osd[45:60]:
 vars:
   block_db_size: 32212254720
   lvm_volumes:
 - data: '/dev/sda'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdb'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdc'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdd'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sde'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdf'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdg'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdh'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdi'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdj'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdk'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/sdl'
   db: '/dev/nvme0n1'
   crush_device_class: 'hdd'
 - data: '/dev/nvme0n1'  # Use the rest for metadata
   crush_device_class: 'nvme'

With block_db_size set for each disk, I got an error during the parameter
checking phase in Ansible and no LVs were created.

Please help me understand how to configure what I would like to do.

Thank you,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failure to start ceph-mon in docker

2019-08-28 Thread Robert LeBlanc
Turns out /var/lib/ceph was ceph.ceph and not 167.167, chowning it made
things work. I guess only monitor needs that permission, rgw,mgr,osd are
all happy without needing it to be 167.167.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Aug 28, 2019 at 1:45 PM Robert LeBlanc  wrote:

> We are trying to set up a new Nautilus cluster using ceph-ansible with
> containers. We got things deployed, but I couldn't run `ceph s` on the host
> so decided to `apt install ceph-common and installed the Luminous version
> from Ubuntu 18.04. For some reason the docker container that was running
> the monitor restarted and won't restart. I added the repo for Nautilus and
> upgraded ceph-common, but the problem persists. The Manager and OSD docker
> containers don't seem to be affected at all. I see this in the journal:
>
> Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Starting Ceph Monitor...
> Aug 28 20:40:55 sun-gcs02-osd01 docker[2926]: Error: No such container:
> ceph-mon-sun-gcs02-osd01
> Aug 28 20:40:55 sun-gcs02-osd01 systemd[1]: Started Ceph Monitor.
> Aug 28 20:40:55 sun-gcs02-osd01 docker[2949]: WARNING: Your kernel does
> not support swap limit capabilities or the cgroup is not mounted. Memory
> limited without swap.
> Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:40:56
>  /opt/ceph-container/bin/entrypoint.sh: Existing mon, trying to rejoin
> cluster...
> Aug 28 20:40:56 sun-gcs02-osd01 docker[2949]: warning: line 41:
> 'osd_memory_target' in section 'osd' redefined
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:41:03
>  /opt/ceph-container/bin/entrypoint.sh: /etc/ceph/ceph.conf is already
> memory tuned
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: 2019-08-28 20:41:03
>  /opt/ceph-container/bin/entrypoint.sh: SUCCESS
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: exec: PID 368: spawning
> /usr/bin/ceph-mon --cluster ceph --default-log-to-file=false
> --default-mon-cluster-log-to-file=false --setuser ceph --setgroup ceph -d
> --mon-cluster-log-to-stderr --log-stderr-prefix=debug  -i sun-gcs02-osd01
> --mon-data /var/lib/ceph/mon/ceph-sun-gcs02-osd01 --public-addr
> 10.65.101.21
>
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: exec: Waiting 368 to quit
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: warning: line 41:
> 'osd_memory_target' in section 'osd' redefined
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28
> 20:41:03.835 7f401283c180  0 set uid:gid to 167:167 (ceph:ceph)
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28
> 20:41:03.835 7f401283c180  0 ceph version 14.2.2
> (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable), process
> ceph-mon, pid 368
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28
> 20:41:03.835 7f401283c180 -1 stat(/var/lib/ceph/mon/ceph-sun-gcs02-osd01)
> (13) Permission denied
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: debug 2019-08-28
> 20:41:03.835 7f401283c180 -1 error accessing monitor data directory at
> '/var/lib/ceph/mon/ceph-sun-gcs02-osd01': (13) Permission denied
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: managing teardown
> after SIGCHLD
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: Waiting PID 368 to
> terminate
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: Process 368 is
> terminated
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: Bye Bye, container
> will die with return code -1
> Aug 28 20:41:03 sun-gcs02-osd01 docker[2949]: teardown: if you don't want
> me to die and have access to a shell to debug this situation, next time run
> me with '-e DEBUG=stayalive'
> Aug 28 20:41:04 sun-gcs02-osd01 systemd[1]:
> ceph-mon@sun-gcs02-osd01.service: Main process exited, code=exited,
> status=255/n/a
> Aug 28 20:41:04 sun-gcs02-osd01 systemd[1]:
> ceph-mon@sun-gcs02-osd01.service: Failed with result 'exit-code'.
>
> The directories for the monitor are owned by 167.167 and matches the
> UID.GID that the container reports.
>
> oot@sun-gcs02-osd01:~# ls -lhd /var/lib/ceph/
> drwxr-x--- 14 ceph ceph 4.0K Jul 30 22:15 /var/lib/ceph/
> root@sun-gcs02-osd01:~# ls -lh /var/lib/ceph/
> total 56K
> drwxr-xr-x   2 167 167 4.0K Jul 30 22:16 bootstrap-mds
> drwxr-xr-x   2 167 167 4.0K Jul 30 22:16 bootstrap-mgr
> drwxr-xr-x   2 167 167 4.0K Jul 30 22:16 bootstrap-osd
> drwxr-xr-x   2 167 167 4.0K Jul 30 22:16 bootstrap-rbd
> drwxr-xr-x   2 167 167 4.0K Jul 30 22:16 bootstrap-rbd-mirror
> drwxr-xr-x   2 167 167 4.0K Jul 30 22:16 bootstrap-rgw
> drwxr-xr-x   3 167 167 4.0K Jul 30 22:15 mds
> drwxr-xr-x   3 167 167 4.0K Jul 30 22:15 mgr
> drwxr-xr-x   3 167 167 4.0K Jul 3

[ceph-users] Failure to start ceph-mon in docker

2019-08-28 Thread Robert LeBlanc
t
-rw-r--r-- 1 167 167  45M Aug 28 19:16 050228.sst
-rw-r--r-- 1 167 167   16 Aug 16 07:40 CURRENT
-rw-r--r-- 1 167 167   37 Jul 30 22:15 IDENTITY
-rw-r--r-- 1 167 1670 Jul 30 22:15 LOCK
-rw-r--r-- 1 167 167 1.3M Aug 28 19:16 MANIFEST-027846
-rw-r--r-- 1 167 167 4.7K Aug  1 23:38 OPTIONS-002825
-rw-r--r-- 1 167 167 4.7K Aug 16 07:40 OPTIONS-027849


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-22 Thread Robert LeBlanc
We just had metadata damage show up on our Jewel cluster. I tried a few
things like renaming directories and scanning, but the damage would just
show up again in less than 24 hours. I finally just copied the directories
with the damage to a tmp location on CephFS, then swapped it with the
damaged one. When I deleted the directories with the damage the active MDS
crashed, but the replay took over just fine. I haven't had the messages now
for almost a week.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Aug 19, 2019 at 10:30 PM Lars Täuber  wrote:

> Hi there!
>
> Does anyone else have an idea what I could do to get rid of this error?
>
> BTW: it is the third time that the pg 20.0 is gone inconsistent.
> This is a pg from the metadata pool (cephfs).
> May this be related anyhow?
>
> # ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data
> damage: 1 pg inconsistent
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsmds3(mds.0): Metadata damage detected
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 20.0 is active+clean+inconsistent, acting [9,27,15]
>
>
> Best regards,
> Lars
>
>
> Mon, 19 Aug 2019 13:51:59 +0200
> Lars Täuber  ==> Paul Emmerich  :
> > Hi Paul,
> >
> > thanks for the hint.
> >
> > I did a recursive scrub from "/". The log says there where some inodes
> with bad backtraces repaired. But the error remains.
> > May this have something to do with a deleted file? Or a file within a
> snapshot?
> >
> > The path told by
> >
> > # ceph tell mds.mds3 damage ls
> > 2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on
> v2:192.168.16.23:6800/176704036
> > 2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on
> v2:192.168.16.23:6800/176704036
> > [
> > {
> > "damage_type": "backtrace",
> > "id": 3760765989,
> > "ino": 1099518115802,
> > "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> > }
> > ]
> >
> > starts a bit strange to me.
> >
> > Are the snapshots also repaired with a recursive repair operation?
> >
> > Thanks
> > Lars
> >
> >
> > Mon, 19 Aug 2019 13:30:53 +0200
> > Paul Emmerich  ==> Lars Täuber 
> :
> > > Hi,
> > >
> > > that error just says that the path is wrong. I unfortunately don't
> > > know the correct way to instruct it to scrub a stray path off the top
> > > of my head; you can always run a recursive scrub on / to go over
> > > everything, though
> > >
> > >
> > > Paul
> > >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does CephFS find a file?

2019-08-19 Thread Robert LeBlanc
I'm fairly new to CephFS, but in my poking around with it, this is what I
understand.

The MDS manages dentries as omap (simple key/value database) entries in the
metadada pool. Each dentry keeps a list of filenames and some metadata
about the file such as inode number and some other info such as size I
presume (can't find a documentation outlining the binary format of the
omap, just did enough digging to find the inode location). The MDS can
return the inode and size to the client and the client looks up the OSDs
for the inode using the CRUSH map and dividing the size by the stripe size
to know how many objects to fetch for the whole object. The file is stored
by the inode (in hex) appended by the object offset. The inode corresponds
to the same value in `ls -li` in CephFS converted to hex.

I hope that is correct and useful as a starting point for you.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Aug 19, 2019 at 2:37 AM aot...@outlook.com 
wrote:

> I am a student new to cephfs. I think there are 2 steps to finding a file:
>
> 1.Find out which objects belong to this file.
>
> 2.Use CRUSH to find out OSDs.
>
>
>
> What I don’t know is how does CephFS get the object list of the file. Does
> MDS save all object list of all files? Or CRUSH can use some
> information(what information?) to calculate the list of objects? In other
> words, where is the object list of the file saved?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-12 Thread Robert LeBlanc
On Wed, Aug 7, 2019 at 7:05 AM Paul Emmerich  wrote:

> ~ is the internal implementation of device classes. Internally it's
> still using separate roots, that's how it stays compatible with older
> clients that don't know about device classes.
>

That makes sense.


> And since it wasn't mentioned here yet: consider upgrading to Nautilus
> to benefit from the new and improved accounting for metadata space.
> You'll be able to see how much space is used for metadata and quotas
> should work properly for metadata usage.
>

I think I'm not explaining this well and it is confusing people. I don't
want to limit the size of the metadata pool, I also don't want to limit the
size of the data pool as the cluster flexibility could cause the quota to
be out of date at anytime and probably useless (since we want to use as
much space as possible for data). I would like to reserve space for the
metadata pool so that no other pool can touch it, much like when you thick
provision a VM disk file. It is guaranteed for that entity an no one else
can use it, even if it is mostly empty. So far people have only told me how
to limit the space of a pool, which is not what I'm looking for.

Thank you,
Robert LeBlanc


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Replay MDS server stuck

2019-08-09 Thread Robert LeBlanc
We had a outage of our Jewel 10.2.11 CephFS last night. Our primary MDS hit
an assert in ceph try_remove_dentries_for_stray(), but the replay MDS never
came up. The logs for MDS02 show:

---like clockwork these first two lines appear every second---
2019-08-02 16:27:24.664508 7f6f47f5c700  1 mds.0.0 standby_replay_restart
(as standby)
2019-08-02 16:27:24.689079 7f6f43341700  1 mds.0.0 replay_done (as standby)
2019-08-02 16:27:25.689210 7f6f47f5c700  1 mds.0.0 standby_replay_restart
(as standby)
2019-08-09 00:01:25.298131 7f6f4a862700  1 mds.0.10 handle_mds_map i am
now mds.0.10
2019-08-09 00:01:25.298135 7f6f4a862700  1 mds.0.10 handle_mds_map
state change up:standby-replay --> up:replay
2019-08-09 00:43:35.382921 7f6f46f5a700 -1 mds.sun-gcs01-mds02 *** got
signal Terminated ***
2019-08-09 00:43:35.382952 7f6f46f5a700  1 mds.sun-gcs01-mds02 suicide.
 wanted state up:replay
2019-08-09 00:43:35.409663 7f6f46f5a700  1 mds.0.10 shutdown: shutting
down rank 0
2019-08-09 00:43:35.414729 7f6f43341700  0 mds.0.log _replay journaler got
error -11, aborting
2019-08-09 00:48:36.819539 7fe6e89e8200  0 set uid:gid to 1001:1002
(ceph:ceph)
2019-08-09 00:48:36.819555 7fe6e89e8200  0 ceph version 10.2.11
(e4b061b47f07f583c92a050d9e84b1813a35671e), process ceph-mds, pid 39603
2019-08-09 00:48:36.820045 7fe6e89e8200  0 pidfile_write: ignore empty
--pid-file
2019-08-09 00:48:37.813857 7fe6e2088700  1 mds.sun-gcs01-mds02
handle_mds_map standby
2019-08-09 00:48:37.833089 7fe6e2088700  1 mds.0.0 handle_mds_map i am now
mds.19317235.0 replaying mds.0.0
2019-08-09 00:48:37.833097 7fe6e2088700  1 mds.0.0 handle_mds_map state
change up:boot --> up:standby-replay
2019-08-09 00:48:37.833106 7fe6e2088700  1 mds.0.0 replay_start
2019-08-09 00:48:37.833111 7fe6e2088700  1 mds.0.0  recovery set is
2019-08-09 00:48:37.849332 7fe6dd77e700  0 mds.0.cache creating system
inode with ino:100
2019-08-09 00:48:37.849627 7fe6dd77e700  0 mds.0.cache creating system
inode with ino:1
2019-08-09 00:48:40.548094 7fe6dab67700  0 log_channel(cluster) log [WRN] :
 replayed op client.10012302:8321663,8321660 used ino 10052d9c287 but
session next is 10052d57512
2019-08-09 00:48:40.844534 7fe6dab67700  1 mds.0.0 replay_done (as standby)
2019-08-09 00:48:41.844648 7fe6df782700  1 mds.0.0 standby_replay_restart
(as standby)
2019-08-09 00:48:41.868242 7fe6dab67700  1 mds.0.0 replay_done (as standby)
---last two lines repeat again every second---

I was thinking of a couple of option to improve recovery in this situation.

1. Monitor the replay status of the replay MDS and alert if it is too far
behind the active MDS, or if it isn't incrementing. I'm looking at `ceph
daemon mds. perf dump` as mds_log:rdpos was mentioned in another
thread, but the replay MDS has a higher number than the primary. Maybe even
some sort of timestamp or entry that increments with time or something we
can check.

2. A third MDS that isn't replaying the journal and is more of a cold
standby. It would take longer to start up, but seems if the replay failed
to take over, then it would read the journal and start up.

It seems that the issue that caused the primary to fail is hard to catch
and debug, so we just need to be sure that we can recover with failover.
Any hints on how to improve that would be appreciated.

Thank you,
Robert LeBlanc
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-07 Thread Robert LeBlanc
On Wed, Aug 7, 2019 at 12:08 AM Konstantin Shalygin  wrote:

> On 8/7/19 1:40 PM, Robert LeBlanc wrote:
>
> > Maybe it's the lateness of the day, but I'm not sure how to do that.
> > Do you have an example where all the OSDs are of class ssd?
> Can't parse what you mean. You always should paste your `ceph osd tree`
> first.
>

Our 'ceph osd tree' is like this:
ID  CLASS WEIGHTTYPE NAMESTATUS REWEIGHT PRI-AFF
 -1   892.21326 root default
 -369.16382 host sun-pcs01-osd01
  0   ssd   3.49309 osd.0up  1.0 1.0
  1   ssd   3.42329 osd.1up  0.87482 1.0
  2   ssd   3.49309 osd.2up  0.88989 1.0
  3   ssd   3.42329 osd.3up  0.94989 1.0
  4   ssd   3.49309 osd.4up  0.93993 1.0
  5   ssd   3.42329 osd.5up  1.0 1.0
  6   ssd   3.49309 osd.6up  0.89490 1.0
  7   ssd   3.42329 osd.7up  1.0 1.0
  8   ssd   3.49309 osd.8up  0.89482 1.0
  9   ssd   3.42329 osd.9up  1.0 1.0
100   ssd   3.49309 osd.100  up  1.0 1.0
101   ssd   3.42329 osd.101  up  1.0 1.0
102   ssd   3.49309 osd.102  up  1.0 1.0
103   ssd   3.42329 osd.103  up  0.81482 1.0
104   ssd   3.49309 osd.104  up  0.87973 1.0
105   ssd   3.42329 osd.105  up  0.86485 1.0
106   ssd   3.49309 osd.106  up  0.79965 1.0
107   ssd   3.42329 osd.107  up  1.0 1.0
108   ssd   3.49309 osd.108  up  1.0 1.0
109   ssd   3.42329 osd.109  up  1.0 1.0
 -562.24744 host sun-pcs01-osd02
 10   ssd   3.49309 osd.10   up  1.0 1.0
 11   ssd   3.42329 osd.11   up  0.72473 1.0
 12   ssd   3.49309 osd.12   up  1.0 1.0
 13   ssd   3.42329 osd.13   up  0.78979 1.0
 14   ssd   3.49309 osd.14   up  0.98961 1.0
 15   ssd   3.42329 osd.15   up  1.0 1.0
 16   ssd   3.49309 osd.16   up  0.96495 1.0
 17   ssd   3.42329 osd.17   up  0.94994 1.0
 18   ssd   3.49309 osd.18   up  1.0 1.0
 19   ssd   3.42329 osd.19   up  0.80481 1.0
110   ssd   3.49309 osd.110  up  0.97998 1.0
111   ssd   3.42329 osd.111  up  1.0 1.0
112   ssd   3.49309 osd.112  up  1.0 1.0
113   ssd   3.42329 osd.113  up  0.72974 1.0
116   ssd   3.49309 osd.116  up  0.91992 1.0
117   ssd   3.42329 osd.117  up  0.96997 1.0
118   ssd   3.49309 osd.118  up  0.93959 1.0
119   ssd   3.42329 osd.119  up  0.94481 1.0
... plus 11 more hosts just like this

How do you single out one OSD from each host for the metadata only and
prevent data on that OSD when all the device classes are the same? It seems
that you would need one OSD to be a different class to do that. It a
previous email the conversation was:

Is it possible to add a new device class like 'metadata'?

Yes, but you don't need this. Just use your existing class with another
crush ruleset.


So, I'm trying to figure out how you use the existing class of 'ssd' with
another CRUSH ruleset to accomplish the above.


> > Yes, we can set quotas to limit space usage (or number objects), but
> > you can not reserve some space that other pools can't use. The problem
> > is if we set a quota for the CephFS data pool to the equivalent of 95%
> > there are at least two scenario that make that quota useless.
>
> Of course. 95% of CephFS deployments is where meta_pool on flash drives
> with enough space for this.
>
>
> ```
>
> pool 21 'fs_data' replicated size 3 min_size 2 crush_rule 4 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 56870 flags hashpspool
> stripe_width 0 application cephfs
> pool 22 'fs_meta' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins pg_num 16 pgp_num 16 last_change 56870 flags hashpspool
> stripe_width 0 application cephfs
>
> ```
>
> ```
>
> # ceph osd crush rule dump replicated_racks_nvme
> {
>  "rule_id": 0,
>  "rule_name": "replicated_racks_nvme",
>  "ruleset": 0,
>  "type": 1,
>  "min_size": 1,
>  "max_size&qu

Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Robert LeBlanc
On Tue, Aug 6, 2019 at 7:56 PM Konstantin Shalygin  wrote:

> Is it possible to add a new device class like 'metadata'?
>
>
> Yes, but you don't need this. Just use your existing class with another
> crush ruleset.
>

Maybe it's the lateness of the day, but I'm not sure how to do that. Do you
have an example where all the OSDs are of class ssd?

> If I set the device class manually, will it be overwritten when the OSD
> boots up?
>
>
> Nope. Classes assigned automatically when OSD is created, not boot'ed.
>

That's good to know.

> I read https://ceph.com/community/new-luminous-crush-device-classes/ and it
> mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence
> the question.
>
> But it's not a magic. Sometimes drive can be sata ssd, but in kernel is
> 'rotational'...
>
I see, so it's not looking to see if the device is in /sys/class/pci or
something.

> We will still have 13 OSDs, it will be overkill for space for metadata, but
> since Ceph lacks a reserve space feature, we don't have  many options. This
> cluster is so fast that it can fill up in the blink of an eye.
>
>
> Not true. You always can set per-pool quota in bytes, for example:
>
> * your meta is 1G;
>
> * your raw space is 300G;
>
> * your data is 90G;
>
> Set quota to your data pool: `ceph osd pool set-quota 
> max_bytes 96636762000`
>
Yes, we can set quotas to limit space usage (or number objects), but you
can not reserve some space that other pools can't use. The problem is if we
set a quota for the CephFS data pool to the equivalent of 95% there are at
least two scenario that make that quota useless.

1. A host fails and the cluster recovers. The quota is now past the
capacity of the cluster so if the data pool fills up, no pool can write.
2. The CephFS data pool is an erasure encoded pool, and it shares with a
RGW data pool that is 3x rep. If more writes happen to the RGW data pool,
then the quota will be past the capacity of the cluster.

Both of these cause metadata operations to not be committed and cause lots
of problems with CephFS (can't list a directory with a broken inode in it).
We would prefer to get a truncated file, then a broken file system.

I wrote a script that calculates 95% of the pool capacity and sets the
quota if the current quota is 1% out of balance. This is run by cron every
5 minutes.

If there is a way to reserve some capacity for a pool that no other pool
can use, please provide an example. Think of reserved inode space in
ext4/XFS/etc.

Thank you.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New CRUSH device class questions

2019-08-06 Thread Robert LeBlanc
On Tue, Aug 6, 2019 at 11:11 AM Paul Emmerich 
wrote:

> On Tue, Aug 6, 2019 at 7:45 PM Robert LeBlanc 
> wrote:
> > We have a 12.2.8 luminous cluster with all NVMe and we want to take some
> of the NVMe OSDs and allocate them strictly to metadata pools (we have a
> problem with filling up this cluster and causing lingering metadata
> problems, and this will guarantee space for metadata operations).
>
> Depending on the workload and metadata size: this might be a bad idea
> as it reduces parallelism.
>

We will still have 13 OSDs, it will be overkill for space for metadata, but
since Ceph lacks a reserve space feature, we don't have  many options. This
cluster is so fast that it can fill up in the blink of an eye.

> In the past, we have done this the old-school way of creating a separate
> root, but I wanted to see if we could leverage the device class function
> instead.
> >
> > Right now all our devices show as ssd rather than nvme, but that is the
> only class in this cluster. None of the device classes were manually set,
> so is there a reason they were not detected as nvme?
>
> Ceph only distinguishes rotational vs. non-rotational for device
> classes and device-specific configuration options
>

I read https://ceph.com/community/new-luminous-crush-device-classes/ and it
mentions that Ceph automatically classifies into hdd, ssd, and nvme. Hence
the question.

> Is it possible to add a new device class like 'metadata'?
>
> yes, it's just a string, you can use whatever you want. For example,
> we run a few setups that distinguish 2.5" and 3.5" HDDs.
>
>
> >
> > If I set the device class manually, will it be overwritten when the OSD
> boots up?
>
> not sure about this, we run all of our setups with auto-updating on
> start disabled
>

Do you know if 'osd crush location hook' can specify the device class?

> Is what I'm trying to accomplish better done by the old-school separate
> root and the osd_crush_location_hook (potentially using a file with a list
> of partition UUIDs that should be in the metadata pool).?
>
> device classes are the way to go here


Thanks!

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New CRUSH device class questions

2019-08-06 Thread Robert LeBlanc
We have a 12.2.8 luminous cluster with all NVMe and we want to take some of
the NVMe OSDs and allocate them strictly to metadata pools (we have a
problem with filling up this cluster and causing lingering metadata
problems, and this will guarantee space for metadata operations). In the
past, we have done this the old-school way of creating a separate root, but
I wanted to see if we could leverage the device class function instead.

Right now all our devices show as ssd rather than nvme, but that is the
only class in this cluster. None of the device classes were manually set,
so is there a reason they were not detected as nvme?

Is it possible to add a new device class like 'metadata'?

If I set the device class manually, will it be overwritten when the OSD
boots up?

Is what I'm trying to accomplish better done by the old-school separate
root and the osd_crush_location_hook (potentially using a file with a list
of partition UUIDs that should be in the metadata pool).?

Any other options I may not be considering?

Thank you,
Robert LeBlanc
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Built-in HA?

2019-08-05 Thread Robert LeBlanc
Another option is if both RDMA ports are on the same card, then you can do
RDMA with a bond. This does not work if you have two separate cards.

As far as your questions go, my guess would be that you would want to have
the different NICs in different broadcast domains, or set up Source Based
Routing and bind the source port on the connection (not the easiest, but
allows you to have multiple NICs in the same broadcast domain). I don't
have experience with Ceph in this type of configuration.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Aug 2, 2019 at 9:41 AM Volodymyr Litovka  wrote:

> Dear colleagues,
>
> at the moment, we use Ceph in routed environment (OSPF, ECMP) and
> everything is ok, reliability is high and there is nothing to complain
> about. But for hardware reasons (to be more precise - RDMA offload), we are
> faced with the need to operate Ceph directly on physical interfaces.
>
> According to documentation, "We generally recommend that dual-NIC systems
> either be configured with two IPs on the same network, or bonded."
>
> Q1: Did anybody test and can explain, how Ceph will behave in first
> scenario (two IPs on the same network)? I think this configuration require
> just one statement in 'public network' (where both interfaces reside)? How
> it will distribute traffic between links, how it will detect link failures
> and how it will switchover?
>
> Q2: Did anybody test a bit another scenario - both NICs have addresses in
> different networks and Ceph configuration contain two 'public networks'?
> Questions are same - how Ceph distributes traffic between links and how it
> recovers from link failures?
>
> Thank you.
>
> --
> Volodymyr Litovka
>   "Vision without Execution is Hallucination." -- Thomas Edison
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-08-03 Thread Robert LeBlanc
It does better because it is a fair share queue and doesn't let recovery
ops take priority over client ops at any point for any time. It allows
clients to have a much more predictable latency to the storage.

Sent from a mobile device, please excuse any typos.

On Sat, Aug 3, 2019, 1:10 PM Alex Gorbachev  wrote:

> On Fri, Aug 2, 2019 at 6:57 PM Robert LeBlanc 
> wrote:
> >
> > On Fri, Jul 26, 2019 at 1:02 PM Peter Sabaini  wrote:
> >>
> >> On 26.07.19 15:03, Stefan Kooman wrote:
> >> > Quoting Peter Sabaini (pe...@sabaini.at):
> >> >> What kind of commit/apply latency increases have you seen when
> adding a
> >> >> large numbers of OSDs? I'm nervous how sensitive workloads might
> react
> >> >> here, esp. with spinners.
> >> >
> >> > You mean when there is backfilling going on? Instead of doing "a big
> >>
> >> Yes exactly. I usually tune down max rebalance and max recovery active
> >> knobs to lessen impact but still I found the additional write load can
> >> substantially increase i/o latencies. Not all workloads like this.
> >
> >
> > We have been using:
> >
> > osd op queue = wpq
> > osd op queue cut off = high
> >
> > It virtually eliminates the impact of backfills on our clusters. Our
> backfill and recovery times have increased when the cluster has lots of
> client I/O, but the clients haven't noticed that huge backfills have been
> going on.
> >
> > 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Would this be superior to setting:
>
> osd_recovery_sleep = 0.5 (or some high value)
>
>
> --
> Alex Gorbachev
> Intelligent Systems Services Inc.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems understanding 'ceph-features' output

2019-08-02 Thread Robert LeBlanc
On Tue, Jul 30, 2019 at 2:06 AM Janne Johansson  wrote:

> Someone should make a webpage where you can enter that hex-string and get
> a list back.
>

Providing a minimum bitmap would allow someone to do so, and someone like
me to do it manually until then.
----
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to add 100 new OSDs...

2019-08-02 Thread Robert LeBlanc
On Fri, Jul 26, 2019 at 1:02 PM Peter Sabaini  wrote:

> On 26.07.19 15:03, Stefan Kooman wrote:
> > Quoting Peter Sabaini (pe...@sabaini.at):
> >> What kind of commit/apply latency increases have you seen when adding a
> >> large numbers of OSDs? I'm nervous how sensitive workloads might react
> >> here, esp. with spinners.
> >
> > You mean when there is backfilling going on? Instead of doing "a big
>
> Yes exactly. I usually tune down max rebalance and max recovery active
> knobs to lessen impact but still I found the additional write load can
> substantially increase i/o latencies. Not all workloads like this.
>

We have been using:

osd op queue = wpq
osd op queue cut off = high

It virtually eliminates the impact of backfills on our clusters. Our
backfill and recovery times have increased when the cluster has lots of
client I/O, but the clients haven't noticed that huge backfills have been
going on.


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mark CephFS inode as lost

2019-07-23 Thread Robert LeBlanc
Thanks, I created a ticket. http://tracker.ceph.com/issues/40906

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jul 22, 2019 at 11:45 PM Yan, Zheng  wrote:

> please create a ticket at http://tracker.ceph.com/projects/cephfs and
> upload mds log with debug_mds =10
>
> On Tue, Jul 23, 2019 at 6:00 AM Robert LeBlanc 
> wrote:
> >
> > We have a Luminous cluster which has filled up to 100% multiple times
> and this causes an inode to be left in a bad state. Doing anything to these
> files causes the client to hang which requires evicting the client and
> failing over the MDS. Usually we move the parent directory out of the way
> and things mostly are okay. However in this last fill up, we have a
> significant amount of storage that we have moved out of the way and really
> need to reclaim that space. I can't delete the files around it as listing
> the directory causes a hang.
> >
> > We can get the inode that is bad from the logs/blocked_ops, how can we
> tell MDS that the inode is lost and to forget about it without trying to do
> any checks on it (checking the RADOS objects may be part of the problem)?
> Once the inode is out of CephFS, we can clean up the RADOS objects manually
> or leave them there to rot.
> >
> > Thanks,
> > Robert LeBlanc
> > 
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mark CephFS inode as lost

2019-07-22 Thread Robert LeBlanc
We have a Luminous cluster which has filled up to 100% multiple times and
this causes an inode to be left in a bad state. Doing anything to these
files causes the client to hang which requires evicting the client and
failing over the MDS. Usually we move the parent directory out of the way
and things mostly are okay. However in this last fill up, we have a
significant amount of storage that we have moved out of the way and really
need to reclaim that space. I can't delete the files around it as listing
the directory causes a hang.

We can get the inode that is bad from the logs/blocked_ops, how can we tell
MDS that the inode is lost and to forget about it without trying to do any
checks on it (checking the RADOS objects may be part of the problem)? Once
the inode is out of CephFS, we can clean up the RADOS objects manually or
leave them there to rot.

Thanks,
Robert LeBlanc

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Investigating Config Error, 300x reduction in IOPs performance on RGW layer

2019-07-17 Thread Robert LeBlanc
I'm pretty new to RGW, but I'm needing to get max performance as well. Have
you tried moving your RGW metadata pools to nvme? Carve out a bit of NVMe
space and then pin the pool to the SSD class in CRUSH, that way the small
metadata ops aren't on slow media.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jul 17, 2019 at 5:59 PM Ravi Patel  wrote:

> Hello,
>
> We have deployed ceph cluster and we are trying to debug a massive drop in
> performance between the RADOS layer vs the RGW layer
>
> ## Cluster config
> 4 OSD nodes (12 Drives each, NVME Journals, 1 SSD drive) 40GbE NIC
> 2 RGW nodes ( DNS RR load balancing) 40GbE NIC
> 3 MON nodes 1 GbE NIC
>
> ## Pool configuration
> RGW data pool  - replicated 3x 4M stripe (HDD)
> RGW metadata pool - replicated 3x (SSD) pool
>
> ## Benchmarks
> 4K Read IOP/s performance using RADOS Bench 48,000~ IOP/s
> 4K Read RGW performance via s3 interface ~ 130 IOP/s
>
> Really trying to understand how to debug this issue. all the nodes never
> break 15% CPU utilization and there is plenty of RAM. The one pathological
> issue in our cluster is that the MON nodes are currently on VMs that are
> sitting behind a single 1 GbE NIC. (We are in the process of moving them,
> but are unsure if that will fix the issue.
>
> What metrics should we be looking at to debug the RGW layer. Where do we
> need to look?
>
> ---
>
> Ravi Patel, PhD
> Machine Learning Systems Lead
> Email: r...@kheironmed.com
>
>
> *Kheiron Medical Technologies*
>
> kheironmed.com | supporting radiologists with deep learning
>
> Kheiron Medical Technologies Ltd. is a registered company in England and
> Wales. This e-mail and its attachment(s) are intended for the above named
> only and are confidential. If they have come to you in error then you must
> take no action based upon them but contact us immediately. Any disclosure,
> copying, distribution or any action taken or omitted to be taken in
> reliance on it is prohibited and may be unlawful. Although this e-mail and
> its attachments are believed to be free of any virus, it is the
> responsibility of the recipient to ensure that they are virus free. If you
> contact us by e-mail then we will store your name and address to facilitate
> communications. Any statements contained herein are those of the individual
> and not the organisation.
>
> Registered number: 10184103. Registered office: RocketSpace, 40 Islington
> High Street, London, N1 8EQ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Allocation recommendations for separate blocks.db and WAL

2019-07-17 Thread Robert LeBlanc
So, I see the recommendation for 4% of OSD space for blocks.db/WAL and the
corresponding discussion regrading the 3/30/300GB vs 6/60/600GB allocation.

How does this change when WAL is seperate from blocks.db?

Reading [0] it seems that 6/60/600 is not correct. It seems that to compact
a 300GB DB, you taking values from the above layer (which is only 10% of
the lower layer and only some percentage that exceeds the trigger point of
that will be merged down) and merging that in, so at worse case you would
need 333GB (300+30+3) plus some headroom.

[0]  https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] enterprise support

2019-07-15 Thread Robert LeBlanc
We recently used Croit (https://croit.io/) and they were really good.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jul 15, 2019 at 12:53 PM Void Star Nill 
wrote:

> Hello,
>
> Other than Redhat and SUSE, are there other companies that provide
> enterprise support for Ceph?
>
> Thanks,
> Shridhar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] To backport or not to backport

2019-07-05 Thread Robert LeBlanc
On Thu, Jul 4, 2019 at 8:00 AM Stefan Kooman  wrote:

> Hi,
>
> Now the release cadence has been set, it's time for another discussion
> :-).
>
> During Ceph day NL we had a panel q/a [1]. One of the things that was
> discussed were backports. Occasionally users will ask for backports of
> functionality in newer releases to older releases (that are still in
> support).
>
> Ceph is quite a unique project in the sense that new functionality gets
> backported to older releases. Sometimes even functionality gets changed
> in the lifetime of a release. I can recall "ceph-volume" change to LVM
> in the beginning of the Luminous release. While backports can enrich the
> user experience of a ceph operator, it's not without risks. There have
> been several issues with "incomplete" backports and or unforeseen
> circumstances that had the reverse effect: downtime of (part of) ceph
> services. The ones that come to my mind are:
>
> - MDS (cephfs damaged)  mimic backport (13.2.2)
> - RADOS (pg log hard limit) luminous / mimic backport (12.2.8 / 13.2.2)
>
> I would like to define a simple rule of when to backport:
>
> - Only backport fixes that do not introduce new functionality, but
> addresses
>   (impaired) functionality already present in the release.
>
> Example of, IMHO, a backport that matches the backport criteria was the
> "bitmap_allocator" fix. It fixed a real problem, not some corner case.
> Don't get me wrong here, it is important to catch corner cases, but it
> should not put the majority of clusters at risk.
>
> The time and effort that might be saved with this approach can indeed be
> spend in one of the new focus areas Sage mentioned during his keynote
> talk at Cephalocon Barcelona: quality. Quality of the backports that are
> needed, improved testing, especially for upgrades to newer releases. If
> upgrades are seemless, people are more willing to upgrade, because hey,
> it just works(tm). Upgrades should be boring.
>
> How many clusters (not nautilus ;-)) are running with "bitmap_allocator" or
> with the pglog_hardlimit enabled? If a new feature is not enabled by
> default and it's unclear how "stable" it is to use, operators tend to not
> enable it, defeating the purpose of the backport.
>
> Backporting fixes to older releases can be considered a "business
> opportunity" for the likes of Red Hat, SUSE, Fujitsu, etc. Especially
> for users that want a system that "keeps on running forever" and never
> needs "dangerous" updates.
>
> This is my view on the matter, please let me know what you think of
> this.
>
> Gr. Stefan
>
> P.s. Just to make things clear: this thread is in _no way_ intended to
> pick on
> anybody.
>
>
> [1]: https://pad.ceph.com/p/ceph-day-nl-2019-panel


I prefer a released version to be fairly static and not have new features
introduced, only bug fixes. For one, I'd prefer not to have to read the
release notes to figure out how dangerous a "bug-fix" release should be.
The fixes in a released version should be tested extremely well so it "Just
Works".

By not back porting new features, I think it gives more time to bake the
features into the new version and frees up the developers to focus on the
forward direction of the product. If I want a new feature, then the burden
is on me to test a new version and verify that it works in my environment
(or vendors), not the developers.

I wholeheartedly support only bug fixes and security fixes going into
released versions.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cannot add fuse options to ceph-fuse command

2019-07-05 Thread Robert LeBlanc
Is this a Ceph specific option? If so, you may need to prefix it with
"ceph.", at least I had to for FUSE to pass it to the Ceph module/code
portion.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Jul 4, 2019 at 7:35 AM songz.gucas 
wrote:

> Hi,
>
>
> I try to add some fuse options when mount cephfs using ceph-fuse tool, but
> it errored:
>
>
> ceph-fuse -m 10.128.5.1,10.128.5.2,10.128.5.3 -r /test1 /cephfs/test1 -o
> entry_timeout=5
>
> ceph-fuse[3857515]: starting ceph client2019-07-04 21:55:37.767
> 7fc1d9cbdbc0 -1 init, newargv = 0x555d6f847490 newargc=9
>
>
> fuse: unknown option `entry_timeout=5'
>
> ceph-fuse[3857515]: fuse failed to start
>
> 2019-07-04 21:55:37.796 7fc1d9cbdbc0 -1 fuse_lowlevel_new failed
>
>
>
> How can I pass options to fuse?
>
>
> Thank you for your precious help !
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-ansible with docker

2019-07-01 Thread Robert LeBlanc
I need some help getting up the learning curve and hope someone can get me
on the right track.

I need to set up a new cluster, but want the mon, mgr and rgw services as
containers on the non-container osd nodes. It seem that doing no containers
or all containers is fairly easy but I'm trying to understand if I can do
what I want. I'd like to use containers to set cgroups for resource
management.

With the OSDs having a public network and cluster network, is Ansible smart
enough to connect to the right network based on mon IP address for
instance? How do you tell Ansible to place the mon container on a specific
host? I watched the video from Sébastien Han that he made in 2015, but it
seems that the config has changed quite a bit since then. I'm quite new to
Ansible (used Puppet and Salt in the past) and Docker (used LXC and LXD),
so any help would be appreciated. Does Docker have their own IP address and
are bridges created like LXD or does it share the host IP?

Thank you,
Robert LeBlanc
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] increase pg_num error

2019-07-01 Thread Robert LeBlanc
On Mon, Jul 1, 2019 at 11:57 AM Brett Chancellor 
wrote:

> In Nautilus just pg_num is sufficient for both increases and decreases.
>
>
Good to know, I haven't gotten to Nautilus yet.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] increase pg_num error

2019-07-01 Thread Robert LeBlanc
I believe he needs to increase the pgp_num first, then pg_num.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jul 1, 2019 at 7:21 AM Nathan Fish  wrote:

> I ran into this recently. Try running "ceph osd require-osd-release
> nautilus". This drops backwards compat with pre-nautilus and allows
> changing settings.
>
> On Mon, Jul 1, 2019 at 4:24 AM Sylvain PORTIER  wrote:
> >
> > Hi all,
> >
> > I am using ceph 14.2.1 (Nautilus)
> >
> > I am unable to increase the pg_num of a pool.
> >
> > I have a pool named Backup, the current pg_num is 64 : ceph osd pool get
> > Backup pg_num => result pg_num: 64
> >
> > And when I try to increase it using the command
> >
> > ceph osd pool set Backup pg_num 512 => result "set pool 6 pg_num to 512"
> >
> > And then I check with the command : ceph osd pool get Backup pg_num =>
> > result pg_num: 64
> >
> > I don't how to increase the pg_num of a pool, I also tried the autoscale
> > module, but it doesn't work (unable to activate the autoscale, always
> > warn mode).
> >
> > Thank you for your help,
> >
> >
> > Cabeur.
> >
> >
> > ---
> > L'absence de virus dans ce courrier électronique a été vérifiée par le
> logiciel antivirus Avast.
> > https://www.avast.com/antivirus
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Robert LeBlanc
On Sat, Jun 29, 2019 at 8:12 PM Bryan Henderson 
wrote:

> > I'm not sure why the monitor did not mark it _out_ after 600 seconds
> > (default)
>
> Well, that part I understand.  The monitor didn't mark the OSD out because
> the
> monitor still considered the OSD up.  No reason to mark an up OSD out.
>
> I think the monitor should have marked the OSD down upon not hearing from
> it
> for 15 minutes ("mon osd report interval"), then out 10 minutes after that
> ("mon osd down out interval").
>
> And that's worst case.  Though details of how OSDs watch each other are
> vague,
> I suspect an existing OSD was supposed to detect the dead OSDs and report
> that
> to the monitor, which would believe it within about a minute and mark the
> OSDs
> down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min
> down
> reporters", "osd reporter subtree level").
>
> --
> Bryan Henderson   San Jose, California
>

So, if an OSD (osd.1) misses three heartbeats (6 seconds each) from another
OSD (osd.2), then the OSD sending the heartbeats (osd.2) tells the monitor
that the OSD (osd.1) is down. It takes two OSDs from different CRUSH
subtrees (host by default) for the monitor to mark the host down. The OSD
is supposed to report to the monitor each time there is a change or every
120 seconds, if 600 seconds pass with the monitor not hearing from the OSD,
it will mark it down. It 'should' only take 20 seconds to detect a downed
OSD.

Usually, the problem is that an OSD gets too busy and misses heartbeats so
other OSDs wrongly mark them down.

If 'nodown' is set, then the monitor will not mark OSDs down.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Robert LeBlanc
On Sat, Jun 29, 2019 at 6:51 PM Bryan Henderson 
wrote:

> > The reason it is so long is that you don't want to move data
> > around unnecessarily if the osd is just being rebooted/restarted.
>
> I think you're confusing down with out.  When an OSD is out, Ceph
> backfills.  While it is merely down, Ceph hopes that it will come back.
> But it will direct I/O to other redundant OSDs instead of a down one.
>
> Going down leads to going out, and I believe that is the 600 seconds you
> mention - the time between when the OSD is marked down and when Ceph marks
> it
> out (if all other conditions permit).
>
> There is a pretty good explanation of how OSDs get marked down, which is
> pretty complicated, at
>
>
> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
>
> It just doesn't seem to match the implementation.
>
> --
> Bryan Henderson   San Jose, California
>

I mixed up my terminology, the first line should have read:
" I'm not sure why the monitor did not mark it _out_ after 600 seconds
(default) "

The "down timeout" I mention is the "mon osd down out interval".

The rest of what I wrote is correct. Just to make sure I don't confuse
anyone else.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating a cephfs data pool

2019-06-28 Thread Robert LeBlanc
Yes, 'mv' on the client is just a metadata operation and not what I'm
talking about. The idea is to bring the old pool in as a cache layer, then
bring the new pool in as a lower layer, then flush/evict the data from the
cache and Ceph will move the data to the new pool, but still be able to
access it by the old pool name. You then add an overlay so that the new
pool name acts the same, then the idea is that you can remove the old pool
from the cache and remove the overlay. The only problem is updating cephfs
to look at the new pool name for data that it knows is at the old pool name.

The other option is to add a data mover to cephfs so you can do something
like `ceph fs mv old_pool new_pool` and it would move all the objects and
update the metadata as it performs the data moving. The question is how to
do the data movement since the MDS is not in the data path.

Since both pool names act the same with the overlay, the best option sounds
like; configure the tiering, add the overlay, then do a `ceph fs migrate
old_pool new_pool` which causes the MDS to scan through all the metadata
and update all references of 'old_pool' to 'new_pool'. Once that is done
and the eviction is done, then you can remove the pool from cephfs and the
overlay. That way the OSDs are the one doing the data movement.

I don't know that part of the code, so I can't quickly propose any patches.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Jun 28, 2019 at 9:37 AM Marc Roos  wrote:

>
> Afaik is the mv now fast because it is not moving any real data, just
> some meta data. Thus a real mv will be slow (only in the case between
> different pools) because it copies the data to the new pool and when
> successful deletes the old one. This will of course take a lot more
> time, but you at least are able to access the cephfs on both locations
> during this time and can fix things in your client access.
>
> My problem with mv now is that if you accidentally use it between data
> pools, it does not really move data.
>
>
>
> -Original Message-
> From: Robert LeBlanc [mailto:rob...@leblancnet.us]
> Sent: vrijdag 28 juni 2019 18:30
> To: Marc Roos
> Cc: ceph-users; jgarcia
> Subject: Re: [ceph-users] Migrating a cephfs data pool
>
> Given that the MDS knows everything, it seems trivial to add a ceph 'mv'
> command to do this. I looked at using tiering to try and do the move,
> but I don't know how to tell cephfs that the data is now on the new pool
> instead of the old pool name. Since we can't take a long enough downtime
> to move hundreds of Terabytes, we need something that can be done
> online, and if it has a minute or two of downtime would be okay.
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Jun 28, 2019 at 9:02 AM Marc Roos 
> wrote:
>
>
>
>
> 1.
> change data pool for a folder on the file system:
> setfattr -n ceph.dir.layout.pool -v fs_data.ec21 foldername
>
> 2.
> cp /oldlocation /foldername
> Remember that you preferably want to use mv, but this leaves
> (meta)
> data
> on the old pool, that is not what you want when you want to delete
> that
> pool.
>
> 3. When everything is copied-removed, you should end up with an
> empty
> datapool with zero objects.
>
> 4. Verify here with others, if you can just remove this one.
>
> I think this is a reliable technique to switch, because you use
> the
>
> basic cephfs functionality that supposed to work. I prefer that
> the
> ceph
> guys implement a mv that does what you expect from it. Now it acts
> more
> or less like a linking.
>
>
>
>
> -Original Message-
> From: Jorge Garcia [mailto:jgar...@soe.ucsc.edu]
> Sent: vrijdag 28 juni 2019 17:52
> To: Marc Roos; ceph-users
> Subject: Re: [ceph-users] Migrating a cephfs data pool
>
> Are you talking about adding the new data pool to the current
> filesystem? Like:
>
>$ ceph fs add_data_pool my_ceph_fs new_ec_pool
>
> I have done that, and now the filesystem shows up as having two
> data
> pools:
>
>$ ceph fs ls
>name: my_ceph_fs, metadata pool: cephfs_meta, data pools:
> [cephfs_data new_ec_pool ]
>
> but then I run into two issues:
>
> 1. How do I actually copy/move/migrate the data from the old pool
> to the
> new pool?
> 2. When I'm done moving the data, how do I get rid of the old da

Re: [ceph-users] Migrating a cephfs data pool

2019-06-28 Thread Robert LeBlanc
Given that the MDS knows everything, it seems trivial to add a ceph 'mv'
command to do this. I looked at using tiering to try and do the move, but I
don't know how to tell cephfs that the data is now on the new pool instead
of the old pool name. Since we can't take a long enough downtime to move
hundreds of Terabytes, we need something that can be done online, and if it
has a minute or two of downtime would be okay.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Jun 28, 2019 at 9:02 AM Marc Roos  wrote:

>
>
> 1.
> change data pool for a folder on the file system:
> setfattr -n ceph.dir.layout.pool -v fs_data.ec21 foldername
>
> 2.
> cp /oldlocation /foldername
> Remember that you preferably want to use mv, but this leaves (meta) data
> on the old pool, that is not what you want when you want to delete that
> pool.
>
> 3. When everything is copied-removed, you should end up with an empty
> datapool with zero objects.
>
> 4. Verify here with others, if you can just remove this one.
>
> I think this is a reliable technique to switch, because you use the
> basic cephfs functionality that supposed to work. I prefer that the ceph
> guys implement a mv that does what you expect from it. Now it acts more
> or less like a linking.
>
>
>
>
> -Original Message-
> From: Jorge Garcia [mailto:jgar...@soe.ucsc.edu]
> Sent: vrijdag 28 juni 2019 17:52
> To: Marc Roos; ceph-users
> Subject: Re: [ceph-users] Migrating a cephfs data pool
>
> Are you talking about adding the new data pool to the current
> filesystem? Like:
>
>$ ceph fs add_data_pool my_ceph_fs new_ec_pool
>
> I have done that, and now the filesystem shows up as having two data
> pools:
>
>$ ceph fs ls
>name: my_ceph_fs, metadata pool: cephfs_meta, data pools:
> [cephfs_data new_ec_pool ]
>
> but then I run into two issues:
>
> 1. How do I actually copy/move/migrate the data from the old pool to the
> new pool?
> 2. When I'm done moving the data, how do I get rid of the old data pool?
>
> I know there's a rm_data_pool option, but I have read on the mailing
> list that you can't remove the original data pool from a cephfs
> filesystem.
>
> The other option is to create a whole new cephfs with a new metadata
> pool and the new data pool, but creating multiple filesystems is still
> experimental and not allowed by default...
>
> On 6/28/19 8:28 AM, Marc Roos wrote:
> >
> > What about adding the new data pool, mounting it and then moving the
> > files? (read copy because move between data pools does not what you
> > expect it do)
> >
> >
> > -Original Message-
> > From: Jorge Garcia [mailto:jgar...@soe.ucsc.edu]
> > Sent: vrijdag 28 juni 2019 17:26
> > To: ceph-users
> > Subject: *SPAM* [ceph-users] Migrating a cephfs data pool
> >
> > This seems to be an issue that gets brought up repeatedly, but I
> > haven't seen a definitive answer yet. So, at the risk of repeating a
> > question that has already been asked:
> >
> > How do you migrate a cephfs data pool to a new data pool? The obvious
> > case would be somebody that has set up a replicated pool for their
> > cephfs data and then wants to convert it to an erasure code pool. Is
> > there a simple way to do this, other than creating a whole new ceph
> > cluster and copying the data using rsync?
> >
> > Thanks for any clues
> >
> > Jorge
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-28 Thread Robert LeBlanc
I'm not sure why the monitor did not mark it down after 600 seconds
(default). The reason it is so long is that you don't want to move data
around unnecessarily if the osd is just being rebooted/restarted. Usually,
you will still have min_size OSDs available for all PGs that will allow IO
to continue. Then when the down timeout expires it will start backfilling
and recovering the PGs that were affected. Double check that size !=
min_size for your pools.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Jun 27, 2019 at 5:26 PM Bryan Henderson 
wrote:

> What does it take for a monitor to consider an OSD down which has been
> dead as
> a doornail since the cluster started?
>
> A couple of times, I have seen 'ceph status' report an OSD was up, when it
> was
> quite dead.  Recently, a couple of OSDs were on machines that failed to
> boot
> up after a power failure.  The rest of the Ceph cluster came up, though,
> and
> reported all OSDs up and in.  I/Os stalled, probably because they were
> waiting
> for the dead OSDs to come back.
>
> I waited 15 minutes, because the manual says if the monitor doesn't hear a
> heartbeat from an OSD in that long (default value of
> mon_osd_report_timeout),
> it marks it down.  But it didn't.  I did "osd down" commands for the dead
> OSDs
> and the status changed to down and I/O started working.
>
> And wouldn't even 15 minutes of grace be unacceptable if it means I/Os
> have to
> wait that long before falling back to a redundant OSD?
>
> --
> Bryan Henderson   San Jose, California
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS : Kernel/Fuse technical differences

2019-06-25 Thread Robert LeBlanc
There may also be more memory coping involved instead of just passing
pointers around as well, but I'm not 100% sure.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jun 24, 2019 at 10:28 AM Jeff Layton 
wrote:

> On Mon, 2019-06-24 at 15:51 +0200, Hervé Ballans wrote:
> > Hi everyone,
> >
> > We successfully use Ceph here for several years now, and since recently,
> > CephFS.
> >
> >  From the same CephFS server, I notice a big difference between a fuse
> > mount and a kernel mount (10 times faster for kernel mount). It makes
> > sense to me (an additional fuse library versus a direct access to a
> > device...), but recently, one of our users asked me to explain him in
> > more detail the reason for this big difference...Hum...
> >
> > I then realized that I didn't really know how to explain the reasons to
> > him !!
> >
> > As well, does anyone have a more detailed explanation in a few words or
> > know a good web resource on this subject (I guess it's not specific to
> > Ceph but it's generic to all filesystems ?..)
> >
> > Thanks in advance,
> > Hervé
> >
>
> A lot of it is the context switching.
>
> Every time you make a system call (or other activity) that accesses a
> FUSE mount, it has to dispatch that request to the fuse device, the
> userland ceph-fuse daemon then has to wake up and do its thing (at least
> once) and then send the result back down to the kernel which then wakes
> up the original task so it can get the result.
>
> FUSE is a wonderful thing, but it's not really built for speed.
>
> --
> Jeff Layton 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rebalancing ceph cluster

2019-06-25 Thread Robert LeBlanc
The placement of PGs is random in the cluster and takes into account any
CRUSH rules which may also skew the distribution. Having more PGs will help
give more options for placing PGs, but it still may not be adequate. It is
recommended to have between 100-150 PGs per OSD, and you are pretty close.
If you aren't planning to add any more pools, then splitting the PGs for
pools that have a lot of data can help.

To get things to be more balanced, you can reweight the high utlization
OSDs down to cause CRUSH to migrate some PGs off. This won't mean that they
will get moved to the lowest utilized OSDs (they might wind up on another
one that is pretty full). So, it may take several iterations to get things
balanced. Just be sure that if you reweighted one down and it is now much
lower usage than the others to reweight it back up to attract some PGs back
to it.

```ceph osd reweight {osd-num} {weight}```
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Jun 24, 2019 at 2:25 AM jinguk.k...@ungleich.ch <
jinguk.k...@ungleich.ch> wrote:

> Hello everyone,
>
> We have some osd on the ceph.
> Some osd's usage is more than 77% and another osd's usage is 39% in the
> same host.
>
> I wonder why osd’s usage is different.(Difference is large) and how can i
> fix it?
>
> ID  CLASS   WEIGHTREWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS TYPE
> NAME
>  -2  93.26010- 93.3TiB 52.3TiB 41.0TiB 56.04 0.98   -
> host serverA
> …...
>  33 HDD  9.09511  1.0 9.10TiB 3.55TiB 5.54TiB 39.08 0.68  66
> osd.4
>  45 HDD   7.27675  1.0 7.28TiB 5.64TiB 1.64TiB 77.53 1.36  81
> osd.7
> …...
>
> -5  79.99017- 80.0TiB 47.7TiB 32.3TiB 59.62 1.04   -
> host serverB
>   1 HDD   9.09511  1.0 9.10TiB 4.79TiB 4.31TiB 52.63 0.92  87
> osd.1
>   6 HDD   9.09511  1.0 9.10TiB 6.62TiB 2.48TiB 72.75 1.27  99
> osd.6
>  …...
>
> Thank you
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-10 Thread Robert LeBlanc
I'm glad it's working, to be clear did you use wpq, or is it still the prio
queue?

Sent from a mobile device, please excuse any typos.

On Mon, Jun 10, 2019, 4:45 AM BASSAGET Cédric 
wrote:

> an update from 12.2.9 to 12.2.12 seems to have fixed the problem !
>
> Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric <
> cedric.bassaget...@gmail.com> a écrit :
>
>> Hi Robert,
>> Before doing anything on my prod env, I generate r/w on ceph cluster
>> using fio .
>> On my newest cluster, release 12.2.12, I did not manage to get
>> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio
>> ran from 4 diffrent hosts)
>>
>> On my prod cluster, release 12.2.9, as soon as I run fio on a single
>> host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1"
>> does not show me a usage more that 5-10% on disks...
>>
>> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc  a
>> écrit :
>>
>>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric <
>>> cedric.bassaget...@gmail.com> wrote:
>>>
>>>> Hello Robert,
>>>> My disks did not reach 100% on the last warning, they climb to 70-80%
>>>> usage. But I see rrqm / wrqm counters increasing...
>>>>
>>>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>
>>>> sda   0.00 4.000.00   16.00 0.00   104.00
>>>>  13.00 0.000.000.000.00   0.00   0.00
>>>> sdb   0.00 2.001.00 3456.00 8.00 25996.00
>>>>  15.04 5.761.670.001.67   0.03   9.20
>>>> sdd   4.00 0.00 41462.00 1119.00 331272.00  7996.00
>>>>  15.9419.890.470.480.21   0.02  66.00
>>>>
>>>> dm-0  0.00 0.00 6825.00  503.00 330856.00  7996.00
>>>>  92.48 4.000.550.560.30   0.09  66.80
>>>> dm-1  0.00 0.001.00 1129.00 8.00 25996.00
>>>>  46.02 1.030.910.000.91   0.09  10.00
>>>>
>>>>
>>>> sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd
>>>> are my OSDs
>>>>
>>>> would "osd op queue = wpq" help in this case ?
>>>> Regards
>>>>
>>>
>>> Your disk times look okay, just a lot more unbalanced than I would
>>> expect. I'd give wpq a try, I use it all the time, just be sure to also
>>> include the op_cutoff setting too or it doesn't have much effect. Let me
>>> know how it goes.
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-10 Thread Robert LeBlanc
On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric <
cedric.bassaget...@gmail.com> wrote:

> Hello Robert,
> My disks did not reach 100% on the last warning, they climb to 70-80%
> usage. But I see rrqm / wrqm counters increasing...
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
>
> sda   0.00 4.000.00   16.00 0.00   104.0013.00
> 0.000.000.000.00   0.00   0.00
> sdb   0.00 2.001.00 3456.00 8.00 25996.0015.04
> 5.761.670.001.67   0.03   9.20
> sdd   4.00 0.00 41462.00 1119.00 331272.00  7996.00
>  15.9419.890.470.480.21   0.02  66.00
>
> dm-0  0.00 0.00 6825.00  503.00 330856.00  7996.00
>  92.48 4.000.550.560.30   0.09  66.80
> dm-1  0.00 0.001.00 1129.00 8.00 25996.0046.02
> 1.030.910.000.91   0.09  10.00
>
>
> sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd are
> my OSDs
>
> would "osd op queue = wpq" help in this case ?
> Regards
>

Your disk times look okay, just a lot more unbalanced than I would expect.
I'd give wpq a try, I use it all the time, just be sure to also include the
op_cutoff setting too or it doesn't have much effect. Let me know how it
goes.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

2019-06-07 Thread Robert LeBlanc
With the low number of OSDs, you are probably satuarting the disks. Check
with `iostat -xd 2` and see what the utilization of your disks are. A lot
of SSDs don't perform well with Ceph's heavy sync writes and performance is
terrible.

If some of your drives are 100% while others are lower utilization, you can
possibly get more performance and greatly reduce the blocked I/O with the
WPQ scheduler. In the ceph.conf add this to the [osd] section and restart
the processes:

osd op queue = wpq
osd op queue cut off = high

This has helped our clusters with fairness between OSDs and making
backfills not so disruptive.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric 
wrote:

> Hello,
>
> I see messages related to REQUEST_SLOW a few times per day.
>
> here's my ceph -s  :
>
> root@ceph-pa2-1:/etc/ceph# ceph -s
>   cluster:
> id: 72d94815-f057-4127-8914-448dfd25f5bc
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3
> mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2
> osd: 6 osds: 6 up, 6 in
>
>   data:
> pools:   1 pools, 256 pgs
> objects: 408.79k objects, 1.49TiB
> usage:   4.44TiB used, 37.5TiB / 41.9TiB avail
> pgs: 256 active+clean
>
>   io:
> client:   8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr
>
>
> Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
> luminous (stable)
>
> I've check :
> - all my network stack : OK ( 2*10G LAG )
> - memory usage : ok (256G on each host, about 2% used per osd)
> - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz)
> - disk status : OK (SAMSUNG   AREA7680S5xnNTRI  3P04 => samsung DC series)
>
> I heard on IRC that it can be related to samsung PM / SM series.
>
> Do anybody here is facing the same problem ? What can I do to solve that ?
> Regards,
> Cédric
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance in a small cluster

2019-05-24 Thread Robert LeBlanc
On Fri, May 24, 2019 at 6:26 AM Robert Sander 
wrote:

> Am 24.05.19 um 14:43 schrieb Paul Emmerich:
> > 20 MB/s at 4K blocks is ~5000 iops, that's 1250 IOPS per SSD (assuming
> > replica 3).
> >
> > What we usually check in scenarios like these:
> >
> > * SSD model? Lots of cheap SSDs simply can't handle more than that
>
> The system has been newly created and is not busy at all.
>
> We tested a single SSD without OSD on top with fio: it can do 50K IOPS
> read and 16K IOPS write.
>

You probably tested with async writes, try passing sync to fio, that is
much closer to what Ceph will do as it syncs every write to make sure it
is written to disk before acknowledging back to the client that the write
is done. When I did these tests, I also filled the entire drive and ran the
test for an hour. Most drives looked fine with short tests are small
amounts of data, but once the drive started getting full, the performance
dropped off a cliff. Considering that Ceph is really hard on drives, it's
good to test the extreme.

Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS object mapping.

2019-05-24 Thread Robert LeBlanc
On Fri, May 24, 2019 at 2:14 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
> On 5/22/19 5:53 PM, Robert LeBlanc wrote:
>
> When you say 'some' is it a fixed offset that the file data starts? Is the
> first stripe just metadata?
>
> No, the first stripe contains the first 4 MB of a file by default,. The
> xattr and omap data are stored separately.
>
Ahh, so it must be in the XFS xattrs, that makes sense.

For future posterity, I combined a couple of your commands to remove the
temporary intermediate file for others who may run across this.

rados -p  getxattr . parent
| ceph-dencoder type inode_backtrace_t import - decode dump_json


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Robert LeBlanc
I'd say that if you can't find that object in Rados, then your assumption
may be good. I haven't run into this problem before. Try doing a Rados get
for that object and see if you get anything. I've done a Rados list
grepping for the hex inode, but it took almost two days on our cluster that
had half a billion objects. Your cluster may be faster.

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 8:21 AM Kevin Flöh  wrote:

> ok this just gives me:
>
> error getting xattr ec31/10004dfce92./parent: (2) No such file or
> directory
>
> Does this mean that the lost object isn't even a file that appears in the
> ceph directory. Maybe a leftover of a file that has not been deleted
> properly? It wouldn't be an issue to mark the object as lost in that case.
> On 24.05.19 5:08 nachm., Robert LeBlanc wrote:
>
> You need to use the first stripe of the object as that is the only one
> with the metadata.
>
> Try "rados -p ec31 getxattr 10004dfce92. parent" instead.
>
> Robert LeBlanc
>
> Sent from a mobile device, please excuse any typos.
>
> On Fri, May 24, 2019, 4:42 AM Kevin Flöh  wrote:
>
>> Hi,
>>
>> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but
>> this is just hanging forever if we are looking for unfound objects. It
>> works fine for all other objects.
>>
>> We also tried scanning the ceph directory with find -inum 1099593404050
>> (decimal of 10004dfce92) and found nothing. This is also working for non
>> unfound objects.
>>
>> Is there another way to find the corresponding file?
>> On 24.05.19 11:12 vorm., Burkhard Linke wrote:
>>
>> Hi,
>> On 5/24/19 9:48 AM, Kevin Flöh wrote:
>>
>> We got the object ids of the missing objects with ceph pg 1.24c
>> list_missing:
>>
>> {
>> "offset": {
>> "oid": "",
>> "key": "",
>> "snapid": 0,
>> "hash": 0,
>> "max": 0,
>> "pool": -9223372036854775808,
>> "namespace": ""
>> },
>> "num_missing": 1,
>> "num_unfound": 1,
>> "objects": [
>> {
>> "oid": {
>> "oid": "10004dfce92.003d",
>> "key": "",
>> "snapid": -2,
>> "hash": 90219084,
>> "max": 0,
>> "pool": 1,
>> "namespace": ""
>> },
>> "need": "46950'195355",
>> "have": "0'0",
>> "flags": "none",
>> "locations": [
>> "36(3)",
>> "61(2)"
>> ]
>> }
>> ],
>> "more": false
>> }
>>
>> we want to give up those objects with:
>>
>> ceph pg 1.24c mark_unfound_lost revert
>>
>> But first we would like to know which file(s) is affected. Is there a way to 
>> map the object id to the corresponding file?
>>
>>
>> The object name is composed of the file inode id and the chunk within the
>> file. The first chunk has some metadata you can use to retrieve the
>> filename. See the 'CephFS object mapping' thread on the mailing list for
>> more information.
>>
>>
>> Regards,
>>
>> Burkhard
>>
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-24 Thread Robert LeBlanc
You need to use the first stripe of the object as that is the only one with
the metadata.

Try "rados -p ec31 getxattr 10004dfce92. parent" instead.

Robert LeBlanc

Sent from a mobile device, please excuse any typos.

On Fri, May 24, 2019, 4:42 AM Kevin Flöh  wrote:

> Hi,
>
> we already tried "rados -p ec31 getxattr 10004dfce92.003d parent" but
> this is just hanging forever if we are looking for unfound objects. It
> works fine for all other objects.
>
> We also tried scanning the ceph directory with find -inum 1099593404050
> (decimal of 10004dfce92) and found nothing. This is also working for non
> unfound objects.
>
> Is there another way to find the corresponding file?
> On 24.05.19 11:12 vorm., Burkhard Linke wrote:
>
> Hi,
> On 5/24/19 9:48 AM, Kevin Flöh wrote:
>
> We got the object ids of the missing objects with ceph pg 1.24c
> list_missing:
>
> {
> "offset": {
> "oid": "",
> "key": "",
> "snapid": 0,
> "hash": 0,
> "max": 0,
> "pool": -9223372036854775808,
> "namespace": ""
> },
> "num_missing": 1,
> "num_unfound": 1,
> "objects": [
> {
> "oid": {
> "oid": "10004dfce92.003d",
> "key": "",
> "snapid": -2,
> "hash": 90219084,
> "max": 0,
> "pool": 1,
> "namespace": ""
> },
> "need": "46950'195355",
> "have": "0'0",
> "flags": "none",
> "locations": [
> "36(3)",
> "61(2)"
> ]
> }
> ],
> "more": false
> }
>
> we want to give up those objects with:
>
> ceph pg 1.24c mark_unfound_lost revert
>
> But first we would like to know which file(s) is affected. Is there a way to 
> map the object id to the corresponding file?
>
>
> The object name is composed of the file inode id and the chunk within the
> file. The first chunk has some metadata you can use to retrieve the
> filename. See the 'CephFS object mapping' thread on the mailing list for
> more information.
>
>
> Regards,
>
> Burkhard
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Major ceph disaster

2019-05-22 Thread Robert LeBlanc
On Wed, May 22, 2019 at 4:31 AM Kevin Flöh  wrote:

> Hi,
>
> thank you, it worked. The PGs are not incomplete anymore. Still we have
> another problem, there are 7 PGs inconsistent and a cpeh pg repair is
> not doing anything. I just get "instructing pg 1.5dd on osd.24 to
> repair" and nothing happens. Does somebody know how we can get the PGs
> to repair?
>
> Regards,
>
> Kevin
>

Kevin,

I just fixed an inconsistent PG yesterday. You will need to figure out why
they are inconsistent. Do these steps and then we can figure out how to
proceed.
1. Do a deep-scrub on each PG that is inconsistent. (This may fix some of
them)
2. Print out the inconsistent report for each inconsistent PG. `rados
list-inconsistent-obj  --format=json-pretty`
3. You will want to look at the error messages and see if all the shards
have the same data.

Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS object mapping.

2019-05-22 Thread Robert LeBlanc
On Wed, May 22, 2019 at 12:22 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 5/21/19 9:46 PM, Robert LeBlanc wrote:
> > I'm at a new job working with Ceph again and am excited to back in the
> > community!
> >
> > I can't find any documentation to support this, so please help me
> > understand if I got this right.
> >
> > I've got a Jewel cluster with CephFS and we have an inconsistent PG.
> > All copies of the object are zero size, but the digest says that it
> > should be a non-zero size, so it seems that my two options are, delete
> > the file that the object is part of, or rewrite the object with RADOS
> > to update the digest. So, this leads to my question, how to I tell
> > which file the object belongs to.
> >
> > From what I found, the object is prefixed with the hex value of the
> > inode and suffixed by the stripe number:
> > 1000d2ba15c.0005
> > .
> >
> > I then ran `find . -xdev -inum 1099732590940` and found a file on the
> > CephFS file system. I just want to make sure that I found the right
> > file before I start trying recovery options.
> >
>
> The first stripe XYZ. has some metadata stored as xattr (rados
> xattr, not cephfs xattr). One of the entries has the key 'parent':
>

When you say 'some' is it a fixed offset that the file data starts? Is the
first stripe just metadata?


> # ls Ubuntu16.04-WS2016-17.ova
> Ubuntu16.04-WS2016-17.ova
>
> # ls -i Ubuntu16.04-WS2016-17.ova
> 1099751898435 Ubuntu16.04-WS2016-17.ova
>
> # rados -p cephfs_test_data stat 1000e523d43.
> cephfs_test_data/1000e523d43. mtime 2016-10-13 16:20:10.00,
> size 4194304
>
> # rados -p cephfs_test_data listxattr 1000e523d43.
> layout
> parent
>
> # rados -p cephfs_test_data getxattr 1000e523d43. parent | strings
> Ubuntu16.04-WS2016-17.ova5:
> adm2
> volumes
>
>
> The complete path of the file is
> /volumes/adm/Ubuntu16.04-WS2016-17.ova5. For a complete check you can
> store the content of the parent key and use ceph-dencoder to print its
> content:
>
> # rados -p cephfs_test_data getxattr 1000e523d43. parent >
> parent.bin
>
> # ceph-dencoder type inode_backtrace_t import parent.bin decode dump_json
> {
>  "ino": 1099751898435,
>  "ancestors": [
>  {
>  "dirino": 1099527190071,
>  "dname": "Ubuntu16.04-WS2016-17.ova",
>  "version": 14901
>  },
>  {
>  "dirino": 1099521974514,
>  "dname": "adm",
>  "version": 61190706
>  },
>  {
>  "dirino": 1,
>  "dname": "volumes",
>  "version": 48394885
>  }
>  ],
>  "pool": 7,
>  "old_pools": []
> }
>
>
> One important thing to note: ls -i prints the inode id in decimal,
> cephfs uses hexadecimal for the rados object names. Thus the different
> value in the above commands.
>

Thank you for this, this is much faster than doing a find for the inode
(that took many hours, I let it run overnight and it found it some time. It
took about 21 hours to search the whole filesystem.)


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS object mapping.

2019-05-21 Thread Robert LeBlanc
I'm at a new job working with Ceph again and am excited to back in the
community!

I can't find any documentation to support this, so please help me
understand if I got this right.

I've got a Jewel cluster with CephFS and we have an inconsistent PG. All
copies of the object are zero size, but the digest says that it should be a
non-zero size, so it seems that my two options are, delete the file that
the object is part of, or rewrite the object with RADOS to update the
digest. So, this leads to my question, how to I tell which file the object
belongs to.

>From what I found, the object is prefixed with the hex value of the inode
and suffixed by the stripe number:
1000d2ba15c.0005
.

I then ran `find . -xdev -inum 1099732590940` and found a file on the
CephFS file system. I just want to make sure that I found the right file
before I start trying recovery options.

Thank you,
Robert LeBlanc
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] deep-scrubbing has large impact on performance

2016-11-22 Thread Robert LeBlanc
If you use wpq, I recommend also setting "osd_op_queue_cut_off = high"
as well, otherwise replication OPs are not weighted and really reduces
the benefit of wpq.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Nov 22, 2016 at 5:34 AM, Eugen Block  wrote:
> Thank you!
>
>
> Zitat von Nick Fisk :
>
>>> -Original Message-
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>> Eugen Block
>>> Sent: 22 November 2016 10:11
>>> To: Nick Fisk 
>>> Cc: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] deep-scrubbing has large impact on performance
>>>
>>> Thanks for the very quick answer!
>>>
>>> > If you are using Jewel
>>>
>>> We are still using Hammer (0.94.7), we wanted to upgrade to Jewel in a
>>> couple of weeks, would you recommend to do it now?
>>
>>
>> It's been fairly solid for me, but you might want to wait for the
>> scrubbing hang bug to be fixed before upgrading. I think this
>> might be fixed in the upcoming 10.2.4 release.
>>
>>>
>>>
>>> Zitat von Nick Fisk :
>>>
>>> >> -Original Message-
>>> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>>> >> Of Eugen Block
>>> >> Sent: 22 November 2016 09:55
>>> >> To: ceph-users@lists.ceph.com
>>> >> Subject: [ceph-users] deep-scrubbing has large impact on performance
>>> >>
>>> >> Hi list,
>>> >>
>>> >> I've been searching the mail archive and the web for some help. I
>>> >> tried the things I found, but I can't see the effects. We use
>>> > Ceph for
>>> >> our Openstack environment.
>>> >>
>>> >> When our cluster (2 pools, each 4092 PGs, in 20 OSDs on 4 nodes, 3
>>> >> MONs) starts deep-scrubbing, it's impossible to work with the VMs.
>>> >> Currently, the deep-scrubs happen to start on Monday, which is
>>> >> unfortunate. I already plan to start the next deep-scrub on
>>> > Saturday,
>>> >> so it has no impact on our work days. But if I imagine we had a large
>>> >> multi-datacenter, such performance breaks are not
>>> > reasonable. So
>>> >> I'm wondering how do you guys manage that?
>>> >>
>>> >> What I've tried so far:
>>> >>
>>> >> ceph tell osd.* injectargs '--osd_scrub_sleep 0.1'
>>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_priority 7'
>>> >> ceph tell osd.* injectargs '--osd_disk_thread_ioprio_class idle'
>>> >> ceph tell osd.* injectargs '--osd_scrub_begin_hour 0'
>>> >> ceph tell osd.* injectargs '--osd_scrub_end_hour 7'
>>> >>
>>> >> And I also added these options to the ceph.conf.
>>> >> To be able to work again, I had to set the nodeep-scrub option and
>>> >> unset it when I left the office. Today, I see the cluster deep-
>>> >> scrubbing again, but only one PG at a time, it seems that now the
>>> >> default for osd_max_scrubs is working now and I don't see major
>>> >> impacts yet.
>>> >>
>>> >> But is there something else I can do to reduce the performance impact?
>>> >
>>> > If you are using Jewel, the scrubing is now done in the client IO
>>> > thread, so those disk thread options won't do anything. Instead there
>>> > is a new priority setting, which seems to work for me, along with a
>>> > few other settings.
>>> >
>>> > osd_scrub_priority = 1
>>> > osd_scrub_sleep = .1
>>> > osd_scrub_chunk_min = 1
>>> > osd_scrub_chunk_max = 5
>>> > osd_scrub_load_threshold = 5
>>> >
>>> > Also enabling the weighted priority queue can assist the new priority
>>> > options
>>> >
>>> > osd_op_queue = wpq
>>> >
>>> >
>>> >> I just found [1] and will have a look into it.
>>> >>
>>> >> [1] http://prob6.com/en/ceph-pg-deep-scrub-cron/
>>> >>
>>> >> Thanks!
>>> >> Eugen
>>> >>
>>> >> --
>>> >> Eugen Block voice   : +49-40-559 51 75
>>> 

Re: [ceph-users] Blocked ops, OSD consuming memory, hammer

2016-05-26 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've seen something similar to this when bringing an OSD back into a
cluster that has a lot of I/O that is "close" to the max performance
of the drives. For Jewell, there is a "mon osd prime pg temp" [0]
which really helped reduce the huge memory usage when an OSD starts up
and helped a bit with the slow/blocked I/O too. I created a backport
for Hammer that I didn't have problems with, but was rejected to
prevent adding new features to Hammer. You could patch Hammer and you
only have to run the new code on the monitors to get the benefit.[1]

[0] 
http://docs.ceph.com/docs/jewel/rados/configuration/mon-config-ref/#miscellaneous
[1] https://github.com/ceph/ceph/pull/7848
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.4.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJXRxdoCRDmVDuy+mK58QAAIhYQAJuJ4k7OEOwkAoGnILar
M+etyX5nbacWSBxX/NhH7pD++Nmu1JxHa1KM1ymytzEfIwhBenCb/4exdkaQ
KpQQREB2STDSCWXvutoAfhc3YsqL0XY/XH2gRMX+crK2NXQoRsEQVzBgWVYh
uIZ+wJ2EjzML9nZX5t4Qcxf2o7Z130/FIwcAAx2IkIRex3PsgCWy9t6sVSZB
5zytRQECL+bwa04/Oy+xQqMhekJyLiYkKk0m3c6HI10LOtkoVO/iSj723jMs
5AWazaJOP8A8P5UyzXrDIuM2mcia4yws1INke8r8fcLFtll+rDLIs6icSzsQ
aMJHpHu2HNSv+EfqAmX7LpH/ebxcx6CtS51fW2BuQWCszzmeSwbNkx/8VVSS
VKgL4ARy1596sdtQVwXuPAQqmV65Cw9K/gP5E/LtISC3tnQM/bugZhGfZNz7
X3ZYy3ujYhrswMsKVw+5i1dPolqxZerIt6rq2r56JK5ZqjNBi5EhUDGsfqiZ
LT84jcAhazI+inrIF/O4bg8ili1uNeqhcyNQWnrFawyt3C5MzOD8GSHLgA3A
D1IR5I2hpwO3TMqzED8+eQ/Qgd1qF1zMaAkja95aC7mxzfXTsxQj68iIAKUp
47Nwaz4ln2A5f20SQe3W4jxp33MKsAJYej2/xn/B0roxH7ZTAhXlcpYhU8Ni
s5aw
=X+rk
-END PGP SIGNATURE-
----
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, May 24, 2016 at 3:16 PM, Heath Albritton  wrote:
> Having some problems with my cluster.  Wondering if I could get some
> troubleshooting tips:
>
> Running hammer 0.94.5.  Small cluster with cache tiering.  3 spinning
> nodes and 3 SSD nodes.
>
> Lots of blocked ops.  OSDs are consuming the entirety of the system
> memory (128GB) and then falling over.  Lots of blocked ops, slow
> requests.  Seeing logs like this:
>
> 2016-05-24 19:30:09.288941 7f63c126b700  1 heartbeat_map is_healthy
> 'FileStore::op_tp thread 0x7f63cb3cd700' had timed out after 60
> 2016-05-24 19:30:09.503712 7f63c5273700  0 log_channel(cluster) log
> [WRN] : map e7779 wrongly marked me down
> 2016-05-24 19:30:11.190178 7f63cabcc700  0 --
> 10.164.245.22:6831/5013886 submit_message MOSDPGPushReply(9.10d 7762
> [PushReplyOp(3110010d/rbd_data.9647882ae8944a.26e7/head//9)])
> v2 remote, 10.164.245.23:6821/3028423, failed lossy con, dropping
> message 0xfc21e00
> 2016-05-24 19:30:22.832381 7f63bca62700 -1 osd.23 7780
> lsb_release_parse - failed to call lsb_release binary with error: (12)
> Cannot allocate memory
>
> Eventually the OSD fails.  Cluster is in an unhealthy state.
>
> I can set noup, restart the OSDs and get them on the current map, but
> once I put them back into the cluster, they eventually fail.
>
>
> -H
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mark out vs crush weight 0

2016-05-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Check out the Weighted Priority Queue option in Jewel, this really
helped reduce the impact of recovery and backfill on client traffic
with my testing. I think it really addresses a lot of the pain points
you mention.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.4.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJXQybACRDmVDuy+mK58QAANFMP/iFHpRFx8Xiik5axDSZl
zKjUQGUetuGzh6hu/y1+RNtrbUaC+Gg6L4A3ivT5f7CUsCcnOquQ/bBxQMe5
ve5M8XrEREPlBOzcQS+IIFK66bN8OC1Q/Rf1OzCFpWJmoMumbcBxrGV5KV8l
5m/GrOjmtxJzH/olaAzktOMAm3mTpWyL7KIPjUiBXvPi4EnyifIV3Hqc55TX
6/oz7vX7U9cg+JouVvnDAkLcb5C/hxNRNCGKO7Vxk0usuvYbvsbmRbQddAFt
6z6dJ9SFiPpys50WR8vpmsabqFEwKBAZSCemv/LdeAp+moLhFAydVD46LRsP
NUNj23NuB5lDJKt444Y97/udDgnwJM4uq/8fHfTGMdptkzDsfdbOxDG4SPqd
m7/bOJJET0UByCgtNuU0dUq0Rme0iidrH/9gZt6Y2w0jY4VSvPmkuP+GSIfj
Boc2EIw39SoyaNgC/m5WvEru5trsH+vE7RcJpStLzwv+3MejQPzr9UDay/k4
7gxrNrB7YJ7YIX5i2yGYfE+tNVNUD4nGBgPCcBY7yDAzvbBKM5HzZSxWfYv6
JULq+EVc592gGjUx8BI+vJnckV3yGABCrVdUda2xxYwjMkIHbnoQtL7yi3DL
W7Y5Z5iIDGSSpDcMOIEzSCiABzuKJHQC+EPf1NHGbEtK7ZGFPqmVx98eREgO
oyjl
=U0LK
-END PGP SIGNATURE-



Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, May 19, 2016 at 5:26 AM, Oliver Dzombic 
wrote:

> Hi,
>
> a sparedisk is a nice idea.
>
> But i think thats something you can also do with a shellscript.
>
> Checking if an osd is down or out and just using your spare disk.
>
> Maybe the programming ressources should not be used for something most
> of us can do with a simple shell script checking every 5 seconds the
> situation.
>
> 
>
> Maybe better idea ( in my humble opinion ) is to solve this stuff by
> optimizing the code in recovery situations.
>
> Currently we have things like
>
> client-op-priority,
> recovery-op-priority,
> max-backfills,
> recovery-max-active and so on
>
> to limit the performance impact in a recovery situation.
>
> And still in a situation of recovery the performance go downhill ( a lot
> )  when all OSD's start to refill the to_be_recovered OSD.
>
> In my case, i was removing old HDD's from a cluster.
>
> If i down/out them ( 6 TB drives 40-50% full ) the cluster's performance
> will go down very dramatically. So i had to reduce the weight by 0.1
> steps to ease this pain, but could not remove it completely.
>
>
> So i think the tools / code to protect the cluster's performance ( even
> in recovery situation ) can be improved.
>
> Of course, on one hand, we want to make sure, that asap the configured
> amount of replica's and this way, datasecurity is restored.
>
> But on the other hand, it does not help too much if the recovery
> proceedure will impact the cluster's performance on a level where the
> useability is too much reduced.
>
> So maybe introcude another config option to controle this ratio ?
>
> To control more effectively how much IOPS/Bandwidth is used ( maybe
> streight in numbers in form of an IO ratelimit ) so that administrator's
> have the chance to config, according to the hardware environment, the
> "perfect" settings for their individual usecase.
>
>
> Because, right now, when i reduce the weight of a 6 TB HDD, while having
> ~ 30 OSD's in the cluster, from 1.0 to 0.9, around 3-5% of data will be
> moved around the cluster ( replication 2 ).
>
> While its moving, there is a true performance hit on the virtual servers.
>
> So if this could be solved, by a IOPS/HDD Bandwidth rate limit, that i
> can simply tell the cluster to use max. 10 IOPS and/or 10 MB/s for the
> recovery, then i think it would be a great help for any usecase and
> administrator.
>
> Thanks !
>
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 19.05.2016 um 04:57 schrieb Christian Balzer:
> >
> > Hello Sage,
> >
> > On Wed, 18 May 2016 17:23:00 -0400 (EDT) Sage Weil wrote:
> >
> >> Currently, after an OSD has been down for 5 minutes, we mark the OSD
> >> "out", whic redistributes the data to other OSDs in the cluster.  If the
> >> OSD comes back up, it marks the OSD back in (with the same reweight
> >> value, usually 1.0).
> >>
> >> The good thing about marking OSDs out is that exactly the amount of data
> >> on the OSD moves.  (Well, pretty close.)  It is uniformly distributed
> >> across al

Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Ceph is not able to use native Infiniband protocols yet and so it is
only leveraging IPoIB at the moment. The most likely reason you are
only getting ~10 Gb performance is that IPoIB heavily leverages
multicast in Infiniband (if you do so research in this area you will
understand why unicast IP still uses multicast on an Inifiniband
network). To be extremely compatible with all adapters, the subnet
manager will set the speed of multicast to 10 Gb/s so that SDR
adapters can be used and not drop packets. If you know that you will
never have adapters under a certain speed, you can configure the
subnet manager to use a higher speed. This does not change IPoIB
networks that are already configured (I had to down all the IPoIB
adapter at the same time and bring them back up to upgrade the speed).
Even after that, there still wasn't similar performance to native
Infiniband, but I got at least a 2x improvement (along with setting
the MTU to 64K) on the FDR adapters. There is still a ton of overhead
for doing IPoIB so it is not an ideal transport to get performance on
Infiniband, I think of it as a compatibility feature. Hopefully, that
will give you enough information to perform the research. If you
search the OFED mailing list, you will see some posts from me 2-3
years ago regarding this very topic.

Good luck and keep holding out for Ceph with XIO.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4
9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9
+T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z
PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX
vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O
9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb
UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw
+JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO
2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9
V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp
jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM
ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih
XyBJ
=EF9A
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Apr 7, 2016 at 1:43 PM, German Anders  wrote:
> Hi Cephers,
>
> I've setup a production environment Ceph cluster with the Jewel release
> (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
> Servers and 6 OSD Servers:
>
> 3x MON Servers:
> 2x Intel Xeon E5-2630v3@2.40Ghz
> 384GB RAM
> 2x 200G Intel DC3700 in RAID-1 for OS
> 1x InfiniBand ConnectX-3 ADPT DP
>
> 6x OSD Servers:
> 2x Intel Xeon E5-2650v2@2.60Ghz
> 128GB RAM
> 2x 200G Intel DC3700 in RAID-1 for OS
> 12x 800G Intel DC3510 (osd & journal) on same device
> 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on
> the CLUS network)
>
> ceph.conf file is:
>
> [global]
> fsid = xxx
> mon_initial_members = cibm01, cibm02, cibm03
> mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = xx.xx.16.0/20
> cluster_network = xx.xx.32.0/20
>
> [mon]
>
> [mon.cibm01]
> host = cibm01
> mon_addr = xx.xx.xx.1:6789
>
> [mon.cibm02]
> host = cibm02
> mon_addr = xx.xx.xx.2:6789
>
> [mon.cibm03]
> host = cibm03
> mon_addr = xx.xx.xx.3:6789
>
> [osd]
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
>
> ## OSD Configuration ##
> [osd.0]
> host = cibn01
> public_addr = xx.xx.17.1
> cluster_addr = xx.xx.32.1
>
> [osd.1]
> host = cibn01
> public_addr = xx.xx.17.1
> cluster_addr = xx.xx.32.1
>
> ...
>
>
>
> They are all running Ubuntu 14.04.4 LTS. Journals are 5GB partitions on each
> disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For
> example:
>
> sdc  8:32   0 745.2G  0 disk
> |-sdc1   8:33   0 740.2G  0 part
> /var/lib/ceph/osd/ceph-0
> `-sdc2   8:34   0 5G  0 part
>
> The purpose of this cluster will be to serve as a backend storage for Cinder
> volumes (RBD) and Glance images in an OpenStack cloud, most of the clusters
> on OpenStack will be non-relational databases like Cassandra with many
> instances each.
>
> All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
> Mellanox Technologies MT27500 Family [ConnectX-3] adapters.
>
>
> So I assume that performance will be really nice, right?...but.. I'm getting
> some numbers that I 

Re: [ceph-users] data corruption with hammer

2016-03-20 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm having trouble finding documentation about using ceph_test_rados.
Can I run this on the existing cluster and will that provide useful
info? It seems running it in the build will not have the caching set
up (vstart.sh).

I have accepted a job with another company and only have until
Wednesday to help with getting information about this bug. My new job
will not be using Ceph, so I won't be able to provide any additional
info after Tuesday. I want to leave the company on a good trajectory
for upgrading, so any input you can provide will be helpful.

I've found:

./ceph_test_rados --op read 100 --op write 100 --op delete 50
- --max-ops 40 --objects 1024 --max-in-flight 64 --size 400
- --min-stride-size 40 --max-stride-size 80 --max-seconds 600
- --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op
rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0

Is that enough if I change --pool to the cached pool and do the
toggling while ceph_test_rados is running? I think this will run for
10 minutes.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6tjwCRDmVDuy+mK58QAANKgP/ia5TA/7kTUpmciVR2BW
t0MrilXAIvdikHlaWTVIxEmb4S8X+57hziEZUd6hLBMnKnuUQxsDb3yyuZX4
iqaE8KBXDjMFjHnhTOFf7eB2JIjM1WkZxmlA23yBRMNtvlBArbwxYYnAyTXt
/fW1QmgLZIvuql1y01TdRot/owqJ3B2Ah896lySrltWj626R+1rhTLVDWYr6
EKa1mf8BiRBeGpjEVhN6Vihb7T1IzHtCi1E6+mlSqhWGNf8AeZh8IKUT0tbm
C/JiUVGmG8/t7WFzCiQWd1w8UdkdCzms7k662CsSLIpbjNo4ouwEkpb5sZLP
ELgWxo8hvad7USqSXvXqJNzmoenUwQwdUvSjYbNk+4D+8eHqptlNXDmDfpiE
pN7dp8wbJ+yICxMPLuUe/Iqzp6rRnjPwam/CiDZu52N1ncH3X1X4u0cuAD0Z
dFjEfdAZJAJ+fqvts2zVvtOwq/q41eTuV3ZRSn5ubA6iAeKnxMtPoEcuozEp
Su1Iud2fYdma5w8MFStjp1BAV3osg1WgIM6KYzsSZI1BkCQAqU58ROZ0ZsMb
D05/AEK/A6fp0ROXUczhXDcXlXcGEWyJm1QEtg7cSu3C+9qu5qvQQxyrrwbZ
MK8C5lhVb44sqSVcSIZ+KCrPC+x8UKodDQZCz6O6NrJjZLn2g06583cMFWK8
qLo+
=qgB7
-END PGP SIGNATURE-


--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Mar 17, 2016 at 8:19 AM, Sage Weil  wrote:

> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
> > We are trying to figure out how to use rados bench to reproduce. Ceph
> > itself doesn't seem to think there is any corruption, but when you do a
> > verify inside the RBD, there is. Can rados bench verify the objects after
> > they are written? It also seems to be primarily the filesystem metadata
> > that is corrupted. If we fsck the volume, there is missing data (put into
> > lost+found), but if it is there it is primarily OK. There only seems to
> be
> > a few cases where a file's contents are corrupted. I would suspect on an
> > object boundary. We would have to look at blockinfo to map that out and
> see
> > if that is what is happening.
>
> 'rados bench' doesn't do validation.  ceph_test_rados does, though--if you
> can reproduce with that workload then it should be pretty easy to track
> down.
>
> Thanks!
> sage
>
>
> > We stopped all the IO and did put the tier in writeback mode with recency
> > 1,  set the recency to 2 and started the test and there was corruption,
> so
> > it doesn't seem to be limited to changing the mode. I don't know how that
> > patch could cause the issue either. Unless there is a bug that reads from
> > the back tier, but writes to cache tier, then the object gets promoted
> > wiping that last write, but then it seems like it should not be as much
> > corruption since the metadata should be in the cache pretty quick. We
> > usually evited the cache before each try so we should not be evicting on
> > writeback.
> >
> > Sent from a mobile device, please excuse any typos.
> > On Mar 17, 2016 6:26 AM, "Sage Weil"  wrote:
> >
> > > On Thu, 17 Mar 2016, Nick Fisk wrote:
> > > > There is got to be something else going on here. All that PR does is
> to
> > > > potentially delay the promotion to hit_set_period*recency instead of
> > > > just doing it on the 2nd read regardless, it's got to be uncovering
> > > > another bug.
> > > >
> > > > Do you see the same problem if the cache is in writeback mode before
> you
> > > > start the unpacking. Ie is it the switching mid operation which
> causes
> > > > the problem? If it only happens mid operation, does it still occur if
> > > > you pause IO when you make the switch?
> > > >
> > > > Do you also see this if you perform on a RBD mount, to rule out any
> > > > librbd/qemu weirdness?
> > > >
> > > > Do you know if it’s the actual data that is getting corrupted or if
> it's
> > &g

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
Yep, let me pull and build that branch. I tried installing the dbg
packages and running it in gdb, but it didn't load the symbols.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil  wrote:
> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>> Also, is this ceph_test_rados rewriting objects quickly? I think that
>> the issue is with rewriting objects so if we can tailor the
>> ceph_test_rados to do that, it might be easier to reproduce.
>
> It's doing lots of overwrites, yeah.
>
> I was albe to reproduce--thanks!  It looks like it's specific to
> hammer.  The code was rewritten for jewel so it doesn't affect the
> latest.  The problem is that maybe_handle_cache may proxy the read and
> also still try to handle the same request locally (if it doesn't trigger a
> promote).
>
> Here's my proposed fix:
>
> https://github.com/ceph/ceph/pull/8187
>
> Do you mind testing this branch?
>
> It doesn't appear to be directly related to flipping between writeback and
> forward, although it may be that we are seeing two unrelated issues.  I
> seemed to be able to trigger it more easily when I flipped modes, but the
> bug itself was a simple issue in the writeback mode logic.  :/
>
> Anyway, please see if this fixes it for you (esp with the RBD workload).
>
> Thanks!
> sage
>
>
>
>
>> --------
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  
>> wrote:
>> > I'll  miss the Ceph community as well. There was a few things I really
>> > wanted to work in with Ceph.
>> >
>> > I got this:
>> >
>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
>> > dirty exists
>> > 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
>> > 1040:  finishing write tid 1 to nodez23350-256
>> > 1040:  finishing write tid 2 to nodez23350-256
>> > 1040:  finishing write tid 3 to nodez23350-256
>> > 1040:  finishing write tid 4 to nodez23350-256
>> > 1040:  finishing write tid 6 to nodez23350-256
>> > 1035: done (4 left)
>> > 1037: done (3 left)
>> > 1038: done (2 left)
>> > 1043: read oid 430 snap -1
>> > 1043:  expect (ObjNum 429 snap 0 seq_num 429)
>> > 1040:  finishing write tid 7 to nodez23350-256
>> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
>> > dirty exists
>> > 1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
>> > 1042:  expect (ObjNum 664 snap 0 seq_num 664)
>> > 1043: Error: oid 430 read returned error code -2
>> > ./test/osd/RadosModel.h: In function 'virtual void
>> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
>> > 2016-03-17 10:47:19.085414
>> > ./test/osd/RadosModel.h: 1109: FAILED assert(0)
>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> > const*)+0x76) [0x4db956]
>> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
>> > 3: (()+0x9791d) [0x7fa1d472191d]
>> > 4: (()+0x72519) [0x7fa1d46fc519]
>> > 5: (()+0x13c178) [0x7fa1d47c6178]
>> > 6: (()+0x80a4) [0x7fa1d425a0a4]
>> > 7: (clone()+0x6d) [0x7fa1d2bd504d]
>> > NOTE: a copy of the executable, or `objdump -rdS ` is
>> > needed to interpret this.
>> > terminate called after throwing an instance of 'ceph::FailedAssertion'
>> > Aborted
>> >
>> > I had to toggle writeback/forward and min_read_recency_for_promote a
>> > few times to get it, but I don't know if it is because I only have one
>> > job running. Even with six jobs running, it is not easy to trigger
>> > with ceph_test_rados, but it is very instant in the RBD VMs.
>> >
>> > Here are the six run crashes (I have about the last 2000 lines of each
>> > if needed):
>> >
>> > nodev:
>> > update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num
>> > 1014) dirty exists
>> > 1015:  left oid 1015 (ObjNum 1014 snap 0 seq_num 1014)
>> > 1016:  finishing write tid 1 to nodev21799-1016
>> > 1016:  finishing write tid 2 to nodev21799-1016
>> > 1016:  finishing write tid 3 to nodev21799-1016
>> > 1016:  finishing write tid 4 to nodev21799-1016
>> > 1016:  finishing write tid 6 to nodev21799-1016
>> > 1016:  finish

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Possible, it looks like all the messages comes from a test suite. Is
there some logging that would expose this or an assert that could be
added? We are about ready to do some testing in our lab to see if we
can replicate it and workaround the issue. I also can't tell which
version introduced this in Hammer, it doesn't look like it has been
resolved.

Thanks,
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6bqRCRDmVDuy+mK58QAANTsP/1jceRh9zYDlm2rkVq3e
F6UKgezyCWV7h1cou8/rSVkxOfyyWEDSy1nMPBTHCtfMuOHzlx9VZftmPCiY
BmxbclpUhAbAbjMb/E7t0jFR7fAZylX4okjUTN1y7NII+6xMXyxb51drYrZv
AJzNcXfWYL1+y0Mz/QqOgEyij27OF8vYpSTJqXFDUcXtZNPfyvTjJ1ttYtuR
saFJJ6SrFXA5LliGBNQK+pTDq0ZF0Bn0soE73rpzwpQvIdiOf/Jg7hAbERCc
Vqjhg34YVLdpGd8W7IvaT0RirYbz8SmRdwOw1IIkBcqe0r9Mt08OgKu5NPT3
Rm0MKYynE1E7nKgutPisJQidT9QuaSVuY40oRDBIlrFA1BxNjGjwFxZn7y8r
WyNMHKqB9Y+78uWdtEZtGfiSwyxC2UZTQFI4+eLs/XOoRLWv9oxRYV55Co0W
e8zPW0nL1pm9iD9J+3fCRlNEL+cyDjsLLmW005BkF2q7da1XgxkoNndUBTlM
Az9RGHoCELfI6kle315/2BEGfE2aRokLngbyhQWKAWmrdTCTDZaJwDKIi4hb
69LGT2eHofTWB5KgMHoCFLUSy2lYa86GxLLsBvPuqOfAXPWHMZERGv94qH/E
CppgbnchgRHuI68rNM6nFYPJa4C3MlyQhu2WmOialAGgQi+IQP/g6h70e0RR
eqLX
=DcjE
-END PGP SIGNATURE-


--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Mar 16, 2016 at 1:40 PM, Gregory Farnum  wrote:

> This tracker ticket happened to go by my eyes today:
> http://tracker.ceph.com/issues/12814 . There isn't a lot of detail
> there but the headline matches.
> -Greg
>
> On Wed, Mar 16, 2016 at 2:02 AM, Nick Fisk  wrote:
> >
> >
> >> -Original Message-
> >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> >> Christian Balzer
> >> Sent: 16 March 2016 07:08
> >> To: Robert LeBlanc 
> >> Cc: Robert LeBlanc ; ceph-users  >> us...@lists.ceph.com>; William Perkins 
> >> Subject: Re: [ceph-users] data corruption with hammer
> >>
> >>
> >> Hello Robert,
> >>
> >> On Tue, 15 Mar 2016 10:54:20 -0600 Robert LeBlanc wrote:
> >>
> >> > -BEGIN PGP SIGNED MESSAGE-
> >> > Hash: SHA256
> >> >
> >> > There are no monitors on the new node.
> >> >
> >> So one less possible source of confusion.
> >>
> >> > It doesn't look like there has been any new corruption since we
> >> > stopped changing the cache modes. Upon closer inspection, some files
> >> > have been changed such that binary files are now ASCII files and visa
> >> > versa. These are readable ASCII files and are things like PHP or
> >> > script files. Or C files where ASCII files should be.
> >> >
> >> What would be most interesting is if the objects containing those
> > corrupted
> >> files did reside on the new OSDs (primary PG) or the old ones, or both.
> >>
> >> Also, what cache mode was the cluster in before the first switch
> > (writeback I
> >> presume from the timeline) and which one is it in now?
> >>
> >> > I've seen this type of corruption before when a SAN node misbehaved
> >> > and both controllers were writing concurrently to the backend disks.
> >> > The volume was only mounted by one host, but the writes were split
> >> > between the controllers when it should have been active/passive.
> >> >
> >> > We have killed off the OSDs on the new node as a precaution and will
> >> > try to replicate this in our lab.
> >> >
> >> > I suspicion is that is has to do with the cache promotion code update,
> >> > but I'm not sure how it would have caused this.
> >> >
> >> While blissfully unaware of the code, I have a hard time imagining how
> it
> >> would cause that as well.
> >> Potentially a regression in the code that only triggers in one cache
> mode
> > and
> >> when wanting to promote something?
> >>
> >> Or if it is actually the switching action, not correctly promoting
> things
> > as it
> >> happens?
> >> And thus referencing a stale object?
> >
> > I can't think of any other reason why the recency would break things in
> any
> > other way. Can the OP confirm what recency setting is being used?
> >
> > When you switch to writeback, if you haven't reached the required recency
> > yet, all reads will be proxied, previous behaviour would have pretty much
> > promoted all the time regardless. So unless something is happening where
> &g

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
We are trying to figure out how to use rados bench to reproduce. Ceph
itself doesn't seem to think there is any corruption, but when you do a
verify inside the RBD, there is. Can rados bench verify the objects after
they are written? It also seems to be primarily the filesystem metadata
that is corrupted. If we fsck the volume, there is missing data (put into
lost+found), but if it is there it is primarily OK. There only seems to be
a few cases where a file's contents are corrupted. I would suspect on an
object boundary. We would have to look at blockinfo to map that out and see
if that is what is happening.

We stopped all the IO and did put the tier in writeback mode with recency
1,  set the recency to 2 and started the test and there was corruption, so
it doesn't seem to be limited to changing the mode. I don't know how that
patch could cause the issue either. Unless there is a bug that reads from
the back tier, but writes to cache tier, then the object gets promoted
wiping that last write, but then it seems like it should not be as much
corruption since the metadata should be in the cache pretty quick. We
usually evited the cache before each try so we should not be evicting on
writeback.

Sent from a mobile device, please excuse any typos.
On Mar 17, 2016 6:26 AM, "Sage Weil"  wrote:

> On Thu, 17 Mar 2016, Nick Fisk wrote:
> > There is got to be something else going on here. All that PR does is to
> > potentially delay the promotion to hit_set_period*recency instead of
> > just doing it on the 2nd read regardless, it's got to be uncovering
> > another bug.
> >
> > Do you see the same problem if the cache is in writeback mode before you
> > start the unpacking. Ie is it the switching mid operation which causes
> > the problem? If it only happens mid operation, does it still occur if
> > you pause IO when you make the switch?
> >
> > Do you also see this if you perform on a RBD mount, to rule out any
> > librbd/qemu weirdness?
> >
> > Do you know if it’s the actual data that is getting corrupted or if it's
> > the FS metadata? I'm only wondering as unpacking should really only be
> > writing to each object a couple of times, whereas FS metadata could
> > potentially be being updated+read back lots of times for the same group
> > of objects and ordering is very important.
> >
> > Thinking through it logically the only difference is that with recency=1
> > the object will be copied up to the cache tier, where recency=6 it will
> > be proxy read for a long time. If I had to guess I would say the issue
> > would lie somewhere in the proxy read + writeback<->forward logic.
>
> That seems reasonable.  Was switching from writeback -> forward always
> part of the sequence that resulted in corruption?  Not that there is a
> known ordering issue when switching to forward mode.  I wouldn't really
> expect it to bite real users but it's possible..
>
> http://tracker.ceph.com/issues/12814
>
> I've opened a ticket to track this:
>
> http://tracker.ceph.com/issues/15171
>
> What would be *really* great is if you could reproduce this with a
> ceph_test_rados workload (from ceph-tests).  I.e., get ceph_test_rados
> running, and then find the sequence of operations that are sufficient to
> trigger a failure.
>
> sage
>
>
>
>  >
> >
> >
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of
> > > Mike Lovell
> > > Sent: 16 March 2016 23:23
> > > To: ceph-users ; sw...@redhat.com
> > > Cc: Robert LeBlanc ; William Perkins
> > > 
> > > Subject: Re: [ceph-users] data corruption with hammer
> > >
> > > just got done with a test against a build of 0.94.6 minus the two
> commits that
> > > were backported in PR 7207. everything worked as it should with the
> cache-
> > > mode set to writeback and the min_read_recency_for_promote set to 2.
> > > assuming it works properly on master, there must be a commit that we're
> > > missing on the backport to support this properly.
> > >
> > > sage,
> > > i'm adding you to the recipients on this so hopefully you see it. the
> tl;dr
> > > version is that the backport of the cache recency fix to hammer
> doesn't work
> > > right and potentially corrupts data when
> > > the min_read_recency_for_promote is set to greater than 1.
> > >
> > > mike
> > >
> > > On Wed, Mar 16, 2016 at 4:41 PM, Mike Lovell
> > >  wrote:
> > > robert and i have done some further investigation the

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
Cherry-picking that commit onto v0.94.6 wasn't clean so I'm just
building your branch. I'm not sure what the difference between your
branch and 0.94.6 is, I don't see any commits against
osd/ReplicatedPG.cc in the last 5 months other than the one you did
today.
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:38 AM, Robert LeBlanc  wrote:
> Yep, let me pull and build that branch. I tried installing the dbg
> packages and running it in gdb, but it didn't load the symbols.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil  wrote:
>> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>>> Also, is this ceph_test_rados rewriting objects quickly? I think that
>>> the issue is with rewriting objects so if we can tailor the
>>> ceph_test_rados to do that, it might be easier to reproduce.
>>
>> It's doing lots of overwrites, yeah.
>>
>> I was albe to reproduce--thanks!  It looks like it's specific to
>> hammer.  The code was rewritten for jewel so it doesn't affect the
>> latest.  The problem is that maybe_handle_cache may proxy the read and
>> also still try to handle the same request locally (if it doesn't trigger a
>> promote).
>>
>> Here's my proposed fix:
>>
>> https://github.com/ceph/ceph/pull/8187
>>
>> Do you mind testing this branch?
>>
>> It doesn't appear to be directly related to flipping between writeback and
>> forward, although it may be that we are seeing two unrelated issues.  I
>> seemed to be able to trigger it more easily when I flipped modes, but the
>> bug itself was a simple issue in the writeback mode logic.  :/
>>
>> Anyway, please see if this fixes it for you (esp with the RBD workload).
>>
>> Thanks!
>> sage
>>
>>
>>
>>
>>> 
>>> Robert LeBlanc
>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>
>>>
>>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  
>>> wrote:
>>> > I'll  miss the Ceph community as well. There was a few things I really
>>> > wanted to work in with Ceph.
>>> >
>>> > I got this:
>>> >
>>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
>>> > dirty exists
>>> > 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
>>> > 1040:  finishing write tid 1 to nodez23350-256
>>> > 1040:  finishing write tid 2 to nodez23350-256
>>> > 1040:  finishing write tid 3 to nodez23350-256
>>> > 1040:  finishing write tid 4 to nodez23350-256
>>> > 1040:  finishing write tid 6 to nodez23350-256
>>> > 1035: done (4 left)
>>> > 1037: done (3 left)
>>> > 1038: done (2 left)
>>> > 1043: read oid 430 snap -1
>>> > 1043:  expect (ObjNum 429 snap 0 seq_num 429)
>>> > 1040:  finishing write tid 7 to nodez23350-256
>>> > update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
>>> > dirty exists
>>> > 1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
>>> > 1042:  expect (ObjNum 664 snap 0 seq_num 664)
>>> > 1043: Error: oid 430 read returned error code -2
>>> > ./test/osd/RadosModel.h: In function 'virtual void
>>> > ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
>>> > 2016-03-17 10:47:19.085414
>>> > ./test/osd/RadosModel.h: 1109: FAILED assert(0)
>>> > ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> > const*)+0x76) [0x4db956]
>>> > 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
>>> > 3: (()+0x9791d) [0x7fa1d472191d]
>>> > 4: (()+0x72519) [0x7fa1d46fc519]
>>> > 5: (()+0x13c178) [0x7fa1d47c6178]
>>> > 6: (()+0x80a4) [0x7fa1d425a0a4]
>>> > 7: (clone()+0x6d) [0x7fa1d2bd504d]
>>> > NOTE: a copy of the executable, or `objdump -rdS ` is
>>> > needed to interpret this.
>>> > terminate called after throwing an instance of 'ceph::FailedAssertion'
>>> > Aborted
>>> >
>>> > I had to toggle writeback/forward and min_read_recency_for_promote a
>>> > few times to get it, but I don't know if it is because I only have one
>>> > job running. Ev

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
1017:  finishing write tid 3 to nodezz25161-1017
1017:  finishing write tid 5 to nodezz25161-1017
1017:  finishing write tid 6 to nodezz25161-1017
update_object_version oid 1017 v 3011 (ObjNum 1016 snap 0 seq_num
1016) dirty exists
1017:  left oid 1017 (ObjNum 1016 snap 0 seq_num 1016)
1018:  finishing write tid 1 to nodezz25161-1018
1018:  finishing write tid 2 to nodezz25161-1018
1018:  finishing write tid 3 to nodezz25161-1018
1018:  finishing write tid 4 to nodezz25161-1018
1018:  finishing write tid 6 to nodezz25161-1018
1018:  finishing write tid 7 to nodezz25161-1018
update_object_version oid 1018 v 1099 (ObjNum 1017 snap 0 seq_num
1017) dirty exists
1018:  left oid 1018 (ObjNum 1017 snap 0 seq_num 1017)
1019:  finishing write tid 1 to nodezz25161-1019
1019:  finishing write tid 2 to nodezz25161-1019
1019:  finishing write tid 3 to nodezz25161-1019
1019:  finishing write tid 5 to nodezz25161-1019
1019:  finishing write tid 6 to nodezz25161-1019
update_object_version oid 1019 v 1300 (ObjNum 1018 snap 0 seq_num
1018) dirty exists
1019:  left oid 1019 (ObjNum 1018 snap 0 seq_num 1018)
1020:  finishing write tid 1 to nodezz25161-1020
1020:  finishing write tid 2 to nodezz25161-1020
1020:  finishing write tid 3 to nodezz25161-1020
1020:  finishing write tid 5 to nodezz25161-1020
1020:  finishing write tid 6 to nodezz25161-1020
update_object_version oid 1020 v 1324 (ObjNum 1019 snap 0 seq_num
1019) dirty exists
1020:  left oid 1020 (ObjNum 1019 snap 0 seq_num 1019)
1021:  finishing write tid 1 to nodezz25161-1021
1021:  finishing write tid 2 to nodezz25161-1021
1021:  finishing write tid 3 to nodezz25161-1021
1021:  finishing write tid 5 to nodezz25161-1021
1021:  finishing write tid 6 to nodezz25161-1021
update_object_version oid 1021 v 890 (ObjNum 1020 snap 0 seq_num 1020)
dirty exists
1021:  left oid 1021 (ObjNum 1020 snap 0 seq_num 1020)
1022:  finishing write tid 1 to nodezz25161-1022
1022:  finishing write tid 2 to nodezz25161-1022
1022:  finishing write tid 3 to nodezz25161-1022
1022:  finishing write tid 5 to nodezz25161-1022
1022:  finishing write tid 6 to nodezz25161-1022
update_object_version oid 1022 v 464 (ObjNum 1021 snap 0 seq_num 1021)
dirty exists
1022:  left oid 1022 (ObjNum 1021 snap 0 seq_num 1021)
1023:  finishing write tid 1 to nodezz25161-1023
1023:  finishing write tid 2 to nodezz25161-1023
1023:  finishing write tid 3 to nodezz25161-1023
1023:  finishing write tid 5 to nodezz25161-1023
1023:  finishing write tid 6 to nodezz25161-1023
update_object_version oid 1023 v 1516 (ObjNum 1022 snap 0 seq_num
1022) dirty exists
1023:  left oid 1023 (ObjNum 1022 snap 0 seq_num 1022)
1024:  finishing write tid 1 to nodezz25161-1024
1024:  finishing write tid 2 to nodezz25161-1024
1025: Error: oid 219 read returned error code -2
./test/osd/RadosModel.h: In function 'virtual void
ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fbb1bfff700 time
2016-03-17 10:53:53.071338
./test/osd/RadosModel.h: 1109: FAILED assert(0)
ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x76) [0x4db956]
2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
3: (()+0x9791d) [0x7fbb30ff191d]
4: (()+0x72519) [0x7fbb30fcc519]
5: (()+0x13c178) [0x7fbb31096178]
6: (()+0x80a4) [0x7fbb30b2a0a4]
7: (clone()+0x6d) [0x7fbb2f4a504d]
NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
Aborted

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 10:39 AM, Sage Weil  wrote:
> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I'm having trouble finding documentation about using ceph_test_rados. Can I
>> run this on the existing cluster and will that provide useful info? It seems
>>  running it in the build will not have the caching set up (vstart.sh).
>>
>> I have accepted a job with another company and only have until Wednesday to
>> help with getting information about this bug. My new job will not be using C
>> eph, so I won't be able to provide any additional info after Tuesday. I want
>>  to leave the company on a good trajectory for upgrading, so any input you c
>> an provide will be helpful.
>
> I'm sorry to hear it!  You'll be missed.  :)
>
>> I've found:
>>
>> ./ceph_test_rados --op read 100 --op write 100 --op delete 50
>> - --max-ops 40 --objects 1024 --max-in-flight 64 --size 400
>> - --min-stride-size 40 --max-stride-size 80 --max-seconds 600
>> - --op copy_from 50 --op snap_create 50 --op snap_remove 50 --op
>> rollback 50 --op setattr 25 --op rmattr 25 --pool unique_pool_0
>>
>> Is that enough if I change --pool

Re: [ceph-users] data corruption with hammer

2016-03-19 Thread Robert LeBlanc
Also, is this ceph_test_rados rewriting objects quickly? I think that
the issue is with rewriting objects so if we can tailor the
ceph_test_rados to do that, it might be easier to reproduce.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  wrote:
> I'll  miss the Ceph community as well. There was a few things I really
> wanted to work in with Ceph.
>
> I got this:
>
> update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
> dirty exists
> 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
> 1040:  finishing write tid 1 to nodez23350-256
> 1040:  finishing write tid 2 to nodez23350-256
> 1040:  finishing write tid 3 to nodez23350-256
> 1040:  finishing write tid 4 to nodez23350-256
> 1040:  finishing write tid 6 to nodez23350-256
> 1035: done (4 left)
> 1037: done (3 left)
> 1038: done (2 left)
> 1043: read oid 430 snap -1
> 1043:  expect (ObjNum 429 snap 0 seq_num 429)
> 1040:  finishing write tid 7 to nodez23350-256
> update_object_version oid 256 v 661 (ObjNum 1029 snap 0 seq_num 1029)
> dirty exists
> 1040:  left oid 256 (ObjNum 1029 snap 0 seq_num 1029)
> 1042:  expect (ObjNum 664 snap 0 seq_num 664)
> 1043: Error: oid 430 read returned error code -2
> ./test/osd/RadosModel.h: In function 'virtual void
> ReadOp::_finish(TestOp::CallbackInfo*)' thread 7fa1bf7fe700 time
> 2016-03-17 10:47:19.085414
> ./test/osd/RadosModel.h: 1109: FAILED assert(0)
> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x76) [0x4db956]
> 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4c959c]
> 3: (()+0x9791d) [0x7fa1d472191d]
> 4: (()+0x72519) [0x7fa1d46fc519]
> 5: (()+0x13c178) [0x7fa1d47c6178]
> 6: (()+0x80a4) [0x7fa1d425a0a4]
> 7: (clone()+0x6d) [0x7fa1d2bd504d]
> NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> Aborted
>
> I had to toggle writeback/forward and min_read_recency_for_promote a
> few times to get it, but I don't know if it is because I only have one
> job running. Even with six jobs running, it is not easy to trigger
> with ceph_test_rados, but it is very instant in the RBD VMs.
>
> Here are the six run crashes (I have about the last 2000 lines of each
> if needed):
>
> nodev:
> update_object_version oid 1015 v 1255 (ObjNum 1014 snap 0 seq_num
> 1014) dirty exists
> 1015:  left oid 1015 (ObjNum 1014 snap 0 seq_num 1014)
> 1016:  finishing write tid 1 to nodev21799-1016
> 1016:  finishing write tid 2 to nodev21799-1016
> 1016:  finishing write tid 3 to nodev21799-1016
> 1016:  finishing write tid 4 to nodev21799-1016
> 1016:  finishing write tid 6 to nodev21799-1016
> 1016:  finishing write tid 7 to nodev21799-1016
> update_object_version oid 1016 v 1957 (ObjNum 1015 snap 0 seq_num
> 1015) dirty exists
> 1016:  left oid 1016 (ObjNum 1015 snap 0 seq_num 1015)
> 1017:  finishing write tid 1 to nodev21799-1017
> 1017:  finishing write tid 2 to nodev21799-1017
> 1017:  finishing write tid 3 to nodev21799-1017
> 1017:  finishing write tid 5 to nodev21799-1017
> 1017:  finishing write tid 6 to nodev21799-1017
> update_object_version oid 1017 v 1010 (ObjNum 1016 snap 0 seq_num
> 1016) dirty exists
> 1017:  left oid 1017 (ObjNum 1016 snap 0 seq_num 1016)
> 1018:  finishing write tid 1 to nodev21799-1018
> 1018:  finishing write tid 2 to nodev21799-1018
> 1018:  finishing write tid 3 to nodev21799-1018
> 1018:  finishing write tid 4 to nodev21799-1018
> 1018:  finishing write tid 6 to nodev21799-1018
> 1018:  finishing write tid 7 to nodev21799-1018
> update_object_version oid 1018 v 1093 (ObjNum 1017 snap 0 seq_num
> 1017) dirty exists
> 1018:  left oid 1018 (ObjNum 1017 snap 0 seq_num 1017)
> 1019:  finishing write tid 1 to nodev21799-1019
> 1019:  finishing write tid 2 to nodev21799-1019
> 1019:  finishing write tid 3 to nodev21799-1019
> 1019:  finishing write tid 5 to nodev21799-1019
> 1019:  finishing write tid 6 to nodev21799-1019
> update_object_version oid 1019 v 462 (ObjNum 1018 snap 0 seq_num 1018)
> dirty exists
> 1019:  left oid 1019 (ObjNum 1018 snap 0 seq_num 1018)
> 1021:  finishing write tid 1 to nodev21799-1021
> 1020:  finishing write tid 1 to nodev21799-1020
> 1020:  finishing write tid 2 to nodev21799-1020
> 1020:  finishing write tid 3 to nodev21799-1020
> 1020:  finishing write tid 5 to nodev21799-1020
> 1020:  finishing write tid 6 to nodev21799-1020
> update_object_version oid 1020 v 1287 (ObjNum 1019 snap 0 seq_num
> 1019) dirty exists
> 1020:  left oid 1020 (

Re: [ceph-users] data corruption with hammer

2016-03-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Sage,

You patch seems to have resolved the issue for us. We can't reproduce
the problem with ceph_test_rados or our VM test. I also figured out
that those are all backports that were cherry-picked so it was showing
the original commit date. There was quite a bit of work on
ReplicatedPG.cc since 0.94.6 so it probably only makes sense to wait
for 0.94.7 for this fix.

Thanks for looking into this so quick!

As a work around for 0.94.6, our testing shows that
min_read_recency_for_promote 1 does not have the corruption as it
keeps the original behavior. Something for people to be aware of with
0.94.6 and using cache tiers.

Hopefully there is a way to detect this in a unittest.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6wILCRDmVDuy+mK58QAAcVQP/0t8jGZuwmwg2RIwkgjQ
Kb3mIxvsmnA9BQ4dICJB3Wu6FPT1/V34t0ThASehWyVSJyiUkdf+pxhXbDaQ
vOr4OOyTwCB2Ly6jaLEgAiyGTL45uOnMYcSttXPG95lilTb+oGUcqBdQzRbw
yJHG18UiEgMvKnttFjTLbd1FjICIY7xkkP7lrdHvaqe200aqQmb+g8CHTVj/
HqzYm/gTs84c2vK+x/nV8OFxY9Yf5WAV+O7uozeWC3SAc2VMlQgi8rdng51N
B+andt/SXgGq9VCDqdmEzcEpBN+2wK6usZQCZJmMXRmW4BXYVK4yAdfgKJOB
MEUN2cDA1i7bMIUcDrh1hnqwEfizkbqOWXpgrgAkQYhtlbp/gvEucl5nYMUy
kv9jNYg/KFQn9tzZqKWmvHj3sjl6DmOlN+A9XA2fGppOiiKk0s4dVKRDFwSJ
LNxUIZm4CtAekaQ4KymE/hK6RhRU2REQl7qSMF+wtw73nhA9gzqP32Ag46yd
WoeGpOngWRnMaejQfkuTSjiDSLvbCd7X5LM/WXH4dJHtHNSSA2qK3c4Nvvqp
yDhvFLdvybtJvWj0+hHczpcP0VlFZH9s7uGWz0+cNabkRnm41EC2+XD6sJ5+
kinZO+CgjbC2AQPdoEKMuvRwBgnftH0YuZJFl0sQPkgBg23r+eCfIxfW/9v/
iLgk
=6It+
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Mar 17, 2016 at 11:55 AM, Robert LeBlanc  wrote:
> Cherry-picking that commit onto v0.94.6 wasn't clean so I'm just
> building your branch. I'm not sure what the difference between your
> branch and 0.94.6 is, I don't see any commits against
> osd/ReplicatedPG.cc in the last 5 months other than the one you did
> today.
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Mar 17, 2016 at 11:38 AM, Robert LeBlanc  wrote:
>> Yep, let me pull and build that branch. I tried installing the dbg
>> packages and running it in gdb, but it didn't load the symbols.
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Thu, Mar 17, 2016 at 11:36 AM, Sage Weil  wrote:
>>> On Thu, 17 Mar 2016, Robert LeBlanc wrote:
>>>> Also, is this ceph_test_rados rewriting objects quickly? I think that
>>>> the issue is with rewriting objects so if we can tailor the
>>>> ceph_test_rados to do that, it might be easier to reproduce.
>>>
>>> It's doing lots of overwrites, yeah.
>>>
>>> I was albe to reproduce--thanks!  It looks like it's specific to
>>> hammer.  The code was rewritten for jewel so it doesn't affect the
>>> latest.  The problem is that maybe_handle_cache may proxy the read and
>>> also still try to handle the same request locally (if it doesn't trigger a
>>> promote).
>>>
>>> Here's my proposed fix:
>>>
>>> https://github.com/ceph/ceph/pull/8187
>>>
>>> Do you mind testing this branch?
>>>
>>> It doesn't appear to be directly related to flipping between writeback and
>>> forward, although it may be that we are seeing two unrelated issues.  I
>>> seemed to be able to trigger it more easily when I flipped modes, but the
>>> bug itself was a simple issue in the writeback mode logic.  :/
>>>
>>> Anyway, please see if this fixes it for you (esp with the RBD workload).
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>>
>>>
>>>> 
>>>> Robert LeBlanc
>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>>>
>>>>
>>>> On Thu, Mar 17, 2016 at 11:05 AM, Robert LeBlanc  
>>>> wrote:
>>>> > I'll  miss the Ceph community as well. There was a few things I really
>>>> > wanted to work in with Ceph.
>>>> >
>>>> > I got this:
>>>> >
>>>> > update_object_version oid 13 v 1166 (ObjNum 1028 snap 0 seq_num 1028)
>>>> > dirty exists
>>>> > 1038:  left oid 13 (ObjNum 1028 snap 0 seq_num 1028)
>>>> > 1040:  finishing write tid 1 to nodez23350-256
>>>> > 1040:  finishing write tid 2 to nodez23350-256
>>>> > 1040:  finishing write 

Re: [ceph-users] data corruption with hammer

2016-03-15 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

There are no monitors on the new node.

It doesn't look like there has been any new corruption since we
stopped changing the cache modes. Upon closer inspection, some files
have been changed such that binary files are now ASCII files and visa
versa. These are readable ASCII files and are things like PHP or
script files. Or C files where ASCII files should be.

I've seen this type of corruption before when a SAN node misbehaved
and both controllers were writing concurrently to the backend disks.
The volume was only mounted by one host, but the writes were split
between the controllers when it should have been active/passive.

We have killed off the OSDs on the new node as a precaution and will
try to replicate this in our lab.

I suspicion is that is has to do with the cache promotion code update,
but I'm not sure how it would have caused this.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW6D4zCRDmVDuy+mK58QAAoW0QAKmaNnN78m/3/YLIIlAB
U+q9PKXgB4ptds1prEJrB/HJqtxIi021M2urk6iO2XRUgR4qSWZyVJWMmeE9
6EhM6IvLbweOePr2LJ5nAVEkL5Fns+ya/aOAvilqo2WJGr8jt9J1ABjQgodp
SAGwDywo3GbGUmdxWWy5CrhLsdc9WNhiXdBxREh/uqWFvw2D8/1Uq4/u8tEv
fohrGD+SZfYLQwP9O/v8Rc1C3A0h7N4ytSMiN7Xg2CC9bJDynn0FTrP2LAr/
edEYx+SWF2VtKuG7wVHrQqImTfDUoTLJXP5Q6B+Oxy852qvWzglfoRhaKwGf
fodaxFlTDQaeMnyhMlODRMMXadmiTmyM/WK44YBuMjM8tnlaxf7yKgh09ADz
ay5oviRWnn7peXmq65TvaZzUfz6Mx5ZWYtqIevaXb0ieFgrxCTdVbdpnMNRt
bMwQ+yVQ8WB5AQmEqN6p6enBCxpvr42p8Eu484dO0xqjIiEOfsMANT/8V63y
RzjPMOaFKFnl3JoYNm61RGAUYszNBeX/Plm/3mP0qiiGBAeHYoxh7DNYlrs/
gUb/O9V0yNuHQIRTs8ZRyrzZKpmh9YMYo8hCsfIqWZjMwEyQaRFuysQB3NaR
lQCO/o12Khv2cygmTCQxS2L7vp2zrkPaS/KietqQ0gwkV1XbynK0XyLkAVDw
zTLa
=Wk/a
-END PGP SIGNATURE-
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Mar 14, 2016 at 9:35 PM, Christian Balzer  wrote:
>
> Hello,
>
> On Mon, 14 Mar 2016 20:51:04 -0600 Mike Lovell wrote:
>
>> something weird happened on one of the ceph clusters that i administer
>> tonight which resulted in virtual machines using rbd volumes seeing
>> corruption in multiple forms.
>>
>> when everything was fine earlier in the day, the cluster was a number of
>> storage nodes spread across 3 different roots in the crush map. the first
>> bunch of storage nodes have both hard drives and ssds in them with the
>> hard drives in one root and the ssds in another. there is a pool for
>> each and the pool for the ssds is a cache tier for the hard drives. the
>> last set of storage nodes were in a separate root with their own pool
>> that is being used for burn in testing.
>>
>> these nodes had run for a while with test traffic and we decided to move
>> them to the main root and pools. the main cluster is running 0.94.5 and
>> the new nodes got 0.94.6 due to them getting configured after that was
>> released. i removed the test pool and did a ceph osd crush move to move
>> the first node into the main cluster, the hard drives into the root for
>> that tier of storage and the ssds into the root and pool for the cache
>> tier. each set was done about 45 minutes apart and they ran for a couple
>> hours while performing backfill without any issue other than high load
>> on the cluster.
>>
> Since I glanced what your setup looks like from Robert's posts and yours I
> won't say the obvious thing, as you aren't using EC pools.
>
>> we normally run the ssd tier in the forward cache-mode due to the ssds we
>> have not being able to keep up with the io of writeback. this results in
>> io on the hard drives slowing going up and performance of the cluster
>> starting to suffer. about once a week, i change the cache-mode between
>> writeback and forward for short periods of time to promote actively used
>> data to the cache tier. this moves io load from the hard drive tier to
>> the ssd tier and has been done multiple times without issue. i normally
>> don't do this while there are backfills or recoveries happening on the
>> cluster but decided to go ahead while backfill was happening due to the
>> high load.
>>
> As you might recall, I managed to have "rados bench" break (I/O error) when
> doing these switches with Firefly on my crappy test cluster, but not with
> Hammer.
> However I haven't done any such switches on my production cluster with a
> cache tier, both because the cache pool hasn't even reached 50% capacity
> after 2 weeks of pounding and because I'm sure that everything will hold
> up when it comes to the first flushing.
>
> Maybe the extreme load (as opposed to normal VM ops) of your cluster
> during the backfilling triggered the same or a simi

Re: [ceph-users] how to downgrade when upgrade from firefly to hammer fail

2016-03-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

There is no downgrade path. You are best off trying to fix the issue
preventing the upgrade. Post some of the logs from the upgraded OSD
and people can try to help you out.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW3fYBCRDmVDuy+mK58QAAma8P/iqr+OaE/qzI9DModRp6
jk16h6k5rH/VSuIyFk74yvfRN28vtywJc62dx/sK4ke3oGpzo9Fn8ay0YMWM
7WEK6E0qZydMYGsNMKw5uBN1VfhiThwKLiip+U/t0ZUV963d+djH0yOFK4bF
9jKX/p8ZIWfAgIRyFgz2QhFT2zPL9UIasVo7eg8nsAE9YNSE2CkBEvxTrxWC
C26BrJ24+4TY6qliruhniQNOkWAstYMtPeiFqh5IlH/5/oOqQ6wK+dgyhEmP
4A/RAS7bom0cWP4eu0b+St+IvefC7kzoJM38yTEHku5YALAAFYitLh1Fbzp8
99lS/piObnAgjNTPW6h1KteweIYZJJ3ki9yhq8cBpQ4O5PqHc/64SBq/NY4o
69dpUUqp6L7HudMwDs5z8Q76BjuCu4NhMCieKgks+CuF7mwmCPTEN2A+enaD
MTHkQeM5MNRZ4xigucIrYhiT18SMvaI4aKkLCq7GGHkaInk5+91WLcYF+KDa
L+9n4M0jW14n2BXejMZjpKXxNa86N5cF7yO/hILCtz1CVJgNcqT2z+kIDZ3z
50aZva/SHsvxmdwK+UxrB3jnFldhzPUB6nU/xJCQWN+BBTSQByFmAg+JkEuX
13qV0h4yWRfH4uaKYdKuzTVSX0zY8HkAA4ZHTatxiPXiVET+NwNE+4aqdbTz
hw+f
=nLNP
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Mar 6, 2016 at 7:59 PM, Dong Wu  wrote:
> hi, cephers
> I want to upgrade my ceph cluster from firefly(0.80.11) to hammer,
> when i successfully install hammer deb package on all my hosts, then i
> update monitor first, and it success.
> but when i restart osds on one host to upgrade, it failed, osds
> cannot startup, then i want to downgrade to firefly again to keep my
> cluster going on, after i reinstall firefly deb package, i failed to
> start osds on the host, here is the log:
>
> 2016-03-07 09:47:14.704242 7f2f11ba87c0  0 ceph version 0.80.11
> (8424145d49264624a3b0a204aedb127835161070), process ceph-osd, pid
> 37459
> 2016-03-07 09:47:14.709159 7f2f11ba87c0 -1
> filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount : stale version
> stamp 4. Please run the FileStore update script before starting the
> OSD, or set filestore_update_to to 3
> 2016-03-07 09:47:14.709176 7f2f11ba87c0 -1  ** ERROR: error converting
> store /var/lib/ceph/osd/ceph-0: (22) Invalid argument
> 2016-03-07 09:47:18.385399 7f98478187c0  0 ceph version 0.80.11
> (8424145d49264624a3b0a204aedb127835161070), process ceph-osd, pid
> 39041
> 2016-03-07 09:47:18.390320 7f98478187c0 -1
> filestore(/var/lib/ceph/osd/ceph-0) FileStore::mount : stale version
> stamp 4. Please run the FileStore update script before starting the
> OSD, or set filestore_update_to to 3
> 2016-03-07 09:47:18.390337 7f98478187c0 -1  ** ERROR: error converting
> store /var/lib/ceph/osd/ceph-0: (22) Invalid argument
>
>   how can i downgrade to firefly successfully?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Pool and EC: objects didn't flush to a cold EC storage

2016-03-07 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Did you also set "target_max_bytes" to the size of the pool? That bit
us when we didn't have it set. The ratio then uses the
target_max_bytes to know when to flush.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW3fWJCRDmVDuy+mK58QAAWg0P/131uHJVnkZ7jFCRi8yY
iD2//WQJl5VJtcsYwqR0PxqKdMHGTDs263BpVSyUj5tMF+dgOpfvkPSlZ2Uf
SGYBXMvmUOu3WOMWfgcu6Tkt6Sai7vJtn6m8P1B+jKPEiTRqk+Apkft87JAE
rOsVM1lEwGZNH6+C8XUz13xcZeM15MTHn/QyRRhjt0cNLHxcG0/oWBBX753j
BIhde7XtORq6U0T79E6N6kd8KRE0XgOiwWa3bk9mKHWxkrc+1W53RfefwexU
rA9VkJKI+7YCh307TXF2cFEw8JPglOJdMcn5G96tb//jMBGh+kBfoT3FbM4F
Pb9LASt+DRIptZsF4DJJHLCOs6HseLmAiDp6z+wntjMITkeRGdxcA92llXz+
+/nnGKJtOZj76agXhYmkEZeEVSCiKaKC2xFqUy+p+B1UVGff+cSRt5Fz3NfB
NOSlYXbYCdahXaoKcaxa6oupep3TtjI6TBQ7JS4kHHfBMj8JHpSga4WkKqlz
e3Oz9PsDU9Tw2UVyo4zLEqgpcWcbY8E1VAAoirKAGcCqnwzwjvhGM2e1h66L
yYjepiUQ9oLbIct9MXJOSAMwctsrAYgvR1veG+vqND5ZLr+OIR7at9Vpeg8m
+oBVG+4PgxlIEfxVGf+8OjLK9sJUTm+AtLMzsbDqMFX9VQtpoTlsqYGd5gTW
9t/H
=7sfH
-END PGP SIGNATURE-
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Mar 6, 2016 at 2:17 AM, Mike Almateia  wrote:
> Hello Cephers!
>
> When my cluster hit "full ratio" settings, objects from cache pull didn't
> flush to a cold storage.
>
> 1. Hit the 'full ratio':
>
> 2016-03-06 11:35:23.838401 osd.64 10.22.11.21:6824/31423 4327 : cluster
> [WRN] OSD near full (90%)
> 2016-03-06 11:35:55.447205 osd.64 10.22.11.21:6824/31423 4329 : cluster
> [WRN] OSD near full (90%)
> 2016-03-06 11:36:29.255815 osd.64 10.22.11.21:6824/31423 4332 : cluster
> [WRN] OSD near full (90%)
> 2016-03-06 11:37:04.769765 osd.64 10.22.11.21:6824/31423 4333 : cluster
> [WRN] OSD near full (90%)
> ...
>
> 2. Well, ok. Set the option 'ceph osd pool set hotec cache_target_full_ratio
> 0.8'.
> But no one of objects didn't flush at all
>
> 3. Ok. Try flush all object manually:
> [root@c1 ~]# rados -p hotec cache-flush-evict-all
> rbd_data.34d1f5746d773.00016ba9
>
> 4. After full day objects still in cache pool, didn't flush at all:
> [root@c1 ~]# rados df
> pool name KB  objects   clones degraded  unfound
> rdrd KB   wrwr KB
> data   00000
> 64   158212215700473
> hotec  797656118 25030755000
> 370599163045649 69947951  17786794779
> rbd00000
> 0000
>   total used  2080570792 25030755
>
> It a bug or predictable action?
>
> --
> Mike. runs!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing OSD drive without rempaping pg's

2016-03-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

With a fresh disk, you will need to remove the old key in ceph (ceph
auth del osd.X) and the old osd (ceph osd rm X), but I think you can
leave the CRUSH map alone (don't do ceph osd crush rm osd.X) so that
there isn't any additional data movement (if there aren't any
available OSD numbers less than the OSD being replaced, it will get
the same ID. There may also be a way to specify an ID, but I haven't
used it). Then when you add the new disk in, it only backfills what
the previous disk had, unless the size is different, then it will take
on more or less and shuffle some things around the cluster.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW1cZBCRDmVDuy+mK58QAAm9EQAMVOnCOrBkqGaczy5+ds
yplotd9kKt/eyhp1nSgPJD+4RdOVQjoL4VVLtCfApXcMfxHkW/vBjpOWD1Bh
l14NDjCzpkXM5HpHqQkiel/7thcN45u/Z7wSX8T+x9ontYn1Bv0CfI/6qaFb
DmIYdAGjdLgWKpORyeN1WjgrU5DzUbCMHw/3sLfieVpoYsh91dMuxt33366z
mcMQ6RYIE/5xpm8LkTsjYkmnl7Xes5fGsIAlx6kJDHpAoBBWEfstjgtCXIBt
PgDnBJ/SwisAQKXuQOZg87/3OE+qFQUyILwFE3USD3ugx8xvo1aUGnerY/mT
8rUNfFLCPLhdiAp1fr2kkQW/SfV7spkNkZ/v99J/9dEwSj2pgJ7iHMGNr/Em
K3oLezrm7NO2RHsMrn/pz82bO1CSzHrRQ5Aq7Re2r48zYeFxSgvcbMk6Ogzh
rDPb2q+QEw/UbIuotl09ab3OGCjzXxhfDIQ44iEUEj0l2Cl5MQQcakdYakoC
WCPaqIN7ocqiWnQPY/RnSXuhUgsd8uTBtxcXtHp+y0feAf/80nxc3dFWDfiK
8sKmt+rHoBQKQz0yhc0A0YqM8vnWYatVrVh1+SZe7iJE3/qyglNFmbJQ0O54
au/AJ7OqEy1MnJ06fIaLbSIQMXXMWdEqcib2gIKeunhLDkwoUbi+JRJLBY5X
ITts
=yIjM
-END PGP SIGNATURE-----
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Feb 29, 2016 at 10:29 PM, Lindsay Mathieson
 wrote:
> I was looking at replacing an osd drive in place as per the procedure here:
>
> http://www.spinics.net/lists/ceph-users/msg05959.html
>
> "If you are going to replace the drive immediately, set the “noout” flag.
> Take the OSD “down” and replace drive.  Assuming it is mounted in the same
> place as the bad drive, bring the OSD back up.  This will replicate exactly
> the same PGs the bad drive held back to the replacement drive."
>
>
>
> But the new drive mount will be blank - what happens with the journal,
> keyring etc? does starting the OSD process recreate them automatically?
>
>
> thanks,
>
> --
> Lindsay Mathieson
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] List of SSDs

2016-02-26 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Honestly, we are scared to try the same tests with the m600s. When we
first put them in, we had them more full, but we backed them off to
reduce the load on them. Based on that I don't expect them to fair any
better. We'd love to get more IOPs out of our clusters considering the
ability of the s3610s. We constantly tune the cluster and try to
provide code back to Ceph that helps under high loads and congestion.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW0PHnCRDmVDuy+mK58QAAxh4QAIJG5blyxFMRJ9DdF3U+
J1U47Yd1jGMNUrzSsipA6TCm7FeoKs9/y2PRTIIFdnanmKj/J3X+F2T4M+b4
oHr1V8HzbxRk6dB6q2+Z6DkXMG9I48qhTbxnksAsn/vbEDrCq1t0ctvL3Zxu
h97FRsnW5DBH/HAfble6cLZ9vbBbo394yqtR8wu/1gE0E/zpLAgAw5ZnJ4t8
n8RIWyOIwQgapUspQ3KtzGdFl1HP/MLjA/QQiexn8CEhtluwxJTZZB6Fy4q6
b60gLj3HEszo76bcExCrGETCqlT5kiy1qGJCUrHO6sQ7YHpSDVGt9o1muoGJ
FXvbGkdhSbqGYB0P5xx83ab3ZQ9Eyg2tf0hreZo9q1kyP5rXTfylr6IGgaCF
qNj0QTvcE0TYeUVIUxKkHfG0Ys06kFqdxJAEF3A4tJJp0KyBKwK7eJrj4P2H
xclQWUDMTDJk+JSufBNxo5AY94TOLhUsWieuEFGyZeW8gji+oOrIWHHilxz7
De0Xi2Y+9O/OKcKkbBE/g+Pys0S/L9ZwAId5EEMzNRXEoQwlbPVclvukpEQJ
xFiLdEJLQzwXP7hRT9lMQkHs3IKKL/0TgsfN2bszoXbHk1rN1NqMVt9BDqHr
ZGb++dyfjUFaMOM/S8WXfkxV3dtYi7LKGEn4pSQ2IyZ92REwcTWej2TPV5r9
Nq0g
=LM6/
-END PGP SIGNATURE-
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Feb 26, 2016 at 5:41 PM, Shinobu Kinjo  wrote:
> Thank you for your very precious output.
> "s3610s write iops high-load" is very interesting to me.
> Have you every did any same test set of s3610s for m600s?
>
>> These clusters normally service 12K IOPs with bursts up to 22K IOPs all RBD. 
>> I've seen a peak of 64K IOPs from client traffic.
>
> That's pretty good result, isn't it?
> I guess you've been tuning your cluster??
>
> Rgds,
> Shinobu
>
> - Original Message -
> From: "Robert LeBlanc" 
> To: "Shinobu Kinjo" 
> Cc: "Christian Balzer" , ceph-users@lists.ceph.com
> Sent: Saturday, February 27, 2016 8:52:34 AM
> Subject: Re: [ceph-users] List of SSDs
>
> A picture is worth a thousand words:
>
>
> The red lines are the m600s IO time (dotted) and IOPs (solid) and our
> baseline s3610s in green and our test set of s3610s in blue.
>
> We used weighting to manipulate how many PGs each SSD took. The m600s are
> 1TB while the s3610s are 800GBs and we only have the m600s about half
> filled. So we weighted the s3610s individually until they were about ~40GBs
> within the m600s. We did the same weighting to achieve similar percentage
> usage and 80% usage. This graph is stepping from 50% to 70% and finally
> very close to 80%.
>
> We have two production clusters currently, third one will be built in the
> next month all about the same size.
>
> 16 nodes, 3 - 1TB m600 drives and 9 - 4TB HGST HDDs, single E5-2640v2 and
> 64 GB RAM dual 40 Gigabit Ethernet ports, direct attached SATA. These
> clusters normally service 12K IOPs with bursts up to 22K IOPs all RBD. I've
> seen a peak of 64K IOPs from client traffic.
>
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Fri, Feb 26, 2016 at 4:05 PM, Shinobu Kinjo  wrote:
>
>> Hello,
>>
>> > We started having high wait times on the M600s so we got 6 S3610s, 6
>> M500dcs, and 6 500 GB M600s (they have the SLC to MLC conversion that we
>> thought might work better).
>>
>> Is it working better as you were expecting?
>>
>> > We have graphite gathering stats on the admin sockets for Ceph and the
>> standard system stats.
>>
>> Very cool!
>>
>> > We weighted the drives so they had the same byte usage and let them run
>> for a week or so, then made them the same percentage of used space, let
>> them run a couple of weeks, then set them to 80% full and let them run a
>> couple of weeks.
>>
>> Almost exactly same *byte* usage? I'm pretty interesting to how you
>> realized that.
>>
>> > We compared IOPS and IO time of the drives to get our comparison.
>>
>> What is your feeling about the comparison?
>>
>> > This was done on live production clusters and not synthetic benchmarks.
>>
>> How large is your production the Ceph cluster?
>>
>> Rgds,
>> Shinobu
>>
>> >
>> > Hello,
>> >
>> > On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote:
>> >
>> > > We are moving to the Intel S3610, from our testing it is a good balance
>> > > between price, performance and longevity. But as with all things, do
>>

Re: [ceph-users] List of SSDs

2016-02-26 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Resending sans attachment...

A picture is worth a thousand words:

http://robert.leblancnet.us/files/s3610-load-test-20160224.png

The red lines are the m600s IO time (dotted) and IOPs (solid) and our
baseline s3610s in green and our test set of s3610s in blue.

We used weighting to manipulate how many PGs each SSD took. The m600s
are 1TB while the s3610s are 800GBs and we only have the m600s about
half filled. So we weighted the s3610s individually until they were
about ~40GBs within the m600s. We did the same weighting to achieve
similar percentage usage and 80% usage. This graph is stepping from
50% to 70% and finally very close to 80%.

We have two production clusters currently, third one will be built in
the next month all about the same size.

16 nodes, 3 - 1TB m600 drives and 9 - 4TB HGST HDDs, single E5-2640v2
and 64 GB RAM dual 40 Gigabit Ethernet ports, direct attached SATA.
These clusters normally service 12K IOPs with bursts up to 22K IOPs
all RBD. I've seen a peak of 64K IOPs from client traffic.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW0OgCCRDmVDuy+mK58QAAp+4P/0HJ+UU3gaAdRXyELCg5
mLifFliWYDFuabP+K5aI6mBn4qlF/1BAe6d9K8Zrcz+nZvXP+BcSEd1puUAW
GIy+5O3xJkDUM5O9lAN+jIqw0X7ple2xni3Q5/fKwgGpD1TuEjGEnZlFfRJC
8HWfw6rnL+J7WEirhhXrk+NmOvLJRaozROuzKmKcbBVS2oVtrhOPA7eiNrUz
NhN/YbvArGrQFneBO39Tp3YPn8cJ2nVgwv6eru9nnrvkEUD9nwJXlgyNf/NC
IjX+LnKET0q0ouCFbjJGaUm4+tvNWWtXypYpcdC78RF+XMdsYHMKAikQ0aG7
7UbYlvf+DhFPqskXhpaB1+lEj+qyhYNwvaxt5QtYsuPK7zDfbV23ed/aiw7c
58q3ROMmIZGsVyBh3fR7EAvKcp3W8KQr9JUq3K3vLcWplNZsuvg4QZIx0ia2
YfGzBsJKugxMVGbmqnXCAcjUyEI/haoovIdMOVBWw8Uv8R9m2IpoNXgqsqi1
xJjIJ5pmiwMZliq2YLwcUy/6e3uPpPRYhgRkkHr167DDB0A5ijI7Y8Q5GX28
AeraQSHLBtOtyrXBcFCtZv2YVbl2juwwC2lNXHJZBd0b/iUDnrBA358U0crm
+TqyYR7LoZiUjUMI0HZzjeyVIsST201R6uQ1Tv9b6DFAOxDMPWD7ViJLcSIO
yAiI
=vXUO
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Feb 26, 2016 at 4:05 PM, Shinobu Kinjo  wrote:
> Hello,
>
>> We started having high wait times on the M600s so we got 6 S3610s, 6 
>> M500dcs, and 6 500 GB M600s (they have the SLC to MLC conversion that we 
>> thought might work better).
>
> Is it working better as you were expecting?
>
>> We have graphite gathering stats on the admin sockets for Ceph and the 
>> standard system stats.
>
> Very cool!
>
>> We weighted the drives so they had the same byte usage and let them run for 
>> a week or so, then made them the same percentage of used space, let them run 
>> a couple of weeks, then set them to 80% full and let them run a couple of 
>> weeks.
>
> Almost exactly same *byte* usage? I'm pretty interesting to how you realized 
> that.
>
>> We compared IOPS and IO time of the drives to get our comparison.
>
> What is your feeling about the comparison?
>
>> This was done on live production clusters and not synthetic benchmarks.
>
> How large is your production the Ceph cluster?
>
> Rgds,
> Shinobu
>
>>
>> Hello,
>>
>> On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote:
>>
>> > We are moving to the Intel S3610, from our testing it is a good balance
>> > between price, performance and longevity. But as with all things, do your
>> > testing ahead of time. This will be our third model of SSDs for our
>> > cluster. The S3500s didn't have enough life and performance tapers off
>> > add it gets full. The Micron M600s looked good with the Sebastian journal
>> > tests, but once in use for a while go downhill pretty bad. We also tested
>> > Micron M500dc drives and they were on par with the S3610s and are more
>> > expensive and are closer to EoL. The S3700s didn't have quite the same
>> > performance as the S3610s, but they will last forever and are very stable
>> > in terms of performance and have the best power loss protection.
>> >
>> That's interesting, how did you come to that conclusion and how did test
>> it?
>> Also which models did you compare?
>>
>>
>> > Short answer is test them for yourself to make sure they will work. You
>> > are pretty safe with the Intel S3xxx drives. The Micron M500dc is also
>> > pretty safe based on my experience. It had also been mentioned that
>> > someone has had good experience with a Samsung DC Pro (has to have both
>> > DC and Pro in the name), but we weren't able to get any quick enough to
>> > test so I can't vouch for them.
>> >
>> I have some Samsung DC Pro EVOs in production (non-Ceph, see that
>> non-barrier thread).
>> They do have issues with LSI o

Re: [ceph-users] List of SSDs

2016-02-25 Thread Robert LeBlanc
We replaced 32 S3500s with 48 Micron M600s in our production cluster. The
S3500s were only doing journals because they were too small and we still
ate 3-4% of their life in a couple of months. We started having high wait
times on the M600s so we got 6 S3610s, 6 M500dcs, and 6 500 GB M600s (they
have the SLC to MLC conversion that we thought might work better). And we
swapped out 18 of the M600s throughout our cluster with these test drives.
We have graphite gathering stats on the admin sockets for Ceph and the
standard system stats. We weighted the drives so they had the same byte
usage and let them run for a week or so, then made them the same percentage
of used space, let them run a couple of weeks, then set them to 80% full
and let them run a couple of weeks. We compared IOPS and IO time of the
drives to get our comparison. This was done on live production clusters and
not synthetic benchmarks. Some of the data about the S3500s is from my test
cluster that has them.

Sent from a mobile device, please excuse any typos.
On Feb 25, 2016 9:20 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> On Wed, 24 Feb 2016 22:56:15 -0700 Robert LeBlanc wrote:
>
> > We are moving to the Intel S3610, from our testing it is a good balance
> > between price, performance and longevity. But as with all things, do your
> > testing ahead of time. This will be our third model of SSDs for our
> > cluster. The S3500s didn't have enough life and performance tapers off
> > add it gets full. The Micron M600s looked good with the Sebastian journal
> > tests, but once in use for a while go downhill pretty bad. We also tested
> > Micron M500dc drives and they were on par with the S3610s and are more
> > expensive and are closer to EoL. The S3700s didn't have quite the same
> > performance as the S3610s, but they will last forever and are very stable
> > in terms of performance and have the best power loss protection.
> >
> That's interesting, how did you come to that conclusion and how did test
> it?
> Also which models did you compare?
>
>
> > Short answer is test them for yourself to make sure they will work. You
> > are pretty safe with the Intel S3xxx drives. The Micron M500dc is also
> > pretty safe based on my experience. It had also been mentioned that
> > someone has had good experience with a Samsung DC Pro (has to have both
> > DC and Pro in the name), but we weren't able to get any quick enough to
> > test so I can't vouch for them.
> >
> I have some Samsung DC Pro EVOs in production (non-Ceph, see that
> non-barrier thread).
> They do have issues with LSI occasionally, haven't gotten around to make
> that FS non-barrier to see if it fixes things.
>
> The EVOs are also similar to the Intel DC S3500s, meaning that they are
> not really suitable for Ceph due to their endurance.
>
> Never tested the "real" DC Pro ones, but they are likely to be OK.
>
> Christian
>
> > Sent from a mobile device, please excuse any typos.
> > On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  wrote:
> >
> > > Hello,
> > >
> > > There has been a bunch of discussion about using SSD.
> > > Does anyone have any list of SSDs describing which SSD is highly
> > > recommended, which SSD is not.
> > >
> > > Rgds,
> > > Shinobu
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-25 Thread Robert LeBlanc
I was only testing one SSD per node and it used 3.5-4.5 cores on my 8 core
Atom boxes. I've also set these boxes to only 4 GB of RAM to reduce the
effects of page cache. So no, I still had some headroom, but I was also
running fio on my nodes too. I don't remember how much idle I had overall,
but there was some.

Sent from a mobile device, please excuse any typos.
On Feb 25, 2016 9:15 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> On Wed, 24 Feb 2016 23:01:43 -0700 Robert LeBlanc wrote:
>
> > With my S3500 drives in my test cluster, the latest master branch gave me
> > an almost 2x increase in performance compare to just a month or two ago.
> > There looks to be some really nice things coming in Jewel around SSD
> > performance. My drives are now 80-85% busy doing about 10-12K IOPS when
> > doing 4K fio to libRBD.
> >
> That's good news, but then again the future is always bright. ^o^
> Before that (or even now with the SSDs still 15% idle), were you
> exhausting your CPUs or are they also still not fully utilized as I am
> seeing below?
>
> Christian
>
> > Sent from a mobile device, please excuse any typos.
> > On Feb 24, 2016 8:10 PM, "Christian Balzer"  wrote:
> >
> > >
> > > Hello,
> > >
> > > For posterity and of course to ask some questions, here are my
> > > experiences with a pure SSD pool.
> > >
> > > SW: Debian Jessie, Ceph Hammer 0.94.5.
> > >
> > > HW:
> > > 2 nodes (thus replication of 2) with each:
> > > 2x E5-2623 CPUs
> > > 64GB RAM
> > > 4x DC S3610 800GB SSDs
> > > Infiniband (IPoIB) network
> > >
> > > Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
> > > Ceph journal is inline (journal file).
> > >
> > > Performance:
> > > A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks)
> > > gives me about 620MB/s, the storage nodes are I/O bound (all SSDs are
> > > 100% busy according to atop) and this meshes nicely with the speeds I
> > > saw when testing the individual SSDs with fio before involving Ceph.
> > >
> > > To elaborate on that, an individual SSD of that type can do about
> > > 500MB/s sequential writes, so ideally you would see 1GB/s writes with
> > > Ceph (500*8/2(replication)/2(journal on same disk).
> > > However my experience tells me that other activities (FS journals,
> > > leveldb PG updates, etc) impact things as well.
> > >
> > > A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
> > > blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
> > > All OSD processes are using about 2 cores and the OS another 2, but
> > > that leaves about 6 cores unused (MHz on all cores scales to max
> > > during the test run).
> > > Closer inspection with all CPUs being displayed in atop shows that no
> > > single core is fully used, they all average around 40% and even the
> > > busiest ones (handling IRQs) still have ample capacity available.
> > > I'm wondering if this an indication of insufficient parallelism or if
> > > it's latency of sorts.
> > > I'm aware of the many tuning settings for SSD based OSDs, however I was
> > > expecting to run into a CPU wall first and foremost.
> > >
> > >
> > > Write amplification:
> > > 10 second rados bench with 4MB blocks, 6348MB written in total.
> > > nand-writes per SSD:118*32MB=3776MB.
> > > 30208MB total written to all SSDs.
> > > Amplification:4.75
> > >
> > > Very close to what you would expect with a replication of 2 and
> > > journal on same disk.
> > >
> > >
> > > 10 second rados bench with 4KB blocks, 219MB written in total.
> > > nand-writes per SSD:41*32MB=1312MB.
> > > 10496MB total written to all SSDs.
> > > Amplification:48!!!
> > >
> > > Le ouch.
> > > In my use case with rbd cache on all VMs I expect writes to be rather
> > > large for the most part and not like this extreme example.
> > > But as I wrote the last time I did this kind of testing, this is an
> > > area where caveat emptor most definitely applies when planning and
> > > buying SSDs. And where the Ceph code could probably do with some
> > > attention.
> > >
> > > Regards,
> > >
> > > Christian
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can not disable rbd cache

2016-02-25 Thread Robert LeBlanc
My guess would be that if you are already running hammer on the client it
is already using the new watcher API. This would be a fix on the OSDs to
allow the object to be moved because the current client is smart enough to
try again. It would be watchers per object.

Sent from a mobile device, please excuse any typos.
On Feb 25, 2016 9:10 PM, "Christian Balzer"  wrote:

> On Thu, 25 Feb 2016 10:07:37 -0500 (EST) Jason Dillaman wrote:
>
> > > > Let's start from the top. Where are you stuck with [1]? I have
> > > > noticed that after evicting all the objects with RBD that one object
> > > > for each active RBD is still left, I think this is the head object.
> > > Precisely.
> > > That came up in my extensive tests as well.
> >
> > Is this in reference to the RBD image header object (i.e. XYZ.rbd or
> > rbd_header.XYZ)?
> Yes.
>
> > The cache tier doesn't currently support evicting
> > objects that are being watched.  This guard was added to the OSD because
> > it wasn't previously possible to alert clients that a watched object had
> > encountered an error (such as it no longer exists in the cache tier).
> > Now that Hammer (and later) librbd releases will reconnect the watch on
> > error (eviction), perhaps this guard can be loosened [1].
> >
> > [1] http://tracker.ceph.com/issues/14865
> >
>
> How do I interpret "all watchers" in the issue above?
> As in, all watchers of an object, or all watchers in general.
>
> If it is per object (which I guess/hope), than this fix would mean that
> after an upgrade to Hammer or later on the client side a restart of the VM
> would allow the header object to be evicted, while the header objects for
> VMs that have been running since the dawn of time can not.
>
> Correct?
>
> This would definitely be better than having to stop the VM, flush things
> and then start it up again.
>
> Christian
>
> > --
> >
> > Jason
> >
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Observations with a SSD based pool under Hammer

2016-02-24 Thread Robert LeBlanc
With my S3500 drives in my test cluster, the latest master branch gave me
an almost 2x increase in performance compare to just a month or two ago.
There looks to be some really nice things coming in Jewel around SSD
performance. My drives are now 80-85% busy doing about 10-12K IOPS when
doing 4K fio to libRBD.

Sent from a mobile device, please excuse any typos.
On Feb 24, 2016 8:10 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> For posterity and of course to ask some questions, here are my experiences
> with a pure SSD pool.
>
> SW: Debian Jessie, Ceph Hammer 0.94.5.
>
> HW:
> 2 nodes (thus replication of 2) with each:
> 2x E5-2623 CPUs
> 64GB RAM
> 4x DC S3610 800GB SSDs
> Infiniband (IPoIB) network
>
> Ceph: no tuning or significant/relevant config changes, OSD FS is Ext4,
> Ceph journal is inline (journal file).
>
> Performance:
> A test run with "rados -p cache  bench 30 write -t 32" (4MB blocks) gives
> me about 620MB/s, the storage nodes are I/O bound (all SSDs are 100% busy
> according to atop) and this meshes nicely with the speeds I saw when
> testing the individual SSDs with fio before involving Ceph.
>
> To elaborate on that, an individual SSD of that type can do about 500MB/s
> sequential writes, so ideally you would see 1GB/s writes with Ceph
> (500*8/2(replication)/2(journal on same disk).
> However my experience tells me that other activities (FS journals, leveldb
> PG updates, etc) impact things as well.
>
> A test run with "rados -p cache  bench 30 write -t 32 -b 4096" (4KB
> blocks) gives me about 7200 IOPS, the SSDs are about 40% busy.
> All OSD processes are using about 2 cores and the OS another 2, but that
> leaves about 6 cores unused (MHz on all cores scales to max during the
> test run).
> Closer inspection with all CPUs being displayed in atop shows that no
> single core is fully used, they all average around 40% and even the
> busiest ones (handling IRQs) still have ample capacity available.
> I'm wondering if this an indication of insufficient parallelism or if it's
> latency of sorts.
> I'm aware of the many tuning settings for SSD based OSDs, however I was
> expecting to run into a CPU wall first and foremost.
>
>
> Write amplification:
> 10 second rados bench with 4MB blocks, 6348MB written in total.
> nand-writes per SSD:118*32MB=3776MB.
> 30208MB total written to all SSDs.
> Amplification:4.75
>
> Very close to what you would expect with a replication of 2 and journal on
> same disk.
>
>
> 10 second rados bench with 4KB blocks, 219MB written in total.
> nand-writes per SSD:41*32MB=1312MB.
> 10496MB total written to all SSDs.
> Amplification:48!!!
>
> Le ouch.
> In my use case with rbd cache on all VMs I expect writes to be rather
> large for the most part and not like this extreme example.
> But as I wrote the last time I did this kind of testing, this is an area
> where caveat emptor most definitely applies when planning and buying SSDs.
> And where the Ceph code could probably do with some attention.
>
> Regards,
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] List of SSDs

2016-02-24 Thread Robert LeBlanc
We are moving to the Intel S3610, from our testing it is a good balance
between price, performance and longevity. But as with all things, do your
testing ahead of time. This will be our third model of SSDs for our
cluster. The S3500s didn't have enough life and performance tapers off add
it gets full. The Micron M600s looked good with the Sebastian journal
tests, but once in use for a while go downhill pretty bad. We also tested
Micron M500dc drives and they were on par with the S3610s and are more
expensive and are closer to EoL. The S3700s didn't have quite the same
performance as the S3610s, but they will last forever and are very stable
in terms of performance and have the best power loss protection.

Short answer is test them for yourself to make sure they will work. You are
pretty safe with the Intel S3xxx drives. The Micron M500dc is also pretty
safe based on my experience. It had also been mentioned that someone has
had good experience with a Samsung DC Pro (has to have both DC and Pro in
the name), but we weren't able to get any quick enough to test so I can't
vouch for them.

Sent from a mobile device, please excuse any typos.
On Feb 24, 2016 6:37 PM, "Shinobu Kinjo"  wrote:

> Hello,
>
> There has been a bunch of discussion about using SSD.
> Does anyone have any list of SSDs describing which SSD is highly
> recommended, which SSD is not.
>
> Rgds,
> Shinobu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

2016-02-24 Thread Robert LeBlanc
We have not seen this issue, but we don't run EC pools yet (we are waiting
for multiple layers to be available). We are not running 0.94.6 in
production yet either. We have adopted the policy to only run released
versions in production unless there is a really pressing need to have a
patch. We are running 0.94.6 through our alpha and staging clusters and
hoping to do the upgrade in the next couple of weeks. We won't know how
much the recency fix will help until then because we have not been able to
replicate our workload with fio accurately enough to get good test results.
Unfortunately we will probably be swapping out our M600s with S3610s. We've
burned through 30% of the life in 2 months and they have 8x the op latency.
Due to the 10 Minutes of Terror, we are going to have to do both at the
same time to reduce the impact. Luckily, when you have weighted out OSDs or
empty ones, it is much less impactful. If you get your upgrade done before
ours, I'd like to know how it went. I'll be posting the results from ours
when it is done.

Sent from a mobile device, please excuse any typos.
On Feb 24, 2016 5:43 PM, "Christian Balzer"  wrote:

>
> Hello Jason (Ceph devs et al),
>
> On Wed, 24 Feb 2016 13:15:34 -0500 (EST) Jason Dillaman wrote:
>
> > If you run "rados -p  ls | grep "rbd_id." and
> > don't see that object, you are experiencing that issue [1].
> >
> > You can attempt to work around this issue by running "rados -p irfu-virt
> > setomapval rbd_id. dummy value" to force-promote the object
> > to the cache pool.  I haven't tested / verified that will alleviate the
> > issue, though.
> >
> > [1] http://tracker.ceph.com/issues/14762
> >
>
> This concerns me greatly, as I'm about to phase in a cache tier this
> weekend into a very busy, VERY mission critical Ceph cluster.
> That is on top of a replicated pool, Hammer.
>
> That issue and the related git blurb are less than crystal clear, so for
> my and everybody else's benefit could you elaborate a bit more on this?
>
> 1. Does this only affect EC base pools?
> 2. Is this a regressions of sorts and when came it about?
>I have a hard time imagining people not running into this earlier,
>unless that problem is very hard to trigger.
> 3. One assumes that this isn't fixed in any released version of Ceph,
>correct?
>
> Robert, sorry for CC'ing you, but AFAICT your cluster is about the closest
> approximation in terms of busyness to mine here.
> And I a assume that you're neither using EC pools (since you need
> performance, not space) and haven't experienced this bug all?
>
> Also, would you consider the benefits of the recency fix (thanks for
> that) being worth risk of being an early adopter of 0.94.6?
> In other words, are you eating your own dog food already and 0.94.6 hasn't
> eaten your data babies yet? ^o^
>
> Regards,
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Can not disable rbd cache

2016-02-24 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Let's start from the top. Where are you stuck with [1]? I have noticed
that after evicting all the objects with RBD that one object for each
active RBD is still left, I think this is the head object. We haven't
tried this, but our planned procedure for finishing the deactivation
of a cache tier is to shut down the active VM, then flush again and
then start the VM again. Once all VMs have been stopped, flushed and
restarted, we should be able to remove the cache tier. That way we
don't have to stop all the VMs at once or for long periods of time. I
hope at some point the last object can be flushed without shutting
down the VM.

If you are experiencing something different, please provide some more
info, especially more detailed steps of what you tried.

[1] 
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/?highlight=cache#removing-a-cache-tier
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWzkPgCRDmVDuy+mK58QAABrEQAIFAxEEmKroSKqqGluFE
aCwvTTxye5IfIBjmVoreFZy+/r5B5D+aMBUFArANCk/A9V678mb/24MkCggT
8Ehb0eBVbkWxptUfexXfSuXvFqTGWA5BDnVTzT9rJ5liTQinXbhDCuJcVCDb
hcHmNRnUrituZoDfivwp9ZMpe/ZqsQsIN06NyVhLyPWtA1/Ji06v1WwVkEKe
b6FkS4J4C6RdmgBi1+QNntcgLjgWi5CXNBrPwhyvRMHYyjGFGUJQ87S7mQJL
4bBSs5e/bBraMBZlv59DgRjmvlGuBQHlSiqSsy3BKsHErKzjxYsh06fNTAZe
TJ6bVPsa+vUKprRdWtUIaxqbY6vAXytwpswL57zgvD4PuPAFD80Wz9AK0mgz
ypoUacAocRu+rIZ2NgEt4Xr6+K3pJ2wRT2Fs+xMmKt2uoH7XyccU+7kIrEhy
CD4AZfCXlOgA5LWYPFpBXC9087OygNZ7907klCG2QMn5Qh15W/MiylU0ECF8
n3kNm4qEO4ICl5MiAXfaw2yaFa7Hht6N+oyDBRUI93Oj9I7pFA4uCrPhuPNt
oRgNN9nTwBdVqUICvWJxOsb0AHuJoVIZbLbJ5dNKpcxehrO9aC9Ursa5/Wqt
BGljYMYyg1QNf/CbAhZTpT+H4NQLPbN4D0muCchVKe7gekvj6u6vKjWwEiWR
cl7D
=U5aZ
-END PGP SIGNATURE-
--------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Feb 24, 2016 at 4:29 AM, Oliver Dzombic  wrote:
> Hi Esta,
>
> how do you know, that its still active ?
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 24.02.2016 um 12:27 schrieb wikison:
>> Hi,
>> I want to disable rbd cache in my ceph cluster. I've set the *rbd cache*
>> to be false in the [client] section of ceph.conf and rebooted the
>> cluster. But caching system was still working. How can I disable the rbd
>> caching system? Any help?
>>
>> best regards.
>>
>> 2016-02-24
>> 
>> Esta Wang
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crush map customization for production use

2016-02-24 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I think I saw someone say that they has issues with "step take" when
it was not a "root" node. Otherwise it looks good to me. The "step
chooseleaf firstn 0 type chassis" says to pick one OSD from different
chassis where 0 says to take as many is the replication factor. Since
an OSD can only be in one server, they you are accomplishing what you
want.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWzkEECRDmVDuy+mK58QAAPtgQAJc4AKklznB6tQBUOaF9
nRu1C7+CJMpdhWZLiJW96OwTCIQ4CDv0f86/W0tOEoMa5Swqk0kWj4CEaej3
65/MgHsk3BhW6qwKmOicI/y+bALPDuBXRTEUm97tuKjhVC19vpEsOqQhd7Ux
TlqCoQuf+yBjr5sOGj/NYRC6NKCVjmP6k3kth1INyvDPfjmK2h0VUuUB/AGo
2sWPdYG0Ki3I5JjtO3Ja5yjsYWMbDNZq1hgEFCfhEmsQSzbCmzRvIPRfbAiv
DmrRo7qy9M86tJKuucBuiUD0k4HmIEVR8b1f42w9Kfhc7FyhtkszyvCpo7cl
8yuAqgfQ5bgzRyHtPmvBCqxxNesca9T7jlLxn+Q6Wco2fwGbYvwb4HcF2v+I
+FAZQEOLZ1h4gxhsZ5j6IgSIwwoxlswc0G4DL1PIYwmWaqUBH3OUQjZg4tL5
eN1/X2fl7vgEdVO3fh+sm8+HfDLkEwL67GDxPm09RraSCpT/jyX+cjLWav0z
qbTT6GxG74YiZIgQ0/s95GUvYJem+W7XgfSuf7P5Hpk4ooKcI/H3H6WUjaby
kIvNvdgK2+DFcfRisE0WObKESQO/9tVojpEp9zkEH6OAv3cNvdCcGaHRiFDl
7cD0IpScVkSFHVn4MfOeB4Z+qw9ow9SwGB75BYm98axxsRdNlPNiQzxRcb5z
Tdal
=iMwX
-END PGP SIGNATURE-
----
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Feb 24, 2016 at 4:09 AM, Vickey Singh
 wrote:
> Hello Geeks
>
> Can someone please review and comment on my custom crush maps. I would
> really appreciate your help
>
>
> My setup :  1 Rack , 4 chassis , 3 storage nodes each chassis ( so total 12
> storage nodes ) , pool size = 3
>
> What i want to achieve is:
> - Survive chassis failures , even if i loose 2 complete chassis (containing
> 3 nodes each) , data should not be lost
> - The crush ruleset should store each copy on a unique chassis and host
>
> For example :
> copy 1 ---> c1-node1
> copy 2 ---> c2-node3
> copy 3 ---> c4-node2
>
>
>
> Here is my crushmap
> =
>
> chassis block_storage_chassis_4 {
> id -17 # do not change unnecessarily
> # weight weight 163.350
> alg straw
> hash 0 # rjenkins1
> item c4-node1 weight 54.450
> item c4-node2 weight 54.450
> item c4-node3 weight 54.450
>
> }
>
> chassis block_storage_chassis_3 {
> id -16 # do not change unnecessarily
> # weight weight 163.350
> alg straw
> hash 0 # rjenkins1
> item c3-node1 weight 54.450
> item c3-node2 weight 54.450
> item c3-node3 weight 54.450
>
> }
>
> chassis block_storage_chassis_2 {
> id -15 # do not change unnecessarily
> # weight weight 163.350
> alg straw
> hash 0 # rjenkins1
> item c2-node1 weight 54.450
> item c2-node2 weight 54.450
> item c3-node3 weight 54.450
>
> }
>
> chassis block_storage_chassis_1 {
> id -14 # do not change unnecessarily
> # weight 163.350
> alg straw
> hash 0 # rjenkins1
> item c1-node1 weight 54.450
> item c1-node2 weight 54.450
> item c1-node3 weight 54.450
>
> }
>
> rack block_storage_rack_1 {
> id -10 # do not change unnecessarily
> # weight 174.240
> alg straw
> hash 0 # rjenkins1
> item block_storage_chassis_1 weight 163.350
> item block_storage_chassis_2 weight 163.350
> item block_storage_chassis_3 weight 163.350
> item block_storage_chassis_4 weight 163.350
>
> }
>
> class block_storage {
> id -6 # do not change unnecessarily
> # weight 210.540
> alg straw
> hash 0 # rjenkins1
> item block_storage_rack_1 weight 656.400
> }
>
> rule ruleset_block_storage {
> ruleset 1
> type replicated
> min_size 1
> max_size 10
> step take block_storage
> step chooseleaf firstn 0 type chassis
> step emit
> }
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Incorrect output from ceph osd map command

2016-02-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

ceph pg dump

Since all objects map to a PG, as long as you can verify that no PG is
on the same host/chassis/rack, you are good.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.5
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWzS0NCRDmVDuy+mK58QAANfQP/19WHCUa2wPK6cHwx6zC
msfy+zipJ86qvqTgAh5azy0VRIk5lo1GknwMJhulox5vk5M+GQo0ermR/yfw
MbXKXy1f81NeZgQSqDX+GD3V19c/mb1WYuA0SLatPKkvv6L5BxPzHoGm6HYE
1hr3VSMYixCE2JZubQxj8EA+RnrJXYPue+e9aRXGbFymXIGHNdW5A3wU/vlp
IJ18E3vTIrAdmpyKlLFYhI6w2sMPUSwGllqfBpuo+OxVE+9Wa+AptZIClNXB
CI2Ozs02V9aRwUiCf6qPIBUAIPUE6/uDqzcS3mId8KUs4IxGi0pCr/t2irr5
jdc3u4WLtmZISo7RC/yyftvFFWvUkH0+2tr3lLQXHaDc+RaJPdlj5v5tylJp
j5HTywmzz/vIPKFnn9OmVimMHfFJyWinShixVWI4ORKnPFD0gT0Qlg0yC2Hx
PmtFE/OxUvYYM65WKONhAUTrjOlLAjbibFHDwhuXfQ/1Pxuh28YWkAyX/wdE
cFZxoq6E6DePuKNO3xw1EqBUVsncW3+PltN7b+CWVOawEp+me42Ovetq7OqU
B8aQhqQB0/T8bRYeIzINkkB60k6gSvrF5TO2Kq+x7UiYUQ82KyHE+zlTryXW
0BEj2bK9s4NtAItkx3F7bcmnusOOlb1AMMJFssMQV/LmjDOR9xJUYiuqXxrb
6AB3
=hv6I
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Feb 23, 2016 at 3:33 PM, Vickey Singh
 wrote:
> Adding community for further help on this.
>
> On Tue, Feb 23, 2016 at 10:57 PM, Vickey Singh 
> wrote:
>>
>>
>>
>> On Tue, Feb 23, 2016 at 9:53 PM, Gregory Farnum 
>> wrote:
>>>
>>>
>>>
>>> On Tuesday, February 23, 2016, Vickey Singh 
>>> wrote:
>>>>
>>>> Thanks Greg,
>>>>
>>>> Do you mean ceph osd map command is not displaying accurate information
>>>> ?
>>>>
>>>> I guess, either of these things are happening with my cluster
>>>> - ceph osd map is not printing true information
>>>> - Object to PG mapping is not correct ( one object is mapped to multiple
>>>> PG's )
>>>>
>>>> This is happening for several objects , but the cluster is Healthy.
>>>
>>>
>>> No, you're looking for the map command to do something it was not
>>> designed for. If you want to see if an object exists, you will need to use a
>>> RADOS client to fetch the object and see if it's there. "map" is a mapping
>>> command: given an object name, which PG/OSD does CRUSH map that name to?
>>
>>
>> well your 6th sense is amazing :)
>>
>> This is exactly i want to achieve , i wan to see my PG/OSD mapping for
>> objects. ( basically i have changed my crush hierarchy , now i want to
>> verify that no 2 objects should go to a single host / chassis / rack ) so to
>> verify them i was using ceph osd map command.
>>
>> Is there a smarter way to achieve this ?
>>
>>
>>
>>
>>>
>>>
>>>>
>>>>
>>>> Need expert suggestion.
>>>>
>>>>
>>>> On Tue, Feb 23, 2016 at 7:20 PM, Gregory Farnum 
>>>> wrote:
>>>>>
>>>>> This is not a bug. The map command just says which PG/OSD an object
>>>>> maps to; it does not go out and query the osd to see if there actually is
>>>>> such an object.
>>>>> -Greg
>>>>>
>>>>>
>>>>> On Tuesday, February 23, 2016, Vickey Singh
>>>>>  wrote:
>>>>>>
>>>>>> Hello Guys
>>>>>>
>>>>>> I am getting wired output from osd map. The object does not exists on
>>>>>> pool but osd map still shows its PG and OSD on which its stored.
>>>>>>
>>>>>> So i have rbd device coming from pool 'gold' , this image has an
>>>>>> object 'rb.0.10f61.238e1f29.2ac5'
>>>>>>
>>>>>> The below commands verifies this
>>>>>>
>>>>>> [root@ceph-node1 ~]# rados -p gold ls | grep -i
>>>>>> rb.0.10f61.238e1f29.2ac5
>>>>>> rb.0.10f61.238e1f29.2ac5
>>>>>> [root@ceph-node1 ~]#
>>>>>>
>>>>>> This object lives on pool gold and OSD 38,0,20 , which is correct
>>>>>>
>>>>>> [root@ceph-node1 ~]# ceph osd map gold
>>>>>> rb.0.10f61.238e1f29.2ac5
>>>>>> osdmap e1357 pool 'gold' (1) object 'rb.0.10f61.238e1f29.2ac5'
>>>>>> -> pg 1.11692600 (1.0) -> up ([38,0,20], p38) acting ([38,0,20], p38)
>>>>>> [root@ceph-node1 ~]#
>>>>>>
>>>>>>
>>>>>>

Re: [ceph-users] Ceph and its failures

2016-02-23 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

You probably haven't written to any objects after fixing the problem.
Do some client I/O on the cluster and the PG will show fixed again. I
had this happen to me as well.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.5
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWzSv5CRDmVDuy+mK58QAABe4P/jJ4Vtp9qsV6T49/17FW
qgoZlxIfTLDXNnsTUUFju3c20hDHTET8uMCsaCrLb02ZujbGV0a1LcW/ffJe
hjWx1ExyyrN0bTdwBe+RRycKriHTFH19Fx3zVoRQvDaWoTAbjTFZkvQAxftN
vqKonYxsWyvITYLCFMtX0aPEljo+kQ8BNK4vJoPA2hw6cc0TKIKHSsbt9a0Q
6eCjuSPB76cGDRfbxnZbTXT79UgPD4m5ztNo3stXjvfzRMq0/6YLov8rBXTJ
y5bnlheBOHfwcS/9P1Vdi+LDDy+iaZb5/gEwXPPzV2uGr/z8RTgGMk0dKyk3
fzZHWU7FhUIl3OVDF3IqQe2tZtWTs59fithHRme7T7+tmQaG0VOd1noMYlNz
n3bCQOJutfcyWvU4naQSkgAPfvTH0GwNp16ETAZlB6pADKtH3oXMOPW3CH5H
HyY5+H9w7ELbYiuJlGwMRyko/sNIiVEoj2dZB/ta+61G8+nlYR2GsjLceXOM
HP9Wi3MrVJtXDLFrnQRglB2dfFWvBlrlBTj3uG7Ebn5DO6glxPEAvzrOgsJ2
O8D5+AMvooc41T74aUcWQK8NHNrrN+eL18yhRfjCgyadA2VYvWeu6K7sIUFo
NKFE66ahsxrNKZUrLjeCo69iP4Zf5+AgY7rCau81vzQNtmFUPjzUKyOzgpsb
Y2fQ
=TGcG
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Tue, Feb 23, 2016 at 2:08 PM, Nmz  wrote:
>>> ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
>>>
>>> Ceph contains
>>>  MON: 3
>>>  OSD: 3
>>>
>> For completeness sake, the OSDs are on 3 different hosts, right?
>
> It is single machine. I`m doing tests only.
>
>>> File system: ZFS
>> That is the odd one out, very few people I'm aware of use it, support for
>> it is marginal at best.
>> And some of its features may of course obscure things.
>
> I`m using ZFS on linux for a log time and I`m happy with it.
>
>
>> Exact specification please, as in how is ZFS configured (single disk,
>> raid-z, etc)?
>
> 2 disks in mirror mode.
>
>>> Kernel: 4.2.6
>>>
>> While probably not related, I vaguely remember 4.3 being recommended for
>> use with Ceph.
>
> At this time I can run only this kernel. But IF I decide to use Ceph (only if 
> Ceph satisfy requirements) I can use any other kernel.
>
>>> 3. Does Ceph have auto heal option?
>> No.
>> And neither is the repair function a good idea w/o checking the data on
>> disk first.
>> This is my biggest pet peeve with Ceph and you will find it mentioned
>> frequently in this ML, just a few days ago this thread for example:
>> "pg repair behavior? (Was: Re: getting rid of misplaced objects)"
>
> It is very strange to recovery data manually without know which data is good.
> If I have 3 copies of data and 2 of them are corrupted then I cat recovery 
> the bad one.
>
>
> --
>
> Did some new test. Now new 3 OSD are in different systems. FS is ext3
>
> Same start as before.
>
> # grep "a" * -R
> Binary file 
> osd/nmz-5/current/17.17_head/rbd\udata.1bef77ac761fb.0001__head_FB98F317__11
>  matches
> Binary file osd/nmz-5-journal/journal matches
>
> # ceph pg dump | grep 17.17
> dumped all in format plain
> 17.17   1   0   0   0   0   40961   1   
> active+clean2016-02-23 16:14:32.234638  291'1   309:44  [5,4,3] 5 
>   [5,4,3] 5   0'0 2016-02-22 20:30:04.255301  0'0 2016-02-22 
> 20:30:04.255301
>
> # md5sum rbd\\udata.1bef77ac761fb.0001__head_FB98F317__11
> \c2642965410d118c7fe40589a34d2463  
> rbd\\udata.1bef77ac761fb.0001__head_FB98F317__11
>
> # sed -i -r 's/aa/ab/g' 
> rbd\\udata.1bef77ac761fb.0001__head_FB98F317__11
>
>
> # ceph pg deep-scrub 17.17
>
> 7fbd99e6c700  0 log_channel(cluster) log [INF] : 17.17 deep-scrub starts
> 7fbd97667700  0 log_channel(cluster) log [INF] : 17.17 deep-scrub ok
>
> -- restartind OSD.5
>
> # ceph pg deep-scrub 17.17
>
> 7f00f40b8700  0 log_channel(cluster) log [INF] : 17.17 deep-scrub starts
> 7f00f68bd700 -1 log_channel(cluster) log [ERR] : 17.17 shard 5: soid 
> 17/fb98f317/rbd_data.1bef77ac761fb.0001/head data_digest 
> 0x389d90f6 != known data_digest 0x4f18a4a5 from auth shard 3, missing attr _, 
> missing attr snapset
> 7f00f68bd700 -1 log_channel(cluster) log [ERR] : 17.17 deep-scrub 0 missing, 
> 1 inconsistent objects
> 7f00f68bd700 -1 log_channel(cluster) log [ERR] : 17.17 deep-scrub 1 errors
>
>
> Ceph 9.2.0 bug ?
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm still going to see if I can get Ceph clients to hardly notice that
an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest
time with the longest recovery impact. It should be painless no matter
how slow the drives/CPU/etc are. If it means waiting to service client
I/O until all the peering, and stuff (not including
backfilling/recovery because that can be done in the background
without much impact already) is completed before sending the client
I/O to the OSD, then that is what I'm going to target. That way if it
takes 5 minutes for the OSD to get it's bearing because it is swapping
due to low memory or whatever, the clients happily ignore the OSD
until it says it is ready and don't have all the client I/O fighting
to get a piece of scarce resources.

I appreciate all the suggestions that have been mentioned and believe
that there is a fundamental issue here that causes a problem when you
run your hardware into the red zone (like we have to do out of
necessity). You may be happy with how things are set-up in your
environment, but I'm not ready to give up on it and I think we can
make it better. That way it "Just Works" (TM) with more hardware and
configurations and doesn't need tons of efforts to get it tuned just
right. Oh, and be careful not to touch it, the balance of the force
might get thrown off and the whole thing will tank. That does not make
me feel confident. Ceph is so resilient in so many ways already, why
should this be an Achilles heel for some?
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w
dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8
NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa
CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB
XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU
P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88
IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT
8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz
LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl
tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v
R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs
/0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD
yHbZ
=rnWI
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensen  wrote:
>> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
>> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
>> > load_pgs opened
>> 564 pgs > --- > Another minute to load the PGs.
>> Same OSD reboot as above : 8 seconds for this.
>
> Do you really have 564 pgs on a single OSD?  I've never had anything like
> decent performance on an OSD with greater than about 150pgs.  In our
> production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd total
> (with size set to 3).  When we initially deployed our large cluster with
> 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no end of
> trouble getting pgs to peer.  The OSDs ate RAM like nobody's business, took
> forever to do anything, and in general caused problems.  If you're running
> 564 pgs/osd in this 4 OSD cluster, I'd look at that first as the potential
> culprit.  That is a lot of threads inside the OSD process that all need to
> get CPU/network/disk time in order to peer as they come up.  Especially on
> firefly I would point to this.  We've moved to Hammer and that did improve a
> number of our performance bottlenecks, though we've also grown our cluster
> without adding pgs, so we are now down in the 25-30 primary pgs/osd range,
> and restarting osds, or whole nodes (24-32 OSDs for us) no longer causes us
> pain.  In the past restarting a node could cause 5-10 minutes of peering and
> pain/slow requests/unhappiness of various sorts (RAM exhaustion, OOM Killer,
> Flapping OSDs).  This all improved greatly once we got our pg/osd count
> under 100 even before we upgraded to hammer.
>
>
>
>
>
> On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton
>  wrote:
>>
>> Hi,
>>
>> Le 13/02/2016 15:52, Christian Balzer a écrit :
>> > [..]
>> >
>> > Hum that's surprisingly long. How much data (size and nb of files) do
>> > you have on this OSD, which FS do you use, what are the mount options,
>> > what is the hardware and the kind of access ?
>> >
>> > I already mentioned the HW, Areca RAID controller with 2GB HW cache and
>> > a
>> > 7 disk RAID6 per OSD.
>> > Not

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-12 Thread Robert LeBlanc
Christian,

Yep, that describes what I see too. Good news is that I made a lot of
progress on optimizing the queue today 10-50% performance increase in my
microbenchmarks (that is only the improvement in enqueueing and dequeueing
ops which is a small part of the whole IO path, but every little bit
helps). I have some more knobs to turn in my microbenchmarks then run it
through some real tests and document the results and then submit the pull
request hopefully mid next week. Then the next thing to look into is what
I've affectionately called "The 10 Minutes of Terror" which can be anywhere
from 5 minutes to 20 minutes in our cluster. It is our biggest point after
the starved IO situation which the new queue I wrote goes a long way in
mitigating that and the cache promotion/demotion issues which Nick and Sage
have been working on (thanks for all the work on that).

I hope in a few weeks time I can have a report on what I find. Hopefully we
can have it fixed for Jewel and Hammer. Fingers crossed.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Feb 12, 2016 10:32 PM, "Christian Balzer"  wrote:

>
> Hello,
>
> for the record what Robert is writing below matches my experience the best.
>
> On Fri, 12 Feb 2016 22:17:01 + Steve Taylor wrote:
>
> > I could be wrong, but I didn't think a PG would have to peer when an OSD
> > is restarted with noout set. If I'm wrong, then this peering would
> > definitely block I/O. I just did a quick test on a non-busy cluster and
> > didn't see any peering when my OSD went down or up, but I'm not sure how
> > good a test that is. The OSD should also stay "in" throughout the
> > restart with noout set, so it wouldn't have been "out" before to cause
> > peering when it came "in."
> >
> It stays in as far as the ceph -s output is concerned, but clearly for
> things to work as desired/imagined/expected some redirection state has to
> be enabled for the duration.
>
> > I do know that OSDs don’t mark themselves "up" until they're caught up
> > on OSD maps. They won't accept any op requests until they're "up," so
> > they shouldn't have any catching up to do by the time they start taking
> > op requests. In theory they're ready to handle I/O by the time they
> > start handling I/O. At least that's my understanding.
> >
>
> Well, here is the 3 minute ordeal of a restart, things went downhill from
> the moment the shutdown was initiated to the time it was fully back up,
> with some additional fun at the recovery after that, but compared to the 3
> minutes of near silence, that was a minor hiccup.
>
> ---
> 2016-02-12 01:33:45.408348 7f7f0c786700 -1 osd.2 1788 *** Got signal
> Terminated ***
> 2016-02-12 01:33:45.408414 7f7f0c786700  0 osd.2 1788 prepare_to_stop
> telling mon we are shutting down
> 2016-02-12 01:33:45.408807 7f7f1cfa7700  0 osd.2 1788 got_stop_ack
> starting shutdown
> 2016-02-12 01:33:45.408841 7f7f0c786700  0 osd.2 1788 prepare_to_stop
> starting shutdown
> 2016-02-12 01:33:45.408852 7f7f0c786700 -1 osd.2 1788 shutdown
> 2016-02-12 01:33:45.409541 7f7f0c786700 20 osd.2 1788  kicking pg 1.10
> 2016-02-12 01:33:45.409547 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[1.10(
> empty local-les=1788 n=0 ec=1 les/c 1788/1788
>  1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active] lock
> 2016-02-12 01:33:45.409562 7f7f0c786700 10 osd.2 pg_epoch: 1788 pg[1.10(
> empty local-les=1788 n=0 ec=1 les/c 1788/1788
>  1787/1787/1787) [1,2] r=1 lpr=1787 pi=1656-1786/12 crt=0'0 active]
> on_shutdown
> ---
> This goes on for quite a while with other PGs and more shutdown fun, until
> we get here:
> ---
> 2016-02-12 01:33:46.413966 7f7f0c786700 20 osd.2 1788  kicking pg 2.17c
> 2016-02-12 01:33:46.413974 7f7f0c786700 30 osd.2 pg_epoch: 1788 pg[2.17c(
> v 1788'10083124 (1782'10080123,1788'10083124
> ] local-les=1785 n=1522 ec=1 les/c 1785/1785 1784/1784/1784) [0,2] r=1
> lpr=1784 pi=1652-1783/16 luod=0'0 crt=1692'3885
> 111 lcod 1788'10083123 active] lock
> 2016-02-12 01:33:48.690760 7f75be4d57c0  0 ceph version 0.80.10
> (ea6c958c38df1216bf95c927f143d8b13c4a9e70), process ce
> ph-osd, pid 24967
> ---
> So from shutdown to startup about 2 seconds, not that bad.
> However here is where the cookie crumbles massively:
> ---
> 2016-02-12 01:33:50.263152 7f75be4d57c0  0
> filestore(/var/lib/ceph/osd/ceph-2) limited size xattrs
> 2016-02-12 01:35:31.809897 7f75be4d57c0  0
> filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode
> : checkpoint is not enabled
> ---
> Nearly 2 minutes to mount things, it probably had to go

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-12 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

What I've seen is that when an OSD starts up in a busy cluster, as
soon as it is "in" (could be "out" before) it starts getting client
traffic. However, it has be "in" to start catching up and peering to
the other OSDs in the cluster. The OSD is not ready to service
requests for that PG yet, but it has the OP queued until it is ready.
On a busy cluster it can take an OSD a long time to become ready
especially if it is servicing client requests at the same time.

If someone isn't able to look into the code to resolve this by the
time I'm finished with the queue optimizations I'm doing (hopefully in
a week or two), I plan on looking into this to see if there is
something that can be done to prevent the OPs from being accepted
until the OSD is ready for them.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk  wrote:
> I wonder if Christian is hitting some performance issue when the OSD or
> number of OSD's all start up at once? Or maybe the OSD is still doing some
> internal startup procedure and when the IO hits it on a very busy cluster,
> it causes it to become overloaded for a few seconds?
>
> I've seen similar things in the past where if I did not have enough min free
> KB's configured, PG's would take a long time to peer/activate and cause slow
> ops.
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Steve Taylor
>> Sent: 12 February 2016 16:32
>> To: Nick Fisk ; 'Christian Balzer' ; ceph-
>> us...@lists.ceph.com
>> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't
>> uptosnuff)
>>
>> Nick is right. Setting noout is the right move in this scenario.
> Restarting an
>> OSD shouldn't block I/O unless nodown is also set, however. The exception
>> to this would be a case where min_size can't be achieved because of the
>> down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would
> certainly
>> block writes. Otherwise the cluster will recognize down OSDs as down
>> (without nodown set), redirect I/O requests to OSDs that are up, and
> backfill
>> as necessary when things are back to normal.
>>
>> You can set min_size to something lower if you don't have enough OSDs to
>> allow you to restart one without blocking writes. If this isn't the case,
>> something deeper is going on with your cluster. You shouldn't get slow
>> requests due to restarting a single OSD with only noout set and idle disks
> on
>> the remaining OSDs. I've done this many, many times.
>>
>> Steve Taylor | Senior Software Engineer | StorageCraft Technology
>> Corporation
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2799 | Fax: 801.545.4705
>>
>> If you are not the intended recipient of this message, be advised that any
>> dissemination or copying of this message is prohibited.
>> If you received this message erroneously, please notify the sender and
>> delete it, together with any attachments.
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Nick Fisk
>> Sent: Friday, February 12, 2016 9:07 AM
>> To: 'Christian Balzer' ; ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't
>> uptosnuff)
>>
>>
>>
>> > -Original Message-
>> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> > Of Christian Balzer
>> > Sent: 12 February 2016 15:38
>> > To: ceph-users@lists.ceph.com
>> > Subject: Re: [ceph-users] Reducing the impact of OSD restarts (noout
>> > ain't
>> > uptosnuff)
>> >
>> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote:
>> >
>> > > Hi,
>> > >
>> > > On 02/12/2016 03:47 PM, Christian Balzer wrote:
>> > > > Hello,
>> > > >
>> > > > yesterday I upgraded our most busy (in other words lethally
>> > > > overloaded) production cluster to the latest Firefly in
>> > > > preparation for a Hammer upgrade and then phasing in of a cache
> tier.
>> > > >
>> > > > When restarting the ODSs it took 3 minutes (1 minute in a
>> > > > consecutive repeat to test the impact of primed caches) during
>> > > > which the cluster crawled to a near stand-still and 

Re: [ceph-users] cls_rbd ops on rbd_id.$name objects in EC pool

2016-02-11 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Is this only a problem with EC base tiers or would replicated base
tiers see this too?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Feb 11, 2016 at 6:09 PM, Sage Weil  wrote:
> On Thu, 11 Feb 2016, Nick Fisk wrote:
>> That’s a relief, I was sensing a major case of face palm occuring when I
>> read Jason's email!!!
>
> https://github.com/ceph/ceph/pull/7617
>
> The tangled logic in maybe_handle_cache wasn't respecting the force
> promotion bool.
>
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWvVgFCRDmVDuy+mK58QAAepwQAIYDDy9BKOqCN6AYg6QK
XOipjXIAwU+lwJA9dV6GOLSeztyg03i1h0Nvibww9JuRYoWWDfPmRCqWWCyl
qHoa1q3RgByUTlrxQwl2j0oqVdj2Gn238yyZLqqkhvJS4icc1Xl42710xQa8
OZCMmrJZQ6ZF4n9rU4tUZy6+4l+FjhmqGCu4PHw6SK2TiA6SJR4pcMsbFb6Y
h5yWHLNJaCNxe3JVI4sd/tDFxU+pnalz4u2/QkUg2I22C1rYOelbvQ8qeVsR
TFy3wc62GGqjaZ9+cjvY3VwrsScFh9skz/cBg7ANRs20rdwX74xfdsIUeAdW
f1zfNaobOBt0ZbrYcrp28BhjpIik7GriBiFSUaJ/xIWc8wDNSYhAApGUMNhc
oLcsl11zpHzAce8z/Jv5uVRH7VG0jqJKQg8t2l09V/LryxcTktrEQctq+6LS
zqh46uToc2jpvTvhIwUiT5fhg9NA2j2cOo1lhMgWJBpT41qambgBWAeIXliE
oHMexkdpN80Fqv+yXjEyDocDH1c1bTlyXI71btHptQNvC2VtTqoviz96CtEt
jOAFV3nfYz36XkAeayH22ExAwwvF8d9FWpnzEnI3cf63QChN3LAk1C52T5Lw
T5rp/4kNF3UnTlJ/ejf3twGaR+FSUWMeum4pKpakfpntRH/ZxpU1VSKSL0TM
BdYi
=gvlV
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] K is for Kraken

2016-02-08 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Too bad K isn't an LTS. It was be fun to release the Kraken many times.

I like liliput

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Feb 8, 2016 at 11:36 AM, Sage Weil  wrote:
> I didn't find any other good K names, but I'm not sure anything would top
> kraken anyway, so I didn't look too hard.  :)
>
> For L, the options I found were
>
> luminous (flying squid)
> longfin (squid)
> long barrel (squid)
> liliput (octopus)
>
> Any other suggestions?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWuOfZCRDmVDuy+mK58QAAyMgP/ian9UzIjU8JNkxgR+9T
nMI99toP9Ud5nqq6IN+niVwNIkaVIpuPACMi5e88UHQMW8qhVZPAnPA1Ogrd
HP67cO/m6SAFFIzOWHyFCNpVzfPzoqL0lijdzLzihTC0d9Cbv+vHTX/jX4q6
HwWDEMOctrrdVqaCXGP2hkuViq+pRZqDZKgG9GeQ5lEY9b7swOEmC/z1P5Me
R/UpKtHfu0QMywY6AWTf2vgwx2RIy1QGLs8Fy++GjsggazZqmmOS0xmefLtl
ImSqCmj+YFlsPBt+lazLtYU+2v5AJThIRkZWUbSR+A1jkotP48fQgSQeJN1V
F6fu/4gLB+FwbLLwaZqYVTrq+hrztxu98SkgyMIwN1t+O5JzcCY0xd56Gemi
f//00qvNjSCmRoILq3MPxnPzoD66RnZvFkhbGCsz0h5F1xJUSa0L7u9x0tUe
5LlwY1Qb6e9UBfP6VYjUwGMTChlvnO2tvKQszxPBmIadrjjlfOJhw+aueHmz
kCSsM+s5LlrlI8e7vlxEdF05R7StLVVGzi8aIx/byjLxKFjNA2ZiIg+IUVJK
GwaV6FQ/B3yRW9WKS+TH1aG7HfdtBWkmcDpy0ofLE5NsckrKL0YKEhJmQSBj
BFSAXKXk+cKbWV/ykN0fLWhJcYmNM/9pZ5d3eBzbqIltBe4OQcVgBUllNE0B
QvQj
=gYt+
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unified queue in Infernalis

2016-02-05 Thread Robert LeBlanc
I believe this is referring to combining the previously separate queues
into a single queue (PrioritizedQueue and soon to be WeightedPriorityQueue)
in ceph. That way client IO and recovery IO can be better prioritized in
the Ceph code. This is all before the disk queue.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Feb 5, 2016 4:28 PM, "Stillwell, Bryan" 
wrote:

> I saw the following in the release notes for Infernalis, and I'm wondering
> where I can find more information about it?
>
> * There is now a unified queue (and thus prioritization) of client IO,
> recovery, scrubbing, and snapshot trimming.
>
> I've tried checking the docs for more details, but didn't have much luck.
> Does this mean we can adjust the ionice priority of each of these
> operations if we're using the CFQ scheduler?
>
> Thanks,
> Bryan
>
>
> 
>
> This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Set cache tier pool forward state automatically!

2016-02-04 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On Thu, Feb 4, 2016 at 8:32 PM, Christian Balzer  wrote:
> On Wed, 3 Feb 2016 22:42:32 -0700 Robert LeBlanc wrote:

> I just finished downgrading my test cluster from testing to Jessie and
> then upgrading Ceph from Firefly to Hammer (that was fun few hours).
>
> And I can confirm that I don't see that issue with Hammer, wonder if it's
> worth prodding the devs about.
> I sorta dread the time the PG upgrade process will take when going to
> Hammer on the overloaded production server.
> But then again, a fix for Firefly is both unlikely and going to take too
> long for my case anyway.

There isn't any PG upgrading as part of the Firefly to Hammer process
that I can think of. If there is, it wasn't long. Setting noout would
prevent backfilling so only recovery of delta changes would be needed.
You don't have to reboot the node, just restart the process so it can
limit the amount of changes to only a minute or two. It's been a long
time since I did that upgrade that I just may be plain forgetting
something.

I think you are right about the fix not going into Firefly since it
will EoL this year. Glad to know that it is fixed in Hammer.

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWtCYCCRDmVDuy+mK58QAAvDEP/i85/cjKQwi4idRzLT9e
7oecZ2kTldVNLILLsGhmbg+oABgyKQ7uNY+XTJXSlYMYIKGpoQ9cDO/r9tB3
nANDVxvVF6yxiA4Pzo8ybytu+qKyOeB17ri3//ReFyyPg+tDJsNpXV+ECUFX
LZPekvhV397JFS8KoT00nkzGGiWh1PlbQCYqZCNCbsrhIqwCjFq+k5ydKpvv
qJfTh1d3V0h0vgtbtdC4Vdrzvqr65BoLHNcy6cOlIzPHhkJi6W5rABB6Haec
sn7onFqsdJn9TSEJ8TSHfgtaWR5vT7y6/AQHHDafXzdr/VZKorwemdeiRwuX
LEWudwg+J3cf4DrhVlDjv91I24f78/fH4Bm8m/sugo98L/+UqNgCz9VXI4AP
ejRkZyIkWacEjkrBw8D7QttEEwo58247gYrimb07+MMVX36p+0S7pkpsdH1Y
3d3eOuHqqs3mG51eFlZng8Iax029NPQ7Umdt7l/Eru7g7pthJtPEmvPwMMB1
dcx+X2Aj6G9F+Jsa3hJNTPDsr3cKLOGcS9uu7iQjXVpMfhlF5v/16XCoDKfa
ZSGc6cEdEKLSfEIe7msD3n2gRLL4QfbXFSf7bJUi9dm6LRLMCBEks4QiakAm
t0AFeV96xsubg5uplBfkIROND3qU80ccSI5mhey8OC42zHKFj3B/Rf5v+qDF
X3T3
=2CEJ
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading with mon & osd on same host

2016-02-04 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Just make sure that your monitors and OSDs are on the very latest of
Hammer or else your Infernalis OSDs won't activate.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Feb 4, 2016 at 12:23 AM, Mika c  wrote:
> Hi,
> >* Do the packages (Debian) restart the services upon upgrade?
> No need, restart by yourself.
>
>>Do I need to actually stop all OSDs, or can I upgrade them one by one?
> No need to stop. Just upgrade osd server one by one and restart each osd
> daemons.
>
>
>
> Best wishes,
> Mika
>
>
> 2016-02-03 18:55 GMT+08:00 Udo Waechter :
>>
>> Hi,
>>
>> I would like to upgrade my ceph cluster from hammer to infernalis.
>>
>> I'm reading the upgrade notes, that I need to upgrade & restart the
>> monitors first, then the OSDs.
>>
>> Now, my cluster has OSDs and Mons on the same hosts (I know that should
>> not be the case, but it is :( ).
>>
>> I'm just wondering:
>> * Do the packages (Debian) restart the services upon upgrade?
>>
>>
>> In theory it should work this way:
>>
>> * install / upgrade the new packages
>> * restart all mons
>> * stop OSD one by one and change the user accordingly.
>>
>> Another question then:
>>
>> Do I need to actually stop all OSDs, or can I upgrade them one by one?
>> I don't want to take the whole cluster down :(
>>
>> Thanks very much,
>> udo.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWs4YFCRDmVDuy+mK58QAAdjUP/RNYDkRYaPuspNei14sh
XrM23GLCsmuv7jderKwOkG2wsIQOxR86E5F/dUEnB0UB+CKAvYBi3w2cHNc1
PrZoWkPkiEy2+bsQW65CoPK4UoghFoNGWABSPIDgcNjrblbTJ+Ph0FEfXSNn
PnlZ40/NySiCypPbKTOFag8o30eOqIO1UDjWqTdeQWhVKmQpAGWEAMc/A1Dk
YHLqq1MQOiZ1Zh16Bx664sspR68GYnWw57MF5bVterEahlhm8/n17rJVDFT/
440+Idph3GEpIWqiXLLYM8nCIiwsXO30OxdwTVVpoDrszh782E2jAMkW9cCs
IXBkZRgq4M6Gz4P76BWiNJN0CeTsA0NUwQVZQl9cndeLgyqhCzFS8825ixfl
fFFiz3RFqluVzP55V+D3IEFZHlbiYMZtx1HbrjWR1UG1Q40PnB3XxwxiNBDT
dKsjpGMYeHs/KPUdMaWraQqBxjWC1bvc00eqVhQZm/Xz+jniitr+DGfh9afi
sTYYiHJcURgpvvbi77oOglzYfMes+b5oOxJT5KII2eEDothG6GF63Bn7c75W
7BjjlR4ugmD6kO4PsyF2NisfdL7IpEQe/aiieGPU10QRvVfRdu5LEGd6/An2
YxvAhzQxx+gJzknBDlbh95wcdVy/MHKDO3XoK1FXOpRaejCcPLRhu3rW/vgy
ZRJc
=rHFo
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Set cache tier pool forward state automatically!

2016-02-03 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256


On Wed, Feb 3, 2016 at 9:00 PM, Christian Balzer  wrote:
> On Wed, 3 Feb 2016 16:57:09 -0700 Robert LeBlanc wrote:

> That's an interesting strategy, I suppose you haven't run into the issue I
> wrote about 2 days ago when switching to forward while running rdb bench?

We haven't, but we are running 0.94.5. If you are running Firefly,
that could be why.

> In my case I venture that the number of really hot objects is small enough
> to not overwhelm things and that 5K IOPS would be all that cluster ever
> needs to provide.

We have 48x Micron M600 1TB drives. They do not perform as well as the
Intel S3610 800GB which seems to have the right balance of performance
and durability for our needs. Since we had to under provision the
M600s, we should be able to get by with the 800GB just fine. Once we
get the drives swapped out, we may do better than the 10K IOPs as well
as the recency fix going into the next version of Hammer it will help
with writeback. With our new cluster, we did fine in writeback mode
until we hit that 10K IOP limit, then we started getting slow I/O
messages, turning it to forward mode and things sped up a lot.


- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWsuTFCRDmVDuy+mK58QAAxO0P/R2mkmoOE/YSD9Ea8uUz
XBlOI2eibT5DGK6jR/hVL0V0dInNtVM+4yGWEmvJm5nxnwbx+EQd+lCTFQ5y
WouwGQLCMOCiy0rgduTeTwyGHjeIbloGoYYhZQPEFHOMt1lcKcwiEbrEKUYN
csUmEApK2aiPna5dMsvQs39/oATuid9Aec8VwcyCozzWUe/UziXVFhdWw3Q5
2mz8AuOhrmFqd7iyFN9Dici/DXLhBxWgg4PWn81Ggzq/5LHGyyV6A0jiLCBH
/B9rUCOmdfBvdK/GxCG7iUqIjVvIR2mtYFkCu7VL/exsnxuGRB2RHYcXgfVH
rMbZ+gbK/T4XZvUTwDpsfzkEwOTlCuhkcMcHyZLl/MdmcNVXP2+cB9TaCbPI
Hn2H0CuXqQhZ73znQSVS66/QA7s4W5LzMiAUZnOdIX05eVLnZEgstFr8fSEn
O95Y4jLYyQB+CIF9IfA6fgGsvnrs0rTGvYEThk6HL1sa6uVwR5PESVJpapS5
smUenHyp7OPTVdVpGzJh6VOOB08lcA7JFkicCSG1iXTPucuGkuVNMQ2i0LNb
DA/WAbwUqSK1XHIIu2NCaDZsIbSPwWGXj2uwfNFgSzss1UqAVEF0cBfY6c6n
3bdPwY2SgOc7nB+LGDQM6dsaFqDS1E490cFwc85uDTkVOBL0JcAJHAvZV2lD
w4Tj
=H+mV
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   4   5   >