Re: [ceph-users] Ceph luminous repo not working on Ubuntu xenial

2017-09-28 Thread Stefan Kooman
Quoting Kashif Mumtaz (kashif.mum...@yahoo.com):
> 
> Dear User,
> I am striving had to install Ceph luminous version on Ubuntu 16.04.3  ( 
> xenial ).
> Its repo is available at https://download.ceph.com/debian-luminous/ 
> I added it like sudo apt-add-repository 'deb 
> https://download.ceph.com/debian-luminous/ xenial main'
> # more  sources.list
> deb https://download.ceph.com/debian-luminous/ xenial main

^^ That looks good. 

> It say no package available. Did anybody able to install Luminous on Xenial 
> by using repo?

Just checkin': you did a "apt update" after adding the repo?

The repo works fine for me. Is the Ceph gpg key installed?

apt-key list |grep Ceph
uid  Ceph.com (release key) 

Make sure you have "apt-transport-https" installed (as the repos uses
TLS).

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd max scrubs not honored?

2017-09-28 Thread Christian Balzer

Hello,

On Thu, 28 Sep 2017 22:36:22 + Gregory Farnum wrote:

> Also, realize the deep scrub interval is a per-PG thing and (unfortunately)
> the OSD doesn't use a global view of its PG deep scrub ages to try and
> schedule them intelligently across that time. If you really want to try and
> force this out, I believe a few sites have written scripts to do it by
> turning off deep scrubs, forcing individual PGs to deep scrub at intervals,
> and then enabling deep scrubs again.
> -Greg
> 
This approach works best and w/o surprises down the road if 
osd_scrub_interval_randomize_ratio is disabled.
And the osd_scrub_start_hour and osd_scrub_end_hour set to your needs.

I basically kick the deep scrubs off on a per OSD basis (one at a
time and staggered of course) and if your cluster is small/fast enough
that pattern will be retained indefinitely, with only one PG doing a deep
scrub at any given time (with the default max scrub of 1 of course). 

Christian

> On Wed, Sep 27, 2017 at 6:34 AM David Turner  wrote:
> 
> > This isn't an answer, but a suggestion to try and help track it down as
> > I'm not sure what the problem is. Try querying the admin socket for your
> > osds and look through all of their config options and settings for
> > something that might explain why you have multiple deep scrubs happening on
> > a single osd at the same time.
> >
> > However if you misspoke and only have 1 deep scrub per osd but multiple
> > people node, then what you are seeing is expected behavior.  I believe that
> > luminous added a sleep seeing for scrub io that also might help.  Looking
> > through the admin socket dump of settings looking for scrub should give you
> > some ideas of things to try.
> >
> > On Tue, Sep 26, 2017, 2:04 PM J David  wrote:
> >  
> >> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also
> >> the default, at almost all times, there are 2-3 deep scrubs running.
> >>
> >> 3 simultaneous deep scrubs is enough to cause a constant stream of:
> >>
> >> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32
> >> sec (REQUEST_SLOW)
> >>
> >> This seems to correspond with all three deep scrubs hitting the same
> >> OSD at the same time, starving out all other I/O requests for that
> >> OSD.  But it can happen less frequently and less severely with two or
> >> even one deep scrub running.  Nonetheless, consumers of the cluster
> >> are not thrilled with regular instances of 30-60 second disk I/Os.
> >>
> >> The cluster is five nodes, 15 OSDs, and there is one pool with 512
> >> placement groups.  The cluster is running:
> >>
> >> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous
> >> (rc)
> >>
> >> All of the OSDs are bluestore, with HDD storage and SSD block.db.
> >>
> >> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this
> >> issue, though it seems to get the number down from 3 to 2, which at
> >> least cuts down on the frequency of requests stalling out.  With 512
> >> pgs, that should mean that one pg gets deep-scrubbed per hour, and it
> >> seems like a deep-scrub takes about 20 minutes.  So what should be
> >> happening is that 1/3rd of the time there should be one deep scrub,
> >> and 2/3rds of the time there shouldn’t be any.  Yet instead we have
> >> 2-3 deep scrubs running at all times.
> >>
> >> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched per
> >> hour:
> >>
> >> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' |
> >> fgrep 2017-09-26 | sort -rn | head -22
> >> dumped all
> >> 2017-09-26 16:42:46.781761 0.181
> >> 2017-09-26 16:41:40.056816 0.59
> >> 2017-09-26 16:39:26.216566 0.9e
> >> 2017-09-26 16:26:43.379806 0.19f
> >> 2017-09-26 16:24:16.321075 0.60
> >> 2017-09-26 16:08:36.095040 0.134
> >> 2017-09-26 16:03:33.478330 0.b5
> >> 2017-09-26 15:55:14.205885 0.1e2
> >> 2017-09-26 15:54:31.413481 0.98
> >> 2017-09-26 15:45:58.329782 0.71
> >> 2017-09-26 15:34:51.777681 0.1e5
> >> 2017-09-26 15:32:49.669298 0.c7
> >> 2017-09-26 15:01:48.590645 0.1f
> >> 2017-09-26 15:01:00.082014 0.199
> >> 2017-09-26 14:45:52.893951 0.d9
> >> 2017-09-26 14:43:39.870689 0.140
> >> 2017-09-26 14:28:56.217892 0.fc
> >> 2017-09-26 14:28:49.665678 0.e3
> >> 2017-09-26 14:11:04.718698 0.1d6
> >> 2017-09-26 14:09:44.975028 0.72
> >> 2017-09-26 14:06:17.945012 0.8a
> >> 2017-09-26 13:54:44.199792 0.ec
> >>
> >> What’s going on here?
> >>
> >> Why isn’t the limit on scrubs being honored?
> >>
> >> It would also be great if scrub I/O were surfaced in “ceph status” the
> >> way recovery I/O is, especially since it can have such a significant
> >> impact on client operations.
> >>
> >> Thanks!
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>  
> > ___
> > ceph-users mailing list
> > 

Re: [ceph-users] ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

2017-09-28 Thread Brad Hubbard
This looks similar to
https://bugzilla.redhat.com/show_bug.cgi?id=1458007 or one of the
bugs/trackers attached to that.

On Thu, Sep 28, 2017 at 11:14 PM, Sean Purdy  wrote:
> On Thu, 28 Sep 2017, Matthew Vernon said:
>> Hi,
>>
>> TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it
>> needs increasing and/or removing entirely. Should I copy this to ceph-devel?
>
> Just a note.  Looks like debian stretch luminous packages have a 10_000 
> second timeout:
>
> from /lib/systemd/system/ceph-disk@.service
>
> Environment=CEPH_DISK_TIMEOUT=1
> ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock 
> /var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout 
> trigger --sync %f'
>
>
> Sean
>
>> On 15/09/17 16:48, Matthew Vernon wrote:
>> >On 14/09/17 16:26, Götz Reinicke wrote:
>> >>After that, 10 OSDs did not came up as the others. The disk did not get
>> >>mounted and the OSD processes did nothing … even after a couple of
>> >>minutes no more disks/OSDs showed up.
>> >
>> >I'm still digging, but AFAICT it's a race condition in startup - in our
>> >case, we're only seeing it if some of the filesystems aren't clean. This
>> >may be related to the thread "Very slow start of osds after reboot" from
>> >August, but I don't think any conclusion was reached there.
>>
>> This annoyed me enough that I went off to find the problem :-)
>>
>> On systemd-enabled machines[0] ceph disks are activated by systemd's
>> ceph-disk@.service, which calls:
>>
>> /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f)
>> /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'
>>
>> ceph-disk trigger --sync calls ceph-disk activate which (among other things)
>> mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/
>> once it's extracted the osd number from the fs). If the fs is unclean, XFS
>> auto-recovers before mounting (which takes time - range 2-25s for our 6TB
>> disks) Importantly, there is a single global lock file[1] so only one
>> ceph-disk activate can be doing this at once.
>>
>> So, each fs is auto-recovering one at at time (rather than in parallel), and
>> once the elapsed time gets past 120s, timeout kills the flock, systemd kills
>> the cgroup, and no more OSDs start up - we typically find a few fs mounted
>> in /var/lib/ceph/tmp/mnt.. systemd keeps trying to start the remaining
>> osds (via ceph-osd@.service), but their fs isn't in the correct place, so
>> this never works.
>>
>> The fix/workaround is to adjust the timeout value (edit the service file
>> directly, or for style points write an override in /etc/systemd/system
>> remembering you need a blank ExecStart line before your revised one).
>>
>> Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to
>> start all its osds when started up with all fss dirty. So the current 120s
>> is far too small (it's just about OK when all the osd fss are clean).
>>
>> I think, though, that having the timeout at all is a bug - if something
>> needs to time out under some circumstances, should it be at a lower layer,
>> perhaps?
>>
>> A couple of final points/asides, if I may:
>>
>> ceph-disk trigger uses subprocess.communicate (via the command() function),
>> which means it swallows the log output from ceph-disk activate (and only
>> outputs it after that process finishes) - as well as producing confusing
>> timestamps, this means that when systemd kills the cgroup, all the output
>> from the ceph-disk activate command vanishes into the void. That made
>> debugging needlessly hard. Better to let called processes like that output
>> immediately?
>>
>> Does each fs need mounting twice? could the osd be encoded in the partition
>> label or similar instead?
>>
>> Is a single global activation lock necessary? It slows startup down quite a
>> bit; I see no reason why (at least in the one-osd-per-disk case) you
>> couldn't be activating all the osds at once...
>>
>> Regards,
>>
>> Matthew
>>
>> [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the
>> timeout, so presumably upstart systems aren't affected
>> [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu
>>
>>
>> --
>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
>> a charity registered in England with number 1021457 and a company registered
>> in England with number 2742969, whose registered office is 215 Euston Road,
>> London, NW1 2BE. ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 12.2.0 on 32bit?

2017-09-28 Thread Gregory Farnum
Have you tried running a Luminous OSD with filestore instead of BlueStore?

As BlueStore is all new code and uses a lot of optimizations and tricks for
fast and efficient use of memory, some 64-bit assumptions may have snuck in
there. I'm not sure how much interest there is in making sure that works on
32-bit systems at this point, but narrowing it down to a specific component
would certainly help.

On Fri, Sep 22, 2017 at 8:57 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com>
wrote:

> It crashes with SimpleMessenger as well  (ms_type = simple)
>
>
> I've also tried with and without these two settings, but still crashes.
> bluestore cache size = 536870912
> bluestore cache kv max = 268435456
>
>
> When using SimpleMessenger, it tells me it is crashing (Segmentation
> Fault) in 'thread_name:ms_pipe_write'.  This is common in all crashes under
> SimpleMessenger, just like 'msgr-worker-' was common
> under AsyncMessenger.
>
>
> The node I'm testing this on is running a 32bit kernel (4.12.5) and has
> 8GB ram (free -m).
>
>
> Per 'ps aux', VSZ and RSS never get much above 1196392 and 544024
> respectively.  (One time they didn't get past 999536 and 329712
> respectively.)
>
>
> Also, under SimpleMessenger, gdb is reporting stack corruption in the back
> traces.
>
>
> What other memory tuning options should I try?
>
>
>
>
>
> On 2017-09-11 08:05, Gregory Farnum wrote:
>
> You could try setting it to run with SimpleMessenger instead of
> AsyncMessenger -- the default changed across those releases.
> I imagine the root of the problem though is that with BlueStore the OSD is
> using a lot more memory than it used to and so we're overflowing the 32-bit
> address space...which means a more permanent solution might require turning
> down the memory tuning options. Sage has discussed those in various places.
> On Sun, Sep 10, 2017 at 11:52 PM Dyweni - Ceph-Users <
> 6exbab4fy...@dyweni.com> wrote:
>
>> Hi,
>>
>> Is anyone running Ceph Luminous (12.2.0) on 32bit Linux?  Have you seen
>> any problems?
>>
>>
>>
>> My setup has been 1 MON and 7 OSDs (no MDS, RGW, etc), all running Jewel
>> (10.2.1), on 32bit, with no issues at all.
>>
>> I've upgraded everything to latest version of Jewel (10.2.9) and still
>> no issues.
>>
>> Next I upgraded my MON to Luminous (12.2.0) and added MGR to it.  Still
>> no issues.
>>
>> Next I removed one node from the cluster, wiped it clean, upgraded it to
>> Luminous (12.2.), and created a new BlueStore data area.  Now this node
>> crashes with segmentation fault usually within a few minutes of starting
>> up.  I've loaded symbols and used GDB to examine back traces.  From what
>> I can tell, the seg faults are happening randomly, and the stack is
>> corrupted, so traces from GDB are unusable (even with all symbols
>> installed for all packages on the system). However, in all cases, the
>> seg fault is occuring in the 'msgr-worker-' thread.
>>
>>
>>
>>
>> My data is fine, just would like to get Ceph 12.2.0 running stably on
>> this node, so I can upgrade the remaining nodes and switch everything
>> over to BlueStore.
>>
>>
>>
>> Thanks,
>> Dyweni
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd max scrubs not honored?

2017-09-28 Thread David Turner
I often schedule the deep scrubs for a cluster so that none of them will
happen on their own and will always be run using my cron/scripts.  For
instance, set the deep scrub interval to 2 months and schedule a cron that
will take care of all of the deep scrubs within a month.  If for any reason
the script stops working, the PGs will still be scrubbed at least every 2
months.  But the script should ensure that they happen every month but only
during the times of day I'm running the cron.  That way I can ease up on
deep scrubbing when the cluster needs a little more performance or is going
through a big recovery, but also catch it back up.

There are also config settings to ensure that scrubs only happen during
hours of the day you want them to so you can avoid major client IO
regardless of how you scrub your cluster.

On Thu, Sep 28, 2017 at 6:36 PM Gregory Farnum  wrote:

> Also, realize the deep scrub interval is a per-PG thing and
> (unfortunately) the OSD doesn't use a global view of its PG deep scrub ages
> to try and schedule them intelligently across that time. If you really want
> to try and force this out, I believe a few sites have written scripts to do
> it by turning off deep scrubs, forcing individual PGs to deep scrub at
> intervals, and then enabling deep scrubs again.
> -Greg
>
>
> On Wed, Sep 27, 2017 at 6:34 AM David Turner 
> wrote:
>
>> This isn't an answer, but a suggestion to try and help track it down as
>> I'm not sure what the problem is. Try querying the admin socket for your
>> osds and look through all of their config options and settings for
>> something that might explain why you have multiple deep scrubs happening on
>> a single osd at the same time.
>>
>> However if you misspoke and only have 1 deep scrub per osd but multiple
>> people node, then what you are seeing is expected behavior.  I believe that
>> luminous added a sleep seeing for scrub io that also might help.  Looking
>> through the admin socket dump of settings looking for scrub should give you
>> some ideas of things to try.
>>
>> On Tue, Sep 26, 2017, 2:04 PM J David  wrote:
>>
>>> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also
>>> the default, at almost all times, there are 2-3 deep scrubs running.
>>>
>>> 3 simultaneous deep scrubs is enough to cause a constant stream of:
>>>
>>> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32
>>> sec (REQUEST_SLOW)
>>>
>>> This seems to correspond with all three deep scrubs hitting the same
>>> OSD at the same time, starving out all other I/O requests for that
>>> OSD.  But it can happen less frequently and less severely with two or
>>> even one deep scrub running.  Nonetheless, consumers of the cluster
>>> are not thrilled with regular instances of 30-60 second disk I/Os.
>>>
>>> The cluster is five nodes, 15 OSDs, and there is one pool with 512
>>> placement groups.  The cluster is running:
>>>
>>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous
>>> (rc)
>>>
>>> All of the OSDs are bluestore, with HDD storage and SSD block.db.
>>>
>>> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this
>>> issue, though it seems to get the number down from 3 to 2, which at
>>> least cuts down on the frequency of requests stalling out.  With 512
>>> pgs, that should mean that one pg gets deep-scrubbed per hour, and it
>>> seems like a deep-scrub takes about 20 minutes.  So what should be
>>> happening is that 1/3rd of the time there should be one deep scrub,
>>> and 2/3rds of the time there shouldn’t be any.  Yet instead we have
>>> 2-3 deep scrubs running at all times.
>>>
>>> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched
>>> per hour:
>>>
>>> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' |
>>> fgrep 2017-09-26 | sort -rn | head -22
>>> dumped all
>>> 2017-09-26 16:42:46.781761 0.181
>>> 2017-09-26 16:41:40.056816 0.59
>>> 2017-09-26 16:39:26.216566 0.9e
>>> 2017-09-26 16:26:43.379806 0.19f
>>> 2017-09-26 16:24:16.321075 0.60
>>> 2017-09-26 16:08:36.095040 0.134
>>> 2017-09-26 16:03:33.478330 0.b5
>>> 2017-09-26 15:55:14.205885 0.1e2
>>> 2017-09-26 15:54:31.413481 0.98
>>> 2017-09-26 15:45:58.329782 0.71
>>> 2017-09-26 15:34:51.777681 0.1e5
>>> 2017-09-26 15:32:49.669298 0.c7
>>> 2017-09-26 15:01:48.590645 0.1f
>>> 2017-09-26 15:01:00.082014 0.199
>>> 2017-09-26 14:45:52.893951 0.d9
>>> 2017-09-26 14:43:39.870689 0.140
>>> 2017-09-26 14:28:56.217892 0.fc
>>> 2017-09-26 14:28:49.665678 0.e3
>>> 2017-09-26 14:11:04.718698 0.1d6
>>> 2017-09-26 14:09:44.975028 0.72
>>> 2017-09-26 14:06:17.945012 0.8a
>>> 2017-09-26 13:54:44.199792 0.ec
>>>
>>> What’s going on here?
>>>
>>> Why isn’t the limit on scrubs being honored?
>>>
>>> It would also be great if scrub I/O were surfaced in “ceph status” the
>>> way recovery I/O is, especially since it can have such a significant
>>> impact on client operations.

Re: [ceph-users] osd max scrubs not honored?

2017-09-28 Thread Gregory Farnum
Also, realize the deep scrub interval is a per-PG thing and (unfortunately)
the OSD doesn't use a global view of its PG deep scrub ages to try and
schedule them intelligently across that time. If you really want to try and
force this out, I believe a few sites have written scripts to do it by
turning off deep scrubs, forcing individual PGs to deep scrub at intervals,
and then enabling deep scrubs again.
-Greg

On Wed, Sep 27, 2017 at 6:34 AM David Turner  wrote:

> This isn't an answer, but a suggestion to try and help track it down as
> I'm not sure what the problem is. Try querying the admin socket for your
> osds and look through all of their config options and settings for
> something that might explain why you have multiple deep scrubs happening on
> a single osd at the same time.
>
> However if you misspoke and only have 1 deep scrub per osd but multiple
> people node, then what you are seeing is expected behavior.  I believe that
> luminous added a sleep seeing for scrub io that also might help.  Looking
> through the admin socket dump of settings looking for scrub should give you
> some ideas of things to try.
>
> On Tue, Sep 26, 2017, 2:04 PM J David  wrote:
>
>> With “osd max scrubs” set to 1 in ceph.conf, which I believe is also
>> the default, at almost all times, there are 2-3 deep scrubs running.
>>
>> 3 simultaneous deep scrubs is enough to cause a constant stream of:
>>
>> mon.ceph1 [WRN] Health check update: 69 slow requests are blocked > 32
>> sec (REQUEST_SLOW)
>>
>> This seems to correspond with all three deep scrubs hitting the same
>> OSD at the same time, starving out all other I/O requests for that
>> OSD.  But it can happen less frequently and less severely with two or
>> even one deep scrub running.  Nonetheless, consumers of the cluster
>> are not thrilled with regular instances of 30-60 second disk I/Os.
>>
>> The cluster is five nodes, 15 OSDs, and there is one pool with 512
>> placement groups.  The cluster is running:
>>
>> ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous
>> (rc)
>>
>> All of the OSDs are bluestore, with HDD storage and SSD block.db.
>>
>> Even setting “osd deep scrub interval = 1843200” hasn’t resolved this
>> issue, though it seems to get the number down from 3 to 2, which at
>> least cuts down on the frequency of requests stalling out.  With 512
>> pgs, that should mean that one pg gets deep-scrubbed per hour, and it
>> seems like a deep-scrub takes about 20 minutes.  So what should be
>> happening is that 1/3rd of the time there should be one deep scrub,
>> and 2/3rds of the time there shouldn’t be any.  Yet instead we have
>> 2-3 deep scrubs running at all times.
>>
>> Looking at “ceph pg dump” shows that about 7 deep scrubs get launched per
>> hour:
>>
>> $sudo ceph pg dump | fgrep active | awk ‘{print$23” “$24" "$1}' |
>> fgrep 2017-09-26 | sort -rn | head -22
>> dumped all
>> 2017-09-26 16:42:46.781761 0.181
>> 2017-09-26 16:41:40.056816 0.59
>> 2017-09-26 16:39:26.216566 0.9e
>> 2017-09-26 16:26:43.379806 0.19f
>> 2017-09-26 16:24:16.321075 0.60
>> 2017-09-26 16:08:36.095040 0.134
>> 2017-09-26 16:03:33.478330 0.b5
>> 2017-09-26 15:55:14.205885 0.1e2
>> 2017-09-26 15:54:31.413481 0.98
>> 2017-09-26 15:45:58.329782 0.71
>> 2017-09-26 15:34:51.777681 0.1e5
>> 2017-09-26 15:32:49.669298 0.c7
>> 2017-09-26 15:01:48.590645 0.1f
>> 2017-09-26 15:01:00.082014 0.199
>> 2017-09-26 14:45:52.893951 0.d9
>> 2017-09-26 14:43:39.870689 0.140
>> 2017-09-26 14:28:56.217892 0.fc
>> 2017-09-26 14:28:49.665678 0.e3
>> 2017-09-26 14:11:04.718698 0.1d6
>> 2017-09-26 14:09:44.975028 0.72
>> 2017-09-26 14:06:17.945012 0.8a
>> 2017-09-26 13:54:44.199792 0.ec
>>
>> What’s going on here?
>>
>> Why isn’t the limit on scrubs being honored?
>>
>> It would also be great if scrub I/O were surfaced in “ceph status” the
>> way recovery I/O is, especially since it can have such a significant
>> impact on client operations.
>>
>> Thanks!
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Christian Wuerdig
I'm pretty sure the orphan find command does exactly just that -
finding orphans. I remember some emails on the dev list where Yehuda
said he wasn't 100% comfortable of automating the delete just yet.
So the purpose is to run the orphan find tool and then delete the
orphaned objects once you're happy that they all are actually
orphaned.

On Fri, Sep 29, 2017 at 7:46 AM, Webert de Souza Lima
 wrote:
> When I had to use that I just took for granted that it worked, so I can't
> really tell you if that's just it.
>
> :|
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
>
> On Thu, Sep 28, 2017 at 1:31 PM, Andreas Calminder
>  wrote:
>>
>> Hi,
>> Yes I'm able to run these commands, however it is unclear both in man file
>> and the docs what's supposed to happen with the orphans, will they be
>> deleted once I run finish? Or will that just throw away the job? What will
>> orphans find actually produce? At the moment it just outputs a lot of text
>> saying something like putting $num in orphans.$jobid.$shardnum and listing
>> objects that are not orphans?
>>
>> Regards,
>> Andreas
>>
>> On 28 Sep 2017 15:10, "Webert de Souza Lima" 
>> wrote:
>>
>> Hello,
>>
>> not an expert here but I think the answer is something like:
>>
>> radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
>> radosgw-admin orphans finish --job-id=_JOB_ID_
>>
>> _JOB_ID_ being anything.
>>
>>
>>
>> Regards,
>>
>> Webert Lima
>> DevOps Engineer at MAV Tecnologia
>> Belo Horizonte - Brasil
>>
>> On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder
>>  wrote:
>>>
>>> Hello,
>>> running Jewel on some nodes with rados gateway I've managed to get a
>>> lot of leaked multipart objects, most of them belonging to buckets
>>> that do not even exist anymore. We estimated these objects to occupy
>>> somewhere around 60TB, which would be great to reclaim. Question is
>>> how, since trying to find them one by one and perform some kind of
>>> sanity check if they're in use or not will take forever.
>>>
>>> The radosgw-admin orphans find command sounds like something I could
>>> use, but it's not clear if the command also removes the orphans? If
>>> not, what does it do? Can I use it to help me removing my orphan
>>> objects?
>>>
>>> Best regards,
>>> Andreas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-28 Thread Gregory Farnum
On Thu, Sep 28, 2017 at 5:16 AM Micha Krause  wrote:

> Hi,
>
> I had a chance to catch John Spray at the Ceph Day, and he suggested that
> I try to reproduce this bug in luminos.
>
> To fix my immediate problem we discussed 2 ideas:
>
> 1. Manually edit the Meta-data, unfortunately I was not able to find any
> Information on how the meta-data is structured :-(
>
> 2. Edit the code to set the link count to 0 if it is negative:
>
>
> diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc
> index 9e53907..2ca1449 100644
> --- a/src/mds/StrayManager.cc
> +++ b/src/mds/StrayManager.cc
> @@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool
> delay)
>   logger->set(l_mdc_num_strays_delayed, num_strays_delayed);
> }
>
> +  if (in->inode.nlink < 0) {
> +in->inode.nlink=0;
> +  }
> +
> // purge?
> if (in->inode.nlink == 0) {
>   // past snaprealm parents imply snapped dentry remote links.
> diff --git a/src/xxHash b/src/xxHash
> --- a/src/xxHash
> +++ b/src/xxHash
> @@ -1 +1 @@
>
>
> Im not sure if this works, the patched mds no longer crashes, however I
> expected that this value:
>
> root@mds02:~ # ceph daemonperf mds.1
> -mds-- --mds_server-- ---objecter--- -mds_cache-
> ---mds_log
> rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
> subm|
>0  100k   0 |  000 |  000 |  00  625k   0 | 30
>  25k   0
> 
>
> Should go down, but it stays at 625k, unfortunately I don't have another
> System to compare.
>
> After I started the patched mds once, I reverted back to an unpatched mds,
> and it also stopped crashing, so I guess it did "fix" something.
>
>
> A question just out of curiosity, I tried to log these events with
> something like:
>
>   dout(10) << "Fixed negative inode count";
>
> or
>
>   derr << "Fixed negative inode count";
>
> But my compiler yelled at me for trying this.
>

dout and derr are big macros. You need to end the line with " << dendl;" to
close it off.


>
> Micha Krause
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph luminous repo not working on Ubuntu xenial

2017-09-28 Thread Kashif Mumtaz

Dear User,
I am striving had to install Ceph luminous version on Ubuntu 16.04.3  ( xenial 
).
Its repo is available at https://download.ceph.com/debian-luminous/ 
I added it like sudo apt-add-repository 'deb 
https://download.ceph.com/debian-luminous/ xenial main'
# more  sources.list
deb https://download.ceph.com/debian-luminous/ xenial main
It say no package available. Did anybody able to install Luminous on Xenial by 
using repo?___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Webert de Souza Lima
When I had to use that I just took for granted that it worked, so I can't
really tell you if that's just it.

:|


Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, Sep 28, 2017 at 1:31 PM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hi,
> Yes I'm able to run these commands, however it is unclear both in man file
> and the docs what's supposed to happen with the orphans, will they be
> deleted once I run finish? Or will that just throw away the job? What will
> orphans find actually produce? At the moment it just outputs a lot of text
> saying something like putting $num in orphans.$jobid.$shardnum and listing
> objects that are not orphans?
>
> Regards,
> Andreas
>
> On 28 Sep 2017 15:10, "Webert de Souza Lima" 
> wrote:
>
> Hello,
>
> not an expert here but I think the answer is something like:
>
> radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
> radosgw-admin orphans finish --job-id=_JOB_ID_
>
> _JOB_ID_ being anything.
>
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder <
> andreas.calmin...@klarna.com> wrote:
>
>> Hello,
>> running Jewel on some nodes with rados gateway I've managed to get a
>> lot of leaked multipart objects, most of them belonging to buckets
>> that do not even exist anymore. We estimated these objects to occupy
>> somewhere around 60TB, which would be great to reclaim. Question is
>> how, since trying to find them one by one and perform some kind of
>> sanity check if they're in use or not will take forever.
>>
>> The radosgw-admin orphans find command sounds like something I could
>> use, but it's not clear if the command also removes the orphans? If
>> not, what does it do? Can I use it to help me removing my orphan
>> objects?
>>
>> Best regards,
>> Andreas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-28 Thread Ronny Aasen

On 28. sep. 2017 18:53, hjcho616 wrote:
Yay! Finally after about exactly one month I finally am able to mount 
the drive!  Now is time to see how my data is doing. =P  Doesn't look 
too bad though.


Got to love the open source. =)  I downloaded ceph source code.  Built them.  Then tried to run ceph-objectstore-export on that osd.4.   Then started debugging it.  Obviously don't have any idea of what everything do... > but was able to trace to the error message.  The corruption appears to be at the mount region.  When it tries to decode a buffer, most buffers had very periodic (looking at the printfs I put in) access to data but then > few of them had huge number.  Oh that "1" that didn't make sense was from the corruption happened, and that struct_v portion of the data changed to ASCII value of 1, which happily printed 1. =P  Since it was a mount 
portion... and hoping it doesn't impact the data much... went ahead and allowed those corrupted values.  I was able to export osd.4 with journal!


congratulations and well done :)

just imagine tring to do this on $vendors's propitary blackbox...

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power outages!!! help!

2017-09-28 Thread hjcho616
Yay! Finally after about exactly one month I finally am able to mount the 
drive!  Now is time to see how my data is doing. =P  Doesn't look too bad 
though.
Got to love the open source. =)  I downloaded ceph source code.  Built them.  
Then tried to run ceph-objectstore-export on that osd.4.   Then started 
debugging it.  Obviously don't have any idea of what everything do... but was 
able to trace to the error message.  The corruption appears to be at the mount 
region.  When it tries to decode a buffer, most buffers had very periodic 
(looking at the printfs I put in) access to data but then few of them had huge 
number.  Oh that "1" that didn't make sense was from the corruption happened, 
and that struct_v portion of the data changed to ASCII value of 1, which 
happily printed 1. =P  Since it was a mount portion... and hoping it doesn't 
impact the data much... went ahead and allowed those corrupted values.  I was 
able to export osd.4 with journal!
Then imported that page..  But OSDs wouldn't take them.. as it decided to 
create empty page 1.28 and assigned them active.  So.. just as "Incomplete PGs 
Oh My!" page sugeested,pulled those osds down and removed those empty heads and 
started back up.  At that point, no more incomplete data!
Working on that inconsistent data. looks like this is somewhat new in the 
10.2s.  I was able to get it working with rados get and put and 
deep-scrub.https://www.spinics.net/lists/ceph-users/msg39063.html

At this point, everything was active+clean.  But MDS wasn't happy.  Seems to 
suggest journal is broke.HEALTH_ERR mds rank 0 is damaged; mds cluster is 
degraded; no legacy OSD present but 'sortbitwise' flag is not set

Found this. Did everything down to cephfs-table-tool all reset 
sessionhttp://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/

Restarted MDS.  HEALTH_WARN no legacy OSD present but 'sortbitwise' flag is not 
set
Mounted!  Thank you everyone for the help!  Learned alot!
Regards,Hong
 

On Friday, September 22, 2017 1:01 AM, hjcho616  wrote:
 

 Ronny,
Could you help me with this log?  I got this with debug osd=20 filestore=20 
ms=20.  This one is running "ceph pg repair 2.7"  This is one of the smaller 
page, thus log was smaller.  Others have similar errors.  I can see the lines 
with ERR, but other than that is there something I should be paying attention 
to? 
https://drive.google.com/file/d/0By7YztAJNGUWNkpCV090dHBmOWc/view?usp=sharing
Error messages looks like this.2017-09-21 23:53:31.545510 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 2: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od  
alloc_hint [0 0])2017-09-21 23:53:31.545520 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 7: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od  
alloc_hint [0 0])2017-09-21 23:53:31.545531 7f51682df700 -1 
log_channel(cluster) log [ERR] : 2.7 soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head: failed to pick suitable auth 
object
I did try to move that object to different location as suggested from this 
page.http://ceph.com/geen-categorie/ceph-manually-repair-object/

This is what I ran.systemctl stop ceph-osd@7ceph-osd -i 7 --flush-journalcd 
/var/lib/ceph/osd/ceph-7cd current/2.7_head/mv 
rb.0.145d.2ae8944a.00bb__head_6F5DBE87__2 ~/ceph osd treesystemctl 
start ceph-osd@7ceph pg repair 2.7
Then I just get this..2017-09-22 00:41:06.495399 7f22ac3bd700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 2: soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head data_digest 0x62b74a1f != 
data_digest 0x43d61c5d from auth oi 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head(12962'694 osd.2.0:90545 
dirty|data_digest|omap_digest s 4194304 uv 484 dd 43d61c5d od  
alloc_hint [0 0])2017-09-22 00:41:06.495417 7f22ac3bd700 -1 
log_channel(cluster) log [ERR] : 2.7 shard 7 missing 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head2017-09-22 00:41:06.495424 
7f22ac3bd700 -1 log_channel(cluster) log [ERR] : 2.7 soid 
2:e17dbaf6:::rb.0.145d.2ae8944a.00bb:head: failed to pick suitable auth 
object
Moving from osd.2 results in similar error message, just says missing on top 
one instead. =P

I was hoping this time would give me a different result as I let one more osd 
copy one from OSD1 by turning down osd.7 and set noout.  But it doesn't appear 
to care about that extra data. Maybe only true when size is 3?  Basically since 
I had most osds alive on OSD1 I was trying to favor data from OSD1. =P
What can I do in this case? According to 

Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Andreas Calminder
Hi,
Yes I'm able to run these commands, however it is unclear both in man file
and the docs what's supposed to happen with the orphans, will they be
deleted once I run finish? Or will that just throw away the job? What will
orphans find actually produce? At the moment it just outputs a lot of text
saying something like putting $num in orphans.$jobid.$shardnum and listing
objects that are not orphans?

Regards,
Andreas

On 28 Sep 2017 15:10, "Webert de Souza Lima"  wrote:

Hello,

not an expert here but I think the answer is something like:

radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
radosgw-admin orphans finish --job-id=_JOB_ID_

_JOB_ID_ being anything.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hello,
> running Jewel on some nodes with rados gateway I've managed to get a
> lot of leaked multipart objects, most of them belonging to buckets
> that do not even exist anymore. We estimated these objects to occupy
> somewhere around 60TB, which would be great to reclaim. Question is
> how, since trying to find them one by one and perform some kind of
> sanity check if they're in use or not will take forever.
>
> The radosgw-admin orphans find command sounds like something I could
> use, but it's not clear if the command also removes the orphans? If
> not, what does it do? Can I use it to help me removing my orphan
> objects?
>
> Best regards,
> Andreas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Luminous v12.2.1 released

2017-09-28 Thread Abhishek

This is the first bugfix release of Luminous v12.2.x long term stable
release series. It contains a range of bug fixes and a few features
across CephFS, RBD & RGW. We recommend all the users of 12.2.x series
update.

For more details, refer to the release notes entry at the official
blog[1] and the complete changelog[2]

Notable Changes
---

* Dynamic resharding is now enabled by default for RGW, RGW will now
   automatically reshard there bucket index once the index grows beyond
   `rgw_max_objs_per_shard`

* Limiting MDS cache via a memory limit is now supported using the new
   mds_cache_memory_limit config option (1GB by default).  A cache
reservation
   can also be specified using mds_cache_reservation as a percentage of
the
   limit (5% by default). Limits by inode count are still supported 
using
   mds_cache_size. Setting mds_cache_size to 0 (the default) disables 
the

   inode limit.

* The maximum number of PGs per OSD before the monitor issues a
   warning has been reduced from 300 to 200 PGs.  200 is still twice
   the generally recommended target of 100 PGs per OSD.  This limit can
   be adjusted via the ``mon_max_pg_per_osd`` option on the
   monitors.  The older ``mon_pg_warn_max_per_osd`` option has been
removed.

* Creating pools or adjusting pg_num will now fail if the change would
   make the number of PGs per OSD exceed the configured
   ``mon_max_pg_per_osd`` limit.  The option can be adjusted if it
   is really necessary to create a pool with more PGs.

* There was a bug in the PG mapping behavior of the new *upmap*
   feature. If you made use of this feature (e.g., via the `ceph osd
   pg-upmap-items` command), we recommend that all mappings be removed
(via
   the `ceph osd rm-pg-upmap-items` command) before upgrading to this
   point release.

* A stall in BlueStore IO submission that was affecting many users has
   been resolved.

Other Notable Changes
-
* bluestore: asyn cdeferred_try_submit deadlock (issue#21207, pr#17494,
Sage Weil)
* bluestore: fix deferred write deadlock, aio short return handling
(issue#21171, pr#17601, Sage Weil)
* bluestore: osd crash when change option bluestore_csum_type from none
to CRC32 (issue#21175, pr#17497, xie xingguo)
* bluestore: os/bluestore/BlueFS.cc: 1255: FAILED
assert(!log_file->fnode.extents.empty()) (issue#21250, pr#17562, Sage
Weil)
* build/ops: ceph-fuse RPM should require fusermount (issue#21057,
pr#17470, Ken Dreyer)
* build/ops: RHEL 7.3 Selinux denials at OSD start (issue#19200,
pr#17468, Boris Ranto)
* build/ops: rocksdb,cmake:  build portable binaries (issue#20529,
pr#17745, Kefu Chai)
* cephfs: client/mds has wrong check to clear S_ISGID on chown
(issue#21004, pr#17471, Patrick Donnelly)
* cephfs: get_quota_root sends lookupname op for every buffered write
(issue#20945, pr#17473, Dan van der Ster)
* cephfs: MDCache::try_subtree_merge() may print N^2 lines of debug
message (issue#21221, pr#17712, Patrick Donnelly)
* cephfs: MDS rank add/remove log messages say wrong number of ranks
(issue#21421, pr#17887, John Spray)
* cephfs: MDS: standby-replay mds should avoid initiating subtree export
(issue#21378, issue#21222, pr#17714, "Yan, Zheng", Jianyu Li)
* cephfs: the standbys are not updated via ceph tell mds.\* command
(issue#21230, pr#17565, Kefu Chai)
* common: adding line break at end of some cli results (issue#21019,
pr#17467, songweibin)
* core: [cls] metadata_list API function does not honor `max_return`
parameter (issue#21247, pr#17558, Jason Dillaman)
* core: incorrect erasure-code space in command ceph df (issue#21243,
pr#17724, liuchang0812)
* core: interval_set: optimize intersect_of insert operations
(issue#21229, pr#17487, Zac Medico)
* core: osd crush rule rename not idempotent (issue#21162, pr#17481, xie
xingguo)
* core: osd/PGLog: write only changed dup entries (issue#21026,
pr#17378, Josh Durgin)
* doc: doc/rbd: iSCSI Gateway Documentation (issue#20437, pr#17381, Aron
Gunn, Jason Dillaman)
* mds: fix 'dirfrag end' check in Server::handle_client_readdir
(issue#21070, pr#17686, "Yan, Zheng")
* mds: support limiting cache by memory (issue#20594, pr#17711, "Yan,
Zheng", Patrick Donnelly)
* mgr: 500 error when attempting to view filesystem data (issue#20692,
pr#17477, John Spray)
* mgr: ceph mgr versions shows active mgr as Unknown (issue#21260,
pr#17635, John Spray)
* mgr: Crash in MonCommandCompletion (issue#21157, pr#17483, John Spray)
* mon: mon/OSDMonitor: deleting pool while pgs are being created leads
to assert(p != pools.end) in update_creating_pgs() (issue#21309,
pr#17634, Joao Eduardo Luis)
* mon: OSDMonitor: osd pool application get support (issue#20976,
pr#17472, xie xingguo)
* mon: rate limit on health check update logging (issue#20888, pr#17500,
John Spray)
* osd: build_initial_pg_history doesn't update up/acting/etc
(issue#21203, pr#17496, w11979, Sage Weil)
* osd: osd/PG: discard msgs from down peers (issue#19605, pr#17501, Kefu
Chai)
* osd/PrimaryLogPG: request osdmap 

[ceph-users] Openstack (pike) Ceilometer-API deprecated. RadosGW stats?

2017-09-28 Thread magicb...@gmail.com

Hi

it looks like OpenStack (Pike) has deprecated Ceilometer-API.

Is this a problem for RadosGW when pushes stats to Openstack Telemetry 
service?


Thanks,
J.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

2017-09-28 Thread Sean Purdy
On Thu, 28 Sep 2017, Matthew Vernon said:
> Hi,
> 
> TL;DR - the timeout setting in ceph-disk@.service is (far) too small - it
> needs increasing and/or removing entirely. Should I copy this to ceph-devel?

Just a note.  Looks like debian stretch luminous packages have a 10_000 second 
timeout:

from /lib/systemd/system/ceph-disk@.service

Environment=CEPH_DISK_TIMEOUT=1
ExecStart=/bin/sh -c 'timeout $CEPH_DISK_TIMEOUT flock 
/var/lock/ceph-disk-$(basename %f) /usr/sbin/ceph-disk --verbose --log-stdout 
trigger --sync %f'
 

Sean

> On 15/09/17 16:48, Matthew Vernon wrote:
> >On 14/09/17 16:26, Götz Reinicke wrote:
> >>After that, 10 OSDs did not came up as the others. The disk did not get
> >>mounted and the OSD processes did nothing … even after a couple of
> >>minutes no more disks/OSDs showed up.
> >
> >I'm still digging, but AFAICT it's a race condition in startup - in our
> >case, we're only seeing it if some of the filesystems aren't clean. This
> >may be related to the thread "Very slow start of osds after reboot" from
> >August, but I don't think any conclusion was reached there.
> 
> This annoyed me enough that I went off to find the problem :-)
> 
> On systemd-enabled machines[0] ceph disks are activated by systemd's
> ceph-disk@.service, which calls:
> 
> /bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f)
> /usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'
> 
> ceph-disk trigger --sync calls ceph-disk activate which (among other things)
> mounts the osd fs (first in a temporary location, then in /var/lib/ceph/osd/
> once it's extracted the osd number from the fs). If the fs is unclean, XFS
> auto-recovers before mounting (which takes time - range 2-25s for our 6TB
> disks) Importantly, there is a single global lock file[1] so only one
> ceph-disk activate can be doing this at once.
> 
> So, each fs is auto-recovering one at at time (rather than in parallel), and
> once the elapsed time gets past 120s, timeout kills the flock, systemd kills
> the cgroup, and no more OSDs start up - we typically find a few fs mounted
> in /var/lib/ceph/tmp/mnt.. systemd keeps trying to start the remaining
> osds (via ceph-osd@.service), but their fs isn't in the correct place, so
> this never works.
> 
> The fix/workaround is to adjust the timeout value (edit the service file
> directly, or for style points write an override in /etc/systemd/system
> remembering you need a blank ExecStart line before your revised one).
> 
> Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to
> start all its osds when started up with all fss dirty. So the current 120s
> is far too small (it's just about OK when all the osd fss are clean).
> 
> I think, though, that having the timeout at all is a bug - if something
> needs to time out under some circumstances, should it be at a lower layer,
> perhaps?
> 
> A couple of final points/asides, if I may:
> 
> ceph-disk trigger uses subprocess.communicate (via the command() function),
> which means it swallows the log output from ceph-disk activate (and only
> outputs it after that process finishes) - as well as producing confusing
> timestamps, this means that when systemd kills the cgroup, all the output
> from the ceph-disk activate command vanishes into the void. That made
> debugging needlessly hard. Better to let called processes like that output
> immediately?
> 
> Does each fs need mounting twice? could the osd be encoded in the partition
> label or similar instead?
> 
> Is a single global activation lock necessary? It slows startup down quite a
> bit; I see no reason why (at least in the one-osd-per-disk case) you
> couldn't be activating all the osds at once...
> 
> Regards,
> 
> Matthew
> 
> [0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the
> timeout, so presumably upstart systems aren't affected
> [1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu
> 
> 
> -- 
> The Wellcome Trust Sanger Institute is operated by Genome Research Limited,
> a charity registered in England with number 1021457 and a company registered
> in England with number 2742969, whose registered office is 215 Euston Road,
> London, NW1 2BE. ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW how to delete orphans

2017-09-28 Thread Webert de Souza Lima
Hello,

not an expert here but I think the answer is something like:

radosgw-admin orphans find --pool=_DATA_POOL_ --job-id=_JOB_ID_
radosgw-admin orphans finish --job-id=_JOB_ID_

_JOB_ID_ being anything.



Regards,

Webert Lima
DevOps Engineer at MAV Tecnologia
*Belo Horizonte - Brasil*

On Thu, Sep 28, 2017 at 9:38 AM, Andreas Calminder <
andreas.calmin...@klarna.com> wrote:

> Hello,
> running Jewel on some nodes with rados gateway I've managed to get a
> lot of leaked multipart objects, most of them belonging to buckets
> that do not even exist anymore. We estimated these objects to occupy
> somewhere around 60TB, which would be great to reclaim. Question is
> how, since trying to find them one by one and perform some kind of
> sanity check if they're in use or not will take forever.
>
> The radosgw-admin orphans find command sounds like something I could
> use, but it's not clear if the command also removes the orphans? If
> not, what does it do? Can I use it to help me removing my orphan
> objects?
>
> Best regards,
> Andreas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW how to delete orphans

2017-09-28 Thread Andreas Calminder
Hello,
running Jewel on some nodes with rados gateway I've managed to get a
lot of leaked multipart objects, most of them belonging to buckets
that do not even exist anymore. We estimated these objects to occupy
somewhere around 60TB, which would be great to reclaim. Question is
how, since trying to find them one by one and perform some kind of
sanity check if they're in use or not will take forever.

The radosgw-admin orphans find command sounds like something I could
use, but it's not clear if the command also removes the orphans? If
not, what does it do? Can I use it to help me removing my orphan
objects?

Best regards,
Andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS crashes shortly after startup while trying to purge stray files.

2017-09-28 Thread Micha Krause

Hi,

I had a chance to catch John Spray at the Ceph Day, and he suggested that I try 
to reproduce this bug in luminos.

To fix my immediate problem we discussed 2 ideas:

1. Manually edit the Meta-data, unfortunately I was not able to find any 
Information on how the meta-data is structured :-(

2. Edit the code to set the link count to 0 if it is negative:


diff --git a/src/mds/StrayManager.cc b/src/mds/StrayManager.cc
index 9e53907..2ca1449 100644
--- a/src/mds/StrayManager.cc
+++ b/src/mds/StrayManager.cc
@@ -553,6 +553,10 @@ bool StrayManager::__eval_stray(CDentry *dn, bool delay)
 logger->set(l_mdc_num_strays_delayed, num_strays_delayed);
   }

+  if (in->inode.nlink < 0) {
+in->inode.nlink=0;
+  }
+
   // purge?
   if (in->inode.nlink == 0) {
 // past snaprealm parents imply snapped dentry remote links.
diff --git a/src/xxHash b/src/xxHash
--- a/src/xxHash
+++ b/src/xxHash
@@ -1 +1 @@


Im not sure if this works, the patched mds no longer crashes, however I 
expected that this value:

root@mds02:~ # ceph daemonperf mds.1
-mds-- --mds_server-- ---objecter--- -mds_cache- ---mds_log
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0  100k   0 |  000 |  000 |  00  625k   0 | 30   25k   0
   

Should go down, but it stays at 625k, unfortunately I don't have another System 
to compare.

After I started the patched mds once, I reverted back to an unpatched mds, and it also 
stopped crashing, so I guess it did "fix" something.


A question just out of curiosity, I tried to log these events with something 
like:

 dout(10) << "Fixed negative inode count";

or

 derr << "Fixed negative inode count";

But my compiler yelled at me for trying this.


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph/systemd startup bug (was Re: Some OSDs are down after Server reboot)

2017-09-28 Thread Matthew Vernon

Hi,

TL;DR - the timeout setting in ceph-disk@.service is (far) too small - 
it needs increasing and/or removing entirely. Should I copy this to 
ceph-devel?


On 15/09/17 16:48, Matthew Vernon wrote:

On 14/09/17 16:26, Götz Reinicke wrote:

After that, 10 OSDs did not came up as the others. The disk did not get
mounted and the OSD processes did nothing … even after a couple of
minutes no more disks/OSDs showed up.


I'm still digging, but AFAICT it's a race condition in startup - in our
case, we're only seeing it if some of the filesystems aren't clean. This
may be related to the thread "Very slow start of osds after reboot" from
August, but I don't think any conclusion was reached there.


This annoyed me enough that I went off to find the problem :-)

On systemd-enabled machines[0] ceph disks are activated by systemd's 
ceph-disk@.service, which calls:


/bin/sh -c 'timeout 120 flock /var/lock/ceph-disk-$(basename %f) 
/usr/sbin/ceph-disk --verbose --log-stdout trigger --sync %f'


ceph-disk trigger --sync calls ceph-disk activate which (among other 
things) mounts the osd fs (first in a temporary location, then in 
/var/lib/ceph/osd/ once it's extracted the osd number from the fs). If 
the fs is unclean, XFS auto-recovers before mounting (which takes time - 
range 2-25s for our 6TB disks) Importantly, there is a single global 
lock file[1] so only one ceph-disk activate can be doing this at once.


So, each fs is auto-recovering one at at time (rather than in parallel), 
and once the elapsed time gets past 120s, timeout kills the flock, 
systemd kills the cgroup, and no more OSDs start up - we typically find 
a few fs mounted in /var/lib/ceph/tmp/mnt.. systemd keeps trying to 
start the remaining osds (via ceph-osd@.service), but their fs isn't in 
the correct place, so this never works.


The fix/workaround is to adjust the timeout value (edit the service file 
directly, or for style points write an override in /etc/systemd/system 
remembering you need a blank ExecStart line before your revised one).


Experimenting, one of our storage nodes with 60 6TB disks took 17m35s to 
start all its osds when started up with all fss dirty. So the current 
120s is far too small (it's just about OK when all the osd fss are clean).


I think, though, that having the timeout at all is a bug - if something 
needs to time out under some circumstances, should it be at a lower 
layer, perhaps?


A couple of final points/asides, if I may:

ceph-disk trigger uses subprocess.communicate (via the command() 
function), which means it swallows the log output from ceph-disk 
activate (and only outputs it after that process finishes) - as well as 
producing confusing timestamps, this means that when systemd kills the 
cgroup, all the output from the ceph-disk activate command vanishes into 
the void. That made debugging needlessly hard. Better to let called 
processes like that output immediately?


Does each fs need mounting twice? could the osd be encoded in the 
partition label or similar instead?


Is a single global activation lock necessary? It slows startup down 
quite a bit; I see no reason why (at least in the one-osd-per-disk case) 
you couldn't be activating all the osds at once...


Regards,

Matthew

[0] I note, for instance, that /etc/init/ceph-disk.conf doesn't have the 
timeout, so presumably upstart systems aren't affected

[1] /var/lib/ceph/tmp/ceph-disk.activate.lock at least on Ubuntu


--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph fs" commands hang forever and kill monitors

2017-09-28 Thread John Spray
On Thu, Sep 28, 2017 at 11:51 AM, Richard Hesketh
 wrote:
> On 27/09/17 19:35, John Spray wrote:
>> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh
>>  wrote:
>>> On 27/09/17 12:32, John Spray wrote:
 On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh
  wrote:
> As the subject says... any ceph fs administrative command I try to run 
> hangs forever and kills monitors in the background - sometimes they come 
> back, on a couple of occasions I had to manually stop/restart a suffering 
> mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an 
> error and can also kill a monitor. However, clients can mount the 
> filesystem and read/write data without issue.
>
> Relevant excerpt from logs on an affected monitor, just trying to run 
> 'ceph fs ls':
>
> 2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 
> handle_command mon_command({"prefix": "fs ls"} v 0) v1
> 2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : 
> from='client.? 10.10.10.1:0/2771553898' entity='client.admin' 
> cmd=[{"prefix": "fs ls"}]: dispatch
> 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 
> /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& 
> OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 
> 13:20:50.727676
> /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != 
> pool_name.end())
>
>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous 
> (rc)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x102) [0x55a8ca0bb642]
>  2: (()+0x48165f) [0x55a8c9f4165f]
>  3: 
> (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18)
>  [0x55a8ca047688]
>  4: 
> (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) 
> [0x55a8ca048008]
>  5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) 
> [0x55a8c9f9d1b0]
>  6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) 
> [0x55a8c9e63193]
>  7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) 
> [0x55a8c9e6a52e]
>  8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
>  9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
>  10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
>  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
>  12: (()+0x76ba) [0x7fc86b3ac6ba]
>  13: (clone()+0x6d) [0x7fc869bd63dd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
> to interpret this.
>
> I'm running Luminous. The cluster and FS have been in service since 
> Hammer and have default data/metadata pool names. I discovered the issue 
> after attempting to enable directory sharding.

 Well that's not good...

 The assertion is because your FSMap is referring to a pool that
 apparently no longer exists in the OSDMap.  This should be impossible
 in current Ceph (we forbid removing pools if they're in use), but
 could perhaps have been caused in an earlier version of Ceph when it
 was possible to remove a pool even if CephFS was referring to it?

 Alternatively, perhaps something more severe is going on that's
 causing your mons to see a wrong/inconsistent view of the world.  Has
 the cluster ever been through any traumatic disaster recovery type
 activity involving hand-editing any of the cluster maps?  What
 intermediate versions has it passed through on the way from Hammer to
 Luminous?

 Opened a ticket here: http://tracker.ceph.com/issues/21568

 John
>>>
>>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually 
>>> inherited this cluster from a colleague who left shortly after I joined, so 
>>> unfortunately there is some of its history I cannot fill in.
>>>
>>> Turns out the cluster actually predates Firefly. Looking at dates my 
>>> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I 
>>> inherited it at Hammer, and took it Hammer -> Infernalis -> Jewel -> 
>>> Luminous myself. I know I did make sure to do the tmap_upgrade step on 
>>> cephfs but can't remember if I did it at Infernalis or Jewel.
>>>
>>> Infernalis was a tricky upgrade; the attempt was aborted once after the 
>>> first set of OSDs didn't come back up after upgrade (had to 
>>> remove/downgrade and readd), and setting sortbitwise as the documentation 
>>> suggested after a successful second attempt caused everything to break and 
>>> degrade slowly until it was unset and recovered. Never had disaster 
>>> recovery involve mucking around with the pools while I was administrating 
>>> it, but unfortunately I cannot speak for the cluster's pre-Hammer history. 
>>> The only pools I have removed were ones I created temporarily for testing 
>>> 

Re: [ceph-users] "ceph fs" commands hang forever and kill monitors

2017-09-28 Thread Shinobu Kinjo
So the problem you faced has been completely solved?

On Thu, Sep 28, 2017 at 7:51 PM, Richard Hesketh
 wrote:
> On 27/09/17 19:35, John Spray wrote:
>> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh
>>  wrote:
>>> On 27/09/17 12:32, John Spray wrote:
 On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh
  wrote:
> As the subject says... any ceph fs administrative command I try to run 
> hangs forever and kills monitors in the background - sometimes they come 
> back, on a couple of occasions I had to manually stop/restart a suffering 
> mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an 
> error and can also kill a monitor. However, clients can mount the 
> filesystem and read/write data without issue.
>
> Relevant excerpt from logs on an affected monitor, just trying to run 
> 'ceph fs ls':
>
> 2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 
> handle_command mon_command({"prefix": "fs ls"} v 0) v1
> 2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : 
> from='client.? 10.10.10.1:0/2771553898' entity='client.admin' 
> cmd=[{"prefix": "fs ls"}]: dispatch
> 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 
> /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& 
> OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 
> 13:20:50.727676
> /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != 
> pool_name.end())
>
>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous 
> (rc)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x102) [0x55a8ca0bb642]
>  2: (()+0x48165f) [0x55a8c9f4165f]
>  3: 
> (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18)
>  [0x55a8ca047688]
>  4: 
> (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) 
> [0x55a8ca048008]
>  5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) 
> [0x55a8c9f9d1b0]
>  6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) 
> [0x55a8c9e63193]
>  7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) 
> [0x55a8c9e6a52e]
>  8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
>  9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
>  10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
>  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
>  12: (()+0x76ba) [0x7fc86b3ac6ba]
>  13: (clone()+0x6d) [0x7fc869bd63dd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
> to interpret this.
>
> I'm running Luminous. The cluster and FS have been in service since 
> Hammer and have default data/metadata pool names. I discovered the issue 
> after attempting to enable directory sharding.

 Well that's not good...

 The assertion is because your FSMap is referring to a pool that
 apparently no longer exists in the OSDMap.  This should be impossible
 in current Ceph (we forbid removing pools if they're in use), but
 could perhaps have been caused in an earlier version of Ceph when it
 was possible to remove a pool even if CephFS was referring to it?

 Alternatively, perhaps something more severe is going on that's
 causing your mons to see a wrong/inconsistent view of the world.  Has
 the cluster ever been through any traumatic disaster recovery type
 activity involving hand-editing any of the cluster maps?  What
 intermediate versions has it passed through on the way from Hammer to
 Luminous?

 Opened a ticket here: http://tracker.ceph.com/issues/21568

 John
>>>
>>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually 
>>> inherited this cluster from a colleague who left shortly after I joined, so 
>>> unfortunately there is some of its history I cannot fill in.
>>>
>>> Turns out the cluster actually predates Firefly. Looking at dates my 
>>> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I 
>>> inherited it at Hammer, and took it Hammer -> Infernalis -> Jewel -> 
>>> Luminous myself. I know I did make sure to do the tmap_upgrade step on 
>>> cephfs but can't remember if I did it at Infernalis or Jewel.
>>>
>>> Infernalis was a tricky upgrade; the attempt was aborted once after the 
>>> first set of OSDs didn't come back up after upgrade (had to 
>>> remove/downgrade and readd), and setting sortbitwise as the documentation 
>>> suggested after a successful second attempt caused everything to break and 
>>> degrade slowly until it was unset and recovered. Never had disaster 
>>> recovery involve mucking around with the pools while I was administrating 
>>> it, but unfortunately I cannot speak for the cluster's pre-Hammer history. 
>>> The only pools I have 

Re: [ceph-users] "ceph fs" commands hang forever and kill monitors

2017-09-28 Thread Richard Hesketh
On 27/09/17 19:35, John Spray wrote:
> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh
>  wrote:
>> On 27/09/17 12:32, John Spray wrote:
>>> On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh
>>>  wrote:
 As the subject says... any ceph fs administrative command I try to run 
 hangs forever and kills monitors in the background - sometimes they come 
 back, on a couple of occasions I had to manually stop/restart a suffering 
 mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an 
 error and can also kill a monitor. However, clients can mount the 
 filesystem and read/write data without issue.

 Relevant excerpt from logs on an affected monitor, just trying to run 
 'ceph fs ls':

 2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 
 handle_command mon_command({"prefix": "fs ls"} v 0) v1
 2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : 
 from='client.? 10.10.10.1:0/2771553898' entity='client.admin' 
 cmd=[{"prefix": "fs ls"}]: dispatch
 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 
 /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& 
 OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 
 13:20:50.727676
 /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != 
 pool_name.end())

  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous 
 (rc)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
 const*)+0x102) [0x55a8ca0bb642]
  2: (()+0x48165f) [0x55a8c9f4165f]
  3: 
 (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18)
  [0x55a8ca047688]
  4: 
 (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) 
 [0x55a8ca048008]
  5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) 
 [0x55a8c9f9d1b0]
  6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) 
 [0x55a8c9e63193]
  7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) 
 [0x55a8c9e6a52e]
  8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
  9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
  10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
  12: (()+0x76ba) [0x7fc86b3ac6ba]
  13: (clone()+0x6d) [0x7fc869bd63dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
 to interpret this.

 I'm running Luminous. The cluster and FS have been in service since Hammer 
 and have default data/metadata pool names. I discovered the issue after 
 attempting to enable directory sharding.
>>>
>>> Well that's not good...
>>>
>>> The assertion is because your FSMap is referring to a pool that
>>> apparently no longer exists in the OSDMap.  This should be impossible
>>> in current Ceph (we forbid removing pools if they're in use), but
>>> could perhaps have been caused in an earlier version of Ceph when it
>>> was possible to remove a pool even if CephFS was referring to it?
>>>
>>> Alternatively, perhaps something more severe is going on that's
>>> causing your mons to see a wrong/inconsistent view of the world.  Has
>>> the cluster ever been through any traumatic disaster recovery type
>>> activity involving hand-editing any of the cluster maps?  What
>>> intermediate versions has it passed through on the way from Hammer to
>>> Luminous?
>>>
>>> Opened a ticket here: http://tracker.ceph.com/issues/21568
>>>
>>> John
>>
>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually inherited 
>> this cluster from a colleague who left shortly after I joined, so 
>> unfortunately there is some of its history I cannot fill in.
>>
>> Turns out the cluster actually predates Firefly. Looking at dates my 
>> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I inherited 
>> it at Hammer, and took it Hammer -> Infernalis -> Jewel -> Luminous myself. 
>> I know I did make sure to do the tmap_upgrade step on cephfs but can't 
>> remember if I did it at Infernalis or Jewel.
>>
>> Infernalis was a tricky upgrade; the attempt was aborted once after the 
>> first set of OSDs didn't come back up after upgrade (had to remove/downgrade 
>> and readd), and setting sortbitwise as the documentation suggested after a 
>> successful second attempt caused everything to break and degrade slowly 
>> until it was unset and recovered. Never had disaster recovery involve 
>> mucking around with the pools while I was administrating it, but 
>> unfortunately I cannot speak for the cluster's pre-Hammer history. The only 
>> pools I have removed were ones I created temporarily for testing crush 
>> rules/benchmarking.
> 
> OK, so it sounds like a cluster with an interesting history and some
> stories to tell :-)
> 
>> I have hand-edited the crush map (extract, decompile, 

Re: [ceph-users] RDMA with mellanox connect x3pro on debian stretch and proxmox v5.0 kernel 4.10.17-3

2017-09-28 Thread Gerhard W. Recher
Hi Haomai,

can you please guide me to a running cluster with RDMA ?

regards

Gerhard W. Recher

net4sec UG (haftungsbeschränkt)
Leitenweg 6
86929 Penzing

+49 171 4802507
Am 28.09.2017 um 04:21 schrieb Haomai Wang:
> previously we have a infiniband cluster, recently we deploy a roce
> cluster. they are both test purpose for users.
>
> On Wed, Sep 27, 2017 at 11:38 PM, Gerhard W. Recher
>  wrote:
>> Haomai,
>>
>> I looked at your presentation, so i guess you already have a running
>> cluster with RDMA & mellanox
>> (https://www.youtube.com/watch?v=Qb2SUWLdDCw)
>>
>> Is nobody out there having a running cluster with RDMA ?
>> any help is appreciated !
>>
>> Gerhard W. Recher
>>
>> net4sec UG (haftungsbeschränkt)
>> Leitenweg 6
>> 86929 Penzing
>>
>> +49 171 4802507
>> Am 27.09.2017 um 16:09 schrieb Haomai Wang:
>>> https://community.mellanox.com/docs/DOC-2415
>>>
>>> On Wed, Sep 27, 2017 at 10:01 PM, Gerhard W. Recher
>>>  wrote:
 How to set local gid option ?

 I have no glue :)

 Gerhard W. Recher

 net4sec UG (haftungsbeschränkt)
 Leitenweg 6
 86929 Penzing

 +49 171 4802507
 Am 27.09.2017 um 15:59 schrieb Haomai Wang:
> do you set local gid option?
>
> On Wed, Sep 27, 2017 at 9:52 PM, Gerhard W. Recher
>  wrote:
>> Yep ROcE 
>>
>> i followed up all recommendations in mellanox papers ...
>>
>> */etc/security/limits.conf*
>>
>> * soft memlock unlimited
>> * hard memlock unlimited
>> root soft memlock unlimited
>> root hard memlock unlimited
>>
>>
>> also set properties on daemons (chapter 11) in
>> https://community.mellanox.com/docs/DOC-2721
>>
>>
>> only gids parameter in ceph.conf is no way in proxmox, because
>> cephp.conf is for all storage node the same file
>> root@pve01:/etc/ceph# ls -latr
>> total 16
>> lrwxrwxrwx   1 root root   18 Jun 21 19:35 ceph.conf -> 
>> /etc/pve/ceph.conf
>>
>> and each node has uniqe  GIDS.
>>
>>
>> ./showgids
>> DEV PORTINDEX   GID
>> IPv4VER DEV
>> --- -   ---
>> --- ---
>> mlx4_0  1   0
>> fe80::::268a:07ff:fee2:6070 v1  ens1
>> mlx4_0  1   1
>> fe80::::268a:07ff:fee2:6070 v2  ens1
>> mlx4_0  1   2   ::::::c0a8:dd8d
>> 192.168.221.141 v1  vmbr0
>> mlx4_0  1   3   ::::::c0a8:dd8d
>> 192.168.221.141 v2  vmbr0
>> mlx4_0  2   0
>> fe80::::268a:07ff:fee2:6071 v1  ens1d1
>> mlx4_0  2   1
>> fe80::::268a:07ff:fee2:6071 v2  ens1d1
>> mlx4_0  2   2   ::::::c0a8:648d
>> 192.168.100.141 v1  ens1d1
>> mlx4_0  2   3   ::::::c0a8:648d
>> 192.168.100.141 v2  ens1d1
>> n_gids_found=8
>>
>> next node ... showgids
>> ./showgids
>> DEV PORTINDEX   GID
>> IPv4VER DEV
>> --- -   ---
>> --- ---
>> mlx4_0  1   0
>> fe80::::268a:07ff:fef9:8730 v1  ens1
>> mlx4_0  1   1
>> fe80::::268a:07ff:fef9:8730 v2  ens1
>> mlx4_0  1   2   ::::::c0a8:dd8e
>> 192.168.221.142 v1  vmbr0
>> mlx4_0  1   3   ::::::c0a8:dd8e
>> 192.168.221.142 v2  vmbr0
>> mlx4_0  2   0
>> fe80::::268a:07ff:fef9:8731 v1  ens1d1
>> mlx4_0  2   1
>> fe80::::268a:07ff:fef9:8731 v2  ens1d1
>> mlx4_0  2   2   ::::::c0a8:648e
>> 192.168.100.142 v1  ens1d1
>> mlx4_0  2   3   ::::::c0a8:648e
>> 192.168.100.142 v2  ens1d1
>> n_gids_found=8
>>
>>
>>
>> ifconfig ens1d1
>> ens1d1: flags=4163  mtu 9000
>> inet 192.168.100.141  netmask 255.255.255.0  broadcast
>> 192.168.100.255
>> inet6 fe80::268a:7ff:fee2:6071  prefixlen 64  scopeid 0x20
>> ether 24:8a:07:e2:60:71  txqueuelen 1000  (Ethernet)
>> RX packets 25450717  bytes 39981352146 (37.2 GiB)
>> RX errors 0  dropped 77  overruns 77  frame 0
>> TX packets 26554236  bytes 53419159091 (49.7 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>>
>>
>> Gerhard W. Recher
>>
>> net4sec UG (haftungsbeschränkt)
>> Leitenweg 6

Re: [ceph-users] tunable question

2017-09-28 Thread mj

Hi Dan, list,

Our cluster is small: three nodes, totally 24 4Tb platter OSDs, SSD 
journals. Using rbd for VMs. That's it. Runs nicely though :-)


The fact that "tunable optimal" for jewel would result in "significantly 
fewer mappings change when an OSD is marked out of the cluster" is what 
attracts us.


Reasoning behind it: upgrading to "optimal" NOW, should result in faster 
rebuild-time when disaster strikes, and we're all stressed out. :-)


After the jewel upgrade, we also upgraded the tunables from "(require 
bobtail, min is firefly)" to "hammer". This resulted in approx 24 hours 
rebuild, but actually without significant inpact on the hosted VMs.


Is it safe to assume that setting it to "optimal" would have a similar 
impact, or are the implications bigger?


MJ


On 09/28/2017 10:29 AM, Dan van der Ster wrote:

Hi,

How big is your cluster and what is your use case?

For us, we'll likely never enable the recent tunables that need to
remap *all* PGs -- it would simply be too disruptive for marginal
benefit.

Cheers, Dan


On Thu, Sep 28, 2017 at 9:21 AM, mj  wrote:

Hi,

We have completed the upgrade to jewel, and we set tunables to hammer.
Cluster again HEALTH_OK. :-)

But now, we would like to proceed in the direction of luminous and bluestore
OSDs, and we would like to ask for some feedback first.

 From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an
existing cluster will result in a very large amount of data movement as
almost every PG mapping is likely to change."

Given the above, and the fact that we would like to proceed to
luminous/bluestore in the not too far away future: What is cleverer:

1 - keep the cluster at tunable hammer now, upgrade to luminous in a little
while, change OSDs to bluestore, and then set tunables to optimal

or

2 - set tunable to optimal now, take the impact of "almost all PG
remapping", and when that is finished, upgrade to luminous, bluestore etc.

Which route is the preferred one?

Or is there a third (or fourth?) option..? :-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Developers Monthly - October

2017-09-28 Thread Joao Eduardo Luis

On 09/28/2017 04:08 AM, Leonardo Vaz wrote:

Hey Cephers,

This is just a friendly reminder that the next Ceph Developer Montly
meeting is coming up:

  http://wiki.ceph.com/Planning

If you have work that you're doing that it a feature work, significant
backports, or anything you would like to discuss with the core team,
please add it to the following page:

  http://wiki.ceph.com/CDM_04-OCT-2017


Will we, at some point, have a european-friendly time? :)

From the planning page, I see that we've been alternating between 21h00 
EDT and 12h30 EDT for over a year, but this time around we're having two 
straight 21h00 EDT sessions instead.


Was this a copy-paste mistake, or are we actually having another APAC 
friendly session again?


  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Developers Monthly - October

2017-09-28 Thread Shinobu Kinjo
Are we going to have next CDM in an APAC friendly time slot again?



On Thu, Sep 28, 2017 at 12:08 PM, Leonardo Vaz  wrote:
> Hey Cephers,
>
> This is just a friendly reminder that the next Ceph Developer Montly
> meeting is coming up:
>
>  http://wiki.ceph.com/Planning
>
> If you have work that you're doing that it a feature work, significant
> backports, or anything you would like to discuss with the core team,
> please add it to the following page:
>
>  http://wiki.ceph.com/CDM_04-OCT-2017
>
> If you have questions or comments, please let us know.
>
> Kindest regards,
>
> Leo
>
> --
> Leonardo Vaz
> Ceph Community Manager
> Open Source and Standards Team
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need some help/advice upgrading Hammer to Jewel - HEALTH_ERR shutting down OSD

2017-09-28 Thread Eric van Blokland
David,

Thank you so much for your reply. I'm not entirely satisfied though. I'm
expecting the PG states "degraded" and "undersized". Those should result in
a HEALTH_WARN. I'm particularly worried about the "stuck inactive" part.
Please correct me if I'm wrong but I was in the understanding that a PG
would only get in that state if all OSDs that have that PG mapped are down.

Even if the cluster would recover immediately after updating and bringing
the OSDs back up, I really wouldn't feel comfortable doing this while the
cluster is online and being used.
I think I'll schedule downtime and do an offline upgrade instead, just to
be safe. Nonetheless I would really like to know what is wrong with either
this cluster or my understanding of Ceph.

Below the ceph -s for my test environment. I would expect the production
cluster to act the same. I also find it odd the test setup didn't get the
tunables warning. Both clusters were running a Hammer release when
initialized, probably not the exact same versions though.

health HEALTH_WARN
53 pgs degraded
53 pgs stuck degraded
67 pgs stuck unclean
53 pgs stuck undersized
53 pgs undersized
recovery 28/423 objects degraded (6.619%)
 monmap e3: 3 mons at {mgm1=
10.10.100.21:6789/0,mgm2=10.10.100.22:6789/0,mgm3=10.10.100.23:6789/0}
election epoch 40, quorum 0,1,2 mgm1,mgm2,mgm3
 osdmap e163: 6 osds: 4 up, 4 in; 14 remapped pgs
  pgmap v2320: 96 pgs, 2 pools, 514 MB data, 141 objects
1638 MB used, 100707 MB / 102345 MB avail
28/423 objects degraded (6.619%)
  53 active+undersized+degraded
  29 active+clean
  14 active+remapped

Kind regards,

Eric van Blokland

On Thu, Sep 28, 2017 at 3:02 AM, David Turner  wrote:

> There are new PG states that cause health_err. In this case it is
> undersized that is causing this state.
>
> While I decided to upgrade my tunables before upgrading the rest of my
> cluster, it does not seem to be a requirement. However I would recommend
> upgrading them sooner than later. It will cause a fair amount of
> backfilling when you do it. If you are using krbd, don't upgrade your
> tunables past Hammer.
>
> In any case, you should feel safe continuing with your upgrade. You will
> definitely be safe to finish this first node as you have 2 copies of your
> data if anything goes awry. I would say that this first node will finish
> and get back to a state where all backfilling is done and you can continue
> with the other nodes.
>
> On Wed, Sep 27, 2017, 6:32 PM Eric van Blokland 
> wrote:
>
>> Hello,
>>
>> I have run into an issue while upgrading a Ceph cluster from Hammer to
>> Jewel on CentOS. It's a small cluster with 3 monitoring servers and a
>> humble 6 OSDs distributed over 3 servers.
>>
>> I've upgraded the 3 monitors successfully to 10.2.7. They appear to be
>> running fine except for this health warning: "crush map has legacy tunables
>> (require bobtail, min is firefly)". While I might completely underestimate
>> the significance of this warning, it seemed pretty harmless to me and I
>> decided to upgrade my OSDs (running 0.94.10) before touching the tunables.
>>
>> However, as soon as I brought down the OSDs on the first storage server
>> to start upgrading them, the cluster immediately got a HEALTH_ERR status
>> (see ceph -s output below) which made me abort to update process and just
>> start the OSDs again.
>>
>> Now considering that my crushmap forces distribution of 3 copies over 3
>> servers, the cluster can't heal itself when I take those OSDs down, which
>> would justify an error status. I'm worried however because my memory and my
>> lab environment tell me that this situation should only give a health
>> warning and only degraded PGs, not stuck/inactive (or did my lab
>> environment not get the stuck pgs because they were not being addressed?).
>>
>>  health HEALTH_ERR
>> 199 pgs are stuck inactive for more than 300 seconds
>> 576 pgs degraded
>> 199 pgs stuck inactive
>> 238 pgs stuck unclean
>> 576 pgs undersized
>> recovery 1415496/4246488 objects degraded (33.333%)
>> 2/6 in osds are down
>> crush map has legacy tunables (require bobtail, min is
>> firefly)
>>  monmap e1: 3 mons at {mgm1=10.10.3.11:6789/0,mgm2=
>> 10.10.3.12:6789/0,mgm3=10.10.3.13:6789/0}
>> election epoch 1650, quorum 0,1,2 mgm1,mgm2,mgm3
>>  osdmap e808: 6 osds: 4 up, 6 in; 576 remapped pgs
>>   pgmap v4309615: 576 pgs, 5 pools, 1483 GB data, 1382 kobjects
>> 4445 GB used, 7836 GB / 12281 GB avail
>> 1415496/4246488 objects degraded (33.333%)
>>  512 undersized+degraded+peered
>>   64 active+undersized+degraded
>>
>> How should I proceed from here? Am I seeing 

Re: [ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it

2017-09-28 Thread Ronny Aasen

On 28. sep. 2017 09:27, Olivier Migeot wrote:

Greetings,

we're in the process of recovering a cluster after an electrical 
disaster. Didn't work bad so far, we managed to clear most of errors. 
All that prevents return to HEALTH_OK now is a bunch (6) of scrub 
errors, apparently from a PG that's marked as active+clean+inconsistent.


Thing is, rados list-inconsistent-obj doesn't return anything but an 
empty list (plus, in the most recent attempts : error 2: (2) No such 
file or directory)


We're on Jewel (waiting for this to be fixed before planning upgrade), 
and the pool our PG belongs to has a replica of 2.


No success with ceph pg repair, and I already tried to remove and import 
the most recent version of said PG in both its acting OSDs : it doesn't 
change a thing.


Is there anything else I could try?

Thanks,



size=2 is ofcourse horrible, and I assume you know that...  But even 
more important:  I hope you have min_size=2 so you avoid generating more 
problems in the future, or while troubleshooting.

!


first of all, read this link a few times:
http://ceph.com/geen-categorie/ceph-manually-repair-object/

you need to locate the bad objects to fix them. since
rados list-inconsistent-obj does not work you need to manualy check the 
logs of the osd's that are participating in the pg in question. grep for 
ERR,


once you find the name of the object with problem, you need to locate 
the object using find /path/of/pg -name 'objectname'


once you have the objectpath you need to compare the 2 objects and find 
out what object is the bad one, this is where 3 replication would have 
helped, since when one is bad, how do you know the bad from the good...


the error message in the log may give hints to the error. read and 
understand what the error message is, since it is critical to 
understanding what is wrong with the object.


the object type also helps when determining the wrong one. is it a rados 
object, a rbd block or a cephfs metadata og data object. knowing what it 
should be helps determining the wrong one.


things to try:
ls -lh $path ; compare metadata are there obvious problems?  refer to 
the error in the log.

- one have size 0 and there should have been a size?
- one have size greater then 0 and it should have been size 0?
- one is significantly larger then the other, perhaps one is truncated? 
perhaps one have garbage added.


md5sum $path
- perhaps a block have read error, it would show on this command. and be 
a dead giveaway to the problem object.

- compare checksum.  do you know what the object  should have as sum?

actualy look at the object. use strings or hexdump to try to determine 
the contents, vs what the object should contain.


if you can  locate the bad object. then stop the osd. flush it's 
journal. move away the bad object, (i just mv it to somewhere else).

restart the osd.

run repair on the pg, tail  the logs and wait for the repair and scrub 
to finish.



--

if you are unable to determine the good object from the bad. You can try 
to determine what file it refers to in cephfs, or what block it refers 
to in rbd.  and by overwriting that file or block in cephfs or rbd you 
can indirectly overwrite both objects with new data.


if this is a rbd you should run a filesystem check on the fs on that rbd 
after all the ceph problems are repaired.


good luck
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tunable question

2017-09-28 Thread Dan van der Ster
Hi,

How big is your cluster and what is your use case?

For us, we'll likely never enable the recent tunables that need to
remap *all* PGs -- it would simply be too disruptive for marginal
benefit.

Cheers, Dan


On Thu, Sep 28, 2017 at 9:21 AM, mj  wrote:
> Hi,
>
> We have completed the upgrade to jewel, and we set tunables to hammer.
> Cluster again HEALTH_OK. :-)
>
> But now, we would like to proceed in the direction of luminous and bluestore
> OSDs, and we would like to ask for some feedback first.
>
> From the jewel ceph docs on tubables: "Changing tunable to "optimal" on an
> existing cluster will result in a very large amount of data movement as
> almost every PG mapping is likely to change."
>
> Given the above, and the fact that we would like to proceed to
> luminous/bluestore in the not too far away future: What is cleverer:
>
> 1 - keep the cluster at tunable hammer now, upgrade to luminous in a little
> while, change OSDs to bluestore, and then set tunables to optimal
>
> or
>
> 2 - set tunable to optimal now, take the impact of "almost all PG
> remapping", and when that is finished, upgrade to luminous, bluestore etc.
>
> Which route is the preferred one?
>
> Or is there a third (or fourth?) option..? :-)
>
> MJ
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it

2017-09-28 Thread Olivier Migeot

Greetings,

we're in the process of recovering a cluster after an electrical 
disaster. Didn't work bad so far, we managed to clear most of errors. 
All that prevents return to HEALTH_OK now is a bunch (6) of scrub 
errors, apparently from a PG that's marked as active+clean+inconsistent.


Thing is, rados list-inconsistent-obj doesn't return anything but an 
empty list (plus, in the most recent attempts : error 2: (2) No such 
file or directory)


We're on Jewel (waiting for this to be fixed before planning upgrade), 
and the pool our PG belongs to has a replica of 2.


No success with ceph pg repair, and I already tried to remove and import 
the most recent version of said PG in both its acting OSDs : it doesn't 
change a thing.


Is there anything else I could try?

Thanks,

--

Olivier Migeot

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tunable question

2017-09-28 Thread mj

Hi,

We have completed the upgrade to jewel, and we set tunables to hammer. 
Cluster again HEALTH_OK. :-)


But now, we would like to proceed in the direction of luminous and 
bluestore OSDs, and we would like to ask for some feedback first.


From the jewel ceph docs on tubables: "Changing tunable to "optimal" on 
an existing cluster will result in a very large amount of data movement 
as almost every PG mapping is likely to change."


Given the above, and the fact that we would like to proceed to 
luminous/bluestore in the not too far away future: What is cleverer:


1 - keep the cluster at tunable hammer now, upgrade to luminous in a 
little while, change OSDs to bluestore, and then set tunables to optimal


or

2 - set tunable to optimal now, take the impact of "almost all PG 
remapping", and when that is finished, upgrade to luminous, bluestore etc.


Which route is the preferred one?

Or is there a third (or fourth?) option..? :-)

MJ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Large amount of files - cephfs?

2017-09-28 Thread Henrik Korkuc

On 17-09-27 14:57, Josef Zelenka wrote:

Hi,

we are currently working on a ceph solution for one of our customers. 
They run a file hosting and they need to store approximately 100 
million of pictures(thumbnails). Their current code works with FTP, 
that they use as a storage. We thought that we could use cephfs for 
this, but i am not sure how it would behave with that many files, how 
would the performance be affected etc. Is cephfs useable in this 
scenario, or would radosgw+swift be better(they'd likely have to 
rewrite some of the code, so we'd prefer not to do this)? We already 
have some experience with cephfs for storing bigger files, streaming 
etc so i'm not completely new to this, but i thought it'd be better to 
ask more experiened users. Some advice on this would be greatly 
appreciated, thanks,


Josef

Depending on your OSD count, you should be able to put 100mil of files 
there. As others mentioned, depending on your workload, metadata may be 
a bottleneck.


If metadata is not a concern, then you just need to have enough OSDs to 
distribute RADOS objects. You should be fine with few millions objects 
per OSDs, going with tens of millions per OSD may be more problematic as 
you have larger memory usage, OSDs are slower, backfill/recovery is slow.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com