[ceph-users] ceph.conf tuning ... please comment

2017-12-05 Thread Stefan Kooman
Dear list,

In a ceph blog post about the new Luminous release there is a paragraph
on the need for ceph tuning [1]:

"If you are a Ceph power user and believe there is some setting that you
need to change for your environment to get the best performance, please
tell uswed like to either adjust our defaults so that your change isnt
necessary or have a go at convincing you that you shouldnt be tuning
that option."

We have been tuning several ceph.conf parameters in order to allow for
"fast failure" when an entire datacenter goes offline. We now have
continued operation (no pending IO) after ~ 7 seconds. We have changed
the following parameters:

[global]
# http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
osd heartbeat grace = 4  # default 6
# Do _NOT_  scale based on laggy estimations
mon osd adjust heartbeat grace = false

^^ without this setting it could take up to two minutes before ceph
flagged a whole datacenter down (after we cut connectivity to the DC).
Not sure how the estimation is done, but not good enough for us.

[mon]
# http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/
# TUNING #
mon lease = 1.0# default 5
mon election timeout = 2   # default 5 
mon lease renew interval factor = 0.4  # default 0.6
mon lease ack timeout factor = 1.5 # default 2.0
mon timecheck interval = 60# default 300

Above checks are there to make the whole process faster. After a DC
failure the monitors will need a re-election (depending on what DC and
who was a leader and who were peon). While going through mon
debug logging we have observed that this whole process is really fast
(things happen to be done in milliseconds). We have a quite low latency
network, so I guess we can cut some slack here. Ceph won't make any
decisions while there is no consensus, so better get that consensus as
soon as possible.

# 
http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/#monitor-settings
 
mon osd reporter subtree level = datacenter

^^ We do want to make sure at least two datacenters are seeing a
datacenter go down, not individual hosts.

[osd]
# http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
osd crush update on start = false
osd heartbeat interval = 1 # default 6
osd mon heartbeat interval = 10# default 30
osd mon report interval min = 1# default 5
osd mon report interval max = 15   # default 120

The osd would almost immediately see a "cut off" to their partner OSD's
in the placement group. By default they wait 6 seconds before sending
their report to the monitors. During our analysis this is exactly the
time the monitors were keeping an election. By tuning all of the above
we could get them to send their reports faster, and by the time the
election process was finished the monitors would handle the reports from
the OSDs and come to the conclusion that a DC is down, flag it down
and allow for normal client IO again.

Of course, stability and data safety is most important to us. So if any
of these settings make you worry please let us know.

Gr. Stefan

[1]: http://ceph.com/community/new-luminous-rados-improvements/


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hangs with qemu/libvirt/rbd when one host disappears

2017-12-05 Thread Brad Hubbard
On Wed, Dec 6, 2017 at 4:09 AM, Marcus Priesch  wrote:
> Dear Ceph Users,
>
> first of all, big thanks to all the devs and people who made all this
> possible, ceph is amazing !!!
>
> ok, so let me get to the point where i need your help:
>
> i have a cluster of 6 hosts, mixed with ssd's and hdd's.
>
> on 4 of the 6 hosts are 21 vm's running in total with less to no
> workload (web, mail, elasticsearch) for a couple of users.
>
> 4 nodes are running ubuntu server and 2 of them are running proxmox
> (because we are now in the process of migrating towards proxmox).
>
> i am running ceph luminous (have upgraded two weeks ago)
>
> ceph communication is carried out on a seperate 1Gbit Network where we
> plan to upgrade to bonded 2x10Gbit during the next couple of weeks.
>
> i have two pools defined where i only use disk images via libvirt/rbd.
>
> the hdd pool has two replicas and is for large (~4TB) backup images and
> the ssd pool has three replicas (two on ssd osd's and one on hdd osd's)
> for improved fail safety and faster access for "live data" and OS
> images.
>
> in the crush map i have two different rules for the two pools so that
> replicas always are stored on different hosts - i have verified this and
> it works. it is coded via the "host" attribute (host node1-hdd and host
> node1 are both actually on the same host)
>
> so, now comes the interesting part:
>
> when i turn off one of the hosts (lets say node7) that do only ceph,
> after some time the vm's stall and hang until the host comes up again.
>
> when i dont turn on the host again, after some time the cluster starts
> rebalancing ...
>
> yesterday i experienced that after a couple of hours of rebalancing the
> vm's continue working again - i think thats when the cluster has
> finished rebalancing ? havent really digged into this.
>
> well, today we turned off the same host (node7) again and i got stuck
> pg's again.
>
> this time i did some investigation and to my surprise i found the
> following in the output of ceph health detail:
>
> REQUEST_SLOW 17 slow requests are blocked > 32 sec
> 3 ops are blocked > 2097.15 sec
> 14 ops are blocked > 1048.58 sec
> osds 9,10 have blocked requests > 1048.58 sec
> osd.5 has blocked requests > 2097.15 sec
>
> i think the blocked requests are my problem, do they ?
>
> but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
> tell me why the requests to this nodes got stuck ?
>
> i have one pg in state "stuck unclean" which has its replicas on osd's
> 2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
> thought the "write op" should have gone there ... so why unclean ? the
> manual states "For stuck unclean placement groups, there is usually
> something preventing recovery from completing, like unfound objects" but
> there arent ...
>
> do i have a configuration issue here (amount of replicas?) or is this
> behavior simply just because my cluster network is too slow ?
>
> you can find detailed outputs here :
>
> https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY
>
> i hope any of you can help me shed any light on this ...
>
> at least the point of all is that a single host should be allowed to
> fail and the vm's continue running ... ;)

You don't really have six MONs do you (although I know the answer to
this question)? I think you need to take another look at some of the
docs about monitors.


>
> regards and thanks in advance,
> marcus.
>
> --
> Marcus Priesch
> open source consultant - solution provider
> www.priesch.co.at / off...@priesch.co.at
> A-2122 Riedenthal, In Prandnern 31 / +43 650 62 72 870
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-05 Thread Rafael Lopez
>
> Yes, you can run luminous on Trusty; one of my clusters is currently
> Luminous/Bluestore/Trusty as I've not had time to sort out doing OS
> upgrades on it. I second the suggestion that it would be better to do the
> luminous upgrade first, retaining existing filestore OSDs, and then do the
> OS upgrade/OSD recreation on each node in sequence. I don't think there
> should realistically be any problems with running a mixed cluster for a
> while but doing the jewel->luminous upgrade on the existing installs first
> shouldn't be significant extra effort/time as you're already predicting at
> least two months to upgrade everything, and it does minimise the amount of
> change at any one time in case things do start going horribly wrong.
>
> Also, at 48 nodes, I would've thought you could get away with cycling more
> than one of them at once. Assuming they're homogenous taking out even 4 at
> a time should only raise utilisation on the rest of the cluster to a little
> over 65%, which still seems safe to me, and you'd waste way less time
> waiting for recovery. (I recognise that depending on the nature of your
> employment situation this may not actually be desirable...)
>
> Rich
>
>
I also agree with this approach we actually did the reverse, updated OS
on all nodes from precise/trusty to xenial while cluster was still running
hammer. the only thing that we had to fiddle with was init (ie. no systemd
files provided with hammer), but you can write basic script(s) to
start/stop all osds manually. this was ok for us, particularly since we
didn't intend to run that state for a long period, and eventually upgraded
to jewel and soon to be luminous. In your case, since trusty is supported
in luminous I don't think you would have any trouble with this?


-- 
*Rafael Lopez*
Research Devops Engineer
Monash University eResearch Centre
E: rafael.lo...@monash.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD down with Ceph version of Kraken

2017-12-05 Thread Brad Hubbard
On Tue, Dec 5, 2017 at 8:14 PM,   wrote:
> Hi,
>
>
>
> Our Ceph version is Kraken and for the storage node we have up to 90 hard
> disks that can be used for OSD, we configured the messenger type as
> “simple”, I noticed that “simple” type here might create lots of threads and
> hence occupied lots of resource, we observed the configuration will cause
> many OSD failure, and happened frequently. Is there any configuration could
> help to work around the issue of OSD failure?

You probably need something like;

kernel.pid_max = 4194303

and ulimit nproc unlimited but it's hard to know without knowing what
specific error(s) you're hitting.

HTH.

>
>
>
> Thanks in the advance!
>
>
>
> Best Regards,
>
> Dave Chen
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS log jam prevention

2017-12-05 Thread Patrick Donnelly
On Tue, Dec 5, 2017 at 8:07 AM, Reed Dier  wrote:
> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD
> backed CephFS pool.
>
> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running
> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and
> clients.

You should try a newer kernel client if possible since the MDS is
having trouble trimming its cache.

> HEALTH_ERR 1 MDSs report oversized cache; 1 MDSs have many clients failing
> to respond to cache pressure; 1 MDSs behind on tr
> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1
> pool(s); 242 slow requests are blocked > 32 sec
> ; 769378 stuck requests are blocked > 4096 sec
> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by
> clients, 1 stray files
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
> pressure
> mdsdb(mds.0): Many clients (37) failing to respond to cache
> pressureclient_count: 37
> MDS_TRIM 1 MDSs behind on trimming
> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30,
> num_segments: 36252

See also: http://tracker.ceph.com/issues/21975

You can try doubling (several times if necessary) the MDS configs
`mds_log_max_segments` and `mds_log_max_expiring` to make it more
aggressively trim its journal. (That may not help since your OSD
requests are slow.)

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Ronny Aasen
just as long as you are aware that size=3, min_size=2 is the right 
config for everyone except those that really know what they are doing.
and if you ever run min_size=1 you better be expecting to corrupt your 
cluster sooner or later.


Ronny

On 05.12.2017 21:22, Denes Dolhay wrote:

Hi,

So for this to happen you have to lose another osd before backfilling 
is done.



Thank You! This clarifies it!

Denes



On 12/05/2017 03:32 PM, Ronny Aasen wrote:

On 05. des. 2017 10:26, Denes Dolhay wrote:

Hi,

This question popped up a few times already under filestore and 
bluestore too, but please help me understand, why this is?


"when you have 2 different objects, both with correct digests, in 
your cluster, the cluster can not know witch of the 2 objects are 
the correct one."


Doesn't it use an epoch, or an omap epoch when storing new data? If 
so why can it not use the recent one?






this have been discussed a few times on the list. generally  you have 
2 disks.


first disk fail. and writes happen to the other disk..

first disk recovers, and second disk fail before recovery is done. 
writes happen to second disk..


all objects have correct checksum. and both osd's think they are the 
correct one. so your cluster is inconsistent.  so bluestore checksums

does not solve this problem, both objects are objectivly "correct" :)


with min_size =2 the cluster would not accept a write unless 2 disks 
accepted the write.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Denes Dolhay

Hi,

So for this to happen you have to lose another osd before backfilling is 
done.



Thank You! This clarifies it!

Denes



On 12/05/2017 03:32 PM, Ronny Aasen wrote:

On 05. des. 2017 10:26, Denes Dolhay wrote:

Hi,

This question popped up a few times already under filestore and 
bluestore too, but please help me understand, why this is?


"when you have 2 different objects, both with correct digests, in 
your cluster, the cluster can not know witch of the 2 objects are the 
correct one."


Doesn't it use an epoch, or an omap epoch when storing new data? If 
so why can it not use the recent one?






this have been discussed a few times on the list. generally  you have 
2 disks.


first disk fail. and writes happen to the other disk..

first disk recovers, and second disk fail before recovery is done. 
writes happen to second disk..


all objects have correct checksum. and both osd's think they are the 
correct one. so your cluster is inconsistent.  so bluestore checksums

does not solve this problem, both objects are objectivly "correct" :)


with min_size =2 the cluster would not accept a write unless 2 disks 
accepted the write.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-05 Thread Denes Dolhay

Hello!

I can only answer some of your questions:

-The backfill process obeys a "nearfull_ratio" limit (I think defaults 
to 85%) above it the system will stop repairing itself, so it wont go up 
to 100%


-The normal write ops obey a full_ratio too, I think default 95%, above 
that no write io will be accepted to the pool.


-You have min_size=1 (as i can recall) So if you lose a disc then the 
other osds on the same hosts would fill up to 85% and then the cluster 
would stop repairing and would remain in a degraded (some pgs 
undersized) state until you solve the problem, or reach 95% at which 
point the cluster would stop accepting write io.


Calculations:

Sum pool used : 995+14986+1318 = 17299 ... 17299 * 2 (size) = 34598 
(+journal ?) ~ 35349 (global raw used)


Size: 52806G = 35349 (raw used) + 17457 (raw avail) => 66.94% OK.

The dociumentation sais that max avail pool is an estimate and is 
calculated against the osd which will run out of space first, so in tour 
case this is the relevant info.



I think you can access the per osd statistics with the ceph pg dump command.


However, I think you are quite correct:
Spinning usage: 14986+995 = 15981G
Sum spinning capacity: 15981+3232 = 19213G -> 83% full
(I used the vaues caluclated by your ceph df, as it uses the most full 
osd, so it is a good estimate for a worst case)
Since at 85% full, the cluster will stop self healing, so you cannot 
lose any spinning disc in a way that the cluster auto recovers to a 
healthy state (no undersized pgs). I would consider adding at least 2 
new discs to the host which only has ssds in your setup, of course 
considering slots, memory, etc. This would give you some breathing space 
to restructure your cluster too.


Denes.

On 12/05/2017 03:07 PM, tim taler wrote:

okay another day another nightmare ;-)

So far we discussed pools as bundles of:
- pool 1) 15 HDD-OSDs (consisting of a total of 25 HDDs actual, 5
single HDDs and five raid0 pairs as mentioned before)
- pool 2) 6 SSD-OSDs
unfortunately (well) on the "physical" pool 1 there are two "logical"
pools (my wording is here maybe not cephish?)

now I wonder about the real free space on "the pool"...

ceph df tells me:

GLOBAL:
 SIZE   AVAIL  RAW USED %RAW USED
 52806G 17457G   35349G 66.94
POOLS:
 NAME ID USED   %USED MAX AVAIL OBJECTS
 pool-1-HDD  9 995G 13.34 3232G   262134
 pool-2-HDD10 14986G  69.86 3232G 3892481
 pool-3-SDD12   1318G  55.94   519G  372618

Now how do I read this?
the sum of "MAX AVAIL" in the "POOLS" section is 7387
okay 7387*2 (since all three pools have a size of 2) is 14774

The GLOBAL section on the other hand tells me I still got 17457G available
17457-14774=2683
where are the missing 2683 GB?
or am I missing something (else than space and a sane setup I mean :-)

AND (!)
if in the "physical" HDD pool the reported two times 3232G available
space is true,
than in this setup (two hosts) there would be only 3232G free on each host.
Given that the HDD-OSDs are 4TB in size - if one dies and the host
tries to restore the data
(as I learned yesterday the data in this setup will ONLY be restored
on that host on which the OSD died)
than ...
it doesn't work, right?
Except I could hope that - due to too few placement groups and the resulting
miss-balance of space usage on the OSDs - the dead OSD was only filled
by 60% and not 85%
and only the real data will rewritten(restored).
But even that seems not possible - given the miss-balanced OSDs - the
fuller ones will hit total saturation
and - at least as I understand it now - after that (again after the
first OSD is filled 100%) I can't use the left
space on the other OSDs.
right?

If all that is true (and PLEASE point out any mistake in my thinking)
than I got here at the moment
25 harddisks of which NONE  must fail or the pool will at least stop
accepting writes.

Am I right? (feels like a reciprocal Russian roulette ... ONE chamber
WITHOUT a bullet ;-)

Now - sorry we are not finished yet (and yes this is true, I'm not
trying to make fun of you)

On top of all this I see a rapid decrease in the available space which
is not consistent
with growing data inside the rbds living in this cluster nore growing
numbers of rbds (we ONLY use rbds).
BUT someone is running sanpshots.
How do I sum up the amount of space each snapshot is using.

is it the sum of the USED column in the output of "rbd du --snapp" ?

And what is the philosophy of snapshots in ceph?
AN object is 4MB in size, if a bit in that object changes is the whole
object replicated?
(the cluster is luminous upgraded from jewel so we use filestore on
xfs not bluestore)

TIA

On Tue, Dec 5, 2017 at 11:10 AM, Stefan Kooman  wrote:

Quoting tim taler (robur...@gmail.com):

And I'm still puzzled about the implication of the cluster size on the

[ceph-users] Hangs with qemu/libvirt/rbd when one host disappears

2017-12-05 Thread Marcus Priesch
Dear Ceph Users,

first of all, big thanks to all the devs and people who made all this
possible, ceph is amazing !!!

ok, so let me get to the point where i need your help:

i have a cluster of 6 hosts, mixed with ssd's and hdd's.

on 4 of the 6 hosts are 21 vm's running in total with less to no
workload (web, mail, elasticsearch) for a couple of users.

4 nodes are running ubuntu server and 2 of them are running proxmox
(because we are now in the process of migrating towards proxmox).

i am running ceph luminous (have upgraded two weeks ago)

ceph communication is carried out on a seperate 1Gbit Network where we
plan to upgrade to bonded 2x10Gbit during the next couple of weeks.

i have two pools defined where i only use disk images via libvirt/rbd.

the hdd pool has two replicas and is for large (~4TB) backup images and
the ssd pool has three replicas (two on ssd osd's and one on hdd osd's)
for improved fail safety and faster access for "live data" and OS
images.

in the crush map i have two different rules for the two pools so that
replicas always are stored on different hosts - i have verified this and
it works. it is coded via the "host" attribute (host node1-hdd and host
node1 are both actually on the same host)

so, now comes the interesting part:

when i turn off one of the hosts (lets say node7) that do only ceph,
after some time the vm's stall and hang until the host comes up again.

when i dont turn on the host again, after some time the cluster starts
rebalancing ...

yesterday i experienced that after a couple of hours of rebalancing the
vm's continue working again - i think thats when the cluster has
finished rebalancing ? havent really digged into this.

well, today we turned off the same host (node7) again and i got stuck
pg's again.

this time i did some investigation and to my surprise i found the
following in the output of ceph health detail:

REQUEST_SLOW 17 slow requests are blocked > 32 sec
3 ops are blocked > 2097.15 sec
14 ops are blocked > 1048.58 sec
osds 9,10 have blocked requests > 1048.58 sec
osd.5 has blocked requests > 2097.15 sec

i think the blocked requests are my problem, do they ?

but neither osd's 9, 10 or 5 are located on host7 - so can anyone of you
tell me why the requests to this nodes got stuck ?

i have one pg in state "stuck unclean" which has its replicas on osd's
2, 3 and 15. 3 is on node7, but the first in the active set is 2 - i
thought the "write op" should have gone there ... so why unclean ? the
manual states "For stuck unclean placement groups, there is usually
something preventing recovery from completing, like unfound objects" but
there arent ...

do i have a configuration issue here (amount of replicas?) or is this
behavior simply just because my cluster network is too slow ?

you can find detailed outputs here :

https://owncloud.priesch.co.at/index.php/s/toYdGekchqpbydY

i hope any of you can help me shed any light on this ...

at least the point of all is that a single host should be allowed to
fail and the vm's continue running ... ;)

regards and thanks in advance,
marcus.

-- 
Marcus Priesch
open source consultant - solution provider
www.priesch.co.at / off...@priesch.co.at
A-2122 Riedenthal, In Prandnern 31 / +43 650 62 72 870
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS log jam prevention

2017-12-05 Thread Daniel Baumann
Hi,

On 12/05/17 17:58, Dan Jakubiec wrote:
> Is this is configuration problem or a bug?

we had massive problems with both kraken (feb-sept 2017) and luminous
(12.2.0), seeing the same behaviour as you. ceph.conf was containing
defaults only, except that we had to crank up mds_cache_size and
mds_bal_fragment_size_max.

using dirfrag and multi-mds did not change anything. even with luminous
(12.2.0) basically a single rsync over a large directory tree could kill
cephfs for all clients within seconds, where even a waiting period of >8
hours did not help.

since the cluster was semi-productive, we coudn't take the downtime so
we switched to unmounting all cephfs, flush journal, and re-mount it.

interestingly with 12.2.1 on kernel 4.13 however, this doesn't occur
anymore (the 'mds lagging behind' still happens, but recovers quickly
within minutes, and the rsync doesn not need to be aborted).

i'm not sure if 12.2.1 fixed it itself, or it was your config changes
happening at the same time:

mds_session_autoclose = 10
mds_reconnect_timeout = 10

mds_blacklist_interval = 10
mds_session_blacklist_on_timeout = false
mds_session_blacklist_on_evict = false

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.2 released

2017-12-05 Thread Richard Hesketh
You are safe to upgrade packages just by doing an apt-get update; apt-get 
upgrade, and you will then want to restart your ceph daemons to bring them to 
the new version - though you should of course stagger your restarts of each 
type to ensure your mons remain quorate (don't restart more than half at once, 
ideally one at a time), and your OSDs to keep at least min_size for your pools 
- if you have kept the default failure domain of host for your pools, 
restarting all the OSDs on one node and waiting for them to come back up before 
moving on to the next should be fine. Personally I tend to just reboot the 
entire node and wait for it to come back when I'm doing upgrades as there are 
usually also new kernels waiting to be live by the time I get around to it.

This is a minor version upgrade so you shouldn't need to restart daemon types 
in any particular order - I think that's only a concern when you're doing major 
version upgrades.

Rich

On 05/12/17 17:32, Rudi Ahlers wrote:
> Hi, 
> 
> Can you please tell me how to upgrade these? Would a simple apt-get update be 
> sufficient, or is there a better / safer way?
> 
> On Tue, Dec 5, 2017 at 5:41 PM, Sean Redmond  > wrote:
> 
> Hi Florent,
> 
> I have always done mons ,osds, rgw, mds, clients
> 
> Packages that don't auto restart services on update IMO is a good thing.
> 
> Thanks
> 
> On Tue, Dec 5, 2017 at 3:26 PM, Florent B  > wrote:
> 
> On Debian systems, upgrading packages does not restart services !
> 
> 
> On 05/12/2017 16:22, Oscar Segarra wrote:
>> I have executed:
>>
>> yum upgrade -y ceph 
>>
>> On each node and everything has worked fine...
>>
>> 2017-12-05 16:19 GMT+01:00 Florent B > >:
>>
>> Upgrade procedure is OSD or MON first ?
>>
>> There was a change on Luminous upgrade about it.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-05 Thread Richard Hesketh
On 05/12/17 17:10, Graham Allan wrote:
> On 12/05/2017 07:20 AM, Wido den Hollander wrote:
>> Hi,
>>
>> I haven't tried this before but I expect it to work, but I wanted to
>> check before proceeding.
>>
>> I have a Ceph cluster which is running with manually formatted
>> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
>>
>> I would like to upgrade this system to Luminous, but since I have to
>> re-install all servers and re-format all disks I'd like to move it to
>> BlueStore at the same time.
> 
> You don't *have* to update the OS in order to update to Luminous, do you? 
> Luminous is still supported on Ubuntu 14.04 AFAIK.
> 
> Though obviously I understand your desire to upgrade; I only ask because I am 
> in the same position (Ubuntu 14.04, xfs, sysvinit), though happily with a 
> smaller cluster. Personally I was planning to upgrade ours entirely to 
> Luminous while still on Ubuntu 14.04, before later going through the same 
> process of decommissioning one machine at a time to reinstall with CentOS 7 
> and Bluestore. I too don't see any reason the mixed Jewel/Luminous cluster 
> wouldn't work, but still felt less comfortable with extending the upgrade 
> duration.
> 
> Graham

Yes, you can run luminous on Trusty; one of my clusters is currently 
Luminous/Bluestore/Trusty as I've not had time to sort out doing OS upgrades on 
it. I second the suggestion that it would be better to do the luminous upgrade 
first, retaining existing filestore OSDs, and then do the OS upgrade/OSD 
recreation on each node in sequence. I don't think there should realistically 
be any problems with running a mixed cluster for a while but doing the 
jewel->luminous upgrade on the existing installs first shouldn't be significant 
extra effort/time as you're already predicting at least two months to upgrade 
everything, and it does minimise the amount of change at any one time in case 
things do start going horribly wrong.

Also, at 48 nodes, I would've thought you could get away with cycling more than 
one of them at once. Assuming they're homogenous taking out even 4 at a time 
should only raise utilisation on the rest of the cluster to a little over 65%, 
which still seems safe to me, and you'd waste way less time waiting for 
recovery. (I recognise that depending on the nature of your employment 
situation this may not actually be desirable...)

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in OSDs running 12.2.1 beyond the buffer_anon mempool leak

2017-12-05 Thread Subhachandra Chandra
That is what I will try today. I tried "filestore" with 12.2.1 and did not
see any issues. Will repeat the experiment with "bluestore" and 12.2.2.

Thanks
Subhachandra

On Tue, Dec 5, 2017 at 5:14 AM, Konstantin Shalygin  wrote:

>   We are trying out Ceph on a small cluster and are observing memory
>> leakage in the OSD processes.
>>
> Try new 12.2.2 - this release should fix memory issues with Bluestore.
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.2 released

2017-12-05 Thread Rudi Ahlers
Hi,

Can you please tell me how to upgrade these? Would a simple apt-get update
be sufficient, or is there a better / safer way?

On Tue, Dec 5, 2017 at 5:41 PM, Sean Redmond 
wrote:

> Hi Florent,
>
> I have always done mons ,osds, rgw, mds, clients
>
> Packages that don't auto restart services on update IMO is a good thing.
>
> Thanks
>
> On Tue, Dec 5, 2017 at 3:26 PM, Florent B  wrote:
>
>> On Debian systems, upgrading packages does not restart services !
>>
>> On 05/12/2017 16:22, Oscar Segarra wrote:
>>
>> I have executed:
>>
>> yum upgrade -y ceph
>>
>> On each node and everything has worked fine...
>>
>> 2017-12-05 16:19 GMT+01:00 Florent B :
>>
>>> Upgrade procedure is OSD or MON first ?
>>>
>>> There was a change on Luminous upgrade about it.
>>>
>>>
>>> On 01/12/2017 18:34, Abhishek Lekshmanan wrote:
>>> > We're glad to announce the second bugfix release of Luminous v12.2.x
>>> > stable release series. It contains a range of bug fixes and a few
>>> > features across Bluestore, CephFS, RBD & RGW. We recommend all the
>>> users
>>> > of 12.2.x series update.
>>> >
>>> > For more detailed information, see the blog[1] and the complete
>>> > changelog[2]
>>> >
>>> > A big thank you to everyone for the continual feedback & bug
>>> > reports we've received over this release cycle
>>> >
>>> > Notable Changes
>>> > ---
>>> > * Standby ceph-mgr daemons now redirect requests to the active
>>> messenger, easing
>>> >   configuration for tools & users accessing the web dashboard, restful
>>> API, or
>>> >   other ceph-mgr module services.
>>> > * The prometheus module has several significant updates and
>>> improvements.
>>> > * The new balancer module enables automatic optimization of CRUSH
>>> weights to
>>> >   balance data across the cluster.
>>> > * The ceph-volume tool has been updated to include support for
>>> BlueStore as well
>>> >   as FileStore. The only major missing ceph-volume feature is dm-crypt
>>> support.
>>> > * RGW's dynamic bucket index resharding is disabled in multisite
>>> environments,
>>> >   as it can cause inconsistencies in replication of bucket indexes to
>>> remote
>>> >   sites
>>> >
>>> > Other Notable Changes
>>> > -
>>> > * build/ops: bump sphinx to 1.6 (issue#21717, pr#18167, Kefu Chai,
>>> Alfredo Deza)
>>> > * build/ops: macros expanding in spec file comment (issue#22250,
>>> pr#19173, Ken Dreyer)
>>> > * build/ops: python-numpy-devel build dependency for SUSE
>>> (issue#21176, pr#17692, Nathan Cutler)
>>> > * build/ops: selinux: Allow getattr on lnk sysfs files (issue#21492,
>>> pr#18650, Boris Ranto)
>>> > * build/ops: Ubuntu amd64 client can not discover the ubuntu arm64
>>> ceph cluster (issue#19705, pr#18293, Kefu Chai)
>>> > * core: buffer: fix ABI breakage by removing list _mempool member
>>> (issue#21573, pr#18491, Sage Weil)
>>> > * core: Daemons(OSD, Mon…) exit abnormally at injectargs command
>>> (issue#21365, pr#17864, Yan Jun)
>>> > * core: Disable messenger logging (debug ms = 0/0) for clients unless
>>> overridden (issue#21860, pr#18529, Jason Dillaman)
>>> > * core: Improve OSD startup time by only scanning for omap corruption
>>> once (issue#21328, pr#17889, Luo Kexue, David Zafman)
>>> > * core: upmap does not respect osd reweights (issue#21538, pr#18699,
>>> Theofilos Mouratidis)
>>> > * dashboard: barfs on nulls where it expects numbers (issue#21570,
>>> pr#18728, John Spray)
>>> > * dashboard: OSD list has servers and osds in arbitrary order
>>> (issue#21572, pr#18736, John Spray)
>>> > * dashboard: the dashboard uses absolute links for filesystems and
>>> clients (issue#20568, pr#18737, Nick Erdmann)
>>> > * filestore: set default readahead and compaction threads for rocksdb
>>> (issue#21505, pr#18234, Josh Durgin, Mark Nelson)
>>> > * librbd: object map batch update might cause OSD suicide timeout
>>> (issue#21797, pr#18416, Jason Dillaman)
>>> > * librbd: snapshots should be created/removed against data pool
>>> (issue#21567, pr#18336, Jason Dillaman)
>>> > * mds: make sure snap inode’s last matches its parent dentry’s last
>>> (issue#21337, pr#17994, “Yan, Zheng”)
>>> > * mds: sanitize mdsmap of removed pools (issue#21945, issue#21568,
>>> pr#18628, Patrick Donnelly)
>>> > * mgr: bulk backport of ceph-mgr improvements (issue#21594,
>>> issue#17460,
>>> >   issue#21197, issue#21158, issue#21593, pr#18675, Benjeman Meekhof,
>>> >   Sage Weil, Jan Fajerski, John Spray, Kefu Chai, My Do, Spandan Kumar
>>> Sahu)
>>> > * mgr: ceph-mgr gets process called “exe” after respawn (issue#21404,
>>> pr#18738, John Spray)
>>> > * mgr: fix crashable DaemonStateIndex::get calls (issue#17737,
>>> pr#18412, John Spray)
>>> > * mgr: key mismatch for mgr after upgrade from jewel to luminous(dev)
>>> (issue#20950, pr#18727, John Spray)
>>> > * mgr: mgr status module uses base 10 units (issue#21189, issue#21752,
>>> pr#18257, John Spray, Yanhu Cao)
>>> > * mgr: 

Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-05 Thread Graham Allan



On 12/05/2017 07:20 AM, Wido den Hollander wrote:

Hi,

I haven't tried this before but I expect it to work, but I wanted to
check before proceeding.

I have a Ceph cluster which is running with manually formatted
FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.

I would like to upgrade this system to Luminous, but since I have to
re-install all servers and re-format all disks I'd like to move it to
BlueStore at the same time.


You don't *have* to update the OS in order to update to Luminous, do 
you? Luminous is still supported on Ubuntu 14.04 AFAIK.


Though obviously I understand your desire to upgrade; I only ask because 
I am in the same position (Ubuntu 14.04, xfs, sysvinit), though happily 
with a smaller cluster. Personally I was planning to upgrade ours 
entirely to Luminous while still on Ubuntu 14.04, before later going 
through the same process of decommissioning one machine at a time to 
reinstall with CentOS 7 and Bluestore. I too don't see any reason the 
mixed Jewel/Luminous cluster wouldn't work, but still felt less 
comfortable with extending the upgrade duration.


Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS log jam prevention

2017-12-05 Thread Dan Jakubiec
To add a little color here... we started an rsync last night to copy about 4TB 
worth of files to CephFS.  Paused it this morning because CephFS was 
unresponsive on the machine (e.g. can't cat a file from the filesystem).

Been waiting about 3 hours for the log jam to clear.  Slow requests have 
steadily decreased but still can't cat a file.

Seems like there should be something throttling the rsync operation to prevent 
the queues from backing up so far.  Is this is configuration problem or a bug?

From reading the Ceph docs, this seems to be the most telling:

mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by clients, 
1 stray files

[Ref: http://docs.ceph.com/docs/master/cephfs/cache-size-limits/]

"Be aware that the cache limit is not a hard limit. Potential bugs in the 
CephFS client or MDS or misbehaving applications might cause the MDS to exceed 
its cache size. The  mds_health_cache_threshold configures the cluster health 
warning message so that operators can investigate why the MDS cannot shrink its 
cache."

Any suggestions?

Thanks,

-- Dan



> On Dec 5, 2017, at 10:07, Reed Dier  wrote:
> 
> Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD 
> backed CephFS pool.
> 
> Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running 
> mix of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and 
> clients.
> 
>> $ ceph versions
>> {
>> "mon": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 3
>> },
>> "mgr": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 3
>> },
>> "osd": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 74
>> },
>> "mds": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 2
>> },
>> "overall": {
>> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
>> luminous (stable)": 82
>> }
>> }
> 
>>  
>> HEALTH_ERR
>>  1 MDSs report oversized cache; 1 MDSs have many clients failing to respond 
>> to cache pressure; 1 MDSs behind on tr
>> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 
>> pool(s); 242 slow requests are blocked > 32 sec
>> ; 769378 stuck requests are blocked > 4096 sec
>> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
>> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by 
>> clients, 1 stray files
>> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache 
>> pressure
>> mdsdb(mds.0): Many clients (37) failing to respond to cache 
>> pressureclient_count: 37
>> MDS_TRIM 1 MDSs behind on trimming
>> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, 
>> num_segments: 36252
>> OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
>> REQUEST_SLOW 242 slow requests are blocked > 32 sec
>> 236 ops are blocked > 2097.15 sec
>> 3 ops are blocked > 1048.58 sec
>> 2 ops are blocked > 524.288 sec
>> 1 ops are blocked > 32.768 sec
>> REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
>> 91 ops are blocked > 67108.9 sec
>> 121258 ops are blocked > 33554.4 sec
>> 308189 ops are blocked > 16777.2 sec
>> 251586 ops are blocked > 8388.61 sec
>> 88254 ops are blocked > 4194.3 sec
>> osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
>> osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
>> osd.13 has stuck requests > 67108.9 sec
> 
> This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 
> NVMe drives for journaling.
> Removed SSD OSD’s for brevity.
> 
>> $ ceph osd tree
>> ID  CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF
>> -1387.28799 root ssd
>>  -1   174.51500 root default
>> -10   174.51500 rack default.rack2
>> -5543.62000 chassis node2425
>>  -221.81000 host node24
>>   0   hdd   7.26999 osd.0 up  1.0 1.0
>>   8   hdd   7.26999 osd.8 up  1.0 1.0
>>  16   hdd   7.26999 osd.16up  1.0 1.0
>>  -321.81000 host node25
>>   1   hdd   7.26999 osd.1 up  1.0 1.0
>>   9   hdd   7.26999 osd.9 up  1.0 1.0
>>  17   hdd   7.26999 osd.17up  1.0 1.0
>> -5643.63499 chassis node2627
>>  -421.81999 host node26
>>   2   hdd   7.27499 osd.2 up  1.0 1.0
>>  10   hdd   7.26999 osd.10up  1.0 1.0
>>  18   hdd   7.27499  

[ceph-users] CephFS log jam prevention

2017-12-05 Thread Reed Dier
Been trying to do a fairly large rsync onto a 3x replicated, filestore HDD 
backed CephFS pool.

Luminous 12.2.1 for all daemons, kernel CephFS driver, Ubuntu 16.04 running mix 
of 4.8 and 4.10 kernels, 2x10GbE networking between all daemons and clients.

> $ ceph versions
> {
> "mon": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 3
> },
> "mgr": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 3
> },
> "osd": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 74
> },
> "mds": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 2
> },
> "overall": {
> "ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
> luminous (stable)": 82
> }
> }

>  
> HEALTH_ERR
>  1 MDSs report oversized cache; 1 MDSs have many clients failing to respond 
> to cache pressure; 1 MDSs behind on tr
> imming; noout,nodeep-scrub flag(s) set; application not enabled on 1 pool(s); 
> 242 slow requests are blocked > 32 sec
> ; 769378 stuck requests are blocked > 4096 sec
> MDS_CACHE_OVERSIZED 1 MDSs report oversized cache
> mdsdb(mds.0): MDS cache is too large (23GB/8GB); 1018 inodes in use by 
> clients, 1 stray files
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache 
> pressure
> mdsdb(mds.0): Many clients (37) failing to respond to cache 
> pressureclient_count: 37
> MDS_TRIM 1 MDSs behind on trimming
> mdsdb(mds.0): Behind on trimming (36252/30)max_segments: 30, 
> num_segments: 36252
> OSDMAP_FLAGS noout,nodeep-scrub flag(s) set
> REQUEST_SLOW 242 slow requests are blocked > 32 sec
> 236 ops are blocked > 2097.15 sec
> 3 ops are blocked > 1048.58 sec
> 2 ops are blocked > 524.288 sec
> 1 ops are blocked > 32.768 sec
> REQUEST_STUCK 769378 stuck requests are blocked > 4096 sec
> 91 ops are blocked > 67108.9 sec
> 121258 ops are blocked > 33554.4 sec
> 308189 ops are blocked > 16777.2 sec
> 251586 ops are blocked > 8388.61 sec
> 88254 ops are blocked > 4194.3 sec
> osds 0,1,3,6,8,12,15,16,17,21,22,23 have stuck requests > 16777.2 sec
> osds 4,7,9,10,11,14,18,20 have stuck requests > 33554.4 sec
> osd.13 has stuck requests > 67108.9 sec

This is across 8 nodes, holding 3x 8TB HDD’s each, all backed by Intel P3600 
NVMe drives for journaling.
Removed SSD OSD’s for brevity.

> $ ceph osd tree
> ID  CLASS WEIGHTTYPE NAME STATUS REWEIGHT PRI-AFF
> -1387.28799 root ssd
>  -1   174.51500 root default
> -10   174.51500 rack default.rack2
> -5543.62000 chassis node2425
>  -221.81000 host node24
>   0   hdd   7.26999 osd.0 up  1.0 1.0
>   8   hdd   7.26999 osd.8 up  1.0 1.0
>  16   hdd   7.26999 osd.16up  1.0 1.0
>  -321.81000 host node25
>   1   hdd   7.26999 osd.1 up  1.0 1.0
>   9   hdd   7.26999 osd.9 up  1.0 1.0
>  17   hdd   7.26999 osd.17up  1.0 1.0
> -5643.63499 chassis node2627
>  -421.81999 host node26
>   2   hdd   7.27499 osd.2 up  1.0 1.0
>  10   hdd   7.26999 osd.10up  1.0 1.0
>  18   hdd   7.27499 osd.18up  1.0 1.0
>  -521.81499 host node27
>   3   hdd   7.26999 osd.3 up  1.0 1.0
>  11   hdd   7.26999 osd.11up  1.0 1.0
>  19   hdd   7.27499 osd.19up  1.0 1.0
> -5743.62999 chassis node2829
>  -621.81499 host node28
>   4   hdd   7.26999 osd.4 up  1.0 1.0
>  12   hdd   7.26999 osd.12up  1.0 1.0
>  20   hdd   7.27499 osd.20up  1.0 1.0
>  -721.81499 host node29
>   5   hdd   7.26999 osd.5 up  1.0 1.0
>  13   hdd   7.26999 osd.13up  1.0 1.0
>  21   hdd   7.27499 osd.21up  1.0 1.0
> -5843.62999 chassis node3031
>  -821.81499 host node30
>   6   hdd   7.26999 osd.6 up  1.0 1.0
>  14   hdd   7.26999 osd.14up  1.0 1.0
>  22   hdd   7.27499 osd.22

Re: [ceph-users] Luminous v12.2.2 released

2017-12-05 Thread Erik McCormick
On Dec 5, 2017 10:26 AM, "Florent B"  wrote:

On Debian systems, upgrading packages does not restart services !

You really don't want it to restart services. Many small clusters run mons
and osds on the same nodes, and auto restart makes it impossible to order
restarts.

-Erik

On 05/12/2017 16:22, Oscar Segarra wrote:

I have executed:

yum upgrade -y ceph

On each node and everything has worked fine...

2017-12-05 16:19 GMT+01:00 Florent B :

> Upgrade procedure is OSD or MON first ?
>
> There was a change on Luminous upgrade about it.
>
>
> On 01/12/2017 18:34, Abhishek Lekshmanan wrote:
> > We're glad to announce the second bugfix release of Luminous v12.2.x
> > stable release series. It contains a range of bug fixes and a few
> > features across Bluestore, CephFS, RBD & RGW. We recommend all the users
> > of 12.2.x series update.
> >
> > For more detailed information, see the blog[1] and the complete
> > changelog[2]
> >
> > A big thank you to everyone for the continual feedback & bug
> > reports we've received over this release cycle
> >
> > Notable Changes
> > ---
> > * Standby ceph-mgr daemons now redirect requests to the active
> messenger, easing
> >   configuration for tools & users accessing the web dashboard, restful
> API, or
> >   other ceph-mgr module services.
> > * The prometheus module has several significant updates and improvements.
> > * The new balancer module enables automatic optimization of CRUSH
> weights to
> >   balance data across the cluster.
> > * The ceph-volume tool has been updated to include support for BlueStore
> as well
> >   as FileStore. The only major missing ceph-volume feature is dm-crypt
> support.
> > * RGW's dynamic bucket index resharding is disabled in multisite
> environments,
> >   as it can cause inconsistencies in replication of bucket indexes to
> remote
> >   sites
> >
> > Other Notable Changes
> > -
> > * build/ops: bump sphinx to 1.6 (issue#21717, pr#18167, Kefu Chai,
> Alfredo Deza)
> > * build/ops: macros expanding in spec file comment (issue#22250,
> pr#19173, Ken Dreyer)
> > * build/ops: python-numpy-devel build dependency for SUSE (issue#21176,
> pr#17692, Nathan Cutler)
> > * build/ops: selinux: Allow getattr on lnk sysfs files (issue#21492,
> pr#18650, Boris Ranto)
> > * build/ops: Ubuntu amd64 client can not discover the ubuntu arm64 ceph
> cluster (issue#19705, pr#18293, Kefu Chai)
> > * core: buffer: fix ABI breakage by removing list _mempool member
> (issue#21573, pr#18491, Sage Weil)
> > * core: Daemons(OSD, Mon…) exit abnormally at injectargs command
> (issue#21365, pr#17864, Yan Jun)
> > * core: Disable messenger logging (debug ms = 0/0) for clients unless
> overridden (issue#21860, pr#18529, Jason Dillaman)
> > * core: Improve OSD startup time by only scanning for omap corruption
> once (issue#21328, pr#17889, Luo Kexue, David Zafman)
> > * core: upmap does not respect osd reweights (issue#21538, pr#18699,
> Theofilos Mouratidis)
> > * dashboard: barfs on nulls where it expects numbers (issue#21570,
> pr#18728, John Spray)
> > * dashboard: OSD list has servers and osds in arbitrary order
> (issue#21572, pr#18736, John Spray)
> > * dashboard: the dashboard uses absolute links for filesystems and
> clients (issue#20568, pr#18737, Nick Erdmann)
> > * filestore: set default readahead and compaction threads for rocksdb
> (issue#21505, pr#18234, Josh Durgin, Mark Nelson)
> > * librbd: object map batch update might cause OSD suicide timeout
> (issue#21797, pr#18416, Jason Dillaman)
> > * librbd: snapshots should be created/removed against data pool
> (issue#21567, pr#18336, Jason Dillaman)
> > * mds: make sure snap inode’s last matches its parent dentry’s last
> (issue#21337, pr#17994, “Yan, Zheng”)
> > * mds: sanitize mdsmap of removed pools (issue#21945, issue#21568,
> pr#18628, Patrick Donnelly)
> > * mgr: bulk backport of ceph-mgr improvements (issue#21594, issue#17460,
> >   issue#21197, issue#21158, issue#21593, pr#18675, Benjeman Meekhof,
> >   Sage Weil, Jan Fajerski, John Spray, Kefu Chai, My Do, Spandan Kumar
> Sahu)
> > * mgr: ceph-mgr gets process called “exe” after respawn (issue#21404,
> pr#18738, John Spray)
> > * mgr: fix crashable DaemonStateIndex::get calls (issue#17737, pr#18412,
> John Spray)
> > * mgr: key mismatch for mgr after upgrade from jewel to luminous(dev)
> (issue#20950, pr#18727, John Spray)
> > * mgr: mgr status module uses base 10 units (issue#21189, issue#21752,
> pr#18257, John Spray, Yanhu Cao)
> > * mgr: mgr[zabbix] float division by zero (issue#21518, pr#18734, John
> Spray)
> > * mgr: Prometheus crash when update (issue#21253, pr#17867, John Spray)
> > * mgr: prometheus module generates invalid output when counter names
> contain non-alphanum characters (issue#20899, pr#17868, John Spray, Jeremy
> H Austin)
> > * mgr: Quieten scary RuntimeError from restful module on startup
> (issue#21292, pr#17866, John Spray)
> > * mgr: 

Re: [ceph-users] Luminous v12.2.2 released

2017-12-05 Thread Sean Redmond
Hi Florent,

I have always done mons ,osds, rgw, mds, clients

Packages that don't auto restart services on update IMO is a good thing.

Thanks

On Tue, Dec 5, 2017 at 3:26 PM, Florent B  wrote:

> On Debian systems, upgrading packages does not restart services !
>
> On 05/12/2017 16:22, Oscar Segarra wrote:
>
> I have executed:
>
> yum upgrade -y ceph
>
> On each node and everything has worked fine...
>
> 2017-12-05 16:19 GMT+01:00 Florent B :
>
>> Upgrade procedure is OSD or MON first ?
>>
>> There was a change on Luminous upgrade about it.
>>
>>
>> On 01/12/2017 18:34, Abhishek Lekshmanan wrote:
>> > We're glad to announce the second bugfix release of Luminous v12.2.x
>> > stable release series. It contains a range of bug fixes and a few
>> > features across Bluestore, CephFS, RBD & RGW. We recommend all the users
>> > of 12.2.x series update.
>> >
>> > For more detailed information, see the blog[1] and the complete
>> > changelog[2]
>> >
>> > A big thank you to everyone for the continual feedback & bug
>> > reports we've received over this release cycle
>> >
>> > Notable Changes
>> > ---
>> > * Standby ceph-mgr daemons now redirect requests to the active
>> messenger, easing
>> >   configuration for tools & users accessing the web dashboard, restful
>> API, or
>> >   other ceph-mgr module services.
>> > * The prometheus module has several significant updates and
>> improvements.
>> > * The new balancer module enables automatic optimization of CRUSH
>> weights to
>> >   balance data across the cluster.
>> > * The ceph-volume tool has been updated to include support for
>> BlueStore as well
>> >   as FileStore. The only major missing ceph-volume feature is dm-crypt
>> support.
>> > * RGW's dynamic bucket index resharding is disabled in multisite
>> environments,
>> >   as it can cause inconsistencies in replication of bucket indexes to
>> remote
>> >   sites
>> >
>> > Other Notable Changes
>> > -
>> > * build/ops: bump sphinx to 1.6 (issue#21717, pr#18167, Kefu Chai,
>> Alfredo Deza)
>> > * build/ops: macros expanding in spec file comment (issue#22250,
>> pr#19173, Ken Dreyer)
>> > * build/ops: python-numpy-devel build dependency for SUSE (issue#21176,
>> pr#17692, Nathan Cutler)
>> > * build/ops: selinux: Allow getattr on lnk sysfs files (issue#21492,
>> pr#18650, Boris Ranto)
>> > * build/ops: Ubuntu amd64 client can not discover the ubuntu arm64 ceph
>> cluster (issue#19705, pr#18293, Kefu Chai)
>> > * core: buffer: fix ABI breakage by removing list _mempool member
>> (issue#21573, pr#18491, Sage Weil)
>> > * core: Daemons(OSD, Mon…) exit abnormally at injectargs command
>> (issue#21365, pr#17864, Yan Jun)
>> > * core: Disable messenger logging (debug ms = 0/0) for clients unless
>> overridden (issue#21860, pr#18529, Jason Dillaman)
>> > * core: Improve OSD startup time by only scanning for omap corruption
>> once (issue#21328, pr#17889, Luo Kexue, David Zafman)
>> > * core: upmap does not respect osd reweights (issue#21538, pr#18699,
>> Theofilos Mouratidis)
>> > * dashboard: barfs on nulls where it expects numbers (issue#21570,
>> pr#18728, John Spray)
>> > * dashboard: OSD list has servers and osds in arbitrary order
>> (issue#21572, pr#18736, John Spray)
>> > * dashboard: the dashboard uses absolute links for filesystems and
>> clients (issue#20568, pr#18737, Nick Erdmann)
>> > * filestore: set default readahead and compaction threads for rocksdb
>> (issue#21505, pr#18234, Josh Durgin, Mark Nelson)
>> > * librbd: object map batch update might cause OSD suicide timeout
>> (issue#21797, pr#18416, Jason Dillaman)
>> > * librbd: snapshots should be created/removed against data pool
>> (issue#21567, pr#18336, Jason Dillaman)
>> > * mds: make sure snap inode’s last matches its parent dentry’s last
>> (issue#21337, pr#17994, “Yan, Zheng”)
>> > * mds: sanitize mdsmap of removed pools (issue#21945, issue#21568,
>> pr#18628, Patrick Donnelly)
>> > * mgr: bulk backport of ceph-mgr improvements (issue#21594, issue#17460,
>> >   issue#21197, issue#21158, issue#21593, pr#18675, Benjeman Meekhof,
>> >   Sage Weil, Jan Fajerski, John Spray, Kefu Chai, My Do, Spandan Kumar
>> Sahu)
>> > * mgr: ceph-mgr gets process called “exe” after respawn (issue#21404,
>> pr#18738, John Spray)
>> > * mgr: fix crashable DaemonStateIndex::get calls (issue#17737,
>> pr#18412, John Spray)
>> > * mgr: key mismatch for mgr after upgrade from jewel to luminous(dev)
>> (issue#20950, pr#18727, John Spray)
>> > * mgr: mgr status module uses base 10 units (issue#21189, issue#21752,
>> pr#18257, John Spray, Yanhu Cao)
>> > * mgr: mgr[zabbix] float division by zero (issue#21518, pr#18734, John
>> Spray)
>> > * mgr: Prometheus crash when update (issue#21253, pr#17867, John Spray)
>> > * mgr: prometheus module generates invalid output when counter names
>> contain non-alphanum characters (issue#20899, pr#17868, John Spray, Jeremy
>> H Austin)
>> > * mgr: Quieten scary 

Re: [ceph-users] Luminous v12.2.2 released

2017-12-05 Thread Oscar Segarra
I have executed:

yum upgrade -y ceph

On each node and everything has worked fine...

2017-12-05 16:19 GMT+01:00 Florent B :

> Upgrade procedure is OSD or MON first ?
>
> There was a change on Luminous upgrade about it.
>
>
> On 01/12/2017 18:34, Abhishek Lekshmanan wrote:
> > We're glad to announce the second bugfix release of Luminous v12.2.x
> > stable release series. It contains a range of bug fixes and a few
> > features across Bluestore, CephFS, RBD & RGW. We recommend all the users
> > of 12.2.x series update.
> >
> > For more detailed information, see the blog[1] and the complete
> > changelog[2]
> >
> > A big thank you to everyone for the continual feedback & bug
> > reports we've received over this release cycle
> >
> > Notable Changes
> > ---
> > * Standby ceph-mgr daemons now redirect requests to the active
> messenger, easing
> >   configuration for tools & users accessing the web dashboard, restful
> API, or
> >   other ceph-mgr module services.
> > * The prometheus module has several significant updates and improvements.
> > * The new balancer module enables automatic optimization of CRUSH
> weights to
> >   balance data across the cluster.
> > * The ceph-volume tool has been updated to include support for BlueStore
> as well
> >   as FileStore. The only major missing ceph-volume feature is dm-crypt
> support.
> > * RGW's dynamic bucket index resharding is disabled in multisite
> environments,
> >   as it can cause inconsistencies in replication of bucket indexes to
> remote
> >   sites
> >
> > Other Notable Changes
> > -
> > * build/ops: bump sphinx to 1.6 (issue#21717, pr#18167, Kefu Chai,
> Alfredo Deza)
> > * build/ops: macros expanding in spec file comment (issue#22250,
> pr#19173, Ken Dreyer)
> > * build/ops: python-numpy-devel build dependency for SUSE (issue#21176,
> pr#17692, Nathan Cutler)
> > * build/ops: selinux: Allow getattr on lnk sysfs files (issue#21492,
> pr#18650, Boris Ranto)
> > * build/ops: Ubuntu amd64 client can not discover the ubuntu arm64 ceph
> cluster (issue#19705, pr#18293, Kefu Chai)
> > * core: buffer: fix ABI breakage by removing list _mempool member
> (issue#21573, pr#18491, Sage Weil)
> > * core: Daemons(OSD, Mon…) exit abnormally at injectargs command
> (issue#21365, pr#17864, Yan Jun)
> > * core: Disable messenger logging (debug ms = 0/0) for clients unless
> overridden (issue#21860, pr#18529, Jason Dillaman)
> > * core: Improve OSD startup time by only scanning for omap corruption
> once (issue#21328, pr#17889, Luo Kexue, David Zafman)
> > * core: upmap does not respect osd reweights (issue#21538, pr#18699,
> Theofilos Mouratidis)
> > * dashboard: barfs on nulls where it expects numbers (issue#21570,
> pr#18728, John Spray)
> > * dashboard: OSD list has servers and osds in arbitrary order
> (issue#21572, pr#18736, John Spray)
> > * dashboard: the dashboard uses absolute links for filesystems and
> clients (issue#20568, pr#18737, Nick Erdmann)
> > * filestore: set default readahead and compaction threads for rocksdb
> (issue#21505, pr#18234, Josh Durgin, Mark Nelson)
> > * librbd: object map batch update might cause OSD suicide timeout
> (issue#21797, pr#18416, Jason Dillaman)
> > * librbd: snapshots should be created/removed against data pool
> (issue#21567, pr#18336, Jason Dillaman)
> > * mds: make sure snap inode’s last matches its parent dentry’s last
> (issue#21337, pr#17994, “Yan, Zheng”)
> > * mds: sanitize mdsmap of removed pools (issue#21945, issue#21568,
> pr#18628, Patrick Donnelly)
> > * mgr: bulk backport of ceph-mgr improvements (issue#21594, issue#17460,
> >   issue#21197, issue#21158, issue#21593, pr#18675, Benjeman Meekhof,
> >   Sage Weil, Jan Fajerski, John Spray, Kefu Chai, My Do, Spandan Kumar
> Sahu)
> > * mgr: ceph-mgr gets process called “exe” after respawn (issue#21404,
> pr#18738, John Spray)
> > * mgr: fix crashable DaemonStateIndex::get calls (issue#17737, pr#18412,
> John Spray)
> > * mgr: key mismatch for mgr after upgrade from jewel to luminous(dev)
> (issue#20950, pr#18727, John Spray)
> > * mgr: mgr status module uses base 10 units (issue#21189, issue#21752,
> pr#18257, John Spray, Yanhu Cao)
> > * mgr: mgr[zabbix] float division by zero (issue#21518, pr#18734, John
> Spray)
> > * mgr: Prometheus crash when update (issue#21253, pr#17867, John Spray)
> > * mgr: prometheus module generates invalid output when counter names
> contain non-alphanum characters (issue#20899, pr#17868, John Spray, Jeremy
> H Austin)
> > * mgr: Quieten scary RuntimeError from restful module on startup
> (issue#21292, pr#17866, John Spray)
> > * mgr: Spurious ceph-mgr failovers during mon elections (issue#20629,
> pr#18726, John Spray)
> > * mon: Client client.admin marked osd.2 out, after it was down for
> 1504627577 seconds (issue#21249, pr#17862, John Spray)
> > * mon: DNS SRV default service name not used anymore (issue#21204,
> pr#17863, Kefu Chai)
> > * mon/MgrMonitor: handle cmd descs 

Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-12-05 Thread Wido den Hollander

> Op 5 december 2017 om 15:27 schreef Jason Dillaman :
> 
> 
> On Tue, Dec 5, 2017 at 9:13 AM, Wido den Hollander  wrote:
> >
> >> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> >>
> >>
> >> We experienced this problem in the past on older (pre-Jewel) releases
> >> where a PG split that affected the RBD header object would result in
> >> the watch getting lost by librados. Any chance you know if the
> >> affected RBD header objects were involved in a PG split? Can you
> >> generate a gcore dump of one of the affected VMs and ceph-post-file it
> >> for analysis?
> >>
> >
> > I asked again for the gcore, but they can't release it as it contains 
> > confidential information about the Instance and the Ceph cluster. I 
> > understand their reasoning and they also understand that it makes it 
> > difficult to debug this.
> >
> > I am allowed to look at the gcore dump when on location (next week), but 
> > I'm not allowed to share it.
> 
> Indeed -- best chance would be if you could reproduce on a VM that you
> are permitted to share.
> 

We are looking into that.

> >> As for the VM going R/O, that is the expected behavior when a client
> >> breaks the exclusive lock held by a (dead) client.
> >>
> >
> > We noticed another VM going into RO when a snapshot was created. When 
> > checking last week this Instance had a watcher, but after the snapshot/RO 
> > we found out it no longer has a watcher registered.
> >
> > Any suggestions or ideas?
> 
> If you have the admin socket enabled, you could run "ceph
> --admin-daemon /path/to/asok objecter_requests" to dump the ops. That
> probably won't be useful unless there is a smoking gun. Did you have
> any OSDs go out/down? Network issues?
> 

The admin socket is currently not enabled, but I will ask them to do that. We 
will then have to wait for this to happen again.

We didn't have any network issues there, but a few OSD went down and up again 
in the last few weeks, but not very recently afaik.

I'll look into the admin socket!

Wido

> > Wido
> >
> >> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> >> > Hi,
> >> >
> >> > On a OpenStack environment I encountered a VM which went into R/O mode 
> >> > after a RBD snapshot was created.
> >> >
> >> > Digging into this I found 10s (out of thousands) RBD images which DO 
> >> > have a running VM, but do NOT have a watcher on the RBD image.
> >> >
> >> > For example:
> >> >
> >> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >> >
> >> > 'Watchers: none'
> >> >
> >> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
> >> > the client.
> >> >
> >> > In the meantime the cluster was already upgraded to 10.2.10
> >> >
> >> > Looking further I also found a Compute node with 10.2.10 installed which 
> >> > also has RBD images without watchers.
> >> >
> >> > Restarting or live migrating the VM to a different host resolves this 
> >> > issue.
> >> >
> >> > The internet is full of posts where RBD images still have Watchers when 
> >> > people don't expect them, but in this case I'm expecting a watcher which 
> >> > isn't there.
> >> >
> >> > The main problem right now is that creating a snapshot potentially puts 
> >> > a VM in Read-Only state because of the lack of notification.
> >> >
> >> > Has anybody seen this as well?
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Ronny Aasen

On 05. des. 2017 10:26, Denes Dolhay wrote:

Hi,

This question popped up a few times already under filestore and 
bluestore too, but please help me understand, why this is?


"when you have 2 different objects, both with correct digests, in your 
cluster, the cluster can not know witch of the 2 objects are the correct 
one."


Doesn't it use an epoch, or an omap epoch when storing new data? If so 
why can it not use the recent one?






this have been discussed a few times on the list. generally  you have 2 
disks.


first disk fail. and writes happen to the other disk..

first disk recovers, and second disk fail before recovery is done. 
writes happen to second disk..


all objects have correct checksum. and both osd's think they are the 
correct one. so your cluster is inconsistent.  so bluestore checksums

does not solve this problem, both objects are objectivly "correct" :)


with min_size =2 the cluster would not accept a write unless 2 disks 
accepted the write.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-12-05 Thread Jason Dillaman
On Tue, Dec 5, 2017 at 9:13 AM, Wido den Hollander  wrote:
>
>> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
>>
>>
>> We experienced this problem in the past on older (pre-Jewel) releases
>> where a PG split that affected the RBD header object would result in
>> the watch getting lost by librados. Any chance you know if the
>> affected RBD header objects were involved in a PG split? Can you
>> generate a gcore dump of one of the affected VMs and ceph-post-file it
>> for analysis?
>>
>
> I asked again for the gcore, but they can't release it as it contains 
> confidential information about the Instance and the Ceph cluster. I 
> understand their reasoning and they also understand that it makes it 
> difficult to debug this.
>
> I am allowed to look at the gcore dump when on location (next week), but I'm 
> not allowed to share it.

Indeed -- best chance would be if you could reproduce on a VM that you
are permitted to share.

>> As for the VM going R/O, that is the expected behavior when a client
>> breaks the exclusive lock held by a (dead) client.
>>
>
> We noticed another VM going into RO when a snapshot was created. When 
> checking last week this Instance had a watcher, but after the snapshot/RO we 
> found out it no longer has a watcher registered.
>
> Any suggestions or ideas?

If you have the admin socket enabled, you could run "ceph
--admin-daemon /path/to/asok objecter_requests" to dump the ops. That
probably won't be useful unless there is a smoking gun. Did you have
any OSDs go out/down? Network issues?

> Wido
>
>> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
>> > Hi,
>> >
>> > On a OpenStack environment I encountered a VM which went into R/O mode 
>> > after a RBD snapshot was created.
>> >
>> > Digging into this I found 10s (out of thousands) RBD images which DO have 
>> > a running VM, but do NOT have a watcher on the RBD image.
>> >
>> > For example:
>> >
>> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>> >
>> > 'Watchers: none'
>> >
>> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
>> > the client.
>> >
>> > In the meantime the cluster was already upgraded to 10.2.10
>> >
>> > Looking further I also found a Compute node with 10.2.10 installed which 
>> > also has RBD images without watchers.
>> >
>> > Restarting or live migrating the VM to a different host resolves this 
>> > issue.
>> >
>> > The internet is full of posts where RBD images still have Watchers when 
>> > people don't expect them, but in this case I'm expecting a watcher which 
>> > isn't there.
>> >
>> > The main problem right now is that creating a snapshot potentially puts a 
>> > VM in Read-Only state because of the lack of notification.
>> >
>> > Has anybody seen this as well?
>> >
>> > Thanks,
>> >
>> > Wido
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-12-05 Thread Wido den Hollander

> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> 
> 
> We experienced this problem in the past on older (pre-Jewel) releases
> where a PG split that affected the RBD header object would result in
> the watch getting lost by librados. Any chance you know if the
> affected RBD header objects were involved in a PG split? Can you
> generate a gcore dump of one of the affected VMs and ceph-post-file it
> for analysis?
> 

I asked again for the gcore, but they can't release it as it contains 
confidential information about the Instance and the Ceph cluster. I understand 
their reasoning and they also understand that it makes it difficult to debug 
this.

I am allowed to look at the gcore dump when on location (next week), but I'm 
not allowed to share it.

> As for the VM going R/O, that is the expected behavior when a client
> breaks the exclusive lock held by a (dead) client.
> 

We noticed another VM going into RO when a snapshot was created. When checking 
last week this Instance had a watcher, but after the snapshot/RO we found out 
it no longer has a watcher registered.

Any suggestions or ideas?

Wido

> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> > Hi,
> >
> > On a OpenStack environment I encountered a VM which went into R/O mode 
> > after a RBD snapshot was created.
> >
> > Digging into this I found 10s (out of thousands) RBD images which DO have a 
> > running VM, but do NOT have a watcher on the RBD image.
> >
> > For example:
> >
> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >
> > 'Watchers: none'
> >
> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> > client.
> >
> > In the meantime the cluster was already upgraded to 10.2.10
> >
> > Looking further I also found a Compute node with 10.2.10 installed which 
> > also has RBD images without watchers.
> >
> > Restarting or live migrating the VM to a different host resolves this issue.
> >
> > The internet is full of posts where RBD images still have Watchers when 
> > people don't expect them, but in this case I'm expecting a watcher which 
> > isn't there.
> >
> > The main problem right now is that creating a snapshot potentially puts a 
> > VM in Read-Only state because of the lack of notification.
> >
> > Has anybody seen this as well?
> >
> > Thanks,
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-05 Thread tim taler
okay another day another nightmare ;-)

So far we discussed pools as bundles of:
- pool 1) 15 HDD-OSDs (consisting of a total of 25 HDDs actual, 5
single HDDs and five raid0 pairs as mentioned before)
- pool 2) 6 SSD-OSDs
unfortunately (well) on the "physical" pool 1 there are two "logical"
pools (my wording is here maybe not cephish?)

now I wonder about the real free space on "the pool"...

ceph df tells me:

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
52806G 17457G   35349G 66.94
POOLS:
NAME ID USED   %USED MAX AVAIL OBJECTS
pool-1-HDD  9 995G 13.34 3232G   262134
pool-2-HDD10 14986G  69.86 3232G 3892481
pool-3-SDD12   1318G  55.94   519G  372618

Now how do I read this?
the sum of "MAX AVAIL" in the "POOLS" section is 7387
okay 7387*2 (since all three pools have a size of 2) is 14774

The GLOBAL section on the other hand tells me I still got 17457G available
17457-14774=2683
where are the missing 2683 GB?
or am I missing something (else than space and a sane setup I mean :-)

AND (!)
if in the "physical" HDD pool the reported two times 3232G available
space is true,
than in this setup (two hosts) there would be only 3232G free on each host.
Given that the HDD-OSDs are 4TB in size - if one dies and the host
tries to restore the data
(as I learned yesterday the data in this setup will ONLY be restored
on that host on which the OSD died)
than ...
it doesn't work, right?
Except I could hope that - due to too few placement groups and the resulting
miss-balance of space usage on the OSDs - the dead OSD was only filled
by 60% and not 85%
and only the real data will rewritten(restored).
But even that seems not possible - given the miss-balanced OSDs - the
fuller ones will hit total saturation
and - at least as I understand it now - after that (again after the
first OSD is filled 100%) I can't use the left
space on the other OSDs.
right?

If all that is true (and PLEASE point out any mistake in my thinking)
than I got here at the moment
25 harddisks of which NONE  must fail or the pool will at least stop
accepting writes.

Am I right? (feels like a reciprocal Russian roulette ... ONE chamber
WITHOUT a bullet ;-)

Now - sorry we are not finished yet (and yes this is true, I'm not
trying to make fun of you)

On top of all this I see a rapid decrease in the available space which
is not consistent
with growing data inside the rbds living in this cluster nore growing
numbers of rbds (we ONLY use rbds).
BUT someone is running sanpshots.
How do I sum up the amount of space each snapshot is using.

is it the sum of the USED column in the output of "rbd du --snapp" ?

And what is the philosophy of snapshots in ceph?
AN object is 4MB in size, if a bit in that object changes is the whole
object replicated?
(the cluster is luminous upgraded from jewel so we use filestore on
xfs not bluestore)

TIA

On Tue, Dec 5, 2017 at 11:10 AM, Stefan Kooman  wrote:
> Quoting tim taler (robur...@gmail.com):
>> And I'm still puzzled about the implication of the cluster size on the
>> amount of OSD failures.
>> With size=2 min_size=1 one host could die and (if by chance there is
>> NO read error on any bit on the living host) I could (theoretically)
>> recover, is that right?
> True.
>> OR is it that if any two disks in the cluster fail at the same time
>> (or while one is still being rebuild) all my data would be gone?
> Only the objects that are located on those disks. So for example obj1
> disk1,host1 and obj 1 on disk2,host2 ... you will lose data, yes.
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu-runner failing during image creation

2017-12-05 Thread Jason Dillaman
We are planning to push a 4.14-based kernel with the necessary LIO
fixes to here [1] sometime soon. I'll make an announcement and update
the documentation once it's available.

[1] https://shaman.ceph.com/repos/kernel/ceph-iscsi-stable

On Mon, Dec 4, 2017 at 6:55 PM, Brady Deetz  wrote:
> I thought I was good to go with tcmu-runner on Kernel 4.14, but I guess not?
> Any thoughts on the output below?
>
> 2017-12-04 17:44:09,631ERROR [rbd-target-api:665:_disk()] - LUN alloc
> problem - Could not set LIO device attribute cmd_time_out/qfull_time_out for
> device: iscsi-primary.primary00. Kernel not supported. - error(Cannot find
> attribute: qfull_time_out)
>
>
> [root@dc1srviscsi01 ~]# uname -a
> Linux dc1srviscsi01.ceph.xxx.xxx 4.14.3-1.el7.elrepo.x86_64 #1 SMP Thu Nov
> 30 09:35:20 EST 2017 x86_64 x86_64 x86_64 GNU/Linux
>
> ceph-iscsi-cli/
> [root@dc1srviscsi01 ceph-iscsi-cli]# git branch
> * (detached from 2.5)
>   master
>
> ceph-iscsi-cli/
> [root@dc1srviscsi01 ceph-iscsi-cli]# git branch
> * (detached from 2.5)
>   master
>
> ceph-iscsi-config/
> [root@dc1srviscsi01 ceph-iscsi-config]# git branch
> * (detached from 2.3)
>   master
>
> rtslib-fb/
> [root@dc1srviscsi01 rtslib-fb]# git branch
> * (detached from v2.1.fb64)
>   master
>
> targetcli-fb/
> [root@dc1srviscsi01 targetcli-fb]# git branch
> * (detached from v2.1.fb47)
>   master
>
> tcmu-runner/
> [root@dc1srviscsi01 tcmu-runner]# git branch
> * (detached from v1.3.0-rc4)
>   master
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-05 Thread Wido den Hollander
Hi,

I haven't tried this before but I expect it to work, but I wanted to check 
before proceeding.

I have a Ceph cluster which is running with manually formatted FileStore XFS 
disks, Jewel, sysvinit and Ubuntu 14.04.

I would like to upgrade this system to Luminous, but since I have to re-install 
all servers and re-format all disks I'd like to move it to BlueStore at the 
same time.

This system however has 768 3TB disks and has a utilization of about 60%. You 
can guess, it will take a long time before all the backfills complete.

The idea is to take a machine down, wipe all disks, re-install it with Ubuntu 
16.04 and Luminous and re-format the disks with BlueStore.

The OSDs get back, start to backfill and we wait.

My estimation is that we can do one machine per day, but we have 48 machines to 
do. Realistically this will take ~60 days to complete.

Afaik running Jewel (10.2.10) mixed with Luminous (12.2.2) should work just 
fine I wanted to check if there are any caveats I don't know about.

I'll upgrade the MONs to Luminous first before starting to upgrade the OSDs. 
Between each machine I'll wait for a HEALTH_OK before proceeding allowing the 
MONs to trim their datastore.

The question is: Does it hurt to run Jewel and Luminous mixed for ~60 days?

I think it won't, but I wanted to double-check.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Memory leak in OSDs running 12.2.1 beyond the buffer_anon mempool leak

2017-12-05 Thread Konstantin Shalygin

  We are trying out Ceph on a small cluster and are observing memory
leakage in the OSD processes.

Try new 12.2.2 - this release should fix memory issues with Bluestore.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] List directory in cephfs blocking very long time

2017-12-05 Thread David Turner
The 3.10 kernel is very old compared to 12.2.2. I would recommend trying a
newer kernel or using ceph-fuse. I personally use ceph-fuse. It is updated
with each release of Ceph and will match the new features released more
closely than the kernel driver.

On Tue, Dec 5, 2017, 6:59 AM 张建  wrote:

> Hello,
>
> My problem description:
> On cephfs client 1:
> Copy a directory "base" which contains hundreds of rpm files to the
> mounted cephfs directory.
> then , change to cephfs directory and execute "ls -l base", it returns
> quickly. This is no problem.
>
> On the other cephfs client,
> Change to the mounted cephfs directory and execute "ls -l base", it
> blocks for long time.
> Execute "strace ls -l base", we can find it blocks in lstat files (about
> 5 seconds blocking for each file).
>
> Is there anybody met this problem?
>
> ==
> cephfs client information:
> OS: CentOS Linux release 7.2.1511 (Core)
> kernel: Linux 3.10.0-327.el7.x86_64
> mount type: with ceph kernel drvier
>
> ==
> ceph cluster information:
> OS: CentOS Linux release 7.2.1511 (Core)
> kernel: Linux 3.10.0-327.el7.x86_64
> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
> ceph cluster node1:  1 mon, 1 mgr, 1 mds, 4 osd
> ceph cluster node2:  1 mon, 1 mgr, 1 mds, 4 osd
> ceph cluster node3:  1 mon, 1 mgr, 1 mds, 4 osd
> Each osd is deployed with bluestore type (4TB HDD block device + 240GB
> SSD block.db device).
> public network:  port 1 of 56Gb/s Dual Port Infiniband, IPoIB
> cluster network: port 2 of 56Gb/s Dual Port Infiniband, IPoIB
>
> ceph.conf:
> -
> [global]
> fsid = 86887b08-6ff4-4ec1-a622-64bc688d1f2f
> mon_initial_members = storage-0, storage-1, storage-2
> mon_host = 10.0.30.11,10.0.30.12,10.0.30.13
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> public_network = 10.0.30.0/24
> cluster_network = 10.0.40.0/24
>
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
>
> osd_objectstore = bluestore
> bluestore_block_db_size = 236223201280
>
> ==
> strace log:
> -
> [root@computing-1 cephfs]# strace ls -l base
> execve("/usr/bin/ls", ["ls", "-l", "base"], [/* 24 vars */]) = 0
> brk(0)  = 0x155f000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x7f8c65b91000
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or
> directory)
> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=53575, ...}) = 0
> mmap(NULL, 53575, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8c65b83000
> close(3)= 0
> open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
> read(3,
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300j\0\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=155744, ...}) = 0
> mmap(NULL, 2255216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f8c6574a000
> mprotect(0x7f8c6576e000, 2093056, PROT_NONE) = 0
> mmap(0x7f8c6596d000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23000) = 0x7f8c6596d000
> mmap(0x7f8c6596f000, 6512, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f8c6596f000
> close(3)= 0
> open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0
> \26\0\0\0\0\0\0"..., 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=20024, ...}) = 0
> mmap(NULL, 2114112, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f8c65545000
> mprotect(0x7f8c65549000, 2093056, PROT_NONE) = 0
> mmap(0x7f8c65748000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f8c65748000
> close(3)= 0
> open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
> read(3,
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\37\0\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=37056, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
> = 0x7f8c65b82000
> mmap(NULL, 2130560, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f8c6533c000
> mprotect(0x7f8c65343000, 2097152, PROT_NONE) = 0
> mmap(0x7f8c65543000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7f8c65543000
> close(3)= 0
> open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0
> \34\2\0\0\0\0\0"..., 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=2107816, ...}) = 0
> mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
> 0) = 0x7f8c64f7b000
> mprotect(0x7f8c65131000, 2097152, PROT_NONE) = 0
> 

Re: [ceph-users] List directory in cephfs blocking very long time

2017-12-05 Thread David C
Not seen this myself but you should update to at least CentOS 7.3, ideally
7.4. I believe a lot of cephfs fixes went into those kernels. If you still
have the issue with the CentOS kernels, test with the latest upstream
kernel. And/or test with latest Fuse client.

On Tue, Dec 5, 2017 at 12:01 PM, 张建  wrote:

> Hello,
>
> My problem description:
> On cephfs client 1:
> Copy a directory "base" which contains hundreds of rpm files to the
> mounted cephfs directory.
> then , change to cephfs directory and execute "ls -l base", it returns
> quickly. This is no problem.
>
> On the other cephfs client,
> Change to the mounted cephfs directory and execute "ls -l base", it blocks
> for long time.
> Execute "strace ls -l base", we can find it blocks in lstat files (about 5
> seconds blocking for each file).
>
> Is there anybody met this problem?
>
> ==
> cephfs client information:
> OS: CentOS Linux release 7.2.1511 (Core)
> kernel: Linux 3.10.0-327.el7.x86_64
> mount type: with ceph kernel drvier
>
> ==
> ceph cluster information:
> OS: CentOS Linux release 7.2.1511 (Core)
> kernel: Linux 3.10.0-327.el7.x86_64
> ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
> ceph cluster node1:  1 mon, 1 mgr, 1 mds, 4 osd
> ceph cluster node2:  1 mon, 1 mgr, 1 mds, 4 osd
> ceph cluster node3:  1 mon, 1 mgr, 1 mds, 4 osd
> Each osd is deployed with bluestore type (4TB HDD block device + 240GB SSD
> block.db device).
> public network:  port 1 of 56Gb/s Dual Port Infiniband, IPoIB
> cluster network: port 2 of 56Gb/s Dual Port Infiniband, IPoIB
>
> ceph.conf:
> -
> [global]
> fsid = 86887b08-6ff4-4ec1-a622-64bc688d1f2f
> mon_initial_members = storage-0, storage-1, storage-2
> mon_host = 10.0.30.11,10.0.30.12,10.0.30.13
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
>
> public_network = 10.0.30.0/24
> cluster_network = 10.0.40.0/24
>
> osd_pool_default_size = 2
> osd_pool_default_min_size = 1
>
> osd_objectstore = bluestore
> bluestore_block_db_size = 236223201280
>
> ==
> strace log:
> -
> [root@computing-1 cephfs]# strace ls -l base
> execve("/usr/bin/ls", ["ls", "-l", "base"], [/* 24 vars */]) = 0
> brk(0)  = 0x155f000
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7f8c65b91000
> access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or
> directory)
> open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
> fstat(3, {st_mode=S_IFREG|0644, st_size=53575, ...}) = 0
> mmap(NULL, 53575, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8c65b83000
> close(3)= 0
> open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300j\0\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=155744, ...}) = 0
> mmap(NULL, 2255216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
> = 0x7f8c6574a000
> mprotect(0x7f8c6576e000, 2093056, PROT_NONE) = 0
> mmap(0x7f8c6596d000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23000) = 0x7f8c6596d000
> mmap(0x7f8c6596f000, 6512, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f8c6596f000
> close(3)= 0
> open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0
> \26\0\0\0\0\0\0"..., 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=20024, ...}) = 0
> mmap(NULL, 2114112, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
> = 0x7f8c65545000
> mprotect(0x7f8c65549000, 2093056, PROT_NONE) = 0
> mmap(0x7f8c65748000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f8c65748000
> close(3)= 0
> open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
> read(3, 
> "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\37\0\0\0\0\0\0"...,
> 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=37056, ...}) = 0
> mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
> 0x7f8c65b82000
> mmap(NULL, 2130560, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
> = 0x7f8c6533c000
> mprotect(0x7f8c65343000, 2097152, PROT_NONE) = 0
> mmap(0x7f8c65543000, 8192, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7f8c65543000
> close(3)= 0
> open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
> read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0
> \34\2\0\0\0\0\0"..., 832) = 832
> fstat(3, {st_mode=S_IFREG|0755, st_size=2107816, ...}) = 0
> mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0)
> = 0x7f8c64f7b000
> mprotect(0x7f8c65131000, 2097152, PROT_NONE) 

[ceph-users] List directory in cephfs blocking very long time

2017-12-05 Thread 张建

Hello,

My problem description:
On cephfs client 1:
Copy a directory "base" which contains hundreds of rpm files to the 
mounted cephfs directory.
then , change to cephfs directory and execute "ls -l base", it returns 
quickly. This is no problem.


On the other cephfs client,
Change to the mounted cephfs directory and execute "ls -l base", it 
blocks for long time.
Execute "strace ls -l base", we can find it blocks in lstat files (about 
5 seconds blocking for each file).


Is there anybody met this problem?

==
cephfs client information:
OS: CentOS Linux release 7.2.1511 (Core)
kernel: Linux 3.10.0-327.el7.x86_64
mount type: with ceph kernel drvier

==
ceph cluster information:
OS: CentOS Linux release 7.2.1511 (Core)
kernel: Linux 3.10.0-327.el7.x86_64
ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous 
(stable)

ceph cluster node1:  1 mon, 1 mgr, 1 mds, 4 osd
ceph cluster node2:  1 mon, 1 mgr, 1 mds, 4 osd
ceph cluster node3:  1 mon, 1 mgr, 1 mds, 4 osd
Each osd is deployed with bluestore type (4TB HDD block device + 240GB 
SSD block.db device).

public network:  port 1 of 56Gb/s Dual Port Infiniband, IPoIB
cluster network: port 2 of 56Gb/s Dual Port Infiniband, IPoIB

ceph.conf:
-
[global]
fsid = 86887b08-6ff4-4ec1-a622-64bc688d1f2f
mon_initial_members = storage-0, storage-1, storage-2
mon_host = 10.0.30.11,10.0.30.12,10.0.30.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

public_network = 10.0.30.0/24
cluster_network = 10.0.40.0/24

osd_pool_default_size = 2
osd_pool_default_min_size = 1

osd_objectstore = bluestore
bluestore_block_db_size = 236223201280

==
strace log:
-
[root@computing-1 cephfs]# strace ls -l base
execve("/usr/bin/ls", ["ls", "-l", "base"], [/* 24 vars */]) = 0
brk(0)  = 0x155f000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f8c65b91000
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or 
directory)

open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=53575, ...}) = 0
mmap(NULL, 53575, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f8c65b83000
close(3)    = 0
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\300j\0\0\0\0\0\0"..., 
832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=155744, ...}) = 0
mmap(NULL, 2255216, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 
0) = 0x7f8c6574a000

mprotect(0x7f8c6576e000, 2093056, PROT_NONE) = 0
mmap(0x7f8c6596d000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23000) = 0x7f8c6596d000
mmap(0x7f8c6596f000, 6512, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f8c6596f000

close(3)    = 0
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 
\26\0\0\0\0\0\0"..., 832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=20024, ...}) = 0
mmap(NULL, 2114112, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 
0) = 0x7f8c65545000

mprotect(0x7f8c65549000, 2093056, PROT_NONE) = 0
mmap(0x7f8c65748000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f8c65748000

close(3)    = 0
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\200\37\0\0\0\0\0\0"..., 
832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=37056, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f8c65b82000
mmap(NULL, 2130560, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 
0) = 0x7f8c6533c000

mprotect(0x7f8c65343000, 2097152, PROT_NONE) = 0
mmap(0x7f8c65543000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7f8c65543000

close(3)    = 0
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 
\34\2\0\0\0\0\0"..., 832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=2107816, ...}) = 0
mmap(NULL, 3932736, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 
0) = 0x7f8c64f7b000

mprotect(0x7f8c65131000, 2097152, PROT_NONE) = 0
mmap(0x7f8c65331000, 24576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b6000) = 0x7f8c65331000
mmap(0x7f8c65337000, 16960, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f8c65337000

close(3)    = 0
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, 
"\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\360\25\0\0\0\0\0\0"..., 
832) = 832

fstat(3, {st_mode=S_IFREG|0755, st_size=398272, ...}) = 0
mmap(NULL, 2490888, 

Re: [ceph-users] Adding multiple OSD

2017-12-05 Thread Richard Hesketh
On 05/12/17 09:20, Ronny Aasen wrote:
> On 05. des. 2017 00:14, Karun Josy wrote:
>> Thank you for detailed explanation!
>>
>> Got one another doubt,
>>
>> This is the total space available in the cluster :
>>
>> TOTAL : 23490G
>> Use  : 10170G
>> Avail : 13320G
>>
>>
>> But ecpool shows max avail as just 3 TB. What am I missing ?
>>
>> Karun Josy
> 
> without knowing details of your cluster, this is just assumption guessing, 
> but...
> 
> perhaps one of your hosts have less free space then the others, replicated 
> can pick 3 of the hosts that have plenty of space, but erasure perhaps 
> require more hosts, so the host with least space is the limiting factor.
> 
> check
> ceph osd df tree
> 
> to see how it looks.
> 
> 
> kinds regards
> Ronny Aasen

From previous emails the erasure code profile is k=5,m=3, with a host failure 
domain, so the EC pool does use all eight hosts for every object. I agree it's 
very likely that the problem is that your hosts currently have heterogeneous 
capacity and the maximum data in the EC pool will be limited by the size of the 
smallest host.

Also remember that with this profile, you have a 3/5 overhead on your data, so 
1GB of real data stored in the pool translates to 1.6GB of raw data on disk. 
The pool usage and max available stats are given in terms of real data, but the 
cluster TOTAL usage/availability is expressed in terms of the raw space (since 
real usable data will vary depending on pool settings). If you check, you will 
probably find that your lowest-capacity host has near 6TB of space free, which 
would let you store a little over 3.5TB of real data in your EC pool.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD down with Ceph version of Kraken

2017-12-05 Thread Dave.Chen
Hi,

Our Ceph version is Kraken and for the storage node we have up to 90 hard disks 
that can be used for OSD, we configured the messenger type as "simple", I 
noticed that "simple" type here might create lots of threads and hence occupied 
lots of resource, we observed the configuration will cause many OSD failure, 
and happened frequently. Is there any configuration could help to work around 
the issue of OSD failure?

Thanks in the advance!

Best Regards,
Dave Chen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP with some basics please

2017-12-05 Thread Stefan Kooman
Quoting tim taler (robur...@gmail.com):
> And I'm still puzzled about the implication of the cluster size on the
> amount of OSD failures.
> With size=2 min_size=1 one host could die and (if by chance there is
> NO read error on any bit on the living host) I could (theoretically)
> recover, is that right?
True.
> OR is it that if any two disks in the cluster fail at the same time
> (or while one is still being rebuild) all my data would be gone?
Only the objects that are located on those disks. So for example obj1
disk1,host1 and obj 1 on disk2,host2 ... you will lose data, yes.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Denes Dolhay

Hi,

This question popped up a few times already under filestore and 
bluestore too, but please help me understand, why this is?


"when you have 2 different objects, both with correct digests, in your 
cluster, the cluster can not know witch of the 2 objects are the correct 
one."


Doesn't it use an epoch, or an omap epoch when storing new data? If so 
why can it not use the recent one?



Thanks,

Denes.


On 12/05/2017 10:14 AM, Ronny Aasen wrote:

On 05. des. 2017 09:18, Gonzalo Aguilar Delgado wrote:

Hi,

I created this. http://paste.debian.net/999172/ But the expiration 
date is too short. So I did this too https://pastebin.com/QfrE71Dg.


What I want to mention is that there's no known cause for what's 
happening. It's true that time desynch happens on reboot because few 
millis skew. But ntp corrects it fast. There are no network issues 
and the log of the osd is in the output.


I only see in other osd the errors that are becoming more and more 
usual:


2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a shard 2: soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head 
data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a shard 6: soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head 
data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.63 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head: failed 
to pick suitable auth object


Digests not matching basically. Someone told me that this can be 
caused by a faulty disk. So I replaced the offending drive, and now I 
found the new disk is happening the same. Ok. But this thread is not 
for checking the source of the problem. This will be done later.


This thread is to try recover an OSD that seems ok to the object 
store tool. This is:



Why it breaks here?



if i get errors on a disk that i suspect are from reasons other then 
the disk beeing faulty. i remove the disk from the cluster. run it 
thru smart disk tests + long test. then run it thru the vendors 
diagnostic tools (i have a separate 1u machine for this)

if the disk clears as OK i wipe it and reinsert it as a new OSD

the reason you are getting corrupt digests are probably the very 
common way most people get corruptions.. you have size=2 , min_size=1



when you have 2 different objects, both with correct digests, in your 
cluster, the cluster can not know witch of the 2 objects are the 
correct one.  just search this list for all the users that end up in 
your situation for the same reason, also read this : 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016663.html



simple rule of thumb
size=2, min_size=1 :: i do not care about my data, the data is 
volatile but i want the cluster to accept writes _all the time_


size=2, min_size=2 :: i can not afford real redundancy, but i do care 
a little about my data, i accept that the cluster will block writes in 
error situations until the problem is fixed.


size=3, min_size=2 :: i want safe and available data, and i understand 
that the ceph defaults are there for a reason.




basically: size=3, min_size=2 if you want to avoid corruptions.

remove-wipe-reinstall disks that have developed 
corruptions/inconsistencies with the cluster


kind regards
Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSD

2017-12-05 Thread Ronny Aasen

On 05. des. 2017 00:14, Karun Josy wrote:

Thank you for detailed explanation!

Got one another doubt,

This is the total space available in the cluster :

TOTAL : 23490G
Use  : 10170G
Avail : 13320G


But ecpool shows max avail as just 3 TB. What am I missing ?

==


$ ceph df
GLOBAL:
     SIZE       AVAIL      RAW USED     %RAW USED
     23490G     13338G       10151G         43.22
POOLS:
     NAME            ID     USED      %USED     MAX AVAIL     OBJECTS
     ostemplates     1       162G      2.79         1134G       42084
     imagepool       34      122G      2.11         1891G       34196
     cvm1            54      8058         0         1891G         950
     ecpool1         55     4246G     42.77         3546G     1232590


$ ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
  0   ssd 1.86469  1.0  1909G   625G  1284G 32.76 0.76 201
  1   ssd 1.86469  1.0  1909G   691G  1217G 36.23 0.84 208
  2   ssd 0.87320  1.0   894G   587G   306G 65.67 1.52 156
11   ssd 0.87320  1.0   894G   631G   262G 70.68 1.63 186
  3   ssd 0.87320  1.0   894G   605G   288G 67.73 1.56 165
14   ssd 0.87320  1.0   894G   635G   258G 71.07 1.64 177
  4   ssd 0.87320  1.0   894G   419G   474G 46.93 1.08 127
15   ssd 0.87320  1.0   894G   373G   521G 41.73 0.96 114
16   ssd 0.87320  1.0   894G   492G   401G 55.10 1.27 149
  5   ssd 0.87320  1.0   894G   288G   605G 32.25 0.74  87
  6   ssd 0.87320  1.0   894G   342G   551G 38.28 0.88 102
  7   ssd 0.87320  1.0   894G   300G   593G 33.61 0.78  93
22   ssd 0.87320  1.0   894G   343G   550G 38.43 0.89 104
  8   ssd 0.87320  1.0   894G   267G   626G 29.90 0.69  77
  9   ssd 0.87320  1.0   894G   376G   518G 42.06 0.97 118
10   ssd 0.87320  1.0   894G   322G   571G 36.12 0.83 102
19   ssd 0.87320  1.0   894G   339G   554G 37.95 0.88 109
12   ssd 0.87320  1.0   894G   360G   534G 40.26 0.93 112
13   ssd 0.87320  1.0   894G   404G   489G 45.21 1.04 120
20   ssd 0.87320  1.0   894G   342G   551G 38.29 0.88 103
23   ssd 0.87320  1.0   894G   148G   745G 16.65 0.38  61
17   ssd 0.87320  1.0   894G   423G   470G 47.34 1.09 117
18   ssd 0.87320  1.0   894G   403G   490G 45.18 1.04 120
21   ssd 0.87320  1.0   894G   444G   450G 49.67 1.15 130
                     TOTAL 23490G 10170G 13320G 43.30



Karun Josy

On Tue, Dec 5, 2017 at 4:42 AM, Karun Josy > wrote:


Thank you for detailed explanation!

Got one another doubt,

This is the total space available in the cluster :

TOTAL 23490G
Use 10170G
Avail : 13320G


But ecpool shows max avail as just 3 TB.




without knowing details of your cluster, this is just assumption 
guessing, but...


perhaps one of your hosts have less free space then the others, 
replicated can pick 3 of the hosts that have plenty of space, but 
erasure perhaps require more hosts, so the host with least space is the 
limiting factor.


check
ceph osd df tree

to see how it looks.


kinds regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] eu.ceph.com now has SSL/HTTPS

2017-12-05 Thread Wido den Hollander
Hi,

I just didn't think about it earlier, so I did it this morning: 
https://eu.ceph.com/

Using Lets Encrypt eu.ceph.com is now available over HTTPS as well.

If you are using this mirror you might want to use SSL to download your 
packages and keys from.

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Ronny Aasen

On 05. des. 2017 09:18, Gonzalo Aguilar Delgado wrote:

Hi,

I created this. http://paste.debian.net/999172/ But the expiration date 
is too short. So I did this too https://pastebin.com/QfrE71Dg.


What I want to mention is that there's no known cause for what's 
happening. It's true that time desynch happens on reboot because few 
millis skew. But ntp corrects it fast. There are no network issues and 
the log of the osd is in the output.


I only see in other osd the errors that are becoming more and more usual:

2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a shard 2: soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest 
0xfae07534 != data_digest 0xe2de2a76 from auth oi 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a shard 6: soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest 
0xfae07534 != data_digest 0xe2de2a76 from auth oi 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.63 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head: failed to 
pick suitable auth object


Digests not matching basically. Someone told me that this can be caused 
by a faulty disk. So I replaced the offending drive, and now I found the 
new disk is happening the same. Ok. But this thread is not for checking 
the source of the problem. This will be done later.


This thread is to try recover an OSD that seems ok to the object store 
tool. This is:



Why it breaks here?



if i get errors on a disk that i suspect are from reasons other then the 
disk beeing faulty. i remove the disk from the cluster. run it thru 
smart disk tests + long test. then run it thru the vendors diagnostic 
tools (i have a separate 1u machine for this)

if the disk clears as OK i wipe it and reinsert it as a new OSD

the reason you are getting corrupt digests are probably the very common 
way most people get corruptions.. you have size=2 , min_size=1



when you have 2 different objects, both with correct digests, in your 
cluster, the cluster can not know witch of the 2 objects are the correct 
one.  just search this list for all the users that end up in your 
situation for the same reason, also read this : 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016663.html



simple rule of thumb
size=2, min_size=1 :: i do not care about my data, the data is volatile 
but i want the cluster to accept writes _all the time_


size=2, min_size=2 :: i can not afford real redundancy, but i do care a 
little about my data, i accept that the cluster will block writes in 
error situations until the problem is fixed.


size=3, min_size=2 :: i want safe and available data, and i understand 
that the ceph defaults are there for a reason.




basically: size=3, min_size=2 if you want to avoid corruptions.

remove-wipe-reinstall disks that have developed 
corruptions/inconsistencies with the cluster


kind regards
Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Gonzalo Aguilar Delgado
Hi,

I created this. http://paste.debian.net/999172/ But the expiration date
is too short. So I did this too https://pastebin.com/QfrE71Dg.

What I want to mention is that there's no known cause for what's
happening. It's true that time desynch happens on reboot because few
millis skew. But ntp corrects it fast. There are no network issues and
the log of the osd is in the output.

I only see in other osd the errors that are becoming more and more usual:

2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log
[ERR] : 10.7a shard 2: soid
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest
0xfae07534 != data_digest 0xe2de2a76 from auth oi
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log
[ERR] : 10.7a shard 6: soid
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest
0xfae07534 != data_digest 0xe2de2a76 from auth oi
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.63 7f0feff7f700 -1 log_channel(cluster) log
[ERR] : 10.7a soid
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head: failed to
pick suitable auth object

Digests not matching basically. Someone told me that this can be caused
by a faulty disk. So I replaced the offending drive, and now I found the
new disk is happening the same. Ok. But this thread is not for checking
the source of the problem. This will be done later.

This thread is to try recover an OSD that seems ok to the object store
tool. This is:


Why it breaks here?



starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4
/var/lib/ceph/osd/ceph-4/journal
osd/PG.cc: In function 'static int PG::peek_map_epoch(ObjectStore*,
spg_t, epoch_t*, ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03
13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)
 ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x80) [0x5556eab28790]
<- HERE
 2: (PG::peek_map_epoch(ObjectStore*, spg_t, unsigned int*,
ceph::buffer::list*)+0x661) [0x5556ea4e6601]
 3: (OSD::load_pgs()+0x75a) [0x5556ea43a8aa]
 4: (OSD::init()+0x2026) [0x5556ea445ca6]
 5: (main()+0x2ef1) [0x5556ea3b7301]
 6: (__libc_start_main()+0xf0) [0x7f467886b830]
 7: (_start()+0x29) [0x5556ea3f8b09]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.
2017-12-03 13:39:29.497091 7f467ba0b8c0 -1 osd/PG.cc: In function
'static int PG::peek_map_epoch(ObjectStore*, spg_t, epoch_t*,
ceph::bufferlist*)' thread 7f467ba0b8c0 time 2017-12-03 13:39:29.495311
osd/PG.cc: 3025: FAILED assert(values.size() == 2)


So it looks like the offending code is this one:

  int r = store->omap_get_values(coll, pgmeta_oid, keys, );
  if (r == 0) {
    assert(values.size() == 2); <-- Here

    // sanity check version


While the object store tool can run it without any problem. As you can
see here:


ceph-objectstore-tool --debug --op list-pgs --data-path
/var/lib/ceph/osd/ceph-4 --journal-path /dev/sdf3
2017-12-05 09:18:25.885258 7f5dd8b94a40  0
filestore(/var/lib/ceph/osd/ceph-4) backend xfs (magic 0x58465342)
2017-12-05 09:18:25.885715 7f5dd8b94a40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-12-05 09:18:25.885734 7f5dd8b94a40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2017-12-05 09:18:25.885755 7f5dd8b94a40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
splice is supported
2017-12-05 09:18:25.910484 7f5dd8b94a40  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2017-12-05 09:18:25.910545 7f5dd8b94a40  0
xfsfilestorebackend(/var/lib/ceph/osd/ceph-4) detect_feature: extsize is
disabled by conf
2017-12-05 09:18:26.639796 7f5dd8b94a40  0
filestore(/var/lib/ceph/osd/ceph-4) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2017-12-05 09:18:26.650560 7f5dd8b94a40  1 journal _open /dev/sdf3 fd
11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-12-05 09:18:26.662606 7f5dd8b94a40  1 journal _open /dev/sdf3 fd
11: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2017-12-05 09:18:26.664869 7f5dd8b94a40  1
filestore(/var/lib/ceph/osd/ceph-4) upgrade
Cluster fsid=9028f4da-0d77-462b-be9b-dbdf7fa57771
Supported features: compat={},rocompat={},incompat={1=initial feature
set(~v.18),2=pginfo object,3=object