from:"Stillwell, Bryan"

Re: [ceph-users] Experience with 5k RPM/archive HDDs

2017-02-03 Thread Stillwell, Bryan J

On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"
 wrote:

>
>> Op 3 februari 2017 om 11:03 schreef Maxime Guyot
>>:
>> 
>> 
>> Hi,
>> 
>> Interesting feedback!
>> 
>>  > In my opinion the SMR can be used exclusively for the RGW.
>>  > Unless it's something like a backup/archive cluster or pool with
>>little to none concurrent R/W access, you're likely to run out of IOPS
>>(again) long before filling these monsters up.
>> 
>> That¹s exactly the use case I am considering those archive HDDs for:
>>something like AWS Glacier, a form of offsite backup probably via
>>radosgw. The classic Seagate enterprise class HDD provide ³too much²
>>performance for this use case, I could live with 1Ž4 of the performance
>>for that price point.
>> 
>
>If you go down that route I suggest that you make a mixed cluster for RGW.
>
>A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or
>PM863 or a Intel DC series.
>
>All pools by default should go to those OSDs.
>
>Only the RGW buckets data pool should go to the big SMR drives. However,
>again, expect very, very low performance of those disks.

One of the other concerns you should think about is recovery time when one
of these drives fail.  The more OSDs you have, the less this becomes an
issue, but on a small cluster is might take over a day to fully recover
from an OSD failure.  Which is a decent amount of time to have degraded
PGs.

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD create with SSD journal

2017-01-11 Thread Stillwell, Bryan J

On 1/11/17, 10:31 AM, "ceph-users on behalf of Reed Dier"

wrote:

>>2017-01-03 12:10:23.514577 7f1d821f2800  0 ceph version 10.2.5
>>(c461ee19ecbc0c5c330aca20f7392c9a00730367), process ceph-osd, pid 19754
>> 2017-01-03 12:10:23.517465 7f1d821f2800  1
>>filestore(/var/lib/ceph/tmp/mnt.WaQmjK) mkfs in
>>/var/lib/ceph/tmp/mnt.WaQmjK
>> 2017-01-03 12:10:23.517494 7f1d821f2800  1
>>filestore(/var/lib/ceph/tmp/mnt.WaQmjK) mkfs fsid is already set to
>>644058d7-e1b0-4abe-92e2-43b17d75148e
>> 2017-01-03 12:10:23.517499 7f1d821f2800  1
>>filestore(/var/lib/ceph/tmp/mnt.WaQmjK) write_version_stamp 4
>> 2017-01-03 12:10:23.517678 7f1d821f2800  0
>>filestore(/var/lib/ceph/tmp/mnt.WaQmjK) backend xfs (magic 0x58465342)
>> 2017-01-03 12:10:23.519898 7f1d821f2800  1
>>filestore(/var/lib/ceph/tmp/mnt.WaQmjK) leveldb db exists/created
>> 2017-01-03 12:10:23.520035 7f1d821f2800 -1
>>filestore(/var/lib/ceph/tmp/mnt.WaQmjK) mkjournal error creating journal
>>on /var/lib/ceph/tmp/mnt.WaQmjK/journal: (13) Permission denied
>> 2017-01-03 12:10:23.520049 7f1d821f2800 -1 OSD::mkfs: ObjectStore::mkfs
>>failed with error -13
>> 2017-01-03 12:10:23.520100 7f1d821f2800 -1 ESC[0;31m ** ERROR: error
>>creating empty object store in /var/lib/ceph/tmp/mnt.WaQmjK: (13)
>>Permission deniedESC[0m
>
>I needed up creating the OSD¹s with on-disk journals, then going back and
>moving the journals to the NVMe partition as intended, but hoping to do
>this all in one fell swoop, so hoping there may be some pointers on
>something I may be doing incorrectly with ceph-deploy for the external
>journal location. Adding a handful of OSD¹s soon, and would like to do it
>correctly from the start.

What's the ownership of the journal device (/dev/nvme0n1p5)?

It should be owned by ceph:ceph or you'll get the permission denied errors
message.

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-11 Thread Stillwell, Bryan J

John,

This morning I compared the logs from yesterday and I show a noticeable
increase in messages like these:

2017-01-11 09:00:03.032521 7f70f15c1700 10 mgr handle_mgr_digest 575
2017-01-11 09:00:03.032523 7f70f15c1700 10 mgr handle_mgr_digest 441
2017-01-11 09:00:03.032529 7f70f15c1700 10 mgr notify_all notify_all:
notify_all mon_status
2017-01-11 09:00:03.032532 7f70f15c1700 10 mgr notify_all notify_all:
notify_all health
2017-01-11 09:00:03.032534 7f70f15c1700 10 mgr notify_all notify_all:
notify_all pg_summary
2017-01-11 09:00:03.033613 7f70f15c1700  4 mgr ms_dispatch active
mgrdigest v1
2017-01-11 09:00:03.033618 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1
2017-01-11 09:00:03.033620 7f70f15c1700 10 mgr handle_mgr_digest 575
2017-01-11 09:00:03.033622 7f70f15c1700 10 mgr handle_mgr_digest 441
2017-01-11 09:00:03.033628 7f70f15c1700 10 mgr notify_all notify_all:
notify_all mon_status
2017-01-11 09:00:03.033631 7f70f15c1700 10 mgr notify_all notify_all:
notify_all health
2017-01-11 09:00:03.033633 7f70f15c1700 10 mgr notify_all notify_all:
notify_all pg_summary
2017-01-11 09:00:03.532898 7f70f15c1700  4 mgr ms_dispatch active
mgrdigest v1
2017-01-11 09:00:03.532945 7f70f15c1700 -1 mgr ms_dispatch mgrdigest v1


In a 1 minute period yesterday I saw 84 times this group of messages
showed up.  Today that same group of messages showed up 156 times.

Other than that I did see an increase in this messages from 9 times a
minute to 14 times a minute:

2017-01-11 09:00:00.402000 7f70f3d61700  0 -- 172.24.88.207:6800/4104 >> -
conn(0x563c9ee89000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0
l=0).fault with nothing to send and in the half  accept state just closed

Let me know if you need anything else.

Bryan


On 1/10/17, 10:00 AM, "ceph-users on behalf of Stillwell, Bryan J"
 wrote:

>On 1/10/17, 5:35 AM, "John Spray"  wrote:
>
>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>> wrote:
>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>> single node, two OSD cluster, and after a while I noticed that the new
>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>
>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>> ceph-mgr
>>>
>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
>>> anyone else seen this?
>>
>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>daemon to see if it's obviously spinning in a particular place?
>
>I've injected that option to the ceps-mgr process, and now I'm just
>waiting for it to go out of control again.
>
>However, I've noticed quite a few messages like this in the logs already:
>
>2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
>cs=1 l=0).fault initiating reconnect
>2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
>accept peer reset, then tried to connect to us, replacing
>2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
>send and in the half  accept state just closed
>
>
>What's weird about that is that this is a single node cluster with
>ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>host.  So none of the communication should be leaving the node.
>
>Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Stillwell, Bryan J

That's strange, I installed that version using packages from here:

http://download.ceph.com/debian-kraken/pool/main/c/ceph/


Bryan

On 1/10/17, 10:51 AM, "Samuel Just"  wrote:

>Can you push that branch somewhere?  I don't have a v11.1.1 or that sha1.
>-Sam
>
>On Tue, Jan 10, 2017 at 9:41 AM, Stillwell, Bryan J
> wrote:
>> This is from:
>>
>> ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)
>>
>> On 1/10/17, 10:23 AM, "Samuel Just"  wrote:
>>
>>>What ceph sha1 is that?  Does it include
>>>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>>>spin?
>>>-Sam
>>>
>>>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
>>> wrote:
>>>> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>>>>
>>>>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>>>>> wrote:
>>>>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>>>>> single node, two OSD cluster, and after a while I noticed that the
>>>>>>new
>>>>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>>>>
>>>>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>>>>> ceph-mgr
>>>>>>
>>>>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its
>>>>>>CPU
>>>>>> usage down to < 1%, but after a while it climbs back up to > 100%.
>>>>>>Has
>>>>>> anyone else seen this?
>>>>>
>>>>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>>>>daemon to see if it's obviously spinning in a particular place?
>>>>
>>>> I've injected that option to the ceps-mgr process, and now I'm just
>>>> waiting for it to go out of control again.
>>>>
>>>> However, I've noticed quite a few messages like this in the logs
>>>>already:
>>>>
>>>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>
>>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN
>>>>pgs=2
>>>> cs=1 l=0).fault initiating reconnect
>>>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>
>>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>>l=0).handle_connect_msg
>>>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>>>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>
>>>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>>>l=0).handle_connect_msg
>>>> accept peer reset, then tried to connect to us, replacing
>>>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104
>>>>>>
>>>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>>>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing
>>>>to
>>>> send and in the half  accept state just closed
>>>>
>>>>
>>>> What's weird about that is that this is a single node cluster with
>>>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>>>> host.  So none of the communication should be leaving the node.
>>>>
>>>> Bryan
>>>>
>>>> E-MAIL CONFIDENTIALITY NOTICE:
>>>> The contents of this e-mail message and any attachments are intended
>>>>solely for the addressee(s) and may contain confidential and/or legally
>>>>privileged information. If you are not the intended recipient of this
>>>>message or if this message has been addressed to you in error, please
>>>>immediately alert the sender by reply e-mail and then delete this
>>>>message and any attachments. If you are not the intended recipient, you
>>>>are notified that any use, dissemination, distribution, copying, or
>>>>storage of this message or any attachment is strictly prohibited.
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>&g

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Stillwell, Bryan J

This is from:

ceph version 11.1.1 (87597971b371d7f497d7eabad3545d72d18dd755)

On 1/10/17, 10:23 AM, "Samuel Just"  wrote:

>What ceph sha1 is that?  Does it include
>6c3d015c6854a12cda40673848813d968ff6afae which fixed the messenger
>spin?
>-Sam
>
>On Tue, Jan 10, 2017 at 9:00 AM, Stillwell, Bryan J
> wrote:
>> On 1/10/17, 5:35 AM, "John Spray"  wrote:
>>
>>>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
>>> wrote:
>>>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>>>> single node, two OSD cluster, and after a while I noticed that the new
>>>> ceph-mgr daemon is frequently using a lot of the CPU:
>>>>
>>>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>>>> ceph-mgr
>>>>
>>>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>>>> usage down to < 1%, but after a while it climbs back up to > 100%.
>>>>Has
>>>> anyone else seen this?
>>>
>>>Definitely worth investigating, could you set "debug mgr = 20" on the
>>>daemon to see if it's obviously spinning in a particular place?
>>
>> I've injected that option to the ceps-mgr process, and now I'm just
>> waiting for it to go out of control again.
>>
>> However, I've noticed quite a few messages like this in the logs
>>already:
>>
>> 2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
>> cs=1 l=0).fault initiating reconnect
>> 2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>l=0).handle_connect_msg
>> accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
>> 2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
>>l=0).handle_connect_msg
>> accept peer reset, then tried to connect to us, replacing
>> 2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
>> 172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
>> s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
>> send and in the half  accept state just closed
>>
>>
>> What's weird about that is that this is a single node cluster with
>> ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
>> host.  So none of the communication should be leaving the node.
>>
>> Bryan
>>
>> E-MAIL CONFIDENTIALITY NOTICE:
>> The contents of this e-mail message and any attachments are intended
>>solely for the addressee(s) and may contain confidential and/or legally
>>privileged information. If you are not the intended recipient of this
>>message or if this message has been addressed to you in error, please
>>immediately alert the sender by reply e-mail and then delete this
>>message and any attachments. If you are not the intended recipient, you
>>are notified that any use, dissemination, distribution, copying, or
>>storage of this message or any attachment is strictly prohibited.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crushmap (tunables) flapping on cluster

2017-01-10 Thread Stillwell, Bryan J

On 1/10/17, 2:56 AM, "ceph-users on behalf of Breunig, Steve (KASRL)"
 wrote:

>Hi list,
>
>
>I'm running a cluster which is currently in migration from hammer to
>jewel.
>
>
>Actually i have the problem, that the tunables are flapping and a map of
>an rbd image is not working.
>
>
>It is flapping between:
>
>
>{
>"choose_local_tries": 0,
>"choose_local_fallback_tries": 0,
>"choose_total_tries": 50,
>"chooseleaf_descend_once": 1,
>"chooseleaf_vary_r": 1,
>"chooseleaf_stable": 0,
>"straw_calc_version": 1,
>"allowed_bucket_algs": 54,
>"profile": "hammer",
>"optimal_tunables": 0,
>"legacy_tunables": 0,
>"minimum_required_version": "hammer",
>"require_feature_tunables": 1,
>"require_feature_tunables2": 1,
>"has_v2_rules": 0,
>"require_feature_tunables3": 1,
>"has_v3_rules": 0,
>"has_v4_buckets": 1,
>"require_feature_tunables5": 0,
>"has_v5_rules": 0
>}
>
>
>and
>
>
>{
>"choose_local_tries": 0,
>"choose_local_fallback_tries": 0,
>"choose_total_tries": 50,
>"chooseleaf_descend_once": 1,
>"chooseleaf_vary_r": 1,
>"straw_calc_version": 1,
>"allowed_bucket_algs": 54,
>"profile": "hammer",
>"optimal_tunables": 0,
>"legacy_tunables": 0,
>"require_feature_tunables": 1,
>"require_feature_tunables2": 1,
>"require_feature_tunables3": 1,
>"has_v2_rules": 0,
>"has_v3_rules": 0,
>"has_v4_buckets": 1
>}
>
>
>Did someone have that problem too?
>How can it be solved?

Have you upgraded all the mon nodes?  My guess is that when you're running
'ceph osd crush show-tunables' it's sometimes being reported from a hammer
mon node and sometimes from a jewel mon node.

You can run 'ceph tell mon.* version' to verify they're all running the
same version.

When you say the map is failing, are you using the kernel rbd driver?  If
so you might need to upgrade your kernel to support the new features in
jewel.

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-10 Thread Stillwell, Bryan J

On 1/10/17, 5:35 AM, "John Spray"  wrote:

>On Mon, Jan 9, 2017 at 11:46 PM, Stillwell, Bryan J
> wrote:
>> Last week I decided to play around with Kraken (11.1.1-1xenial) on a
>> single node, two OSD cluster, and after a while I noticed that the new
>> ceph-mgr daemon is frequently using a lot of the CPU:
>>
>> 17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
>> ceph-mgr
>>
>> Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
>> usage down to < 1%, but after a while it climbs back up to > 100%.  Has
>> anyone else seen this?
>
>Definitely worth investigating, could you set "debug mgr = 20" on the
>daemon to see if it's obviously spinning in a particular place?

I've injected that option to the ceps-mgr process, and now I'm just
waiting for it to go out of control again.

However, I've noticed quite a few messages like this in the logs already:

2017-01-10 09:56:07.441678 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800 s=STATE_OPEN pgs=2
cs=1 l=0).fault initiating reconnect
2017-01-10 09:56:07.442044 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept connect_seq 0 vs existing csq=2 existing_state=STATE_CONNECTING
2017-01-10 09:56:07.442067 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7dfea800 :6800
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg
accept peer reset, then tried to connect to us, replacing
2017-01-10 09:56:07.443026 7f70f4562700  0 -- 172.24.88.207:6800/4104 >>
172.24.88.207:0/4168225878 conn(0x563c7e0bc000 :6800
s=STATE_ACCEPTING_WAIT_CONNECT_MSG pgs=2 cs=0 l=0).fault with nothing to
send and in the half  accept state just closed

What's weird about that is that this is a single node cluster with
ceph-mgr, ceph-mon, and the ceph-osd processes all running on the same
host.  So none of the communication should be leaving the node.

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] High CPU usage by ceph-mgr on idle Ceph cluster

2017-01-09 Thread Stillwell, Bryan J

Last week I decided to play around with Kraken (11.1.1-1xenial) on a
single node, two OSD cluster, and after a while I noticed that the new
ceph-mgr daemon is frequently using a lot of the CPU:

17519 ceph  20   0  850044 168104208 S 102.7  4.3   1278:27
ceph-mgr

Restarting it with 'systemctl restart ceph-mgr*' seems to get its CPU
usage down to < 1%, but after a while it climbs back up to > 100%.  Has
anyone else seen this?

Bryan

E-MAIL CONFIDENTIALITY NOTICE: 
The contents of this e-mail message and any attachments are intended solely for 
the addressee(s) and may contain confidential and/or legally privileged 
information. If you are not the intended recipient of this message or if this 
message has been addressed to you in error, please immediately alert the sender 
by reply e-mail and then delete this message and any attachments. If you are 
not the intended recipient, you are notified that any use, dissemination, 
distribution, copying, or storage of this message or any attachment is strictly 
prohibited.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Total free space in addition to MAX AVAIL

2016-11-01 Thread Stillwell, Bryan J

On 11/1/16, 1:45 PM, "Sage Weil"  wrote:

>On Tue, 1 Nov 2016, Stillwell, Bryan J wrote:
>> I recently learned that 'MAX AVAIL' in the 'ceph df' output doesn't
>> represent what I thought it did.  It actually represents the amount of
>> data that can be used before the first OSD becomes full, and not the sum
>> of all free space across a set of OSDs.  This means that balancing the
>> data with 'ceph osd reweight' will actually increase the value of 'MAX
>> AVAIL'.
>> 
>> Knowing this I would like to graph both 'MAX AVAIL' and the total free
>> space across two different sets of OSDs so I can get an idea how out of
>> balance the cluster is.
>> 
>> This is where I'm running into trouble.  I have two different types of
>> Ceph nodes in my cluster.  One with all HDDs+SSD journals, and the other
>> with all SSDs using co-located journals.  There isn't any cache tiering
>> going on, so a pool either uses the all-HDD root, or the all-SSD root,
>>but
>> not both.
>> 
>> The only method I can think of to get this information is to walk the
>> CRUSH tree to figure out which OSDs are under a given root, and then use
>> the output of 'ceph osd df -f json' to sum up the free space of each
>>OSD.
>> Is there a better way?
>
>Try
>
>   ceph osd df tree -f json-pretty
>
>I think that'll give you all the right fields you need to sum.
>
>I wonder if this is something we should be reporting elsewhere, though?
>Summing up all free space is one thing.  Doing it per CRUSH hierarchy is
>something else.  Maybe the 'ceph osd df tree' output could have a field
>summing freespace for self + children in the json dump only...

That's just what I was looking for!  It also looks like the regular 'ceph
osd df tree' output has this information too:

# ceph osd df tree

ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  TYPE NAME
-8 0.52199-   521G 61835M   461G 11.57 0.57 root ceph-ssd
-5 0.17400-   173G 20615M   153G 11.58 0.57 host
dev-ceph-ssd-001
 9 0.05800  1.0 59361M  5374M 53987M  9.05 0.45 osd.9
10 0.05800  1.0 59361M  6837M 52524M 11.52 0.57 osd.10
11 0.05800  1.0 59361M  8404M 50957M 14.16 0.70 osd.11
-6 0.17400-   173G 20615M   153G 11.58 0.57 host
dev-ceph-ssd-002
12 0.05800  1.0 59361M  7165M 52196M 12.07 0.60 osd.12
13 0.05800  1.0 59361M  6762M 52599M 11.39 0.56 osd.13
14 0.05800  1.0 59361M  6688M 52673M 11.27 0.56 osd.14
-7 0.17400-   173G 20604M   153G 11.57 0.57 host
dev-ceph-ssd-003
15 0.05800  1.0 59361M  8189M 51172M 13.80 0.68 osd.15
16 0.05800  1.0 59361M  4835M 54526M  8.15 0.40 osd.16
17 0.05800  1.0 59361M  7579M 51782M 12.77 0.63 osd.17
-1 0.57596-   575G   161G   414G 27.97 1.39 root ceph-hdd
-2 0.19199-   191G 49990M   143G 25.44 1.26 host
dev-ceph-hdd-001
 0 0.06400  0.75000 65502M 15785M 49717M 24.10 1.19 osd.0
 1 0.06400  0.64999 65502M 17127M 48375M 26.15 1.30 osd.1
 2 0.06400  0.5 65502M 17077M 48425M 26.07 1.29 osd.2
-3 0.19199-   191G 63885M   129G 32.51 1.61 host
dev-ceph-hdd-002
 3 0.06400  1.0 65502M 28681M 36821M 43.79 2.17 osd.3
 4 0.06400  0.5 65502M 17246M 48256M 26.33 1.30 osd.4
 5 0.06400  0.84999 65502M 17958M 47544M 27.42 1.36 osd.5
-4 0.19199-   191G 51038M   142G 25.97 1.29 host
dev-ceph-hdd-003
 6 0.06400  0.64999 65502M 16617M 48885M 25.37 1.26 osd.6
 7 0.06400  0.7 65502M 16391M 49111M 25.02 1.24 osd.7
 8 0.06400  0.64999 65502M 18029M 47473M 27.52 1.36 osd.8
  TOTAL  1097G   221G   876G 20.18
MIN/MAX VAR: 0.40/2.17  STDDEV: 9.68



As you can tell I set the weights so that osd.3 would make the MAX AVAIL
difference more pronounced.  Also it appears like VAR is calculated on the
whole cluster instead of each root.

Thanks!
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Total free space in addition to MAX AVAIL

2016-11-01 Thread Stillwell, Bryan J

I recently learned that 'MAX AVAIL' in the 'ceph df' output doesn't
represent what I thought it did.  It actually represents the amount of
data that can be used before the first OSD becomes full, and not the sum
of all free space across a set of OSDs.  This means that balancing the
data with 'ceph osd reweight' will actually increase the value of 'MAX
AVAIL'.

Knowing this I would like to graph both 'MAX AVAIL' and the total free
space across two different sets of OSDs so I can get an idea how out of
balance the cluster is.

This is where I'm running into trouble.  I have two different types of
Ceph nodes in my cluster.  One with all HDDs+SSD journals, and the other
with all SSDs using co-located journals.  There isn't any cache tiering
going on, so a pool either uses the all-HDD root, or the all-SSD root, but
not both.

The only method I can think of to get this information is to walk the
CRUSH tree to figure out which OSDs are under a given root, and then use
the output of 'ceph osd df -f json' to sum up the free space of each OSD.
Is there a better way?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Announcing the ceph-large mailing list

2016-10-20 Thread Stillwell, Bryan J

Do you run a large Ceph cluster?  Do you find that you run into issues
that you didn't have when your cluster was smaller?  If so we have a new
mailing list for you!

Announcing the new ceph-large mailing list.  This list is targeted at
experienced Ceph operators with cluster(s) over 500 OSDs to discuss
issues and experiences with going big.  If you're one of these people,
please join the list here:

http://lists.ceph.com/listinfo.cgi/ceph-large-ceph.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Missing arm64 Ubuntu packages for 10.2.3

2016-10-14 Thread Stillwell, Bryan J

On 10/14/16, 2:29 PM, "Alfredo Deza"  wrote:

>On Thu, Oct 13, 2016 at 5:19 PM, Stillwell, Bryan J
> wrote:
>> On 10/13/16, 2:32 PM, "Alfredo Deza"  wrote:
>>
>>>On Thu, Oct 13, 2016 at 11:33 AM, Stillwell, Bryan J
>>> wrote:
>>>> I have a basement cluster that is partially built with Odroid-C2
>>>>boards
>>>>and
>>>> when I attempted to upgrade to the 10.2.3 release I noticed that this
>>>> release doesn't have an arm64 build.  Are there any plans on
>>>>continuing
>>>>to
>>>> make arm64 builds?
>>>
>>>We have a couple of machines for building ceph releases on ARM64 but
>>>unfortunately they sometimes have issues and since Arm64 is
>>>considered a "nice to have" at the moment we usually skip them if
>>>anything comes up.
>>>
>>>So it is an on-and-off kind of situation (I don't recall what happened
>>>for 10.2.3)
>>>
>>>But since you've asked, I can try to get them built and see if we can
>>>get 10.2.3 out.
>>
>> Sounds good, thanks Alfredo!
>
>10.2.3 arm64 for xenial (and centos7) is out. We only have xenial
>available for arm64, hopefully that will work for you.

Thanks Alfredo, but I'm only seeing xenial arm64 dbg packages here:

http://download.ceph.com/debian-jewel/pool/main/c/ceph/


There's also a report on IRC that the Packages file no longer contains the
10.2.3 amd64 packages for xenial.

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Missing arm64 Ubuntu packages for 10.2.3

2016-10-13 Thread Stillwell, Bryan J

On 10/13/16, 2:32 PM, "Alfredo Deza"  wrote:

>On Thu, Oct 13, 2016 at 11:33 AM, Stillwell, Bryan J
> wrote:
>> I have a basement cluster that is partially built with Odroid-C2 boards
>>and
>> when I attempted to upgrade to the 10.2.3 release I noticed that this
>> release doesn't have an arm64 build.  Are there any plans on continuing
>>to
>> make arm64 builds?
>
>We have a couple of machines for building ceph releases on ARM64 but
>unfortunately they sometimes have issues and since Arm64 is
>considered a "nice to have" at the moment we usually skip them if
>anything comes up.
>
>So it is an on-and-off kind of situation (I don't recall what happened
>for 10.2.3)
>
>But since you've asked, I can try to get them built and see if we can
>get 10.2.3 out.

Sounds good, thanks Alfredo!

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Missing arm64 Ubuntu packages for 10.2.3

2016-10-13 Thread Stillwell, Bryan J

I have a basement cluster that is partially built with Odroid-C2 boards and 
when I attempted to upgrade to the 10.2.3 release I noticed that this release 
doesn't have an arm64 build.  Are there any plans on continuing to make arm64 
builds?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] upgrade from v0.94.6 or lower and 'failed to encode map X with expected crc'

2016-10-06 Thread Stillwell, Bryan J

Thanks Kefu!

Downgrading the mons to 0.94.6 got us out of this situation.  I appreciate
you tracking this down!

Bryan

On 10/4/16, 1:18 AM, "ceph-users on behalf of kefu chai"
 wrote:

>hi ceph users,
>
>If user upgrades the cluster from a prior release to v0.94.7 or up by
>following the steps:
>
>1. upgrade the monitors first,
>2. and then the OSDs.
>
>It is expected that the cluster log will be flooded with messages like:
>
>2016-07-12 08:42:42.1234567 osd.1234 [WRN] failed to encode map e4321
>with expected crc
>
>Because we changed[1] the encoding of OSDMap in v0.94.7. And the
>monitors start sending the incremental OSDMaps with the new encoding
>to the OSDs once the quorum members are all at the new version. But
>the OSDs at the old version still re-encode the osdmaps with the old
>encoding, then compare the resulting CRC with the one carried by the
>received incremental maps. And, they don't match! So the OSDs will ask
>the monitors for the full map in this case.
>
>For a large Ceph cluster, there are several consequences of the CRC
>mismatch:
>1. monitor being flooded by this clog
>2. monitor burdened by the sending the fullmaps.
>3. the network saturated by the osdmap messages carrying the requested
>fullmaps
>3. slow requests observed if the updated osdmaps are delayed by the
>saturated network.
>
>as reported[2,3,4,5] by our users.
>
>The interim solution for those who are stuck in the middle of an upgrade
>is:
>
>1. revert all the monitors back to the previous version,
>2. upgrade the OSDs to the version you want to upgrade.
>3. upgrade the monitors to the version you want to upgrade.
>
>And for users who plan to upgrade from a version prior to v0.94.7 to
>v0.94.7 or up, please
>1. upgrade the OSDs to the version you want to upgrade
>2. upgrade the monitors to the version you want to upgrade.
>
>For users preferring upgrading from a version prior to v0.94.7 to
>jewel, it is suggested to upgrade to the latest hammer first by
>following the steps above, if the scale of your cluster is relatively
>large.
>
>And in the short term, we are preparing a fix[6] for hammer, so the
>monitors will send osdmap encoded with lower version encoding.
>
>In the long term, we won't use the new release feature bit in the
>cluster unless allowed explicitly[7].
>
>
>@ceph developers,
>
>so if we want to bump up the encoding version of OSDMap or its
>(sub)fields, I think it would be desirable to match the encoder with
>the new major release feature bit. For instance, if a new field named
>"foo" is added to `pg_pool_t` in kraken, and `map
>pools` is in turn a field of `OSDMap`, then we need to be careful when
>updating `pg_pool_t::encode()`, like
>
>void pg_pool_t::encode(bufferlist& bl, uint64_t features) const {
>  // ...
>  if ((features & CEPH_FEATURE_SERVER_KRAKEN) == 0) {
>// encode in the jewel way
>return;
>  }
>  // encode in the kraken way
>}
>
>Because,
>
>- it would be difficult for the monitor to send understandable osdmaps
>for all osds.
>- we disable/enable the new encoder by excluding/including the major
>release feature bit in [7].
>
>--
>[1] sha1 039240418060c9a49298dacc0478772334526dce
>[2] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg30783.html
>[3] http://www.spinics.net/lists/ceph-users/msg28296.html
>[4] http://ceph-users.ceph.narkive.com/rPGrATpE/v0-94-7-hammer-released
>[5] 
>http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013189.
>html
>[6] http://tracker.ceph.com/issues/17386
>[7] https://github.com/ceph/ceph/pull/11284
>
>-- 
>Regards
>Kefu Chai
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [EXTERNAL] Upgrading 0.94.6 -> 0.94.9 saturating mon node networking

2016-09-23 Thread Stillwell, Bryan J

Will,

This issue in the tracker has an explanation of what is going on:

http://tracker.ceph.com/issues/17386


So the encoding change caused the old OSDs to start requesting full OSDMap
updates instead of incremental ones.

I would still like to know the purpose of changing the encoding so late in
the stable release series...

Bryan

On 9/22/16, 7:32 AM, "Will.Boege"  wrote:

>Just went through this upgrading a ~400 OSD cluster. I was in the EXACT
>spot you were in. The faster you can get all OSDs to the same version as
>the MONs the better. We decided to power forward and the performance got
>better for every OSD node we patched.
>
>Additionally I also discovered your LevelDBs will start growing
>exponentially if you leave your cluster in that state for too long.
>
>Pretty sure the downrev OSDs are aggressively getting osdmaps from the
>MONs causing some kind of spinlock condition.
>
>> On Sep 21, 2016, at 4:21 PM, Stillwell, Bryan J
>> wrote:
>> 
>> While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9
>>I've
>> run into serious performance issues every time I restart an OSD.
>> 
>> At first I thought the problem I was running into was caused by the
>>osdmap
>> encoding bug that Dan and Wido ran into when upgrading to 0.94.7,
>>because
>> I was seeing a ton (millions) of these messages in the logs:
>> 
>> 2016-09-21 20:48:32.831040 osd.504 24.161.248.128:6810/96488 24 :
>>cluster
>> [WRN] failed to encode map e727985 with expected cry
>> 
>> Here are the links to their descriptions of the problem:
>> 
>> http://www.spinics.net/lists/ceph-devel/msg30450.html
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg30783.html
>> 
>> I tried the solution of using the following command to stop those errors
>> from occurring:
>> 
>> ceph tell osd.* injectargs '--clog_to_monitors false'
>> 
>> Which did get the messages to stop spamming the log files, however, it
>> didn't fix the performance issue for me.
>> 
>> Using dstat on the mon nodes I was able to determine that every time the
>> osdmap is updated (by running 'ceph osd pool set data size 2' in this
>> example) it causes the outgoing network on all mon nodes to be saturated
>> for multiple seconds at a time:
>> 
>> system total-cpu-usage --memory-usage-
>>-net/total-
>> -dsk/total- --io/total-
>> time |usr sys idl wai hiq siq| used  buff  cach  free| recv
>> send| read  writ| read  writ
>> 21-09 21:06:53|  1   0  99   0   0   0|11.8G  273M 18.7G  221G|2326k
>> 9015k|   0  1348k|   0  16.0
>> 21-09 21:06:54|  1   1  98   0   0   0|11.9G  273M 18.7G  221G|  15M
>> 10M|   0  1312k|   0  16.0
>> 21-09 21:06:55|  2   2  94   0   0   1|12.3G  273M 18.7G  220G|  14M
>> 311M|   048M|   0   309
>> 21-09 21:06:56|  2   3  93   0   0   3|12.2G  273M 18.7G  220G|7745k
>> 1190M|   016M|   0  93.0
>> 21-09 21:06:57|  1   2  96   0   0   1|12.0G  273M 18.7G  220G|8269k
>> 1189M|   0  1956k|   0  10.0
>> 21-09 21:06:58|  3   1  95   0   0   1|11.8G  273M 18.7G  221G|4854k
>> 752M|   0  4960k|   0  21.0
>> 21-09 21:06:59|  3   0  97   0   0   0|11.8G  273M 18.7G  221G|3098k
>> 25M|   0  5036k|   0  26.0
>> 21-09 21:07:00|  1   0  98   0   0   0|11.8G  273M 18.7G  221G|2247k
>> 25M|   0  9980k|   0  45.0
>> 21-09 21:07:01|  2   1  97   0   0   0|11.8G  273M 18.7G  221G|4149k
>> 17M|   076M|   0   427
>> 
>> That would be 1190 MiB/s (or 9.982 Gbps).
>> 
>> Restarting every OSD on a node at once as part of the upgrade causes a
>> couple minutes worth of network saturation on all three mon nodes.  This
>> causes thousands of slow requests and many unhappy OpenStack users.
>> 
>> I'm now stuck about 15% into the upgrade and haven't been able to
>> determine how to move forward (or even backward) without causing another
>> outage.
>> 
>> I've attempted to run the same test on another cluster with 1300+ OSDs
>>and
>> the outgoing network on the mon nodes didn't exceed 15 MiB/s (0.126
>>Gbps).
>> 
>> Any suggestions on how I can proceed?
>> 
>> Thanks,
>> Bryan
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Upgrading 0.94.6 -> 0.94.9 saturating mon node networking

2016-09-21 Thread Stillwell, Bryan J

While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9 I've
run into serious performance issues every time I restart an OSD.

At first I thought the problem I was running into was caused by the osdmap
encoding bug that Dan and Wido ran into when upgrading to 0.94.7, because
I was seeing a ton (millions) of these messages in the logs:

2016-09-21 20:48:32.831040 osd.504 24.161.248.128:6810/96488 24 : cluster
[WRN] failed to encode map e727985 with expected cry

Here are the links to their descriptions of the problem:

http://www.spinics.net/lists/ceph-devel/msg30450.html
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg30783.html

I tried the solution of using the following command to stop those errors
from occurring:

ceph tell osd.* injectargs '--clog_to_monitors false'

Which did get the messages to stop spamming the log files, however, it
didn't fix the performance issue for me.

Using dstat on the mon nodes I was able to determine that every time the
osdmap is updated (by running 'ceph osd pool set data size 2' in this
example) it causes the outgoing network on all mon nodes to be saturated
for multiple seconds at a time:

system total-cpu-usage --memory-usage- -net/total-
-dsk/total- --io/total-
 time |usr sys idl wai hiq siq| used  buff  cach  free| recv
send| read  writ| read  writ
21-09 21:06:53|  1   0  99   0   0   0|11.8G  273M 18.7G  221G|2326k
9015k|   0  1348k|   0  16.0
21-09 21:06:54|  1   1  98   0   0   0|11.9G  273M 18.7G  221G|  15M
10M|   0  1312k|   0  16.0
21-09 21:06:55|  2   2  94   0   0   1|12.3G  273M 18.7G  220G|  14M
311M|   048M|   0   309
21-09 21:06:56|  2   3  93   0   0   3|12.2G  273M 18.7G  220G|7745k
1190M|   016M|   0  93.0
21-09 21:06:57|  1   2  96   0   0   1|12.0G  273M 18.7G  220G|8269k
1189M|   0  1956k|   0  10.0
21-09 21:06:58|  3   1  95   0   0   1|11.8G  273M 18.7G  221G|4854k
752M|   0  4960k|   0  21.0
21-09 21:06:59|  3   0  97   0   0   0|11.8G  273M 18.7G  221G|3098k
25M|   0  5036k|   0  26.0
21-09 21:07:00|  1   0  98   0   0   0|11.8G  273M 18.7G  221G|2247k
25M|   0  9980k|   0  45.0
21-09 21:07:01|  2   1  97   0   0   0|11.8G  273M 18.7G  221G|4149k
17M|   076M|   0   427

That would be 1190 MiB/s (or 9.982 Gbps).

Restarting every OSD on a node at once as part of the upgrade causes a
couple minutes worth of network saturation on all three mon nodes.  This
causes thousands of slow requests and many unhappy OpenStack users.

I'm now stuck about 15% into the upgrade and haven't been able to
determine how to move forward (or even backward) without causing another
outage.

I've attempted to run the same test on another cluster with 1300+ OSDs and
the outgoing network on the mon nodes didn't exceed 15 MiB/s (0.126 Gbps).

Any suggestions on how I can proceed?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Stillwell, Bryan J

Thanks Somnath,

I'll try moving my testing to master tomorrow to see if that improves the
stability at all.

Bryan

On 8/3/16, 4:50 PM, "Somnath Roy"  wrote:

>Probably, it is better to move to latest master and reproduce this
>defect. Lot of stuff has changed between this.
>This is a good test case and I doubt any of us testing by enabling fsck()
>on mount/unmount.
>
>Thanks & Regards
>Somnath
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>Stillwell, Bryan J
>Sent: Wednesday, August 03, 2016 3:41 PM
>To: ceph-users@lists.ceph.com
>Subject: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures
>
>I've been doing some benchmarking of BlueStore in 10.2.2 the last few
>days and have come across a failure that keeps happening after stressing
>the cluster fairly heavily.  Some of the OSDs started failing and
>attempts to restart them fail to log anything in /var/log/ceph/, so I
>tried starting them manually and ran into these error messages:
>
># /usr/bin/ceph-osd --cluster=ceph -i 4 -f --setuser ceph --setgroup ceph
>2016-08-02 22:52:01.190226 7f97d75e1800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-02 22:52:01.190340 7f97d75e1800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-02 22:52:01.190497 7f97d75e1800 -1 WARNING: experimental feature
>'bluestore' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4/
>/var/lib/ceph/osd/ceph-4/journal
>2016-08-02 22:52:01.194461 7f97d75e1800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-02 22:52:01.237619 7f97d75e1800 -1 WARNING: experimental feature
>'rocksdb' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>2016-08-02 22:52:01.501405 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/)  a#20:bac03f87:::4_454:head# nid
>67134 already in use
>2016-08-02 22:52:01.629900 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/)  9#20:e64f44a7:::4_258:head# nid
>78351 already in use
>2016-08-02 22:52:01.967599 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent
>256983760896~1245184 intersects allocated blocks
>2016-08-02 22:52:01.967605 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [256984940544~65536]
>2016-08-02 22:52:01.978635 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 258455044096~196608
>intersects allocated blocks
>2016-08-02 22:52:01.978640 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [258455175168~65536]
>2016-08-02 22:52:01.978647 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) fsck leaked some space; free+used =
>[0~252138684416,252138815488~4844945408,256984940544~1470103552,2584551751
>6
>8~5732719067136] != expected 0~5991174242304
>2016-08-02 22:52:02.987479 7f97d75e1800 -1
>bluestore(/var/lib/ceph/osd/ceph-4/) mount fsck found 5 errors
>2016-08-02 22:52:02.987488 7f97d75e1800 -1 osd.4 0 OSD:init: unable to
>mount object store
>2016-08-02 22:52:02.987498 7f97d75e1800 -1  ** ERROR: osd init failed:
>(5) Input/output error
>
>
>Here's another one:
>
># /usr/bin/ceph-osd --cluster=ceph -i 11 -f --setuser ceph --setgroup ceph
>2016-08-03 22:16:49.052319 7f0e4d949800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-03 22:16:49.052445 7f0e4d949800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-03 22:16:49.052690 7f0e4d949800 -1 WARNING: experimental feature
>'bluestore' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>important data.
>
>starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11/
>/var/lib/ceph/osd/ceph-11/journal
>2016-08-03 22:16:49.056779 7f0e4d949800 -1 WARNING: the following
>dangerous and experimental features are enabled: *
>2016-08-03 22:16:49.095695 7f0e4d949800 -1 WARNING: experimental feature
>'rocksdb' is enabled Please be aware that this feature is experimental,
>untested, unsupported, and may result in data corruption, data loss,
>and/or irreparable damage to your cluster.  Do not use feature with
>

[ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Stillwell, Bryan J

I've been doing some benchmarking of BlueStore in 10.2.2 the last few days
and
have come across a failure that keeps happening after stressing the cluster
fairly heavily.  Some of the OSDs started failing and attempts to restart
them
fail to log anything in /var/log/ceph/, so I tried starting them manually
and
ran into these error messages:

# /usr/bin/ceph-osd --cluster=ceph -i 4 -f --setuser ceph --setgroup ceph
2016-08-02 22:52:01.190226 7f97d75e1800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-02 22:52:01.190340 7f97d75e1800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-02 22:52:01.190497 7f97d75e1800 -1 WARNING: experimental feature
'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4/
/var/lib/ceph/osd/ceph-4/journal
2016-08-02 22:52:01.194461 7f97d75e1800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-02 22:52:01.237619 7f97d75e1800 -1 WARNING: experimental feature
'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

2016-08-02 22:52:01.501405 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  a#20:bac03f87:::4_454:head# nid
67134 already in use
2016-08-02 22:52:01.629900 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  9#20:e64f44a7:::4_258:head# nid
78351 already in use
2016-08-02 22:52:01.967599 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 256983760896~1245184
intersects allocated blocks
2016-08-02 22:52:01.967605 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [256984940544~65536]
2016-08-02 22:52:01.978635 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 258455044096~196608
intersects allocated blocks
2016-08-02 22:52:01.978640 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [258455175168~65536]
2016-08-02 22:52:01.978647 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck leaked some space; free+used =
[0~252138684416,252138815488~4844945408,256984940544~1470103552,25845517516
8~5732719067136] != expected 0~5991174242304
2016-08-02 22:52:02.987479 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) mount fsck found 5 errors
2016-08-02 22:52:02.987488 7f97d75e1800 -1 osd.4 0 OSD:init: unable to
mount object store
2016-08-02 22:52:02.987498 7f97d75e1800 -1  ** ERROR: osd init failed: (5)
Input/output error


Here's another one:

# /usr/bin/ceph-osd --cluster=ceph -i 11 -f --setuser ceph --setgroup ceph
2016-08-03 22:16:49.052319 7f0e4d949800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-03 22:16:49.052445 7f0e4d949800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-03 22:16:49.052690 7f0e4d949800 -1 WARNING: experimental feature
'bluestore' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11/
/var/lib/ceph/osd/ceph-11/journal
2016-08-03 22:16:49.056779 7f0e4d949800 -1 WARNING: the following
dangerous and experimental features are enabled: *
2016-08-03 22:16:49.095695 7f0e4d949800 -1 WARNING: experimental feature
'rocksdb' is enabled
Please be aware that this feature is experimental, untested,
unsupported, and may result in data corruption, data loss,
and/or irreparable damage to your cluster.  Do not use
feature with important data.

2016-08-03 22:16:49.821451 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/)  6#20:2eed99bf:::4_257:head# nid
72869 already in use
2016-08-03 22:16:49.961943 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck free extent 257123155968~65536
intersects allocated blocks
2016-08-03 22:16:49.961950 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck overlap: [257123155968~65536]
2016-08-03 22:16:49.962012 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck leaked some space; free+used =
[0~241963433984,241963499520~5749210742784] != expected 0~5991174242304
2016-08-03 22:16:50.855099 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) mount fsck found 3 errors
2016-08-03 22:16:50.855109 7f0e4d949800 -1 osd.11 0 OSD:init: unable to
mount object store
2016-08-03 22:16:50.855118 7f0e4d949800 -1  ** ERROR: osd init failed: (5)
Input/output error


I currently have a total of 12 OSDs down (out of 46) which all appear to be
experiencing this problem.

Here are more details of the cluster (currently just a single node):

2x Xeon E5-2699 v4 @ 2.20GHz

[ceph-users] Multi-device BlueStore testing

2016-07-19 Thread Stillwell, Bryan J

I would like to do some BlueStore testing using multiple devices like mentioned 
here:

https://www.sebastien-han.fr/blog/2016/05/04/Ceph-Jewel-configure-BlueStore-with-multiple-devices/

However, simply creating the block.db and block.wal symlinks and pointing them 
at empty partitions doesn't appear to be enough:

2016-07-19 21:30:15.717827 7f48ec4d9800  1 bluestore(/var/lib/ceph/osd/ceph-0) 
mount path /var/lib/ceph/osd/ceph-0
2016-07-19 21:30:15.717855 7f48ec4d9800  1 bluestore(/var/lib/ceph/osd/ceph-0) 
fsck
2016-07-19 21:30:15.717869 7f48ec4d9800  1 bdev create path 
/var/lib/ceph/osd/ceph-0/block type kernel
2016-07-19 21:30:15.718367 7f48ec4d9800  1 bdev(/var/lib/ceph/osd/ceph-0/block) 
open path /var/lib/ceph/osd/ceph-0/block
2016-07-19 21:30:15.718462 7f48ec4d9800  1 bdev(/var/lib/ceph/osd/ceph-0/block) 
open size 6001069202944 (5588 GB) block_size 4096 (4096 B)
2016-07-19 21:30:15.718786 7f48ec4d9800  1 bdev create path 
/var/lib/ceph/osd/ceph-0/block.db type kernel
2016-07-19 21:30:15.719305 7f48ec4d9800  1 
bdev(/var/lib/ceph/osd/ceph-0/block.db) open path 
/var/lib/ceph/osd/ceph-0/block.db
2016-07-19 21:30:15.719388 7f48ec4d9800  1 
bdev(/var/lib/ceph/osd/ceph-0/block.db) open size 1023410176 (976 MB) 
block_size 4096 (4096 B)
2016-07-19 21:30:15.719394 7f48ec4d9800  1 bluefs add_block_device bdev 1 path 
/var/lib/ceph/osd/ceph-0/block.db size 976 MB
2016-07-19 21:30:15.719586 7f48ec4d9800 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block.db) _read_bdev_label unable to decode 
label at offset 66: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
of struct encoding
2016-07-19 21:30:15.719597 7f48ec4d9800 -1 bluestore(/var/lib/ceph/osd/ceph-0) 
_open_db check block device(/var/lib/ceph/osd/ceph-0/block.db) label returned: 
(22) Invalid argument
2016-07-19 21:30:15.719602 7f48ec4d9800  1 
bdev(/var/lib/ceph/osd/ceph-0/block.db) close
2016-07-19 21:30:15.999311 7f48ec4d9800  1 bdev(/var/lib/ceph/osd/ceph-0/block) 
close
2016-07-19 21:30:16.243312 7f48ec4d9800 -1 osd.0 0 OSD:init: unable to mount 
object store

I originally used 'ceph-disk prepare --bluestore' to create the OSD, but I feel 
like there is some kind of initialization step I need to do when moving the db 
and wal over to an NVMe device.  My google searches just aren't turning up 
much.  Could someone point me in the right direction?

Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg has invalid (post-split) stats; must scrub before tier agent can activate

2016-06-16 Thread Stillwell, Bryan J

I wanted to report back what the solution was to this problem.  It appears
like I was running into this bug:

http://tracker.ceph.com/issues/16113


After running 'ceph osd unset sortbitwise' all the unfound objects were
found!  Which makes me happy again.  :)

Bryan

On 5/24/16, 4:27 PM, "Stillwell, Bryan J" 
wrote:

>On one of my test clusters that I¹ve upgraded from Infernalis to Jewel
>(10.2.1), and I¹m having a problem where reads are resulting in unfound
>objects.
>
>I¹m using cephfs on top of a erasure coded pool with cache tiering which I
>believe is related.
>
>From what I can piece together, here is what the sequence of events looks
>like:
>
>Try doing an md5sum on all files in a directory:
>
>$ date
>Tue May 24 16:06:01 MDT 2016
>$ md5sum *
>
>
>Shortly afterward I see this in the logs:
>
>2016-05-24 16:06:20.406701 mon.0 172.24.88.20:6789/0 222796 : cluster
>[INF] osd.24 172.24.88.54:6814/26253 failed (2 reporters from different
>host after 21.000162 >= grace 20.00)
>2016-05-24 16:06:22.626169 mon.0 172.24.88.20:6789/0 222805 : cluster
>[INF] osd.24 172.24.88.54:6813/21502 boot
>
>2016-05-24 16:06:22.760512 mon.0 172.24.88.20:6789/0 222809 : cluster
>[INF] osd.21 172.24.88.56:6828/26011 failed (2 reporters from different
>host after 21.000314 >= grace 20.00)
>2016-05-24 16:06:24.980100 osd.23 172.24.88.54:6803/15322 120 : cluster
>[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:24.935090 osd.16 172.24.88.56:6824/25830 8 : cluster
>[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.023810 osd.16 172.24.88.56:6824/25830 9 : cluster
>[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.063645 osd.16 172.24.88.56:6824/25830 10 : cluster
>[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.326786 osd.16 172.24.88.56:6824/25830 11 : cluster
>[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.887230 osd.26 172.24.88.56:6808/10047 56 : cluster
>[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:31.413173 osd.12 172.24.88.56:6820/3496 509 : cluster
>[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:24.758508 osd.25 172.24.88.54:6801/25977 34 : cluster
>[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.307177 osd.24 172.24.88.54:6813/21502 1 : cluster
>[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.061032 osd.18 172.24.88.20:6806/23166 65 : cluster
>[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:06:25.216812 osd.22 172.24.88.20:6816/32656 24 : cluster
>[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:15.393004 mon.0 172.24.88.20:6789/0 222885 : cluster
>[INF] osd.21 172.24.88.56:6800/27171 boot
>2016-05-24 16:07:30.986037 osd.12 172.24.88.56:6820/3496 510 : cluster
>[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.606189 osd.24 172.24.88.54:6813/21502 2 : cluster
>[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.011658 osd.22 172.24.88.20:6816/32656 27 : cluster
>[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.744757 osd.18 172.24.88.20:6806/23166 66 : cluster
>[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.160872 osd.23 172.24.88.54:6803/15322 121 : cluster
>[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.945012 osd.21 172.24.88.56:6800/27171 2 : cluster
>[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.974664 osd.21 172.24.88.56:6800/27171 3 : cluster
>[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:30.978548 osd.21 172.24.88.56:6800/27171 4 : cluster
>[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:32.394111 osd.21 172.24.88.56:6800/27171 5 : cluster
>[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
>can activate
>2016-05-24 16:07:29.828650 osd.16 172.24.88.56:6824/25830

Re: [ceph-users] Rebuilding/recreating CephFS journal?

2016-05-27 Thread Stillwell, Bryan J

On 5/27/16, 3:23 PM, "Gregory Farnum"  wrote:

>On Fri, May 27, 2016 at 2:22 PM, Stillwell, Bryan J
> wrote:
>> Here's the full 'ceph -s' output:
>>
>> # ceph -s
>> cluster c7ba6111-e0d6-40e8-b0af-8428e8702df9
>>  health HEALTH_ERR
>> mds rank 0 is damaged
>> mds cluster is degraded
>>  monmap e5: 3 mons at
>> {b3=172.24.88.53:6789/0,b4=172.24.88.54:6789/0,lira=172.24.88.20:6789/0}
>> election epoch 320, quorum 0,1,2 lira,b3,b4
>>   fsmap e287: 0/1/1 up, 1 up:standby, 1 damaged
>>  osdmap e35262: 21 osds: 21 up, 21 in
>> flags sortbitwise
>>   pgmap v10096597: 480 pgs, 4 pools, 23718 GB data, 5951 kobjects
>> 35758 GB used, 11358 GB / 47116 GB avail
>>  479 active+clean
>>1 active+clean+scrubbing+deep
>
>Yeah, you should just need to mark mds 0 as repaired at this point.

Thanks Greg!  I ran 'ceph mds repaired 0' and it's working again!

Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebuilding/recreating CephFS journal?

2016-05-27 Thread Stillwell, Bryan J

Here's the full 'ceph -s' output:

# ceph -s
cluster c7ba6111-e0d6-40e8-b0af-8428e8702df9
 health HEALTH_ERR
mds rank 0 is damaged
mds cluster is degraded
 monmap e5: 3 mons at
{b3=172.24.88.53:6789/0,b4=172.24.88.54:6789/0,lira=172.24.88.20:6789/0}
election epoch 320, quorum 0,1,2 lira,b3,b4
  fsmap e287: 0/1/1 up, 1 up:standby, 1 damaged
 osdmap e35262: 21 osds: 21 up, 21 in
flags sortbitwise
  pgmap v10096597: 480 pgs, 4 pools, 23718 GB data, 5951 kobjects
35758 GB used, 11358 GB / 47116 GB avail
 479 active+clean
   1 active+clean+scrubbing+deep






On 5/27/16, 3:17 PM, "Gregory Farnum"  wrote:

>What's the current full output of "ceph -s"?
>
>If you already had your MDS in damaged state, you might just need to
>mark it as repaired. That's a monitor command.
>
>On Fri, May 27, 2016 at 2:09 PM, Stillwell, Bryan J
> wrote:
>> On 5/27/16, 3:01 PM, "Gregory Farnum"  wrote:
>>
>>>>
>>>> So would the next steps be to run the following commands?:
>>>>
>>>> cephfs-table-tool 0 reset session
>>>> cephfs-table-tool 0 reset snap
>>>> cephfs-table-tool 0 reset inode
>>>> cephfs-journal-tool --rank=0 journal reset
>>>> cephfs-data-scan init
>>>>
>>>> cephfs-data-scan scan_extents data
>>>> cephfs-data-scan scan_inodes data
>>>
>>>No, definitely not. I think you just need to reset the journal again,
>>>since you wiped out a bunch of its data with that fs reset command.
>>>Since your backing data should already be consistent you don't need to
>>>do any data scans. Your snap and inode tables might be corrupt,
>>>but...hopefully not. If they are busted...actually, I don't remember;
>>>maybe you will need to run the data scan tooling to repair those. I'd
>>>try to avoid it if possible just because of the time involved. (It'll
>>>become obvious pretty quickly if the inode tables are no good.)
>>
>> So when I attempt to reset the journal again I get this:
>>
>> # cephfs-journal-tool journal reset
>> journal does not exist on-disk. Did you set a bad rank?2016-05-27
>> 15:03:30.016326 7f63f987e700  0 client.20626476.journaler(ro) error
>> getting journal off disk
>>
>> Error loading journal: (2) No such file or directory, pass --force to
>> forcibly reset this journal
>> Error ((2) No such file or directory)
>>
>>
>>
>> And then I tried to force it which seemed to succeed:
>>
>> # cephfs-journal-tool journal reset --force
>> writing EResetJournal entry
>>
>>
>>
>> However, when I restart the mds it gets stuck in standby mode:
>>
>> 2016-05-27 15:05:57.080672 7fe0cccd8700 -1 mds.b4 *** got signal
>> Terminated ***
>> 2016-05-27 15:05:57.080703 7fe0cccd8700  1 mds.b4 suicide.  wanted state
>> up:standby
>> 2016-05-27 15:06:04.527203 7f500f28a180  0 set uid:gid to 64045:64045
>> (ceph:ceph)
>> 2016-05-27 15:06:04.527259 7f500f28a180  0 ceph version 10.2.0
>> (3a9fba20ec743699b69bd0181dd6c54dc01c64b9), process ceph-mds, pid 19163
>> 2016-05-27 15:06:04.527569 7f500f28a180  0 pidfile_write: ignore empty
>> --pid-file
>> 2016-05-27 15:06:04.637842 7f5008a04700  1 mds.b4 handle_mds_map standby
>>
>>
>>
>> The relevant output from 'ceph -s' looks like this:
>>
>>   fsmap e287: 0/1/1 up, 1 up:standby, 1 damaged
>>
>>
>> What am I missing?
>>
>> Thanks,
>> Bryan
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebuilding/recreating CephFS journal?

2016-05-27 Thread Stillwell, Bryan J

On 5/27/16, 3:01 PM, "Gregory Farnum"  wrote:

>>
>> So would the next steps be to run the following commands?:
>>
>> cephfs-table-tool 0 reset session
>> cephfs-table-tool 0 reset snap
>> cephfs-table-tool 0 reset inode
>> cephfs-journal-tool --rank=0 journal reset
>> cephfs-data-scan init
>>
>> cephfs-data-scan scan_extents data
>> cephfs-data-scan scan_inodes data
>
>No, definitely not. I think you just need to reset the journal again,
>since you wiped out a bunch of its data with that fs reset command.
>Since your backing data should already be consistent you don't need to
>do any data scans. Your snap and inode tables might be corrupt,
>but...hopefully not. If they are busted...actually, I don't remember;
>maybe you will need to run the data scan tooling to repair those. I'd
>try to avoid it if possible just because of the time involved. (It'll
>become obvious pretty quickly if the inode tables are no good.)

So when I attempt to reset the journal again I get this:

# cephfs-journal-tool journal reset
journal does not exist on-disk. Did you set a bad rank?2016-05-27
15:03:30.016326 7f63f987e700  0 client.20626476.journaler(ro) error
getting journal off disk

Error loading journal: (2) No such file or directory, pass --force to
forcibly reset this journal
Error ((2) No such file or directory)



And then I tried to force it which seemed to succeed:

# cephfs-journal-tool journal reset --force
writing EResetJournal entry



However, when I restart the mds it gets stuck in standby mode:

2016-05-27 15:05:57.080672 7fe0cccd8700 -1 mds.b4 *** got signal
Terminated ***
2016-05-27 15:05:57.080703 7fe0cccd8700  1 mds.b4 suicide.  wanted state
up:standby
2016-05-27 15:06:04.527203 7f500f28a180  0 set uid:gid to 64045:64045
(ceph:ceph)
2016-05-27 15:06:04.527259 7f500f28a180  0 ceph version 10.2.0
(3a9fba20ec743699b69bd0181dd6c54dc01c64b9), process ceph-mds, pid 19163
2016-05-27 15:06:04.527569 7f500f28a180  0 pidfile_write: ignore empty
--pid-file
2016-05-27 15:06:04.637842 7f5008a04700  1 mds.b4 handle_mds_map standby



The relevant output from 'ceph -s' looks like this:

  fsmap e287: 0/1/1 up, 1 up:standby, 1 damaged


What am I missing?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rebuilding/recreating CephFS journal?

2016-05-27 Thread Stillwell, Bryan J

On 5/27/16, 11:27 AM, "Gregory Farnum"  wrote:

>On Fri, May 27, 2016 at 9:44 AM, Stillwell, Bryan J
> wrote:
>> I have a Ceph cluster at home that I¹ve been running CephFS on for the
>> last few years.  Recently my MDS server became damaged and while
>> attempting to fix it I believe I¹ve destroyed by CephFS journal based
>>off
>> this:
>>
>> 2016-05-25 16:48:23.882095 7f8d2fac2700 -1 log_channel(cluster) log
>>[ERR]
>> : Error recovering journal 200: (2) No such file or directory
>>
>> As far as I can tell the data and metadata are still in tact, so I¹m
>> wondering if there¹s a way to rebuild the cephfs journal or if that¹s
>>not
>> possible, a way to start extracting the data?
>
>Check out http://docs.ceph.com/docs/master/cephfs/disaster-recovery/
>
>You'll want to make sure you've actually lost the whole journal (how
>did you manage that?!?!), reset it, and quite possibly run the data
>scan tools. Be careful!

So I actually got into this mess by following that page and not being as
careful as I should have been.

I started off by trying to backup the journal, but it failed for this
reason:

# cephfs-journal-tool journal export backup.bin
2016-05-25 15:25:26.541767 7f2932ee5bc0 -1 Missing object 200.0197
2016-05-25 15:25:26.543896 7f2932ee5bc0 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`
Error ((5) Input/output error)

I took a look at http://tracker.ceph.com/issues/9902, but scanning that
page I didn't see a way to do an object-by-object dump.

Now if I attempt to export the journal I get:

# cephfs-journal-tool journal export backup.bin
Error ((5) Input/output error)2016-05-27 14:19:49.807482 7f06fa378bc0 -1
Header 200. is unreadable

2016-05-27 14:19:49.807491 7f06fa378bc0 -1 journal_export: Journal not
readable, attempt object-by-object dump with `rados`

I believe the 'Missing object 200.0197' error had something to do with
this problem that I was trying to deal with:

http://comments.gmane.org/gmane.comp.file-systems.ceph.user/29844

The missing object was probably caused by being a little too aggressive
with running mark_unfound_lost.

Anyways, I continued on with the disaster recovery steps without making a
backup first.  The next step identified the missing object again:

# cephfs-journal-tool event recover_dentries summary
2016-05-25 15:36:35.455989 7fa37b8b1bc0 -1 Missing object 200.0197
Events by type:
  OPEN: 12548
  SESSION: 24
  SUBTREEMAP: 29
  UPDATE: 12254
Errors: 0

I then tried truncating the journal:

# cephfs-journal-tool journal reset
old journal was 1666720764~48749572
new journal start will be 1719664640 (4194304 bytes past old end)
writing journal head
writing EResetJournal entry
done

Reset the session map:

# cephfs-table-tool all reset session
{
"0": {
"data": {},
"result": 0
}
}

And then because I was still having problems starting the MDS I ran:

# ceph fs reset cephfs --yes-i-really-mean-it

That's when I believe Header 200. went missing (I could be wrong,
I don't have good notes around this part).

So would the next steps be to run the following commands?:

cephfs-table-tool 0 reset session
cephfs-table-tool 0 reset snap
cephfs-table-tool 0 reset inode
cephfs-journal-tool --rank=0 journal reset
cephfs-data-scan init

cephfs-data-scan scan_extents data
cephfs-data-scan scan_inodes data

Thanks,
Bryan 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Rebuilding/recreating CephFS journal?

2016-05-27 Thread Stillwell, Bryan J

I have a Ceph cluster at home that I¹ve been running CephFS on for the
last few years.  Recently my MDS server became damaged and while
attempting to fix it I believe I¹ve destroyed by CephFS journal based off
this:

2016-05-25 16:48:23.882095 7f8d2fac2700 -1 log_channel(cluster) log [ERR]
: Error recovering journal 200: (2) No such file or directory

As far as I can tell the data and metadata are still in tact, so I¹m
wondering if there¹s a way to rebuild the cephfs journal or if that¹s not
possible, a way to start extracting the data?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] pg has invalid (post-split) stats; must scrub before tier agent can activate

2016-05-24 Thread Stillwell, Bryan J

On one of my test clusters that I¹ve upgraded from Infernalis to Jewel
(10.2.1), and I¹m having a problem where reads are resulting in unfound
objects.

I¹m using cephfs on top of a erasure coded pool with cache tiering which I
believe is related.

>From what I can piece together, here is what the sequence of events looks
like:

Try doing an md5sum on all files in a directory:

$ date
Tue May 24 16:06:01 MDT 2016
$ md5sum *


Shortly afterward I see this in the logs:

2016-05-24 16:06:20.406701 mon.0 172.24.88.20:6789/0 222796 : cluster
[INF] osd.24 172.24.88.54:6814/26253 failed (2 reporters from different
host after 21.000162 >= grace 20.00)
2016-05-24 16:06:22.626169 mon.0 172.24.88.20:6789/0 222805 : cluster
[INF] osd.24 172.24.88.54:6813/21502 boot

2016-05-24 16:06:22.760512 mon.0 172.24.88.20:6789/0 222809 : cluster
[INF] osd.21 172.24.88.56:6828/26011 failed (2 reporters from different
host after 21.000314 >= grace 20.00)
2016-05-24 16:06:24.980100 osd.23 172.24.88.54:6803/15322 120 : cluster
[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:24.935090 osd.16 172.24.88.56:6824/25830 8 : cluster
[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.023810 osd.16 172.24.88.56:6824/25830 9 : cluster
[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.063645 osd.16 172.24.88.56:6824/25830 10 : cluster
[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.326786 osd.16 172.24.88.56:6824/25830 11 : cluster
[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.887230 osd.26 172.24.88.56:6808/10047 56 : cluster
[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:31.413173 osd.12 172.24.88.56:6820/3496 509 : cluster
[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:24.758508 osd.25 172.24.88.54:6801/25977 34 : cluster
[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.307177 osd.24 172.24.88.54:6813/21502 1 : cluster
[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.061032 osd.18 172.24.88.20:6806/23166 65 : cluster
[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:06:25.216812 osd.22 172.24.88.20:6816/32656 24 : cluster
[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:15.393004 mon.0 172.24.88.20:6789/0 222885 : cluster
[INF] osd.21 172.24.88.56:6800/27171 boot
2016-05-24 16:07:30.986037 osd.12 172.24.88.56:6820/3496 510 : cluster
[WRN] pg 4.a has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.606189 osd.24 172.24.88.54:6813/21502 2 : cluster
[WRN] pg 4.13 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.011658 osd.22 172.24.88.20:6816/32656 27 : cluster
[WRN] pg 4.12 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.744757 osd.18 172.24.88.20:6806/23166 66 : cluster
[WRN] pg 4.3 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.160872 osd.23 172.24.88.54:6803/15322 121 : cluster
[WRN] pg 4.3d has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.945012 osd.21 172.24.88.56:6800/27171 2 : cluster
[WRN] pg 4.11 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.974664 osd.21 172.24.88.56:6800/27171 3 : cluster
[WRN] pg 4.21 has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.978548 osd.21 172.24.88.56:6800/27171 4 : cluster
[WRN] pg 4.2e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:32.394111 osd.21 172.24.88.56:6800/27171 5 : cluster
[WRN] pg 4.f has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:29.828650 osd.16 172.24.88.56:6824/25830 12 : cluster
[WRN] pg 4.3e has invalid (post-split) stats; must scrub before tier agent
can activate
2016-05-24 16:07:30.024493 osd.16 172.24.88.56:6824/25830 13 : cluster
[WRN] pg 4.15 has invalid (post-split) stats; must scrub before tier agent
can activate




Then I see the following in ceph health detail:

pg 4.2e is active+degraded, acting [21,16,13], 1 unfound
pg 4.13 is active+degraded, acting [24,17,21], 1 unfound
pg 3.6 is active+degraded, acting [18,14,21], 1 unfound



The relevant logs on osd.24 appear to be:

 0> 2016-05-24 16:06:01.365111 7fa1753e1700 -1 osd/ReplicatedPG.cc: In
function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::OpContextUPtr&,
unsigned int)' thread 7fa1753e170

Re: [ceph-users] Using bluestore in Jewel 10.0.4

2016-03-14 Thread Stillwell, Bryan

Mark,

Since most of us already have existing clusters that use SSDs for
journals, has there been any testing of converting that hardware over to
using BlueStore and re-purposing the SSDs as a block cache (like using
bcache)?

To me this seems like it would be a good combination for a typical RBD
cluster.

Thanks,
Bryan

On 3/14/16, 10:52 AM, "ceph-users on behalf of Mark Nelson"
 wrote:

>Hi Folks,
>
>We are actually in the middle of doing some bluestore testing/tuning for
>the upstream jewel release as we speak. :)  These are (so far) pure HDD
>tests using 4 nodes with 4 spinning disks and no SSDs.
>
>Basically on the write side it's looking fantastic and that's an area we
>really wanted to improve so that's great.  On the read side, we are
>working on getting sequential read performance up for certain IO sizes.
>  We are more dependent on client-side readahead with bluestore since
>there is no underlying filesystem below the OSDs helping us out. This
>usually isn't a problem in practice since there should be readahead on
>the VM, but when testing with fio using the RBD engine you should
>probably enable client side RBD readahead:
>
>rbd readahead disable after bytes = 0
>rbd readahead max bytes = 4194304
>
>Again, this probably only matters when directly using librbd.
>
>The other question is using default buffered reads in bluestore, ie
>setting:
>
>"bluestore default buffered read = true"
>
>That's what we are working on testing now.  I've included the ceph.conf
>used for these tests and also a link for some of our recent results.
>Please download it and open it up in libreoffice as google's preview
>isn't showing the graphs.
>
>Here's how the legend is setup:
>
>Hammer-FS: Hammer + Filestore
>6dba7fd-BS (No RBD RA): Master + Fixes + Bluestore
>6dba7fd-BS: (4M RBD RA): Master + Fixes + Bluestore + 4M RBD Read Ahead
>c1e41afb-FS: Master + Filestore + new journal throttling + Sam's tuning
>
>https://drive.google.com/file/d/0B2gTBZrkrnpZMl9OZ18yS3NuZEU/view?usp=shar
>ing
>
>Mark
>
>On 03/14/2016 11:04 AM, Kenneth Waegeman wrote:
>> Hi Stefan,
>>
>> We are also interested in the bluestore, but did not yet look into it.
>>
>> We tried keyvaluestore before and that could be enabled by setting the
>> osd objectstore value.
>> And in this ticket http://tracker.ceph.com/issues/13942 I see:
>>
>> [global]
>>  enable experimental unrecoverable data corrupting features = *
>>  bluestore fsck on mount = true
>>  bluestore block db size = 67108864
>>  bluestore block wal size = 134217728
>>  bluestore block size = 5368709120
>>  osd objectstore = bluestore
>>
>> So I guess this could work for bluestore too.
>>
>> Very curious to hear what you see stability and performance wise :)
>>
>> Cheers,
>> Kenneth
>>
>> On 14/03/16 16:03, Stefan Lissmats wrote:
>>> Hello everyone!
>>>
>>> I think that the new bluestore sounds great and would like to try it
>>> out in my test environment but I didn't find anything how to use it
>>> but I finally managed to test it and it really looks promising
>>> performancewise.
>>> If anyone has more information or guides for bluestore please tell me
>>> where.
>>>
>>> I thought I would share how I managed to get a new Jewel cluster with
>>> bluestore based osd:s to work.
>>>
>>>
>>> What i found so far is that ceph-disk can create new bluestore osd:s
>>> (but not ceph-deploy, plase correct me if i'm wrong) and I need to
>>> have "enable experimental unrecoverable data corrupting features =
>>> bluestore rocksdb" in global section in ceph.conf.
>>> After that I can create new osd:s with ceph-disk prepare --bluestore
>>> /dev/sdg
>>>
>>> So i created a cluster with ceph-deploy without any osd:s and then
>>> used ceph-disk on hosts to create the osd:s.
>>>
>>> Pretty simple in the end but it took me a while to figure that out.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>




This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
c

Re: [ceph-users] osd not removed from crush map after ceph osd crush remove

2016-02-25 Thread Stillwell, Bryan

Have you tried restarting each OSD one-by-one to see if that clears up the
problem?

Also, what does the output of this command look like:

ceph osd dump | grep 'replicated size'


As for whether or not 'ceph pg repair' will work, I doubt it.  It uses
copy on the primary OSD to fix the other OSDs in the PG.  From the
information you've provided, it looks like the PGs that are on osd.4 only
have one copy to me.  This seems like it would make 'ceph pg repair' fail
to do anything since the only PG is out of the cluster.

Bryan

On 2/24/16, 2:30 AM, "Dimitar Boichev" 
wrote:

>I think this happened because of the wrongly removed OSD...
>A bug maybe ?
>
>Do you think that "ceph pg repair" will force the remove of the PG from
>the missing osd ?
>I am concerned about executing "pg repair" or "osd lost" because maybe it
>will decide that the stuck one is the right data and try to do stuff with
>it and discard the active running copy ..
>
>
>Regards.
>
>Dimitar Boichev
>SysAdmin Team Lead
>AXSMarine Sofia
>Phone: +359 889 22 55 42
>Skype: dimitar.boichev.axsmarine
>E-mail: dimitar.boic...@axsmarine.com
>
>
>-Original Message-
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>Stillwell, Bryan
>Sent: Tuesday, February 23, 2016 7:31 PM
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>crush remove
>
>Dimitar,
>
>I would agree with you that getting the cluster into a healthy state
>first is probably the better idea.  Based on your pg query, it appears
>like you're using only 1 replica.  Any ideas why that would be?
>
>The output should look like this (with 3 replicas):
>
>osdmap e133481 pg 11.1b8 (11.1b8) -> up [13,58,37] acting [13,58,37]
>
>Bryan
>
>From:  Dimitar Boichev 
>Date:  Tuesday, February 23, 2016 at 1:08 AM
>To:  CTG User , "ceph-users@lists.ceph.com"
>
>Subject:  RE: [ceph-users] osd not removed from crush map after ceph osd
>crush remove
>
>
>>Hello,
>>Thank you Bryan.
>>
>>I was just trying to upgrade to hammer or upper but before that I was
>>wanting to get the cluster in Healthy state.
>>Do you think it is safe to upgrade now first to latest firefly then to
>>Hammer ?
>>
>>
>>Regards.
>>
>>Dimitar Boichev
>>SysAdmin Team Lead
>>AXSMarine Sofia
>>Phone: +359 889 22 55 42
>>Skype: dimitar.boichev.axsmarine
>>E-mail:
>>dimitar.boic...@axsmarine.com
>>
>>
>>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
>>On Behalf Of Stillwell, Bryan
>>Sent: Tuesday, February 23, 2016 1:51 AM
>>To: ceph-users@lists.ceph.com
>>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>>crush remove
>>
>>
>>
>>Dimitar,
>>
>>
>>
>>I'm not sure why those PGs would be stuck in the stale+active+clean
>>state.  Maybe try upgrading to the 0.80.11 release to see if it's a bug
>>that was fixed already?  You can use the 'ceph tell osd.*  version'
>>command after the upgrade to make sure all OSDs are running the new
>>version.  Also since firefly (0.80.x) is near its EOL, you should
>>consider upgrading to hammer (0.94.x).
>>
>>
>>
>>As for why osd.4 didn't get fully removed, the last command you ran
>>isn't correct.  It should be 'ceph osd rm 4'.  Trying to remember when
>>to use the CRUSH name (osd.4) versus the OSD number (4)  can be a pain.
>>
>>
>>
>>Bryan
>>
>>
>>
>>From: ceph-users  on behalf of
>>Dimitar Boichev 
>>Date: Monday, February 22, 2016 at 1:10 AM
>>To: Dimitar Boichev ,
>>"ceph-users@lists.ceph.com" 
>>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>>crush remove
>>
>>
>>
>>>Anyone ?
>>>
>>>Regards.
>>>
>>>
>>>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
>>>On Behalf Of Dimitar Boichev
>>>Sent: Thursday, February 18, 2016 5:06 PM
>>>To: ceph-users@lists.ceph.com
>>>Subject: [ceph-users] osd not removed from crush map after ceph osd
>>>crush remove
>>>
>>>
>>>
>>>Hello,
>>>I am running a tiny cluster of 2 nodes.
>>>ceph -v
>>>ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>>
>>>One osd died and I added a new osd (not replacing the old one).
>>>After that I wanted to remove the failed osd completely fro

Re: [ceph-users] Over 13,000 osdmaps in current/meta

2016-02-25 Thread Stillwell, Bryan

It's good to hear that I'm not the only one affected by this!  After the
node was brought back into the cluster (I weighted it out for hardware
repairs) it appears to have removed some of the old maps, as I'm down to
8,000 now.  Although I did find another OSD in the cluster which has
95,000 osd maps (46GB)!  That's substantial considering each OSD in our
cluster is only 1.2TB.

Bryan

From:  ceph-users  on behalf of Tom
Christensen 
Date:  Thursday, February 25, 2016 at 10:36 AM
To:  "ceph-users@lists.ceph.com" 
Subject:  Re: [ceph-users] Over 13,000 osdmaps in current/meta

>We've seen this as well as early as 0.94.3 and have a bug,
>http://tracker.ceph.com/issues/13990
><http://tracker.ceph.com/issues/13990> which we're working through
>currently.  Nothing fixed yet, still trying to nail down exactly why the
>osd maps aren't being trimmed as they should.
>
>
>
>On Thu, Feb 25, 2016 at 10:16 AM, Stillwell, Bryan
> wrote:
>
>After evacuated all the PGs from a node in hammer 0.94.5, I noticed that
>each of the OSDs was still using ~8GB of storage.  After investigating it
>appears like all the data is coming from around 13,000 files in
>/usr/lib/ceph/osd/ceph-*/current/meta/ with names like:
>
>DIR_4/DIR_0/DIR_0/osdmap.303231__0_C23E4004__none
>DIR_4/DIR_2/DIR_F/osdmap.314431__0_C24ADF24__none
>DIR_4/DIR_0/DIR_A/osdmap.312688__0_C2510A04__none
>
>They're all around 500KB in size.  I'm guessing these are all old OSD
>maps, but I'm wondering why there are so many of them?
>
>Thanks,
>Bryan

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Over 13,000 osdmaps in current/meta

2016-02-25 Thread Stillwell, Bryan

After evacuated all the PGs from a node in hammer 0.94.5, I noticed that
each of the OSDs was still using ~8GB of storage.  After investigating it
appears like all the data is coming from around 13,000 files in
/usr/lib/ceph/osd/ceph-*/current/meta/ with names like:

DIR_4/DIR_0/DIR_0/osdmap.303231__0_C23E4004__none
DIR_4/DIR_2/DIR_F/osdmap.314431__0_C24ADF24__none
DIR_4/DIR_0/DIR_A/osdmap.312688__0_C2510A04__none

They're all around 500KB in size.  I'm guessing these are all old OSD
maps, but I'm wondering why there are so many of them?

Thanks,
Bryan




This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osd not removed from crush map after ceph osd crush remove

2016-02-23 Thread Stillwell, Bryan

Dimitar,

I would agree with you that getting the cluster into a healthy state first
is probably the better idea.  Based on your pg query, it appears like
you're using only 1 replica.  Any ideas why that would be?

The output should look like this (with 3 replicas):

osdmap e133481 pg 11.1b8 (11.1b8) -> up [13,58,37] acting [13,58,37]

Bryan

From:  Dimitar Boichev 
Date:  Tuesday, February 23, 2016 at 1:08 AM
To:  CTG User , "ceph-users@lists.ceph.com"

Subject:  RE: [ceph-users] osd not removed from crush map after ceph osd
crush remove


>Hello,
>Thank you Bryan.
>
>I was just trying to upgrade to hammer or upper but before that I was
>wanting to get the cluster in Healthy state.
>Do you think it is safe to upgrade now first to latest firefly then to
>Hammer ?
>
>
>Regards.
>
>Dimitar Boichev
>SysAdmin Team Lead
>AXSMarine Sofia
>Phone: +359 889 22 55 42
>Skype: dimitar.boichev.axsmarine
>E-mail:
>dimitar.boic...@axsmarine.com
>
>
>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
>On Behalf Of Stillwell, Bryan
>Sent: Tuesday, February 23, 2016 1:51 AM
>To: ceph-users@lists.ceph.com
>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>crush remove
>
>
>
>Dimitar,
>
>
>
>I'm not sure why those PGs would be stuck in the stale+active+clean
>state.  Maybe try upgrading to the 0.80.11 release to see if it's a bug
>that was fixed already?  You can use the 'ceph tell osd.*
> version' command after the upgrade to make sure all OSDs are running the
>new version.  Also since firefly (0.80.x) is near its EOL, you should
>consider upgrading to hammer (0.94.x).
>
>
>
>As for why osd.4 didn't get fully removed, the last command you ran isn't
>correct.  It should be 'ceph osd rm 4'.  Trying to remember when to use
>the CRUSH name (osd.4) versus the OSD number (4)
> can be a pain.
>
>
>
>Bryan
>
>
>
>From: ceph-users  on behalf of Dimitar
>Boichev 
>Date: Monday, February 22, 2016 at 1:10 AM
>To: Dimitar Boichev ,
>"ceph-users@lists.ceph.com" 
>Subject: Re: [ceph-users] osd not removed from crush map after ceph osd
>crush remove
>
>
>
>>Anyone ?
>>
>>Regards.
>>
>>
>>From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com]
>>On Behalf Of Dimitar Boichev
>>Sent: Thursday, February 18, 2016 5:06 PM
>>To: ceph-users@lists.ceph.com
>>Subject: [ceph-users] osd not removed from crush map after ceph osd
>>crush remove
>>
>>
>>
>>Hello,
>>I am running a tiny cluster of 2 nodes.
>>ceph -v
>>ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
>>
>>One osd died and I added a new osd (not replacing the old one).
>>After that I wanted to remove the failed osd completely from the cluster.
>>Here is what I did:
>>ceph osd reweight osd.4 0.0
>>ceph osd crush reweight osd.4 0.0
>>ceph osd out osd.4
>>ceph osd crush remove osd.4
>>ceph auth del osd.4
>>ceph osd rm osd.4
>>
>>
>>But after the rebalancing I ended up with 155 PGs in stale+active+clean
>>state.
>>
>>@storage1:/tmp# ceph -s
>>cluster 7a9120b9-df42-4308-b7b1-e1f3d0f1e7b3
>> health HEALTH_WARN 155 pgs stale; 155 pgs stuck stale; 1 requests
>>are blocked > 32 sec; nodeep-scrub flag(s) set
>> monmap e1: 1 mons at {storage1=192.168.10.3:6789/0}, election epoch
>>1, quorum 0 storage1
>> osdmap e1064: 6 osds: 6 up, 6 in
>>flags nodeep-scrub
>>  pgmap v26760322: 712 pgs, 8 pools, 532 GB data, 155 kobjects
>>1209 GB used, 14210 GB / 15419 GB avail
>> 155 stale+active+clean
>> 557 active+clean
>>  client io 91925 B/s wr, 5 op/s
>>
>>I know about the 1 monitor problem I just want to fix the cluster to
>>healthy state then I will add the third storage node and go up to 3
>>monitors.
>>
>>The problem is as follows:
>>@storage1:/tmp# ceph pg map 2.3a
>>osdmap e1064 pg 2.3a (2.3a) -> up [6] acting [6]
>>@storage1:/tmp# ceph pg 2.3a query
>>Error ENOENT: i don't have pgid 2.3a
>>
>>
>>@storage1:/tmp# ceph health detail
>>HEALTH_WARN 155 pgs stale; 155 pgs stuck stale; 1 requests are blocked >
>>32 sec; 1 osds have slow requests; nodeep-scrub flag(s) set
>>pg 7.2a is stuck stale for 8887559.656879, current state
>>stale+active+clean, last acting [4]
>>pg 5.28 is stuck stale for 8887559.656886, current state
>>stale+active+clean, last acting [4]
>>pg 7.2b is stuck stale for 8887559.656889, current state
>>sta

Re: [ceph-users] osd not removed from crush map after ceph osd crush remove

2016-02-22 Thread Stillwell, Bryan

Dimitar,

I'm not sure why those PGs would be stuck in the stale+active+clean state.  
Maybe try upgrading to the 0.80.11 release to see if it's a bug that was fixed 
already?  You can use the 'ceph tell osd.* version' command after the upgrade 
to make sure all OSDs are running the new version.  Also since firefly (0.80.x) 
is near its EOL, you should consider upgrading to hammer (0.94.x).

As for why osd.4 didn't get fully removed, the last command you ran isn't 
correct.  It should be 'ceph osd rm 4'.  Trying to remember when to use the 
CRUSH name (osd.4) versus the OSD number (4) can be a pain.

Bryan

From: ceph-users 
mailto:ceph-users-boun...@lists.ceph.com>> 
on behalf of Dimitar Boichev 
mailto:dimitar.boic...@axsmarine.com>>
Date: Monday, February 22, 2016 at 1:10 AM
To: Dimitar Boichev 
mailto:dimitar.boic...@axsmarine.com>>, 
"ceph-users@lists.ceph.com" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] osd not removed from crush map after ceph osd crush 
remove

Anyone ?

Regards.

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Dimitar Boichev
Sent: Thursday, February 18, 2016 5:06 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] osd not removed from crush map after ceph osd crush remove

Hello,
I am running a tiny cluster of 2 nodes.
ceph -v
ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)

One osd died and I added a new osd (not replacing the old one).
After that I wanted to remove the failed osd completely from the cluster.
Here is what I did:
ceph osd reweight osd.4 0.0
ceph osd crush reweight osd.4 0.0
ceph osd out osd.4
ceph osd crush remove osd.4
ceph auth del osd.4
ceph osd rm osd.4


But after the rebalancing I ended up with 155 PGs in stale+active+clean  state.

@storage1:/tmp# ceph -s
cluster 7a9120b9-df42-4308-b7b1-e1f3d0f1e7b3
 health HEALTH_WARN 155 pgs stale; 155 pgs stuck stale; 1 requests are 
blocked > 32 sec; nodeep-scrub flag(s) set
 monmap e1: 1 mons at {storage1=192.168.10.3:6789/0}, election epoch 1, 
quorum 0 storage1
 osdmap e1064: 6 osds: 6 up, 6 in
flags nodeep-scrub
  pgmap v26760322: 712 pgs, 8 pools, 532 GB data, 155 kobjects
1209 GB used, 14210 GB / 15419 GB avail
 155 stale+active+clean
 557 active+clean
  client io 91925 B/s wr, 5 op/s

I know about the 1 monitor problem I just want to fix the cluster to healthy 
state then I will add the third storage node and go up to 3 monitors.

The problem is as follows:
@storage1:/tmp# ceph pg map 2.3a
osdmap e1064 pg 2.3a (2.3a) -> up [6] acting [6]
@storage1:/tmp# ceph pg 2.3a query
Error ENOENT: i don't have pgid 2.3a


@storage1:/tmp# ceph health detail
HEALTH_WARN 155 pgs stale; 155 pgs stuck stale; 1 requests are blocked > 32 
sec; 1 osds have slow requests; nodeep-scrub flag(s) set
pg 7.2a is stuck stale for 8887559.656879, current state stale+active+clean, 
last acting [4]
pg 5.28 is stuck stale for 8887559.656886, current state stale+active+clean, 
last acting [4]
pg 7.2b is stuck stale for 8887559.656889, current state stale+active+clean, 
last acting [4]
pg 7.2c is stuck stale for 8887559.656892, current state stale+active+clean, 
last acting [4]
pg 0.2b is stuck stale for 8887559.656893, current state stale+active+clean, 
last acting [4]
pg 6.2c is stuck stale for 8887559.656894, current state stale+active+clean, 
last acting [4]
pg 6.2f is stuck stale for 8887559.656893, current state stale+active+clean, 
last acting [4]
pg 2.2b is stuck stale for 8887559.656896, current state stale+active+clean, 
last acting [4]
pg 2.25 is stuck stale for 8887559.656896, current state stale+active+clean, 
last acting [4]
pg 6.20 is stuck stale for 8887559.656898, current state stale+active+clean, 
last acting [4]
pg 5.21 is stuck stale for 8887559.656898, current state stale+active+clean, 
last acting [4]
pg 0.24 is stuck stale for 8887559.656904, current state stale+active+clean, 
last acting [4]
pg 2.21 is stuck stale for 8887559.656904, current state stale+active+clean, 
last acting [4]
pg 5.27 is stuck stale for 8887559.656906, current state stale+active+clean, 
last acting [4]
pg 2.23 is stuck stale for 8887559.656908, current state stale+active+clean, 
last acting [4]
pg 6.26 is stuck stale for 8887559.656909, current state stale+active+clean, 
last acting [4]
pg 7.27 is stuck stale for 8887559.656913, current state stale+active+clean, 
last acting [4]
pg 7.18 is stuck stale for 8887559.656914, current state stale+active+clean, 
last acting [4]
pg 0.1e is stuck stale for 8887559.656914, current state stale+active+clean, 
last acting [4]
pg 6.18 is stuck stale for 8887559.656919, current state stale+active+clean, 
last acting [4]
pg 2.1f is stuck stale for 8887559.656919, current state stale+active+clean, 
last acting [4]
pg 7.1b is stuck stale for 8887559.656922, current state stale+active+clean, 
last acting [4]
pg 0.1

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-18 Thread Stillwell, Bryan

When I've run into this situation I look for PGs that are on the full
drives, but are in an active+clean state in the cluster.  That way I can
safely remove the PGs from the full drives and not have to risk data loss.

It usually doesn't take much before you can restart the OSDs and let ceph
take care of the rest.

Bryan

From:  ceph-users  on behalf of Lukáš
Kubín 
Date:  Thursday, February 18, 2016 at 2:39 PM
To:  "ceph-users@lists.ceph.com" 
Subject:  Re: [ceph-users] How to recover from OSDs full in small cluster


>Hi,
>we've managed to release some space from our cluster. Now I would like to
>restart those 2 full OSDs. As they're completely full I probably need to
>delete some data from them.
>
>I would like to ask: Is it OK to delete all pg directories (eg. all
>subdirectories in /var/lib/ceph/osd/ceph-5/current/) and start the
>stopped OSD daemon then? This process seems most simple I'm just not sure
>if it is correct - if ceph can handle such
> situation. (I've noticed similar advice here:
>http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd
>/ )
>
>Another option as suggested by Jan is to remove OSD from cluster, and
>recreate them back. That presents more steps though and perhaps some more
>safety prerequirements (nobackfill?) to prevent more block
>movements/disks full while removing/readding.
>
>Thanks!
>
>Lukas
>
>
>Current status:
>
>[root@ceph1 ~]# ceph osd stat
> osdmap e1107: 12 osds: 10 up, 10 in; 29 remapped pgs
>
>[root@ceph1 ~]# ceph pg stat
>
>v21691144: 640 pgs: 503 active+clean, 29 active+remapped, 108
>active+undersized+degraded; 1892 GB data, 3476 GB used, 1780 GB / 5256 GB
>avail; 0 B/s rd, 323 kB/s wr, 49 op/s; 42998/504482 objects degraded
>(8.523%); 10304/504482 objects misplaced (2.042%)
>[root@ceph1 ~]# df -h|grep osd
>/dev/sdg1554G  383G  172G  70% /var/lib/ceph/osd/ceph-3
>/dev/sdf1554G  401G  154G  73% /var/lib/ceph/osd/ceph-2
>/dev/sde1554G  381G  174G  69% /var/lib/ceph/osd/ceph-0
>/dev/sdb1275G  275G   20K 100% /var/lib/ceph/osd/ceph-5
>/dev/sdd1554G  554G   20K 100% /var/lib/ceph/osd/ceph-4
>/dev/sdc1554G  359G  196G  65% /var/lib/ceph/osd/ceph-1
>
>[root@ceph1 ~]# ceph osd tree
>ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>-1 5.93991 root default
>-2 2.96996 host ceph1
> 0 0.53999 osd.0   up  1.0  1.0
> 1 0.53999 osd.1   up  1.0  1.0
> 2 0.53999 osd.2   up  1.0  1.0
> 3 0.53999 osd.3   up  1.0  1.0
> 4 0.53999 osd.4 down0  1.0
> 5 0.26999 osd.5 down0  1.0
>-3 2.96996 host ceph2
> 6 0.53999 osd.6   up  1.0  1.0
> 7 0.53999 osd.7   up  1.0  1.0
> 8 0.53999 osd.8   up  1.0  1.0
> 9 0.53999 osd.9   up  1.0  1.0
>10 0.53999 osd.10  up  1.0  1.0
>11 0.26999 osd.11  up  1.0  1.0
>
>
>
>On Wed, Feb 17, 2016 at 9:43 PM Lukáš Kubín  wrote:
>
>
>Hi,
>I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
>pools, each of size=2. Today, one of our OSDs got full, another 2 near
>full. Cluster turned into ERR state. I have noticed uneven space
>distribution among OSD drives between 70 and
> 100 perce. I have realized there's a low amount of pgs in those 2 pools
>(128 each) and increased one of them to 512, expecting a magic to happen
>and redistribute the space evenly.
>
>Well, something happened - another OSD became full during the
>redistribution and cluster stopped both OSDs and marked them down. After
>some hours the remaining drives partially rebalanced and cluster get to
>WARN state.
>
>I've deleted 3 placement group directories from one of the full OSD's
>filesystem which allowed me to start it up again. Soon, however this
>drive became full again.
>
>So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
>drives to add.
>
>Is there a way how to get out of this situation without adding OSDs? I
>will attempt to release some space, just waiting for colleague to
>identify RBD volumes (openstack images and volumes) which can be deleted.
>
>Thank you.
>
>Lukas
>
>
>This is my cluster state now:
>
>[root@compute1 ~]# ceph -w
>cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
> health HEALTH_WARN
>10 pgs backfill_toofull
>114 pgs degraded
>114 pgs stuck degraded
>147 pgs stuck unclean
>114 pgs stuck undersized
>114 pgs undersized
>1 requests are blocked > 32 sec
>recovery 56923/640724 objects degraded (8.884%)
>recovery 29122/640724 objects misplaced (4.545%)
>3 near full osd(s)
> monmap e3: 3 mons at
>{compute1=10.255.242.14:6789/0,compute2=1

Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-17 Thread Stillwell, Bryan

Vlad,

First off your cluster is rather full (80.31%).  Hopefully you have
hardware ordered for an expansion in the near future.

Based on your 'ceph osd tree' output, it doesn't look like the
reweight-by-utilization did anything for you.  That last number for each
OSD is set to 1, which means it didn't reweight any of the OSDs.  This is
a different weight than the CRUSH weight, and something you can manually
modify as well.

For example you could manually tweak the weights of the fullest OSDs with:

ceph osd reweight osd.23 0.95
ceph osd reweight osd.7 0.95
ceph osd reweight osd.8 0.95

Then just keep tweaking those numbers until the cluster gets an even
distribution of PGs across the OSDs.  The reweight-by-utilization option
can help make this quicker.

Your volumes pool also doesn't have a power of two for pg_num, so your PGs
will have uneven sizes.  Since you can't go back down to 256 PGs, you
should look at gradually increasing them up to 512 PGs.

There are also inconsistent PGs that you should look at repairing.  It
won't help you with the data distribution, but it's good cluster
maintenance.

Bryan

From:  ceph-users  on behalf of Vlad
Blando 
Date:  Wednesday, February 17, 2016 at 5:11 PM
To:  ceph-users 
Subject:  [ceph-users] How to properly deal with NEAR FULL OSD


>Hi This been bugging me for some time now, the distribution of data on
>the OSD is not balanced so some OSD are near full, i did ceph
> osd reweight-by-utilization but it not helping much.
>
>
>[root@controller-node ~]# ceph osd tree
># idweight  type name   up/down reweight
>-1  98.28   root default
>-2  32.76   host ceph-node-1
>0   3.64osd.0   up  1
>1   3.64osd.1   up  1
>2   3.64osd.2   up  1
>3   3.64osd.3   up  1
>4   3.64osd.4   up  1
>5   3.64osd.5   up  1
>6   3.64osd.6   up  1
>7   3.64osd.7   up  1
>8   3.64osd.8   up  1
>-3  32.76   host ceph-node-2
>9   3.64osd.9   up  1
>10  3.64osd.10  up  1
>11  3.64osd.11  up  1
>12  3.64osd.12  up  1
>13  3.64osd.13  up  1
>14  3.64osd.14  up  1
>15  3.64osd.15  up  1
>16  3.64osd.16  up  1
>17  3.64osd.17  up  1
>-4  32.76   host ceph-node-3
>18  3.64osd.18  up  1
>19  3.64osd.19  up  1
>20  3.64osd.20  up  1
>21  3.64osd.21  up  1
>22  3.64osd.22  up  1
>23  3.64osd.23  up  1
>24  3.64osd.24  up  1
>25  3.64osd.25  up  1
>26  3.64osd.26  up  1
>[root@controller-node ~]#
>
>
>[root@controller-node ~]# /opt/df-osd.sh
>ceph-node-1
>===
>/dev/sdb1  3.7T  2.0T  1.7T  54% /var/lib/ceph/osd/ceph-0
>/dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-1
>/dev/sdd1  3.7T  3.3T  431G  89% /var/lib/ceph/osd/ceph-2
>/dev/sde1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-3
>/dev/sdf1  3.7T  3.3T  379G  90% /var/lib/ceph/osd/ceph-4
>/dev/sdg1  3.7T  2.9T  762G  80% /var/lib/ceph/osd/ceph-5
>/dev/sdh1  3.7T  3.0T  733G  81% /var/lib/ceph/osd/ceph-6
>/dev/sdi1  3.7T  3.4T  284G  93% /var/lib/ceph/osd/ceph-7
>/dev/sdj1  3.7T  3.4T  342G  91% /var/lib/ceph/osd/ceph-8
>===
>ceph-node-2
>===
>/dev/sdb1  3.7T  3.1T  622G  84% /var/lib/ceph/osd/ceph-9
>/dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-10
>/dev/sdd1  3.7T  3.1T  557G  86% /var/lib/ceph/osd/ceph-11
>/dev/sde1  3.7T  3.3T  392G  90% /var/lib/ceph/osd/ceph-12
>/dev/sdf1  3.7T  2.6T  1.1T  72% /var/lib/ceph/osd/ceph-13
>/dev/sdg1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-14
>/dev/sdh1  3.7T  2.7T  984G  74% /var/lib/ceph/osd/ceph-15
>/dev/sdi1  3.7T  3.2T  463G  88% /var/lib/ceph/osd/ceph-16
>/dev/sdj1  3.7T  3.1T  594G  85% /var/lib/ceph/osd/ceph-17
>===
>ceph-node-3
>===
>/dev/sdb1  3.7T  2.8T  910G  76% /var/lib/ceph/osd/ceph-18
>/dev/sdc1  3.7T  2.7T 1012G  73% /

Re: [ceph-users] pg repair behavior? (Was: Re: getting rid of misplaced objects)

2016-02-16 Thread Stillwell, Bryan

Zoltan,

It's good to hear that you were able to get the PGs stuck in 'remapped'
back into a 'clean' state.  Based on your response I'm guessing that your
failure domains (node, rack, or maybe row) are too close (or equal) to
your replica size.

For example if your cluster looks like this:

3 replicas
3 racks (CRUSH set to use racks as the failure domain)
  rack 1: 3 nodes
  rack 2: 5 nodes
  rack 3: 4 nodes

Then CRUSH will sometimes have problems making sure each rack has one of
the copies (especially if you are doing reweights on OSDs in the first
rack).  Does that come close to describing your cluster?

I believe you're right about how 'ceph repair' works.  I've run into this
before and one way I went about fixing it was to run md5sum on all the
objects in the PG for each OSD and comparing the results.  My thinking was
that I could track down the inconsistent objects by finding ones where
only 2 of the 3 md5's match.

ceph-01:
  cd /var/lib/ceph/osd/ceph-14/current/3.1b0_head
  find . -type f -exec md5sum '{}' \; | sort -k2
>/tmp/pg_3.1b0-osd.14-md5s.txt
ceph-02:
  cd /var/lib/ceph/osd/ceph-47/current/3.1b0_head
  find . -type f -exec md5sum '{}' \; | sort -k2
>/tmp/pg_3.1b0-osd.47-md5s.txt
ceph-04:
  cd /var/lib/ceph/osd/ceph-29/current/3.1b0_head
  find . -type f -exec md5sum '{}' \; | sort -k2
>/tmp/pg_3.1b0-osd.29-md5s.txt

Then using vimdiff to do a 3-way diff I was able to find the objects which
were different between the OSDs.  Based on the that I was able to
determine if the repair would cause a problem.

I believe if you use btrfs instead of xfs for your filestore backend
you'll get proper checksumming, but I don't know if Ceph utilizes that
information yet.  Plus I've heard btrfs slows down quite a bit over time
when used as an OSD.

As for Jewel I think the new bluestore backend includes checksums, but
someone that's actually using it would have to confirm.  Switching to
bluestore will involve a lot of rebuilding too.

Bryan

On 2/15/16, 8:36 AM, "Zoltan Arnold Nagy" 
wrote:

>Hi Bryan,
>
>You were right: we¹ve modified our PG weights a little (from 1 to around
>0.85 on some OSDs) and once I¹ve changed them back to 1, the remapped PGs
>and misplaced objects were gone.
>So thank you for the tip.
>
>For the inconsistent ones and scrub errors, I¹m a little wary to use pg
>repair as that - if I understand correctly - only copies the primary PG¹s
>data to the other PGs thus can easily corrupt the whole object if the
>primary is corrupted.
>
>I haven¹t seen an update on this since last May where this was brought up
>as a concern from several people and there were mentions of adding
>checksumming to the metadata and doing a checksum-comparison on repair.
>
>Can anybody update on the current status on how exactly pg repair works in
>Hammer or will work in Jewel?

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] getting rid of misplaced objects

2016-02-11 Thread Stillwell, Bryan

What does 'ceph osd tree' look like for this cluster?  Also have you done
anything special to your CRUSH rules?

I've usually found this to be caused by modifying OSD weights a little too
much.

As for the inconsistent PG, you should be able to run 'ceph pg repair' on
it:

http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#
pgs-inconsistent

Bryan

On 2/11/16, 11:21 AM, "ceph-users on behalf of Zoltan Arnold Nagy"

wrote:

>Hi,
>
>Are there any tips and tricks around getting rid of misplaced objects? I
>did check the archive but didn¹t find anything.
>
>Right now my cluster looks like this:
>
>  pgmap v43288593: 16384 pgs, 4 pools, 45439 GB data, 10383 kobjects
>109 TB used, 349 TB / 458 TB avail
>330/25160461 objects degraded (0.001%)
>31280/25160461 objects misplaced (0.124%)
>   16343 active+clean
>  40 active+remapped
>   1 active+clean+inconsistent
>
>This is how it has been for a while and I thought for sure that the
>misplaced would converge down to 0, but nevertheless, it didn¹t.
>
>Any pointers on how I could get it back to all active+clean?
>
>Cheers,
>Zoltan
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Unified queue in Infernalis

2016-02-05 Thread Stillwell, Bryan

I saw the following in the release notes for Infernalis, and I'm wondering
where I can find more information about it?

* There is now a unified queue (and thus prioritization) of client IO,
recovery, scrubbing, and snapshot trimming.

I've tried checking the docs for more details, but didn't have much luck.
Does this mean we can adjust the ionice priority of each of these
operations if we're using the CFQ scheduler?

Thanks,
Bryan




This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis upgrade breaks when journal on separate partition

2016-01-11 Thread Stillwell, Bryan

On 1/10/16, 2:26 PM, "ceph-users on behalf of Stuart Longland"
 wrote:

>On 05/01/16 07:52, Stuart Longland wrote:
>>> I ran into this same issue, and found that a reboot ended up setting
>>>the
>>> > ownership correctly.  If you look at
>>>/lib/udev/rules.d/95-ceph-osd.rules
>>> > you'll see the magic that makes it happen
>> Ahh okay, good-o, so a reboot should be fine.  I guess adding chown-ing
>> of journal files would be a good idea (maybe it's version specific, but
>> chown -R did not follow the symlink and change ownership for me).
>
>Well, it seems I spoke to soon.  Not sure what logic the udev rules use
>to identify ceph journals, but it doesn't seem to pick up on the
>journals in our case as after a reboot, those partitions are owned by
>root:disk with permissions 0660.

This is handled by the UUIDs of the GPT partitions, and since you're using
MS-DOS
partition tables it won't work correctly.  I would recommend switching to
GPT
partition tables if you can.

Bryan

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis upgrade breaks when journal on separate partition

2016-01-04 Thread Stillwell, Bryan

I ran into this same issue, and found that a reboot ended up setting the
ownership correctly.  If you look at /lib/udev/rules.d/95-ceph-osd.rules
you'll see the magic that makes it happen:

# JOURNAL_UUID
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="partition", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660", \
  RUN+="/usr/sbin/ceph-disk --log-stdout -v trigger /dev/$name"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{ID_PART_ENTRY_TYPE}=="45b0969e-9b03-4f30-b4c6-b4b80ceff106", \
  OWNER="ceph", GROUP="ceph", MODE="660"



Bryan

On 1/4/16, 2:39 PM, "ceph-users on behalf of Stuart Longland"
 wrote:

>Hi all,
>
>I just did an update of a storage cluster here, or rather, I've done one
>node out of three updating to Infernalis from Hammer.
>
>I shut down the daemons, as per the guide, then did a recursive chown of
>the /var/lib/ceph directory, then struck the following when re-starting:
>
>> 2016-01-05 07:32:09.114197 7f5b41d0f940  0 ceph version 9.2.0
>>(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 2899
>> 2016-01-05 07:32:09.123740 7f5b41d0f940  0
>>filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)
>> 2016-01-05 07:32:09.124047 7f5b41d0f940  0
>>genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
>>FIEMAP ioctl is disabl
>> ed via 'filestore fiemap' config option
>> 2016-01-05 07:32:09.124053 7f5b41d0f940  0
>>genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
>>SEEK_DATA/SEEK_HOLE is
>>  disabled via 'filestore seek data hole' config option
>> 2016-01-05 07:32:09.124066 7f5b41d0f940  0
>>genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
>>splice is supported
>> 2016-01-05 07:32:09.156182 7f5b41d0f940  0
>>genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features:
>>syncfs(2) syscall full
>> y supported (by glibc and kernel)
>> 2016-01-05 07:32:09.156301 7f5b41d0f940  0
>>xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: extsize
>>is supported and y
>> our kernel >= 3.5
>> 2016-01-05 07:32:09.232801 7f5b41d0f940  0
>>filestore(/var/lib/ceph/osd/ceph-0) mount: enabling WRITEAHEAD journal
>>mode: checkpoint i
>> s not enabled
>> 2016-01-05 07:32:09.253440 7f5b41d0f940 -1
>>filestore(/var/lib/ceph/osd/ceph-0) mount failed to open journal
>>/dev/sdc5: (13) Permissi
>> on denied
>> 2016-01-05 07:32:09.263646 7f5b41d0f940 -1 osd.0 0 OSD:init: unable to
>>mount object store
>> 2016-01-05 07:32:09.263656 7f5b41d0f940 -1 ESC[0;31m ** ERROR: osd init
>>failed: (13) Permission deniedESC[0m
>
>Things did not co-operate until I chown'ed /dev/sdc5 (and /dev/sdc6) to
>ceph:ceph.  (-R in /var/lib/ceph was not sufficient).  Even adding ceph
>to the 'disk' group (who owns /dev/sdc5) oddly enough, was not sufficient.
>
>I have that node running, and will do the others, but I am concerned
>about what happens after a reboot.  Is it necessary to configure udev to
>chown /dev/sdc[56] at boot or is there some way to fix ceph's permissions?
>--
> _ ___ Stuart Longland - Systems Engineer
>\  /|_) |   T: +61 7 3535 9619
> \/ | \ | 38b Douglas StreetF: +61 7 3535 9699
>   SYSTEMSMilton QLD 4064   http://www.vrt.com.au
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs stuck in booting state on CentOS 7.2.1511 and ceph infernalis 9.2.0

2015-12-18 Thread Stillwell, Bryan

I ran into a similar problem while in the middle of upgrading from Hammer 
(0.94.5) to Infernalis (9.2.0).  I decided to try rebuilding one of the OSDs by 
using 'ceph-disk prepare /dev/sdb' and it never comes up:

root@b3:~# ceph daemon osd.10 status
{
"cluster_fsid": "----",
"osd_fsid": "----",
"whoami": 10,
"state": "booting",
"oldest_map": 25804,
"newest_map": 25904,
"num_pgs": 0
}

Here's what is written to /var/log/ceph/osd/ceph-osd.10.log:

2015-12-18 16:09:48.928462 7fd5e2bec940  0 ceph version 9.2.0 
(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 6866
2015-12-18 16:09:48.931387 7fd5e2bec940  1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkfs in /var/lib/ceph/tmp/mnt.IOnlxY
2015-12-18 16:09:48.931417 7fd5e2bec940  1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkfs fsid is already set to 
----
2015-12-18 16:09:48.931422 7fd5e2bec940  1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) write_version_stamp 4
2015-12-18 16:09:48.932671 7fd5e2bec940  0 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) backend xfs (magic 0x58465342)
2015-12-18 16:09:48.934953 7fd5e2bec940  1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) leveldb db exists/created
2015-12-18 16:09:48.935082 7fd5e2bec940  1 journal _open 
/var/lib/ceph/tmp/mnt.IOnlxY/journal fd 11: 1072693248 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-12-18 16:09:48.935218 7fd5e2bec940 -1 journal check: ondisk fsid 
---- doesn't match expected 
----, invalid (someone else's?) journal
2015-12-18 16:09:48.935227 7fd5e2bec940  1 journal close 
/var/lib/ceph/tmp/mnt.IOnlxY/journal
2015-12-18 16:09:48.935452 7fd5e2bec940  1 journal _open 
/var/lib/ceph/tmp/mnt.IOnlxY/journal fd 11: 1072693248 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-12-18 16:09:48.935771 7fd5e2bec940  0 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkjournal created journal on 
/var/lib/ceph/tmp/mnt.IOnlxY/journal
2015-12-18 16:09:48.935803 7fd5e2bec940  1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mkfs done in 
/var/lib/ceph/tmp/mnt.IOnlxY
2015-12-18 16:09:48.935919 7fd5e2bec940  0 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) backend xfs (magic 0x58465342)
2015-12-18 16:09:48.936548 7fd5e2bec940  0 
genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: FIEMAP 
ioctl is disabled via 'filestore fiemap' config option
2015-12-18 16:09:48.936559 7fd5e2bec940  0 
genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2015-12-18 16:09:48.936588 7fd5e2bec940  0 
genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: splice 
is supported
2015-12-18 16:09:48.938319 7fd5e2bec940  0 
genericfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2015-12-18 16:09:48.938394 7fd5e2bec940  0 
xfsfilestorebackend(/var/lib/ceph/tmp/mnt.IOnlxY) detect_features: extsize is 
supported and your kernel >= 3.5
2015-12-18 16:09:48.940420 7fd5e2bec940  0 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) mount: enabling WRITEAHEAD journal 
mode: checkpoint is not enabled
2015-12-18 16:09:48.940646 7fd5e2bec940  1 journal _open 
/var/lib/ceph/tmp/mnt.IOnlxY/journal fd 17: 1072693248 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-12-18 16:09:48.940865 7fd5e2bec940  1 journal _open 
/var/lib/ceph/tmp/mnt.IOnlxY/journal fd 17: 1072693248 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-12-18 16:09:48.941270 7fd5e2bec940  1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) upgrade
2015-12-18 16:09:48.941389 7fd5e2bec940 -1 
filestore(/var/lib/ceph/tmp/mnt.IOnlxY) could not find 
-1/23c2fcde/osd_superblock/0 in index: (2) No such file or directory
2015-12-18 16:09:48.945392 7fd5e2bec940  1 journal close 
/var/lib/ceph/tmp/mnt.IOnlxY/journal
2015-12-18 16:09:48.946175 7fd5e2bec940 -1 created object store 
/var/lib/ceph/tmp/mnt.IOnlxY journal /var/lib/ceph/tmp/mnt.IOnlxY/journal for 
osd.10 fsid ----
2015-12-18 16:09:48.946269 7fd5e2bec940 -1 auth: error reading file: 
/var/lib/ceph/tmp/mnt.IOnlxY/keyring: can't open 
/var/lib/ceph/tmp/mnt.IOnlxY/keyring: (2) No such file or directory
2015-12-18 16:09:48.946623 7fd5e2bec940 -1 created new key in keyring 
/var/lib/ceph/tmp/mnt.IOnlxY/keyring
2015-12-18 16:09:50.698753 7fb5db130940  0 ceph version 9.2.0 
(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process ceph-osd, pid 7045
2015-12-18 16:09:50.745427 7fb5db130940  0 filestore(/var/lib/ceph/osd/ceph-10) 
backend xfs (magic 0x58465342)
2015-12-18 16:09:50.745978 7fb5db130940  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features: FIEMAP 
ioctl is disabled via 'filestore fiemap' config option
2015-12-18 16:09:50.745987 7fb5db130940  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-10) detect_features

Re: [ceph-users] Storage node refurbishing, a "freeze" OSD feature would be nice

2015-08-31 Thread Stillwell, Bryan

We have the following in our ceph.conf to bring in new OSDs with a weight
of 0:

[osd]
osd_crush_initial_weight = 0

We then set 'nobackfill' and bring in each OSD at full weight one at a
time (letting things settle down before bring in the next OSD).  Once all
the OSDs are brought in we unset 'nobackfill' and let ceph take care of
the rest.  This seems to work pretty well for us.

Bryan

On 8/31/15, 4:08 PM, "ceph-users on behalf of Wang, Warren"
 wrote:

>When we know we need to off a node, we weight it down over time. Depending
>on your cluster, you may need to do this over days or hours.
>
>In theory, you could do the same when putting OSDs in, by setting noin,
>and then setting weight to something very low, and going up over time. I
>haven¹t tried this though.
>
>--
>Warren Wang
>Comcast Cloud (OpenStack)
>
>
>
>On 8/31/15, 2:57 AM, "ceph-users on behalf of Udo Lembke"
>
>wrote:
>
>>Hi Christian,
>>for my setup "b" takes too long - too much data movement and stress to
>>all nodes.
>>I have simply (with replica 3) "set noout", reinstall one node (with new
>>filesystem on the OSDs, but leave them in the
>>crushmap) and start all OSDs (at friday night) - takes app. less than one
>>day for rebuild (11*4TB 1*8TB).
>>Do also stress the other nodes, but less than with weigting to zero.
>>
>>Udo
>>
>>On 31.08.2015 06:07, Christian Balzer wrote:
>>>
>>> Hello,
>>>
>>> I'm about to add another storage node to small firefly cluster here and
>>> refurbish 2 existing nodes (more RAM, different OSD disks).
>>>
>>> Insert rant about not going to start using ceph-deploy as I would have
>>>to
>>> set the cluster to no-in since "prepare" also activates things due to
>>>the
>>> udev magic...
>>>
>>> This cluster is quite at the limits of its IOPS capacity (the HW was
>>> requested ages ago, but the mills here grind slowly and not particular
>>> fine either), so the plan is to:
>>>
>>> a) phase in the new node (lets call it C), one OSD at a time (in the
>>>dead
>>> of night)
>>> b) empty out old node A (weight 0), one OSD at a time. When
>>> done, refurbish and bring it back in, like above.
>>> c) repeat with 2nd old node B.
>>>
>>> Looking at this it's obvious where the big optimization in this
>>>procedure
>>> would be, having the ability to "freeze" the OSDs on node B.
>>> That is making them ineligible for any new PGs while preserving their
>>> current status.
>>> So that data moves from A to C (which is significantly faster than A or
>>>B)
>>> and then back to A when it is refurbished, avoiding any heavy lifting
>>>by B.
>>>
>>> Does that sound like something other people might find useful as well
>>>and
>>> is it feasible w/o upsetting the CRUSH applecart?
>>>
>>> Christian
>>>
>>
>>___
>>ceph-users mailing list
>>ceph-users@lists.ceph.com
>>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Inconsistency in 'ceph df' stats

2015-08-31 Thread Stillwell, Bryan

On one of our staging ceph clusters (firefly 0.80.10) I've noticed that
some
of the statistics in the 'ceph df' output don't seem to match up.  For
example
in the output below the amount of raw used is 8,402G, which with triple
replication would be 2,800.7G used (all the pools are triple replication).
However, if you add up the numbers used by all the pools (424G + 2538G +
103G)
you get 3,065G used (a difference of +264.3G).

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
50275G 41873G8402G 16.71
POOLS:
NAME  ID USED  %USED MAX AVAIL OBJECTS
data  0  0 013559G   0
metadata  1  0 013559G   0
rbd   2  0 013559G   0
volumes   3   424G  0.8413559G  159651
images4  2538G  5.0513559G  325198
backups   5  0 013559G   0
instances 6   103G  0.2113559G   25310

The max avail amount doesn't line up either.  If you take 3 * 13,559G you
get
40,677G available, but the global stat is 41,873G (a difference of 1,196G).


On another staging cluster the numbers are closer to what I would expect.
The
amount of raw used is 7,037G, which with triple replication should be
2,345.7G.  However, adding up the amounts used by all the pools (102G +
1749G
+ 478G + 14G) is 2,343G (a difference of just -2.7G).

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
50275G 43238G7037G 14.00
POOLS:
NAME  ID USED   %USED MAX AVAIL OBJECTS
data  0   0 013657G   0
metadata  1   0 013657G   0
rbd   2   0 013657G   0
volumes   3102G  0.2013657G   27215
images4   1749G  3.4813657G  224259
instances 5478G  0.9513657G   79221
backups   6   0 013657G   0
scbench   8  14704M  0.0313657G3677

The max avail is a little further off.  Taking 3 * 13,657G you get 40,971G,
but the global stat is 43,238G (a difference of 2,267G).

My guess would have been that the global numbers would include some of the
overhead involved which lines up with the second cluster, but the first
cluster would have -264.3G of overhead which just doesn't make sense.  Any
ideas where these stats might be getting off?

Thanks,
Bryan




This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Expanding a ceph cluster with ansible

2015-06-23 Thread Stillwell, Bryan

Sébastien,

Nothing has gone wrong with using it in this way, it just has to do with
my lack
of experience with ansible/ceph-ansible.  I'm learning both now, but would
love
if there were more documentation around using them.  For example this
documentation around using ceph-deploy is pretty good, and I was hoping for
something equivalent for ceph-ansible:

http://ceph.com/docs/master/rados/deployment/


With that said, I'm wondering what tweaks do you think would be needed to
get
ceph-ansible working on an existing cluster?

Also to answer your other questions, I haven't tried expanding the cluster
with
ceph-ansible yet.  I'm playing around with it in vagrant/virtualbox, and
it looks
pretty awesome so far!  If everything goes well, I'm not against
revisiting the
choice of puppet-ceph and replacing it with ceph-ansible.

One other question, how well does ceph-ansible handle replacing a failed
HDD
(/dev/sdo) that has the journal at the beginning or middle of an SSD
(/dev/sdd2)?

Thanks,
Bryan

On 6/22/15, 7:09 AM, "Sebastien Han"  wrote:

>Hi Bryan,
>
>It shouldn¹t be a problem for ceph-ansible to expand a cluster even if it
>wasn¹t deployed with it.
>I believe this requires a bit of tweaking on the ceph-ansible, but it¹s
>not much.
>Can you elaborate on what went wrong and perhaps how you configured
>ceph-ansible?
>
>As far as I understood, you haven¹t been able to grow the size of your
>cluster by adding new disks/nodes?
>Is this statement correct?
>
>One more thing, why don¹t you use ceph-ansible entirely to do the
>provisioning and life cycle management of your cluster? :)
>
>> On 18 Jun 2015, at 00:14, Stillwell, Bryan
>> wrote:
>>
>> I've been working on automating a lot of our ceph admin tasks lately
>>and am
>> pretty pleased with how the puppet-ceph module has worked for installing
>> packages, managing ceph.conf, and creating the mon nodes.  However, I
>>don't
>> like the idea of puppet managing the OSDs.  Since we also use ansible
>>in my
>> group, I took a look at ceph-ansible to see how it might be used to
>> complete
>> this task.  I see examples for doing a rolling update and for doing an
>>os
>> migration, but nothing for adding a node or multiple nodes at once.  I
>> don't
>> have a problem doing this work, but wanted to check with the community
>>if
>> any one has experience using ceph-ansible for this?
>>
>> After a lot of trial and error I found the following process works well
>> when
>> using ceph-deploy, but it's a lot of steps and can be error prone
>> (especially if you have old cephx keys that haven't been removed yet):
>>
>> # Disable backfilling and scrubbing to prevent too many performance
>> # impacting tasks from happening at the same time.  Maybe adding
>>norecover
>> # to this list might be a good idea so only peering happens at first.
>> ceph osd set nobackfill
>> ceph osd set noscrub
>> ceph osd set nodeep-scrub
>>
>> # Zap the disks to start from a clean slate
>> ceph-deploy disk zap dnvrco01-cephosd-025:sd{b..y}
>>
>> # Prepare the disks.  I found sleeping between adding each disk can help
>> # prevent performance problems.
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdh:/dev/sdb; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdi:/dev/sdb; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdj:/dev/sdb; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdk:/dev/sdc; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdl:/dev/sdc; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdm:/dev/sdc; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdn:/dev/sdd; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdo:/dev/sdd; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdp:/dev/sdd; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdq:/dev/sde; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdr:/dev/sde; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sds:/dev/sde; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdt:/dev/sdf; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdu:/dev/sdf; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdv:/dev/sdf; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdw:/dev/sdg; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdx:/dev/sdg; sleep 15
>> ceph-deploy osd prepare dnvrco01-cephosd-025:sdy:/dev/sdg; sleep 15
>>
>> # Weight in the new OSDs.  We set 'osd_crush_initial_weight = 0' to
>>prevent
>> # them from being added in during the

[ceph-users] Expanding a ceph cluster with ansible

2015-06-17 Thread Stillwell, Bryan

I've been working on automating a lot of our ceph admin tasks lately and am
pretty pleased with how the puppet-ceph module has worked for installing
packages, managing ceph.conf, and creating the mon nodes.  However, I don't
like the idea of puppet managing the OSDs.  Since we also use ansible in my
group, I took a look at ceph-ansible to see how it might be used to
complete
this task.  I see examples for doing a rolling update and for doing an os
migration, but nothing for adding a node or multiple nodes at once.  I
don't
have a problem doing this work, but wanted to check with the community if
any one has experience using ceph-ansible for this?

After a lot of trial and error I found the following process works well
when
using ceph-deploy, but it's a lot of steps and can be error prone
(especially if you have old cephx keys that haven't been removed yet):

# Disable backfilling and scrubbing to prevent too many performance
# impacting tasks from happening at the same time.  Maybe adding norecover
# to this list might be a good idea so only peering happens at first.
ceph osd set nobackfill
ceph osd set noscrub
ceph osd set nodeep-scrub

# Zap the disks to start from a clean slate
ceph-deploy disk zap dnvrco01-cephosd-025:sd{b..y}

# Prepare the disks.  I found sleeping between adding each disk can help
# prevent performance problems.
ceph-deploy osd prepare dnvrco01-cephosd-025:sdh:/dev/sdb; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdi:/dev/sdb; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdj:/dev/sdb; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdk:/dev/sdc; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdl:/dev/sdc; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdm:/dev/sdc; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdn:/dev/sdd; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdo:/dev/sdd; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdp:/dev/sdd; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdq:/dev/sde; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdr:/dev/sde; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sds:/dev/sde; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdt:/dev/sdf; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdu:/dev/sdf; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdv:/dev/sdf; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdw:/dev/sdg; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdx:/dev/sdg; sleep 15
ceph-deploy osd prepare dnvrco01-cephosd-025:sdy:/dev/sdg; sleep 15

# Weight in the new OSDs.  We set 'osd_crush_initial_weight = 0' to prevent
# them from being added in during the prepare step.  Maybe a longer weight
# in the last step would make this step unncessary.
ceph osd crush reweight osd.450 1.09; sleep 60
ceph osd crush reweight osd.451 1.09; sleep 60
ceph osd crush reweight osd.452 1.09; sleep 60
ceph osd crush reweight osd.453 1.09; sleep 60
ceph osd crush reweight osd.454 1.09; sleep 60
ceph osd crush reweight osd.455 1.09; sleep 60
ceph osd crush reweight osd.456 1.09; sleep 60
ceph osd crush reweight osd.457 1.09; sleep 60
ceph osd crush reweight osd.458 1.09; sleep 60
ceph osd crush reweight osd.459 1.09; sleep 60
ceph osd crush reweight osd.460 1.09; sleep 60
ceph osd crush reweight osd.461 1.09; sleep 60
ceph osd crush reweight osd.462 1.09; sleep 60
ceph osd crush reweight osd.463 1.09; sleep 60
ceph osd crush reweight osd.464 1.09; sleep 60
ceph osd crush reweight osd.465 1.09; sleep 60
ceph osd crush reweight osd.466 1.09; sleep 60
ceph osd crush reweight osd.467 1.09; sleep 60

# Once all the OSDs are added to the cluster, allow the backfill process to
# begin.
ceph osd unset nobackfill

# Then once cluster is healthy again, re-enable scrubbing
ceph osd unset noscrub
ceph osd unset nodeep-scrub


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Discuss: New default recovery config settings

2015-05-29 Thread Stillwell, Bryan

I like the idea of turning the defaults down.  During the ceph operators 
session at the OpenStack conference last week Warren described the behavior 
pretty accurately as "Ceph basically DOSes itself unless you reduce those 
settings."  Maybe this is more of a problem when the clusters are small?

Another idea would be to have a better way to prioritize recovery traffic to an 
even lower priority level by setting the ionice value to 'Idle' in the CFQ 
scheduler?

Bryan

From: Josef Johansson mailto:jose...@gmail.com>>
Date: Friday, May 29, 2015 at 4:16 PM
To: Samuel Just mailto:sj...@redhat.com>>, ceph-devel 
mailto:ceph-de...@vger.kernel.org>>, 
"'ceph-users@lists.ceph.com' 
(ceph-users@lists.ceph.com)" 
mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Discuss: New default recovery config settings


Hi,

We did it the other way around instead, defining a period where the load is 
lighter and turn off/on backfill/recover. Then you want the backfill values to 
be the what is default right now.

Also, someone said that (think it was Greg?) If you have problems with 
backfill, your cluster backing store is not fast enough/too much load.
If 10 osds goes down at the same time you want those values to be high to 
minimize the downtime.

/Josef

fre 29 maj 2015 23:47 Samuel Just mailto:sj...@redhat.com>> 
skrev:
Many people have reported that they need to lower the osd recovery config 
options to minimize the impact of recovery on client io.  We are talking about 
changing the defaults as follows:

osd_max_backfills to 1 (from 10)
osd_recovery_max_active to 3 (from 15)
osd_recovery_op_priority to 1 (from 10)
osd_recovery_max_single_start to 1 (from 5)

We'd like a bit of feedback first though.  Is anyone happy with the current 
configs?  Is anyone using something between these values and the current 
defaults?  What kind of workload?  I'd guess that lowering osd_max_backfills to 
1 is probably a good idea, but I wonder whether lowering 
osd_recovery_max_active and osd_recovery_max_single_start will cause small 
objects to recover unacceptably slowly.

Thoughts?
-Sam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs map command deprecated

2015-04-22 Thread Stillwell, Bryan

On 4/22/15, 2:08 PM, "Gregory Farnum"  wrote:

>On Wed, Apr 22, 2015 at 12:35 PM, Stillwell, Bryan
> wrote:
>> I have a PG that is in the active+inconsistent state and found the
>> following objects to have differing md5sums:
>>
>> -fa8298048c1958de3c04c71b2f225987
>> ./DIR_5/DIR_0/DIR_D/DIR_9/1008a75.017c__head_502F9D05__0
>> +b089c2dcd4f1d8b4419ba34fe468f784
>> ./DIR_5/DIR_0/DIR_D/DIR_9/1008a75.017c__head_502F9D05__0
>> -d3e18fd503ee8ba42792fc41eadfc328
>> ./DIR_5/DIR_0/DIR_D/DIR_9/100aea5.1535__head_3C0C9D05__0
>> -85a1cf9fae40333ce1145b88bbcf0278
>> ./DIR_5/DIR_0/DIR_D/DIR_9/100aeaa.0c1e__head_BA1F9D05__0
>> -4da9b1c2e81a72e2aecbfd10dc0e217a
>> ./DIR_5/DIR_0/DIR_D/DIR_9/100aec4.0014__head_736F9D05__0
>> +0a0c530b54799988863c2364b077
>> ./DIR_5/DIR_0/DIR_D/DIR_9/100aea5.1535__head_3C0C9D05__0
>> +86eea720d66d0792e8a1da8ea57e74f7
>> ./DIR_5/DIR_0/DIR_D/DIR_9/100aeaa.0c1e__head_BA1F9D05__0
>> +75c9085ffb66879326a6f1021586e60f
>> ./DIR_5/DIR_0/DIR_D/DIR_9/100aec4.0014__head_736F9D05__0
>>
>>
>>
>> I would like to know which files are associated with these objects, but
>> one of the commands I used to use doesn't appear to work any more:
>>
>> # cephfs MVI_2553.MOV map
>> WARNING: This tool is deprecated.  Use the layout.* xattrs to query and
>> modify layouts.
>> FILE OFFSETOBJECTOFFSETLENGTH
>>OSD
>> Error getting location: (22) Invalid argument
>>
>>
>> I tried looking at the layout information on this page, but it doesn't
>> contain the name of the object from what I can tell:
>
>Hmm, yeah. The object names are just .part>. So you don't even need the layout info to know what objects to
>look at. :)

Thanks!  In case anyone else wants to know what the solution was in the
future, this worked well for me:

for inode in 1008a75 100aea5 100aeaa 100aec4; do
ls -li | grep $((0x${inode}))
done

Bryan

This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] cephfs map command deprecated

2015-04-22 Thread Stillwell, Bryan

I have a PG that is in the active+inconsistent state and found the
following objects to have differing md5sums:

-fa8298048c1958de3c04c71b2f225987
./DIR_5/DIR_0/DIR_D/DIR_9/1008a75.017c__head_502F9D05__0
+b089c2dcd4f1d8b4419ba34fe468f784
./DIR_5/DIR_0/DIR_D/DIR_9/1008a75.017c__head_502F9D05__0
-d3e18fd503ee8ba42792fc41eadfc328
./DIR_5/DIR_0/DIR_D/DIR_9/100aea5.1535__head_3C0C9D05__0
-85a1cf9fae40333ce1145b88bbcf0278
./DIR_5/DIR_0/DIR_D/DIR_9/100aeaa.0c1e__head_BA1F9D05__0
-4da9b1c2e81a72e2aecbfd10dc0e217a
./DIR_5/DIR_0/DIR_D/DIR_9/100aec4.0014__head_736F9D05__0
+0a0c530b54799988863c2364b077
./DIR_5/DIR_0/DIR_D/DIR_9/100aea5.1535__head_3C0C9D05__0
+86eea720d66d0792e8a1da8ea57e74f7
./DIR_5/DIR_0/DIR_D/DIR_9/100aeaa.0c1e__head_BA1F9D05__0
+75c9085ffb66879326a6f1021586e60f
./DIR_5/DIR_0/DIR_D/DIR_9/100aec4.0014__head_736F9D05__0



I would like to know which files are associated with these objects, but
one of the commands I used to use doesn't appear to work any more:

# cephfs MVI_2553.MOV map
WARNING: This tool is deprecated.  Use the layout.* xattrs to query and
modify layouts.
FILE OFFSETOBJECTOFFSETLENGTH  OSD
Error getting location: (22) Invalid argument


I tried looking at the layout information on this page, but it doesn't
contain the name of the object from what I can tell:

http://ceph.com/docs/master/cephfs/file-layouts/



Is there another command that can be used which is similar to the cephfs
map command?

BTW, I'm using hammer (0.94.1).

Thanks,
Bryan


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Managing larger ceph clusters

2015-04-15 Thread Stillwell, Bryan

I'm curious what people managing larger ceph clusters are doing with
configuration management and orchestration to simplify their lives?

We've been using ceph-deploy to manage our ceph clusters so far, but
feel that moving the management of our clusters to standard tools would
provide a little more consistency and help prevent some mistakes that
have happened while using ceph-deploy.

We're looking at using the same tools we use in our OpenStack
environment (puppet/ansible), but I'm interested in hearing from people
using chef/salt/juju as well.

Some of the cluster operation tasks that I can think of along with
ideas/concerns I have are:

Keyring management
  Seems like hiera-eyaml is a natural fit for storing the keyrings.

ceph.conf
  I believe the puppet ceph module can be used to manage this file, but
  I'm wondering if using a template (erb?) might be better method to
  keeping it organized and properly documented.

Pool configuration
  The puppet module seems to be able to handle managing replicas and the
  number of placement groups, but I don't see support for erasure coded
  pools yet.  This is probably something we would want the initial
  configuration to be set up by puppet, but not something we would want
  puppet changing on a production cluster.

CRUSH maps
  Describing the infrastructure in yaml makes sense.  Things like which
  servers are in which rows/racks/chassis.  Also describing the type of
  server (model, number of HDDs, number of SSDs) makes sense.

CRUSH rules
  I could see puppet managing the various rules based on the backend
  storage (HDD, SSD, primary affinity, erasure coding, etc).

Replacing a failed HDD disk
  Do you automatically identify the new drive and start using it right
  away?  I've seen people talk about using a combination of udev and
  special GPT partition IDs to automate this.  If you have a cluster
  with thousands of drives I think automating the replacement makes
  sense.  How do you handle the journal partition on the SSD?  Does
  removing the old journal partition and creating a new one create a
  hole in the partition map (because the old partition is removed and
  the new one is created at the end of the drive)?

Replacing a failed SSD journal
  Has anyone automated recreating the journal drive using Sebastien
  Han's instructions, or do you have to rebuild all the OSDs as well?


http://www.sebastien-han.fr/blog/2014/11/27/ceph-recover-osds-after-ssd-jou
rnal-failure/

Adding new OSD servers
  How are you adding multiple new OSD servers to the cluster?  I could
  see an ansible playbook which disables nobackfill, noscrub, and
  nodeep-scrub followed by adding all the OSDs to the cluster being
  useful.

Upgrading releases
  I've found an ansible playbook for doing a rolling upgrade which looks
  like it would work well, but are there other methods people are using?


http://www.sebastien-han.fr/blog/2015/03/30/ceph-rolling-upgrades-with-ansi
ble/

Decommissioning hardware
  Seems like another ansible playbook for reducing the OSDs weights to
  zero, marking the OSDs out, stopping the service, removing the OSD ID,
  removing the CRUSH entry, unmounting the drives, and finally removing
  the server would be the best method here.  Any other ideas on how to
  approach this?


That's all I can think of right now.  Is there any other tasks that
people have run into that are missing from this list?

Thanks,
Bryan


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] low power single disk nodes

2015-04-09 Thread Stillwell, Bryan

These are really interesting to me, but how can you buy them?  What's the
performance like in ceph?  Are they using the keyvaluestore backend, or
something specific to these drives?  Also what kind of chassis do they go
into (some kind of ethernet JBOD)?

Bryan

On 4/9/15, 9:43 AM, "Mark Nelson"  wrote:

>How about drives that run Linux with an ARM processor, RAM, and an
>ethernet port right on the drive?  Notice the Ceph logo. :)
>
>https://www.hgst.com/science-of-storage/emerging-technologies/open-etherne
>t-drive-architecture
>
>Mark
>
>On 04/09/2015 10:37 AM, Scott Laird wrote:
>> Minnowboard Max?  2 atom cores, 1 SATA port, and a real (non-USB)
>> Ethernet port.
>>
>>
>> On Thu, Apr 9, 2015, 8:03 AM p...@philw.com 
>> mailto:p...@philw.com>> wrote:
>>
>> Rather expensive option:
>>
>> Applied Micro X-Gene, overkill for a single disk, and only really
>> available in a
>> development kit format right now.
>>
>>
>>>ent-kits/
>>
>>>kits/>>
>>
>> Better Option:
>>
>> Ambedded CY7 - 7 nodes in 1U half Depth, 6 positions for SATA disks,
>> and one
>> node with mSATA SSD
>>
>> > >
>>
>> --phil
>>
>>  > On 09 April 2015 at 15:57 Quentin Hartman
>> mailto:qhart...@direwolfdigital.com>>
>>  > wrote:
>>  >
>>  >  I'm skeptical about how well this would work, but a Banana Pi
>> might be a
>>  > place to start. Like a raspberry pi, but it has a SATA connector:
>>  > http://www.bananapi.org/
>>  >
>>  >  On Thu, Apr 9, 2015 at 3:18 AM, Jerker Nyberg
>> mailto:jer...@update.uu.se>
>>  > > >
>>wrote:
>>  >> >Hello ceph users,
>>  > >
>>  > >Is anyone running any low powered single disk nodes with
>> Ceph now?
>>  > > Calxeda seems to be no more according to Wikipedia. I do not
>> think HP
>>  > > moonshot is what I am looking for - I want stand-alone nodes,
>> not server
>>  > > cartridges integrated into server chassis. And I do not want to
>> be locked to
>>  > > a single vendor.
>>  > >
>>  > >I was playing with Raspberry Pi 2 for signage when I thought
>> of my old
>>  > > experiments with Ceph.
>>  > >
>>  > >I am thinking of for example Odroid-C1 or Odroid-XU3 Lite or
>> maybe
>>  > > something with a low-power Intel x64/x86 processor. Together
>> with one SSD or
>>  > > one low power HDD the node could get all power via PoE (via
>> splitter or
>>  > > integrated into board if such boards exist). PoE provide remote
>> power-on
>>  > > power-off even for consumer grade nodes.
>>  > >
>>  > >The cost for a single low power node should be able to
>> compete with
>>  > > traditional PC-servers price per disk. Ceph take care of
>> redundancy.
>>  > >
>>  > >I think simple custom casing should be good enough - maybe
>> just strap or
>>  > > velcro everything on trays in the rack, at least for the nodes
>> with SSD.
>>  > >
>>  > >Kind regards,
>>  > >--
>>  > >Jerker Nyberg, Uppsala, Sweden.
>>  > >_
>>  > >ceph-users mailing list
>>  > > ceph-users@lists.ceph.com 
>> >>
>>  > > http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>  > > > >
>>  > >  >  _
>>  >  ceph-users mailing list
>>  > ceph-users@lists.ceph.com 
>>  > http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>  >
>>
>>
>> _
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>> 
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


This E-mail and any of its attachments may contain Time War

Re: [ceph-users] Slow performance during recovery operations

2015-04-02 Thread Stillwell, Bryan

>Recovery creates I/O performance drops in our VM too but it's manageable.
>What really hurts us are deep scrubs.
>Our current situation is Firefly 0.80.9 with a total of 24 identical OSDs
>evenly distributed on 4 servers with the following relevant configuration:
>
>osd recovery max active  = 2
>osd scrub load threshold  = 3
>osd deep scrub interval   = 1209600 # 14 days
>osd max backfills = 4
>osd disk thread ioprio class  = idle
>osd disk thread ioprio priority = 7
>
>we managed to add several OSDs at once while deep scrubs were in practice
>disabled: we just increased deep scrub interval from 1 to 2 weeks which
>if I understand correctly had the effect of disabling them for 1 week
>(and indeed there were none while the backfilling
> went on for several hours).
>
>With these settings and no deep-scrubs the load increased a bit in the
>VMs doing non negligible I/Os but this was manageable. Even disk thread
>ioprio settings (which is what you want to get the ionice behaviour for
>deep scrubs) didn't seem to make much of a
> difference.

>From what I can tell, the 'osd disk thread' settings only applies to
scrubbing and 'snap trimming' operations.  I guess what I'm looking for is
a couple settings that may not exist yet:

  osd recovery thread ioprio class = idle
  osd recovery thread ioprio priority = 7

Or am I going down the wrong path here?


>Note : I don't believe Ceph will try to scatter the scrubs on the whole
>period you set with deep scrub interval, it seems its algorithm is much
>simpler than that and may lead to temporary salves of successive deep
>scrubs and it might generate some temporary
> I/O load which is hard to diagnose (by default scrubs and deep scrubs
>are logged by the OSD so you can correlate them with whatever you use to
>supervise your cluster).

It would definitely be nice to not have scrubs affect performance as much
either, so I'll probably add those to our config as well.


>I actually considered monitoring Ceph for backfills and using ceph set
>nodeep-scrub automatically when there are some and unset it when they
>disappear.

I'm pretty sure setting 'nodeep-scrub' doesn't cancel any current
deep-scrubs that are happening, but something like this would help prevent
the problem from getting worse.


Bryan


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Slow performance during recovery operations

2015-04-02 Thread Stillwell, Bryan

All,

Whenever we're doing some kind of recovery operation on our ceph
clusters (cluster expansion or dealing with a drive failure), there
seems to be a fairly noticable performance drop while it does the
backfills (last time I measured it the performance during recovery was
something like 20% of a healthy cluster).  I'm wondering if there are
any settings that we might be missing which would improve this
situation?

Before doing any kind of expansion operation I make sure both 'noscrub'
and 'nodeep-scrub' are set to make sure scrubing isn't making things
worse.

Also we have the following options set in our ceph.conf:

[osd]
osd_journal_size = 16384
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
osd_recovery_max_single_start = 1
osd_op_threads = 12
osd_crush_initial_weight = 0


I'm wondering if there might be a way to use ionice in the CFQ scheduler
to delegate the recovery traffic to be of the Idle type so customer
traffic has a higher priority?

Thanks,
Bryan


This E-mail and any of its attachments may contain Time Warner Cable 
proprietary information, which is privileged, confidential, or subject to 
copyright belonging to Time Warner Cable. This E-mail is intended solely for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient of this E-mail, you are hereby notified that any 
dissemination, distribution, copying, or action taken in relation to the 
contents of and attachments to this E-mail is strictly prohibited and may be 
unlawful. If you have received this E-mail in error, please notify the sender 
immediately and permanently delete the original and any copy of this E-mail and 
any printout.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

52 matches

Mail list logo