Re: [ceph-users] Migration of a Ceph cluster to a new datacenter and new IPs

2018-12-27 Thread Marcus Müller
Hi all,

Just wanted to explain my experience on how to stop the whole cluster and 
change the IPs.

First, we shut down the cluster with this procedure:

1.Stop the clients from using the RBD images/Rados Gateway on this
cluster or any other clients.
2.The cluster must be in healthy state before proceeding.
3.Set the noout, norecover, norebalance, nobackfill, nodown and pause flags
#ceph osd set noout
#ceph osd set norecover
#ceph osd set norebalance
#ceph osd set nobackfill
#ceph osd set nodown
#ceph osd set pause
4. Stop all ceph services 
4.1.First osd nodes one by one
4.2.Lastly monitor nodes one by one

Now we extracted the monmap with 'ceph-mon -i {mon-id} --extract-monmap 
/tmp/monmap‘
Followed this manual: 
http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
 
<http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way>
And imported the new monmap to each monitor (while they all were stopped), 
changed ceph.conf on all nodes with the new IPs (don’t forget the clients).
The last step was to change the IP Config and the hosts files (in our case, 
again don’t forget the clients) and shutdown the nodes. 

In the new datacenter we started the nodes and everything came up as usual.

(1.Power on the admin node)
2.Power on the monitor nodes
3.Power on the osd nodes
4.Wait for all the nodes to come up , Verify all the services are
up and the connectivity is fine between the nodes.
5.Unset all the noout,norecover,noreblance, nobackfill, nodown and
pause flags.
#ceph osd unset noout
#ceph osd unset norecover
#ceph osd unset norebalance
#ceph osd unset nobackfill
#ceph osd unset nodown
#ceph osd unset pause
6.Check and verify the cluster is in healthy state, Verify all the
clients are able to access the cluster.

I hope this helps someone for the future!


> Am 20.12.2018 um 18:18 schrieb Paul Emmerich :
> 
> I'd do it like this:
> 
> * create 2 new mons with the new IPs
> * update all clients to the 3 new mon IPs
> * delete two old mons
> * create 1 new mon
> * delete the last old mon
> 
> I think it's easier to create/delete mons than to change the IP of an
> existing mon. This doesn't even incur a downtime for the clients
> because they get notified about the new mons.
> 
> For the OSDs: stop OSDs, change IP, start OSDs
> 
> Don't change the IP of a running OSD, they don't like that
> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> 
> On Wed, Dec 19, 2018 at 8:55 PM Marcus Müller  
> wrote:
>> 
>> Hi all,
>> 
>> we’re running a ceph hammer cluster with 3 mons and 24 osds (3 same nodes) 
>> and need to migrate all servers to a new datacenter and change the IPs of 
>> the nodes.
>> 
>> I found this tutorial: 
>> http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
>>  regarding the mons (should be easy) but nothing about the osds and which 
>> steps to do if you need to shutdown and migrate the cluster to a new 
>> datacenter.
>> 
>> Has anyone some ideas, how to and which steps I need?
>> 
>> Regards,
>> Marcus
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migration of a Ceph cluster to a new datacenter and new IPs

2018-12-19 Thread Marcus Müller
Hi all,

we’re running a ceph hammer cluster with 3 mons and 24 osds (3 same nodes) and 
need to migrate all servers to a new datacenter and change the IPs of the 
nodes. 

I found this tutorial: 
http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
 

 regarding the mons (should be easy) but nothing about the osds and which steps 
to do if you need to shutdown and migrate the cluster to a new datacenter.

Has anyone some ideas, how to and which steps I need? 

Regards,
Marcus___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Purge Ceph Node and reuse it for another cluster

2018-09-26 Thread Marcus Müller
Hi all,

Is it safe to purge a ceph osd / mon node like described here: 
http://docs.ceph.com/docs/giant/rados/deployment/ceph-deploy-purge/ 
 and later 
use this node with the same OS again for another production ceph cluster?

Did anyone the same already?

Regards
Marcus___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-11 Thread Marcus Müller
Yes, but everything i want to know is, if my way to change the tunables is 
right or not?



> Am 11.01.2017 um 13:11 schrieb Shinobu Kinjo <ski...@redhat.com>:
> 
> Please refer to Jens's message.
> 
> Regards,
> 
>> On Wed, Jan 11, 2017 at 8:53 PM, Marcus Müller <mueller.mar...@posteo.de> 
>> wrote:
>> Ok, thank you. I thought I have to set ceph to a tunables profile. If I’m 
>> right, then I just have to export the current crush map, edit it and import 
>> it again, like:
>> 
>> ceph osd getcrushmap -o /tmp/crush
>> crushtool -i /tmp/crush --set-choose-total-tries 100 -o /tmp/crush.new
>> ceph osd setcrushmap -i /tmp/crush.new
>> 
>> Is this right or not?
>> 
>> I started this cluster with these 3 nodes and each 3 osds. They are vms. I 
>> knew that this cluster would expand very big, that’s the reason for my 
>> choice for ceph. Now I can’t add more HDDs to the vm hypervisor and I want 
>> to separate the nodes physically too. I bought a new node with these 4 
>> drives and now another node with only 2 drives. As I hear now from several 
>> people this was not a good idea. For this reason, I bought now additional 
>> HDDs for the new node, so I have two with the same amount of HDDs and size. 
>> In the next 1-2 months I will get the third physical node and then 
>> everything should be fine. But at this time I have no other option.
>> 
>> May it help to solve this problem by adding the 2 new HDDs to the new ceph 
>> node?
>> 
>> 
>> 
>>> Am 11.01.2017 um 12:00 schrieb Brad Hubbard <bhubb...@redhat.com>:
>>> 
>>> Your current problem has nothing to do with clients and neither does
>>> choose_total_tries.
>>> 
>>> Try setting just this value to 100 and see if your situation improves.
>>> 
>>> Ultimately you need to take a good look at your cluster configuration
>>> and how your crush map is configured to deal with that configuration
>>> but start with choose_total_tries as it has the highest probability of
>>> helping your situation. Your clients should not be affected.
>>> 
>>> Could you explain the reasoning behind having three hosts with one ods
>>> each, one host with two osds and one with four?
>>> 
>>> You likely need to tweak your crushmap to handle this configuration
>>> better or, preferably, move to a more uniform configuration.
>>> 
>>> 
>>>> On Wed, Jan 11, 2017 at 5:38 PM, Marcus Müller <mueller.mar...@posteo.de> 
>>>> wrote:
>>>> I have to thank you all. You give free support and this already helps me.
>>>> I’m not the one who knows ceph that good, but everyday it’s getting better
>>>> and better ;-)
>>>> 
>>>> According to the article Brad posted I have to change the ceph osd crush
>>>> tunables. But there are two questions left as I already wrote:
>>>> 
>>>> - According to
>>>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#tunables there
>>>> are a few profiles. My needed profile would be BOBTAIL (CRUSH_TUNABLES2)
>>>> wich would set choose_total_tries to 50. For the beginning better than 19.
>>>> There I also see: "You can select a profile on a running cluster with the
>>>> command: ceph osd crush tunables {PROFILE}“. My question on this is: Even 
>>>> if
>>>> I run hammer, is it good and possible to set it to bobtail?
>>>> 
>>>> - We can also read:
>>>> WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES2
>>>> - v0.55 or later, including bobtail series (v0.56.x)
>>>> - Linux kernel version v3.9 or later (for the file system and RBD kernel
>>>> clients)
>>>> 
>>>> And here my question is: If my clients use librados (version hammer), do I
>>>> need to have this required kernel version on the clients or the ceph nodes?
>>>> 
>>>> I don’t want to have troubles at the end with my clients. Can someone 
>>>> answer
>>>> me this, before I change the settings?
>>>> 
>>>> 
>>>> Am 11.01.2017 um 06:47 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>>> 
>>>> 
>>>> Yeah, Sam is correct. I've not looked at crushmap. But I should have
>>>> noticed what troublesome is with looking at `ceph osd tree`. That's my
>>>> bad, sorry for that.
>>>> 
>>>> Again please refer to:
>>>> 
>>>> http://www.anchor.com.au/blog/2013/02/pulli

Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-11 Thread Marcus Müller
Ok, thank you. I thought I have to set ceph to a tunables profile. If I’m 
right, then I just have to export the current crush map, edit it and import it 
again, like:

ceph osd getcrushmap -o /tmp/crush
crushtool -i /tmp/crush --set-choose-total-tries 100 -o /tmp/crush.new
ceph osd setcrushmap -i /tmp/crush.new

Is this right or not?

I started this cluster with these 3 nodes and each 3 osds. They are vms. I knew 
that this cluster would expand very big, that’s the reason for my choice for 
ceph. Now I can’t add more HDDs to the vm hypervisor and I want to separate the 
nodes physically too. I bought a new node with these 4 drives and now another 
node with only 2 drives. As I hear now from several people this was not a good 
idea. For this reason, I bought now additional HDDs for the new node, so I have 
two with the same amount of HDDs and size. In the next 1-2 months I will get 
the third physical node and then everything should be fine. But at this time I 
have no other option. 

May it help to solve this problem by adding the 2 new HDDs to the new ceph node?



> Am 11.01.2017 um 12:00 schrieb Brad Hubbard <bhubb...@redhat.com>:
> 
> Your current problem has nothing to do with clients and neither does
> choose_total_tries.
> 
> Try setting just this value to 100 and see if your situation improves.
> 
> Ultimately you need to take a good look at your cluster configuration
> and how your crush map is configured to deal with that configuration
> but start with choose_total_tries as it has the highest probability of
> helping your situation. Your clients should not be affected.
> 
> Could you explain the reasoning behind having three hosts with one ods
> each, one host with two osds and one with four?
> 
> You likely need to tweak your crushmap to handle this configuration
> better or, preferably, move to a more uniform configuration.
> 
> 
> On Wed, Jan 11, 2017 at 5:38 PM, Marcus Müller <mueller.mar...@posteo.de> 
> wrote:
>> I have to thank you all. You give free support and this already helps me.
>> I’m not the one who knows ceph that good, but everyday it’s getting better
>> and better ;-)
>> 
>> According to the article Brad posted I have to change the ceph osd crush
>> tunables. But there are two questions left as I already wrote:
>> 
>> - According to
>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#tunables there
>> are a few profiles. My needed profile would be BOBTAIL (CRUSH_TUNABLES2)
>> wich would set choose_total_tries to 50. For the beginning better than 19.
>> There I also see: "You can select a profile on a running cluster with the
>> command: ceph osd crush tunables {PROFILE}“. My question on this is: Even if
>> I run hammer, is it good and possible to set it to bobtail?
>> 
>> - We can also read:
>>  WHICH CLIENT VERSIONS SUPPORT CRUSH_TUNABLES2
>>  - v0.55 or later, including bobtail series (v0.56.x)
>>  - Linux kernel version v3.9 or later (for the file system and RBD kernel
>> clients)
>> 
>> And here my question is: If my clients use librados (version hammer), do I
>> need to have this required kernel version on the clients or the ceph nodes?
>> 
>> I don’t want to have troubles at the end with my clients. Can someone answer
>> me this, before I change the settings?
>> 
>> 
>> Am 11.01.2017 um 06:47 schrieb Shinobu Kinjo <ski...@redhat.com>:
>> 
>> 
>> Yeah, Sam is correct. I've not looked at crushmap. But I should have
>> noticed what troublesome is with looking at `ceph osd tree`. That's my
>> bad, sorry for that.
>> 
>> Again please refer to:
>> 
>> http://www.anchor.com.au/blog/2013/02/pulling-apart-cephs-crush-algorithm/
>> 
>> Regards,
>> 
>> 
>> On Wed, Jan 11, 2017 at 1:50 AM, Samuel Just <sj...@redhat.com> wrote:
>> 
>> Shinobu isn't correct, you have 9/9 osds up and running.  up does not
>> equal acting because crush is having trouble fulfilling the weights in
>> your crushmap and the acting set is being padded out with an extra osd
>> which happens to have the data to keep you up to the right number of
>> replicas.  Please refer back to Brad's post.
>> -Sam
>> 
>> On Mon, Jan 9, 2017 at 11:08 PM, Marcus Müller <mueller.mar...@posteo.de>
>> wrote:
>> 
>> Ok, i understand but how can I debug why they are not running as they
>> should? For me I thought everything is fine because ceph -s said they are up
>> and running.
>> 
>> I would think of a problem with the crush map.
>> 
>> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <ski...@redhat.com>:
>> 
>> e.g.,
>> OSD7 / 3 / 0 

Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-10 Thread Marcus Müller
Hi Sam,

another idea: I have two HDDs here and already wanted to add them to ceph5, so 
that I would need a new crush map. Could this problem be solved by doing this?


> Am 10.01.2017 um 17:50 schrieb Samuel Just <sj...@redhat.com>:
> 
> Shinobu isn't correct, you have 9/9 osds up and running.  up does not
> equal acting because crush is having trouble fulfilling the weights in
> your crushmap and the acting set is being padded out with an extra osd
> which happens to have the data to keep you up to the right number of
> replicas.  Please refer back to Brad's post.
> -Sam
> 
> On Mon, Jan 9, 2017 at 11:08 PM, Marcus Müller <mueller.mar...@posteo.de> 
> wrote:
>> Ok, i understand but how can I debug why they are not running as they 
>> should? For me I thought everything is fine because ceph -s said they are up 
>> and running.
>> 
>> I would think of a problem with the crush map.
>> 
>>> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>> 
>>> e.g.,
>>> OSD7 / 3 / 0 are in the same acting set. They should be up, if they
>>> are properly running.
>>> 
>>> # 9.7
>>> 
>>>>  "up": [
>>>>  7,
>>>>  3
>>>>  ],
>>>>  "acting": [
>>>>  7,
>>>>  3,
>>>>  0
>>>>  ],
>>> 
>>> 
>>> Here is an example:
>>> 
>>> "up": [
>>>   1,
>>>   0,
>>>   2
>>> ],
>>> "acting": [
>>>   1,
>>>   0,
>>>   2
>>>  ],
>>> 
>>> Regards,
>>> 
>>> 
>>> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller <mueller.mar...@posteo.de> 
>>> wrote:
>>>>> 
>>>>> That's not perfectly correct.
>>>>> 
>>>>> OSD.0/1/2 seem to be down.
>>>> 
>>>> 
>>>> Sorry but where do you see this? I think this indicates that they are up:  
>>>>  osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?
>>>> 
>>>> 
>>>>> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>>>> 
>>>>> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.mar...@posteo.de> 
>>>>> wrote:
>>>>>> All osds are currently up:
>>>>>> 
>>>>>>   health HEALTH_WARN
>>>>>>  4 pgs stuck unclean
>>>>>>  recovery 4482/58798254 objects degraded (0.008%)
>>>>>>  recovery 420522/58798254 objects misplaced (0.715%)
>>>>>>  noscrub,nodeep-scrub flag(s) set
>>>>>>   monmap e9: 5 mons at
>>>>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>>>>  election epoch 478, quorum 0,1,2,3,4
>>>>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>>>>   osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>>>>  flags noscrub,nodeep-scrub
>>>>>>pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>>>>>>  15070 GB used, 40801 GB / 55872 GB avail
>>>>>>  4482/58798254 objects degraded (0.008%)
>>>>>>  420522/58798254 objects misplaced (0.715%)
>>>>>>   316 active+clean
>>>>>> 4 active+remapped
>>>>>> client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>>>>>> 
>>>>>> This did not chance for two days or so.
>>>>>> 
>>>>>> 
>>>>>> By the way, my ceph osd df now looks like this:
>>>>>> 
>>>>>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>>>>>> 0 1.28899  1.0  3724G  1699G  2024G 45.63 1.69
>>>>>> 1 1.57899  1.0  3724G  1708G  2015G 45.87 1.70
>>>>>> 2 1.68900  1.0  3724G  1695G  2028G 45.54 1.69
>>>>>> 3 6.78499  1.0  7450G  1241G  6208G 16.67 0.62
>>>>>> 4 8.3  1.0  7450G  1228G  6221G 16.49 0.61
>>>>>> 5 9.51500  1.0  7450G  1239G  6210G 16.64 0.62
>>>>>> 6 7.66499  1.0  7450G  1265G  6184G 16.99 0.63
>>>>>> 7 9.75499  1.0  7450G  2497G  4952G 33.52 1.24
>>>>>> 8 9.32999  1.0  7450G  2495G  4954G 33.

Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-10 Thread Marcus Müller
Ok, thanks. Then I will change the tunables.

As far as I see, this would already help me: ceph osd crush tunables bobtail

Even if we run ceph hammer this would work according to the documentation, am I 
right? 

And: I’m using librados for our clients (hammer too) could this change create 
problems (even on older kernels)?


> Am 10.01.2017 um 17:50 schrieb Samuel Just <sj...@redhat.com>:
> 
> Shinobu isn't correct, you have 9/9 osds up and running.  up does not
> equal acting because crush is having trouble fulfilling the weights in
> your crushmap and the acting set is being padded out with an extra osd
> which happens to have the data to keep you up to the right number of
> replicas.  Please refer back to Brad's post.
> -Sam
> 
>> On Mon, Jan 9, 2017 at 11:08 PM, Marcus Müller <mueller.mar...@posteo.de> 
>> wrote:
>> Ok, i understand but how can I debug why they are not running as they 
>> should? For me I thought everything is fine because ceph -s said they are up 
>> and running.
>> 
>> I would think of a problem with the crush map.
>> 
>>> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>> 
>>> e.g.,
>>> OSD7 / 3 / 0 are in the same acting set. They should be up, if they
>>> are properly running.
>>> 
>>> # 9.7
>>> 
>>>> "up": [
>>>> 7,
>>>> 3
>>>> ],
>>>> "acting": [
>>>> 7,
>>>> 3,
>>>> 0
>>>> ],
>>> 
>>> 
>>> Here is an example:
>>> 
>>> "up": [
>>>  1,
>>>  0,
>>>  2
>>> ],
>>> "acting": [
>>>  1,
>>>  0,
>>>  2
>>> ],
>>> 
>>> Regards,
>>> 
>>> 
>>> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller <mueller.mar...@posteo.de> 
>>> wrote:
>>>>> 
>>>>> That's not perfectly correct.
>>>>> 
>>>>> OSD.0/1/2 seem to be down.
>>>> 
>>>> 
>>>> Sorry but where do you see this? I think this indicates that they are up:  
>>>>  osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?
>>>> 
>>>> 
>>>>> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>>>> 
>>>>> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.mar...@posteo.de> 
>>>>> wrote:
>>>>>> All osds are currently up:
>>>>>> 
>>>>>>  health HEALTH_WARN
>>>>>> 4 pgs stuck unclean
>>>>>> recovery 4482/58798254 objects degraded (0.008%)
>>>>>> recovery 420522/58798254 objects misplaced (0.715%)
>>>>>> noscrub,nodeep-scrub flag(s) set
>>>>>>  monmap e9: 5 mons at
>>>>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>>>> election epoch 478, quorum 0,1,2,3,4
>>>>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>>>>  osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>>>> flags noscrub,nodeep-scrub
>>>>>>   pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>>>>>> 15070 GB used, 40801 GB / 55872 GB avail
>>>>>> 4482/58798254 objects degraded (0.008%)
>>>>>> 420522/58798254 objects misplaced (0.715%)
>>>>>>  316 active+clean
>>>>>>4 active+remapped
>>>>>> client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>>>>>> 
>>>>>> This did not chance for two days or so.
>>>>>> 
>>>>>> 
>>>>>> By the way, my ceph osd df now looks like this:
>>>>>> 
>>>>>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>>>>>> 0 1.28899  1.0  3724G  1699G  2024G 45.63 1.69
>>>>>> 1 1.57899  1.0  3724G  1708G  2015G 45.87 1.70
>>>>>> 2 1.68900  1.0  3724G  1695G  2028G 45.54 1.69
>>>>>> 3 6.78499  1.0  7450G  1241G  6208G 16.67 0.62
>>>>>> 4 8.3  1.0  7450G  1228G  6221G 16.49 0.61
>>>>>> 5 9.51500  1.0  7450G  1239G  6210G 16.64 0.62
>>>>>> 6 7.66499  1.0  7450G  1265G  6184G 16.99 0.63
>>>>>> 7 9.75499  

Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Marcus Müller
Ok, i understand but how can I debug why they are not running as they should? 
For me I thought everything is fine because ceph -s said they are up and 
running. 

I would think of a problem with the crush map. 

> Am 10.01.2017 um 08:06 schrieb Shinobu Kinjo <ski...@redhat.com>:
> 
> e.g.,
> OSD7 / 3 / 0 are in the same acting set. They should be up, if they
> are properly running.
> 
> # 9.7
> 
>>   "up": [
>>   7,
>>   3
>>   ],
>>   "acting": [
>>   7,
>>   3,
>>   0
>>   ],
> 
> 
> Here is an example:
> 
>  "up": [
>    1,
>0,
>2
>  ],
>  "acting": [
>1,
>0,
>2
>   ],
> 
> Regards,
> 
> 
> On Tue, Jan 10, 2017 at 3:52 PM, Marcus Müller <mueller.mar...@posteo.de> 
> wrote:
>>> 
>>> That's not perfectly correct.
>>> 
>>> OSD.0/1/2 seem to be down.
>> 
>> 
>> Sorry but where do you see this? I think this indicates that they are up:   
>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?
>> 
>> 
>>> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>> 
>>> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.mar...@posteo.de> 
>>> wrote:
>>>> All osds are currently up:
>>>> 
>>>>health HEALTH_WARN
>>>>   4 pgs stuck unclean
>>>>   recovery 4482/58798254 objects degraded (0.008%)
>>>>   recovery 420522/58798254 objects misplaced (0.715%)
>>>>   noscrub,nodeep-scrub flag(s) set
>>>>monmap e9: 5 mons at
>>>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>>>   election epoch 478, quorum 0,1,2,3,4
>>>> ceph1,ceph2,ceph3,ceph4,ceph5
>>>>osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>>   flags noscrub,nodeep-scrub
>>>> pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>>>>   15070 GB used, 40801 GB / 55872 GB avail
>>>>   4482/58798254 objects degraded (0.008%)
>>>>   420522/58798254 objects misplaced (0.715%)
>>>>316 active+clean
>>>>  4 active+remapped
>>>> client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>>>> 
>>>> This did not chance for two days or so.
>>>> 
>>>> 
>>>> By the way, my ceph osd df now looks like this:
>>>> 
>>>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>>>> 0 1.28899  1.0  3724G  1699G  2024G 45.63 1.69
>>>> 1 1.57899  1.0  3724G  1708G  2015G 45.87 1.70
>>>> 2 1.68900  1.0  3724G  1695G  2028G 45.54 1.69
>>>> 3 6.78499  1.0  7450G  1241G  6208G 16.67 0.62
>>>> 4 8.3  1.0  7450G  1228G  6221G 16.49 0.61
>>>> 5 9.51500  1.0  7450G  1239G  6210G 16.64 0.62
>>>> 6 7.66499  1.0  7450G  1265G  6184G 16.99 0.63
>>>> 7 9.75499  1.0  7450G  2497G  4952G 33.52 1.24
>>>> 8 9.32999  1.0  7450G  2495G  4954G 33.49 1.24
>>>> TOTAL 55872G 15071G 40801G 26.97
>>>> MIN/MAX VAR: 0.61/1.70  STDDEV: 13.16
>>>> 
>>>> As you can see, now osd2 also went down to 45% Use and „lost“ data. But I
>>>> also think this is no problem and ceph just clears everything up after
>>>> backfilling.
>>>> 
>>>> 
>>>> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo <ski...@redhat.com>:
>>>> 
>>>> Looking at ``ceph -s`` you originally provided, all OSDs are up.
>>>> 
>>>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>>> 
>>>> 
>>>> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something
>>> 
>>> That's not perfectly correct.
>>> 
>>> OSD.0/1/2 seem to be down.
>>> 
>>>> like related to ?:
>>>> 
>>>> Ceph1, ceph2 and ceph3 are vms on one physical host
>>>> 
>>>> 
>>>> Are those OSDs running on vm instances?
>>>> 
>>>> # 9.7
>>>> 
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>> 7,
>>>> 3
>>>> ],
>>>> "acting": [
>>>> 7,
>>>> 3,
>>>> 0
>>>> ],
>>>> 
>>>> 
>>>> 
>>>> # 7.84
>>>> 
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>> 4,
>>>> 8
>>>> ],
>>>> "acting": [
>>>> 4,
>>>> 8,
>>>> 1
>>>> ],
>>>> 
>>>> 
>>>> 
>>>> # 8.1b
>>>> 
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>> 4,
>>>> 7
>>>> ],
>>>> "acting": [
>>>> 4,
>>>> 7,
>>>> 2
>>>> ],
>>>> 
>>>> 
>>>> 
>>>> # 7.7a
>>>> 
>>>> 
>>>> "state": "active+remapped",
>>>> "snap_trimq": "[]",
>>>> "epoch": 3114,
>>>> "up": [
>>>> 7,
>>>> 4
>>>> ],
>>>> "acting": [
>>>> 7,
>>>> 4,
>>>> 2
>>>> ],
>>>> 
>>>> 
>>>> 
>>>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Marcus Müller
> 
> That's not perfectly correct.
> 
> OSD.0/1/2 seem to be down.


Sorry but where do you see this? I think this indicates that they are up:   
osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs?


> Am 10.01.2017 um 07:50 schrieb Shinobu Kinjo <ski...@redhat.com>:
> 
> On Tue, Jan 10, 2017 at 3:44 PM, Marcus Müller <mueller.mar...@posteo.de> 
> wrote:
>> All osds are currently up:
>> 
>> health HEALTH_WARN
>>4 pgs stuck unclean
>>recovery 4482/58798254 objects degraded (0.008%)
>>recovery 420522/58798254 objects misplaced (0.715%)
>>noscrub,nodeep-scrub flag(s) set
>> monmap e9: 5 mons at
>> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
>>election epoch 478, quorum 0,1,2,3,4
>> ceph1,ceph2,ceph3,ceph4,ceph5
>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>>flags noscrub,nodeep-scrub
>>  pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
>>15070 GB used, 40801 GB / 55872 GB avail
>>4482/58798254 objects degraded (0.008%)
>>420522/58798254 objects misplaced (0.715%)
>> 316 active+clean
>>   4 active+remapped
>>  client io 56601 B/s rd, 45619 B/s wr, 0 op/s
>> 
>> This did not chance for two days or so.
>> 
>> 
>> By the way, my ceph osd df now looks like this:
>> 
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>> 0 1.28899  1.0  3724G  1699G  2024G 45.63 1.69
>> 1 1.57899  1.0  3724G  1708G  2015G 45.87 1.70
>> 2 1.68900  1.0  3724G  1695G  2028G 45.54 1.69
>> 3 6.78499  1.0  7450G  1241G  6208G 16.67 0.62
>> 4 8.3  1.0  7450G  1228G  6221G 16.49 0.61
>> 5 9.51500  1.0  7450G  1239G  6210G 16.64 0.62
>> 6 7.66499  1.0  7450G  1265G  6184G 16.99 0.63
>> 7 9.75499  1.0  7450G  2497G  4952G 33.52 1.24
>> 8 9.32999  1.0  7450G  2495G  4954G 33.49 1.24
>>  TOTAL 55872G 15071G 40801G 26.97
>> MIN/MAX VAR: 0.61/1.70  STDDEV: 13.16
>> 
>> As you can see, now osd2 also went down to 45% Use and „lost“ data. But I
>> also think this is no problem and ceph just clears everything up after
>> backfilling.
>> 
>> 
>> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo <ski...@redhat.com>:
>> 
>> Looking at ``ceph -s`` you originally provided, all OSDs are up.
>> 
>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
>> 
>> 
>> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something
> 
> That's not perfectly correct.
> 
> OSD.0/1/2 seem to be down.
> 
>> like related to ?:
>> 
>> Ceph1, ceph2 and ceph3 are vms on one physical host
>> 
>> 
>> Are those OSDs running on vm instances?
>> 
>> # 9.7
>> 
>> 
>>  "state": "active+remapped",
>>  "snap_trimq": "[]",
>>  "epoch": 3114,
>>  "up": [
>>  7,
>>  3
>>  ],
>>  "acting": [
>>  7,
>>  3,
>>  0
>>  ],
>> 
>> 
>> 
>> # 7.84
>> 
>> 
>>  "state": "active+remapped",
>>  "snap_trimq": "[]",
>>  "epoch": 3114,
>> "up": [
>>  4,
>>  8
>>  ],
>>  "acting": [
>>  4,
>>  8,
>>  1
>>  ],
>> 
>> 
>> 
>> # 8.1b
>> 
>> 
>>  "state": "active+remapped",
>>  "snap_trimq": "[]",
>>  "epoch": 3114,
>>  "up": [
>>  4,
>>  7
>>  ],
>>  "acting": [
>>  4,
>>  7,
>>  2
>>  ],
>> 
>> 
>> 
>> # 7.7a
>> 
>> 
>>  "state": "active+remapped",
>>  "snap_trimq": "[]",
>>  "epoch": 3114,
>>  "up": [
>>  7,
>>  4
>>  ],
>>  "acting": [
>>  7,
>>  4,
>>  2
>>  ],
>> 
>> 
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Marcus Müller
All osds are currently up:

 health HEALTH_WARN
4 pgs stuck unclean
recovery 4482/58798254 objects degraded (0.008%)
recovery 420522/58798254 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set
 monmap e9: 5 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
election epoch 478, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
 osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v9981077: 320 pgs, 3 pools, 4837 GB data, 19140 kobjects
15070 GB used, 40801 GB / 55872 GB avail
4482/58798254 objects degraded (0.008%)
420522/58798254 objects misplaced (0.715%)
 316 active+clean
   4 active+remapped
  client io 56601 B/s rd, 45619 B/s wr, 0 op/s

This did not chance for two days or so.


By the way, my ceph osd df now looks like this:

ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 1.28899  1.0  3724G  1699G  2024G 45.63 1.69 
 1 1.57899  1.0  3724G  1708G  2015G 45.87 1.70 
 2 1.68900  1.0  3724G  1695G  2028G 45.54 1.69 
 3 6.78499  1.0  7450G  1241G  6208G 16.67 0.62 
 4 8.3  1.0  7450G  1228G  6221G 16.49 0.61 
 5 9.51500  1.0  7450G  1239G  6210G 16.64 0.62 
 6 7.66499  1.0  7450G  1265G  6184G 16.99 0.63 
 7 9.75499  1.0  7450G  2497G  4952G 33.52 1.24 
 8 9.32999  1.0  7450G  2495G  4954G 33.49 1.24 
  TOTAL 55872G 15071G 40801G 26.97  
MIN/MAX VAR: 0.61/1.70  STDDEV: 13.16

As you can see, now osd2 also went down to 45% Use and „lost“ data. But I also 
think this is no problem and ceph just clears everything up after backfilling.


> Am 10.01.2017 um 07:29 schrieb Shinobu Kinjo :
> 
> Looking at ``ceph -s`` you originally provided, all OSDs are up.
> 
>> osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
> 
> But looking at ``pg query``, OSD.0 / 1 are not up. Are they something
> like related to ?:
> 
>> Ceph1, ceph2 and ceph3 are vms on one physical host
> 
> Are those OSDs running on vm instances?
> 
> # 9.7
> 
>>   "state": "active+remapped",
>>   "snap_trimq": "[]",
>>   "epoch": 3114,
>>   "up": [
>>   7,
>>   3
>>   ],
>>   "acting": [
>>   7,
>>   3,
>>   0
>>   ],
> 
> 
> # 7.84
> 
>>   "state": "active+remapped",
>>   "snap_trimq": "[]",
>>   "epoch": 3114,
>>  "up": [
>>   4,
>>   8
>>   ],
>>   "acting": [
>>   4,
>>   8,
>>   1
>>   ],
> 
> 
> # 8.1b
> 
>>   "state": "active+remapped",
>>   "snap_trimq": "[]",
>>   "epoch": 3114,
>>   "up": [
>>   4,
>>   7
>>   ],
>>   "acting": [
>>   4,
>>   7,
>>   2
>>   ],
> 
> 
> # 7.7a
> 
>>   "state": "active+remapped",
>>   "snap_trimq": "[]",
>>   "epoch": 3114,
>>   "up": [
>>   7,
>>   4
>>   ],
>>   "acting": [
>>   7,
>>   4,
>>   2
>>   ],
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Marcus Müller
> Trying google with "ceph pg stuck in active and remapped" points to a couple 
> of post on this ML typically indicating that it's a problem with the CRUSH 
> map and ceph being unable to satisfy the mapping rules. Your ceph -s output 
> indicates that your using replication of size 3 in your pools. You also said 
> you had a custom CRUSH map - can you post it?

I’ve sent the file to you, since I’m not sure if it contains sensitive data. 
Yes I have replication of 3 and I did not customize the map by me.


> I might be missing something here but I don't quite see how you come to this 
> statement. ceph osd df and ceph -s both show 16093 GB used and 39779 GB out 
> of 55872 GB available. The sum of the first 3 OSDs used space is, as you 
> stated, 6181 GB which is approx 38.4% so quite close to your target of 33%

Maybe I have to explain it another way:

Directly after finishing the backfill I received this output:

 health HEALTH_WARN
4 pgs stuck unclean
recovery 1698/58476648 objects degraded (0.003%)
recovery 418137/58476648 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set
 monmap e9: 5 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
election epoch 464, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
 osdmap e3086: 9 osds: 9 up, 9 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v9928160: 320 pgs, 3 pools, 4809 GB data, 19035 kobjects
16093 GB used, 39779 GB / 55872 GB avail
1698/58476648 objects degraded (0.003%)
418137/58476648 objects misplaced (0.715%)
 316 active+clean
   4 active+remapped
  client io 757 kB/s rd, 1 op/s

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 1.28899  1.0  3724G  1924G  1799G 51.67 1.79 
 1 1.57899  1.0  3724G  2143G  1580G 57.57 2.00 
 2 1.68900  1.0  3724G  2114G  1609G 56.78 1.97 
 3 6.78499  1.0  7450G  1234G  6215G 16.57 0.58 
 4 8.3  1.0  7450G  1221G  6228G 16.40 0.57 
 5 9.51500  1.0  7450G  1232G  6217G 16.54 0.57 
 6 7.66499  1.0  7450G  1258G  6191G 16.89 0.59 
 7 9.75499  1.0  7450G  2482G  4967G 33.33 1.16 
 8 9.32999  1.0  7450G  2480G  4969G 33.30 1.16 
  TOTAL 55872G 16093G 39779G 28.80  
MIN/MAX VAR: 0.57/2.00  STDDEV: 17.54

Here we can see, that the cluster is using 4809 GB data and has raw used 
16093GB. Or the other way, only 39779G available.

Two days later I saw:

 health HEALTH_WARN
4 pgs stuck unclean
recovery 3486/58726035 objects degraded (0.006%)
recovery 420024/58726035 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set
 monmap e9: 5 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
election epoch 478, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
 osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v9969059: 320 pgs, 3 pools, 4830 GB data, 19116 kobjects
15150 GB used, 40722 GB / 55872 GB avail
3486/58726035 objects degraded (0.006%)
420024/58726035 objects misplaced (0.715%)
 316 active+clean
   4 active+remapped

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 1.28899  1.0  3724G  1696G  2027G 45.56 1.68 
 1 1.57899  1.0  3724G  1705G  2018G 45.80 1.69 
 2 1.68900  1.0  3724G  1794G  1929G 48.19 1.78 
 3 6.78499  1.0  7450G  1239G  6210G 16.64 0.61 
 4 8.3  1.0  7450G  1226G  6223G 16.46 0.61 
 5 9.51500  1.0  7450G  1237G  6212G 16.61 0.61 
 6 7.66499  1.0  7450G  1263G  6186G 16.96 0.63 
 7 9.75499  1.0  7450G  2493G  4956G 33.47 1.23 
 8 9.32999  1.0  7450G  2491G  4958G 33.44 1.23 
  TOTAL 55872G 15150G 40722G 27.12  
MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54


As you can see now, we are using 4830 GB data BUT raw used is only 15150 GB or 
as said the other way, we have now 40722 GB free. You can see the change on the 
%USE of the osds. For me this looks like there is some data lost, since ceph 
did not do any backfill or other operation. That’s the problem...


> Am 09.01.2017 um 21:55 schrieb Christian Wuerdig 
> <christian.wuer...@gmail.com>:
> 
> 
> 
> On Tue, Jan 10, 2017 at 8:23 AM, Marcus Müller <mueller.mar...@posteo.de 
> <mailto:mueller.mar...@posteo.de>> wrote:
> Hi all,
> 
> Recently I added a new node with new osds to my cluster, which, of course 
> resulted in backfilling. At the end, there are 4 pgs left in the state 4 
> active+remapped and I don’t know what to do. 
> 
> Here is how my cluster looks like currently: 
> 
> c

[ceph-users] PGs stuck active+remapped and osds lose data?!

2017-01-09 Thread Marcus Müller
Hi all,

Recently I added a new node with new osds to my cluster, which, of course 
resulted in backfilling. At the end, there are 4 pgs left in the state 4 
active+remapped and I don’t know what to do. 

Here is how my cluster looks like currently: 

ceph -s
 health HEALTH_WARN
4 pgs stuck unclean
recovery 3586/58734009 objects degraded (0.006%)
recovery 420074/58734009 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set
 monmap e9: 5 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0,ceph5=192.168.60.11:6789/0}
election epoch 478, quorum 0,1,2,3,4 ceph1,ceph2,ceph3,ceph4,ceph5
 osdmap e3114: 9 osds: 9 up, 9 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v9970276: 320 pgs, 3 pools, 4831 GB data, 19119 kobjects
15152 GB used, 40719 GB / 55872 GB avail
3586/58734009 objects degraded (0.006%)
420074/58734009 objects misplaced (0.715%)
 316 active+clean
   4 active+remapped
  client io 643 kB/s rd, 7 op/s

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 1.28899  1.0  3724G  1697G  2027G 45.57 1.68 
 1 1.57899  1.0  3724G  1706G  2018G 45.81 1.69 
 2 1.68900  1.0  3724G  1794G  1929G 48.19 1.78 
 3 6.78499  1.0  7450G  1240G  6209G 16.65 0.61 
 4 8.3  1.0  7450G  1226G  6223G 16.47 0.61 
 5 9.51500  1.0  7450G  1237G  6212G 16.62 0.61 
 6 7.66499  1.0  7450G  1264G  6186G 16.97 0.63 
 7 9.75499  1.0  7450G  2494G  4955G 33.48 1.23 
 8 9.32999  1.0  7450G  2491G  4958G 33.45 1.23 
  TOTAL 55872G 15152G 40719G 27.12  
MIN/MAX VAR: 0.61/1.78  STDDEV: 13.54

# ceph health detail
HEALTH_WARN 4 pgs stuck unclean; recovery 3586/58734015 objects degraded 
(0.006%); recovery 420074/58734015 objects misplaced (0.715%); 
noscrub,nodeep-scrub flag(s) set
pg 9.7 is stuck unclean for 512936.160212, current state active+remapped, last 
acting [7,3,0]
pg 7.84 is stuck unclean for 512623.894574, current state active+remapped, last 
acting [4,8,1]
pg 8.1b is stuck unclean for 513164.616377, current state active+remapped, last 
acting [4,7,2]
pg 7.7a is stuck unclean for 513162.316328, current state active+remapped, last 
acting [7,4,2]
recovery 3586/58734015 objects degraded (0.006%)
recovery 420074/58734015 objects misplaced (0.715%)
noscrub,nodeep-scrub flag(s) set

# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 56.00693 root default 
-2  1.28899 host ceph1   
 0  1.28899 osd.0   up  1.0  1.0 
-3  1.57899 host ceph2   
 1  1.57899 osd.1   up  1.0  1.0 
-4  1.68900 host ceph3   
 2  1.68900 osd.2   up  1.0  1.0 
-5 32.36497 host ceph4   
 3  6.78499 osd.3   up  1.0  1.0 
 4  8.3 osd.4   up  1.0  1.0 
 5  9.51500 osd.5   up  1.0  1.0 
 6  7.66499 osd.6   up  1.0  1.0 
-6 19.08498 host ceph5   
 7  9.75499 osd.7   up  1.0  1.0 
 8  9.32999 osd.8   up  1.0  1.0 

I’m using a customized crushmap because as you can see this cluster is not very 
optimal. Ceph1, ceph2 and ceph3 are vms on one physical host - Ceph4 and Ceph5 
are both separate physical hosts. So the idea is to spread 33% of the data to 
ceph1, ceph2 and ceph3 and the other 66% to each ceph4 and ceph5.

Everything went fine with the backfilling but now I see those 4 pgs stuck 
active+remapped since 2 days while the degrades objects increase. 

I did a restart of all osds after and after but this helped not really. It 
first showed me no degraded objects and then increased again.

What can I do in order to get those pgs to active+clean state again? My idea 
was to increase the weight of a osd a little bit in order to let ceph calculate 
the map again, is this a good idea?

---

On the other side I saw something very strange too: After the backfill was done 
(2 days ago), my ceph osd df looked like this:

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 1.28899  1.0  3724G  1924G  1799G 51.67 1.79 
 1 1.57899  1.0  3724G  2143G  1580G 57.57 2.00 
 2 1.68900  1.0  3724G  2114G  1609G 56.78 1.97 
 3 6.78499  1.0  7450G  1234G  6215G 16.57 0.58 
 4 8.3  1.0  7450G  1221G  6228G 16.40 0.57 
 5 9.51500  1.0  7450G  1232G  6217G 16.54 0.57 
 6 7.66499  1.0  7450G  1258G  6191G 16.89 0.59 
 7 9.75499  1.0  7450G  2482G  4967G 33.33 1.16 
 8 9.32999  1.0  7450G  2480G  4969G 33.30 1.16 
  TOTAL 55872G 

[ceph-users] Failed to install ceph via ceph-deploy on Ubuntu 14.04 trusty

2017-01-02 Thread Marcus Müller
Hi all,

I tried to install ceph on a new node with ceph-deploy 1.5.35 but it fails. 
Here is the output: 

# ceph-deploy install --release hammer ceph5
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.35): /usr/bin/ceph-deploy install 
--release hammer ceph5
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  testing   : None
[ceph_deploy.cli][INFO  ]  cd_conf   : 

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  dev_commit: None
[ceph_deploy.cli][INFO  ]  install_mds   : False
[ceph_deploy.cli][INFO  ]  stable: None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  adjust_repos  : True
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  install_all   : False
[ceph_deploy.cli][INFO  ]  repo  : False
[ceph_deploy.cli][INFO  ]  host  : ['ceph5']
[ceph_deploy.cli][INFO  ]  install_rgw   : False
[ceph_deploy.cli][INFO  ]  install_tests : False
[ceph_deploy.cli][INFO  ]  repo_url  : None
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  install_osd   : False
[ceph_deploy.cli][INFO  ]  version_kind  : stable
[ceph_deploy.cli][INFO  ]  install_common: False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  dev   : master
[ceph_deploy.cli][INFO  ]  nogpgcheck: False
[ceph_deploy.cli][INFO  ]  local_mirror  : None
[ceph_deploy.cli][INFO  ]  release   : hammer
[ceph_deploy.cli][INFO  ]  install_mon   : False
[ceph_deploy.cli][INFO  ]  gpg_url   : None
[ceph_deploy.install][DEBUG ] Installing stable version hammer on cluster ceph 
hosts ceph5
[ceph_deploy.install][DEBUG ] Detecting platform for host ceph5 ...
[ceph5][DEBUG ] connected to host: ceph5 
[ceph5][DEBUG ] detect platform information from remote host
[ceph5][DEBUG ] detect machine type
[ceph5][DEBUG ] find the location of an executable
[ceph5][INFO  ] Running command: /sbin/initctl version
[ceph_deploy.install][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph5][INFO  ] installing Ceph on ceph5
[ceph5][INFO  ] Running command: env DEBIAN_FRONTEND=noninteractive 
DEBIAN_PRIORITY=critical apt-get --assume-yes -q --no-install-recommends 
install ca-certificates apt-transport-https
[ceph5][DEBUG ] Paketlisten werden gelesen...
[ceph5][DEBUG ] Abhängigkeitsbaum wird aufgebaut
[ceph5][DEBUG ] Statusinformationen werden eingelesen
[ceph5][DEBUG ] apt-transport-https ist schon die neueste Version.
[ceph5][DEBUG ] ca-certificates ist schon die neueste Version.
[ceph5][DEBUG ] 0 aktualisiert, 0 neu installiert, 0 zu entfernen und 0 nicht 
aktualisiert.
[ceph5][INFO  ] Running command: wget -O release.asc 
https://download.ceph.com/keys/release.asc
[ceph5][WARNIN] --2017-01-03 01:04:56--  
https://download.ceph.com/keys/release.asc
[ceph5][WARNIN] Auflösen des Hostnamen »download.ceph.com 
(download.ceph.com)«... 208.113.164.182, 2607:f298:5:101d:f816:3eff:fe3e:f774
[ceph5][WARNIN] Verbindungsaufbau zu download.ceph.com 
(download.ceph.com)|208.113.164.182|:443... verbunden.
[ceph5][WARNIN] HTTP-Anforderung gesendet, warte auf Antwort... 200 OK
[ceph5][WARNIN] Länge: 1645 (1,6K) [application/octet-stream]
[ceph5][WARNIN] In »»release.asc«« speichern.
[ceph5][WARNIN] 
[ceph5][WARNIN]  0K . 
100%  421M=0s
[ceph5][WARNIN] 
[ceph5][WARNIN] 2017-01-03 01:04:57 (421 MB/s) - »release.asc« gespeichert 
[1645/1645]
[ceph5][WARNIN] 
[ceph5][INFO  ] Running command: apt-key add release.asc
[ceph5][DEBUG ] OK
[ceph5][DEBUG ] add deb repo to /etc/apt/sources.list.d/
[ceph5][INFO  ] Running command: env DEBIAN_FRONTEND=noninteractive 
DEBIAN_PRIORITY=critical apt-get --assume-yes -q update
[ceph5][DEBUG ] OK   http://archive.thomas-krenn.com trusty InRelease
[ceph5][DEBUG ] Ign http://de.archive.ubuntu.com trusty InRelease
[ceph5][DEBUG ] OK   http://security.ubuntu.com trusty-security InRelease
[ceph5][DEBUG ] OK   http://de.archive.ubuntu.com trusty-updates InRelease
[ceph5][DEBUG ] OK   http://archive.thomas-krenn.com trusty/optional amd64 
Packages
[ceph5][DEBUG ] OK   http://de.archive.ubuntu.com trusty-backports InRelease
[ceph5][DEBUG ] OK   http://security.ubuntu.com trusty-security/main Sources
[ceph5][DEBUG ] OK   http://de.archive.ubuntu.com 

Re: [ceph-users] docs.ceph.com down?

2017-01-02 Thread Marcus Müller
Yes, that’s a possible workaround. You can also use the google cache. 

Thanks for verifying and the help.

> Am 02.01.2017 um 20:47 schrieb Sean Redmond <sean.redmo...@gmail.com>:
> 
> If you need the docs you can try reading them here
> 
> https://github.com/ceph/ceph/tree/master/doc 
> <https://github.com/ceph/ceph/tree/master/doc> 
> 
> On Mon, Jan 2, 2017 at 7:45 PM, Andre Forigato <andre.forig...@rnp.br 
> <mailto:andre.forig...@rnp.br>> wrote:
> Hello Marcus,
> 
> Yes, it´s down. :-(
> 
> 
> André
> 
> - Mensagem original -
> > De: "Marcus Müller" <mueller.mar...@posteo.de 
> > <mailto:mueller.mar...@posteo.de>>
> > Para: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > Enviadas: Segunda-feira, 2 de janeiro de 2017 16:55:13
> > Assunto: [ceph-users] docs.ceph.com <http://docs.ceph.com/> down?
> 
> > Hi all,
> 
> > I can not reach docs.ceph.com <http://docs.ceph.com/> for some days. Is the 
> > site really down or do I
> > have a problem here?
> 
> > Regards,
> > Marcus
> 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] docs.ceph.com down?

2017-01-02 Thread Marcus Müller
Hi all,

I can not reach docs.ceph.com  for some days. Is the 
site really down or do I have a problem here? 

Regards,
Marcus___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] docs.ceph.com down?

2017-01-02 Thread Marcus Müller
Hi all,

I can not reach docs.ceph.com  for some days. Is the 
site really down or do I have a problem? 

Regards,
Marcus___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Marcus Müller
Hi all,

i have a big problem and i really hope someone can help me!

We are running a ceph cluster since a year now. Version is: 0.94.7 (Hammer)
Here is some info:

Our osd map is:

ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 26.67998 root default 
-2  3.64000 host ceph1   
 0  3.64000 osd.0   up  1.0  1.0 
-3  3.5 host ceph2   
 1  3.5 osd.1   up  1.0  1.0 
-4  3.64000 host ceph3   
 2  3.64000 osd.2   up  1.0  1.0 
-5 15.89998 host ceph4   
 3  4.0 osd.3   up  1.0  1.0 
 4  3.5 osd.4   up  1.0  1.0 
 5  3.2 osd.5   up  1.0  1.0 
 6  5.0 osd.6   up  1.0  1.0 

ceph df:

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED 
40972G 26821G   14151G 34.54 
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS 
blocks  7  4490G 10.96 1237G 7037004 
commits 8   473M 0 1237G  802353 
fs  9  9666M  0.02 1237G 7863422 

ceph osd df:

ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 3.64000  1.0  3724G  3128G   595G 84.01 2.43 
 1 3.5  1.0  3724G  3237G   487G 86.92 2.52 
 2 3.64000  1.0  3724G  3180G   543G 85.41 2.47 
 3 4.0  1.0  7450G  1616G  5833G 21.70 0.63 
 4 3.5  1.0  7450G  1246G  6203G 16.74 0.48 
 5 3.2  1.0  7450G  1181G  6268G 15.86 0.46 
 6 5.0  1.0  7450G   560G  6889G  7.52 0.22 
  TOTAL 40972G 14151G 26820G 34.54  
MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53


Our current cluster state is: 

 health HEALTH_WARN
63 pgs backfill
8 pgs backfill_toofull
9 pgs backfilling
11 pgs degraded
1 pgs recovering
10 pgs recovery_wait
11 pgs stuck degraded
89 pgs stuck unclean
recovery 8237/52179437 objects degraded (0.016%)
recovery 9620295/52179437 objects misplaced (18.437%)
2 near full osd(s)
noout,noscrub,nodeep-scrub flag(s) set
 monmap e8: 4 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}
election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
 osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
flags noout,noscrub,nodeep-scrub
  pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
14152 GB used, 26820 GB / 40972 GB avail
8237/52179437 objects degraded (0.016%)
9620295/52179437 objects misplaced (18.437%)
 231 active+clean
  61 active+remapped+wait_backfill
   9 active+remapped+backfilling
   6 active+recovery_wait+degraded+remapped
   6 active+remapped+backfill_toofull
   4 active+recovery_wait+degraded
   2 active+remapped+wait_backfill+backfill_toofull
   1 active+recovering+degraded
recovery io 11754 kB/s, 35 objects/s
  client io 1748 kB/s rd, 249 kB/s wr, 44 op/s


My main problems are: 

- As you can see from the osd tree, we have three separate hosts with only one 
osd each. Another one has four osds. Ceph allows me not to get data back from 
these three nodes with only one HDD, which are all near full. I tried to set 
the weight of the osds in the bigger node higher but this just does not work. 
So i added a new osd yesterday which made things not better, as you can see 
now. What do i have to do to just become these three nodes empty again and put 
more data on the other node with the four HDDs.

- I added the „ceph4“ node later, this resulted in a strange ip change as you 
can see in the mon list. The public network and the cluster network were 
swapped or not assigned right. See ceph.conf

[global]
fsid = xxx
mon_initial_members = ceph1
mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 192.168.60.0/24
cluster_network = 192.168.10.0/24
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 128
osd pool default pgp num = 128
osd recovery max active = 50
osd recovery threads = 3
mon_pg_warn_max_per_osd = 0

  What can i do in this case (it’s no big problem since the network is 2x 10 
GBE and everything works)?

- One other thing. Even if i just prepare the osd, it’s automatically added to 
the cluster. I can not activate it. Has had someone other already such