[ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Hector Martin

Hi list,

I'm slightly expanding the underlying LV for two OSDs and figured I 
could use ceph-bluestore-tool to avoid having to re-create them from 
scratch.


I first shut down the OSD, expanded the LV, and then ran:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

I forgot I was using encryption, so the overlying dm-crypt mapping 
stayed the same when I resized the underlying LV. I was surprised by the 
output of ceph-bluestore-tool, which suggested a size change by a 
significant amount (I was changing the LV size only by a few percent). I 
then checked the underlying `block` device and realized its size had not 
changed, so the command should've been a no-op. I then tried to restart 
the OSD, and it failed with an I/O error. I ended up re-creating that 
OSD and letting it recover.


I have another OSD (osd.1) in the original state where I could run this 
test again if needed. Unfortunately I don't have the output of the first 
test any more.


Is `ceph-bluestore-tool bluefs-bdev-expand` supposed to work? I get the 
feeling it gets the size wrong and corrupts OSDs by expanding it too 
much. If this is indeed supposed to work I would be happy to test this 
again with osd.1 if needed and see if I can get it fixed. Otherwise I'll 
just re-create it and move on.


# ceph --version 

ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)


--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Igor Fedotov

Hi Hector,

I've never tried bluefs-bdev-expand over encrypted volumes but it works 
absolutely fine for me in other cases.


So it would be nice to troubleshoot this a bit.

Suggest to do the following:

1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd. 
This will probably allow to recover from the failed expansion and repeat 
it multiple times.


2) Collect current volume sizes with bluefs-bdev-sizes command and 
actual devices sizes using 'lsblk --bytes'.


3) Do lvm volume expansion and then collect dev sizes with 'lsblk 
--bytes' once again


4) Invoke bluefs-bdev-expand for osd.1 with CEPH_ARGS="--debug-bluestore 
20 --debug-bluefs 20 --log-file bluefs-bdev-expand.log"


Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...


Thanks,

Igor




On 12/27/2018 12:19 PM, Hector Martin wrote:

Hi list,

I'm slightly expanding the underlying LV for two OSDs and figured I 
could use ceph-bluestore-tool to avoid having to re-create them from 
scratch.


I first shut down the OSD, expanded the LV, and then ran:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

I forgot I was using encryption, so the overlying dm-crypt mapping 
stayed the same when I resized the underlying LV. I was surprised by 
the output of ceph-bluestore-tool, which suggested a size change by a 
significant amount (I was changing the LV size only by a few percent). 
I then checked the underlying `block` device and realized its size had 
not changed, so the command should've been a no-op. I then tried to 
restart the OSD, and it failed with an I/O error. I ended up 
re-creating that OSD and letting it recover.


I have another OSD (osd.1) in the original state where I could run 
this test again if needed. Unfortunately I don't have the output of 
the first test any more.


Is `ceph-bluestore-tool bluefs-bdev-expand` supposed to work? I get 
the feeling it gets the size wrong and corrupts OSDs by expanding it 
too much. If this is indeed supposed to work I would be happy to test 
this again with osd.1 if needed and see if I can get it fixed. 
Otherwise I'll just re-create it and move on.


# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] `ceph-bluestore-tool bluefs-bdev-expand` corrupts OSDs

2018-12-27 Thread Igor Fedotov

Hector,

One more thing to mention - after expansion please run fsck using 
ceph-bluestore-tool prior to running osd daemon and collect another log 
using CEPH_ARGS variable.



Thanks,

Igor

On 12/27/2018 2:41 PM, Igor Fedotov wrote:

Hi Hector,

I've never tried bluefs-bdev-expand over encrypted volumes but it 
works absolutely fine for me in other cases.


So it would be nice to troubleshoot this a bit.

Suggest to do the following:

1) Backup first 8K for all OSD.1 devices (block, db and wal) using dd. 
This will probably allow to recover from the failed expansion and 
repeat it multiple times.


2) Collect current volume sizes with bluefs-bdev-sizes command and 
actual devices sizes using 'lsblk --bytes'.


3) Do lvm volume expansion and then collect dev sizes with 'lsblk 
--bytes' once again


4) Invoke bluefs-bdev-expand for osd.1 with 
CEPH_ARGS="--debug-bluestore 20 --debug-bluefs 20 --log-file 
bluefs-bdev-expand.log"


Perhaps it makes sense to open a ticket at ceph bug tracker to proceed...


Thanks,

Igor




On 12/27/2018 12:19 PM, Hector Martin wrote:

Hi list,

I'm slightly expanding the underlying LV for two OSDs and figured I 
could use ceph-bluestore-tool to avoid having to re-create them from 
scratch.


I first shut down the OSD, expanded the LV, and then ran:
ceph-bluestore-tool bluefs-bdev-expand --path /var/lib/ceph/osd/ceph-0

I forgot I was using encryption, so the overlying dm-crypt mapping 
stayed the same when I resized the underlying LV. I was surprised by 
the output of ceph-bluestore-tool, which suggested a size change by a 
significant amount (I was changing the LV size only by a few 
percent). I then checked the underlying `block` device and realized 
its size had not changed, so the command should've been a no-op. I 
then tried to restart the OSD, and it failed with an I/O error. I 
ended up re-creating that OSD and letting it recover.


I have another OSD (osd.1) in the original state where I could run 
this test again if needed. Unfortunately I don't have the output of 
the first test any more.


Is `ceph-bluestore-tool bluefs-bdev-expand` supposed to work? I get 
the feeling it gets the size wrong and corrupts OSDs by expanding it 
too much. If this is indeed supposed to work I would be happy to test 
this again with osd.1 if needed and see if I can get it fixed. 
Otherwise I'll just re-create it and move on.


# ceph --version
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Rgw bucket policy for multi tenant

2018-12-27 Thread Marc Roos


I have seen several post on the bucket lists, how do you change this for 
multitenant user: Tenant$tenuser

{
  "Version": "2012-10-17",
  "Statement": [{
"Effect": "Allow",
"Principal": {"AWS": ["arn:aws:iam::usfolks:user/fred"]},
"Action": "s3:PutObjectAcl",
"Resource": [
  "arn:aws:s3:::happybucket/*"
]
  }]
}
http://docs.ceph.com/docs/mimic/radosgw/bucketpolicy/




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migration of a Ceph cluster to a new datacenter and new IPs

2018-12-27 Thread Marcus Müller
Hi all,

Just wanted to explain my experience on how to stop the whole cluster and 
change the IPs.

First, we shut down the cluster with this procedure:

1.Stop the clients from using the RBD images/Rados Gateway on this
cluster or any other clients.
2.The cluster must be in healthy state before proceeding.
3.Set the noout, norecover, norebalance, nobackfill, nodown and pause flags
#ceph osd set noout
#ceph osd set norecover
#ceph osd set norebalance
#ceph osd set nobackfill
#ceph osd set nodown
#ceph osd set pause
4. Stop all ceph services 
4.1.First osd nodes one by one
4.2.Lastly monitor nodes one by one

Now we extracted the monmap with 'ceph-mon -i {mon-id} --extract-monmap 
/tmp/monmap‘
Followed this manual: 
http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
 

And imported the new monmap to each monitor (while they all were stopped), 
changed ceph.conf on all nodes with the new IPs (don’t forget the clients).
The last step was to change the IP Config and the hosts files (in our case, 
again don’t forget the clients) and shutdown the nodes. 

In the new datacenter we started the nodes and everything came up as usual.

(1.Power on the admin node)
2.Power on the monitor nodes
3.Power on the osd nodes
4.Wait for all the nodes to come up , Verify all the services are
up and the connectivity is fine between the nodes.
5.Unset all the noout,norecover,noreblance, nobackfill, nodown and
pause flags.
#ceph osd unset noout
#ceph osd unset norecover
#ceph osd unset norebalance
#ceph osd unset nobackfill
#ceph osd unset nodown
#ceph osd unset pause
6.Check and verify the cluster is in healthy state, Verify all the
clients are able to access the cluster.

I hope this helps someone for the future!


> Am 20.12.2018 um 18:18 schrieb Paul Emmerich :
> 
> I'd do it like this:
> 
> * create 2 new mons with the new IPs
> * update all clients to the 3 new mon IPs
> * delete two old mons
> * create 1 new mon
> * delete the last old mon
> 
> I think it's easier to create/delete mons than to change the IP of an
> existing mon. This doesn't even incur a downtime for the clients
> because they get notified about the new mons.
> 
> For the OSDs: stop OSDs, change IP, start OSDs
> 
> Don't change the IP of a running OSD, they don't like that
> 
> Paul
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
> 
> On Wed, Dec 19, 2018 at 8:55 PM Marcus Müller  
> wrote:
>> 
>> Hi all,
>> 
>> we’re running a ceph hammer cluster with 3 mons and 24 osds (3 same nodes) 
>> and need to migrate all servers to a new datacenter and change the IPs of 
>> the nodes.
>> 
>> I found this tutorial: 
>> http://docs.ceph.com/docs/hammer/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way
>>  regarding the mons (should be easy) but nothing about the osds and which 
>> steps to do if you need to shutdown and migrate the cluster to a new 
>> datacenter.
>> 
>> Has anyone some ideas, how to and which steps I need?
>> 
>> Regards,
>> Marcus
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] size of inc_osdmap vs osdmap

2018-12-27 Thread Sergey Dolgov
We investigated the issue and set debug_mon up to 20 during little change
of osdmap get many messages for all pgs of each pool (for all cluster):

> 2018-12-25 19:28:42.426776 7f075af7d700 20 mon.1@0(leader).osd e1373789
> prime_pg_tempnext_up === next_acting now, clear pg_temp
> 2018-12-25 19:28:42.426776 7f075a77c700 20 mon.1@0(leader).osd e1373789
> prime_pg_tempnext_up === next_acting now, clear pg_temp
> 2018-12-25 19:28:42.426777 7f075977a700 20 mon.1@0(leader).osd e1373789
> prime_pg_tempnext_up === next_acting now, clear pg_temp
> 2018-12-25 19:28:42.426779 7f075af7d700 20 mon.1@0(leader).osd e1373789
> prime_pg_temp 3.1000 [97,812,841]/[] -> [97,812,841]/[97,812,841], priming
> []
> 2018-12-25 19:28:42.426780 7f075a77c700 20 mon.1@0(leader).osd e1373789
> prime_pg_temp 3.0 [84,370,847]/[] -> [84,370,847]/[84,370,847], priming []
> 2018-12-25 19:28:42.426781 7f075977a700 20 mon.1@0(leader).osd e1373789
> prime_pg_temp 4.0 [404,857,11]/[] -> [404,857,11]/[404,857,11], priming []

though no pg_temps are created as result(no single backfill)

We suppose this behavior changed in commit
https://github.com/ceph/ceph/pull/16530/commits/ea723fbb88c69bd00fefd32a3ee94bf5ce53569c
because earlier function *OSDMonitor::prime_pg_temp* should return in
https://github.com/ceph/ceph/blob/luminous/src/mon/OSDMonitor.cc#L1009 like
in jewel https://github.com/ceph/ceph/blob/jewel/src/mon/OSDMonitor.cc#L1214

i accept that we may be mistaken


On Wed, Dec 12, 2018 at 10:53 PM Gregory Farnum  wrote:

> Hmm that does seem odd. How are you looking at those sizes?
>
> On Wed, Dec 12, 2018 at 4:38 AM Sergey Dolgov  wrote:
>
>> Greq, for example for our cluster ~1000 osd:
>>
>> size osdmap.1357881__0_F7FE779D__none = 363KB (crush_version 9860,
>> modified 2018-12-12 04:00:17.661731)
>> size osdmap.1357882__0_F7FE772D__none = 363KB
>> size osdmap.1357883__0_F7FE74FD__none = 363KB (crush_version 9861,
>> modified 2018-12-12 04:00:27.385702)
>> size inc_osdmap.1357882__0_B783A4EA__none = 1.2MB
>>
>> difference between epoch 1357881 and 1357883: crush weight one osd was
>> increased by 0.01 so we get 5 new pg_temp in osdmap.1357883 but size
>> inc_osdmap so huge
>>
>> чт, 6 дек. 2018 г. в 06:20, Gregory Farnum :
>> >
>> > On Wed, Dec 5, 2018 at 3:32 PM Sergey Dolgov  wrote:
>> >>
>> >> Hi guys
>> >>
>> >> I faced strange behavior of crushmap change. When I change crush
>> >> weight osd I sometimes get  increment osdmap(1.2MB) which size is
>> >> significantly bigger than size of osdmap(0.4MB)
>> >
>> >
>> > This is probably because when CRUSH changes, the new primary OSDs for a
>> PG will tend to set a "pg temp" value (in the OSDMap) that temporarily
>> reassigns it to the old acting set, so the data can be accessed while the
>> new OSDs get backfilled. Depending on the size of your cluster, the number
>> of PGs on it, and the size of the CRUSH change, this can easily be larger
>> than the rest of the map because it is data with size linear in the number
>> of PGs affected, instead of being more normally proportional to the number
>> of OSDs.
>> > -Greg
>> >
>> >>
>> >> I use luminois 12.2.8. Cluster was installed a long ago, I suppose
>> >> that initially it was firefly
>> >> How can I view content of increment osdmap or can you give me opinion
>> >> on this problem. I think that spikes of traffic tight after change of
>> >> crushmap relates to this crushmap behavior
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Best regards, Sergey Dolgov
>>
>

-- 
Best regards, Sergey Dolgov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com