[ceph-users] ceph on infiniband

2018-06-25 Thread Will Zhao
Hi:
We are using ceph on infiniband and configure it with default
configuration. The ms_type  is  async + posix. I see there are 3 kinds of
types. Which one is the most stable and the best performance ? Which one do
you suggest shuold I use in production ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread shadow_lin
This is the formated pg dump result:
https://pasteboard.co/HrBZv3s.png

You can see the pg distribution of each pool on each osd is fine.

2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 10:32
主题:Re: Re: Re: [ceph-users] Uneven data distribution with even pg distribution 
after rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

If you look at ceph pg dump, you'll see the size ceph believes each PG is. From 
your ceph df, your PGs for the rbd_pool will be almost zero. So if you have an 
osd with 6 of those PGs and another with none of them, but both osds have the 
same number of PGs overall... The osd with none of them will be more full than 
the other. I bet that the osd you had that was really full just had less of 
those PGs than the rest.


On Mon, Jun 25, 2018, 10:25 PM shadow_lin  wrote:

Hi David,
I am sure most(if not all) data are in one pool.
rbd_pool is only for omap for EC rbd.

ceph df:

GLOBAL:
SIZE AVAIL   RAW USED %RAW USED

427T 100555G 329T 77.03

POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS

ec_rbd_pool 3  219T 81.4050172G 57441718
rbd_pool4   144 037629G   19



2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 10:21
主题:Re: Re: [ceph-users] Uneven data distribution with even pg distribution 
after rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

You have 2 different pools. PGs in each pool are going to be a different size.  
It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's 
and Y's. Having equal PG counts on each osd is only balanced if you have a 
single pool or have a case where all PGs are identical in size. The latter is 
not likely.


On Mon, Jun 25, 2018, 10:02 PM shadow_lin  wrote:

Hi David,
I am afraid I can't run the command you provide now,because I tried to 
remove another osd on that host to see if it would make the data distribution 
even and it did.
The pg number of my pools are at power of 2.
Below is from my note before removed another osd:
pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags 
hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull 
stripe_width 0 application rbd
pg distribution of osd of all pools:
https://pasteboard.co/HrBZv3s.png

What I don't understand is why data distribution is uneven when pg 
distribution is even.

2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 01:24
主题:Re: [ceph-users] Uneven data distribution with even pg distribution after 
rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

I should be able to answer this question for you if you can supply the output 
of the following commands.  It will print out all of your pool names along with 
how many PGs are in that pool.  My guess is that you don't have a power of 2 
number of PGs in your pool.  Alternatively you might have multiple pools and 
the PGs from the various pools are just different sizes.


ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; 
do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
ceph df


For me the output looks like this.
rbd: 64
cephfs_metadata: 64
cephfs_data: 256
rbd-ssd: 32



GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
46053G 26751G   19301G 41.91
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd-replica 4897G 11.36 7006G  263000
cephfs_metadata 6141M  0.05  268G   11945
cephfs_data 7  10746G 43.4114012G 2795782
rbd-replica-ssd 9241G 47.30  268G   75061


On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:

Hi List,
   The enviroment is:
   Ceph 12.2.4
   Balancer module on and in upmap mode
   Failure domain is per host, 2 OSD per host
   EC k=4 m=2
   PG distribution is almost even before and after the rebalancing.


   After marking out one of the osd,I noticed a lot of the data was moving into 
the other osd on the same host .

   Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was 
marked out):

ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134

   I am using RBD only so the objects should all be 4m .I don't understand why 
osd 21 got significant more data 
with the same pg as other osds.
   Is this behavior expected or I misconfiged something or  some kind of bug?

   Thanks


2018

[ceph-users] FS Reclaims storage too slow

2018-06-25 Thread Zhang Qiang
Hi,

Is it normal that I deleted files from the cephfs and ceph didn't
delete the back objects a day later? Until I restart the mds deamon
then it started to release the storage space.

I noticed the doc(http://docs.ceph.com/docs/mimic/dev/delayed-delete/)
says the file is marked as deleted on the MDS, and deleted lazily.
What is the condition to trigger the back object deletion? If it's
normal the deletion delayed that much, is there any way to make it
faster? Since the cluster is near full.

I'm using jewel 10.2.3 both for ceph-fuse and mds.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread David Turner
If you look at ceph pg dump, you'll see the size ceph believes each PG is.
>From your ceph df, your PGs for the rbd_pool will be almost zero. So if you
have an osd with 6 of those PGs and another with none of them, but both
osds have the same number of PGs overall... The osd with none of them will
be more full than the other. I bet that the osd you had that was really
full just had less of those PGs than the rest.

On Mon, Jun 25, 2018, 10:25 PM shadow_lin  wrote:

> Hi David,
> I am sure most(if not all) data are in one pool.
> rbd_pool is only for omap for EC rbd.
>
> ceph df:
>
> GLOBAL:
> SIZE AVAIL   RAW USED %RAW USED
> 427T 100555G 329T 77.03
>
> POOLS:
> NAMEID USED %USED MAX AVAIL OBJECTS
> ec_rbd_pool 3  219T 81.4050172G 57441718
> rbd_pool4   144 037629G   19
>
>
> 2018-06-26
> --
> shadow_lin
> --
>
> *发件人:*David Turner 
> *发送时间:*2018-06-26 10:21
> *主题:*Re: Re: [ceph-users] Uneven data distribution with even pg
> distribution after rebalancing
>
> *收件人:*"shadow_lin"
> *抄送:*"ceph-users"
>
> You have 2 different pools. PGs in each pool are going to be a different
> size.  It's like saying 12x + 13y should equal 2x + 23y because they each
> have 25 X's and Y's. Having equal PG counts on each osd is only balanced if
> you have a single pool or have a case where all PGs are identical in size.
> The latter is not likely.
>
> On Mon, Jun 25, 2018, 10:02 PM shadow_lin  wrote:
>
>> Hi David,
>> I am afraid I can't run the command you provide now,because I tried
>> to remove another osd on that host to see if it would make the data
>> distribution even and it did.
>> The pg number of my pools are at power of 2.
>> Below is from my note before removed another osd:
>> pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2
>> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags
>> hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
>> pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0
>> object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags
>> hashpspool,nearfull stripe_width 0 application rbd
>> pg distribution of osd of all pools:
>> https://pasteboard.co/HrBZv3s.png
>>
>> What I don't understand is why data distribution is uneven when pg
>> distribution is even.
>>
>> 2018-06-26
>>
>> shadow_lin
>>
>>
>>
>> 发件人:David Turner 
>> 发送时间:2018-06-26 01:24
>> 主题:Re: [ceph-users] Uneven data distribution with even pg distribution
>> after rebalancing
>> 收件人:"shadow_lin"
>> 抄送:"ceph-users"
>>
>> I should be able to answer this question for you if you can supply the
>> output of the following commands.  It will print out all of your pool names
>> along with how many PGs are in that pool.  My guess is that you don't have
>> a power of 2 number of PGs in your pool.  Alternatively you might have
>> multiple pools and the PGs from the various pools are just different sizes.
>>
>>
>> ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read
>> pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
>> ceph df
>>
>>
>> For me the output looks like this.
>> rbd: 64
>> cephfs_metadata: 64
>> cephfs_data: 256
>> rbd-ssd: 32
>>
>>
>>
>> GLOBAL:
>> SIZE   AVAIL  RAW USED %RAW USED
>> 46053G 26751G   19301G 41.91
>> POOLS:
>> NAMEID USED   %USED MAX AVAIL OBJECTS
>> rbd-replica 4897G 11.36 7006G  263000
>> cephfs_metadata 6141M  0.05  268G   11945
>> cephfs_data 7  10746G 43.4114012G 2795782
>> rbd-replica-ssd 9241G 47.30  268G   75061
>>
>>
>> On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:
>>
>> Hi List,
>>The enviroment is:
>>Ceph 12.2.4
>>Balancer module on and in upmap mode
>>Failure domain is per host, 2 OSD per host
>>EC k=4 m=2
>>PG distribution is almost even before and after the rebalancing.
>>
>>
>>After marking out one of the osd,I noticed a lot of the data was
>> moving into the other osd on the same host .
>>
>>Ceph osd df result is(osd.20 and osd.21 are in the same host and
>> osd.20 was marked out):
>>
>> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
>> 19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
>> 21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
>> 22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
>> 23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134
>>
>>I am using RBD only so the objects should all be 4m .I don't
>> understand why osd 21 got significant more data
>> with the same pg as other osds.
>>Is this behavior expected or I misconfiged something or  some kind of
>> bug?
>>
>>

Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread shadow_lin
Hi David,
I am sure most(if not all) data are in one pool.
rbd_pool is only for omap for EC rbd.

ceph df:
GLOBAL:
SIZE AVAIL   RAW USED %RAW USED
427T 100555G 329T 77.03
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS
ec_rbd_pool 3  219T 81.4050172G 57441718
rbd_pool4   144 037629G   19



2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 10:21
主题:Re: Re: [ceph-users] Uneven data distribution with even pg distribution 
after rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

You have 2 different pools. PGs in each pool are going to be a different size.  
It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's 
and Y's. Having equal PG counts on each osd is only balanced if you have a 
single pool or have a case where all PGs are identical in size. The latter is 
not likely.


On Mon, Jun 25, 2018, 10:02 PM shadow_lin  wrote:

Hi David,
I am afraid I can't run the command you provide now,because I tried to 
remove another osd on that host to see if it would make the data distribution 
even and it did.
The pg number of my pools are at power of 2.
Below is from my note before removed another osd:
pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags 
hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull 
stripe_width 0 application rbd
pg distribution of osd of all pools:
https://pasteboard.co/HrBZv3s.png

What I don't understand is why data distribution is uneven when pg 
distribution is even.

2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 01:24
主题:Re: [ceph-users] Uneven data distribution with even pg distribution after 
rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

I should be able to answer this question for you if you can supply the output 
of the following commands.  It will print out all of your pool names along with 
how many PGs are in that pool.  My guess is that you don't have a power of 2 
number of PGs in your pool.  Alternatively you might have multiple pools and 
the PGs from the various pools are just different sizes.


ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; 
do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
ceph df


For me the output looks like this.
rbd: 64
cephfs_metadata: 64
cephfs_data: 256
rbd-ssd: 32



GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
46053G 26751G   19301G 41.91
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd-replica 4897G 11.36 7006G  263000
cephfs_metadata 6141M  0.05  268G   11945
cephfs_data 7  10746G 43.4114012G 2795782
rbd-replica-ssd 9241G 47.30  268G   75061


On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:

Hi List,
   The enviroment is:
   Ceph 12.2.4
   Balancer module on and in upmap mode
   Failure domain is per host, 2 OSD per host
   EC k=4 m=2
   PG distribution is almost even before and after the rebalancing.


   After marking out one of the osd,I noticed a lot of the data was moving into 
the other osd on the same host .

   Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was 
marked out):

ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134

   I am using RBD only so the objects should all be 4m .I don't understand why 
osd 21 got significant more data 
with the same pg as other osds.
   Is this behavior expected or I misconfiged something or  some kind of bug?

   Thanks


2018-06-25
shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread David Turner
You have 2 different pools. PGs in each pool are going to be a different
size.  It's like saying 12x + 13y should equal 2x + 23y because they each
have 25 X's and Y's. Having equal PG counts on each osd is only balanced if
you have a single pool or have a case where all PGs are identical in size.
The latter is not likely.

On Mon, Jun 25, 2018, 10:02 PM shadow_lin  wrote:

> Hi David,
> I am afraid I can't run the command you provide now,because I tried to
> remove another osd on that host to see if it would make the data
> distribution even and it did.
> The pg number of my pools are at power of 2.
> Below is from my note before removed another osd:
> pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2
> object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags
> hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
> pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags
> hashpspool,nearfull stripe_width 0 application rbd
> pg distribution of osd of all pools:
> https://pasteboard.co/HrBZv3s.png
>
> What I don't understand is why data distribution is uneven when pg
> distribution is even.
>
> 2018-06-26
>
> shadow_lin
>
>
>
> 发件人:David Turner 
> 发送时间:2018-06-26 01:24
> 主题:Re: [ceph-users] Uneven data distribution with even pg distribution
> after rebalancing
> 收件人:"shadow_lin"
> 抄送:"ceph-users"
>
> I should be able to answer this question for you if you can supply the
> output of the following commands.  It will print out all of your pool names
> along with how many PGs are in that pool.  My guess is that you don't have
> a power of 2 number of PGs in your pool.  Alternatively you might have
> multiple pools and the PGs from the various pools are just different sizes.
>
>
> ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read
> pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
> ceph df
>
>
> For me the output looks like this.
> rbd: 64
> cephfs_metadata: 64
> cephfs_data: 256
> rbd-ssd: 32
>
>
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 46053G 26751G   19301G 41.91
> POOLS:
> NAMEID USED   %USED MAX AVAIL OBJECTS
> rbd-replica 4897G 11.36 7006G  263000
> cephfs_metadata 6141M  0.05  268G   11945
> cephfs_data 7  10746G 43.4114012G 2795782
> rbd-replica-ssd 9241G 47.30  268G   75061
>
>
> On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:
>
> Hi List,
>The enviroment is:
>Ceph 12.2.4
>Balancer module on and in upmap mode
>Failure domain is per host, 2 OSD per host
>EC k=4 m=2
>PG distribution is almost even before and after the rebalancing.
>
>
>After marking out one of the osd,I noticed a lot of the data was moving
> into the other osd on the same host .
>
>Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20
> was marked out):
>
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
> 21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
> 22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
> 23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134
>
>I am using RBD only so the objects should all be 4m .I don't understand
> why osd 21 got significant more data
> with the same pg as other osds.
>Is this behavior expected or I misconfiged something or  some kind of
> bug?
>
>Thanks
>
>
> 2018-06-25
> shadow_lin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread shadow_lin
Hi David,
I am afraid I can't run the command you provide now,because I tried to 
remove another osd on that host to see if it would make the data distribution 
even and it did.
The pg number of my pools are at power of 2.
Below is from my note before removed another osd:
pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags 
hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull 
stripe_width 0 application rbd
pg distribution of osd of all pools:
https://pasteboard.co/HrBZv3s.png

What I don't understand is why data distribution is uneven when pg 
distribution is even.

2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 01:24
主题:Re: [ceph-users] Uneven data distribution with even pg distribution after 
rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

I should be able to answer this question for you if you can supply the output 
of the following commands.  It will print out all of your pool names along with 
how many PGs are in that pool.  My guess is that you don't have a power of 2 
number of PGs in your pool.  Alternatively you might have multiple pools and 
the PGs from the various pools are just different sizes.


ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; 
do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
ceph df


For me the output looks like this.
rbd: 64
cephfs_metadata: 64
cephfs_data: 256
rbd-ssd: 32



GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
46053G 26751G   19301G 41.91
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd-replica 4897G 11.36 7006G  263000
cephfs_metadata 6141M  0.05  268G   11945
cephfs_data 7  10746G 43.4114012G 2795782
rbd-replica-ssd 9241G 47.30  268G   75061


On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:

Hi List,
   The enviroment is:
   Ceph 12.2.4
   Balancer module on and in upmap mode
   Failure domain is per host, 2 OSD per host
   EC k=4 m=2
   PG distribution is almost even before and after the rebalancing.


   After marking out one of the osd,I noticed a lot of the data was moving into 
the other osd on the same host .

   Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was 
marked out):

ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134

   I am using RBD only so the objects should all be 4m .I don't understand why 
osd 21 got significant more data 
with the same pg as other osds.
   Is this behavior expected or I misconfiged something or  some kind of bug?

   Thanks


2018-06-25
shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-25 Thread Brad Hubbard
Interesing...

Can I see the output of "ceph auth list" and can you test whether you
can query any other pg that has osd.21 as its primary?

On Mon, Jun 25, 2018 at 8:04 PM, Andrei Mikhailovsky  wrote:
> Hi Brad,
>
> here is the output:
>
> --
>
> root@arh-ibstorage1-ib:/home/andrei# ceph --debug_ms 5 --debug_auth 20 pg 
> 18.2 query
> 2018-06-25 10:59:12.100302 7fe23eaa1700  2 Event(0x7fe2400e0140 nevent=5000 
> time_id=1).set_owner idx=0 owner=140609690670848
> 2018-06-25 10:59:12.100398 7fe23e2a0700  2 Event(0x7fe24010d030 nevent=5000 
> time_id=1).set_owner idx=1 owner=140609682278144
> 2018-06-25 10:59:12.100445 7fe23da9f700  2 Event(0x7fe240139ec0 nevent=5000 
> time_id=1).set_owner idx=2 owner=140609673885440
> 2018-06-25 10:59:12.100793 7fe244b28700  1  Processor -- start
> 2018-06-25 10:59:12.100869 7fe244b28700  1 -- - start start
> 2018-06-25 10:59:12.100882 7fe244b28700  5 adding auth protocol: cephx
> 2018-06-25 10:59:12.101046 7fe244b28700  2 auth: KeyRing::load: loaded key 
> file /etc/ceph/ceph.client.admin.keyring
> 2018-06-25 10:59:12.101244 7fe244b28700  1 -- - --> 192.168.168.201:6789/0 -- 
> auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240174b80 con 0
> 2018-06-25 10:59:12.101264 7fe244b28700  1 -- - --> 192.168.168.202:6789/0 -- 
> auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240175010 con 0
> 2018-06-25 10:59:12.101690 7fe23e2a0700  1 -- 192.168.168.201:0/3046734987 
> learned_addr learned my addr 192.168.168.201:0/3046734987
> 2018-06-25 10:59:12.101890 7fe23e2a0700  2 -- 192.168.168.201:0/3046734987 >> 
> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
> s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got 
> newly_acked_seq 0 vs out_seq 0
> 2018-06-25 10:59:12.102030 7fe23da9f700  2 -- 192.168.168.201:0/3046734987 >> 
> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 
> s=STATE_CONNECTING_WAIT_ACK_SEQ pgs=0 cs=0 l=1)._process_connection got 
> newly_acked_seq 0 vs out_seq 0
> 2018-06-25 10:59:12.102450 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 >> 
> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 
> seq 1 0x7fe234002670 mon_map magic: 0 v1
> 2018-06-25 10:59:12.102494 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 >> 
> 192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 
> seq 2 0x7fe234002b70 auth_reply(proto 2 0 (0) Success) v1
> 2018-06-25 10:59:12.102542 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 
> <== mon.1 192.168.168.202:6789/0 1  mon_map magic: 0 v1  505+0+0 
> (2386987630 0 0) 0x7fe234002670 con 0x7fe240176dc0
> 2018-06-25 10:59:12.102629 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 
> <== mon.1 192.168.168.202:6789/0 2  auth_reply(proto 2 0 (0) Success) v1 
>  33+0+0 (1469975654 0 0) 0x7fe234002b70 con 0x7fe240176dc0
> 2018-06-25 10:59:12.102655 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service mon
> 2018-06-25 10:59:12.102657 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service osd
> 2018-06-25 10:59:12.102658 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service mgr
> 2018-06-25 10:59:12.102661 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service auth
> 2018-06-25 10:59:12.102662 7fe23ca9d700 10 cephx: validate_tickets want 53 
> have 0 need 53
> 2018-06-25 10:59:12.102666 7fe23ca9d700 10 cephx client: handle_response ret 
> = 0
> 2018-06-25 10:59:12.102671 7fe23ca9d700 10 cephx client:  got initial server 
> challenge 6522ec95fb2eb487
> 2018-06-25 10:59:12.102673 7fe23ca9d700 10 cephx client: validate_tickets: 
> want=53 need=53 have=0
> 2018-06-25 10:59:12.102674 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service mon
> 2018-06-25 10:59:12.102675 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service osd
> 2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service mgr
> 2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no 
> handler for service auth
> 2018-06-25 10:59:12.102677 7fe23ca9d700 10 cephx: validate_tickets want 53 
> have 0 need 53
> 2018-06-25 10:59:12.102678 7fe23ca9d700 10 cephx client: want=53 need=53 
> have=0
> 2018-06-25 10:59:12.102680 7fe23ca9d700 10 cephx client: build_request
> 2018-06-25 10:59:12.102702 7fe23da9f700  5 -- 192.168.168.201:0/3046734987 >> 
> 192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 
> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=333625 cs=1 l=1). rx mon.0 
> seq 1 0x7fe228001490 mon_map magic: 0 v1
> 2018-06-25 10:59:12.102739 7fe23ca9d700 10 cephx client: get auth session 
> key: client_challenge 80f2a24093f783c5
> 2018-06-25 10:59:12.102743 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 
> --> 192.168.168.202:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 
> 0x7fe224002080 con 0
> 2018-06-25 10:59:12.102737 7fe23da9f700  5 -- 192.168.168.201:0/3046734987 >> 
> 192.168.168.2

Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread David Turner
I should be able to answer this question for you if you can supply the
output of the following commands.  It will print out all of your pool names
along with how many PGs are in that pool.  My guess is that you don't have
a power of 2 number of PGs in your pool.  Alternatively you might have
multiple pools and the PGs from the various pools are just different sizes.

ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read
pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
ceph df

For me the output looks like this.
rbd: 64
cephfs_metadata: 64
cephfs_data: 256
rbd-ssd: 32

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
46053G 26751G   19301G 41.91
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd-replica 4897G 11.36 7006G  263000
cephfs_metadata 6141M  0.05  268G   11945
cephfs_data 7  10746G 43.4114012G 2795782
rbd-replica-ssd 9241G 47.30  268G   75061

On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:

> Hi List,
>The enviroment is:
>Ceph 12.2.4
>Balancer module on and in upmap mode
>Failure domain is per host, 2 OSD per host
>EC k=4 m=2
>PG distribution is almost even before and after the rebalancing.
>
>
>After marking out one of the osd,I noticed a lot of the data was moving
> into the other osd on the same host .
>
>Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20
> was marked out):
>
> ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
> 19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
> 21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
> 22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
> 23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134
>
>I am using RBD only so the objects should all be 4m .I don't understand
> why osd 21 got significant more data
> with the same pg as other osds.
>Is this behavior expected or I misconfiged something or  some kind of
> bug?
>
>Thanks
>
>
> 2018-06-25
> shadow_lin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Increase queue_depth in KVM

2018-06-25 Thread Damian Dabrowski
Hello,

When I mount rbd image with -o queue_depth=1024 I can see much improvement,
generally on writes(random write improvement from 3k IOPS on standard
queue_depth to 24k IOPS on queue_depth=1024).

But is there any way to attach rbd disk to KVM instance with custom
queue_depth? I can't find any information about it.

Thanks for any informations.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")

2018-06-25 Thread Dyweni - Ceph-Users

Hi,

Is there any information you'd like to grab off this OSD?  Anything I 
can provide to help you troubleshoot this?


I ask, because if not, I'm going to reformat / rebuild this OSD (unless 
there is a faster way to repair this issue).


Thanks,
Dyweni



On 2018-06-25 07:30, Dyweni - Ceph-Users wrote:

Good Morning,

After removing roughly 20-some rbd shapshots, one of my OSD's has
begun flapping.


 ERROR 1 

2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738 pg[4.e8(
v 44721'485588 (44697'484015,44721'485588] local-lis/les=44593/44595
n=2972 ec=9422/9422 lis/c 44593/44593 les/c/f 44595/44595/40729
44593/44593/44593) [8,7,10] r=0 lpr=44593 crt=44721'485588 lcod
44721'485586 mlcod 44721'485586 active+clean+snapt
rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head
2018-06-25 06:46:41.314172 a1ce2700 -1
/var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
In function 'void bluestore_extent_ref_map_t::put(uint64_t, uint32_t,
PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25
06:46:41.220388
/var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
217: FAILED assert(0 == "put on missing extent (nothing before)")

 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1bc) [0x2a2c314]
 2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned int,
std::vector
>*, bool*)+0x128) [0x2893650]
 3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned int,
std::vector
>*, std::set, 
std::allocator >*)+0xb8) [0x2791bdc]
 4: (BlueStore::_wctx_finish(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr, BlueStore::WriteContext*,
std::set,
std::allocator >*)+0x5c8) [0x27f3254]
 5: (BlueStore::_do_truncate(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr, unsigned long long,
std::set,
std::allocator >*)+0x360) [0x27f7834]
 6: (BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr)+0xb4) [0x27f81b4]
 7: (BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr&)+0x1dc) [0x27f9638]
 8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
ObjectStore::Transaction*)+0xe7c) [0x27e855c]
 9: (BlueStore::queue_transactions(ObjectStore::Sequencer*,
std::vector >&,
boost::intrusive_ptr, ThreadPool::TPHandle*)+0x67c)
[0x27e6f80]
 10: (ObjectStore::queue_transactions(ObjectStore::Sequencer*,
std::vector >&, Context*, Context*,
Context*, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x118) [0x1f9ce48]
 11:
(PrimaryLogPG::queue_transactions(std::vector >&,
boost::intrusive_ptr)+0x9c) [0x22dd754]
 12: (ReplicatedBackend::submit_transaction(hobject_t const&,
object_stat_sum_t const&, eversion_t const&,
std::unique_ptr >&&,
eversion_t const&, eversion_t const&, std::vector > const&,
boost::optional&, Context*, Context*, Context*, unsigned long long, osd_reqid_t,
boost::intrusive_ptr)+0x6f4) [0x25c0568]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*,
PrimaryLogPG::OpContext*)+0x7f4) [0x228ac98]
 14:
(PrimaryLogPG::simple_opc_submit(std::unique_ptr >)+0x1b8) [0x228bc54]
 15: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork
const&)+0x1970) [0x22c5d4c]
 16: (boost::statechart::detail::reaction_result
boost::statechart::custom_reaction::react(PrimaryLogPG::AwaitAsyncWork&, boost::statechart::event_base
const&, void const* const&)+0x58) [0x23b245c]
 17: (boost::statechart::detail::reaction_result
boost::statechart::simple_state,
(boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
boost::statechart::simple_state,
(boost::statechart::history_mode)0>
>(boost::statechart::simple_state, (boost::statechart::history_mode)0>&,
boost::statechart::event_base const&, void const*)+0x30) [0x23b0f04]
 18: (boost::statechart::detail::reaction_result
boost::statechart::simple_state,
(boost::statechart::history_mode)0>::local_react,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na> >(boost::statechart::event_base
const&, void const*)+0x28) [0x23af7cc]
 19: (boost::statechart::simple_state, (boost:
:statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x28) [0x23ad744]
 20:
(boost::statechart::detail::send_function,
boost::statechart::detail::rtti_policy>,
boost::statechart::event_base, void const*>::operator()()+0x40)
[0x21b6000]
 21: (boost::statechart::detail::reaction_result
boost::statechart::null_exception_translator::operator(),
boost::statechart::detail::rtti_policy>,
boost::statechart::event_base, 

[ceph-users] Proxmox with EMC VNXe 3200

2018-06-25 Thread Eneko Lacunza

Hi all,

We're planning the migration of a VMWare 5.5 cluster backed by a EMC 
VNXe 3200 storage appliance to Proxmox.


The VNXe has about 3 year of warranty left and half the disks 
unprovisioned, so the current plan is to use the same VNXe for Proxmox 
storage. After warranty expires we'll most probably go ceph but that's 
some years in the future.


VNXe seems to support both iSCSI and NFS (CIFS too but that is really 
out of my tech-tastes). I guess best option performance-wise would be 
iSCSI, but I like the simplicity of NFS. Any idea about what could be 
the performance impact of this (NFS/iSCSI)?


Has anyone had any experience with this kind of storage appliances?

Thanks a lot
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery after datacenter outage

2018-06-25 Thread Brett Niver
+Paul


On Mon, Jun 25, 2018 at 5:14 AM, Christian Zunker
 wrote:
> Hi Jason,
>
> your guesses were correct. Thank you for your support.
>
> Just in case, someone else stumbles upon this thread, some more links:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020722.html
> http://docs.ceph.com/docs/luminous/rados/operations/user-management/#authorization-capabilities
> http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication
> https://github.com/ceph/ceph/pull/15991
>
> Jason Dillaman  schrieb am Fr., 22. Juni 2018 um 22:58
> Uhr:
>>
>> It sounds like your OpenStack users do not have the correct caps to
>> blacklist dead clients. See step 6 in the upgrade section of Luminous’
>> release notes or (preferably) use the new “profile rbd”-style caps if you
>> don’t use older clients.
>>
>> The reason why repairing the object map seemed to fix everything was
>> because I suspect you performed the op using the admin user, which had the
>> caps necessary to blacklist the dead clients and clean up the dirty
>> exclusive lock on the image.
>>
>> On Fri, Jun 22, 2018 at 4:47 PM Gregory Farnum  wrote:
>>>
>>> On Fri, Jun 22, 2018 at 2:26 AM Christian Zunker
>>>  wrote:

 Hi List,

 we are running a ceph cluster (12.2.5) as backend to our OpenStack
 cloud.

 Yesterday our datacenter had a power outage. As this wouldn't be enough,
 we also had a separated ceph cluster because of networking problems.

 First of all thanks a lot to the ceph developers. After the network was
 back to normal, ceph recovered itself. You saved us from a lot of downtime,
 lack of sleep and insanity.

 Now to our problem/question:
 After ceph recovered, we tried to bring up our VMs. They have cinder
 volumes saved in ceph. All VMs didn't start because of I/O problems during
 start:
 [4.393246] JBD2: recovery failed
 [4.395949] EXT4-fs (vda1): error loading journal
 [4.400811] VFS: Dirty inode writeback failed for block device vda1
 (err=-5).
 mount: mounting /dev/vda1 on /root failed: Input/output error
 done.
 Begin: Running /scripts/local-bottom ... done.
 Begin: Running /scripts/init-bottom ... mount: mounting /dev on
 /root/dev failed: No such file or directory

 We tried to recover the disk with different methods, but all failed
 because of different reasons. What helped us at the end was a rebuild on 
 the
 object map of each image:
 rbd object-map rebuild volumes/

 From what we understood, object-map is a feature for ceph internal
 speedup. How can this lead to I/O errors in our VMs?
 Is this the expected way for a recovery?
 Did we miss something?
 Is there any documentation describing what leads to invalid object-maps
 and how to recover? (We did not find a doc on that topic...)
>>>
>>>
>>> An object map definitely shouldn't lead to IO errors in your VMs; in fact
>>> I thought it auto-repaired itself if necessary. Maybe the RBD guys can chime
>>> in here about probable causes of trouble.
>>>
>>> My *guess* is that perhaps your VMs or QEMU were configured to ignore
>>> barriers or some similar thing, so that when the power failed a write was
>>> "lost" as it got written to a new RBD object but not committed into the
>>> object map, but the FS or database journal recorded it as complete. I can't
>>> be sure about that though.
>>> -Greg
>>>



 regards
 Christian
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> --
>> Jason
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")

2018-06-25 Thread Dyweni - Ceph-Users

Good Morning,

After removing roughly 20-some rbd shapshots, one of my OSD's has begun 
flapping.



 ERROR 1 

2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738 pg[4.e8( v 
44721'485588 (44697'484015,44721'485588] local-lis/les=44593/44595 
n=2972 ec=9422/9422 lis/c 44593/44593 les/c/f 44595/44595/40729 
44593/44593/44593) [8,7,10] r=0 lpr=44593 crt=44721'485588 lcod 
44721'485586 mlcod 44721'485586 active+clean+snapt

rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head
2018-06-25 06:46:41.314172 a1ce2700 -1 
/var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc: 
In function 'void bluestore_extent_ref_map_t::put(uint64_t, uint32_t, 
PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25 06:46:41.220388
/var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc: 
217: FAILED assert(0 == "put on missing extent (nothing before)")


 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1bc) [0x2a2c314]
 2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned int, 
std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> 
>*, bool*)+0x128) [0x2893650]
 3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned int, 
std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> 
>*, std::set, std::allocator >*)+0xb8) [0x2791bdc]
 4: (BlueStore::_wctx_finish(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, BlueStore::WriteContext*, 
std::set, 
std::allocator >*)+0x5c8) [0x27f3254]
 5: (BlueStore::_do_truncate(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr, unsigned long long, 
std::set, 
std::allocator >*)+0x360) [0x27f7834]
 6: (BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr)+0xb4) [0x27f81b4]
 7: (BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr&, 
boost::intrusive_ptr&)+0x1dc) [0x27f9638]
 8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)+0xe7c) [0x27e855c]
 9: (BlueStore::queue_transactions(ObjectStore::Sequencer*, 
std::vectorstd::allocator >&, 
boost::intrusive_ptr, ThreadPool::TPHandle*)+0x67c) 
[0x27e6f80]
 10: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, 
std::vectorstd::allocator >&, Context*, Context*, 
Context*, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x118) 
[0x1f9ce48]
 11: 
(PrimaryLogPG::queue_transactions(std::vectorstd::allocator >&, 
boost::intrusive_ptr)+0x9c) [0x22dd754]
 12: (ReplicatedBackend::submit_transaction(hobject_t const&, 
object_stat_sum_t const&, eversion_t const&, 
std::unique_ptr >&&, 
eversion_t const&, eversion_t const&, std::vectorstd::allocator > const&, 
boost::optionalry_t>&, Context*, Context*, Context*, unsigned long long, osd_reqid_t, 
boost::intrusive_ptr)+0x6f4) [0x25c0568]
 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, 
PrimaryLogPG::OpContext*)+0x7f4) [0x228ac98]
 14: 
(PrimaryLogPG::simple_opc_submit(std::unique_ptrstd::default_delete >)+0x1b8) [0x228bc54]
 15: (PrimaryLogPG::AwaitAsyncWork::react(PrimaryLogPG::DoSnapWork 
const&)+0x1970) [0x22c5d4c]
 16: (boost::statechart::detail::reaction_result 
boost::statechart::custom_reaction::reactboost::statechart::event_base, void 
const*>(PrimaryLogPG::AwaitAsyncWork&, boost::statechart::event_base 
const&, void const* const&)+0x58) [0x23b245c]
 17: (boost::statechart::detail::reaction_result 
boost::statechart::simple_statePrimaryLogPG::Trimming, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
boost::statechart::simple_statePrimaryLogPG::Trimming, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0> 
>(boost::statechart::simple_state, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na>, (boost::statechart::history_mode)0>&, 
boost::statechart::event_base const&, void const*)+0x30) [0x23b0f04]
 18: (boost::statechart::detail::reaction_result 
boost::statechart::simple_statePrimaryLogPG::Trimming, boost::mpl::listmpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)0>::local_react, 
mpl_::na, mpl_::na, mpl_::na, mpl

[ceph-users] Intel SSD DC P3520 PCIe for OSD 1480 TBW good idea?

2018-06-25 Thread Jelle de Jong

Hello everybody,

I am thinking about making a production three node Ceph cluster with 3x 
1.2TB Intel SSD DC P3520 PCIe storage devices. 10.8 (7.2TB 66% for 
production)


I am not planning on a journal on a separate ssd. I assume there is no 
advantage of this when using pcie storage?


Network connection to an Cisco SG550XG-8F8T 10Gbe Switch with Intel 
X710-DA2. (if someone knows a good mainline Linux budget replacement).


https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p3520-series/dc-p3520-1-2tb-aic-3d1.html

Is this a good storage setup?

Mainboard: Intel® Server Board S2600CW2R
CPU: 2x Intel® Xeon® Processor E5-2630 v4 (25M Cache, 2.20 GHz)
Memory:  1x 64GB DDR4 ECC KVR24R17D4K4/64
Disk: 2x WD Gold 4TB 7200rpm 128MB SATA3
Storage: 3x Intel SSD DC P3520 1.2TB PCIe
Adapter: Intel Ethernet Converged Network Adapter X710-DA2

I want to try using NUMA to also run KVM guests besides the OSD. I 
should have enough cores and only have a few osd processes.


Kind regards,

Jelle de Jong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw failover help

2018-06-25 Thread Burkhard Linke

Hi,


On 06/20/2018 07:20 PM, David Turner wrote:

We originally used pacemaker to move a VIP between our RGWs, but ultimately
decided to go with an LB in front of them.  With an LB you can utilize both
RGWs while they're up, but the LB will shy away from either if they're down
until the check starts succeeding for that host again.  We do have 2 LBs
with pacemaker, but the LBs are in charge of 3 prod RGW realms and 2
staging realms.  Moving to the LB with pacemaker simplified our setup quite
a bit for HA.


We use a similar setup, but without an extra load balancer host. 
Pacemaker is deployed on all hosts acting as RGW, together with haproxy 
as load balancer. haproxy is bound to the VIPs, does active checks on 
the RGW, and distributes RGW traffic to all of the three RGW servers in 
our setup. It also takes care for SSL/TLS termination.


With this setup we are also able to use multiple VIPs (e.g. one for 
external traffic, one for internal traffic), and route them to different 
haproxy instances if possible.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG status is "active+undersized+degraded"

2018-06-25 Thread Burkhard Linke

Hi,


On 06/22/2018 08:06 AM, dave.c...@dell.com wrote:

I saw these statement from this link ( 
http://docs.ceph.com/docs/master/rados/operations/crush-map/ ), it that the 
reason which leads to the warning?

" This, combined with the default CRUSH failure domain, ensures that replicas or 
erasure code shards are separated across hosts and a single host failure will not affect 
availability."

Best Regards,
Dave Chen

-Original Message-
From: Chen2, Dave
Sent: Friday, June 22, 2018 1:59 PM
To: 'Burkhard Linke'; ceph-users@lists.ceph.com
Cc: Chen2, Dave
Subject: RE: [ceph-users] PG status is "active+undersized+degraded"

Hi Burkhard,

Thanks for your explanation, I created an new OSD with 2TB from another node, it truly 
solved the issue, the status of Ceph cluster is " health HEALTH_OK" now.

Another question is if three homogeneous OSD is spread across 2 nodes, I still got the 
warning message, and  the status is "active+undersized+degraded",  so does the 
three OSD spread across 3 nodes are mandatory rules for Ceph? Is that only for the HA 
consideration? Any official documents from Ceph has some guide on this?


The default ceph crush rules try to distribute PG replicates among 
hosts. With a default replication number of 3 (pool size = 3), this 
requires at least three hosts. The pool also defines a minimum number of 
PG replicates to be available for allowing I/O to a PG. This is usually 
set to 2 (pool min size = 2). The above status thus means that there are 
enough copies for the min size (-> active), but not enough for the size 
(-> undersized + degraded).


Using less than three hosts requires changing the pool size to 2. But 
this is strongly discouraged, since a sane automatic recovery of data in 
case of a netsplit or other temporary node failure is not possible. Do 
not do this in a production setup.


For a production setup you should also consider node failures. The 
default setup uses 3 replicates, so to allow a node failure, you need 4 
hosts. Otherwise the self healing feature of ceph cannot recover the 
third replicate. You also need to closely monitor your cluster's free 
space to avoid a full cluster due to replicated PGs in case of a node 
failure.


Regards,
Burkhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Balancer: change from crush-compat to upmap

2018-06-25 Thread Caspar Smit
Hi All,

I've been using the balancer module in crush-compat mode for quite a while
now and want to switch to upmap mode since all my clients are now luminous
(v12.2.5)

i've reweighted the compat weight-set back to as close as the original
crush weights using 'ceph osd crush reweight-compat'

Before i switch to upmap i presume i need to remove the compat weight set
with:

ceph osd crush weight-set rm-compat

Will this have any significant impact (rebalancing lots of pgs) or does
this have very little effect since i already reweighted everything back
close to crush default weights?

Thanks in advance and kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Move Ceph-Cluster to another Datacenter

2018-06-25 Thread Mehmet

Hey Ceph people,

need advise on how to move a ceph-cluster from one datacenter to another 
without any downtime :)


DC 1:
3 dedicated MON-Server (also MGR on this Servers)
4 dedicated OSD-Server (3x12 OSD, 1x 23 OSDs)

3 Proxmox Nodes with connection to our Ceph-Storage (not managed from 
proxmox! Ceph is a standalone installation)


DC 2:
No Ceph-related Server actualy

Luminous (12.2.4)
Only one Pool:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd 0  30638G 63.8917318G 16778324

I need to move my Ceph instalation from DC1 to DC2 and would realy be 
happy if you could give me some advise on how to do this without any 
downtime and in a still performant manner.


The latency from DC1 to DC2 is ~1,5ms - could perhaps bring up a 10GB 
fiber connection between DC1 and DC2..


A second Ceph-Cluster on DC2 is for cost reasons not possible but i 
could bring a 5th OSD Server Online there.
So "RBD-Mirror" isn't actualy passable way - but i will try to make this 
possible ^^ ...

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-25 Thread Andrei Mikhailovsky
Hi Brad,

here is the output:

--

root@arh-ibstorage1-ib:/home/andrei# ceph --debug_ms 5 --debug_auth 20 pg 18.2 
query
2018-06-25 10:59:12.100302 7fe23eaa1700  2 Event(0x7fe2400e0140 nevent=5000 
time_id=1).set_owner idx=0 owner=140609690670848
2018-06-25 10:59:12.100398 7fe23e2a0700  2 Event(0x7fe24010d030 nevent=5000 
time_id=1).set_owner idx=1 owner=140609682278144
2018-06-25 10:59:12.100445 7fe23da9f700  2 Event(0x7fe240139ec0 nevent=5000 
time_id=1).set_owner idx=2 owner=140609673885440
2018-06-25 10:59:12.100793 7fe244b28700  1  Processor -- start
2018-06-25 10:59:12.100869 7fe244b28700  1 -- - start start
2018-06-25 10:59:12.100882 7fe244b28700  5 adding auth protocol: cephx
2018-06-25 10:59:12.101046 7fe244b28700  2 auth: KeyRing::load: loaded key file 
/etc/ceph/ceph.client.admin.keyring
2018-06-25 10:59:12.101244 7fe244b28700  1 -- - --> 192.168.168.201:6789/0 -- 
auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240174b80 con 0
2018-06-25 10:59:12.101264 7fe244b28700  1 -- - --> 192.168.168.202:6789/0 -- 
auth(proto 0 30 bytes epoch 0) v1 -- 0x7fe240175010 con 0
2018-06-25 10:59:12.101690 7fe23e2a0700  1 -- 192.168.168.201:0/3046734987 
learned_addr learned my addr 192.168.168.201:0/3046734987
2018-06-25 10:59:12.101890 7fe23e2a0700  2 -- 192.168.168.201:0/3046734987 >> 
192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ 
pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
2018-06-25 10:59:12.102030 7fe23da9f700  2 -- 192.168.168.201:0/3046734987 >> 
192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 s=STATE_CONNECTING_WAIT_ACK_SEQ 
pgs=0 cs=0 l=1)._process_connection got newly_acked_seq 0 vs out_seq 0
2018-06-25 10:59:12.102450 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 >> 
192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 
seq 1 0x7fe234002670 mon_map magic: 0 v1
2018-06-25 10:59:12.102494 7fe23e2a0700  5 -- 192.168.168.201:0/3046734987 >> 
192.168.168.202:6789/0 conn(0x7fe240176dc0 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=472363 cs=1 l=1). rx mon.1 
seq 2 0x7fe234002b70 auth_reply(proto 2 0 (0) Success) v1
2018-06-25 10:59:12.102542 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 <== 
mon.1 192.168.168.202:6789/0 1  mon_map magic: 0 v1  505+0+0 
(2386987630 0 0) 0x7fe234002670 con 0x7fe240176dc0
2018-06-25 10:59:12.102629 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 <== 
mon.1 192.168.168.202:6789/0 2  auth_reply(proto 2 0 (0) Success) v1  
33+0+0 (1469975654 0 0) 0x7fe234002b70 con 0x7fe240176dc0
2018-06-25 10:59:12.102655 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service mon
2018-06-25 10:59:12.102657 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service osd
2018-06-25 10:59:12.102658 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service mgr
2018-06-25 10:59:12.102661 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service auth
2018-06-25 10:59:12.102662 7fe23ca9d700 10 cephx: validate_tickets want 53 have 
0 need 53
2018-06-25 10:59:12.102666 7fe23ca9d700 10 cephx client: handle_response ret = 0
2018-06-25 10:59:12.102671 7fe23ca9d700 10 cephx client:  got initial server 
challenge 6522ec95fb2eb487
2018-06-25 10:59:12.102673 7fe23ca9d700 10 cephx client: validate_tickets: 
want=53 need=53 have=0
2018-06-25 10:59:12.102674 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service mon
2018-06-25 10:59:12.102675 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service osd
2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service mgr
2018-06-25 10:59:12.102676 7fe23ca9d700 10 cephx: set_have_need_key no handler 
for service auth
2018-06-25 10:59:12.102677 7fe23ca9d700 10 cephx: validate_tickets want 53 have 
0 need 53
2018-06-25 10:59:12.102678 7fe23ca9d700 10 cephx client: want=53 need=53 have=0
2018-06-25 10:59:12.102680 7fe23ca9d700 10 cephx client: build_request
2018-06-25 10:59:12.102702 7fe23da9f700  5 -- 192.168.168.201:0/3046734987 >> 
192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=333625 cs=1 l=1). rx mon.0 
seq 1 0x7fe228001490 mon_map magic: 0 v1
2018-06-25 10:59:12.102739 7fe23ca9d700 10 cephx client: get auth session key: 
client_challenge 80f2a24093f783c5
2018-06-25 10:59:12.102743 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 --> 
192.168.168.202:6789/0 -- auth(proto 2 32 bytes epoch 0) v1 -- 0x7fe224002080 
con 0
2018-06-25 10:59:12.102737 7fe23da9f700  5 -- 192.168.168.201:0/3046734987 >> 
192.168.168.201:6789/0 conn(0x7fe24017a420 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=333625 cs=1 l=1). rx mon.0 
seq 2 0x7fe2280019c0 auth_reply(proto 2 0 (0) Success) v1
2018-06-25 10:59:12.102776 7fe23ca9d700  1 -- 192.168.168.201:0/3046734987 <== 
mon.0 192.168.168.201:6789/0 1  mon_map magic: 0 v1  505+0+0 
(2386987630 0 0) 0x7fe228001490 con 0x7fe24017a420
2018-

Re: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)

2018-06-25 Thread Linh Vu
So my colleague Sean Crosby and I were looking through the logs (with debug mds 
= 10) and found some references just before the crash to inode number. We 
converted it from HEX to decimal and got something like 109953*5*627776 (last 
few digits not necessarily correct). We set one digit up i.e to 109953*6*627776 
and used that as the value for take_inos i.e:

$ cephfs-table-tool all take_inos 1099536627776


After that, the MDS could start successfully and we have a HEALTH_OK cluster 
once more!


It would still be useful if `show inode` in cephfs-table-tool actually shows us 
the max inode number at least though. And I think take_inos should be 
documented as well in the Disaster Recovery guide. :)


We'll be monitoring the cluster for the next few days. Hopefully nothing too 
interesting to share after this! 😉


Cheers,

Linh


From: ceph-users  on behalf of Linh Vu 

Sent: Monday, 25 June 2018 7:06:45 PM
To: ceph-users
Subject: [ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't 
start (failing at MDCache::add_inode)


Hi all,


We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active 
and 1 standby MDS. The active MDS crashed and now won't start again with this 
same error:

###

 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 In function 'void MDCache::add_inode(CInode*)' thread 7f01c2749700 time 
2018-06-25 16:11:21.133236
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 262: FAILED assert(!p)
###

Right before that point is just a bunch of client connection requests.

There are also a few other inode errors such as:

###
2018-06-25 09:30:37.889166 7f934c1e5700 -1 log_channel(cluster) log [ERR] : 
loaded dup inode 0x198f00a [2,head] v3426852030 at 
~mds0/stray5/198f00a, but inode 0x198f00a.head v3426838533 already 
exists at ~mds0/stray2/198f00a
###

We've done this for recovery:

$ make sure all MDS are shut down (all crashed by this point anyway)
$ ceph fs set myfs cluster_down true
$ cephfs-journal-tool journal export backup.bin
$ cephfs-journal-tool event recover_dentries summary
Events by type:
  FRAGMENT: 9
  OPEN: 29082
  SESSION: 15
  SUBTREEMAP: 241
  UPDATE: 171835
Errors: 0
$ cephfs-table-tool all reset session
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-table-tool all reset inode
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-journal-tool --rank=myfs:0 journal reset

old journal was 35714605847583~423728061

new journal start will be 35715031236608 (1660964 bytes past old end)
writing journal head
writing EResetJournal entry
done
$ ceph mds fail 0
$ ceph fs reset hpc_projects --yes-i-really-mean-it
$ start up MDS again

However, we keep getting the same error as above.

We found this: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023136.html 
which has a similar issue, and some suggestions on using the cephfs-table-tool 
take_inos command, as our problem looks like we can't create new inodes. 
However, we don't quite understand the show inode or take_inos command. On our 
cluster, we see this:

$ cephfs-table-tool 0 show inode
{
"0": {
"data": {
"version": 1,
"inotable": {
"projected_free": [
{
"start": 1099511627776,
"len": 1099511627776
}
],
"free": [
{
"start": 1099511627776,
"len": 1099511627776
}
]
}
},
"result": 0
}
}

Our test cluster shows the exact same output, and running `cephfs-table-tool 
all take_inos 10` (on the test cluster) doesn't seem to do anything to the 
output of the above, and also the inode number from creating new files doesn't 
seem to jump +100K from where it was (likely we misunderstood how take_inos 
works). On our test cluster (no recovery nor reset has been run on it), the 
latest max inode, from our file creation and running `ls -li` is 1099511627792, 
just a tiny bit bigger than the "start" value above which seems to match the 
file count we've created on it.

How do we find out what is our latest max inode on our production cluster, when 
`show inode` doesn't seem to show us anything useful?


Also, FYI, over a week ago, we had a network failure, and had to perform 
recovery then. The recovery seemed OK, but there were some clients that were 
still running jobs from previously and seemed to have recovered so we were 
still in the proce

Re: [ceph-users] Recovery after datacenter outage

2018-06-25 Thread Christian Zunker
Hi Jason,

your guesses were correct. Thank you for your support.

Just in case, someone else stumbles upon this thread, some more links:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020722.html
http://docs.ceph.com/docs/luminous/rados/operations/user-management/#authorization-capabilities
http://docs.ceph.com/docs/luminous/rbd/rbd-openstack/#setup-ceph-client-authentication
https://github.com/ceph/ceph/pull/15991

Jason Dillaman  schrieb am Fr., 22. Juni 2018 um
22:58 Uhr:

> It sounds like your OpenStack users do not have the correct caps to
> blacklist dead clients. See step 6 in the upgrade section of Luminous’
> release notes or (preferably) use the new “profile rbd”-style caps if you
> don’t use older clients.
>
> The reason why repairing the object map seemed to fix everything was
> because I suspect you performed the op using the admin user, which had the
> caps necessary to blacklist the dead clients and clean up the dirty
> exclusive lock on the image.
>
> On Fri, Jun 22, 2018 at 4:47 PM Gregory Farnum  wrote:
>
>> On Fri, Jun 22, 2018 at 2:26 AM Christian Zunker
>>  wrote:
>>
>>> Hi List,
>>>
>>> we are running a ceph cluster (12.2.5) as backend to our OpenStack cloud.
>>>
>>> Yesterday our datacenter had a power outage. As this wouldn't be enough,
>>> we also had a separated ceph cluster because of networking problems.
>>>
>>> First of all thanks a lot to the ceph developers. After the network was
>>> back to normal, ceph recovered itself. You saved us from a lot of downtime,
>>> lack of sleep and insanity.
>>>
>>> Now to our problem/question:
>>> After ceph recovered, we tried to bring up our VMs. They have cinder
>>> volumes saved in ceph. All VMs didn't start because of I/O problems during
>>> start:
>>> [4.393246] JBD2: recovery failed
>>> [4.395949] EXT4-fs (vda1): error loading journal
>>> [4.400811] VFS: Dirty inode writeback failed for block device vda1
>>> (err=-5).
>>> mount: mounting /dev/vda1 on /root failed: Input/output error
>>> done.
>>> Begin: Running /scripts/local-bottom ... done.
>>> Begin: Running /scripts/init-bottom ... mount: mounting /dev on
>>> /root/dev failed: No such file or directory
>>>
>>> We tried to recover the disk with different methods, but all failed
>>> because of different reasons. What helped us at the end was a rebuild on
>>> the object map of each image:
>>> rbd object-map rebuild volumes/
>>>
>>> From what we understood, object-map is a feature for ceph internal
>>> speedup. How can this lead to I/O errors in our VMs?
>>> Is this the expected way for a recovery?
>>> Did we miss something?
>>> Is there any documentation describing what leads to invalid object-maps
>>> and how to recover? (We did not find a doc on that topic...)
>>>
>>
>> An object map definitely shouldn't lead to IO errors in your VMs; in fact
>> I thought it auto-repaired itself if necessary. Maybe the RBD guys can
>> chime in here about probable causes of trouble.
>>
>> My *guess* is that perhaps your VMs or QEMU were configured to ignore
>> barriers or some similar thing, so that when the power failed a write was
>> "lost" as it got written to a new RBD object but not committed into the
>> object map, but the FS or database journal recorded it as complete. I can't
>> be sure about that though.
>> -Greg
>>
>>
>>>
>>>
>>> regards
>>> Christian
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Help! Luminous 12.2.5 CephFS - MDS crashed and now won't start (failing at MDCache::add_inode)

2018-06-25 Thread Linh Vu
Hi all,


We have a Luminous 12.2.5 cluster, running entirely just CephFS with 1 active 
and 1 standby MDS. The active MDS crashed and now won't start again with this 
same error:

###

 0> 2018-06-25 16:11:21.136203 7f01c2749700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 In function 'void MDCache::add_inode(CInode*)' thread 7f01c2749700 time 
2018-06-25 16:11:21.133236
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.5/rpm/el7/BUILD/ceph-12.2.5/src/mds/MDCache.cc:
 262: FAILED assert(!p)
###

Right before that point is just a bunch of client connection requests.

There are also a few other inode errors such as:

###
2018-06-25 09:30:37.889166 7f934c1e5700 -1 log_channel(cluster) log [ERR] : 
loaded dup inode 0x198f00a [2,head] v3426852030 at 
~mds0/stray5/198f00a, but inode 0x198f00a.head v3426838533 already 
exists at ~mds0/stray2/198f00a
###

We've done this for recovery:

$ make sure all MDS are shut down (all crashed by this point anyway)
$ ceph fs set myfs cluster_down true
$ cephfs-journal-tool journal export backup.bin
$ cephfs-journal-tool event recover_dentries summary
Events by type:
  FRAGMENT: 9
  OPEN: 29082
  SESSION: 15
  SUBTREEMAP: 241
  UPDATE: 171835
Errors: 0
$ cephfs-table-tool all reset session
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-table-tool all reset inode
{
"0": {
"data": {},
"result": 0
}
}
$ cephfs-journal-tool --rank=myfs:0 journal reset

old journal was 35714605847583~423728061

new journal start will be 35715031236608 (1660964 bytes past old end)
writing journal head
writing EResetJournal entry
done
$ ceph mds fail 0
$ ceph fs reset hpc_projects --yes-i-really-mean-it
$ start up MDS again

However, we keep getting the same error as above.

We found this: 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-December/023136.html 
which has a similar issue, and some suggestions on using the cephfs-table-tool 
take_inos command, as our problem looks like we can't create new inodes. 
However, we don't quite understand the show inode or take_inos command. On our 
cluster, we see this:

$ cephfs-table-tool 0 show inode
{
"0": {
"data": {
"version": 1,
"inotable": {
"projected_free": [
{
"start": 1099511627776,
"len": 1099511627776
}
],
"free": [
{
"start": 1099511627776,
"len": 1099511627776
}
]
}
},
"result": 0
}
}

Our test cluster shows the exact same output, and running `cephfs-table-tool 
all take_inos 10` (on the test cluster) doesn't seem to do anything to the 
output of the above, and also the inode number from creating new files doesn't 
seem to jump +100K from where it was (likely we misunderstood how take_inos 
works). On our test cluster (no recovery nor reset has been run on it), the 
latest max inode, from our file creation and running `ls -li` is 1099511627792, 
just a tiny bit bigger than the "start" value above which seems to match the 
file count we've created on it.

How do we find out what is our latest max inode on our production cluster, when 
`show inode` doesn't seem to show us anything useful?


Also, FYI, over a week ago, we had a network failure, and had to perform 
recovery then. The recovery seemed OK, but there were some clients that were 
still running jobs from previously and seemed to have recovered so we were 
still in the process of draining and rebooting them as they finish their jobs. 
Some would come back with bad files but nothing that caused troubles until now.

Very much appreciate any help!

Cheers,

Linh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] unfound blocks IO or gives IO error?

2018-06-25 Thread Dan van der Ster
On Fri, Jun 22, 2018 at 10:44 PM Gregory Farnum  wrote:
>
> On Fri, Jun 22, 2018 at 6:22 AM Sergey Malinin  wrote:
>>
>> From 
>> http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/ :
>>
>> "Now 1 knows that these object exist, but there is no live ceph-osd who has 
>> a copy. In this case, IO to those objects will block, and the cluster will 
>> hope that the failed node comes back soon; this is assumed to be preferable 
>> to returning an IO error to the user."
>
>
> This is definitely the default and the way I recommend you run a cluster. But 
> do keep in mind sometimes other layers in your stack have their own timeouts 
> and will start throwing errors if the Ceph library doesn't return an IO 
> quickly enough. :)

Right, that's understood. This is the nice behaviour of virtio-blk vs
virtio-scsi: the latter has a timeout but blk blocks forever.
On 5000 attached volumes we saw around 12 of these IO errors, and this
was the first time in 5 years of upgrades that an IO error happened...

-- dan


> -Greg
>
>>
>>
>> On 22.06.2018, at 16:16, Dan van der Ster  wrote:
>>
>> Hi all,
>>
>> Quick question: does an IO with an unfound object result in an IO
>> error or should the IO block?
>>
>> During a jewel to luminous upgrade some PGs passed through a state
>> with unfound objects for a few seconds. And this seems to match the
>> times when we had a few IO errors on RBD attached volumes.
>>
>> Wondering what is the correct behaviour here...
>>
>> Cheers, Dan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com