Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread shadow_lin
Hi David,
I am sure most(if not all) data are in one pool.
rbd_pool is only for omap for EC rbd.

ceph df:
GLOBAL:
SIZE AVAIL   RAW USED %RAW USED
427T 100555G 329T 77.03
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS
ec_rbd_pool 3  219T 81.4050172G 57441718
rbd_pool4   144 037629G   19



2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 10:21
主题:Re: Re: [ceph-users] Uneven data distribution with even pg distribution 
after rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

You have 2 different pools. PGs in each pool are going to be a different size.  
It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's 
and Y's. Having equal PG counts on each osd is only balanced if you have a 
single pool or have a case where all PGs are identical in size. The latter is 
not likely.


On Mon, Jun 25, 2018, 10:02 PM shadow_lin  wrote:

Hi David,
I am afraid I can't run the command you provide now,because I tried to 
remove another osd on that host to see if it would make the data distribution 
even and it did.
The pg number of my pools are at power of 2.
Below is from my note before removed another osd:
pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags 
hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull 
stripe_width 0 application rbd
pg distribution of osd of all pools:
https://pasteboard.co/HrBZv3s.png

What I don't understand is why data distribution is uneven when pg 
distribution is even.

2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 01:24
主题:Re: [ceph-users] Uneven data distribution with even pg distribution after 
rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

I should be able to answer this question for you if you can supply the output 
of the following commands.  It will print out all of your pool names along with 
how many PGs are in that pool.  My guess is that you don't have a power of 2 
number of PGs in your pool.  Alternatively you might have multiple pools and 
the PGs from the various pools are just different sizes.


ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; 
do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
ceph df


For me the output looks like this.
rbd: 64
cephfs_metadata: 64
cephfs_data: 256
rbd-ssd: 32



GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
46053G 26751G   19301G 41.91
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd-replica 4897G 11.36 7006G  263000
cephfs_metadata 6141M  0.05  268G   11945
cephfs_data 7  10746G 43.4114012G 2795782
rbd-replica-ssd 9241G 47.30  268G   75061


On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:

Hi List,
   The enviroment is:
   Ceph 12.2.4
   Balancer module on and in upmap mode
   Failure domain is per host, 2 OSD per host
   EC k=4 m=2
   PG distribution is almost even before and after the rebalancing.


   After marking out one of the osd,I noticed a lot of the data was moving into 
the other osd on the same host .

   Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was 
marked out):

ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134

   I am using RBD only so the objects should all be 4m .I don't understand why 
osd 21 got significant more data 
with the same pg as other osds.
   Is this behavior expected or I misconfiged something or  some kind of bug?

   Thanks


2018-06-25
shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-25 Thread shadow_lin
Hi David,
I am afraid I can't run the command you provide now,because I tried to 
remove another osd on that host to see if it would make the data distribution 
even and it did.
The pg number of my pools are at power of 2.
Below is from my note before removed another osd:
pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash 
rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags 
hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd
pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull 
stripe_width 0 application rbd
pg distribution of osd of all pools:
https://pasteboard.co/HrBZv3s.png

What I don't understand is why data distribution is uneven when pg 
distribution is even.

2018-06-26 

shadow_lin 



发件人:David Turner 
发送时间:2018-06-26 01:24
主题:Re: [ceph-users] Uneven data distribution with even pg distribution after 
rebalancing
收件人:"shadow_lin"
抄送:"ceph-users"

I should be able to answer this question for you if you can supply the output 
of the following commands.  It will print out all of your pool names along with 
how many PGs are in that pool.  My guess is that you don't have a power of 2 
number of PGs in your pool.  Alternatively you might have multiple pools and 
the PGs from the various pools are just different sizes.


ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; 
do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done
ceph df


For me the output looks like this.
rbd: 64
cephfs_metadata: 64
cephfs_data: 256
rbd-ssd: 32



GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
46053G 26751G   19301G 41.91
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS
rbd-replica 4897G 11.36 7006G  263000
cephfs_metadata 6141M  0.05  268G   11945
cephfs_data 7  10746G 43.4114012G 2795782
rbd-replica-ssd 9241G 47.30  268G   75061


On Sun, Jun 24, 2018 at 9:48 PM shadow_lin  wrote:

Hi List,
   The enviroment is:
   Ceph 12.2.4
   Balancer module on and in upmap mode
   Failure domain is per host, 2 OSD per host
   EC k=4 m=2
   PG distribution is almost even before and after the rebalancing.


   After marking out one of the osd,I noticed a lot of the data was moving into 
the other osd on the same host .

   Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was 
marked out):

ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134

   I am using RBD only so the objects should all be 4m .I don't understand why 
osd 21 got significant more data 
with the same pg as other osds.
   Is this behavior expected or I misconfiged something or  some kind of bug?

   Thanks


2018-06-25
shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Uneven data distribution with even pg distribution after rebalancing

2018-06-24 Thread shadow_lin
Hi List,
   The enviroment is:
   Ceph 12.2.4
   Balancer module on and in upmap mode
   Failure domain is per host, 2 OSD per host
   EC k=4 m=2
   PG distribution is almost even before and after the rebalancing.


   After marking out one of the osd,I noticed a lot of the data was moving into 
the other osd on the same host .

   Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was 
marked out):

ID CLASS WEIGHT  REWEIGHT SIZE  USE   AVAIL %USE  VAR  PGS
19   hdd 9.09560  1.0 9313G 7079G 2233G 76.01 1.00 135
21   hdd 9.09560  1.0 9313G 8123G 1190G 87.21 1.15 135
22   hdd 9.09560  1.0 9313G 7026G 2287G 75.44 1.00 133
23   hdd 9.09560  1.0 9313G 7026G 2286G 75.45 1.00 134
   
   I am using RBD only so the objects should all be 4m .I don't understand why 
osd 21 got significant more data 
with the same pg as other osds.
   Is this behavior expected or I misconfiged something or  some kind of bug?
   
   Thanks


2018-06-25
shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?

2018-04-01 Thread shadow_lin
Thanks.
Is there any workaround for 10.2.10 to avoid all osd start spliting at the same 
time?

2018-04-01 


shadowlin




发件人:Pavan Rallabhandi <prallabha...@walmartlabs.com>
发送时间:2018-04-01 22:39
主题:Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?
收件人:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com>
抄送:

No, it is supported in the next version of Jewel 
http://tracker.ceph.com/issues/22658
 
From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of shadow_lin 
<shadow_...@163.com>
Date: Sunday, April 1, 2018 at 3:53 AM
To: ceph-users <ceph-users@lists.ceph.com>
Subject: EXT: [ceph-users] Does jewel 10.2.10 support 
filestore_split_rand_factor?
 
Hi list,
The document page of jewel has filestore_split_rand_factor config but I can't 
find the config by using 'ceph daemon osd.x config'.
 
ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
ceph daemon osd.0 config show|grep split
"mon_osd_max_split_count": "32",
"journaler_allow_split_entries": "true",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"filestore_split_multiple": "4",
"filestore_debug_verify_split": "false",


 
2018-04-01



shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?

2018-04-01 Thread shadow_lin
Hi list,
The document page of jewel has filestore_split_rand_factor config but I can't 
find the config by using 'ceph daemon osd.x config'.

ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
ceph daemon osd.0 config show|grep split
"mon_osd_max_split_count": "32",
"journaler_allow_split_entries": "true",
"mds_bal_split_size": "1",
"mds_bal_split_rd": "25000",
"mds_bal_split_wr": "1",
"mds_bal_split_bits": "3",
"filestore_split_multiple": "4",
"filestore_debug_verify_split": "false",


2018-04-01


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mgr balancer bad distribution

2018-03-29 Thread shadow_lin
Hi Stefan,
> Am 28.02.2018 um 13:47 schrieb Stefan Priebe - Profihost AG:
>> Hello,
>>
>> with jewel we always used the python crush optimizer which gave us a
>> pretty good distribution fo the used space.
>>
You mentioned a python crush opimizer for jewel.Could you tell me where I can 
find it? Can it be used with multiple pools?
Thank you.

2018-03-29 


shadowlin___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] remove big rbd image is very slow

2018-03-27 Thread shadow_lin
I did have done that before, but in most time I can't just delete the pool.
Is there any other way to speed up the rbd image deletion?

2018-03-27 


shadowlin




发件人:Ilya Dryomov <idryo...@gmail.com>
发送时间:2018-03-26 20:09
主题:Re: [ceph-users] remove big rbd image is very slow
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

On Sat, Mar 17, 2018 at 5:11 PM, shadow_lin <shadow_...@163.com> wrote: 
> Hi list, 
> My ceph version is jewel 10.2.10. 
> I tired to use rbd rm to remove a 50TB image(without object map because krbd 
> does't support it).It takes about 30mins to just complete about 3%. Is this 
> expected? Is there a way to make it faster? 
> I know there are scripts to delete rados objects of the rbd image to make it 
> faster. But is the slowness expected for rbd rm command? 
> 
> PS: I also encounter very slow rbd export for large rbd image(20TB image but 
> with only a few GB data).Takes hours to completed the export.I guess both 
> are related to object map not enabled, but krbd doesn't support object map 
> feature. 

If you don't have any other images in that pool, you can simply delete 
the pool with "ceph osd pool delete".  It'll take a second ;) 

Thanks, 

Ilya ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] remove big rbd image is very slow

2018-03-17 Thread shadow_lin
Hi list,
My ceph version is jewel 10.2.10.
I tired to use rbd rm to remove a 50TB image(without object map because krbd 
does't support it).It takes about 30mins to just complete about 3%. Is this 
expected? Is there a way to make it faster?
I know there are scripts to delete rados objects of the rbd image to make it 
faster. But is the slowness expected for rbd rm command?

PS: I also encounter very slow rbd export for large rbd image(20TB image but 
with only a few GB data).Takes hours to completed the export.I guess both are 
related to object map not enabled, but krbd doesn't support object map feature.




2018-03-18



shadowlin___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-11 Thread shadow_lin
Hi Jason,
How the old target gateway is blacklisted? Is it a feature of the target 
gateway(which can support active/passive multipath) should provide or is it 
only by rbd excusive lock? 
I think excusive lock only let one client can write to rbd at the same time,but 
another client can obtain the lock later when the lock is released.

2018-03-11 


shadowlin




发件人:Jason Dillaman <jdill...@redhat.com>
发送时间:2018-03-11 07:46
主题:Re: Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi 
Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com>

On Sat, Mar 10, 2018 at 10:11 AM, shadow_lin <shadow_...@163.com> wrote: 
> Hi Jason, 
> 
>>As discussed in this thread, for active/passive, upon initiator 
>>failover, we used the RBD exclusive-lock feature to blacklist the old 
>>"active" iSCSI target gateway so that it cannot talk w/ the Ceph 
>>cluster before new writes are accepted on the new target gateway. 
> 
> I can get during the new active target gateway was talking to rbd the old 
> active target gateway cannot write because of the RBD exclusive-lock 
> But after the new target gateway done the writes,if the old target gateway 
> had some blocked io during the failover,cant it then get the lock and 
> overwrite the new writes? 

Negative -- it's blacklisted so it cannot talk to the cluster. 

> PS: 
> Petasan say they can do active/active iscsi with patched suse kernel. 

I'll let them comment on these corner cases. 

> 2018-03-10 
>  
> shadowlin 
> 
>  
> 
> 发件人:Jason Dillaman <jdill...@redhat.com> 
> 发送时间:2018-03-10 21:40 
> 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 
> 收件人:"shadow_lin"<shadow_...@163.com> 
> 抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi 
> Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> 
> 
> On Sat, Mar 10, 2018 at 7:42 AM, shadow_lin <shadow_...@163.com> wrote: 
>> Hi Mike, 
>> So for now only suse kernel with target_rbd_core and tcmu-runner can run 
>> active/passive multipath safely? 
> 
> Negative, the LIO / tcmu-runner implementation documented here [1] is 
> safe for active/passive. 
> 
>> I am a newbie to iscsi. I think the stuck io get excuted cause overwrite 
>> problem can happen with both active/active and active/passive. 
>> What makes the active/passive safer than active/active? 
> 
> As discussed in this thread, for active/passive, upon initiator 
> failover, we used the RBD exclusive-lock feature to blacklist the old 
> "active" iSCSI target gateway so that it cannot talk w/ the Ceph 
> cluster before new writes are accepted on the new target gateway. 
> 
>> What mechanism should be implement to avoid the problem with 
>> active/passive 
>> and active/active multipath? 
> 
> Active/passive it solved as discussed above. For active/active, we 
> don't have a solution that is known safe under all failure conditions. 
> If LIO supported MCS (multiple connections per session) instead of 
> just MPIO (multipath IO), the initiator would provide enough context 
> to the target to detect IOs from a failover situation. 
> 
>> 2018-03-10 
>> ____ 
>> shadowlin 
>> 
>>  
>> 
>> 发件人:Mike Christie <mchri...@redhat.com> 
>> 发送时间:2018-03-09 00:54 
>> 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 
>> 收件人:"shadow_lin"<shadow_...@163.com>,"Lazuardi 
>> Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> 
>> 抄送: 
>> 
>> On 03/07/2018 09:24 AM, shadow_lin wrote: 
>>> Hi Christie, 
>>> Is it safe to use active/passive multipath with krbd with exclusive lock 
>>> for lio/tgt/scst/tcmu? 
>> 
>> No. We tried to use lio and krbd initially, but there is a issue where 
>> IO might get stuck in the target/block layer and get executed after new 
>> IO. So for lio, tgt and tcmu it is not safe as is right now. We could 
>> add some code tcmu's file_example handler which can be used with krbd so 
>> it works like the rbd one. 
>> 
>> I do know enough about SCST right now. 
>> 
>> 
>>> Is it safe to use active/active multipath If use suse kernel with 
>>> target_core_rbd? 
>>> Thanks. 
>>> 
>>> 2018-03-07 
>>

Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-10 Thread shadow_lin
Hi Jason,

>As discussed in this thread, for active/passive, upon initiator 
>failover, we used the RBD exclusive-lock feature to blacklist the old 
>"active" iSCSI target gateway so that it cannot talk w/ the Ceph 
>cluster before new writes are accepted on the new target gateway. 

I can get during the new active target gateway was talking to rbd the old 
active target gateway cannot write because of the RBD exclusive-lock 
But after the new target gateway done the writes,if the old target gateway had 
some blocked io during the failover,cant it then get the lock and overwrite the 
new writes?

PS:
Petasan say they can do active/active iscsi with patched suse kernel.

2018-03-10 



shadowlin




发件人:Jason Dillaman <jdill...@redhat.com>
发送时间:2018-03-10 21:40
主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi 
Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com>

On Sat, Mar 10, 2018 at 7:42 AM, shadow_lin <shadow_...@163.com> wrote: 
> Hi Mike, 
> So for now only suse kernel with target_rbd_core and tcmu-runner can run 
> active/passive multipath safely? 

Negative, the LIO / tcmu-runner implementation documented here [1] is 
safe for active/passive. 

> I am a newbie to iscsi. I think the stuck io get excuted cause overwrite 
> problem can happen with both active/active and active/passive. 
> What makes the active/passive safer than active/active? 

As discussed in this thread, for active/passive, upon initiator 
failover, we used the RBD exclusive-lock feature to blacklist the old 
"active" iSCSI target gateway so that it cannot talk w/ the Ceph 
cluster before new writes are accepted on the new target gateway. 

> What mechanism should be implement to avoid the problem with active/passive 
> and active/active multipath? 

Active/passive it solved as discussed above. For active/active, we 
don't have a solution that is known safe under all failure conditions. 
If LIO supported MCS (multiple connections per session) instead of 
just MPIO (multipath IO), the initiator would provide enough context 
to the target to detect IOs from a failover situation. 

> 2018-03-10 
>  
> shadowlin 
> 
>  
> 
> 发件人:Mike Christie <mchri...@redhat.com> 
> 发送时间:2018-03-09 00:54 
> 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 
> 收件人:"shadow_lin"<shadow_...@163.com>,"Lazuardi 
> Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> 
> 抄送: 
> 
> On 03/07/2018 09:24 AM, shadow_lin wrote: 
>> Hi Christie, 
>> Is it safe to use active/passive multipath with krbd with exclusive lock 
>> for lio/tgt/scst/tcmu? 
> 
> No. We tried to use lio and krbd initially, but there is a issue where 
> IO might get stuck in the target/block layer and get executed after new 
> IO. So for lio, tgt and tcmu it is not safe as is right now. We could 
> add some code tcmu's file_example handler which can be used with krbd so 
> it works like the rbd one. 
> 
> I do know enough about SCST right now. 
> 
> 
>> Is it safe to use active/active multipath If use suse kernel with 
>> target_core_rbd? 
>> Thanks. 
>> 
>> 2018-03-07 
>>  
>> shadowlin 
>> 
>>  
>> 
>> *发件人:*Mike Christie <mchri...@redhat.com> 
>> *发送时间:*2018-03-07 03:51 
>> *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD 
>> Exclusive Lock 
>> *收件人:*"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph 
>> Users"<ceph-users@lists.ceph.com> 
>> *抄送:* 
>> 
>> On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: 
>> > Hi, 
>> > 
>> > I want to do load balanced multipathing (multiple iSCSI 
>> gateway/exporter 
>> > nodes) of iSCSI backed with RBD images. Should I disable exclusive 
>> lock 
>> > feature? What if I don't disable that feature? I'm using TGT (manual 
>> > way) since I get so many CPU stuck error messages when I was using 
>> LIO. 
>> > 
>> 
>> You are using LIO/TGT with krbd right? 
>> 
>> You cannot or shouldn't do active/active multipathing. If you have the 
>> lock enabled then it bounces between paths for each IO and will be 
>> slow. 
>> If you do not have it enabled then you can end up with stale IO 
>> overwriting current data. 
>> 
>> 
>> 
>> 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

[1] http://docs.ceph.com/docs/master/rbd/iscsi-overview/ 

--  
Jason ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-10 Thread shadow_lin
Hi Mike,  
So for now only suse kernel with target_rbd_core and tcmu-runner can run 
active/passive multipath safely?
I am a newbie to iscsi. I think the stuck io get excuted cause overwrite 
problem can happen with both active/active and active/passive.
What makes the active/passive safer than active/active? 
What mechanism should be implement to avoid the problem with active/passive and 
active/active multipath?
2018-03-10 


shadowlin




发件人:Mike Christie <mchri...@redhat.com>
发送时间:2018-03-09 00:54
主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
收件人:"shadow_lin"<shadow_...@163.com>,"Lazuardi 
Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com>
抄送:

On 03/07/2018 09:24 AM, shadow_lin wrote: 
> Hi Christie, 
> Is it safe to use active/passive multipath with krbd with exclusive lock 
> for lio/tgt/scst/tcmu? 

No. We tried to use lio and krbd initially, but there is a issue where 
IO might get stuck in the target/block layer and get executed after new 
IO. So for lio, tgt and tcmu it is not safe as is right now. We could 
add some code tcmu's file_example handler which can be used with krbd so 
it works like the rbd one. 

I do know enough about SCST right now. 


> Is it safe to use active/active multipath If use suse kernel with 
> target_core_rbd? 
> Thanks. 
>   
> 2018-03-07 
>  
> shadowlin 
>   
>  
>  
> *发件人:*Mike Christie <mchri...@redhat.com> 
> *发送时间:*2018-03-07 03:51 
> *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD 
> Exclusive Lock 
> *收件人:*"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph 
> Users"<ceph-users@lists.ceph.com> 
> *抄送:* 
>   
> On 03/06/2018 01:17 PM, Lazuardi Nasution wrote:  
> > Hi,  
> >   
> > I want to do load balanced multipathing (multiple iSCSI 
> gateway/exporter  
> > nodes) of iSCSI backed with RBD images. Should I disable exclusive lock 
>  
> > feature? What if I don't disable that feature? I'm using TGT (manual  
> > way) since I get so many CPU stuck error messages when I was using LIO. 
>  
> >   
>   
> You are using LIO/TGT with krbd right?  
>   
> You cannot or shouldn't do active/active multipathing. If you have the  
> lock enabled then it bounces between paths for each IO and will be slow.  
> If you do not have it enabled then you can end up with stale IO  
> overwriting current data.  
>   
>   
>   
>  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [jewel] High fs_apply_latency osds

2018-03-10 Thread shadow_lin
Hi Chris,
The osds are running on arm nodes. Every node has a two core 1.5Ghz arm 32bit 
cpu and 2G ram and runs 2 osds.Hdd is 10TB  and journal  colocate with data on 
the same disk.
Drives are half full now,but the problem I described also happened when the 
hdds are empty. Filesystem is ext4 because I have some problem to run xfs for 
now.
I am trying to better balance the pg distrubition now to see if it can ease the 
high latency problem.

2018-03-10 


shadowlin




发件人:Chris Hoy Poy <chr...@base3.com.au>
发送时间:2018-03-10 09:44
主题:Re: [ceph-users] [jewel] High fs_apply_latency osds
收件人:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com>
抄送:

Hi Shadowlin,


Can you describe your hardware ? Cpu/ram/harddrives involved etc


Also how your drives are set up?


How full are the drives? What filesystem is it?


Cheers
/chris






Sent from my SAMSUNG Galaxy S6 on the Telstra Mobile Network




---- Original message 
From: shadow_lin <shadow_...@163.com> 
Date: 10/3/18 1:41 am (GMT+08:00) 
To: ceph-users <ceph-users@lists.ceph.com> 
Subject: [ceph-users] [jewel] High fs_apply_latency osds 


Hi list,
During my write test,I find there are always some of the osds have high 
fs_apply_latency(1k-5kms,2-8times more than others). At first I think it is 
caused by unbalanced pg distribution, but after I reweight the osds the problem 
hasn't gone away.
I looked into the osds with high latency and find one thing in common.There are 
alot of read activities on these osds.Result of iostat shows there are 400-500 
r/s and 2000-3000 rKB/s on these high latnecy osds while the normal osds have 
around 100 r/s 300-400 rKB/s.
I tried to restart osd daemon with high latency.It did became normal sometime 
but there will be another high latency osds come out.And the new high latency 
osd has the same high read activities.

What are these read activities for? Is there a way to lower the latency?

Thanks


2018-03-10



shadowlin




Message protected by MailGuard: e-mail anti-virus, anti-spam and content 
filtering.
http://www.mailguard.com.au

Report this message as spam  
 




Message protected by MailGuard: e-mail anti-virus, anti-spam and content 
filtering.
http://www.mailguard.com.au
 
 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [jewel] High fs_apply_latency osds

2018-03-09 Thread shadow_lin
Hi list,
During my write test,I find there are always some of the osds have high 
fs_apply_latency(1k-5kms,2-8times more than others). At first I think it is 
caused by unbalanced pg distribution, but after I reweight the osds the problem 
hasn't gone away.
I looked into the osds with high latency and find one thing in common.There are 
alot of read activities on these osds.Result of iostat shows there are 400-500 
r/s and 2000-3000 rKB/s on these high latnecy osds while the normal osds have 
around 100 r/s 300-400 rKB/s.
I tried to restart osd daemon with high latency.It did became normal sometime 
but there will be another high latency osds come out.And the new high latency 
osd has the same high read activities.

What are these read activities for? Is there a way to lower the latency?

Thanks


2018-03-10



shadowlin___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs

2018-03-08 Thread shadow_lin
Thanks for your advice.
I will try to reweight osds of my cluster.

Why ceph is so sensitive to unblanced pg distribution during high load? ceph 
osd df result is: https://pastebin.com/ur4Q9jsA.  ceph osd perf result is: 
https://pastebin.com/87DitPhV

There is no osd with very high pg count compare to others. When the wirte test 
load is low everything seems fine, but during high write load test, some of the 
osds with higher pg can have 3-10 time of fs_apply_latency compare to others. 

My guess is the high loaded osds kinda slowed the whole cluster(because I have 
only one pool with all osds)to the level of how fast they can handle. So other 
osd has lower load and have a good latency.

Is this expected during high load(Indicate the load is too hight for current 
cluster to hanlde)? 

How does luminous solve the unevenly pg distribution problem?I read about there 
is a pg-upmap exception table in the osdmap in luminous 12.2.x. It is said to 
use this it is possible to achive perfect pg distribution among osds.

2018-03-09 

shadow_lin 



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-03-09 06:45
主题:Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds 
with more pgs
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

PGs being unevenly distributed is a common occurrence in Ceph.  Luminous 
started making some steps towards correcting this, but you're in Jewel.  There 
are a lot of threads in the ML archives about fixing PG distribution.  
Generally every method comes down to increasing the weight on OSDs with too few 
PGs and decreasing the weight on the OSDs with too many PGs.  There are a lot 
of schools of thought on the best way to implement this in your environment 
which has everything to do with your client IO patterns and workloads.  Looking 
into `ceph osd reweight-by-pg` might be a good place for you to start as you 
are only looking at 1 pool in your cluster.  If you have more pools, you 
generally need `ceph osd reweight-by-utilization`.


On Wed, Mar 7, 2018 at 8:19 AM shadow_lin <shadow_...@163.com> wrote:

Hi list,
   Ceph version is jewel 10.2.10 and all osd are using filestore.
The Cluster has 96 osds and 1 pool with size=2 replication with 4096 pg(base on 
pg calculate method from ceph doc for 100pg/per osd).
The osd with the most pg count has 104 PGs and there are 6 osds have above 100 
PGs
Most of the osd have around 7x-9x PGs
The osd with the least pg count has 58 PGs

During the write test some of the osds have very high fs_apply_latency like 
1000ms-4000ms while the normal ones are like 100-600ms. The osds with high 
latency are always the ones with more pg on it.

iostat on the high latency osd shows the hdds are having high %util at about 
95%-96% while the normal ones are having %util at 40%-60%

I think the reason to cause this is because the osds have more pgs need to 
handle more write request to it.Is this right?
But even though the pg distribution is not even but the variation is not that 
much.How could the performance be so sensitive to it?

Is there anything I can do to improve the performance and reduce the latency?

How can I make the pg distribution to be more even?

Thanks


2018-03-07



shadowlin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-07 Thread shadow_lin
Hi David,
Thanks for the info.
Could I assume that if use active/passive multipath with rbd exclusive lock  
then all targets which support rbd(via block) are safe?
2018-03-08 

shadow_lin 



发件人:David Disseldorp <dd...@suse.de>
发送时间:2018-03-08 08:47
主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi 
Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com>

Hi shadowlin, 

On Wed, 7 Mar 2018 23:24:42 +0800, shadow_lin wrote: 

> Is it safe to use active/active multipath If use suse kernel with 
> target_core_rbd? 
> Thanks. 

A cross-gateway failover race-condition similar to what Mike described 
is currently possible with active/active target_core_rbd. It's a corner 
case that is dependent on a client assuming that unacknowledged I/O has 
been implicitly terminated and can be resumed via an alternate path, 
while the original gateway at the same time issues the original request 
such that it reaches the Ceph cluster after differing I/O to the same 
region via the alternate path. 
It's not something that we've observed in the wild, but is nevertheless 
a bug that is being worked on, with a resolution that should also be 
usable for active/active tcmu-runner. 

Cheers, David ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

2018-03-07 Thread shadow_lin
Hi Christie,
Is it safe to use active/passive multipath with krbd with exclusive lock for 
lio/tgt/scst/tcmu?
Is it safe to use active/active multipath If use suse kernel with 
target_core_rbd?
Thanks.

2018-03-07 


shadowlin




发件人:Mike Christie 
发送时间:2018-03-07 03:51
主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
收件人:"Lazuardi Nasution","Ceph 
Users"
抄送:

On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: 
> Hi, 
>  
> I want to do load balanced multipathing (multiple iSCSI gateway/exporter 
> nodes) of iSCSI backed with RBD images. Should I disable exclusive lock 
> feature? What if I don't disable that feature? I'm using TGT (manual 
> way) since I get so many CPU stuck error messages when I was using LIO. 
>  

You are using LIO/TGT with krbd right? 

You cannot or shouldn't do active/active multipathing. If you have the 
lock enabled then it bounces between paths for each IO and will be slow. 
If you do not have it enabled then you can end up with stale IO 
overwriting current data. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs

2018-03-07 Thread shadow_lin
Hi list,
   Ceph version is jewel 10.2.10 and all osd are using filestore.
The Cluster has 96 osds and 1 pool with size=2 replication with 4096 pg(base on 
pg calculate method from ceph doc for 100pg/per osd).
The osd with the most pg count has 104 PGs and there are 6 osds have above 100 
PGs
Most of the osd have around 7x-9x PGs
The osd with the least pg count has 58 PGs

During the write test some of the osds have very high fs_apply_latency like 
1000ms-4000ms while the normal ones are like 100-600ms. The osds with high 
latency are always the ones with more pg on it.

iostat on the high latency osd shows the hdds are having high %util at about 
95%-96% while the normal ones are having %util at 40%-60%

I think the reason to cause this is because the osds have more pgs need to 
handle more write request to it.Is this right?
But even though the pg distribution is not even but the variation is not that 
much.How could the performance be so sensitive to it?

Is there anything I can do to improve the performance and reduce the latency?

How can I make the pg distribution to be more even?

Thanks


2018-03-07



shadowlin___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?

2018-03-07 Thread shadow_lin
What you said make sense.
I have encountered a few hardware related issue that caused one osd to work 
abnormal and blocked all io of the whole cluster(all osd in one pool) which 
makes me think how to avoid this situation.

2018-03-07 

shadow_lin 



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-03-07 13:51
主题:Re: Re: [ceph-users] Why one crippled osd can slow down or block all request 
to the whole ceph cluster?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

Marking osds down is not without risks. You are taking away one of the copies 
of data for every PG on that osd. Also you are causing every PG on that osd to 
peer. If that osd comes back up, every PG on it again needs to peer and then 
they need to recover.


That is a lot of load and risks to automate into the system. Now let's take 
into consideration other causes of slow requests like having more IO load than 
your spindle can handle, backfilling settings set to aggressively (related to 
the first option), or networking problems. If the mon is detecting slow 
requests on OSDs and marking them down, you could end up marking half of your 
cluster down or causing corrupt data by flapping OSDs.


The mon will mark osds down if those settings I mentioned are met. If the osd 
isn't unresponsive enough to not respond to other OSDs or the mons, then there 
really isn't much that ceph can do to automate this safely. There are just so 
many variables. If ceph was a closed system on specific hardware, it could 
certainly be monitoring that hardware closely for early warning signs... But 
people are running Ceph on everything they can compile it for including 
raspberry pis. The cluster admin, however, should be able to add their own 
early detection for failures.


You can monitor a lot about disks including things such as average await in a 
host to see if the disks are taking longer than normal to respond. That 
particular check led us to find that we had several storage nodes with bad 
cache batteries on the controllers. Finding that explained some slowness we had 
noticed in the cluster. It also led us to a better method to catch that 
scenario sooner.


On Tue, Mar 6, 2018, 11:22 PM shadow_lin <shadow_...@163.com> wrote:

Hi Turner,
Thanks for your insight.
I am wondering if the mon can detect slow/blocked request from certain osd why 
can't mon mark a osd with blocked request down if the request is blocked for a 
certain time.

2018-03-07 

shadow_lin 



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-03-06 23:56
主题:Re: [ceph-users] Why one crippled osd can slow down or block all request to 
the whole ceph cluster?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

There are multiple settings that affect this.  osd_heartbeat_grace is probably 
the most apt.  If an OSD is not getting a response from another OSD for more 
than the heartbeat_grace period, then it will tell the mons that the OSD is 
down.  Once mon_osd_min_down_reporters have told the mons that an OSD is down, 
then the OSD will be marked down by the cluster.  If the OSD does not then talk 
to the mons directly to say that it is up, it will be marked out after 
mon_osd_down_out_interval is reached.  If it does talk to the mons to say that 
it is up, then it should be responding again and be fine. 


In your case where the OSD is half up, half down... I believe all you can 
really do is monitor your cluster and troubleshoot OSDs causing problems like 
this.  Basically every storage solution is vulnerable to this.  Sometimes an 
OSD just needs to be restarted due to being in a bad state somehow, or simply 
removed from the cluster because the disk is going bad.


On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_...@163.com> wrote:

Hi list,
During my test of ceph,I find sometime the whole ceph cluster are blocked and 
the reason was one unfunctional osd.Ceph can heal itself if some osd is down, 
but it seems if some osd is half dead (have heart beat but can't handle 
request) then all the request which are directed to that osd would be blocked. 
If all osds are in one pool and the whole cluster would be blocked due to that 
one hanged osd.
I think this is because ceph will try to distribute the request to all osds and 
if one of the osd wont confirm the request is done then everything is blocked.
Is there a way to let ceph to mark the the crippled osd down if the requests 
direct to that osd are blocked more than certain time to avoid the whole 
cluster is blocked?

2018-03-04


shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?

2018-03-06 Thread shadow_lin
Hi Turner,
Thanks for your insight.
I am wondering if the mon can detect slow/blocked request from certain osd why 
can't mon mark a osd with blocked request down if the request is blocked for a 
certain time.

2018-03-07 

shadow_lin 



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-03-06 23:56
主题:Re: [ceph-users] Why one crippled osd can slow down or block all request to 
the whole ceph cluster?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

There are multiple settings that affect this.  osd_heartbeat_grace is probably 
the most apt.  If an OSD is not getting a response from another OSD for more 
than the heartbeat_grace period, then it will tell the mons that the OSD is 
down.  Once mon_osd_min_down_reporters have told the mons that an OSD is down, 
then the OSD will be marked down by the cluster.  If the OSD does not then talk 
to the mons directly to say that it is up, it will be marked out after 
mon_osd_down_out_interval is reached.  If it does talk to the mons to say that 
it is up, then it should be responding again and be fine.


In your case where the OSD is half up, half down... I believe all you can 
really do is monitor your cluster and troubleshoot OSDs causing problems like 
this.  Basically every storage solution is vulnerable to this.  Sometimes an 
OSD just needs to be restarted due to being in a bad state somehow, or simply 
removed from the cluster because the disk is going bad.


On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_...@163.com> wrote:

Hi list,
During my test of ceph,I find sometime the whole ceph cluster are blocked and 
the reason was one unfunctional osd.Ceph can heal itself if some osd is down, 
but it seems if some osd is half dead (have heart beat but can't handle 
request) then all the request which are directed to that osd would be blocked. 
If all osds are in one pool and the whole cluster would be blocked due to that 
one hanged osd.
I think this is because ceph will try to distribute the request to all osds and 
if one of the osd wont confirm the request is done then everything is blocked.
Is there a way to let ceph to mark the the crippled osd down if the requests 
direct to that osd are blocked more than certain time to avoid the whole 
cluster is blocked?

2018-03-04


shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?

2018-03-03 Thread shadow_lin
Hi list,
During my test of ceph,I find sometime the whole ceph cluster are blocked and 
the reason was one unfunctional osd.Ceph can heal itself if some osd is down, 
but it seems if some osd is half dead (have heart beat but can't handle 
request) then all the request which are directed to that osd would be blocked. 
If all osds are in one pool and the whole cluster would be blocked due to that 
one hanged osd.
I think this is because ceph will try to distribute the request to all osds and 
if one of the osd wont confirm the request is done then everything is blocked.
Is there a way to let ceph to mark the the crippled osd down if the requests 
direct to that osd are blocked more than certain time to avoid the whole 
cluster is blocked?

2018-03-04


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how is iops from ceph -s client io section caculated?

2018-03-03 Thread shadow_lin
If it is because of replication then the iops in ceph status should be always 
relatively stable and be the times of the replication size of the fio's iops.
From what I have saw the iops in ceph status keeps increasing overtime until it 
is relatively stable.

2018-03-04 


lin.yunfan



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-03-03 22:35
主题:Re: [ceph-users] how is iops from ceph -s client io section caculated?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

I would guess that the higher iops in ceph status are from iops calculated from 
replication. fio isn't aware of the backend replication iops, only what it's 
doing to the rbd


On Fri, Mar 2, 2018, 11:53 PM shadow_lin <shadow_...@163.com> wrote:

Hi list,
There is a client io section from the result of ceph -s. I found the value of 
it is kinda confusing.
I am using fio to test rbd seq write performance with 4m block.The throughput 
is about 2000MB/s and fio shows the iops is 500.But from the ceph -s client io 
section the throughput is about 2000MB/s too but the iops is not 
constantly,iops of the client io section keeps increasing from 1000iops to 
2000iops.And I found as the iops increased the throughput get lower(about 
10-20% ).
What the reason of the iops from ceph -s client io section to behave like this?


2018-03-03



lin.yunfan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how is iops from ceph -s client io section caculated?

2018-03-02 Thread shadow_lin
Hi list,
There is a client io section from the result of ceph -s. I found the value of 
it is kinda confusing.
I am using fio to test rbd seq write performance with 4m block.The throughput 
is about 2000MB/s and fio shows the iops is 500.But from the ceph -s client io 
section the throughput is about 2000MB/s too but the iops is not 
constantly,iops of the client io section keeps increasing from 1000iops to 
2000iops.And I found as the iops increased the throughput get lower(about 
10-20% ).
What the reason of the iops from ceph -s client io section to behave like this?


2018-03-03



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous12.2.2]Cache tier doesn't work properly

2018-02-13 Thread shadow_lin
Hi list,
I am testing cache tier in writeback mode by rados bench.
The test resutl is confusing.The write performance is worse than without a 
cache tier.
For my understanding the pool with cache tier in writeback mode should 
performace like all ssd pool(client get ack after data write to hot storage)  
if the cache dosen't need to be flushed.
But in my wirte test,the pool with cache tier has poorer performance than even 
all hdd pool.
And I inspect the pool stat to find out that there is only 244 objects in the 
hot-pool and 695 objects in the cold pool(the write test wrote 695 objects).But 
for my setting 695 objects shouldn't trigger the flush.

Is there any setting or concept I wrongly understood ?my cache tier setting:# 
ceph osd tier add cold-pool hot-pool
pool 'hot-pool' is now (or already was) a tier of 'cold-pool'
#
# ceph osd tier cache-mode hot-pool writeback
set cache-mode for pool 'hot-pool' to writeback
#
# ceph osd tier set-overlay cold-pool hot-pool
overlay for 'cold-pool' is now (or already was) 'hot-pool'
#
# ceph osd pool set hot-pool hit_set_type bloom
set pool 39 hit_set_type to bloom
#
# ceph osd pool set hot-pool hit_set_count 10
set pool 39 hit_set_count to 10
#
# ceph osd pool set hot-pool hit_set_period 3600
set pool 39 hit_set_period to 3600
#
# ceph osd pool set hot-pool target_max_bytes 24000
set pool 39 target_max_bytes to 24000
#
# ceph osd pool set hot-pool target_max_objects 30
set pool 39 target_max_objects to 30
#
# ceph osd pool set hot-pool cache_target_dirty_ratio 0.4
set pool 39 cache_target_dirty_ratio to 0.4
#
# ceph osd pool set hot-pool cache_target_dirty_high_ratio 0.6
set pool 39 cache_target_dirty_high_ratio to 0.6
#
# ceph osd pool set hot-pool cache_target_full_ratio 0.8
set pool 39 cache_target_full_ratio to 0.8
#
# ceph osd pool set hot-pool cache_min_flush_age 600
set pool 39 cache_min_flush_age to 600
#
# ceph osd pool set hot-pool cache_min_evict_age 1800
set pool 39 cache_min_evict_age to 1800


2018-02-13


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does cache tier work in writeback mode?

2018-02-08 Thread shadow_lin
Hi list,
I am testing cache tier in writeback mode.
The test resutl is confusing.The write performance is worse than without a 
cache tier.

The hot storage pool is an all ssd pool and the cold storage pool is an all hdd 
pool. I also created a hddpool and a ssdpool with the same crush rule as the 
cache tier pools for comparison.
The pool config:
tierOSDcap.(TB)pgOSDcap.(TB)pg
hot-pool204.81024ssd-pool204.81024
cold-pool14014002048hdd-pool14014002048


The cache tier config:
# ceph osd tier add cold-pool hot-pool
pool 'hot-pool' is now (or already was) a tier of 'cold-pool'
#
# ceph osd tier cache-mode hot-pool writeback
set cache-mode for pool 'hot-pool' to writeback
#
# ceph osd tier set-overlay cold-pool hot-pool
overlay for 'cold-pool' is now (or already was) 'hot-pool'
#
# ceph osd pool set hot-pool hit_set_type bloom
set pool 39 hit_set_type to bloom
#
# ceph osd pool set hot-pool hit_set_count 10
set pool 39 hit_set_count to 10
#
# ceph osd pool set hot-pool hit_set_period 3600
set pool 39 hit_set_period to 3600
#
# ceph osd pool set hot-pool target_max_bytes 24000
set pool 39 target_max_bytes to 24000
#
# ceph osd pool set hot-pool target_max_objects 30
set pool 39 target_max_objects to 30
#
# ceph osd pool set hot-pool cache_target_dirty_ratio 0.4
set pool 39 cache_target_dirty_ratio to 0.4
#
# ceph osd pool set hot-pool cache_target_dirty_high_ratio 0.6
set pool 39 cache_target_dirty_high_ratio to 0.6
#
# ceph osd pool set hot-pool cache_target_full_ratio 0.8
set pool 39 cache_target_full_ratio to 0.8
#
# ceph osd pool set hot-pool cache_min_flush_age 600
set pool 39 cache_min_flush_age to 600
#
# ceph osd pool set hot-pool cache_min_evict_age 1800
set pool 39 cache_min_evict_age to 1800


Write Test

cold-pool(tier)  write test for 10s
# rados bench -p cold-pool 10 write --no-cleanup
 
hdd-pool  write test for 10s
# rados bench -p hdd-pool 10 write --no-cleanup
 
ssd-pool  write test for 10s
# rados bench -p ssd-pool 10 write --no-cleanup

result:
tierhddssd
objects  695 737 2550
bandwith(MB/s) 2722891016
avg latency (s) 0.23  0.22   0.06


Read Test

# rados bench -p cold-pool 10 seq
 
# rados bench -p cold-pool 10 rand
 
# rados bench -p hdd-pool 10 seq
 
# rados bench -p hdd-pool 10 rand
 
# rados bench -p ssd-pool 10 seq
 
# rados bench -p ssd-pool 10 rand

seq result:
tierhddssd
bandwith(MB/s) 8067891113
avg latency (s) 0.0740.0790.056
 
rand result:
 
tierhddssd
bandwith(MB/s) 1106   790   1113
avg latency (s) 0.0560.0790.056



For my understanding the pool with cache tier  in writeback mode should 
performace like all ssd pool(client get ack after data write to hot storage)  
if the cache dosen't need to be flushed.
But In wirte test,the pool with cache tier has poorer performance than even all 
hdd pool.
And I inspect the pool stat to find out that there is only 244 objects in the 
hot-pool and 695 objects in the cold pool(the write test wrote 695 objects).But 
for my setting 695 objects shouldn't trigger the flush.

Is there any setting or concept I wrongly understood ?




2018-02-09



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?

2018-02-01 Thread shadow_lin
Hi Maged,
I haven't met this problem, but I think I did have  read the bug report you 
provide.
I just want to know what is the best practice to remove a journal,db,wal 
partition if there are some other partitions in the same ssd for other osd and 
don't effect other osds.
I had used the ceph-disk zap a lot before but I just zap the whole 
disk(journal,db,wal collocated with data in the same disk) so I think I need to 
it maually if I only want to "zap" a certain partition.
Thanks.

2018-02-01 


lin.yunfan



发件人:Maged Mokhtar <mmokh...@petasan.org>
发送时间:2018-02-01 22:15
主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is 
bluestore) ?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

Hi Lin,
We do the extra dd after zapping the disk. ceph-disk has a zap function that 
uses wipefs to wipe fs traces, dd  to zero 10MB at partition starts, then 
sgdisk to remove partition table, i believe ceph-volume does the same. After 
this zap for each data or db block that will be created on this device we use 
the dd command to zero 500MB, this may be a bit overboard but other users have 
had similar issues:
http://tracker.ceph.com/issues/22354
Also the initial zap will wipe out the the disk and zeros the start of 
partitions as they used to be, it is possible the new disk will have db block 
with a different size so the start of partitioning has changed.
I am not sure if your question was because you hit this issue or you just want 
to skip the extra dd function or you are facing issues cleaning disks, if it is 
the later we can send you some patch that does this.
Maged
On 2018-02-01 15:04, shadow_lin wrote:
Hi Maged,
The problem you met beacuse of the left over of older cluster.Did you remove 
the db partition or you just use the old partition?
I thought Wido suggest to remove the partition then use the dd to be safe.Is it 
safe I don't remove the partition and just use dd the try to destory the data 
on that partition?
How would ceph-disk or ceph-volume do to the existing partition of 
journal,db,wal?Will it clean it or it just uses it without any action?

2018-02-01

 
lin.yunfan



发件人:Maged Mokhtar <mmokh...@petasan.org>
发送时间:2018-02-01 14:22
主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is 
bluestore) ?
收件人:"David Turner"<drakonst...@gmail.com>
抄送:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com>

I would recommend as Wido to use the dd command. block db device holds the 
metada/allocation of objects stored in data block, not cleaning this is asking 
for problems, besides it does not take any time.  In our testing building new 
custer on top of older installation, we did see many cases where osds will not 
start and report an error such as fsid of cluster and/or OSD does not match 
metada in BlueFS superblock...these errors do not appear if we use the dd 
command. 
On 2018-02-01 06:06, David Turner wrote:
I know that for filestore journals that is fine.  I think it is also safe for 
bluestore.  Doing Wido's recommendation of writing 100MB would be a good idea, 
but not necessary.


On Wed, Jan 31, 2018, 10:10 PM shadow_lin <shadow_...@163.com> wrote:
Hi David,
Thanks for your reply.
I am wondering what if I don't remove the journal(wal,db for bluestore) partion 
on the ssd and only zap the data disk.Then I assign the journal(wal,db for 
bluestore) partion to a new osd.What would happen?

2018-02-01

 
lin.yunfan



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-01-31 17:24
主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is 
bluestore) ?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

I use gdisk to remove the partition and partprobe for the OS to see the new 
partition table. You can script it with sgdisk.


On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote:
Hi list,
if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I 
use ceph-disk zap to clean the disk when I want to remove the osd and clean the 
data on the disk.
But if I use a ssd partition as the journal(wal,db if it is bluestore) , how 
should I clean the journal (wal,db if it is bluestore) of the osd I want to 
remove?Especially when there are other osds are using other partition of the 
same ssd  as journals(wal,db if it is bluestore) .


2018-01-31


shadow_lin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?

2018-02-01 Thread shadow_lin
Hi Maged,
The problem you met beacuse of the left over of older cluster.Did you remove 
the db partition or you just use the old partition?
I thought Wido suggest to remove the partition then use the dd to be safe.Is it 
safe I don't remove the partition and just use dd the try to destory the data 
on that partition?
How would ceph-disk or ceph-volume do to the existing partition of 
journal,db,wal?Will it clean it or it just uses it without any action?

2018-02-01 


lin.yunfan



发件人:Maged Mokhtar <mmokh...@petasan.org>
发送时间:2018-02-01 14:22
主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is 
bluestore) ?
收件人:"David Turner"<drakonst...@gmail.com>
抄送:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com>

I would recommend as Wido to use the dd command. block db device holds the 
metada/allocation of objects stored in data block, not cleaning this is asking 
for problems, besides it does not take any time.  In our testing building new 
custer on top of older installation, we did see many cases where osds will not 
start and report an error such as fsid of cluster and/or OSD does not match 
metada in BlueFS superblock...these errors do not appear if we use the dd 
command. 
On 2018-02-01 06:06, David Turner wrote:
I know that for filestore journals that is fine.  I think it is also safe for 
bluestore.  Doing Wido's recommendation of writing 100MB would be a good idea, 
but not necessary.


On Wed, Jan 31, 2018, 10:10 PM shadow_lin <shadow_...@163.com> wrote:
Hi David,
Thanks for your reply.
I am wondering what if I don't remove the journal(wal,db for bluestore) partion 
on the ssd and only zap the data disk.Then I assign the journal(wal,db for 
bluestore) partion to a new osd.What would happen?

2018-02-01

 
lin.yunfan



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-01-31 17:24
主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is 
bluestore) ?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

I use gdisk to remove the partition and partprobe for the OS to see the new 
partition table. You can script it with sgdisk.


On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote:
Hi list,
if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I 
use ceph-disk zap to clean the disk when I want to remove the osd and clean the 
data on the disk.
But if I use a ssd partition as the journal(wal,db if it is bluestore) , how 
should I clean the journal (wal,db if it is bluestore) of the osd I want to 
remove?Especially when there are other osds are using other partition of the 
same ssd  as journals(wal,db if it is bluestore) .


2018-01-31


shadow_lin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?

2018-01-31 Thread shadow_lin
Hi David,
Thanks for your reply.
I am wondering what if I don't remove the journal(wal,db for bluestore) partion 
on the ssd and only zap the data disk.Then I assign the journal(wal,db for 
bluestore) partion to a new osd.What would happen?

2018-02-01 


lin.yunfan



发件人:David Turner <drakonst...@gmail.com>
发送时间:2018-01-31 17:24
主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is 
bluestore) ?
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

I use gdisk to remove the partition and partprobe for the OS to see the new 
partition table. You can script it with sgdisk.


On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote:

Hi list,
if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I 
use ceph-disk zap to clean the disk when I want to remove the osd and clean the 
data on the disk.
But if I use a ssd partition as the journal(wal,db if it is bluestore) , how 
should I clean the journal (wal,db if it is bluestore) of the osd I want to 
remove?Especially when there are other osds are using other partition of the 
same ssd  as journals(wal,db if it is bluestore) .


2018-01-31


shadow_lin 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?

2018-01-31 Thread shadow_lin
Hi list,
if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I 
use ceph-disk zap to clean the disk when I want to remove the osd and clean the 
data on the disk.
But if I use a ssd partition as the journal(wal,db if it is bluestore) , how 
should I clean the journal (wal,db if it is bluestore) of the osd I want to 
remove?Especially when there are other osds are using other partition of the 
same ssd  as journals(wal,db if it is bluestore) .


2018-01-31


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How ceph client read data from ceph cluster

2018-01-26 Thread shadow_lin
Hi Maged,
I just want to make sure if I understand how ceph client read from cluster.So 
with current version of ceph(12.2.2) the client only read from the primary 
osd(one copy),is that true?

2018-01-27 


lin.yunfan



发件人:Maged Mokhtar <mmokh...@petasan.org>
发送时间:2018-01-26 20:27
主题:Re: [ceph-users] How ceph client read data from ceph cluster
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>



On 2018-01-26 09:09, shadow_lin wrote:
Hi List,
I read a old article about how ceph client read from ceph cluster.It said the 
client only read from the primary osd. Since ceph cluster in replicate mode 
have serveral copys of data only read from one copy seems waste the performance 
of concurrent read from all the copys.
But that artcile is rather old so maybe ceph has imporved to read from all the 
copys? But I haven't find any info about that.
Any info about that would be appreciated.
Thanks
2018-01-26


shadow_lin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Hi 
The majority of cases you will have more concurrent io requests than disks, so 
the load will already be distributed evenly. If this is not the case and you 
have a large cluster with fewer clients, you may consider using object/rbd 
striping so each io will be divided into different osd requests.
Maged___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How ceph client read data from ceph cluster

2018-01-25 Thread shadow_lin
Hi List,
I read a old article about how ceph client read from ceph cluster.It said the 
client only read from the primary osd. Since ceph cluster in replicate mode 
have serveral copys of data only read from one copy seems waste the performance 
of concurrent read from all the copys.
But that artcile is rather old so maybe ceph has imporved to read from all the 
copys? But I haven't find any info about that.
Any info about that would be appreciated.
Thanks
2018-01-26


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limit deep scrub

2018-01-14 Thread shadow_lin
hi,
you can try to adjusting osd_scrub_chunk_min,osd_scrub_chunk_max and 
osd_scrub_sleep.
   
osd scrub sleep

Description: Time to sleep before scrubbing next group of chunks. Increasing 
this value will slow down whole scrub operation while client operations will be 
less impacted.
Type: Float
Default: 0

osd scrub chunk min

Description: The minimal number of object store chunks to scrub during single 
operation. Ceph blocks writes to single chunk during scrub.
Type: 32-bit Integer
Default: 5


2018-01-15 



lin.yunfan



发件人:Karun Josy 
发送时间:2018-01-15 06:53
主题:[ceph-users] Limit deep scrub
收件人:"ceph-users"
抄送:

Hello,


It appears that cluster is having many slow requests while it is scrubbing and 
deep scrubbing. Also sometimes we can see osds flapping.


So we have put the flags : noscrub,nodeep-scrub 


When we unset it, 5 PGs start to scrub.
Is there a way to limit it to one at a time?
 

# ceph daemon osd.35 config show | grep scrub
"mds_max_scrub_ops_in_progress": "5",
"mon_scrub_inject_crc_mismatch": "0.00",
"mon_scrub_inject_missing_keys": "0.00",
"mon_scrub_interval": "86400",
"mon_scrub_max_keys": "100",
"mon_scrub_timeout": "300",
"mon_warn_not_deep_scrubbed": "0",
"mon_warn_not_scrubbed": "0",
"osd_debug_scrub_chance_rewrite_digest": "0",
"osd_deep_scrub_interval": "604800.00",
"osd_deep_scrub_randomize_ratio": "0.15",
"osd_deep_scrub_stride": "524288",
"osd_deep_scrub_update_digest_min_age": "7200",
"osd_max_scrubs": "1",
"osd_op_queue_mclock_scrub_lim": "0.001000",
"osd_op_queue_mclock_scrub_res": "0.00",
"osd_op_queue_mclock_scrub_wgt": "1.00",
"osd_requested_scrub_priority": "120",
"osd_scrub_auto_repair": "false",
"osd_scrub_auto_repair_num_errors": "5",
"osd_scrub_backoff_ratio": "0.66",
"osd_scrub_begin_hour": "0",
"osd_scrub_chunk_max": "25",
"osd_scrub_chunk_min": "5",
"osd_scrub_cost": "52428800",
"osd_scrub_during_recovery": "false",
"osd_scrub_end_hour": "24",
"osd_scrub_interval_randomize_ratio": "0.50",
"osd_scrub_invalid_stats": "true",
"osd_scrub_load_threshold": "0.50",
"osd_scrub_max_interval": "604800.00",
"osd_scrub_min_interval": "86400.00",
"osd_scrub_priority": "5",
"osd_scrub_sleep": "0.00",


 

Karun ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to speed up backfill

2018-01-10 Thread shadow_lin
Hi ,
 Mine is purely backfilling(remove a osd from the cluster) and it started 
at 600Mb/s and ended at about 3MB/s.
How is your recovery made up?Is it backfill or log replay pg recovery or 
both?

2018-01-11 

shadow_lin 



发件人:Josef Zelenka <josef.zele...@cloudevelops.com>
发送时间:2018-01-11 15:26
主题:Re: [ceph-users] How to speed up backfill
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

Hi, our recovery slowed down significantly towards the end, however it was 
still about five times faster than the original speed.We suspected that this is 
caused somehow by threading (more objects transferred - more threads used), but 
this is only an assumption. 



On 11/01/18 05:02, shadow_lin wrote:

Hi,
I had tried these two method and for backfilling it seems only 
osd-max-backfills works.
How was your recovery speed when it comes to the last few pgs or objects?

2018-01-11 

shadow_lin 



发件人:Josef Zelenka <josef.zele...@cloudevelops.com>
发送时间:2018-01-11 04:53
主题:Re: [ceph-users] How to speed up backfill
收件人:"shadow_lin"<shadow_...@163.com>
抄送:

Hi, i had the same issue a few days back, i tried playing around with these two:
ceph tell 'osd.*' injectargs '--osd-max-backfills '
ceph tell 'osd.*' injectargs '--osd-recovery-max-active  '
 and it helped greatly(increased our recovery speed 20x), but be careful to not 
overload your systems. 



On 10/01/18 17:50, shadow_lin wrote:

Hi all,
I am playing with setting for backfill to try to find how to control the speed 
of backfill.

Now I only find  "osd max backfills" can have effect the backfill speed. But 
after all pg need to be backfilled begin backfilling I can't find any way to 
speed up backfills.

Especailly when it comes to the last pg to recover, the speed is only a few 
MB/s(when there are multi pg are backfilled the speed could be more than 
600MB/s in my test)

I am a little confused about the setting of backfills and recovery.Though 
backfilling is a kind of recovery but It seems recovery setting is only about 
to replay pg logs to do recover  pg.

Would change "osd recovery max active" or other recovery setting have any 
effect on backfilling?

I did tried "osd recovery op priority" and "osd recovery max active" with no 
luck.

Any advice would be greatly appreciated.Thanks

2018-01-11



lin.yunfan

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to speed up backfill

2018-01-10 Thread shadow_lin
Hi all,
I am playing with setting for backfill to try to find how to control the speed 
of backfill.

Now I only find  "osd max backfills" can have effect the backfill speed. But 
after all pg need to be backfilled begin backfilling I can't find any way to 
speed up backfills.

Especailly when it comes to the last pg to recover, the speed is only a few 
MB/s(when there are multi pg are backfilled the speed could be more than 
600MB/s in my test)

I am a little confused about the setting of backfills and recovery.Though 
backfilling is a kind of recovery but It seems recovery setting is only about 
to replay pg logs to do recover  pg.

Would change "osd recovery max active" or other recovery setting have any 
effect on backfilling?

I did tried "osd recovery op priority" and "osd recovery max active" with no 
luck.

Any advice would be greatly appreciated.Thanks

2018-01-11



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bad crc causing osd hang and block all request.

2018-01-10 Thread shadow_lin
Thanks for your advice
I rebuilt the osd and haven't have this happened again.So it could be 
corruption on the hdds.

2018-01-11 


lin.yunfan



发件人:Konstantin Shalygin 
发送时间:2018-01-09 12:11
主题:Re: [ceph-users] Bad crc causing osd hang and block all request.
收件人:"ceph-users"
抄送:

> What could cause this problem?Is this caused by a faulty HDD? 
> what data's crc didn't match ? 

This may be caused due faulty drive. Check your dmesg. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bad crc causing osd hang and block all request.

2018-01-08 Thread shadow_lin
Hi lists,

ceph version:luminous 12.2.2

The cluster was doing writing thoughput test when this problem happened.
The cluster health became error 
Health check update: 27 stuck requests are blocked > 4096 sec (REQUEST_STUCK)
Clients can't write any data into cluster.
osd22 and osd40 are the osds who is resposible for the problem.
osd22's log shows below mesage and keep repeating
2018-01-07 20:44:52.202322 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x96aa9400 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969602 vs existing csq=969601 existing_state=STATE_STANDBY

2018-01-07 20:44:52.250600 b56db8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 20:44:52.252470 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x95c04000 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969604 vs existing csq=969603 existing_state=STATE_STANDBY

2018-01-07 20:44:52.300354 b5edb8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 20:44:52.302788 b56db8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x978e7a00 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969606 vs existing csq=969605 existing_state=STATE_STANDBY

2018-01-07 20:44:52.350987 b56db8e0  0 bad crc in data 3751247614 != exp 
3467727689

2018-01-07 20:44:52.352953 b5edb8e0  0 -- 10.0.2.12:6802/2798 >> 
10.0.2.21:6802/2785 conn(0x97420e00 :6802 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg 
accept connect_seq 969608 vs existing csq=969607 existing_state=STATE_STANDBY

2018-01-07 20:44:52.400959 b5edb8e0  0 bad crc in data 3751247614 != exp 
3467727689

osd40's log shows below message and keep repeating
2018-01-07 20:44:52.200709 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484865 cs=969601 
l=0).fault initiating reconnect

2018-01-07 20:44:52.251423 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484866 cs=969603 
l=0).fault initiating reconnect

2018-01-07 20:44:52.301166 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484867 cs=969605 
l=0).fault initiating reconnect

2018-01-07 20:44:52.351810 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484868 cs=969607 
l=0).fault initiating reconnect

2018-01-07 20:44:52.401782 b4e9e8e0  0 -- 10.0.2.21:6802/2785 >> 
10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484869 cs=969609 
l=0).fault initiating reconnect

The NIC of osd22' s host was keeping sending data to osd40's at about 50MBps 
when this happened.

After reboot osd22 the cluster goes back to normal..
This happened twice in my writing test with the same osds(osd22 and osd40).

What could cause this problem?Is this caused by a faulty HDD?
what data's crc didn't match ? 


2018-01-09



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous 12.2.2]bluestore cache uses much more memory than setting value

2018-01-06 Thread shadow_lin
Hi all,
I have already know that luminous would use more memory for bluestore cache 
than the config setting, but I was expecting 1.5x not 7-8x.
below is my bluestore cache setting

[osd]

osd max backfills = 4

bluestore_cache_size = 134217728

bluestore_cache_kv_max = 134217728

osd client message size cap = 67108864

  My osd nodes have only 2G memory and I ran 2 osds per node.So I  set the 
cache value very low.

  I was running a read throughput test and then found some of my osds were 
killed by oom killer and restarted.I found the oom killed osd used much more 
memory for bluestore_cache_data than the normal ones.
 The oom killed osd  used 795MB ram in mempool and 722MB in 
bluestore_cache_data
 The normal osd used about 120MB ram in mempool and 17MB in 
bluestore_cache_data
 
 graph of memory useage of the oom killed osd: 
https://pasteboard.co/H1GzihS.png


 graph of memory useage of the nomral osd: https://pasteboard.co/H1GzaeF.png

Is this a bug of bluestore cache or I misunderstood the meaning of 
bluestore cache setting in config?



2018-01-06



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to monitor slow request?

2017-12-26 Thread shadow_lin
I am building a ceph moninter dashboard and I want to moniter how many slow 
requests are on each node. But I find the ceph.log some time only log like 
below:
2017-12-27 14:59:47.852396 mon.mnc000 mon.0 192.168.99.80:6789/0 2147 : cluster 
[WRN] Health check failed: 4 slow requests are blocked > 32 sec (REQUEST_SLOW)

There is no osd id info about where the slow reqeust happenened.

what would be a proper way to moniter which osd caused the slow request and how 
many slow requests are one that osd?



2017-12-27


shadow_lin ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)

2017-12-26 Thread shadow_lin
I have disabled scrub before the test.

2017-12-27 

shadow_lin 



发件人:Webert de Souza Lima <webert.b...@gmail.com>
发送时间:2017-12-22 20:37
主题:Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation 
problem(possibly tcmalloc related)
收件人:"ceph-users"<ceph-users@lists.ceph.com>
抄送:


On Thu, Dec 21, 2017 at 12:52 PM, shadow_lin <shadow_...@163.com> wrote:
After 18:00 suddenly the write throughput dropped and the osd latency 
increased. TCmalloc started relcaim page heap freelist much more frequently.All 
of this happened very fast and every osd had the indentical pattern.
Could that be caused by OSD scrub?  Check your "osd_scrub_begin_hour"

  ceph daemon osd.$ID config show | grep osd_scrub




Regards,


Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
IRC NICK - WebertRLZ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)

2017-12-21 Thread shadow_lin
My testing cluster is an all hdd cluster with 12 osd(10T hdd each).
I moinitor luminous 12.2.2 write performance and osd memory usage with grafana 
graph for statistic logging.
The test is done  by using fio on a mounted rbd with follow fio parameters:
fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio  -size=200G 
-group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest
   I found there is a noticeably performance degration over time.
   Graph of write throughput and iops
   https://pasteboard.co/GZflpTO.png
   Graph of osd memory usage(2 of 12 osds,the pattern are identical)
   https://pasteboard.co/GZfmfzo.png
   Graph of osd perf
   https://pasteboard.co/GZfmZNx.png

   There are some interesting founding from the graph.
   After 18:00 suddenly the write throughput dropped and the osd latency 
increased. TCmalloc started relcaim page heap freelist much more frequently.All 
of this happened very fast and every osd had the indentical pattern.

   I have done this kind of test several times with different bluestore 
cache setting and find out with more cache the performance degradation would 
happen later.

 I don't know if this is a bug or I can fix it with modify some of the 
config of my cluster. 
  Any advice or direction to look into is appreciated.

  Thanks
   
 


2017-12-21



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time

2017-12-21 Thread shadow_lin
Thanks for your information, but I don't think it is my case.My cluster don't 
have any ssd.

2017-12-21 


lin.yunfan



发件人:Denes Dolhay <de...@denkesys.com>
发送时间:2017-12-18 06:41
主题:Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain 
point of time
收件人:"ceph-users"<ceph-users@lists.ceph.com>
抄送:

Hi,
This is just a tip, I do not know if this actually applies to you, but some 
ssds are decreasing their write throughput on purpose so they do not wear out 
the cells before the warranty period is over.


Denes.





On 12/17/2017 06:45 PM, shadow_lin wrote:

Hi All,
I am testing luminous 12.2.2 and find a strange behavior of my cluster.
   I was testing my cluster throughput by using fio on a mounted rbd with 
follow fio parameters:
   fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio 
-size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest
   Everything was fine at the begining, but after about 10 hrs of testing I 
found the performance dropped noticeably.
   Throughput droped from 300-450MBps to 250-350MBps  and osd latency 
increased from 300ms to 400ms
   I also noted the heap stats showed the osd start reclaiming  page heap 
freelist much more frequently but the rss memory of osd were increasing.
  
  below is the links of grafana graph of my cluster.
  cluster metrics: https://pasteboard.co/GYEOgV1.jpg
  osd mem metrics: https://pasteboard.co/GYEP74M.png
  In the graph the performance dropped after 10:00.

 I am investigating what happened but haven't found any clue yet. If you 
know any thing about how to solve this problem or where I should  look into 
please let me know. 
 Thanks. 


2017-12-18



lin.yunfan

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time

2017-12-18 Thread shadow_lin
Thanks for the information, but I think that is not my case because I am using 
only hdd in my cluster.

From the command you provide I found the db_used_bytes is quite large, but I am 
not sure how the db used bytes is related to the amount of stored data and the 
performance.

ceph daemon osd.0 perf dump | jq '.bluefs' | grep -E '(db|slow)'
  "db_total_bytes": 400029646848,
  "db_used_bytes": 9347006464,
  "slow_total_bytes": 0,
  "slow_used_bytes": 0


2017-12-18 

shadow_lin 



发件人:Konstantin Shalygin <k0...@k0ste.ru>
发送时间:2017-12-18 13:52
主题:Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain 
point of time
收件人:"ceph-users"<ceph-users@lists.ceph.com>
抄送:"shadow_lin"<shadow_...@163.com>

I am testing luminous 12.2.2 and find a strange behavior of my cluster.
Check your block.db usage. Luminous 12.2.2 is affected 
http://tracker.ceph.com/issues/22264


[root@ceph-osd0]# ceph daemon osd.46 perf dump | jq '.bluefs' | grep -E 
'(db|slow)'
  "db_total_bytes": 30064762880,
  "db_used_bytes": 16777216,
  "slow_total_bytes": 240043163648,
  "slow_used_bytes": 659554304,___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time

2017-12-17 Thread shadow_lin
Hi All,
I am testing luminous 12.2.2 and find a strange behavior of my cluster.
   I was testing my cluster throughput by using fio on a mounted rbd with 
follow fio parameters:
   fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio 
-size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest
   Everything was fine at the begining, but after about 10 hrs of testing I 
found the performance dropped noticeably.
   Throughput droped from 300-450MBps to 250-350MBps  and osd latency 
increased from 300ms to 400ms
   I also noted the heap stats showed the osd start reclaiming  page heap 
freelist much more frequently but the rss memory of osd were increasing.
  
  below is the links of grafana graph of my cluster.
  cluster metrics: https://pasteboard.co/GYEOgV1.jpg
  osd mem metrics: https://pasteboard.co/GYEP74M.png
  In the graph the performance dropped after 10:00.

 I am investigating what happened but haven't found any clue yet. If you 
know any thing about how to solve this problem or where I should  look into 
please let me know. 
 Thanks. 


2017-12-18



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The way to minimize osd memory usage?

2017-12-10 Thread shadow_lin
My workload is mainly seq write(for surveillance usage).I am not sure how cache 
would effect the write performance and why the memory usage keeps increasing as 
more data is wrote into ceph storage.

2017-12-11 


lin.yunfan



发件人:Peter Woodman <pe...@shortbus.org>
发送时间:2017-12-11 05:04
主题:Re: [ceph-users] The way to minimize osd memory usage?
收件人:"David Turner"<drakonst...@gmail.com>
抄送:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com>,"Konstantin
 Shalygin"<k0...@k0ste.ru>

I've had some success in this configuration by cutting the bluestore 
cache size down to 512mb and only one OSD on an 8tb drive. Still get 
occasional OOMs, but not terrible. Don't expect wonderful performance, 
though. 

Two OSDs would really be pushing it. 

On Sun, Dec 10, 2017 at 10:05 AM, David Turner <drakonst...@gmail.com> wrote: 
> The docs recommend 1GB/TB of OSDs. I saw people asking if this was still 
> accurate for bluestore and the answer was that it is more true for bluestore 
> than filestore. There might be a way to get this working at the cost of 
> performance. I would look at Linux kernel memory settings as much as ceph 
> and bluestore settings. Cache pressure is one that comes to mind that an 
> aggressive setting might help. 
> 
> 
> On Sat, Dec 9, 2017, 11:33 PM shadow_lin <shadow_...@163.com> wrote: 
>> 
>> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) 
>> we are running is with the memory issues fix.And we are working on to 
>> upgrade to 12.2.2 release to see if there is any furthermore improvement. 
>> 
>> 2017-12-10 
>>  
>> lin.yunfan 
>>  
>> 
>> 发件人:Konstantin Shalygin <k0...@k0ste.ru> 
>> 发送时间:2017-12-10 12:29 
>> 主题:Re: [ceph-users] The way to minimize osd memory usage? 
>> 收件人:"ceph-users"<ceph-users@lists.ceph.com> 
>> 抄送:"shadow_lin"<shadow_...@163.com> 
>> 
>> 
>> > I am testing running ceph luminous(12.2.1-249-g42172a4 
>> > (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server. 
>> Try new 12.2.2 - this release should fix memory issues with Bluestore. 
>> 
>> ___ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The way to minimize osd memory usage?

2017-12-09 Thread shadow_lin
The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) we 
are running is with the memory issues fix.And we are working on to upgrade to 
12.2.2 release to see if there is any furthermore improvement.

2017-12-10 


lin.yunfan



发件人:Konstantin Shalygin <k0...@k0ste.ru>
发送时间:2017-12-10 12:29
主题:Re: [ceph-users] The way to minimize osd memory usage?
收件人:"ceph-users"<ceph-users@lists.ceph.com>
抄送:"shadow_lin"<shadow_...@163.com>

> I am testing running ceph luminous(12.2.1-249-g42172a4 
> (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server. 
Try new 12.2.2 - this release should fix memory issues with Bluestore. ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] The way to minimize osd memory usage?

2017-12-09 Thread shadow_lin
Hi All,
I am testing running ceph luminous(12.2.1-249-g42172a4 
(42172a443183ffe6b36e85770e53fe678db293bf) on ARM server.
The ARM server has a two cores@1.4GHz cpu and 2GB ram and I am running 2 osd 
per ARM server with 2x8TB(or 2x10TB) hdd.
Now I am facing constantly oom problem.I have tried upgrade ceph(to fix osd 
memroy leak problem) and lower the bluestore  cache setting.The oom problems 
did get better but still occurs constantly.

I am hoping someone can gives me some advice of the follow questions.

Is it impossible to run ceph in this config of hardware or Is it possible I can 
do some tunning the solve this problem(even to lose some performance to avoid 
the oom problem)?

Is it a good idea to use raid0 to combine the 2 HDD into one so I can only run 
one osd to save some memory?

How is memory usage of osd related to the size of HDD?




PS:my ceph.conf bluestore cache setting
[osd]
bluestore_cache_size = 104857600
bluestore_cache_kv_max = 67108864
osd client message size cap = 67108864



2017-12-10



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why degraded objects count keeps increasing as more data is wrote into cluster?

2017-11-07 Thread shadow_lin
Hi all,
I have a pool of 2 replicate(failure domain host) and I was testing it with fio 
writing to rbd image(about 450MB/s) when one of my host crashed.I rebooted the 
crashed host and mon said all osd and host were online, but there were some pg 
in degraded status.
I thought it would recover but after a while I found even though all osd were 
up and in but the degraded objects count was keeping increasing as more data  
was wrote into the cluster.If I stop writing data into cluster then the 
degraded objects count start to decrease.
Is this how pg recover should work?or I did something wrong?Why would the 
degraded objects became more and more even when all osd are up and in?
Thanks 

2017-11-08



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous][ERR] Error -2 reading object

2017-11-03 Thread shadow_lin
Hi all,

I am testing luminouse for ec pool backed rbd[k=8,m=2].   My luminouse version 
is: ceph version 12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) 
luminous (stable)
My cluster had some osd memory oom problem so some osds got oom killed.The 
cluster entered recovery state.During the recovery I found some log like blow:

2017-11-04 12:12:11.041217 osd.7 [ERR] Error -2 reading object 
4:4c277622:::rbd_data.3.390e74b0dc51.00186276:head

2017-11-04 12:12:11.260225 osd.7 [ERR] Error -2 reading object 
4:4c277622:::rbd_data.3.390e74b0dc51.00186276:head

2017-11-04 12:12:11.583279 osd.7 [ERR] Error -2 reading object 
4:4c277622:::rbd_data.3.390e74b0dc51.00186276:head

 I haven't seen this error before and I can't google any infomation about 
it either.
 What could cause this error?Is there a way to fix it?

2017-11-04



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How would ec profile effect performance?

2017-11-02 Thread shadow_lin
Hi all,
I am wondering how ec profile would effect ceph performance?
Will ec profile k=10,m=2 perform better than k=8,m=2 since there would be more 
chunk to wirte and read concurrently?
Will ec profile k=10,m=2 perform need more memory and cpu power than ec profile 
k=8,m=2?


2017-11-02



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复: 回复: Re: [luminous]OSD memory usage increase when writing^J a lot of data to cluster

2017-11-02 Thread shadow_lin
Hi Sage,
I did some more test and found this:
I use  ceph tell osd.6 heap stats to found that
osd.6 tcmalloc heap stats:
MALLOC:  404608432 (  385.9 MiB) Bytes in use by application
MALLOC: + 26599424 (   25.4 MiB) Bytes in page heap freelist
MALLOC: + 13442496 (   12.8 MiB) Bytes in central cache freelist
MALLOC: + 21112288 (   20.1 MiB) Bytes in transfer cache freelist
MALLOC: + 21702320 (   20.7 MiB) Bytes in thread cache freelists
MALLOC: +  3021024 (2.9 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =490485984 (  467.8 MiB) Actual memory used (physical + swap)
MALLOC: +162922496 (  155.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =653408480 (  623.1 MiB) Virtual address space used
MALLOC:
MALLOC:  12958  Spans in use
MALLOC: 32  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


and the page heap won't release by osd itself and keep increasing,but if i use 
"ceph tell osd.6 heap release" to manual release it then the page heap freelist 
is released.

osd.6 tcmalloc heap stats:
MALLOC:  404608432 (  385.9 MiB) Bytes in use by application
MALLOC: + 26599424 (   25.4 MiB) Bytes in page heap freelist
MALLOC: + 13442496 (   12.8 MiB) Bytes in central cache freelist
MALLOC: + 21112288 (   20.1 MiB) Bytes in transfer cache freelist
MALLOC: + 21702320 (   20.7 MiB) Bytes in thread cache freelists
MALLOC: +  3021024 (2.9 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =490485984 (  467.8 MiB) Actual memory used (physical + swap)
MALLOC: +162922496 (  155.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =653408480 (  623.1 MiB) Virtual address space used
MALLOC:
MALLOC:  12958  Spans in use
MALLOC: 32  Thread heaps in use
MALLOC:   8192  Tcmalloc page si


i found this problem was discussed  before at 
http://tracker.ceph.com/issues/12681, is it a tcmalloc problem?


2017-11-02 


lin.yunfan



发件人:Sage Weil <s...@newdream.net>
发送时间:2017-11-01 20:11
主题:Re: 回复: Re: [ceph-users] [luminous]OSD memory usage increase when writing^J 
a lot of data to cluster
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

On Wed, 1 Nov 2017, shadow_lin wrote: 
> Hi Sage, 
> We have tried compiled the latest ceph source code from github. 
> The build is ceph version 12.2.1-249-g42172a4 
> (42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable). 
> The memory problem seems better but the memory usage of osd is still keep 
> increasing as more data are wrote into the rbd image and the memory usage 
> won't drop after the write is stopped. 
>Could you specify from which commit the memeory bug is fixed? 

f60a942023088cbba53a816e6ef846994921cab3 and the prior 2 commits. 

If you look at 'cpeh daemon osd.nnn dump_mempools' you can see three 
bluestore pools.  This is what bluestore is using to account for its usage  
so it can know when to trim its cache.  Do those add up to the  
bluestore_cache_size - 512m (for rocskdb) that you have configured? 

sage 


> Thanks 
> 2017-11-01 
>  
>  
> body {font-size:10.5pt; font-family:微软雅黑,serif} lin.yunfan 
>  
>  
>   发件人:Sage Weil <s...@newdream.net> 
> 发送时间:2017-10-24 20:03 
> 主题:Re: [ceph-users] [luminous]OSD memory usage increase when 
> writing a lot of data to cluster 
> 收件人:"shadow_lin"<shadow_...@163.com> 
> 抄送:"ceph-users"<ceph-users@lists.ceph.com> 
>   
> On Tue, 24 Oct 2017, shadow_lin wrote:  
> > BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body  
> > {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All,  
> > The cluster has 24 osd with 24 8TB hdd.  
> > Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memor 
> y  
> > is below the remmanded value, but this osd server is an ARM  server so I 
?? ?> can't do anything to add more ram.  
> > I created a replicated(2 rep) pool and an 20TB image and mounted to the t 
> est  
> > server with xfs fs.   
> >
> > I have set the ceph.conf to this(according to other related post suggeste 
> d):  
> > [osd]  
> > bluestore_cache_size = 104857600  
> > bluestore_cache_size_hdd = 104857600  
> > bluestore_cache_size_ssd = 104857600  
> > bluestore_cache_kv_max = 103809024  
> >
> >  osd map cache size = 20  

[ceph-users] 回复: 回复: Re: [luminous]OSD memory usage increase when writing^J a lot of data to cluster

2017-11-01 Thread shadow_lin
Hi Sage,

This is the mempool dump of my osd.1

ceph daemon osd.0 dump_mempools
{
"bloom_filter": {
"items": 0,
"bytes": 0
},
"bluestore_alloc": {
"items": 10301352,
"bytes": 10301352
},
"bluestore_cache_data": {
"items": 0,
"bytes": 0
},
"bluestore_cache_onode": {
"items": 386,
"bytes": 145136
},
"bluestore_cache_other": {
"items": 91914,
"bytes": 779970
},
"bluestore_fsck": {
"items": 0,
"bytes": 0
},
"bluestore_txc": {
"items": 16,
"bytes": 7040
},
"bluestore_writing_deferred": {
"items": 11,
"bytes": 7600020
},
"bluestore_writing": {
"items": 0,
"bytes": 0
},
"bluefs": {
"items": 170,
"bytes": 5688
},
"buffer_anon": {
"items": 96726,
"bytes": 5685575
},
"buffer_meta": {
"items": 30,
"bytes": 1560
},
"osd": {
"items": 72,
"bytes": 554688
},
"osd_mapbl": {
"items": 0,
"bytes": 0
},
"osd_pglog": {
"items": 197946,
"bytes": 35743344
},
"osdmap": {
"items": 8007,
"bytes": 144024
},
"osdmap_mapping": {
"items": 0,
"bytes": 0
},
"pgmap": {
"items": 0,
"bytes": 0
},
"mds_co": {
"items": 0,
"bytes": 0
},
"unittest_1": {
"items": 0,
"bytes": 0
},
"unittest_2": {
"items": 0,
"bytes": 0
},
"total": {
"items": 10696630,
"bytes": 60968397
}
}


And  the memory use by ps:
ceph  8173 27.3 41.0 1509892 848768 ?  Ssl  Oct31 419:30 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f --setuser ceph --setgroup ceph

And ceph tell osd.0 heap stats
osd.0 tcmalloc heap stats:
MALLOC:  398397808 (  379.9 MiB) Bytes in use by application
MALLOC: +340647936 (  324.9 MiB) Bytes in page heap freelist
MALLOC: + 32574936 (   31.1 MiB) Bytes in central cache freelist
MALLOC: + 22581232 (   21.5 MiB) Bytes in transfer cache freelist
MALLOC: + 51663048 (   49.3 MiB) Bytes in thread cache freelists
MALLOC: +  3152096 (3.0 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =849017056 (  809.7 MiB) Actual memory used (physical + swap)
MALLOC: +128180224 (  122.2 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =977197280 (  931.9 MiB) Virtual address space used
MALLOC:
MALLOC:  16765  Spans in use
MALLOC: 32  Thread heaps in use
MALLOC:   8192  Tcmalloc page size


I have run test for about 10hrs writing,so far no oom happened.The osd uses 
9xxMB memory max and keep stable at around 800-900MB.
I set blue store cache to 100MB by this config
bluestore_cache_size = 104857600
   bluestore_cache_size_hdd = 104857600
   bluestore_cache_size_ssd = 104857600
   bluestore_cache_kv_max = 103809024  

   I am not sure how to calculate if it is right because if i use 
bluestore_cache_size-512m it would be a negative value.
   Did you mean rocksdb would cost about 512MB memory?

2017-11-01 


lin.yunfan



发件人:Sage Weil <s...@newdream.net>
发送时间:2017-11-01 20:11
主题:Re: 回复: Re: [ceph-users] [luminous]OSD memory usage increase when writing^J 
a lot of data to cluster
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

On Wed, 1 Nov 2017, shadow_lin wrote: 
> Hi Sage, 
> We have tried compiled the latest ceph source code from github. 
> The build is ceph version 12.2.1-249-g42172a4 
> (42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable). 
> The memory problem seems better but the memory usage of osd is still keep 
> increasing as more data are wrote into the rbd image and the memory usage 
> won't drop after the write is stopped. 
>Could you specify from which commit the memeory bug is fixed? 

f60a942023088cbba53a816e6ef846994921cab3 and the prior 2 commits. 

If you look at 'cpeh daemon osd.nnn dump_mempools' you can see three 
blu

[ceph-users] 回复: Re: [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-31 Thread shadow_lin
Hi Sage,
We have tried compiled the latest ceph source code from github.
The build is ceph version 12.2.1-249-g42172a4 
(42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable).
The memory problem seems better but the memory usage of osd is still keep 
increasing as more data are wrote into the rbd image and the memory usage won't 
drop after the write is stopped.
   Could you specify from which commit the memeory bug is fixed?
Thanks
2017-11-01 


lin.yunfan



发件人:Sage Weil <s...@newdream.net>
发送时间:2017-10-24 20:03
主题:Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of 
data to cluster
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

On Tue, 24 Oct 2017, shadow_lin wrote: 
> BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body 
> {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All, 
> The cluster has 24 osd with 24 8TB hdd. 
> Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory 
> is below the remmanded value, but this osd server is an ARM  server so I 
> can't do anything to add more ram. 
> I created a replicated(2 rep) pool and an 20TB image and mounted to the test 
> server with xfs fs.  
>   
> I have set the ceph.conf to this(according to other related post suggested): 
> [osd] 
> bluestore_cache_size = 104857600 
> bluestore_cache_size_hdd = 104857600 
> bluestore_cache_size_ssd = 104857600 
> bluestore_cache_kv_max = 103809024 
>   
>  osd map cache size = 20 
> osd map max advance = 10 
> osd map share max epochs = 10 
> osd pg epoch persisted max stale = 10 
> The bluestore cache setting did improve the situation,but if i try to write 
> 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000)  to rbd the 
> osd will eventually be killed by oom killer. 
> If I only wirte like 100G  data once then everything is fine. 
>   
> Why does the osd memory usage keep increasing whle writing ? 
> Is there anything I can do to reduce the memory usage? 

There is a bluestore memory bug that was fixed just after 12.2.1 was  
released; it will be fixed in 12.2.2.  In the meantime, you can run  
consider running the latest luminous branch (not fully tested) from 
https://shaman.ceph.com/builds/ceph/luminous. 

sage ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复: Re: mkfs rbd image is very slow

2017-10-31 Thread shadow_lin
Hi Jason,
Thank you for your advice.
The no discard option works gread.It now takes 5min to format 5t rbd image in 
xfs and only seconds to format in ext4.
Is there any drawback to format rbd image with no discard option?
Thanks


2017-10-31 


lin.yunfan



发件人:Jason Dillaman <jdill...@redhat.com>
发送时间:2017-10-30 03:07
主题:Re: [ceph-users] mkfs rbd image is very slow
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

Try running "mkfs.xfs -K" which disables discarding to see if that 
improves the mkfs speed. The librbd-based implementation encountered a 
similar issue before when certain OSs sent very small discard extents 
for very large disks. 

On Sun, Oct 29, 2017 at 10:16 AM, shadow_lin <shadow_...@163.com> wrote: 
> Hi all, 
> I am testing ec pool backed rbd image performace and found that it takes a 
> very long time to format the rbd image by mkfs. 
> I created a 5TB image and mounted it on the client(ubuntu 16.04 with 4.12 
> kernel) and use mkfs.ext4 and mkfs.xfs to format it. 
> It takes hours to finish the format and the load on some osds are high and I 
> can get slow request warning from time to time. 
> 
> What is a reasonable time to format a 5TB rbd image?What should I do to 
> improve it? 
> Thanks 
> 
> 2017-10-29 
>  
> Frank 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 



--  
Jason ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [Luminous]How to choose the proper ec profile?

2017-10-30 Thread shadow_lin
Hi all,
I am wondering how to choose the proper ec profile for new luminous ec rbd 
image.
If I set the k too high what the draw back would be?

Is it a good idea to set k=10 m=2? It sounds attempting the overhead of storage 
capacity is low and the redundancy is good.

What is the difference for storage safety(redundancy ) between k=10 m=2 and k=4 
m=2?

What would be a good ec profile for archive purpose(decent write perfomance and 
just ok read performace)?

Thanks
2017-10-30



Frank___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mkfs rbd image is very slow

2017-10-29 Thread shadow_lin
Hi all,
I am testing ec pool backed rbd image performace and found that it takes a very 
long time to format the rbd image by mkfs.
I created a 5TB image and mounted it on the client(ubuntu 16.04 with 4.12 
kernel) and use mkfs.ext4 and mkfs.xfs to format it.
It takes hours to finish the format and the load on some osds are high and I 
can get slow request warning from time to time.

What is a reasonable time to format a 5TB rbd image?What should I do to improve 
it?
Thanks

2017-10-29



Frank___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 回复: Re: [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-24 Thread shadow_lin
Hi Sage,
When will 12.2.2 be released?

2017-10-24 

lin.yunfan



发件人:Sage Weil <s...@newdream.net>
发送时间:2017-10-24 20:03
主题:Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of 
data to cluster
收件人:"shadow_lin"<shadow_...@163.com>
抄送:"ceph-users"<ceph-users@lists.ceph.com>

On Tue, 24 Oct 2017, shadow_lin wrote: 
> BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body 
> {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All, 
> The cluster has 24 osd with 24 8TB hdd. 
> Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory 
> is below the remmanded value, but this osd server is an ARM  server so I 
> can't do anything to add more ram. 
> I created a replicated(2 rep) pool and an 20TB image and mounted to the test 
> server with xfs fs.  
>   
> I have set the ceph.conf to this(according to other related post suggested): 
> [osd] 
> bluestore_cache_size = 104857600 
> bluestore_cache_size_hdd = 104857600 
> bluestore_cache_size_ssd = 104857600 
> bluestore_cache_kv_max = 103809024 
>   
>  osd map cache size = 20 
> osd map max advance = 10 
> osd map share max epochs = 10 
> osd pg epoch persisted max stale = 10 
> The bluestore cache setting did improve the situation,but if i try to write 
> 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000)  to rbd the 
> osd will eventually be killed by oom killer. 
> If I only wirte like 100G  data once then everything is fine. 
>   
> Why does the osd memory usage keep increasing whle writing ? 
> Is there anything I can do to reduce the memory usage? 

There is a bluestore memory bug that was fixed just after 12.2.1 was  
released; it will be fixed in 12.2.2.  In the meantime, you can run  
consider running the latest luminous branch (not fully tested) from 
https://shaman.ceph.com/builds/ceph/luminous. 

sage ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster

2017-10-24 Thread shadow_lin
Hi All,
The cluster has 24 osd with 24 8TB hdd. 
Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory is 
below the remmanded value, but this osd server is an ARM  server so I can't do 
anything to add more ram.
I created a replicated(2 rep) pool and an 20TB image and mounted to the test 
server with xfs fs. 

I have set the ceph.conf to this(according to other related post suggested):
[osd]
bluestore_cache_size = 104857600
bluestore_cache_size_hdd = 104857600
bluestore_cache_size_ssd = 104857600
bluestore_cache_kv_max = 103809024

 osd map cache size = 20
osd map max advance = 10
osd map share max epochs = 10
osd pg epoch persisted max stale = 10

The bluestore cache setting did improve the situation,but if i try to write 1TB 
data by dd command(dd if=/dev/zero of=test bs=1G count=1000)  to rbd the osd 
will eventually be killed by oom killer.
If I only wirte like 100G  data once then everything is fine.

Why does the osd memory usage keep increasing whle writing ?
Is there anything I can do to reduce the memory usage?

2017-10-24



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does ceph pg repair work in jewel or later versions of ceph?

2017-05-05 Thread shadow_lin
I have read that the pg repair is simply copy the data from the primary osd to 
other osds.Is that true?or the later version of ceph has improved that?

2017-05-05



lin.yunfan___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com