Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
Hi David, I am sure most(if not all) data are in one pool. rbd_pool is only for omap for EC rbd. ceph df: GLOBAL: SIZE AVAIL RAW USED %RAW USED 427T 100555G 329T 77.03 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS ec_rbd_pool 3 219T 81.4050172G 57441718 rbd_pool4 144 037629G 19 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 10:21 主题:Re: Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" You have 2 different pools. PGs in each pool are going to be a different size. It's like saying 12x + 13y should equal 2x + 23y because they each have 25 X's and Y's. Having equal PG counts on each osd is only balanced if you have a single pool or have a case where all PGs are identical in size. The latter is not likely. On Mon, Jun 25, 2018, 10:02 PM shadow_lin wrote: Hi David, I am afraid I can't run the command you provide now,because I tried to remove another osd on that host to see if it would make the data distribution even and it did. The pg number of my pools are at power of 2. Below is from my note before removed another osd: pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull stripe_width 0 application rbd pg distribution of osd of all pools: https://pasteboard.co/HrBZv3s.png What I don't understand is why data distribution is uneven when pg distribution is even. 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 01:24 主题:Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" I should be able to answer this question for you if you can supply the output of the following commands. It will print out all of your pool names along with how many PGs are in that pool. My guess is that you don't have a power of 2 number of PGs in your pool. Alternatively you might have multiple pools and the PGs from the various pools are just different sizes. ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done ceph df For me the output looks like this. rbd: 64 cephfs_metadata: 64 cephfs_data: 256 rbd-ssd: 32 GLOBAL: SIZE AVAIL RAW USED %RAW USED 46053G 26751G 19301G 41.91 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd-replica 4897G 11.36 7006G 263000 cephfs_metadata 6141M 0.05 268G 11945 cephfs_data 7 10746G 43.4114012G 2795782 rbd-replica-ssd 9241G 47.30 268G 75061 On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: Hi List, The enviroment is: Ceph 12.2.4 Balancer module on and in upmap mode Failure domain is per host, 2 OSD per host EC k=4 m=2 PG distribution is almost even before and after the rebalancing. After marking out one of the osd,I noticed a lot of the data was moving into the other osd on the same host . Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was marked out): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 I am using RBD only so the objects should all be 4m .I don't understand why osd 21 got significant more data with the same pg as other osds. Is this behavior expected or I misconfiged something or some kind of bug? Thanks 2018-06-25 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing
Hi David, I am afraid I can't run the command you provide now,because I tried to remove another osd on that host to see if it would make the data distribution even and it did. The pg number of my pools are at power of 2. Below is from my note before removed another osd: pool 3 'ec_rbd_pool' erasure size 6 min_size 5 crush_rule 2 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 3248 flags hashpspool,ec_overwrites,nearfull stripe_width 16384 application rbd pool 4 'rbd_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 3248 flags hashpspool,nearfull stripe_width 0 application rbd pg distribution of osd of all pools: https://pasteboard.co/HrBZv3s.png What I don't understand is why data distribution is uneven when pg distribution is even. 2018-06-26 shadow_lin 发件人:David Turner 发送时间:2018-06-26 01:24 主题:Re: [ceph-users] Uneven data distribution with even pg distribution after rebalancing 收件人:"shadow_lin" 抄送:"ceph-users" I should be able to answer this question for you if you can supply the output of the following commands. It will print out all of your pool names along with how many PGs are in that pool. My guess is that you don't have a power of 2 number of PGs in your pool. Alternatively you might have multiple pools and the PGs from the various pools are just different sizes. ceph osd lspools | tr ',' '\n' | awk '/^[0-9]/ {print $2}' | while read pool; do echo $pool: $(ceph osd pool get $pool pg_num | cut -d' ' -f2); done ceph df For me the output looks like this. rbd: 64 cephfs_metadata: 64 cephfs_data: 256 rbd-ssd: 32 GLOBAL: SIZE AVAIL RAW USED %RAW USED 46053G 26751G 19301G 41.91 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd-replica 4897G 11.36 7006G 263000 cephfs_metadata 6141M 0.05 268G 11945 cephfs_data 7 10746G 43.4114012G 2795782 rbd-replica-ssd 9241G 47.30 268G 75061 On Sun, Jun 24, 2018 at 9:48 PM shadow_lin wrote: Hi List, The enviroment is: Ceph 12.2.4 Balancer module on and in upmap mode Failure domain is per host, 2 OSD per host EC k=4 m=2 PG distribution is almost even before and after the rebalancing. After marking out one of the osd,I noticed a lot of the data was moving into the other osd on the same host . Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was marked out): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 I am using RBD only so the objects should all be 4m .I don't understand why osd 21 got significant more data with the same pg as other osds. Is this behavior expected or I misconfiged something or some kind of bug? Thanks 2018-06-25 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Uneven data distribution with even pg distribution after rebalancing
Hi List, The enviroment is: Ceph 12.2.4 Balancer module on and in upmap mode Failure domain is per host, 2 OSD per host EC k=4 m=2 PG distribution is almost even before and after the rebalancing. After marking out one of the osd,I noticed a lot of the data was moving into the other osd on the same host . Ceph osd df result is(osd.20 and osd.21 are in the same host and osd.20 was marked out): ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 19 hdd 9.09560 1.0 9313G 7079G 2233G 76.01 1.00 135 21 hdd 9.09560 1.0 9313G 8123G 1190G 87.21 1.15 135 22 hdd 9.09560 1.0 9313G 7026G 2287G 75.44 1.00 133 23 hdd 9.09560 1.0 9313G 7026G 2286G 75.45 1.00 134 I am using RBD only so the objects should all be 4m .I don't understand why osd 21 got significant more data with the same pg as other osds. Is this behavior expected or I misconfiged something or some kind of bug? Thanks 2018-06-25 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?
Thanks. Is there any workaround for 10.2.10 to avoid all osd start spliting at the same time? 2018-04-01 shadowlin 发件人:Pavan Rallabhandi <prallabha...@walmartlabs.com> 发送时间:2018-04-01 22:39 主题:Re: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor? 收件人:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com> 抄送: No, it is supported in the next version of Jewel http://tracker.ceph.com/issues/22658 From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of shadow_lin <shadow_...@163.com> Date: Sunday, April 1, 2018 at 3:53 AM To: ceph-users <ceph-users@lists.ceph.com> Subject: EXT: [ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor? Hi list, The document page of jewel has filestore_split_rand_factor config but I can't find the config by using 'ceph daemon osd.x config'. ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) ceph daemon osd.0 config show|grep split "mon_osd_max_split_count": "32", "journaler_allow_split_entries": "true", "mds_bal_split_size": "1", "mds_bal_split_rd": "25000", "mds_bal_split_wr": "1", "mds_bal_split_bits": "3", "filestore_split_multiple": "4", "filestore_debug_verify_split": "false", 2018-04-01 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Does jewel 10.2.10 support filestore_split_rand_factor?
Hi list, The document page of jewel has filestore_split_rand_factor config but I can't find the config by using 'ceph daemon osd.x config'. ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) ceph daemon osd.0 config show|grep split "mon_osd_max_split_count": "32", "journaler_allow_split_entries": "true", "mds_bal_split_size": "1", "mds_bal_split_rd": "25000", "mds_bal_split_wr": "1", "mds_bal_split_bits": "3", "filestore_split_multiple": "4", "filestore_debug_verify_split": "false", 2018-04-01 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph mgr balancer bad distribution
Hi Stefan, > Am 28.02.2018 um 13:47 schrieb Stefan Priebe - Profihost AG: >> Hello, >> >> with jewel we always used the python crush optimizer which gave us a >> pretty good distribution fo the used space. >> You mentioned a python crush opimizer for jewel.Could you tell me where I can find it? Can it be used with multiple pools? Thank you. 2018-03-29 shadowlin___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] remove big rbd image is very slow
I did have done that before, but in most time I can't just delete the pool. Is there any other way to speed up the rbd image deletion? 2018-03-27 shadowlin 发件人:Ilya Dryomov <idryo...@gmail.com> 发送时间:2018-03-26 20:09 主题:Re: [ceph-users] remove big rbd image is very slow 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> On Sat, Mar 17, 2018 at 5:11 PM, shadow_lin <shadow_...@163.com> wrote: > Hi list, > My ceph version is jewel 10.2.10. > I tired to use rbd rm to remove a 50TB image(without object map because krbd > does't support it).It takes about 30mins to just complete about 3%. Is this > expected? Is there a way to make it faster? > I know there are scripts to delete rados objects of the rbd image to make it > faster. But is the slowness expected for rbd rm command? > > PS: I also encounter very slow rbd export for large rbd image(20TB image but > with only a few GB data).Takes hours to completed the export.I guess both > are related to object map not enabled, but krbd doesn't support object map > feature. If you don't have any other images in that pool, you can simply delete the pool with "ceph osd pool delete". It'll take a second ;) Thanks, Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] remove big rbd image is very slow
Hi list, My ceph version is jewel 10.2.10. I tired to use rbd rm to remove a 50TB image(without object map because krbd does't support it).It takes about 30mins to just complete about 3%. Is this expected? Is there a way to make it faster? I know there are scripts to delete rados objects of the rbd image to make it faster. But is the slowness expected for rbd rm command? PS: I also encounter very slow rbd export for large rbd image(20TB image but with only a few GB data).Takes hours to completed the export.I guess both are related to object map not enabled, but krbd doesn't support object map feature. 2018-03-18 shadowlin___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
Hi Jason, How the old target gateway is blacklisted? Is it a feature of the target gateway(which can support active/passive multipath) should provide or is it only by rbd excusive lock? I think excusive lock only let one client can write to rbd at the same time,but another client can obtain the lock later when the lock is released. 2018-03-11 shadowlin 发件人:Jason Dillaman <jdill...@redhat.com> 发送时间:2018-03-11 07:46 主题:Re: Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> On Sat, Mar 10, 2018 at 10:11 AM, shadow_lin <shadow_...@163.com> wrote: > Hi Jason, > >>As discussed in this thread, for active/passive, upon initiator >>failover, we used the RBD exclusive-lock feature to blacklist the old >>"active" iSCSI target gateway so that it cannot talk w/ the Ceph >>cluster before new writes are accepted on the new target gateway. > > I can get during the new active target gateway was talking to rbd the old > active target gateway cannot write because of the RBD exclusive-lock > But after the new target gateway done the writes,if the old target gateway > had some blocked io during the failover,cant it then get the lock and > overwrite the new writes? Negative -- it's blacklisted so it cannot talk to the cluster. > PS: > Petasan say they can do active/active iscsi with patched suse kernel. I'll let them comment on these corner cases. > 2018-03-10 > > shadowlin > > > > 发件人:Jason Dillaman <jdill...@redhat.com> > 发送时间:2018-03-10 21:40 > 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock > 收件人:"shadow_lin"<shadow_...@163.com> > 抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi > Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> > > On Sat, Mar 10, 2018 at 7:42 AM, shadow_lin <shadow_...@163.com> wrote: >> Hi Mike, >> So for now only suse kernel with target_rbd_core and tcmu-runner can run >> active/passive multipath safely? > > Negative, the LIO / tcmu-runner implementation documented here [1] is > safe for active/passive. > >> I am a newbie to iscsi. I think the stuck io get excuted cause overwrite >> problem can happen with both active/active and active/passive. >> What makes the active/passive safer than active/active? > > As discussed in this thread, for active/passive, upon initiator > failover, we used the RBD exclusive-lock feature to blacklist the old > "active" iSCSI target gateway so that it cannot talk w/ the Ceph > cluster before new writes are accepted on the new target gateway. > >> What mechanism should be implement to avoid the problem with >> active/passive >> and active/active multipath? > > Active/passive it solved as discussed above. For active/active, we > don't have a solution that is known safe under all failure conditions. > If LIO supported MCS (multiple connections per session) instead of > just MPIO (multipath IO), the initiator would provide enough context > to the target to detect IOs from a failover situation. > >> 2018-03-10 >> ____ >> shadowlin >> >> >> >> 发件人:Mike Christie <mchri...@redhat.com> >> 发送时间:2018-03-09 00:54 >> 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock >> 收件人:"shadow_lin"<shadow_...@163.com>,"Lazuardi >> Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> >> 抄送: >> >> On 03/07/2018 09:24 AM, shadow_lin wrote: >>> Hi Christie, >>> Is it safe to use active/passive multipath with krbd with exclusive lock >>> for lio/tgt/scst/tcmu? >> >> No. We tried to use lio and krbd initially, but there is a issue where >> IO might get stuck in the target/block layer and get executed after new >> IO. So for lio, tgt and tcmu it is not safe as is right now. We could >> add some code tcmu's file_example handler which can be used with krbd so >> it works like the rbd one. >> >> I do know enough about SCST right now. >> >> >>> Is it safe to use active/active multipath If use suse kernel with >>> target_core_rbd? >>> Thanks. >>> >>> 2018-03-07 >>
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
Hi Jason, >As discussed in this thread, for active/passive, upon initiator >failover, we used the RBD exclusive-lock feature to blacklist the old >"active" iSCSI target gateway so that it cannot talk w/ the Ceph >cluster before new writes are accepted on the new target gateway. I can get during the new active target gateway was talking to rbd the old active target gateway cannot write because of the RBD exclusive-lock But after the new target gateway done the writes,if the old target gateway had some blocked io during the failover,cant it then get the lock and overwrite the new writes? PS: Petasan say they can do active/active iscsi with patched suse kernel. 2018-03-10 shadowlin 发件人:Jason Dillaman <jdill...@redhat.com> 发送时间:2018-03-10 21:40 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> On Sat, Mar 10, 2018 at 7:42 AM, shadow_lin <shadow_...@163.com> wrote: > Hi Mike, > So for now only suse kernel with target_rbd_core and tcmu-runner can run > active/passive multipath safely? Negative, the LIO / tcmu-runner implementation documented here [1] is safe for active/passive. > I am a newbie to iscsi. I think the stuck io get excuted cause overwrite > problem can happen with both active/active and active/passive. > What makes the active/passive safer than active/active? As discussed in this thread, for active/passive, upon initiator failover, we used the RBD exclusive-lock feature to blacklist the old "active" iSCSI target gateway so that it cannot talk w/ the Ceph cluster before new writes are accepted on the new target gateway. > What mechanism should be implement to avoid the problem with active/passive > and active/active multipath? Active/passive it solved as discussed above. For active/active, we don't have a solution that is known safe under all failure conditions. If LIO supported MCS (multiple connections per session) instead of just MPIO (multipath IO), the initiator would provide enough context to the target to detect IOs from a failover situation. > 2018-03-10 > > shadowlin > > > > 发件人:Mike Christie <mchri...@redhat.com> > 发送时间:2018-03-09 00:54 > 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock > 收件人:"shadow_lin"<shadow_...@163.com>,"Lazuardi > Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> > 抄送: > > On 03/07/2018 09:24 AM, shadow_lin wrote: >> Hi Christie, >> Is it safe to use active/passive multipath with krbd with exclusive lock >> for lio/tgt/scst/tcmu? > > No. We tried to use lio and krbd initially, but there is a issue where > IO might get stuck in the target/block layer and get executed after new > IO. So for lio, tgt and tcmu it is not safe as is right now. We could > add some code tcmu's file_example handler which can be used with krbd so > it works like the rbd one. > > I do know enough about SCST right now. > > >> Is it safe to use active/active multipath If use suse kernel with >> target_core_rbd? >> Thanks. >> >> 2018-03-07 >> >> shadowlin >> >> >> >> *发件人:*Mike Christie <mchri...@redhat.com> >> *发送时间:*2018-03-07 03:51 >> *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD >> Exclusive Lock >> *收件人:*"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph >> Users"<ceph-users@lists.ceph.com> >> *抄送:* >> >> On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: >> > Hi, >> > >> > I want to do load balanced multipathing (multiple iSCSI >> gateway/exporter >> > nodes) of iSCSI backed with RBD images. Should I disable exclusive >> lock >> > feature? What if I don't disable that feature? I'm using TGT (manual >> > way) since I get so many CPU stuck error messages when I was using >> LIO. >> > >> >> You are using LIO/TGT with krbd right? >> >> You cannot or shouldn't do active/active multipathing. If you have the >> lock enabled then it bounces between paths for each IO and will be >> slow. >> If you do not have it enabled then you can end up with stale IO >> overwriting current data. >> >> >> >> > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > [1] http://docs.ceph.com/docs/master/rbd/iscsi-overview/ -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
Hi Mike, So for now only suse kernel with target_rbd_core and tcmu-runner can run active/passive multipath safely? I am a newbie to iscsi. I think the stuck io get excuted cause overwrite problem can happen with both active/active and active/passive. What makes the active/passive safer than active/active? What mechanism should be implement to avoid the problem with active/passive and active/active multipath? 2018-03-10 shadowlin 发件人:Mike Christie <mchri...@redhat.com> 发送时间:2018-03-09 00:54 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 收件人:"shadow_lin"<shadow_...@163.com>,"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> 抄送: On 03/07/2018 09:24 AM, shadow_lin wrote: > Hi Christie, > Is it safe to use active/passive multipath with krbd with exclusive lock > for lio/tgt/scst/tcmu? No. We tried to use lio and krbd initially, but there is a issue where IO might get stuck in the target/block layer and get executed after new IO. So for lio, tgt and tcmu it is not safe as is right now. We could add some code tcmu's file_example handler which can be used with krbd so it works like the rbd one. I do know enough about SCST right now. > Is it safe to use active/active multipath If use suse kernel with > target_core_rbd? > Thanks. > > 2018-03-07 > > shadowlin > > > > *发件人:*Mike Christie <mchri...@redhat.com> > *发送时间:*2018-03-07 03:51 > *主题:*Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD > Exclusive Lock > *收件人:*"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph > Users"<ceph-users@lists.ceph.com> > *抄送:* > > On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: > > Hi, > > > > I want to do load balanced multipathing (multiple iSCSI > gateway/exporter > > nodes) of iSCSI backed with RBD images. Should I disable exclusive lock > > > feature? What if I don't disable that feature? I'm using TGT (manual > > way) since I get so many CPU stuck error messages when I was using LIO. > > > > > You are using LIO/TGT with krbd right? > > You cannot or shouldn't do active/active multipathing. If you have the > lock enabled then it bounces between paths for each IO and will be slow. > If you do not have it enabled then you can end up with stale IO > overwriting current data. > > > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [jewel] High fs_apply_latency osds
Hi Chris, The osds are running on arm nodes. Every node has a two core 1.5Ghz arm 32bit cpu and 2G ram and runs 2 osds.Hdd is 10TB and journal colocate with data on the same disk. Drives are half full now,but the problem I described also happened when the hdds are empty. Filesystem is ext4 because I have some problem to run xfs for now. I am trying to better balance the pg distrubition now to see if it can ease the high latency problem. 2018-03-10 shadowlin 发件人:Chris Hoy Poy <chr...@base3.com.au> 发送时间:2018-03-10 09:44 主题:Re: [ceph-users] [jewel] High fs_apply_latency osds 收件人:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com> 抄送: Hi Shadowlin, Can you describe your hardware ? Cpu/ram/harddrives involved etc Also how your drives are set up? How full are the drives? What filesystem is it? Cheers /chris Sent from my SAMSUNG Galaxy S6 on the Telstra Mobile Network ---- Original message From: shadow_lin <shadow_...@163.com> Date: 10/3/18 1:41 am (GMT+08:00) To: ceph-users <ceph-users@lists.ceph.com> Subject: [ceph-users] [jewel] High fs_apply_latency osds Hi list, During my write test,I find there are always some of the osds have high fs_apply_latency(1k-5kms,2-8times more than others). At first I think it is caused by unbalanced pg distribution, but after I reweight the osds the problem hasn't gone away. I looked into the osds with high latency and find one thing in common.There are alot of read activities on these osds.Result of iostat shows there are 400-500 r/s and 2000-3000 rKB/s on these high latnecy osds while the normal osds have around 100 r/s 300-400 rKB/s. I tried to restart osd daemon with high latency.It did became normal sometime but there will be another high latency osds come out.And the new high latency osd has the same high read activities. What are these read activities for? Is there a way to lower the latency? Thanks 2018-03-10 shadowlin Message protected by MailGuard: e-mail anti-virus, anti-spam and content filtering. http://www.mailguard.com.au Report this message as spam Message protected by MailGuard: e-mail anti-virus, anti-spam and content filtering. http://www.mailguard.com.au ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [jewel] High fs_apply_latency osds
Hi list, During my write test,I find there are always some of the osds have high fs_apply_latency(1k-5kms,2-8times more than others). At first I think it is caused by unbalanced pg distribution, but after I reweight the osds the problem hasn't gone away. I looked into the osds with high latency and find one thing in common.There are alot of read activities on these osds.Result of iostat shows there are 400-500 r/s and 2000-3000 rKB/s on these high latnecy osds while the normal osds have around 100 r/s 300-400 rKB/s. I tried to restart osd daemon with high latency.It did became normal sometime but there will be another high latency osds come out.And the new high latency osd has the same high read activities. What are these read activities for? Is there a way to lower the latency? Thanks 2018-03-10 shadowlin___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs
Thanks for your advice. I will try to reweight osds of my cluster. Why ceph is so sensitive to unblanced pg distribution during high load? ceph osd df result is: https://pastebin.com/ur4Q9jsA. ceph osd perf result is: https://pastebin.com/87DitPhV There is no osd with very high pg count compare to others. When the wirte test load is low everything seems fine, but during high write load test, some of the osds with higher pg can have 3-10 time of fs_apply_latency compare to others. My guess is the high loaded osds kinda slowed the whole cluster(because I have only one pool with all osds)to the level of how fast they can handle. So other osd has lower load and have a good latency. Is this expected during high load(Indicate the load is too hight for current cluster to hanlde)? How does luminous solve the unevenly pg distribution problem?I read about there is a pg-upmap exception table in the osdmap in luminous 12.2.x. It is said to use this it is possible to achive perfect pg distribution among osds. 2018-03-09 shadow_lin 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-03-09 06:45 主题:Re: [ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> PGs being unevenly distributed is a common occurrence in Ceph. Luminous started making some steps towards correcting this, but you're in Jewel. There are a lot of threads in the ML archives about fixing PG distribution. Generally every method comes down to increasing the weight on OSDs with too few PGs and decreasing the weight on the OSDs with too many PGs. There are a lot of schools of thought on the best way to implement this in your environment which has everything to do with your client IO patterns and workloads. Looking into `ceph osd reweight-by-pg` might be a good place for you to start as you are only looking at 1 pool in your cluster. If you have more pools, you generally need `ceph osd reweight-by-utilization`. On Wed, Mar 7, 2018 at 8:19 AM shadow_lin <shadow_...@163.com> wrote: Hi list, Ceph version is jewel 10.2.10 and all osd are using filestore. The Cluster has 96 osds and 1 pool with size=2 replication with 4096 pg(base on pg calculate method from ceph doc for 100pg/per osd). The osd with the most pg count has 104 PGs and there are 6 osds have above 100 PGs Most of the osd have around 7x-9x PGs The osd with the least pg count has 58 PGs During the write test some of the osds have very high fs_apply_latency like 1000ms-4000ms while the normal ones are like 100-600ms. The osds with high latency are always the ones with more pg on it. iostat on the high latency osd shows the hdds are having high %util at about 95%-96% while the normal ones are having %util at 40%-60% I think the reason to cause this is because the osds have more pgs need to handle more write request to it.Is this right? But even though the pg distribution is not even but the variation is not that much.How could the performance be so sensitive to it? Is there anything I can do to improve the performance and reduce the latency? How can I make the pg distribution to be more even? Thanks 2018-03-07 shadowlin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
Hi David, Thanks for the info. Could I assume that if use active/passive multipath with rbd exclusive lock then all targets which support rbd(via block) are safe? 2018-03-08 shadow_lin 发件人:David Disseldorp <dd...@suse.de> 发送时间:2018-03-08 08:47 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"Mike Christie"<mchri...@redhat.com>,"Lazuardi Nasution"<mrxlazuar...@gmail.com>,"Ceph Users"<ceph-users@lists.ceph.com> Hi shadowlin, On Wed, 7 Mar 2018 23:24:42 +0800, shadow_lin wrote: > Is it safe to use active/active multipath If use suse kernel with > target_core_rbd? > Thanks. A cross-gateway failover race-condition similar to what Mike described is currently possible with active/active target_core_rbd. It's a corner case that is dependent on a client assuming that unacknowledged I/O has been implicitly terminated and can be resumed via an alternate path, while the original gateway at the same time issues the original request such that it reaches the Ceph cluster after differing I/O to the same region via the alternate path. It's not something that we've observed in the wild, but is nevertheless a bug that is being worked on, with a resolution that should also be usable for active/active tcmu-runner. Cheers, David ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock
Hi Christie, Is it safe to use active/passive multipath with krbd with exclusive lock for lio/tgt/scst/tcmu? Is it safe to use active/active multipath If use suse kernel with target_core_rbd? Thanks. 2018-03-07 shadowlin 发件人:Mike Christie发送时间:2018-03-07 03:51 主题:Re: [ceph-users] iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock 收件人:"Lazuardi Nasution" ,"Ceph Users" 抄送: On 03/06/2018 01:17 PM, Lazuardi Nasution wrote: > Hi, > > I want to do load balanced multipathing (multiple iSCSI gateway/exporter > nodes) of iSCSI backed with RBD images. Should I disable exclusive lock > feature? What if I don't disable that feature? I'm using TGT (manual > way) since I get so many CPU stuck error messages when I was using LIO. > You are using LIO/TGT with krbd right? You cannot or shouldn't do active/active multipathing. If you have the lock enabled then it bounces between paths for each IO and will be slow. If you do not have it enabled then you can end up with stale IO overwriting current data. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Uneven pg distribution cause high fs_apply_latency on osds with more pgs
Hi list, Ceph version is jewel 10.2.10 and all osd are using filestore. The Cluster has 96 osds and 1 pool with size=2 replication with 4096 pg(base on pg calculate method from ceph doc for 100pg/per osd). The osd with the most pg count has 104 PGs and there are 6 osds have above 100 PGs Most of the osd have around 7x-9x PGs The osd with the least pg count has 58 PGs During the write test some of the osds have very high fs_apply_latency like 1000ms-4000ms while the normal ones are like 100-600ms. The osds with high latency are always the ones with more pg on it. iostat on the high latency osd shows the hdds are having high %util at about 95%-96% while the normal ones are having %util at 40%-60% I think the reason to cause this is because the osds have more pgs need to handle more write request to it.Is this right? But even though the pg distribution is not even but the variation is not that much.How could the performance be so sensitive to it? Is there anything I can do to improve the performance and reduce the latency? How can I make the pg distribution to be more even? Thanks 2018-03-07 shadowlin___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?
What you said make sense. I have encountered a few hardware related issue that caused one osd to work abnormal and blocked all io of the whole cluster(all osd in one pool) which makes me think how to avoid this situation. 2018-03-07 shadow_lin 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-03-07 13:51 主题:Re: Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> Marking osds down is not without risks. You are taking away one of the copies of data for every PG on that osd. Also you are causing every PG on that osd to peer. If that osd comes back up, every PG on it again needs to peer and then they need to recover. That is a lot of load and risks to automate into the system. Now let's take into consideration other causes of slow requests like having more IO load than your spindle can handle, backfilling settings set to aggressively (related to the first option), or networking problems. If the mon is detecting slow requests on OSDs and marking them down, you could end up marking half of your cluster down or causing corrupt data by flapping OSDs. The mon will mark osds down if those settings I mentioned are met. If the osd isn't unresponsive enough to not respond to other OSDs or the mons, then there really isn't much that ceph can do to automate this safely. There are just so many variables. If ceph was a closed system on specific hardware, it could certainly be monitoring that hardware closely for early warning signs... But people are running Ceph on everything they can compile it for including raspberry pis. The cluster admin, however, should be able to add their own early detection for failures. You can monitor a lot about disks including things such as average await in a host to see if the disks are taking longer than normal to respond. That particular check led us to find that we had several storage nodes with bad cache batteries on the controllers. Finding that explained some slowness we had noticed in the cluster. It also led us to a better method to catch that scenario sooner. On Tue, Mar 6, 2018, 11:22 PM shadow_lin <shadow_...@163.com> wrote: Hi Turner, Thanks for your insight. I am wondering if the mon can detect slow/blocked request from certain osd why can't mon mark a osd with blocked request down if the request is blocked for a certain time. 2018-03-07 shadow_lin 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-03-06 23:56 主题:Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> There are multiple settings that affect this. osd_heartbeat_grace is probably the most apt. If an OSD is not getting a response from another OSD for more than the heartbeat_grace period, then it will tell the mons that the OSD is down. Once mon_osd_min_down_reporters have told the mons that an OSD is down, then the OSD will be marked down by the cluster. If the OSD does not then talk to the mons directly to say that it is up, it will be marked out after mon_osd_down_out_interval is reached. If it does talk to the mons to say that it is up, then it should be responding again and be fine. In your case where the OSD is half up, half down... I believe all you can really do is monitor your cluster and troubleshoot OSDs causing problems like this. Basically every storage solution is vulnerable to this. Sometimes an OSD just needs to be restarted due to being in a bad state somehow, or simply removed from the cluster because the disk is going bad. On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_...@163.com> wrote: Hi list, During my test of ceph,I find sometime the whole ceph cluster are blocked and the reason was one unfunctional osd.Ceph can heal itself if some osd is down, but it seems if some osd is half dead (have heart beat but can't handle request) then all the request which are directed to that osd would be blocked. If all osds are in one pool and the whole cluster would be blocked due to that one hanged osd. I think this is because ceph will try to distribute the request to all osds and if one of the osd wont confirm the request is done then everything is blocked. Is there a way to let ceph to mark the the crippled osd down if the requests direct to that osd are blocked more than certain time to avoid the whole cluster is blocked? 2018-03-04 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?
Hi Turner, Thanks for your insight. I am wondering if the mon can detect slow/blocked request from certain osd why can't mon mark a osd with blocked request down if the request is blocked for a certain time. 2018-03-07 shadow_lin 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-03-06 23:56 主题:Re: [ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> There are multiple settings that affect this. osd_heartbeat_grace is probably the most apt. If an OSD is not getting a response from another OSD for more than the heartbeat_grace period, then it will tell the mons that the OSD is down. Once mon_osd_min_down_reporters have told the mons that an OSD is down, then the OSD will be marked down by the cluster. If the OSD does not then talk to the mons directly to say that it is up, it will be marked out after mon_osd_down_out_interval is reached. If it does talk to the mons to say that it is up, then it should be responding again and be fine. In your case where the OSD is half up, half down... I believe all you can really do is monitor your cluster and troubleshoot OSDs causing problems like this. Basically every storage solution is vulnerable to this. Sometimes an OSD just needs to be restarted due to being in a bad state somehow, or simply removed from the cluster because the disk is going bad. On Sun, Mar 4, 2018 at 2:28 AM shadow_lin <shadow_...@163.com> wrote: Hi list, During my test of ceph,I find sometime the whole ceph cluster are blocked and the reason was one unfunctional osd.Ceph can heal itself if some osd is down, but it seems if some osd is half dead (have heart beat but can't handle request) then all the request which are directed to that osd would be blocked. If all osds are in one pool and the whole cluster would be blocked due to that one hanged osd. I think this is because ceph will try to distribute the request to all osds and if one of the osd wont confirm the request is done then everything is blocked. Is there a way to let ceph to mark the the crippled osd down if the requests direct to that osd are blocked more than certain time to avoid the whole cluster is blocked? 2018-03-04 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Why one crippled osd can slow down or block all request to the whole ceph cluster?
Hi list, During my test of ceph,I find sometime the whole ceph cluster are blocked and the reason was one unfunctional osd.Ceph can heal itself if some osd is down, but it seems if some osd is half dead (have heart beat but can't handle request) then all the request which are directed to that osd would be blocked. If all osds are in one pool and the whole cluster would be blocked due to that one hanged osd. I think this is because ceph will try to distribute the request to all osds and if one of the osd wont confirm the request is done then everything is blocked. Is there a way to let ceph to mark the the crippled osd down if the requests direct to that osd are blocked more than certain time to avoid the whole cluster is blocked? 2018-03-04 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how is iops from ceph -s client io section caculated?
If it is because of replication then the iops in ceph status should be always relatively stable and be the times of the replication size of the fio's iops. From what I have saw the iops in ceph status keeps increasing overtime until it is relatively stable. 2018-03-04 lin.yunfan 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-03-03 22:35 主题:Re: [ceph-users] how is iops from ceph -s client io section caculated? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> I would guess that the higher iops in ceph status are from iops calculated from replication. fio isn't aware of the backend replication iops, only what it's doing to the rbd On Fri, Mar 2, 2018, 11:53 PM shadow_lin <shadow_...@163.com> wrote: Hi list, There is a client io section from the result of ceph -s. I found the value of it is kinda confusing. I am using fio to test rbd seq write performance with 4m block.The throughput is about 2000MB/s and fio shows the iops is 500.But from the ceph -s client io section the throughput is about 2000MB/s too but the iops is not constantly,iops of the client io section keeps increasing from 1000iops to 2000iops.And I found as the iops increased the throughput get lower(about 10-20% ). What the reason of the iops from ceph -s client io section to behave like this? 2018-03-03 lin.yunfan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how is iops from ceph -s client io section caculated?
Hi list, There is a client io section from the result of ceph -s. I found the value of it is kinda confusing. I am using fio to test rbd seq write performance with 4m block.The throughput is about 2000MB/s and fio shows the iops is 500.But from the ceph -s client io section the throughput is about 2000MB/s too but the iops is not constantly,iops of the client io section keeps increasing from 1000iops to 2000iops.And I found as the iops increased the throughput get lower(about 10-20% ). What the reason of the iops from ceph -s client io section to behave like this? 2018-03-03 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [luminous12.2.2]Cache tier doesn't work properly
Hi list, I am testing cache tier in writeback mode by rados bench. The test resutl is confusing.The write performance is worse than without a cache tier. For my understanding the pool with cache tier in writeback mode should performace like all ssd pool(client get ack after data write to hot storage) if the cache dosen't need to be flushed. But in my wirte test,the pool with cache tier has poorer performance than even all hdd pool. And I inspect the pool stat to find out that there is only 244 objects in the hot-pool and 695 objects in the cold pool(the write test wrote 695 objects).But for my setting 695 objects shouldn't trigger the flush. Is there any setting or concept I wrongly understood ?my cache tier setting:# ceph osd tier add cold-pool hot-pool pool 'hot-pool' is now (or already was) a tier of 'cold-pool' # # ceph osd tier cache-mode hot-pool writeback set cache-mode for pool 'hot-pool' to writeback # # ceph osd tier set-overlay cold-pool hot-pool overlay for 'cold-pool' is now (or already was) 'hot-pool' # # ceph osd pool set hot-pool hit_set_type bloom set pool 39 hit_set_type to bloom # # ceph osd pool set hot-pool hit_set_count 10 set pool 39 hit_set_count to 10 # # ceph osd pool set hot-pool hit_set_period 3600 set pool 39 hit_set_period to 3600 # # ceph osd pool set hot-pool target_max_bytes 24000 set pool 39 target_max_bytes to 24000 # # ceph osd pool set hot-pool target_max_objects 30 set pool 39 target_max_objects to 30 # # ceph osd pool set hot-pool cache_target_dirty_ratio 0.4 set pool 39 cache_target_dirty_ratio to 0.4 # # ceph osd pool set hot-pool cache_target_dirty_high_ratio 0.6 set pool 39 cache_target_dirty_high_ratio to 0.6 # # ceph osd pool set hot-pool cache_target_full_ratio 0.8 set pool 39 cache_target_full_ratio to 0.8 # # ceph osd pool set hot-pool cache_min_flush_age 600 set pool 39 cache_min_flush_age to 600 # # ceph osd pool set hot-pool cache_min_evict_age 1800 set pool 39 cache_min_evict_age to 1800 2018-02-13 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How does cache tier work in writeback mode?
Hi list, I am testing cache tier in writeback mode. The test resutl is confusing.The write performance is worse than without a cache tier. The hot storage pool is an all ssd pool and the cold storage pool is an all hdd pool. I also created a hddpool and a ssdpool with the same crush rule as the cache tier pools for comparison. The pool config: tierOSDcap.(TB)pgOSDcap.(TB)pg hot-pool204.81024ssd-pool204.81024 cold-pool14014002048hdd-pool14014002048 The cache tier config: # ceph osd tier add cold-pool hot-pool pool 'hot-pool' is now (or already was) a tier of 'cold-pool' # # ceph osd tier cache-mode hot-pool writeback set cache-mode for pool 'hot-pool' to writeback # # ceph osd tier set-overlay cold-pool hot-pool overlay for 'cold-pool' is now (or already was) 'hot-pool' # # ceph osd pool set hot-pool hit_set_type bloom set pool 39 hit_set_type to bloom # # ceph osd pool set hot-pool hit_set_count 10 set pool 39 hit_set_count to 10 # # ceph osd pool set hot-pool hit_set_period 3600 set pool 39 hit_set_period to 3600 # # ceph osd pool set hot-pool target_max_bytes 24000 set pool 39 target_max_bytes to 24000 # # ceph osd pool set hot-pool target_max_objects 30 set pool 39 target_max_objects to 30 # # ceph osd pool set hot-pool cache_target_dirty_ratio 0.4 set pool 39 cache_target_dirty_ratio to 0.4 # # ceph osd pool set hot-pool cache_target_dirty_high_ratio 0.6 set pool 39 cache_target_dirty_high_ratio to 0.6 # # ceph osd pool set hot-pool cache_target_full_ratio 0.8 set pool 39 cache_target_full_ratio to 0.8 # # ceph osd pool set hot-pool cache_min_flush_age 600 set pool 39 cache_min_flush_age to 600 # # ceph osd pool set hot-pool cache_min_evict_age 1800 set pool 39 cache_min_evict_age to 1800 Write Test cold-pool(tier) write test for 10s # rados bench -p cold-pool 10 write --no-cleanup hdd-pool write test for 10s # rados bench -p hdd-pool 10 write --no-cleanup ssd-pool write test for 10s # rados bench -p ssd-pool 10 write --no-cleanup result: tierhddssd objects 695 737 2550 bandwith(MB/s) 2722891016 avg latency (s) 0.23 0.22 0.06 Read Test # rados bench -p cold-pool 10 seq # rados bench -p cold-pool 10 rand # rados bench -p hdd-pool 10 seq # rados bench -p hdd-pool 10 rand # rados bench -p ssd-pool 10 seq # rados bench -p ssd-pool 10 rand seq result: tierhddssd bandwith(MB/s) 8067891113 avg latency (s) 0.0740.0790.056 rand result: tierhddssd bandwith(MB/s) 1106 790 1113 avg latency (s) 0.0560.0790.056 For my understanding the pool with cache tier in writeback mode should performace like all ssd pool(client get ack after data write to hot storage) if the cache dosen't need to be flushed. But In wirte test,the pool with cache tier has poorer performance than even all hdd pool. And I inspect the pool stat to find out that there is only 244 objects in the hot-pool and 695 objects in the cold pool(the write test wrote 695 objects).But for my setting 695 objects shouldn't trigger the flush. Is there any setting or concept I wrongly understood ? 2018-02-09 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?
Hi Maged, I haven't met this problem, but I think I did have read the bug report you provide. I just want to know what is the best practice to remove a journal,db,wal partition if there are some other partitions in the same ssd for other osd and don't effect other osds. I had used the ceph-disk zap a lot before but I just zap the whole disk(journal,db,wal collocated with data in the same disk) so I think I need to it maually if I only want to "zap" a certain partition. Thanks. 2018-02-01 lin.yunfan 发件人:Maged Mokhtar <mmokh...@petasan.org> 发送时间:2018-02-01 22:15 主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> Hi Lin, We do the extra dd after zapping the disk. ceph-disk has a zap function that uses wipefs to wipe fs traces, dd to zero 10MB at partition starts, then sgdisk to remove partition table, i believe ceph-volume does the same. After this zap for each data or db block that will be created on this device we use the dd command to zero 500MB, this may be a bit overboard but other users have had similar issues: http://tracker.ceph.com/issues/22354 Also the initial zap will wipe out the the disk and zeros the start of partitions as they used to be, it is possible the new disk will have db block with a different size so the start of partitioning has changed. I am not sure if your question was because you hit this issue or you just want to skip the extra dd function or you are facing issues cleaning disks, if it is the later we can send you some patch that does this. Maged On 2018-02-01 15:04, shadow_lin wrote: Hi Maged, The problem you met beacuse of the left over of older cluster.Did you remove the db partition or you just use the old partition? I thought Wido suggest to remove the partition then use the dd to be safe.Is it safe I don't remove the partition and just use dd the try to destory the data on that partition? How would ceph-disk or ceph-volume do to the existing partition of journal,db,wal?Will it clean it or it just uses it without any action? 2018-02-01 lin.yunfan 发件人:Maged Mokhtar <mmokh...@petasan.org> 发送时间:2018-02-01 14:22 主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ? 收件人:"David Turner"<drakonst...@gmail.com> 抄送:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com> I would recommend as Wido to use the dd command. block db device holds the metada/allocation of objects stored in data block, not cleaning this is asking for problems, besides it does not take any time. In our testing building new custer on top of older installation, we did see many cases where osds will not start and report an error such as fsid of cluster and/or OSD does not match metada in BlueFS superblock...these errors do not appear if we use the dd command. On 2018-02-01 06:06, David Turner wrote: I know that for filestore journals that is fine. I think it is also safe for bluestore. Doing Wido's recommendation of writing 100MB would be a good idea, but not necessary. On Wed, Jan 31, 2018, 10:10 PM shadow_lin <shadow_...@163.com> wrote: Hi David, Thanks for your reply. I am wondering what if I don't remove the journal(wal,db for bluestore) partion on the ssd and only zap the data disk.Then I assign the journal(wal,db for bluestore) partion to a new osd.What would happen? 2018-02-01 lin.yunfan 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-01-31 17:24 主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> I use gdisk to remove the partition and partprobe for the OS to see the new partition table. You can script it with sgdisk. On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote: Hi list, if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I use ceph-disk zap to clean the disk when I want to remove the osd and clean the data on the disk. But if I use a ssd partition as the journal(wal,db if it is bluestore) , how should I clean the journal (wal,db if it is bluestore) of the osd I want to remove?Especially when there are other osds are using other partition of the same ssd as journals(wal,db if it is bluestore) . 2018-01-31 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?
Hi Maged, The problem you met beacuse of the left over of older cluster.Did you remove the db partition or you just use the old partition? I thought Wido suggest to remove the partition then use the dd to be safe.Is it safe I don't remove the partition and just use dd the try to destory the data on that partition? How would ceph-disk or ceph-volume do to the existing partition of journal,db,wal?Will it clean it or it just uses it without any action? 2018-02-01 lin.yunfan 发件人:Maged Mokhtar <mmokh...@petasan.org> 发送时间:2018-02-01 14:22 主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ? 收件人:"David Turner"<drakonst...@gmail.com> 抄送:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com> I would recommend as Wido to use the dd command. block db device holds the metada/allocation of objects stored in data block, not cleaning this is asking for problems, besides it does not take any time. In our testing building new custer on top of older installation, we did see many cases where osds will not start and report an error such as fsid of cluster and/or OSD does not match metada in BlueFS superblock...these errors do not appear if we use the dd command. On 2018-02-01 06:06, David Turner wrote: I know that for filestore journals that is fine. I think it is also safe for bluestore. Doing Wido's recommendation of writing 100MB would be a good idea, but not necessary. On Wed, Jan 31, 2018, 10:10 PM shadow_lin <shadow_...@163.com> wrote: Hi David, Thanks for your reply. I am wondering what if I don't remove the journal(wal,db for bluestore) partion on the ssd and only zap the data disk.Then I assign the journal(wal,db for bluestore) partion to a new osd.What would happen? 2018-02-01 lin.yunfan 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-01-31 17:24 主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> I use gdisk to remove the partition and partprobe for the OS to see the new partition table. You can script it with sgdisk. On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote: Hi list, if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I use ceph-disk zap to clean the disk when I want to remove the osd and clean the data on the disk. But if I use a ssd partition as the journal(wal,db if it is bluestore) , how should I clean the journal (wal,db if it is bluestore) of the osd I want to remove?Especially when there are other osds are using other partition of the same ssd as journals(wal,db if it is bluestore) . 2018-01-31 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?
Hi David, Thanks for your reply. I am wondering what if I don't remove the journal(wal,db for bluestore) partion on the ssd and only zap the data disk.Then I assign the journal(wal,db for bluestore) partion to a new osd.What would happen? 2018-02-01 lin.yunfan 发件人:David Turner <drakonst...@gmail.com> 发送时间:2018-01-31 17:24 主题:Re: [ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ? 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> I use gdisk to remove the partition and partprobe for the OS to see the new partition table. You can script it with sgdisk. On Wed, Jan 31, 2018, 4:10 AM shadow_lin <shadow_...@163.com> wrote: Hi list, if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I use ceph-disk zap to clean the disk when I want to remove the osd and clean the data on the disk. But if I use a ssd partition as the journal(wal,db if it is bluestore) , how should I clean the journal (wal,db if it is bluestore) of the osd I want to remove?Especially when there are other osds are using other partition of the same ssd as journals(wal,db if it is bluestore) . 2018-01-31 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to clean data of osd with ssd journal(wal, db if it is bluestore) ?
Hi list, if I create an osd with journal(wal,db if it is bluestore) in the same hdd, I use ceph-disk zap to clean the disk when I want to remove the osd and clean the data on the disk. But if I use a ssd partition as the journal(wal,db if it is bluestore) , how should I clean the journal (wal,db if it is bluestore) of the osd I want to remove?Especially when there are other osds are using other partition of the same ssd as journals(wal,db if it is bluestore) . 2018-01-31 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How ceph client read data from ceph cluster
Hi Maged, I just want to make sure if I understand how ceph client read from cluster.So with current version of ceph(12.2.2) the client only read from the primary osd(one copy),is that true? 2018-01-27 lin.yunfan 发件人:Maged Mokhtar <mmokh...@petasan.org> 发送时间:2018-01-26 20:27 主题:Re: [ceph-users] How ceph client read data from ceph cluster 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> On 2018-01-26 09:09, shadow_lin wrote: Hi List, I read a old article about how ceph client read from ceph cluster.It said the client only read from the primary osd. Since ceph cluster in replicate mode have serveral copys of data only read from one copy seems waste the performance of concurrent read from all the copys. But that artcile is rather old so maybe ceph has imporved to read from all the copys? But I haven't find any info about that. Any info about that would be appreciated. Thanks 2018-01-26 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi The majority of cases you will have more concurrent io requests than disks, so the load will already be distributed evenly. If this is not the case and you have a large cluster with fewer clients, you may consider using object/rbd striping so each io will be divided into different osd requests. Maged___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How ceph client read data from ceph cluster
Hi List, I read a old article about how ceph client read from ceph cluster.It said the client only read from the primary osd. Since ceph cluster in replicate mode have serveral copys of data only read from one copy seems waste the performance of concurrent read from all the copys. But that artcile is rather old so maybe ceph has imporved to read from all the copys? But I haven't find any info about that. Any info about that would be appreciated. Thanks 2018-01-26 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Limit deep scrub
hi, you can try to adjusting osd_scrub_chunk_min,osd_scrub_chunk_max and osd_scrub_sleep. osd scrub sleep Description: Time to sleep before scrubbing next group of chunks. Increasing this value will slow down whole scrub operation while client operations will be less impacted. Type: Float Default: 0 osd scrub chunk min Description: The minimal number of object store chunks to scrub during single operation. Ceph blocks writes to single chunk during scrub. Type: 32-bit Integer Default: 5 2018-01-15 lin.yunfan 发件人:Karun Josy发送时间:2018-01-15 06:53 主题:[ceph-users] Limit deep scrub 收件人:"ceph-users" 抄送: Hello, It appears that cluster is having many slow requests while it is scrubbing and deep scrubbing. Also sometimes we can see osds flapping. So we have put the flags : noscrub,nodeep-scrub When we unset it, 5 PGs start to scrub. Is there a way to limit it to one at a time? # ceph daemon osd.35 config show | grep scrub "mds_max_scrub_ops_in_progress": "5", "mon_scrub_inject_crc_mismatch": "0.00", "mon_scrub_inject_missing_keys": "0.00", "mon_scrub_interval": "86400", "mon_scrub_max_keys": "100", "mon_scrub_timeout": "300", "mon_warn_not_deep_scrubbed": "0", "mon_warn_not_scrubbed": "0", "osd_debug_scrub_chance_rewrite_digest": "0", "osd_deep_scrub_interval": "604800.00", "osd_deep_scrub_randomize_ratio": "0.15", "osd_deep_scrub_stride": "524288", "osd_deep_scrub_update_digest_min_age": "7200", "osd_max_scrubs": "1", "osd_op_queue_mclock_scrub_lim": "0.001000", "osd_op_queue_mclock_scrub_res": "0.00", "osd_op_queue_mclock_scrub_wgt": "1.00", "osd_requested_scrub_priority": "120", "osd_scrub_auto_repair": "false", "osd_scrub_auto_repair_num_errors": "5", "osd_scrub_backoff_ratio": "0.66", "osd_scrub_begin_hour": "0", "osd_scrub_chunk_max": "25", "osd_scrub_chunk_min": "5", "osd_scrub_cost": "52428800", "osd_scrub_during_recovery": "false", "osd_scrub_end_hour": "24", "osd_scrub_interval_randomize_ratio": "0.50", "osd_scrub_invalid_stats": "true", "osd_scrub_load_threshold": "0.50", "osd_scrub_max_interval": "604800.00", "osd_scrub_min_interval": "86400.00", "osd_scrub_priority": "5", "osd_scrub_sleep": "0.00", Karun ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to speed up backfill
Hi , Mine is purely backfilling(remove a osd from the cluster) and it started at 600Mb/s and ended at about 3MB/s. How is your recovery made up?Is it backfill or log replay pg recovery or both? 2018-01-11 shadow_lin 发件人:Josef Zelenka <josef.zele...@cloudevelops.com> 发送时间:2018-01-11 15:26 主题:Re: [ceph-users] How to speed up backfill 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> Hi, our recovery slowed down significantly towards the end, however it was still about five times faster than the original speed.We suspected that this is caused somehow by threading (more objects transferred - more threads used), but this is only an assumption. On 11/01/18 05:02, shadow_lin wrote: Hi, I had tried these two method and for backfilling it seems only osd-max-backfills works. How was your recovery speed when it comes to the last few pgs or objects? 2018-01-11 shadow_lin 发件人:Josef Zelenka <josef.zele...@cloudevelops.com> 发送时间:2018-01-11 04:53 主题:Re: [ceph-users] How to speed up backfill 收件人:"shadow_lin"<shadow_...@163.com> 抄送: Hi, i had the same issue a few days back, i tried playing around with these two: ceph tell 'osd.*' injectargs '--osd-max-backfills ' ceph tell 'osd.*' injectargs '--osd-recovery-max-active ' and it helped greatly(increased our recovery speed 20x), but be careful to not overload your systems. On 10/01/18 17:50, shadow_lin wrote: Hi all, I am playing with setting for backfill to try to find how to control the speed of backfill. Now I only find "osd max backfills" can have effect the backfill speed. But after all pg need to be backfilled begin backfilling I can't find any way to speed up backfills. Especailly when it comes to the last pg to recover, the speed is only a few MB/s(when there are multi pg are backfilled the speed could be more than 600MB/s in my test) I am a little confused about the setting of backfills and recovery.Though backfilling is a kind of recovery but It seems recovery setting is only about to replay pg logs to do recover pg. Would change "osd recovery max active" or other recovery setting have any effect on backfilling? I did tried "osd recovery op priority" and "osd recovery max active" with no luck. Any advice would be greatly appreciated.Thanks 2018-01-11 lin.yunfan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to speed up backfill
Hi all, I am playing with setting for backfill to try to find how to control the speed of backfill. Now I only find "osd max backfills" can have effect the backfill speed. But after all pg need to be backfilled begin backfilling I can't find any way to speed up backfills. Especailly when it comes to the last pg to recover, the speed is only a few MB/s(when there are multi pg are backfilled the speed could be more than 600MB/s in my test) I am a little confused about the setting of backfills and recovery.Though backfilling is a kind of recovery but It seems recovery setting is only about to replay pg logs to do recover pg. Would change "osd recovery max active" or other recovery setting have any effect on backfilling? I did tried "osd recovery op priority" and "osd recovery max active" with no luck. Any advice would be greatly appreciated.Thanks 2018-01-11 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad crc causing osd hang and block all request.
Thanks for your advice I rebuilt the osd and haven't have this happened again.So it could be corruption on the hdds. 2018-01-11 lin.yunfan 发件人:Konstantin Shalygin发送时间:2018-01-09 12:11 主题:Re: [ceph-users] Bad crc causing osd hang and block all request. 收件人:"ceph-users" 抄送: > What could cause this problem?Is this caused by a faulty HDD? > what data's crc didn't match ? This may be caused due faulty drive. Check your dmesg. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bad crc causing osd hang and block all request.
Hi lists, ceph version:luminous 12.2.2 The cluster was doing writing thoughput test when this problem happened. The cluster health became error Health check update: 27 stuck requests are blocked > 4096 sec (REQUEST_STUCK) Clients can't write any data into cluster. osd22 and osd40 are the osds who is resposible for the problem. osd22's log shows below mesage and keep repeating 2018-01-07 20:44:52.202322 b56db8e0 0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x96aa9400 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969602 vs existing csq=969601 existing_state=STATE_STANDBY 2018-01-07 20:44:52.250600 b56db8e0 0 bad crc in data 3751247614 != exp 3467727689 2018-01-07 20:44:52.252470 b5edb8e0 0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x95c04000 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969604 vs existing csq=969603 existing_state=STATE_STANDBY 2018-01-07 20:44:52.300354 b5edb8e0 0 bad crc in data 3751247614 != exp 3467727689 2018-01-07 20:44:52.302788 b56db8e0 0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x978e7a00 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969606 vs existing csq=969605 existing_state=STATE_STANDBY 2018-01-07 20:44:52.350987 b56db8e0 0 bad crc in data 3751247614 != exp 3467727689 2018-01-07 20:44:52.352953 b5edb8e0 0 -- 10.0.2.12:6802/2798 >> 10.0.2.21:6802/2785 conn(0x97420e00 :6802 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 969608 vs existing csq=969607 existing_state=STATE_STANDBY 2018-01-07 20:44:52.400959 b5edb8e0 0 bad crc in data 3751247614 != exp 3467727689 osd40's log shows below message and keep repeating 2018-01-07 20:44:52.200709 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484865 cs=969601 l=0).fault initiating reconnect 2018-01-07 20:44:52.251423 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484866 cs=969603 l=0).fault initiating reconnect 2018-01-07 20:44:52.301166 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484867 cs=969605 l=0).fault initiating reconnect 2018-01-07 20:44:52.351810 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484868 cs=969607 l=0).fault initiating reconnect 2018-01-07 20:44:52.401782 b4e9e8e0 0 -- 10.0.2.21:6802/2785 >> 10.0.2.12:6802/2798 conn(0x90a66700 :-1 s=STATE_OPEN pgs=484869 cs=969609 l=0).fault initiating reconnect The NIC of osd22' s host was keeping sending data to osd40's at about 50MBps when this happened. After reboot osd22 the cluster goes back to normal.. This happened twice in my writing test with the same osds(osd22 and osd40). What could cause this problem?Is this caused by a faulty HDD? what data's crc didn't match ? 2018-01-09 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [luminous 12.2.2]bluestore cache uses much more memory than setting value
Hi all, I have already know that luminous would use more memory for bluestore cache than the config setting, but I was expecting 1.5x not 7-8x. below is my bluestore cache setting [osd] osd max backfills = 4 bluestore_cache_size = 134217728 bluestore_cache_kv_max = 134217728 osd client message size cap = 67108864 My osd nodes have only 2G memory and I ran 2 osds per node.So I set the cache value very low. I was running a read throughput test and then found some of my osds were killed by oom killer and restarted.I found the oom killed osd used much more memory for bluestore_cache_data than the normal ones. The oom killed osd used 795MB ram in mempool and 722MB in bluestore_cache_data The normal osd used about 120MB ram in mempool and 17MB in bluestore_cache_data graph of memory useage of the oom killed osd: https://pasteboard.co/H1GzihS.png graph of memory useage of the nomral osd: https://pasteboard.co/H1GzaeF.png Is this a bug of bluestore cache or I misunderstood the meaning of bluestore cache setting in config? 2018-01-06 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to monitor slow request?
I am building a ceph moninter dashboard and I want to moniter how many slow requests are on each node. But I find the ceph.log some time only log like below: 2017-12-27 14:59:47.852396 mon.mnc000 mon.0 192.168.99.80:6789/0 2147 : cluster [WRN] Health check failed: 4 slow requests are blocked > 32 sec (REQUEST_SLOW) There is no osd id info about where the slow reqeust happenened. what would be a proper way to moniter which osd caused the slow request and how many slow requests are one that osd? 2017-12-27 shadow_lin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)
I have disabled scrub before the test. 2017-12-27 shadow_lin 发件人:Webert de Souza Lima <webert.b...@gmail.com> 发送时间:2017-12-22 20:37 主题:Re: [ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related) 收件人:"ceph-users"<ceph-users@lists.ceph.com> 抄送: On Thu, Dec 21, 2017 at 12:52 PM, shadow_lin <shadow_...@163.com> wrote: After 18:00 suddenly the write throughput dropped and the osd latency increased. TCmalloc started relcaim page heap freelist much more frequently.All of this happened very fast and every osd had the indentical pattern. Could that be caused by OSD scrub? Check your "osd_scrub_begin_hour" ceph daemon osd.$ID config show | grep osd_scrub Regards, Webert Lima DevOps Engineer at MAV Tecnologia Belo Horizonte - Brasil IRC NICK - WebertRLZ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [luminous 12.2.2] Cluster write performance degradation problem(possibly tcmalloc related)
My testing cluster is an all hdd cluster with 12 osd(10T hdd each). I moinitor luminous 12.2.2 write performance and osd memory usage with grafana graph for statistic logging. The test is done by using fio on a mounted rbd with follow fio parameters: fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest I found there is a noticeably performance degration over time. Graph of write throughput and iops https://pasteboard.co/GZflpTO.png Graph of osd memory usage(2 of 12 osds,the pattern are identical) https://pasteboard.co/GZfmfzo.png Graph of osd perf https://pasteboard.co/GZfmZNx.png There are some interesting founding from the graph. After 18:00 suddenly the write throughput dropped and the osd latency increased. TCmalloc started relcaim page heap freelist much more frequently.All of this happened very fast and every osd had the indentical pattern. I have done this kind of test several times with different bluestore cache setting and find out with more cache the performance degradation would happen later. I don't know if this is a bug or I can fix it with modify some of the config of my cluster. Any advice or direction to look into is appreciated. Thanks 2017-12-21 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time
Thanks for your information, but I don't think it is my case.My cluster don't have any ssd. 2017-12-21 lin.yunfan 发件人:Denes Dolhay <de...@denkesys.com> 发送时间:2017-12-18 06:41 主题:Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time 收件人:"ceph-users"<ceph-users@lists.ceph.com> 抄送: Hi, This is just a tip, I do not know if this actually applies to you, but some ssds are decreasing their write throughput on purpose so they do not wear out the cells before the warranty period is over. Denes. On 12/17/2017 06:45 PM, shadow_lin wrote: Hi All, I am testing luminous 12.2.2 and find a strange behavior of my cluster. I was testing my cluster throughput by using fio on a mounted rbd with follow fio parameters: fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest Everything was fine at the begining, but after about 10 hrs of testing I found the performance dropped noticeably. Throughput droped from 300-450MBps to 250-350MBps and osd latency increased from 300ms to 400ms I also noted the heap stats showed the osd start reclaiming page heap freelist much more frequently but the rss memory of osd were increasing. below is the links of grafana graph of my cluster. cluster metrics: https://pasteboard.co/GYEOgV1.jpg osd mem metrics: https://pasteboard.co/GYEP74M.png In the graph the performance dropped after 10:00. I am investigating what happened but haven't found any clue yet. If you know any thing about how to solve this problem or where I should look into please let me know. Thanks. 2017-12-18 lin.yunfan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time
Thanks for the information, but I think that is not my case because I am using only hdd in my cluster. From the command you provide I found the db_used_bytes is quite large, but I am not sure how the db used bytes is related to the amount of stored data and the performance. ceph daemon osd.0 perf dump | jq '.bluefs' | grep -E '(db|slow)' "db_total_bytes": 400029646848, "db_used_bytes": 9347006464, "slow_total_bytes": 0, "slow_used_bytes": 0 2017-12-18 shadow_lin 发件人:Konstantin Shalygin <k0...@k0ste.ru> 发送时间:2017-12-18 13:52 主题:Re: [ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time 收件人:"ceph-users"<ceph-users@lists.ceph.com> 抄送:"shadow_lin"<shadow_...@163.com> I am testing luminous 12.2.2 and find a strange behavior of my cluster. Check your block.db usage. Luminous 12.2.2 is affected http://tracker.ceph.com/issues/22264 [root@ceph-osd0]# ceph daemon osd.46 perf dump | jq '.bluefs' | grep -E '(db|slow)' "db_total_bytes": 30064762880, "db_used_bytes": 16777216, "slow_total_bytes": 240043163648, "slow_used_bytes": 659554304,___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [Luminous 12.2.2] Cluster peformance drops after certain point of time
Hi All, I am testing luminous 12.2.2 and find a strange behavior of my cluster. I was testing my cluster throughput by using fio on a mounted rbd with follow fio parameters: fio -directory=fiotest -direct=1 -thread -rw=write -ioengine=libaio -size=200G -group_reporting -bs=1m -iodepth 4 -numjobs=200 -name=writetest Everything was fine at the begining, but after about 10 hrs of testing I found the performance dropped noticeably. Throughput droped from 300-450MBps to 250-350MBps and osd latency increased from 300ms to 400ms I also noted the heap stats showed the osd start reclaiming page heap freelist much more frequently but the rss memory of osd were increasing. below is the links of grafana graph of my cluster. cluster metrics: https://pasteboard.co/GYEOgV1.jpg osd mem metrics: https://pasteboard.co/GYEP74M.png In the graph the performance dropped after 10:00. I am investigating what happened but haven't found any clue yet. If you know any thing about how to solve this problem or where I should look into please let me know. Thanks. 2017-12-18 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] The way to minimize osd memory usage?
My workload is mainly seq write(for surveillance usage).I am not sure how cache would effect the write performance and why the memory usage keeps increasing as more data is wrote into ceph storage. 2017-12-11 lin.yunfan 发件人:Peter Woodman <pe...@shortbus.org> 发送时间:2017-12-11 05:04 主题:Re: [ceph-users] The way to minimize osd memory usage? 收件人:"David Turner"<drakonst...@gmail.com> 抄送:"shadow_lin"<shadow_...@163.com>,"ceph-users"<ceph-users@lists.ceph.com>,"Konstantin Shalygin"<k0...@k0ste.ru> I've had some success in this configuration by cutting the bluestore cache size down to 512mb and only one OSD on an 8tb drive. Still get occasional OOMs, but not terrible. Don't expect wonderful performance, though. Two OSDs would really be pushing it. On Sun, Dec 10, 2017 at 10:05 AM, David Turner <drakonst...@gmail.com> wrote: > The docs recommend 1GB/TB of OSDs. I saw people asking if this was still > accurate for bluestore and the answer was that it is more true for bluestore > than filestore. There might be a way to get this working at the cost of > performance. I would look at Linux kernel memory settings as much as ceph > and bluestore settings. Cache pressure is one that comes to mind that an > aggressive setting might help. > > > On Sat, Dec 9, 2017, 11:33 PM shadow_lin <shadow_...@163.com> wrote: >> >> The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) >> we are running is with the memory issues fix.And we are working on to >> upgrade to 12.2.2 release to see if there is any furthermore improvement. >> >> 2017-12-10 >> >> lin.yunfan >> >> >> 发件人:Konstantin Shalygin <k0...@k0ste.ru> >> 发送时间:2017-12-10 12:29 >> 主题:Re: [ceph-users] The way to minimize osd memory usage? >> 收件人:"ceph-users"<ceph-users@lists.ceph.com> >> 抄送:"shadow_lin"<shadow_...@163.com> >> >> >> > I am testing running ceph luminous(12.2.1-249-g42172a4 >> > (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server. >> Try new 12.2.2 - this release should fix memory issues with Bluestore. >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] The way to minimize osd memory usage?
The 12.2.1(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) we are running is with the memory issues fix.And we are working on to upgrade to 12.2.2 release to see if there is any furthermore improvement. 2017-12-10 lin.yunfan 发件人:Konstantin Shalygin <k0...@k0ste.ru> 发送时间:2017-12-10 12:29 主题:Re: [ceph-users] The way to minimize osd memory usage? 收件人:"ceph-users"<ceph-users@lists.ceph.com> 抄送:"shadow_lin"<shadow_...@163.com> > I am testing running ceph luminous(12.2.1-249-g42172a4 > (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server. Try new 12.2.2 - this release should fix memory issues with Bluestore. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] The way to minimize osd memory usage?
Hi All, I am testing running ceph luminous(12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) on ARM server. The ARM server has a two cores@1.4GHz cpu and 2GB ram and I am running 2 osd per ARM server with 2x8TB(or 2x10TB) hdd. Now I am facing constantly oom problem.I have tried upgrade ceph(to fix osd memroy leak problem) and lower the bluestore cache setting.The oom problems did get better but still occurs constantly. I am hoping someone can gives me some advice of the follow questions. Is it impossible to run ceph in this config of hardware or Is it possible I can do some tunning the solve this problem(even to lose some performance to avoid the oom problem)? Is it a good idea to use raid0 to combine the 2 HDD into one so I can only run one osd to save some memory? How is memory usage of osd related to the size of HDD? PS:my ceph.conf bluestore cache setting [osd] bluestore_cache_size = 104857600 bluestore_cache_kv_max = 67108864 osd client message size cap = 67108864 2017-12-10 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Why degraded objects count keeps increasing as more data is wrote into cluster?
Hi all, I have a pool of 2 replicate(failure domain host) and I was testing it with fio writing to rbd image(about 450MB/s) when one of my host crashed.I rebooted the crashed host and mon said all osd and host were online, but there were some pg in degraded status. I thought it would recover but after a while I found even though all osd were up and in but the degraded objects count was keeping increasing as more data was wrote into the cluster.If I stop writing data into cluster then the degraded objects count start to decrease. Is this how pg recover should work?or I did something wrong?Why would the degraded objects became more and more even when all osd are up and in? Thanks 2017-11-08 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [luminous][ERR] Error -2 reading object
Hi all, I am testing luminouse for ec pool backed rbd[k=8,m=2]. My luminouse version is: ceph version 12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable) My cluster had some osd memory oom problem so some osds got oom killed.The cluster entered recovery state.During the recovery I found some log like blow: 2017-11-04 12:12:11.041217 osd.7 [ERR] Error -2 reading object 4:4c277622:::rbd_data.3.390e74b0dc51.00186276:head 2017-11-04 12:12:11.260225 osd.7 [ERR] Error -2 reading object 4:4c277622:::rbd_data.3.390e74b0dc51.00186276:head 2017-11-04 12:12:11.583279 osd.7 [ERR] Error -2 reading object 4:4c277622:::rbd_data.3.390e74b0dc51.00186276:head I haven't seen this error before and I can't google any infomation about it either. What could cause this error?Is there a way to fix it? 2017-11-04 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How would ec profile effect performance?
Hi all, I am wondering how ec profile would effect ceph performance? Will ec profile k=10,m=2 perform better than k=8,m=2 since there would be more chunk to wirte and read concurrently? Will ec profile k=10,m=2 perform need more memory and cpu power than ec profile k=8,m=2? 2017-11-02 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 回复: 回复: Re: [luminous]OSD memory usage increase when writing^J a lot of data to cluster
Hi Sage, I did some more test and found this: I use ceph tell osd.6 heap stats to found that osd.6 tcmalloc heap stats: MALLOC: 404608432 ( 385.9 MiB) Bytes in use by application MALLOC: + 26599424 ( 25.4 MiB) Bytes in page heap freelist MALLOC: + 13442496 ( 12.8 MiB) Bytes in central cache freelist MALLOC: + 21112288 ( 20.1 MiB) Bytes in transfer cache freelist MALLOC: + 21702320 ( 20.7 MiB) Bytes in thread cache freelists MALLOC: + 3021024 (2.9 MiB) Bytes in malloc metadata MALLOC: MALLOC: =490485984 ( 467.8 MiB) Actual memory used (physical + swap) MALLOC: +162922496 ( 155.4 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: =653408480 ( 623.1 MiB) Virtual address space used MALLOC: MALLOC: 12958 Spans in use MALLOC: 32 Thread heaps in use MALLOC: 8192 Tcmalloc page size and the page heap won't release by osd itself and keep increasing,but if i use "ceph tell osd.6 heap release" to manual release it then the page heap freelist is released. osd.6 tcmalloc heap stats: MALLOC: 404608432 ( 385.9 MiB) Bytes in use by application MALLOC: + 26599424 ( 25.4 MiB) Bytes in page heap freelist MALLOC: + 13442496 ( 12.8 MiB) Bytes in central cache freelist MALLOC: + 21112288 ( 20.1 MiB) Bytes in transfer cache freelist MALLOC: + 21702320 ( 20.7 MiB) Bytes in thread cache freelists MALLOC: + 3021024 (2.9 MiB) Bytes in malloc metadata MALLOC: MALLOC: =490485984 ( 467.8 MiB) Actual memory used (physical + swap) MALLOC: +162922496 ( 155.4 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: =653408480 ( 623.1 MiB) Virtual address space used MALLOC: MALLOC: 12958 Spans in use MALLOC: 32 Thread heaps in use MALLOC: 8192 Tcmalloc page si i found this problem was discussed before at http://tracker.ceph.com/issues/12681, is it a tcmalloc problem? 2017-11-02 lin.yunfan 发件人:Sage Weil <s...@newdream.net> 发送时间:2017-11-01 20:11 主题:Re: 回复: Re: [ceph-users] [luminous]OSD memory usage increase when writing^J a lot of data to cluster 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> On Wed, 1 Nov 2017, shadow_lin wrote: > Hi Sage, > We have tried compiled the latest ceph source code from github. > The build is ceph version 12.2.1-249-g42172a4 > (42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable). > The memory problem seems better but the memory usage of osd is still keep > increasing as more data are wrote into the rbd image and the memory usage > won't drop after the write is stopped. >Could you specify from which commit the memeory bug is fixed? f60a942023088cbba53a816e6ef846994921cab3 and the prior 2 commits. If you look at 'cpeh daemon osd.nnn dump_mempools' you can see three bluestore pools. This is what bluestore is using to account for its usage so it can know when to trim its cache. Do those add up to the bluestore_cache_size - 512m (for rocskdb) that you have configured? sage > Thanks > 2017-11-01 > > > body {font-size:10.5pt; font-family:微软雅黑,serif} lin.yunfan > > > 发件人:Sage Weil <s...@newdream.net> > 发送时间:2017-10-24 20:03 > 主题:Re: [ceph-users] [luminous]OSD memory usage increase when > writing a lot of data to cluster > 收件人:"shadow_lin"<shadow_...@163.com> > 抄送:"ceph-users"<ceph-users@lists.ceph.com> > > On Tue, 24 Oct 2017, shadow_lin wrote: > > BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body > > {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All, > > The cluster has 24 osd with 24 8TB hdd. > > Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memor > y > > is below the remmanded value, but this osd server is an ARM server so I ?? ?> can't do anything to add more ram. > > I created a replicated(2 rep) pool and an 20TB image and mounted to the t > est > > server with xfs fs. > > > > I have set the ceph.conf to this(according to other related post suggeste > d): > > [osd] > > bluestore_cache_size = 104857600 > > bluestore_cache_size_hdd = 104857600 > > bluestore_cache_size_ssd = 104857600 > > bluestore_cache_kv_max = 103809024 > > > > osd map cache size = 20
[ceph-users] 回复: 回复: Re: [luminous]OSD memory usage increase when writing^J a lot of data to cluster
Hi Sage, This is the mempool dump of my osd.1 ceph daemon osd.0 dump_mempools { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 10301352, "bytes": 10301352 }, "bluestore_cache_data": { "items": 0, "bytes": 0 }, "bluestore_cache_onode": { "items": 386, "bytes": 145136 }, "bluestore_cache_other": { "items": 91914, "bytes": 779970 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 16, "bytes": 7040 }, "bluestore_writing_deferred": { "items": 11, "bytes": 7600020 }, "bluestore_writing": { "items": 0, "bytes": 0 }, "bluefs": { "items": 170, "bytes": 5688 }, "buffer_anon": { "items": 96726, "bytes": 5685575 }, "buffer_meta": { "items": 30, "bytes": 1560 }, "osd": { "items": 72, "bytes": 554688 }, "osd_mapbl": { "items": 0, "bytes": 0 }, "osd_pglog": { "items": 197946, "bytes": 35743344 }, "osdmap": { "items": 8007, "bytes": 144024 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 }, "total": { "items": 10696630, "bytes": 60968397 } } And the memory use by ps: ceph 8173 27.3 41.0 1509892 848768 ? Ssl Oct31 419:30 /usr/bin/ceph-osd --cluster=ceph -i 0 -f --setuser ceph --setgroup ceph And ceph tell osd.0 heap stats osd.0 tcmalloc heap stats: MALLOC: 398397808 ( 379.9 MiB) Bytes in use by application MALLOC: +340647936 ( 324.9 MiB) Bytes in page heap freelist MALLOC: + 32574936 ( 31.1 MiB) Bytes in central cache freelist MALLOC: + 22581232 ( 21.5 MiB) Bytes in transfer cache freelist MALLOC: + 51663048 ( 49.3 MiB) Bytes in thread cache freelists MALLOC: + 3152096 (3.0 MiB) Bytes in malloc metadata MALLOC: MALLOC: =849017056 ( 809.7 MiB) Actual memory used (physical + swap) MALLOC: +128180224 ( 122.2 MiB) Bytes released to OS (aka unmapped) MALLOC: MALLOC: =977197280 ( 931.9 MiB) Virtual address space used MALLOC: MALLOC: 16765 Spans in use MALLOC: 32 Thread heaps in use MALLOC: 8192 Tcmalloc page size I have run test for about 10hrs writing,so far no oom happened.The osd uses 9xxMB memory max and keep stable at around 800-900MB. I set blue store cache to 100MB by this config bluestore_cache_size = 104857600 bluestore_cache_size_hdd = 104857600 bluestore_cache_size_ssd = 104857600 bluestore_cache_kv_max = 103809024 I am not sure how to calculate if it is right because if i use bluestore_cache_size-512m it would be a negative value. Did you mean rocksdb would cost about 512MB memory? 2017-11-01 lin.yunfan 发件人:Sage Weil <s...@newdream.net> 发送时间:2017-11-01 20:11 主题:Re: 回复: Re: [ceph-users] [luminous]OSD memory usage increase when writing^J a lot of data to cluster 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> On Wed, 1 Nov 2017, shadow_lin wrote: > Hi Sage, > We have tried compiled the latest ceph source code from github. > The build is ceph version 12.2.1-249-g42172a4 > (42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable). > The memory problem seems better but the memory usage of osd is still keep > increasing as more data are wrote into the rbd image and the memory usage > won't drop after the write is stopped. >Could you specify from which commit the memeory bug is fixed? f60a942023088cbba53a816e6ef846994921cab3 and the prior 2 commits. If you look at 'cpeh daemon osd.nnn dump_mempools' you can see three blu
[ceph-users] 回复: Re: [luminous]OSD memory usage increase when writing a lot of data to cluster
Hi Sage, We have tried compiled the latest ceph source code from github. The build is ceph version 12.2.1-249-g42172a4 (42172a443183ffe6b36e85770e53fe678db293bf) luminous (stable). The memory problem seems better but the memory usage of osd is still keep increasing as more data are wrote into the rbd image and the memory usage won't drop after the write is stopped. Could you specify from which commit the memeory bug is fixed? Thanks 2017-11-01 lin.yunfan 发件人:Sage Weil <s...@newdream.net> 发送时间:2017-10-24 20:03 主题:Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> On Tue, 24 Oct 2017, shadow_lin wrote: > BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body > {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All, > The cluster has 24 osd with 24 8TB hdd. > Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory > is below the remmanded value, but this osd server is an ARM server so I > can't do anything to add more ram. > I created a replicated(2 rep) pool and an 20TB image and mounted to the test > server with xfs fs. > > I have set the ceph.conf to this(according to other related post suggested): > [osd] > bluestore_cache_size = 104857600 > bluestore_cache_size_hdd = 104857600 > bluestore_cache_size_ssd = 104857600 > bluestore_cache_kv_max = 103809024 > > osd map cache size = 20 > osd map max advance = 10 > osd map share max epochs = 10 > osd pg epoch persisted max stale = 10 > The bluestore cache setting did improve the situation,but if i try to write > 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000) to rbd the > osd will eventually be killed by oom killer. > If I only wirte like 100G data once then everything is fine. > > Why does the osd memory usage keep increasing whle writing ? > Is there anything I can do to reduce the memory usage? There is a bluestore memory bug that was fixed just after 12.2.1 was released; it will be fixed in 12.2.2. In the meantime, you can run consider running the latest luminous branch (not fully tested) from https://shaman.ceph.com/builds/ceph/luminous. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 回复: Re: mkfs rbd image is very slow
Hi Jason, Thank you for your advice. The no discard option works gread.It now takes 5min to format 5t rbd image in xfs and only seconds to format in ext4. Is there any drawback to format rbd image with no discard option? Thanks 2017-10-31 lin.yunfan 发件人:Jason Dillaman <jdill...@redhat.com> 发送时间:2017-10-30 03:07 主题:Re: [ceph-users] mkfs rbd image is very slow 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> Try running "mkfs.xfs -K" which disables discarding to see if that improves the mkfs speed. The librbd-based implementation encountered a similar issue before when certain OSs sent very small discard extents for very large disks. On Sun, Oct 29, 2017 at 10:16 AM, shadow_lin <shadow_...@163.com> wrote: > Hi all, > I am testing ec pool backed rbd image performace and found that it takes a > very long time to format the rbd image by mkfs. > I created a 5TB image and mounted it on the client(ubuntu 16.04 with 4.12 > kernel) and use mkfs.ext4 and mkfs.xfs to format it. > It takes hours to finish the format and the load on some osds are high and I > can get slow request warning from time to time. > > What is a reasonable time to format a 5TB rbd image?What should I do to > improve it? > Thanks > > 2017-10-29 > > Frank > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [Luminous]How to choose the proper ec profile?
Hi all, I am wondering how to choose the proper ec profile for new luminous ec rbd image. If I set the k too high what the draw back would be? Is it a good idea to set k=10 m=2? It sounds attempting the overhead of storage capacity is low and the redundancy is good. What is the difference for storage safety(redundancy ) between k=10 m=2 and k=4 m=2? What would be a good ec profile for archive purpose(decent write perfomance and just ok read performace)? Thanks 2017-10-30 Frank___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mkfs rbd image is very slow
Hi all, I am testing ec pool backed rbd image performace and found that it takes a very long time to format the rbd image by mkfs. I created a 5TB image and mounted it on the client(ubuntu 16.04 with 4.12 kernel) and use mkfs.ext4 and mkfs.xfs to format it. It takes hours to finish the format and the load on some osds are high and I can get slow request warning from time to time. What is a reasonable time to format a 5TB rbd image?What should I do to improve it? Thanks 2017-10-29 Frank___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 回复: Re: [luminous]OSD memory usage increase when writing a lot of data to cluster
Hi Sage, When will 12.2.2 be released? 2017-10-24 lin.yunfan 发件人:Sage Weil <s...@newdream.net> 发送时间:2017-10-24 20:03 主题:Re: [ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster 收件人:"shadow_lin"<shadow_...@163.com> 抄送:"ceph-users"<ceph-users@lists.ceph.com> On Tue, 24 Oct 2017, shadow_lin wrote: > BLOCKQUOTE{margin-Top: 0px; margin-Bottom: 0px; margin-Left: 2em} body > {border-width:0;margin:0} img {border:0;margin:0;padding:0} Hi All, > The cluster has 24 osd with 24 8TB hdd. > Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory > is below the remmanded value, but this osd server is an ARM server so I > can't do anything to add more ram. > I created a replicated(2 rep) pool and an 20TB image and mounted to the test > server with xfs fs. > > I have set the ceph.conf to this(according to other related post suggested): > [osd] > bluestore_cache_size = 104857600 > bluestore_cache_size_hdd = 104857600 > bluestore_cache_size_ssd = 104857600 > bluestore_cache_kv_max = 103809024 > > osd map cache size = 20 > osd map max advance = 10 > osd map share max epochs = 10 > osd pg epoch persisted max stale = 10 > The bluestore cache setting did improve the situation,but if i try to write > 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000) to rbd the > osd will eventually be killed by oom killer. > If I only wirte like 100G data once then everything is fine. > > Why does the osd memory usage keep increasing whle writing ? > Is there anything I can do to reduce the memory usage? There is a bluestore memory bug that was fixed just after 12.2.1 was released; it will be fixed in 12.2.2. In the meantime, you can run consider running the latest luminous branch (not fully tested) from https://shaman.ceph.com/builds/ceph/luminous. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] [luminous]OSD memory usage increase when writing a lot of data to cluster
Hi All, The cluster has 24 osd with 24 8TB hdd. Each osd server has 2GB ram and runs 2OSD with 2 8TBHDD. I know the memory is below the remmanded value, but this osd server is an ARM server so I can't do anything to add more ram. I created a replicated(2 rep) pool and an 20TB image and mounted to the test server with xfs fs. I have set the ceph.conf to this(according to other related post suggested): [osd] bluestore_cache_size = 104857600 bluestore_cache_size_hdd = 104857600 bluestore_cache_size_ssd = 104857600 bluestore_cache_kv_max = 103809024 osd map cache size = 20 osd map max advance = 10 osd map share max epochs = 10 osd pg epoch persisted max stale = 10 The bluestore cache setting did improve the situation,but if i try to write 1TB data by dd command(dd if=/dev/zero of=test bs=1G count=1000) to rbd the osd will eventually be killed by oom killer. If I only wirte like 100G data once then everything is fine. Why does the osd memory usage keep increasing whle writing ? Is there anything I can do to reduce the memory usage? 2017-10-24 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How does ceph pg repair work in jewel or later versions of ceph?
I have read that the pg repair is simply copy the data from the primary osd to other osds.Is that true?or the later version of ceph has improved that? 2017-05-05 lin.yunfan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com