Re: [ceph-users] Effect of tunables on client system load
Thanks very much for the insights Greg! My most recent suspicion around the resource consumption is that, with my current configuration, xen is provisioning rbd-nbd storage for guests, rather than just using the kernel module like I was last time around. And, (while I'm unsure of how this works) but it seems there is a tapdisk process for each guest on each xenserver along with the rbd-nbd processes. Perhaps due to this use of NBD xenserver is taking a scenic route through userspace that it wasn't before... That said, gluster is attached via fuse ... I apparently need to dig more into how Xen is attaching to Ceph vs gluster Anyway, thanks again! Nate On Tue, Jun 13, 2017 at 5:30 PM, Gregory Farnumwrote: > > > On Thu, Jun 8, 2017 at 11:11 PM Nathanial Byrnes wrote: > >> Hi All, >>First, some background: >>I have been running a small (4 compute nodes) xen server cluster >> backed by both a small ceph (4 other nodes with a total of 18x 1-spindle >> osd's) and small gluster cluster (2 nodes each with a 14 spindle RAID >> array). I started with gluster 3-4 years ago, at first using NFS to access >> gluster, then upgraded to gluster FUSE. However, I had been facinated with >> ceph since I first read about it, and probably added ceph as soon as XCP >> released a kernel with RBD support, possibly approaching 2 years ago. >>With Ceph, since I started out with the kernel RBD, I believe it >> locked me to Bobtail tunables. I connected to XCP via a project that tricks >> XCP into running LVM on the RBDs managing all this through the iSCSI mgmt >> infrastructure somehow... Only recently I've switched to a newer project >> that uses the RBD-NBD mapping instead. This should let me use whatever >> tunables my client SW support AFAIK. I have not yet changed my tunables as >> the data re-org will probably take a day or two (only 1Gb networking...). >> >>Over this time period, I've observed that my gluster backed guests >> tend not to consume as much of domain-0's (the Xen VM management host) >> resources as do my Ceph backed guests. To me, this is somewhat intuitive >> as the ceph client has to do more "thinking" than the gluster client. >> However, It seems to me that the IO performance of the VM guests is well >> outside than the difference in spindle count would suggest. I am open to >> the notion that there are probably quite a few sub-optimal design >> choices/constraints within the environment. However, I haven't the >> resources to conduct all that many experiments and benchmarks So, over >> time I've ended up treating ceph as my resilient storage, and gluster as my >> more performant (3x vs 2x replication, and, as mentioned above, my gluster >> guests had quicker guest IO and lower dom-0 load). >> >> So, on to my questions: >> >>Would setting my tunables to jewel (my present release), or anything >> newer than bobtail (which is what I think I am set to if I read the ceph >> status warning correctly) reduce my dom-0 load and/or improve any aspects >> of the client IO performance? >> > > Unfortunately no. The tunables are entirely about how CRUSH works, and > while it's possible to construct pessimal CRUSH maps that are impossible to > satisfy and take a long time to churn through calculations, it's hard and > you clearly haven't done that here. I think you're just seeing that the > basic CPU cost of a Ceph IO is higher than in Gluster, or else there is > something unusual about the Xen configuration you have here compared to > more common deployments. > > >> >>Will adding nodes to the cluster ceph reduce load on dom-0, and/or >> improve client IO performance (I doubt the former and would expect the >> latter...)? >> > > In general adding nodes will increase parallel throughput (ie, async IO on > one client or the performance of multiple clients), but won't reduce > latencies. It shouldn't have much (any?) impact on client CPU usage (other > than if the client is pushing through more IO, it will use proportionally > more CPU), nor on the CPU usage of existing daemons. > > >> >>So, why did I bring up gluster at all? In an ideal world, I would like >> to have just one storage environment that would satisfy all my >> organizations needs. If forced to choose with the knowledge I have today, I >> would have to select gluster. I am hoping to come up with some actionable >> data points that might help me discover some of my mistakes which might >> explain my experience to date and maybe even help remedy said mistakes. As >> I mentioned earlier, I like ceph, more so than gluster, and would like to >> employ more within my environment. But, given budgetary constraints, I need >> to do what's best for my organization. >> >> > Yeah. I'm a little surprised you noticed it in the environment you > described, but there aren't many people running Xen on Ceph so perhaps > there's something odd happening with the setup it has there which
Re: [ceph-users] osd_op_tp timeouts
I realized I sent this under wrong thread: here I am sending it again: --- Hello all, I work in the same team as Tyler here, and I can provide more info here.. The cluster is indeed an RGW cluster, with many small (100 KB) objects similar to your use case Bryan. But we have the blind bucket set up with "index_type": 1 for this particular bucket, as we wanted to avoid this bottleneck to begin with (we didn't need listing feature) Would the bucket sharding still be a problem for blind buckets? Mark, would setting logging to 20 give any insights to what threads are doing? Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Effect of tunables on client system load
On Thu, Jun 8, 2017 at 11:11 PM Nathanial Byrneswrote: > Hi All, >First, some background: >I have been running a small (4 compute nodes) xen server cluster > backed by both a small ceph (4 other nodes with a total of 18x 1-spindle > osd's) and small gluster cluster (2 nodes each with a 14 spindle RAID > array). I started with gluster 3-4 years ago, at first using NFS to access > gluster, then upgraded to gluster FUSE. However, I had been facinated with > ceph since I first read about it, and probably added ceph as soon as XCP > released a kernel with RBD support, possibly approaching 2 years ago. >With Ceph, since I started out with the kernel RBD, I believe it > locked me to Bobtail tunables. I connected to XCP via a project that tricks > XCP into running LVM on the RBDs managing all this through the iSCSI mgmt > infrastructure somehow... Only recently I've switched to a newer project > that uses the RBD-NBD mapping instead. This should let me use whatever > tunables my client SW support AFAIK. I have not yet changed my tunables as > the data re-org will probably take a day or two (only 1Gb networking...). > >Over this time period, I've observed that my gluster backed guests tend > not to consume as much of domain-0's (the Xen VM management host) resources > as do my Ceph backed guests. To me, this is somewhat intuitive as the ceph > client has to do more "thinking" than the gluster client. However, It seems > to me that the IO performance of the VM guests is well outside than the > difference in spindle count would suggest. I am open to the notion that > there are probably quite a few sub-optimal design choices/constraints > within the environment. However, I haven't the resources to conduct all > that many experiments and benchmarks So, over time I've ended up > treating ceph as my resilient storage, and gluster as my more performant > (3x vs 2x replication, and, as mentioned above, my gluster guests had > quicker guest IO and lower dom-0 load). > > So, on to my questions: > >Would setting my tunables to jewel (my present release), or anything > newer than bobtail (which is what I think I am set to if I read the ceph > status warning correctly) reduce my dom-0 load and/or improve any aspects > of the client IO performance? > Unfortunately no. The tunables are entirely about how CRUSH works, and while it's possible to construct pessimal CRUSH maps that are impossible to satisfy and take a long time to churn through calculations, it's hard and you clearly haven't done that here. I think you're just seeing that the basic CPU cost of a Ceph IO is higher than in Gluster, or else there is something unusual about the Xen configuration you have here compared to more common deployments. > >Will adding nodes to the cluster ceph reduce load on dom-0, and/or > improve client IO performance (I doubt the former and would expect the > latter...)? > In general adding nodes will increase parallel throughput (ie, async IO on one client or the performance of multiple clients), but won't reduce latencies. It shouldn't have much (any?) impact on client CPU usage (other than if the client is pushing through more IO, it will use proportionally more CPU), nor on the CPU usage of existing daemons. > >So, why did I bring up gluster at all? In an ideal world, I would like > to have just one storage environment that would satisfy all my > organizations needs. If forced to choose with the knowledge I have today, I > would have to select gluster. I am hoping to come up with some actionable > data points that might help me discover some of my mistakes which might > explain my experience to date and maybe even help remedy said mistakes. As > I mentioned earlier, I like ceph, more so than gluster, and would like to > employ more within my environment. But, given budgetary constraints, I need > to do what's best for my organization. > > Yeah. I'm a little surprised you noticed it in the environment you described, but there aren't many people running Xen on Ceph so perhaps there's something odd happening with the setup it has there which I and others aren't picking up on. :/ Good luck! -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Living with huge bucket sizes
Hello all, I work in the same team as Tyler here, and I can provide more info here.. The cluster is indeed an RGW cluster, with many small (100 KB) objects similar to your use case Bryan. But we have the blind bucket set up with "index_type": 1 for this particular bucket, as we wanted to avoid this bottleneck to begin with (we didn't need listing feature) Would the bucket sharding still be a problem for blind buckets? Mark, would setting logging to 20 give any insights to what threads are doing? Eric ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph pg repair : Error EACCES: access denied
What are the cephx permissions of the key you are using to issue repair commands? On Tue, Jun 13, 2017 at 8:31 AM Jake Grimmettwrote: > Dear All, > > I'm testing Luminous and have a problem repairing inconsistent pgs. This > occurs with v12.0.2 and is still present with v12.0.3-1507-g52f0deb > > # ceph health > HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors > > # ceph health detail > HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors > noout flag(s) set > pg 2.3 is active+clean+inconsistent, acting [53,50] > pg 2.11 is active+clean+inconsistent, acting [52,50] > 2 scrub errors > > # rados list-inconsistent-pg hotpool > ["2.3","2.11"] > > # rados list-inconsistent-obj 2.3 --format=json-pretty > No scrub information available for pg 2.3 > error 2: (2) No such file or directory > > # ceph pg repair 2.3 > Error EACCES: access denied > > # ceph pg scrub 2.3 > Error EACCES: access denied > > # ceph pg repair 2.11 > Error EACCES: access denied > > I'm using bluestore OSD's on Scientific Linux 7.3 > > Any ideas what might be the problem? > > thanks, > > Jake > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Jewel XFS calltraces
Le Tue, 13 Jun 2017 14:30:05 +0200 l...@jonas-server.de écrivait: > [Tue Jun 13 13:18:48 2017] CPU: 3 PID: 3844 Comm: tp_fstore_op Not > tainted 4.4.0-75-generic #96-Ubuntu Looks like a kernel bug. However this isn't completely up to date, 4.4.0-79 is available. You'd probably better post this on the xfs mailing list, though: linux-xfs (at) vger.kernel.org -- Emmanuel Florac | Direction technique | Intellique || +33 1 78 94 84 02 pgpwvetqfvetc.pgp Description: Signature digitale OpenPGP ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph pg repair : Error EACCES: access denied
Dear All, I'm testing Luminous and have a problem repairing inconsistent pgs. This occurs with v12.0.2 and is still present with v12.0.3-1507-g52f0deb # ceph health HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors # ceph health detail HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors noout flag(s) set pg 2.3 is active+clean+inconsistent, acting [53,50] pg 2.11 is active+clean+inconsistent, acting [52,50] 2 scrub errors # rados list-inconsistent-pg hotpool ["2.3","2.11"] # rados list-inconsistent-obj 2.3 --format=json-pretty No scrub information available for pg 2.3 error 2: (2) No such file or directory # ceph pg repair 2.3 Error EACCES: access denied # ceph pg scrub 2.3 Error EACCES: access denied # ceph pg repair 2.11 Error EACCES: access denied I'm using bluestore OSD's on Scientific Linux 7.3 Any ideas what might be the problem? thanks, Jake ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd_op_tp timeouts
Is this on an RGW cluster? If so, you might be running into the same problem I was seeing with large bucket sizes: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018504.html The solution is to shard your buckets so the bucket index doesn't get too big. Bryan From: ceph-userson behalf of Tyler Bischel Date: Monday, June 12, 2017 at 5:12 PM To: "ceph-us...@ceph.com" Subject: [ceph-users] osd_op_tp timeouts Hi, We've been having this ongoing problem with threads timing out on the OSDs. Typically we'll see the OSD become unresponsive for about a minute, as threads from other OSDs time out. The timeouts don't seem to be correlated to high load. We turned up the logs to 10/10 for part of a day to catch some of these in progress, and saw the pattern below in the logs several times (grepping for individual threads involved in the time outs). We are using Jewel 10.2.7. Logs: 2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967027 lcod 5484'12967028 active] add_log_entry 5484'12967030 (0'0) modify 10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899 2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967028 lcod 5484'12967028 active] append_log: trimming to 5484'12967028 entries 5484'12967028 (5484'12967026) delete 10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741 2017-06-12 18:45:12.530754 7f82ebfa8700 5 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0, writeout_from: 5484'12967030, trimmed: 2017-06-12 18:45:28.171843 7f82dc503700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.171877 7f82dc402700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.174900 7f82d8887700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.174979 7f82d8786700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.248499 7f82df05e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.248651 7f82df967700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.261044 7f82d8483700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 Metrics: OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to 16, IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time spike. Average Queue Size on the device spikes. One minute later, Write Time, Reads, and Read Time spike briefly. Any thoughts on what may be causing this behavior? --Tyler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v11.2.0 Disk activation issue while booting
I came across this a few times. My problem was with journals I set up by myself. I didn't give them the proper GUID partition type ID so the udev rules didn't know how to make sure the partition looked correct. What the udev rules were unable to do was chown the journal block device as ceph:ceph so that it could be opened by the Ceph user. You can test by chowning the journal block device and try to start the OSD again. Alternatively if you want to see more information, you can start the daemon manually as opposed to starting it through systemd and see what its output looks like. On Tue, Jun 13, 2017 at 6:32 AM nokia cephwrote: > Hello, > > Some osd's not getting activated after a reboot operation which cause that > particular osd's landing in failed state. > > Here you can see mount points were not getting updated to osd-num and > mounted as a incorrect mount point, which caused osd. can't able to > mount/activate the osd's. > > Env:- RHEL 7.2 - EC 4+1, v11.2.0 bluestore. > > #grep mnt proc/mounts > /dev/sdh1 /var/lib/ceph/tmp/mnt.om4Lbq xfs > rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0 > /dev/sdh1 /var/lib/ceph/tmp/mnt.EayTmL xfs > rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0 > > From /var/log/messages.. > > -- > May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh2... > May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh1... > > > May 26 15:39:58 cn1 systemd: *start request repeated too quickly for* > ceph-disk@dev-sdh2.service => suspecting this could be root cause. > May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation: > /dev/sdh2. > May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh2.service entered > failed state. > May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh2.service failed. > May 26 15:39:58 cn1 systemd: start request repeated too quickly for > ceph-disk@dev-sdh1.service > May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation: > /dev/sdh1. > May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh1.service entered > failed state. > May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh1.service failed. > -- > > But this issue will occur intermittently after a reboot operation. > > Note;- We haven't face this problem in Jewel. > > Awaiting for comments. > > Thanks > Jayaram > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd_op_tp timeouts
Hi Tyler, I wanted to make sure you got a reply to this, but unfortunately I don't have much to give you. It sounds like you already took a look at the disk metrics and ceph is probably not waiting on disk IO based on your description. If you can easily invoke the problem, you could attach gdb to the OSD and do a "thread apply all bt" to see what the threads are doing when it's timing out. Also, please open a tracker ticket if one doesn't already exist so we can make sure we get it recorded in case other people see the same thing. Mark On 06/12/2017 06:12 PM, Tyler Bischel wrote: Hi, We've been having this ongoing problem with threads timing out on the OSDs. Typically we'll see the OSD become unresponsive for about a minute, as threads from other OSDs time out. The timeouts don't seem to be correlated to high load. We turned up the logs to 10/10 for part of a day to catch some of these in progress, and saw the pattern below in the logs several times (grepping for individual threads involved in the time outs). We are using Jewel 10.2.7. *Logs:* 2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967027 lcod 5484'12967028 active] add_log_entry 5484'12967030 (0'0) modify 10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899 2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967028 lcod 5484'12967028 active] append_log: trimming to 5484'12967028 entries 5484'12967028 (5484'12967026) delete 10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741 2017-06-12 18:45:12.530754 7f82ebfa8700 5 write_log with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, divergent_priors: 0, writeout_from: 5484'12967030, trimmed: 2017-06-12 18:45:28.171843 7f82dc503700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.171877 7f82dc402700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.174900 7f82d8887700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.174979 7f82d8786700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.248499 7f82df05e700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.248651 7f82df967700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 2017-06-12 18:45:28.261044 7f82d8483700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15 *Metrics:* OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to 16, IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time spike. Average Queue Size on the device spikes. One minute later, Write Time, Reads, and Read Time spike briefly. Any thoughts on what may be causing this behavior? --Tyler ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph Jewel XFS calltraces
Hello guys, we have currently an issue with our ceph setup based on XFS. Sometimes some nodes are dying with high load with this calltrace in dmesg: [Tue Jun 13 13:18:48 2017] BUG: unable to handle kernel NULL pointer dereference at 00a0 [Tue Jun 13 13:18:48 2017] IP: [] xfs_da3_node_read+0x30/0xb0 [xfs] [Tue Jun 13 13:18:48 2017] PGD 0 [Tue Jun 13 13:18:48 2017] Oops: [#1] SMP [Tue Jun 13 13:18:48 2017] Modules linked in: cpuid arc4 md4 nls_utf8 cifs fscache nfnetlink_queue nfnetlink xt_CHECKSUM xt_nat iptable_nat nf_nat_ipv4 xt_NFQUEUE xt_CLASSIFY ip6table_mangle dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag veth dummy bridge stp llc ebtable_filter ebtables iptable_mangle xt_CT iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter ip6_tables x_tables xfs ipmi_devintf dcdbas x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core input_leds joydev lpc_ich ioatdma shpchp 8250_fintek ipmi_si ipmi_msghandler acpi_pad acpi_power_meter [Tue Jun 13 13:18:48 2017] mac_hid vhost_net vhost macvtap macvlan kvm_intel kvm irqbypass cdc_ether nf_nat_ftp tcp_htcp nf_nat_pptp nf_nat_proto_gre nf_conntrack_ftp bonding nf_nat_sip nf_conntrack_sip nf_nat nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack usbnet mii lp parport autofs4 btrfs raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 raid1 hid_generic usbhid hid ixgbe igb vxlan ip6_udp_tunnel ahci dca udp_tunnel libahci i2c_algo_bit ptp megaraid_sas pps_core mdio wmi fjes [Tue Jun 13 13:18:48 2017] CPU: 3 PID: 3844 Comm: tp_fstore_op Not tainted 4.4.0-75-generic #96-Ubuntu [Tue Jun 13 13:18:48 2017] Hardware name: Dell Inc. PowerEdge R720/0XH7F2, BIOS 2.5.4 01/22/2016 [Tue Jun 13 13:18:48 2017] task: 881feda65400 ti: 883fbda08000 task.ti: 883fbda08000 [Tue Jun 13 13:18:48 2017] RIP: 0010:[] [] xfs_da3_node_read+0x30/0xb0 [xfs] [Tue Jun 13 13:18:48 2017] RSP: 0018:883fbda0bc88 EFLAGS: 00010286 [Tue Jun 13 13:18:48 2017] RAX: RBX: 8801102c5050 RCX: 0001 [Tue Jun 13 13:18:48 2017] RDX: RSI: RDI: 883fbda0bc38 [Tue Jun 13 13:18:48 2017] RBP: 883fbda0bca8 R08: 0001 R09: fffe [Tue Jun 13 13:18:48 2017] R10: 880007374ae0 R11: 0001 R12: 883fbda0bcd8 [Tue Jun 13 13:18:48 2017] R13: 880035ac4c80 R14: 0001 R15: 8b1f4885 [Tue Jun 13 13:18:48 2017] FS: 7fc574607700() GS:883fff04() knlGS: [Tue Jun 13 13:18:48 2017] CS: 0010 DS: ES: CR0: 80050033 [Tue Jun 13 13:18:48 2017] CR2: 00a0 CR3: 003fd828d000 CR4: 001426e0 [Tue Jun 13 13:18:48 2017] Stack: [Tue Jun 13 13:18:48 2017] c06b4b50 c0695ecc 883fbda0bde0 0001 [Tue Jun 13 13:18:48 2017] 883fbda0bd20 c06718b3 00030008 880e99b44010 [Tue Jun 13 13:18:48 2017] 360c65a8 88270f80b900 [Tue Jun 13 13:18:48 2017] Call Trace: [Tue Jun 13 13:18:48 2017] [] ? xfs_trans_roll+0x2c/0x50 [xfs] [Tue Jun 13 13:18:48 2017] [] xfs_attr3_node_inactive+0x183/0x220 [xfs] [Tue Jun 13 13:18:48 2017] [] xfs_attr3_node_inactive+0x1c9/0x220 [xfs] [Tue Jun 13 13:18:48 2017] [] xfs_attr3_root_inactive+0xac/0x100 [xfs] [Tue Jun 13 13:18:48 2017] [] xfs_attr_inactive+0x14c/0x1a0 [xfs] [Tue Jun 13 13:18:48 2017] [] xfs_inactive+0x85/0x120 [xfs] [Tue Jun 13 13:18:48 2017] [] xfs_fs_evict_inode+0xa5/0x100 [xfs] [Tue Jun 13 13:18:48 2017] [] evict+0xbe/0x190 [Tue Jun 13 13:18:48 2017] [] iput+0x1c1/0x240 [Tue Jun 13 13:18:48 2017] [] do_unlinkat+0x199/0x2d0 [Tue Jun 13 13:18:48 2017] [] SyS_unlink+0x16/0x20 [Tue Jun 13 13:18:48 2017] [] entry_SYSCALL_64_fastpath+0x16/0x71 [Tue Jun 13 13:18:48 2017] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 48 83 ec 10 48 c7 04 24 50 4b 6b c0 e8 dd fe ff ff 85 c0 75 46 48 85 db 74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 fa be 3e 74 [Tue Jun 13 13:18:48 2017] RIP [] xfs_da3_node_read+0x30/0xb0 [xfs] [Tue Jun 13 13:18:48 2017] RSP [Tue Jun 13 13:18:48 2017] CR2: 00a0 [Tue Jun 13 13:18:48 2017] ---[ end trace 5470d0d55cacb4ef ]--- The OSD has then the issue that it can not reach any other osd in the pool. -1043> 2017-06-13 13:24:00.917597 7fc539a72700 0 -- 192.168.14.19:6827/3389 >> 192.168.14.7:6805/3658 pipe(0x558219846000 sd=23 :6827 s=0 pgs=0 cs=0 l=0 c=0x55821a330400).accept connect_seq 7 vs existing 7 state standby -1042> 2017-06-13 13:24:00.918433 7fc539a72700 0 -- 192.168.14.19:6827/3389 >> 192.168.14.7:6805/3658
[ceph-users] v11.2.0 Disk activation issue while booting
Hello, Some osd's not getting activated after a reboot operation which cause that particular osd's landing in failed state. Here you can see mount points were not getting updated to osd-num and mounted as a incorrect mount point, which caused osd. can't able to mount/activate the osd's. Env:- RHEL 7.2 - EC 4+1, v11.2.0 bluestore. #grep mnt proc/mounts /dev/sdh1 /var/lib/ceph/tmp/mnt.om4Lbq xfs rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0 /dev/sdh1 /var/lib/ceph/tmp/mnt.EayTmL xfs rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0 >From /var/log/messages.. -- May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh2... May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh1... May 26 15:39:58 cn1 systemd: *start request repeated too quickly for* ceph-disk@dev-sdh2.service => suspecting this could be root cause. May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation: /dev/sdh2. May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh2.service entered failed state. May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh2.service failed. May 26 15:39:58 cn1 systemd: start request repeated too quickly for ceph-disk@dev-sdh1.service May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation: /dev/sdh1. May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh1.service entered failed state. May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh1.service failed. -- But this issue will occur intermittently after a reboot operation. Note;- We haven't face this problem in Jewel. Awaiting for comments. Thanks Jayaram ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph durability calculation and test method
Hi all : I have some questions about the durability of ceph. I am trying to mesure the durability of ceph .I konw it should be related with host and disk failing probability, failing detection time, when to trigger the recover and the recovery time . I use it with multiple replication, say k replication. If I have N hosts, R racks, O osds per host, ignoring the swich, how should I define the failure probability of disk and host ? I think they should be independent, and should be time-dependent . I google it , but find little thing about it . I see AWS says it delivers 99.9% durability. How this is claimed ? And can I design some test method to prove the durability ? Or just let it run long enough time and make the statistics ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com