Re: [ceph-users] Effect of tunables on client system load

2017-06-13 Thread Nathanial Byrnes
Thanks very much for the insights Greg!

My most recent suspicion around the resource consumption is that, with my
current configuration, xen is provisioning rbd-nbd storage for guests,
rather than just using the kernel module like I was last time around. And,
(while I'm unsure of how this works) but it seems there is a tapdisk
process for each guest on each xenserver along with the rbd-nbd processes.
Perhaps due to this use of NBD xenserver is taking a scenic route through
userspace that it wasn't before... That said, gluster is attached via fuse
... I apparently need to dig more into how Xen is attaching to Ceph vs
gluster

   Anyway, thanks again!

   Nate

On Tue, Jun 13, 2017 at 5:30 PM, Gregory Farnum  wrote:

>
>
> On Thu, Jun 8, 2017 at 11:11 PM Nathanial Byrnes  wrote:
>
>> Hi All,
>>First, some background:
>>I have been running a small (4 compute nodes) xen server cluster
>> backed by both a small ceph (4 other nodes with a total of 18x 1-spindle
>> osd's) and small gluster cluster (2 nodes each with a 14 spindle RAID
>> array). I started with gluster 3-4 years ago, at first using NFS to access
>> gluster, then upgraded to gluster FUSE. However, I had been facinated with
>> ceph since I first read about it, and probably added ceph as soon as XCP
>> released a kernel with RBD support, possibly approaching 2 years ago.
>>With Ceph, since I started out with the kernel RBD, I believe it
>> locked me to Bobtail tunables. I connected to XCP via a project that tricks
>> XCP into running LVM on the RBDs managing all this through the iSCSI mgmt
>> infrastructure somehow... Only recently I've switched to a newer project
>> that uses the RBD-NBD mapping instead. This should let me use whatever
>> tunables my client SW support AFAIK. I have not yet changed my tunables as
>> the data re-org will probably take a day or two (only 1Gb networking...).
>>
>>Over this time period, I've observed that my gluster backed guests
>> tend not to consume as much of domain-0's (the Xen VM management host)
>> resources as do my Ceph backed guests. To me, this is somewhat intuitive
>>  as the ceph client has to do more "thinking" than the gluster client.
>> However, It seems to me that the IO performance of the VM guests is well
>> outside than the difference in spindle count would suggest. I am open to
>> the notion that there are probably quite a few sub-optimal design
>> choices/constraints within the environment. However, I haven't the
>> resources to conduct all that many experiments and benchmarks So, over
>> time I've ended up treating ceph as my resilient storage, and gluster as my
>> more performant (3x vs 2x replication, and, as mentioned above, my gluster
>> guests had quicker guest IO and lower dom-0 load).
>>
>> So, on to my questions:
>>
>>Would setting my tunables to jewel (my present release), or anything
>> newer than bobtail (which is what I think I am set to if I read the ceph
>> status warning correctly) reduce my dom-0 load and/or improve any aspects
>> of the client IO performance?
>>
>
> Unfortunately no. The tunables are entirely about how CRUSH works, and
> while it's possible to construct pessimal CRUSH maps that are impossible to
> satisfy and take a long time to churn through calculations, it's hard and
> you clearly haven't done that here. I think you're just seeing that the
> basic CPU cost of a Ceph IO is higher than in Gluster, or else there is
> something unusual about the Xen configuration you have here compared to
> more common deployments.
>
>
>>
>>Will adding nodes to the cluster ceph reduce load on dom-0, and/or
>> improve client IO performance (I doubt the former and would expect the
>> latter...)?
>>
>
> In general adding nodes will increase parallel throughput (ie, async IO on
> one client or the performance of multiple clients), but won't reduce
> latencies. It shouldn't have much (any?) impact on client CPU usage (other
> than if the client is pushing through more IO, it will use proportionally
> more CPU), nor on the CPU usage of existing daemons.
>
>
>>
>>So, why did I bring up gluster at all? In an ideal world, I would like
>> to have just one storage environment that would satisfy all my
>> organizations needs. If forced to choose with the knowledge I have today, I
>> would have to select gluster. I am hoping to come up with some actionable
>> data points that might help me discover some of my mistakes which might
>> explain my experience to date and maybe even help remedy said mistakes. As
>> I mentioned earlier, I like ceph, more so than gluster, and would like to
>> employ more within my environment. But, given budgetary constraints, I need
>> to do what's best for my organization.
>>
>>
> Yeah. I'm a little surprised you noticed it in the environment you
> described, but there aren't many people running Xen on Ceph so perhaps
> there's something odd happening with the setup it has there which 

Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Eric Choi
I realized I sent this under wrong thread: here I am sending it again:

---

Hello all,

I work in the same team as Tyler here, and I can provide more info here..

The cluster is indeed an RGW cluster, with many small (100 KB) objects
similar to your use case Bryan.  But we have the blind bucket set up with
 "index_type": 1 for this particular bucket, as we wanted to avoid this
bottleneck to begin with (we didn't need listing feature)  Would the bucket
sharding still be a problem for blind buckets?

Mark, would setting logging to 20 give any insights to what threads are
doing?


Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Effect of tunables on client system load

2017-06-13 Thread Gregory Farnum
On Thu, Jun 8, 2017 at 11:11 PM Nathanial Byrnes  wrote:

> Hi All,
>First, some background:
>I have been running a small (4 compute nodes) xen server cluster
> backed by both a small ceph (4 other nodes with a total of 18x 1-spindle
> osd's) and small gluster cluster (2 nodes each with a 14 spindle RAID
> array). I started with gluster 3-4 years ago, at first using NFS to access
> gluster, then upgraded to gluster FUSE. However, I had been facinated with
> ceph since I first read about it, and probably added ceph as soon as XCP
> released a kernel with RBD support, possibly approaching 2 years ago.
>With Ceph, since I started out with the kernel RBD, I believe it
> locked me to Bobtail tunables. I connected to XCP via a project that tricks
> XCP into running LVM on the RBDs managing all this through the iSCSI mgmt
> infrastructure somehow... Only recently I've switched to a newer project
> that uses the RBD-NBD mapping instead. This should let me use whatever
> tunables my client SW support AFAIK. I have not yet changed my tunables as
> the data re-org will probably take a day or two (only 1Gb networking...).
>
>Over this time period, I've observed that my gluster backed guests tend
> not to consume as much of domain-0's (the Xen VM management host) resources
> as do my Ceph backed guests. To me, this is somewhat intuitive  as the ceph
> client has to do more "thinking" than the gluster client. However, It seems
> to me that the IO performance of the VM guests is well outside than the
> difference in spindle count would suggest. I am open to the notion that
> there are probably quite a few sub-optimal design choices/constraints
> within the environment. However, I haven't the resources to conduct all
> that many experiments and benchmarks So, over time I've ended up
> treating ceph as my resilient storage, and gluster as my more performant
> (3x vs 2x replication, and, as mentioned above, my gluster guests had
> quicker guest IO and lower dom-0 load).
>
> So, on to my questions:
>
>Would setting my tunables to jewel (my present release), or anything
> newer than bobtail (which is what I think I am set to if I read the ceph
> status warning correctly) reduce my dom-0 load and/or improve any aspects
> of the client IO performance?
>

Unfortunately no. The tunables are entirely about how CRUSH works, and
while it's possible to construct pessimal CRUSH maps that are impossible to
satisfy and take a long time to churn through calculations, it's hard and
you clearly haven't done that here. I think you're just seeing that the
basic CPU cost of a Ceph IO is higher than in Gluster, or else there is
something unusual about the Xen configuration you have here compared to
more common deployments.


>
>Will adding nodes to the cluster ceph reduce load on dom-0, and/or
> improve client IO performance (I doubt the former and would expect the
> latter...)?
>

In general adding nodes will increase parallel throughput (ie, async IO on
one client or the performance of multiple clients), but won't reduce
latencies. It shouldn't have much (any?) impact on client CPU usage (other
than if the client is pushing through more IO, it will use proportionally
more CPU), nor on the CPU usage of existing daemons.


>
>So, why did I bring up gluster at all? In an ideal world, I would like
> to have just one storage environment that would satisfy all my
> organizations needs. If forced to choose with the knowledge I have today, I
> would have to select gluster. I am hoping to come up with some actionable
> data points that might help me discover some of my mistakes which might
> explain my experience to date and maybe even help remedy said mistakes. As
> I mentioned earlier, I like ceph, more so than gluster, and would like to
> employ more within my environment. But, given budgetary constraints, I need
> to do what's best for my organization.
>
>
Yeah. I'm a little surprised you noticed it in the environment you
described, but there aren't many people running Xen on Ceph so perhaps
there's something odd happening with the setup it has there which I and
others aren't picking up on. :/

Good luck!
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Living with huge bucket sizes

2017-06-13 Thread Eric Choi
Hello all,

I work in the same team as Tyler here, and I can provide more info here..

The cluster is indeed an RGW cluster, with many small (100 KB) objects
similar to your use case Bryan.  But we have the blind bucket set up with
 "index_type": 1 for this particular bucket, as we wanted to avoid this
bottleneck to begin with (we didn't need listing feature)  Would the bucket
sharding still be a problem for blind buckets?

Mark, would setting logging to 20 give any insights to what threads are
doing?


Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph pg repair : Error EACCES: access denied

2017-06-13 Thread Gregory Farnum
What are the cephx permissions of the key you are using to issue repair
commands?
On Tue, Jun 13, 2017 at 8:31 AM Jake Grimmett  wrote:

> Dear All,
>
> I'm testing Luminous and have a problem repairing inconsistent pgs. This
> occurs with v12.0.2 and is still present with v12.0.3-1507-g52f0deb
>
> # ceph health
> HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors
>
> # ceph health detail
> HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors
> noout flag(s) set
> pg 2.3 is active+clean+inconsistent, acting [53,50]
> pg 2.11 is active+clean+inconsistent, acting [52,50]
> 2 scrub errors
>
> #  rados list-inconsistent-pg hotpool
> ["2.3","2.11"]
>
> # rados list-inconsistent-obj 2.3 --format=json-pretty
> No scrub information available for pg 2.3
> error 2: (2) No such file or directory
>
> # ceph pg repair 2.3
> Error EACCES: access denied
>
> # ceph pg scrub 2.3
> Error EACCES: access denied
>
> # ceph pg repair 2.11
> Error EACCES: access denied
>
> I'm using bluestore OSD's on Scientific Linux 7.3
>
> Any ideas what might be the problem?
>
> thanks,
>
> Jake
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Jewel XFS calltraces

2017-06-13 Thread Emmanuel Florac
Le Tue, 13 Jun 2017 14:30:05 +0200
l...@jonas-server.de écrivait:

> [Tue Jun 13 13:18:48 2017] CPU: 3 PID: 3844 Comm: tp_fstore_op Not 
> tainted 4.4.0-75-generic #96-Ubuntu

Looks like a kernel bug. However this isn't completely up to date,
4.4.0-79 is available. You'd probably better post this on the xfs
mailing list, though: linux-xfs (at) vger.kernel.org

-- 

Emmanuel Florac |   Direction technique
|   Intellique
|   
|   +33 1 78 94 84 02



pgpwvetqfvetc.pgp
Description: Signature digitale OpenPGP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph pg repair : Error EACCES: access denied

2017-06-13 Thread Jake Grimmett
Dear All,

I'm testing Luminous and have a problem repairing inconsistent pgs. This
occurs with v12.0.2 and is still present with v12.0.3-1507-g52f0deb

# ceph health
HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors

# ceph health detail
HEALTH_ERR noout flag(s) set; 2 pgs inconsistent; 2 scrub errors
noout flag(s) set
pg 2.3 is active+clean+inconsistent, acting [53,50]
pg 2.11 is active+clean+inconsistent, acting [52,50]
2 scrub errors

#  rados list-inconsistent-pg hotpool
["2.3","2.11"]

# rados list-inconsistent-obj 2.3 --format=json-pretty
No scrub information available for pg 2.3
error 2: (2) No such file or directory

# ceph pg repair 2.3
Error EACCES: access denied

# ceph pg scrub 2.3
Error EACCES: access denied

# ceph pg repair 2.11
Error EACCES: access denied

I'm using bluestore OSD's on Scientific Linux 7.3

Any ideas what might be the problem?

thanks,

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Bryan Stillwell
Is this on an RGW cluster?

If so, you might be running into the same problem I was seeing with large 
bucket sizes:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-June/018504.html

The solution is to shard your buckets so the bucket index doesn't get too big.

Bryan

From: ceph-users  on behalf of Tyler Bischel 

Date: Monday, June 12, 2017 at 5:12 PM
To: "ceph-us...@ceph.com" 
Subject: [ceph-users] osd_op_tp timeouts

Hi,
  We've been having this ongoing problem with threads timing out on the OSDs.  
Typically we'll see the OSD become unresponsive for about a minute, as threads 
from other OSDs time out.  The timeouts don't seem to be correlated to high 
load.  We turned up the logs to 10/10 for part of a day to catch some of these 
in progress, and saw the pattern below in the logs several times (grepping for 
individual threads involved in the time outs).

We are using Jewel 10.2.7.

Logs:

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967027 lcod 5484'12967028 active] add_log_entry 
5484'12967030 (0'0) modify   
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
 by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484 pg[10.6d2( v 
5484'12967030 (5469'12963946,5484'12967030] local-les=5476 n=419 ec=593 les/c/f 
5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475 pi=4780-5474/109 luod=0'0 
lua=5484'12967019 crt=5484'12967028 lcod 5484'12967028 active] append_log: 
trimming to 5484'12967028 entries 5484'12967028 (5484'12967026) delete   
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
 by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to: 0'0, 
dirty_from: 4294967295'18446744073709551615, dirty_divergent_priors: false, 
divergent_priors: 0, writeout_from: 5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy 
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15



Metrics:
OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0 to 16, 
IO In progress spikes from 0 to hundreds, IO Time Weighted, IO Time spike.  
Average Queue Size on the device spikes.  One minute later, Write Time, Reads, 
and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v11.2.0 Disk activation issue while booting

2017-06-13 Thread David Turner
I came across this a few times.  My problem was with journals I set up by
myself.  I didn't give them the proper GUID partition type ID so the udev
rules didn't know how to make sure the partition looked correct.  What the
udev rules were unable to do was chown the journal block device as
ceph:ceph so that it could be opened by the Ceph user.  You can test by
chowning the journal block device and try to start the OSD again.

Alternatively if you want to see more information, you can start the daemon
manually as opposed to starting it through systemd and see what its output
looks like.

On Tue, Jun 13, 2017 at 6:32 AM nokia ceph  wrote:

> Hello,
>
> Some osd's not getting activated after a reboot operation which cause that
> particular osd's landing in failed state.
>
> Here you can see mount points were not getting updated to osd-num and
> mounted as a incorrect mount point, which caused osd. can't able to
> mount/activate the osd's.
>
> Env:- RHEL 7.2 - EC 4+1, v11.2.0 bluestore.
>
> #grep mnt proc/mounts
> /dev/sdh1 /var/lib/ceph/tmp/mnt.om4Lbq xfs
> rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0
> /dev/sdh1 /var/lib/ceph/tmp/mnt.EayTmL xfs
> rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0
>
> From /var/log/messages..
>
> --
> May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh2...
> May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh1...
>
>
> May 26 15:39:58 cn1 systemd: *start request repeated too quickly for*
> ceph-disk@dev-sdh2.service   => suspecting this could be root cause.
> May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation:
> /dev/sdh2.
> May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh2.service entered
> failed state.
> May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh2.service failed.
> May 26 15:39:58 cn1 systemd: start request repeated too quickly for
> ceph-disk@dev-sdh1.service
> May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation:
> /dev/sdh1.
> May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh1.service entered
> failed state.
> May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh1.service failed.
> --
>
> But this issue will occur intermittently  after a reboot operation.
>
> Note;- We haven't face this problem in Jewel.
>
> Awaiting for comments.
>
> Thanks
> Jayaram
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_op_tp timeouts

2017-06-13 Thread Mark Nelson

Hi Tyler,

I wanted to make sure you got a reply to this, but unfortunately I don't 
have much to give you.  It sounds like you already took a look at the 
disk metrics and ceph is probably not waiting on disk IO based on your 
description.  If you can easily invoke the problem, you could attach gdb 
to the OSD and do a "thread apply all bt" to see what the threads are 
doing when it's timing out.  Also, please open a tracker ticket if one 
doesn't already exist so we can make sure we get it recorded in case 
other people see the same thing.


Mark

On 06/12/2017 06:12 PM, Tyler Bischel wrote:

Hi,
  We've been having this ongoing problem with threads timing out on the
OSDs.  Typically we'll see the OSD become unresponsive for about a
minute, as threads from other OSDs time out.  The timeouts don't seem to
be correlated to high load.  We turned up the logs to 10/10 for part of
a day to catch some of these in progress, and saw the pattern below in
the logs several times (grepping for individual threads involved in the
time outs).

We are using Jewel 10.2.7.

*Logs:*

2017-06-12 18:45:12.530698 7f82ebfa8700 10 osd.30 pg_epoch: 5484
pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476
n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967027 lcod
5484'12967028 active] add_log_entry 5484'12967030 (0'0) modify
10:4b771c01:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0634p1mdw1-11203-593EE138-2E:head
by client.1274027169.0:3107075054 2017-06-12 18:45:12.523899

2017-06-12 18:45:12.530718 7f82ebfa8700 10 osd.30 pg_epoch: 5484
pg[10.6d2( v 5484'12967030 (5469'12963946,5484'12967030] local-les=5476
n=419 ec=593 les/c/f 5476/5476/0 5474/5475/5455) [27,16,30] r=2 lpr=5475
pi=4780-5474/109 luod=0'0 lua=5484'12967019 crt=5484'12967028 lcod
5484'12967028 active] append_log: trimming to 5484'12967028 entries
5484'12967028 (5484'12967026) delete
10:4b796a74:::0b405695-e5a7-467f-bb88-37ce8153f1ef.1270728618.3834_filter0469p1mdw1-21390-593EE137-57:head
by client.1274027164.0:3183456083 2017-06-12 18:45:12.491741

2017-06-12 18:45:12.530754 7f82ebfa8700  5 write_log with: dirty_to:
0'0, dirty_from: 4294967295'18446744073709551615,
dirty_divergent_priors: false, divergent_priors: 0, writeout_from:
5484'12967030, trimmed:

2017-06-12 18:45:28.171843 7f82dc503700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.171877 7f82dc402700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174900 7f82d8887700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.174979 7f82d8786700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248499 7f82df05e700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.248651 7f82df967700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15

2017-06-12 18:45:28.261044 7f82d8483700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f82ebfa8700' had timed out after 15


*Metrics:*

OSD Disk IO Wait spikes from 2ms to 1s, CPU Procs Blocked spikes from 0
to 16, IO In progress spikes from 0 to hundreds, IO Time Weighted, IO
Time spike.  Average Queue Size on the device spikes.  One minute later,
Write Time, Reads, and Read Time spike briefly.

Any thoughts on what may be causing this behavior?

--Tyler



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Jewel XFS calltraces

2017-06-13 Thread list

Hello guys,

we have currently an issue with our ceph setup based on XFS. Sometimes 
some nodes are dying with high load with this calltrace in dmesg:


[Tue Jun 13 13:18:48 2017] BUG: unable to handle kernel NULL pointer 
dereference at 00a0
[Tue Jun 13 13:18:48 2017] IP: [] 
xfs_da3_node_read+0x30/0xb0 [xfs]

[Tue Jun 13 13:18:48 2017] PGD 0
[Tue Jun 13 13:18:48 2017] Oops:  [#1] SMP
[Tue Jun 13 13:18:48 2017] Modules linked in: cpuid arc4 md4 nls_utf8 
cifs fscache nfnetlink_queue nfnetlink xt_CHECKSUM xt_nat iptable_nat 
nf_nat_ipv4 xt_NFQUEUE xt_CLASSIFY ip6table_mangle dccp_diag dccp 
tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag veth 
dummy bridge stp llc ebtable_filter ebtables iptable_mangle xt_CT 
iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables 
xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter 
ip6_tables x_tables xfs ipmi_devintf dcdbas x86_pkg_temp_thermal 
intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel aesni_intel ipmi_ssif aes_x86_64 lrw gf128mul 
glue_helper ablk_helper cryptd sb_edac edac_core input_leds joydev 
lpc_ich ioatdma shpchp 8250_fintek ipmi_si ipmi_msghandler acpi_pad 
acpi_power_meter
[Tue Jun 13 13:18:48 2017]  mac_hid vhost_net vhost macvtap macvlan 
kvm_intel kvm irqbypass cdc_ether nf_nat_ftp tcp_htcp nf_nat_pptp 
nf_nat_proto_gre nf_conntrack_ftp bonding nf_nat_sip nf_conntrack_sip 
nf_nat nf_conntrack_pptp nf_conntrack_proto_gre nf_conntrack usbnet mii 
lp parport autofs4 btrfs raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid10 
raid1 hid_generic usbhid hid ixgbe igb vxlan ip6_udp_tunnel ahci dca 
udp_tunnel libahci i2c_algo_bit ptp megaraid_sas pps_core mdio wmi fjes
[Tue Jun 13 13:18:48 2017] CPU: 3 PID: 3844 Comm: tp_fstore_op Not 
tainted 4.4.0-75-generic #96-Ubuntu
[Tue Jun 13 13:18:48 2017] Hardware name: Dell Inc. PowerEdge 
R720/0XH7F2, BIOS 2.5.4 01/22/2016
[Tue Jun 13 13:18:48 2017] task: 881feda65400 ti: 883fbda08000 
task.ti: 883fbda08000
[Tue Jun 13 13:18:48 2017] RIP: 0010:[]  
[] xfs_da3_node_read+0x30/0xb0 [xfs]

[Tue Jun 13 13:18:48 2017] RSP: 0018:883fbda0bc88  EFLAGS: 00010286
[Tue Jun 13 13:18:48 2017] RAX:  RBX: 8801102c5050 
RCX: 0001
[Tue Jun 13 13:18:48 2017] RDX:  RSI:  
RDI: 883fbda0bc38
[Tue Jun 13 13:18:48 2017] RBP: 883fbda0bca8 R08: 0001 
R09: fffe
[Tue Jun 13 13:18:48 2017] R10: 880007374ae0 R11: 0001 
R12: 883fbda0bcd8
[Tue Jun 13 13:18:48 2017] R13: 880035ac4c80 R14: 0001 
R15: 8b1f4885
[Tue Jun 13 13:18:48 2017] FS:  7fc574607700() 
GS:883fff04() knlGS:
[Tue Jun 13 13:18:48 2017] CS:  0010 DS:  ES:  CR0: 
80050033
[Tue Jun 13 13:18:48 2017] CR2: 00a0 CR3: 003fd828d000 
CR4: 001426e0

[Tue Jun 13 13:18:48 2017] Stack:
[Tue Jun 13 13:18:48 2017]  c06b4b50 c0695ecc 
883fbda0bde0 0001
[Tue Jun 13 13:18:48 2017]  883fbda0bd20 c06718b3 
00030008 880e99b44010
[Tue Jun 13 13:18:48 2017]  360c65a8 88270f80b900 
 

[Tue Jun 13 13:18:48 2017] Call Trace:
[Tue Jun 13 13:18:48 2017]  [] ? 
xfs_trans_roll+0x2c/0x50 [xfs]
[Tue Jun 13 13:18:48 2017]  [] 
xfs_attr3_node_inactive+0x183/0x220 [xfs]
[Tue Jun 13 13:18:48 2017]  [] 
xfs_attr3_node_inactive+0x1c9/0x220 [xfs]
[Tue Jun 13 13:18:48 2017]  [] 
xfs_attr3_root_inactive+0xac/0x100 [xfs]
[Tue Jun 13 13:18:48 2017]  [] 
xfs_attr_inactive+0x14c/0x1a0 [xfs]
[Tue Jun 13 13:18:48 2017]  [] xfs_inactive+0x85/0x120 
[xfs]
[Tue Jun 13 13:18:48 2017]  [] 
xfs_fs_evict_inode+0xa5/0x100 [xfs]

[Tue Jun 13 13:18:48 2017]  [] evict+0xbe/0x190
[Tue Jun 13 13:18:48 2017]  [] iput+0x1c1/0x240
[Tue Jun 13 13:18:48 2017]  [] do_unlinkat+0x199/0x2d0
[Tue Jun 13 13:18:48 2017]  [] SyS_unlink+0x16/0x20
[Tue Jun 13 13:18:48 2017]  [] 
entry_SYSCALL_64_fastpath+0x16/0x71
[Tue Jun 13 13:18:48 2017] Code: 55 48 89 e5 41 54 53 4d 89 c4 48 89 fb 
48 83 ec 10 48 c7 04 24 50 4b 6b c0 e8 dd fe ff ff 85 c0 75 46 48 85 db 
74 41 49 8b 34 24 <48> 8b 96 a0 00 00 00 0f b7 52 08 66 c1 c2 08 66 81 
fa be 3e 74
[Tue Jun 13 13:18:48 2017] RIP  [] 
xfs_da3_node_read+0x30/0xb0 [xfs]

[Tue Jun 13 13:18:48 2017]  RSP 
[Tue Jun 13 13:18:48 2017] CR2: 00a0
[Tue Jun 13 13:18:48 2017] ---[ end trace 5470d0d55cacb4ef ]---

The OSD has then the issue that it can not reach any other osd in the 
pool.


 -1043> 2017-06-13 13:24:00.917597 7fc539a72700  0 -- 
192.168.14.19:6827/3389 >> 192.168.14.7:6805/3658 pipe(0x558219846000 
sd=23 :6827
s=0 pgs=0 cs=0 l=0 c=0x55821a330400).accept connect_seq 7 vs existing 7 
state standby
 -1042> 2017-06-13 13:24:00.918433 7fc539a72700  0 -- 
192.168.14.19:6827/3389 >> 192.168.14.7:6805/3658 

[ceph-users] v11.2.0 Disk activation issue while booting

2017-06-13 Thread nokia ceph
Hello,

Some osd's not getting activated after a reboot operation which cause that
particular osd's landing in failed state.

Here you can see mount points were not getting updated to osd-num and
mounted as a incorrect mount point, which caused osd. can't able to
mount/activate the osd's.

Env:- RHEL 7.2 - EC 4+1, v11.2.0 bluestore.

#grep mnt proc/mounts
/dev/sdh1 /var/lib/ceph/tmp/mnt.om4Lbq xfs
rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0
/dev/sdh1 /var/lib/ceph/tmp/mnt.EayTmL xfs
rw,noatime,attr2,inode64,sunit=512,swidth=512,noquota 0 0

>From /var/log/messages..

--
May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh2...
May 26 15:39:58 cn1 systemd: Starting Ceph disk activation: /dev/sdh1...


May 26 15:39:58 cn1 systemd: *start request repeated too quickly for*
ceph-disk@dev-sdh2.service   => suspecting this could be root cause.
May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation:
/dev/sdh2.
May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh2.service entered failed
state.
May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh2.service failed.
May 26 15:39:58 cn1 systemd: start request repeated too quickly for
ceph-disk@dev-sdh1.service
May 26 15:39:58 cn1 systemd: Failed to start Ceph disk activation:
/dev/sdh1.
May 26 15:39:58 cn1 systemd: Unit ceph-disk@dev-sdh1.service entered failed
state.
May 26 15:39:58 cn1 systemd: ceph-disk@dev-sdh1.service failed.
--

But this issue will occur intermittently  after a reboot operation.

Note;- We haven't face this problem in Jewel.

Awaiting for comments.

Thanks
Jayaram
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph durability calculation and test method

2017-06-13 Thread Z Will
Hi all :
 I have some questions about the durability of ceph.  I am trying
to mesure the durability of ceph .I konw it should be related with
host and disk failing probability, failing detection time, when to
trigger the recover and the recovery time .  I use it with multiple
replication, say k replication. If I have N hosts, R racks, O osds per
host, ignoring the swich, how should I define the failure probability
of disk and host ? I think they should be independent, and should be
time-dependent . I google it , but find little thing about it . I see
AWS says it delivers 99.9% durability. How this is claimed ?
And  can I design some test method to prove the durability ? Or just
let it run long enough time and make the statistics ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com