Re: [ceph-users] Global, Synchronous Blocked Requests

2015-11-27 Thread Daniel Maraio

Hello,

  Can you provide some further details. What are the size of your 
objects, how many objects do you have in your buckets. Are you using 
bucket index sharding, are you sharding your objects over multiple 
buckets? Is the cluster doing any scrubbing during these periods? It 
sounds like you may be having trouble with your rgw bucket index. In our 
cluster, much smaller than yours mind you, it was necessary to put the 
rgw bucket index onto it's own set of osds to isolate it from the rest 
of the cluster IO. We are still using single object bucket indexes but 
have a plan to move to sharded bucket index eventually.


  You should determine what OSDs your bucket indexes are located on and 
see if a pattern emerges with the OSDs have have slow requests during 
this periods. You can use the command ' ceph pg ls-by-pool 
.rgw.buckets.index ' to show what pgs/osds the bucket index resides on.


- Daniel

On 11/27/2015 10:24 PM, Brian Felton wrote:

Greetings Ceph Community,

We are running a Hammer cluster (0.94.3-1) in production that recently 
experienced asymptotic performance degradation. We've been migrating 
data from an older non-Ceph cluster at a fairly steady pace for the 
past eight weeks (about 5TB a week).  Overnight, the ingress rate 
dropped by 95%.  Upon investigation, we found we were receiving 
hundreds of thousands of 'slow request' warnings.


The cluster is being used as an S3-compliant object storage solution.  
What has been extremely problematic is that all cluster writes are 
being blocked simultaneously.  When something goes wrong, we've 
observed our transfer jobs (6-8 jobs, running across 4 servers) all 
simultaneously block on writes for 10-60 seconds, then release and 
continue simultaneously.  The blocks occur very frequently (at least 
once a minute after the previous block has cleared).


Our setup is as follows:

 - 5 monitor nodes (VMs: 2 vCPU, 4GB RAM, Ubuntu 14.04.3, kernel 
3.13.0-48)

 - 2 RGW nodes (VMs: 2 vCPU, 4GB RAM, Ubuntu 14.04.3, kernel 3.13.0-48)
 - 9 Storage nodes (Supermicro server: 32 CPU, 256GB RAM, Ubuntu 
14.04.3, kernel 3.13.0-46)


Each storage server contains 72 6TB SATA drives for Ceph (648 OSDs, 
~3.5PB in total).  Each disk is set up as its own ZFS zpool.  Each OSD 
has a 10GB journal, located within the disk's zpool.


Other information that might be pertinent:
 - All servers (and VMs) use NTP to sync clocks.
 - The cluster uses k=7, m=2 erasure coding.
 - Each storage server has 6 10Gbps ports, with 2 bonded for front-end 
traffic and 4 bonded for back-end traffic.
 - Ingress and egress traffic is typically a few MB/sec tops, and 
we've stress tested it at levels at least 100x what we normally see
 - We have pushed a few hundred TB into the cluster during burn-in 
without issue


Given the global nature of the failure, we initially suspected 
networking issues.  After a solid day of investigation, we were unable 
to find any reason to suspect the network (no dropped packets on FE or 
BE networks, no ping loss, no switch issues, reasonable iperf tests, 
etc.).  We next examined the storage nodes, but we found no failures 
of any kind (nothing in system/kernel logs, no ZFS errors, 
iostat/atop/etc. all normal, etc.).


We've also attempted the following, with no success:
 - Rolling restart of the storage nodes
 - Rolling restart of the mon nodes
 - Complete shutdown/restart of all mon nodes
 - Expansion of RGW capacity from 2 servers to 5
 - Uncontrollable sobbing

Nothing about the cluster has changed recently -- no OS patches, no 
Ceph patches, no software updates of any kind. For the months we've 
had the cluster operational, we've had no performance-related issues.  
In the days leading up to the major performance issue we're now 
experiencing, the logs did record 100 or so 'slow request' events of 
>30 seconds on subsequent days.  After that, the slow requests became 
a constant, and now our logs are spammed with entries like the following:


2015-11-28 02:30:07.328347 osd.116 192.168.10.10:6832/1689576 
 1115 : cluster [WRN] 2 slow 
requests, 1 included below; oldest blocked for > 60.024165 secs
2015-11-28 02:30:07.328358 osd.116 192.168.10.10:6832/1689576 
 1116 : cluster [WRN] slow request 
60.024165 seconds old, received at 2015-11-28 02:29:07.304113: 
osd_op(client.214858.0:6990585 
default.184914.126_2d29cad4962d3ac08bb7c3153188d23f [create 0~0 
[excl],setxattr user.rgw.idtag (22),writefull 0~523488,setxattr 
user.rgw.manifest (444),setxattr user.rgw.acl (371),setxattr 
user.rgw.content_type (1),setxattr user.rgw.etag (33)] 48.158d9795 
ondisk+write+known_if_redirected e15933) currently commit_sent


We've analyzed the logs on the monitor nodes (ceph.log and 
ceph-mon..log), and there doesn't appear to be a smoking gun.  The 
'slow request' events are spread fairly evenly across all 648 OSDs.


A 'ceph health detail' typically shows something like the following:


[ceph-users] Crush Ruleset Questions

2015-10-03 Thread Daniel Maraio

Hello,

  I've looked over the crush documentation but I am a little confused. 
Perhaps someone here can help me out!


  I have three chassis with 6 SSD osds that I use for writeback cache. 
I have removed one OSD from each server and I want to make a new 
replicated ruleset to use just these three OSDs. I want to segregate the 
IO for the RGW bucket index on this new ruleset to isolate it from 
scrubs,promotes,eviction operations.


  My question is, how do I a make a ruleset that will use just these 
three OSDs. My current ruleset for these hosts looks like:


root cache {
id -27  # do not change unnecessarily
# weight 3.780
alg straw
hash 0  # rjenkins1
item osd-cache01 weight 1.260
item osd-cache02 weight 1.260
item osd-cache03 weight 1.260
}

host osd-cache01 {
id -3   # do not change unnecessarily
# weight 1.260
alg straw
hash 0  # rjenkins1
item osd.15 weight 0.210
item osd.18 weight 0.210
item osd.19 weight 0.210
item osd.20 weight 0.210
item osd.21 weight 0.210
item osd.25 weight 0.210
}
host osd-cache02 {
id -4   # do not change unnecessarily
# weight 1.260
alg straw
hash 0  # rjenkins1
item osd.26 weight 0.210
item osd.27 weight 0.210
item osd.28 weight 0.210
item osd.29 weight 0.210
item osd.30 weight 0.210
item osd.31 weight 0.210
}
host osd-cache03 {
id -5   # do not change unnecessarily
# weight 1.260
alg straw
hash 0  # rjenkins1
item osd.32 weight 0.210
item osd.33 weight 0.210
item osd.34 weight 0.210
item osd.35 weight 0.210
item osd.36 weight 0.210
item osd.37 weight 0.210
}


- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Moving/Sharding RGW Bucket Index

2015-09-01 Thread Daniel Maraio

Hello,

  I have two large buckets in my RGW and I think the performance is 
being impacted by the bucket index. One bucket contains 9 million 
objects and the other one has 22 million. I'd like to shard the bucket 
index and also change the ruleset of the .rgw.buckets.index pool to put 
it on our SSD root. I could not find any documentation on this issue. It 
looks like the bucket indexes can be rebuilt using the radosgw-admin 
bucket check command but I'm not sure how to proceed. We can stop writes 
or take the cluster down completely if necessary. My initial thought was 
to backup the existing index pool and create a new one. I'm not sure if 
I can change the index_pool of an existing bucket. If that is possible I 
assume I can change that to my new pool and execute a radosgw-admin 
bucket check command to rebuild/shard the index.


  Does anyone have experience in getting sharding running with an 
existing bucket, or even moving the index pool to a different ruleset? 
When I change the crush ruleset for the .rgw.buckets.index pool to my 
SSD root we run into issues, buckets cannot be created or listed, writes 
cease to work, reads seem to work fine though. Thanks for your time!


- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW blocked threads/timeouts

2015-06-09 Thread Daniel Maraio

Hello Cephers,

  I had a question about something we experience in our cluster. When 
we add new capacity or suffer failures we will often get blocked 
requests during the rebuilding. This leads to threads from the RGW 
blocking and eventually no longer serving new requests. I suspect that 
if we set the RGW thread timeouts low enough this could alleviate the 
problem. We don't necessary care if a certain portion of requests get 
ignored during this time period. So long as the RGW can respond to some 
of them.


  So my question is, has anyone else experienced this and what have you 
done to solve it. The two timeout settings I am looking at are listed 
below, and i'm not certain what the distinction is between them, perhaps 
someone could fill me in. Thank you and I appreciate the assistance!


  The documentation is not too clear about the differences and after 
some brief searching I didn't find any discussions about these values.


rgwopthreadtimeout
rgwopthreadsuicidetimeout

- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG size distribution

2015-06-02 Thread Daniel Maraio

Hello,

  Thank you for the feedback Jan, much appreciated! I wont post the 
whole tree as it is rather long, but here is an example of one of our 
hosts. All of the OSDs and hosts are weighted the same, with the 
exception of a host that is missing an OSD due to a broken backplane. We 
are only using hosts for buckets so no rack/DC. We have not manually 
adjusted the crush map at all for this cluster.


 -1 302.26959 root default
-24  14.47998 host osd23
192   1.81000 osd.192  up  1.0  1.0
193   1.81000 osd.193  up  1.0  1.0
194   1.81000 osd.194  up  1.0  1.0
195   1.81000 osd.195  up  1.0  1.0
199   1.81000 osd.199  up  1.0  1.0
200   1.81000 osd.200  up  1.0  1.0
201   1.81000 osd.201  up  1.0  1.0
202   1.81000 osd.202  up  1.0  1.0

  I appreciate your input and will likely follow the same path you 
have, slowly increasing the PGs and adjusting the weights as necessary. 
If anyone else has any further suggestions I'd love to hear them as well!


- Daniel


On 06/02/2015 01:33 PM, Jan Schermer wrote:

Post the output from your “ceph osd tree”.
We were in a similiar situation, some of the OSDs were quite full while other had 
50% free. This is exactly why we increased the number of PGs, and it helped to 
some degree.
Are all your hosts the same size? Does your CRUSH map select a host in the end? 
That way if you have few hosts with differing number of OSDs the distribution 
will be poor (IMHO).

Anyway, when we started increasing the PG numbers we first generated the PGs 
themselves (pg_num) in small increments since that put a lot of load on the 
OSDs and we were seeing slow requests with large increases.
So something like this:
for i in `seq 4096 64 8192` ; do ceph osd pool set poolname pg_num $i ; done
This ate a few gigs from the drives (1-2GB if I remember correctly).

Once that was finished we increased the pgp_num in larger and larger increments 
 - at first 64 at a time and then 512 at a time when we were reaching the 
target (16384 in our case). This does allocate more space temporarily, and it 
seems to just randomly move data around - one minute an OSD is fine, another 
and the OSD is nearing full. One of us basically had to watch the process all 
the time, reweighting the devices that were almost full.
With increasing number of PGs it became much simpler, as the overhead was 
smaller, every bit of work was smaller and all the management operations a lot 
smoother.

YMMV - our data distribution was poor from the start, hosts had differing 
weights due to differing number of OSDs, there were some historical remnants 
when we tried to load-balance the data by hand, and we ended in a much better 
state but not perfect - some OSDs still have much more free space than other.
We haven’t touched the CRUSH map at all during this process, once we do and set 
newer tunables then the data distribution should be much more even.

I’d love to hear the others’ input since we are not sure why exactly this 
problem is present at all - I’d expect it to fill all the OSDs to the same or 
close-enough level, but in reality we have OSDs with weight 1.0 which are 
almost empty and others with weight 0.5 which are nearly full… When adding data 
it seems to (subjectively) distribute them evenly...

Jan


On 02 Jun 2015, at 18:52, Daniel Maraio dmar...@choopa.com wrote:

Hello,

  I have some questions about the size of my placement groups and how I can get 
a more even distribution. We currently have 160 2TB OSDs across 20 chassis.  We 
have 133TB used in our radosgw pool with a replica size of 2. We want to move 
to 3 replicas but are concerned we may fill up some of our OSDs. Some OSDs have 
~1.1TB free while others only have ~600GB free. The radosgw pool has 4096 pgs, 
looking at the documentation I probably want to increase this up to 8192, but 
we have decided to hold off on that for now.

  So, now for the pg usage. I dumped out the PG stats and noticed that there 
are two groups of PG sizes in my cluster. There are about 1024 PGs that are 
each around 17-18GB in size. The rest of the PGs are all around 34-36GB in 
size. Any idea why there are two distinct groups? We only have the one pool 
with data in it, though there are several different buckets in the radosgw 
pool. The data in the pool ranges from small images to 4-6mb audio files. Will 
increasing the number of PGs on this pool provide a more even distribution?

  Another thing to note is that the initial cluster was built lopsided, with 
some 4TB OSDs and some 2TB, we have removed all the 4TB disks and are only 
using 2TBs across the entire cluster. Not sure if this would have had any 
impact.

  Thank you for your time and I would appreciate any insight the community can 
offer.

- Daniel

[ceph-users] PG size distribution

2015-06-02 Thread Daniel Maraio

Hello,

  I have some questions about the size of my placement groups and how I 
can get a more even distribution. We currently have 160 2TB OSDs across 
20 chassis.  We have 133TB used in our radosgw pool with a replica size 
of 2. We want to move to 3 replicas but are concerned we may fill up 
some of our OSDs. Some OSDs have ~1.1TB free while others only have 
~600GB free. The radosgw pool has 4096 pgs, looking at the documentation 
I probably want to increase this up to 8192, but we have decided to hold 
off on that for now.


  So, now for the pg usage. I dumped out the PG stats and noticed that 
there are two groups of PG sizes in my cluster. There are about 1024 PGs 
that are each around 17-18GB in size. The rest of the PGs are all around 
34-36GB in size. Any idea why there are two distinct groups? We only 
have the one pool with data in it, though there are several different 
buckets in the radosgw pool. The data in the pool ranges from small 
images to 4-6mb audio files. Will increasing the number of PGs on this 
pool provide a more even distribution?


  Another thing to note is that the initial cluster was built lopsided, 
with some 4TB OSDs and some 2TB, we have removed all the 4TB disks and 
are only using 2TBs across the entire cluster. Not sure if this would 
have had any impact.


  Thank you for your time and I would appreciate any insight the 
community can offer.


- Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com