Re: [ceph-users] Long peering - throttle at FileStore::queue_transactions
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil wrote: > On Mon, 4 Jan 2016, Guang Yang wrote: >> Hi Cephers, >> Happy New Year! I got question regards to the long PG peering.. >> >> Over the last several days I have been looking into the *long peering* >> problem when we start a OSD / OSD host, what I observed was that the >> two peering working threads were throttled (stuck) when trying to >> queue new transactions (writing pg log), thus the peering process are >> dramatically slow down. >> >> The first question came to me was, what were the transactions in the >> queue? The major ones, as I saw, included: >> >> - The osd_map and incremental osd_map, this happens if the OSD had >> been down for a while (in a large cluster), or when the cluster got >> upgrade, which made the osd_map epoch the down OSD had, was far behind >> the latest osd_map epoch. During the OSD booting, it would need to >> persist all those osd_maps and generate lots of filestore transactions >> (linear with the epoch gap). >> > As the PG was not involved in most of those epochs, could we only take and >> > persist those osd_maps which matter to the PGs on the OSD? > > This part should happen before the OSD sends the MOSDBoot message, before > anyone knows it exists. There is a tunable threshold that controls how > recent the map has to be before the OSD tries to boot. If you're > seeing this in the real world, be probably just need to adjust that value > way down to something small(er). It would queue the transactions and then sends out the MOSDBoot, thus there is still a chance that it could have contention with the peering OPs (especially on large clusters where there are lots of activities which generates many osdmap epoch). Any chance we can change the *queue_transactions* to "apply_transactions*, thus we block there waiting for the persistent of the osdmap. At least we may be able to do that during OSD booting? The concern is, if the OSD is active, the apply_transaction would take longer with holding the osd_lock.. I don't find such tuning, could you elaborate? Thanks! > > sage > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Long peering - throttle at FileStore::queue_transactions
Hi Cephers, Happy New Year! I got question regards to the long PG peering.. Over the last several days I have been looking into the *long peering* problem when we start a OSD / OSD host, what I observed was that the two peering working threads were throttled (stuck) when trying to queue new transactions (writing pg log), thus the peering process are dramatically slow down. The first question came to me was, what were the transactions in the queue? The major ones, as I saw, included: - The osd_map and incremental osd_map, this happens if the OSD had been down for a while (in a large cluster), or when the cluster got upgrade, which made the osd_map epoch the down OSD had, was far behind the latest osd_map epoch. During the OSD booting, it would need to persist all those osd_maps and generate lots of filestore transactions (linear with the epoch gap). > As the PG was not involved in most of those epochs, could we only take and > persist those osd_maps which matter to the PGs on the OSD? - There are lots of deletion transactions, and as the PG booting, it needs to merge the PG log from its peers, and for the deletion PG entry, it would need to queue the deletion transaction immediately. > Could we delay the queue of the transactions until all PGs on the host are > peered? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD disk replacement best practise
Hi cephers, Most recently I am drafting the run books for OSD disk replacement, I think the rule of thumb is to reduce data migration (recover/backfill), and I thought the following procedure should achieve the purpose: 1. ceph osd out osd.XXX (mark it out to trigger data migration) 2. ceph osd rm osd.XXX 3. ceph auth rm osd.XXX 4. provision a new OSD which will take XXX as the OSD id and migrate data back. With the above procedure, the crush weight of the host never changed so that we can limit the data migration only for those which are neccesary. Does it make sense? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] row geo-replication to another data store?
Hi cephers, We are investigating a backup solution for Ceph, in short, we would like a solution to backup a Ceph cluster to another data store (not Ceph cluster, assume it has SWIFT API). We would like to have both full backup and incremental backup on top of the full backup. After going through the geo-replication blueprint [1], I am thinking that we can leverage the effort and instead of replicate the data into another ceph cluster, we make it replicate to another data store. At the same time, I have a couple of questions which need your help: 1) How does the ragosgw-agent scale to multiple hosts? Our first investigation shows it only works on a single host but I would like to confirm. 2) Can we configure the interval to do incremental backup like 1 hour / 1 day / 1 month? [1] https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ask a performance question for the RGW
On Jun 30, 2014, at 3:59 PM, baijia...@126.com wrote: > Hello, > thanks for you answer the question. > But when there are less than 50 thousand objects, and latency is very big . I > see the write ops for the bucket index object., from > "journaled_completion_queue" to "op_commit" cost 3.6 seconds,this mean that > from “writing journal finish” to "op_commit" cost 3.6 seconds。 > so I can't understand this and what happened? The operations updating the same bucket index object get serialized, one possibility is that those operation was hang there waiting other ops finishing their work. > > thanks > baijia...@126.com > > 发件人: Guang Yang > 发送时间: 2014-06-30 14:57 > 收件人: baijiaruo > 抄送: ceph-users > 主题: Re: [ceph-users] Ask a performance question for the RGW > Hello, > There is a known limitation of bucket scalability, and there is a blueprint > tracking it - > https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability. > > At time being, I would recommend to do sharding at application level (create > multiple buckets) to workaround this limitation. > > Thanks, > Guang > > On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote: > > > > > hello, everyone! > > > > when I user rest bench test RGW performance and the cmd is: > > ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 > > -t 200 -b 524288 -no-cleanup write > > > > test result: > > Total time run: 362.962324 T > > otal writes made: 48189 > > Write size: 524288 > > Bandwidth (MB/sec): 66.383 > > Stddev Bandwidth: 40.7776 > > Max bandwidth (MB/sec): 173 > > Min bandwidth (MB/sec): 0 > > Average Latency: 1.50435 > > Stddev Latency: 0.910731 > > Max latency: 9.12276 > > Min latency: 0.19867 > > > > my environment is 4 host and 40 disk(osd)。 but test result is very bad, > > average latency is 1.5 seconds 。and I find write obj metadate is very > > slowly。because it puts so many object to one bucket, we know writing object > > metadate can call method “bucket_prepare_op”,and test find this op is very > > slowly。 I find the osd which contain bucket-obj。and see the > > “bucket_prepare_op”by dump_historic_ops : > > { "description": "osd_op(client.4742.0:87613 .dir.default.4243.3 [call > > rgw.bucket_prepare_op] 3.3670fe74 e317)", > > "received_at": "2014-06-30 13:35:55.409597", > > "age": "51.148026", > > "duration": "4.130137", > > "type_data": [ > > "commit sent; apply or cleanup", > > { "client": "client.4742", > > "tid": 87613}, > > [ > > { "time": "2014-06-30 13:35:55.409660", > > "event": "waiting_for_osdmap"}, > > { "time": "2014-06-30 13:35:55.409669", > > "event": "queue op_wq"}, > > { "time": "2014-06-30 13:35:55.896766", > > "event": "reached_pg"}, > > { "time": "2014-06-30 13:35:55.896793", > > "event": "started"}, > > { "time": "2014-06-30 13:35:55.896796", > > "event": "started"}, > > { "time": "2014-06-30 13:35:55.899450", > > "event": "waiting for subops from [40,43]"}, > > { "time": "2014-06-30 13:35:55.899757", > > "event": "commit_queued_for_journal_write"}, > > { "time": "2014-06-30 13:35:55.899799", > > "event": "write_thread_in_journal_buffer"}, > > { "time": "2014-06-30 13:35:55.899910", > > "event": "journaled_completion_queued"}, > > { "time": "2014-06-30 13:35:55.899936", > > "event": "journal first callback"}, > > { "time": "2014-06-30 13:35:55.899944", > > "event&quo
Re: [ceph-users] Ask a performance question for the RGW
Hello, There is a known limitation of bucket scalability, and there is a blueprint tracking it - https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability. At time being, I would recommend to do sharding at application level (create multiple buckets) to workaround this limitation. Thanks, Guang On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote: > > hello, everyone! > > when I user rest bench test RGW performance and the cmd is: > ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 > -t 200 -b 524288 -no-cleanup write > > test result: > Total time run: 362.962324 T > otal writes made: 48189 > Write size: 524288 > Bandwidth (MB/sec): 66.383 > Stddev Bandwidth: 40.7776 > Max bandwidth (MB/sec): 173 > Min bandwidth (MB/sec): 0 > Average Latency: 1.50435 > Stddev Latency: 0.910731 > Max latency: 9.12276 > Min latency: 0.19867 > > my environment is 4 host and 40 disk(osd)。 but test result is very bad, > average latency is 1.5 seconds 。and I find write obj metadate is very > slowly。because it puts so many object to one bucket, we know writing object > metadate can call method “bucket_prepare_op”,and test find this op is very > slowly。 I find the osd which contain bucket-obj。and see the > “bucket_prepare_op”by dump_historic_ops : > { "description": "osd_op(client.4742.0:87613 .dir.default.4243.3 [call > rgw.bucket_prepare_op] 3.3670fe74 e317)", > "received_at": "2014-06-30 13:35:55.409597", > "age": "51.148026", > "duration": "4.130137", > "type_data": [ > "commit sent; apply or cleanup", > { "client": "client.4742", > "tid": 87613}, > [ > { "time": "2014-06-30 13:35:55.409660", > "event": "waiting_for_osdmap"}, > { "time": "2014-06-30 13:35:55.409669", > "event": "queue op_wq"}, > { "time": "2014-06-30 13:35:55.896766", > "event": "reached_pg"}, > { "time": "2014-06-30 13:35:55.896793", > "event": "started"}, > { "time": "2014-06-30 13:35:55.896796", > "event": "started"}, > { "time": "2014-06-30 13:35:55.899450", > "event": "waiting for subops from [40,43]"}, > { "time": "2014-06-30 13:35:55.899757", > "event": "commit_queued_for_journal_write"}, > { "time": "2014-06-30 13:35:55.899799", > "event": "write_thread_in_journal_buffer"}, > { "time": "2014-06-30 13:35:55.899910", > "event": "journaled_completion_queued"}, > { "time": "2014-06-30 13:35:55.899936", > "event": "journal first callback"}, > { "time": "2014-06-30 13:35:55.899944", > "event": "queuing ondisk"}, > { "time": "2014-06-30 13:35:56.142104", > "event": "sub_op_commit_rec"}, > { "time": "2014-06-30 13:35:56.176950", > "event": "sub_op_commit_rec"}, > { "time": "2014-06-30 13:35:59.535301", > "event": "op_commit"}, > { "time": "2014-06-30 13:35:59.535331", > "event": "commit_sent"}, > { "time": "2014-06-30 13:35:59.539723", > "event": "op_applied"}, > { "time": "2014-06-30 13:35:59.539734", > "event": "done"}]]}, > > so why from "journaled_completion_queued" to "op_commit" is very slowly, and > what happened? > thanks > > baijia...@126.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XFS - number of files in a directory
Hello Cephers, We used to have a Ceph cluster and setup our data pool as 3 replicas, we estimated the number of files (given disk size and object size) for each PG was around 8K, we disabled folder splitting which mean all files located at the root PG folder. Our testing showed a good performance with such setup. Right now we are evaluating erasure coding, which split the object into a number of chunks and increase the number of files several times, although XFS claims a good support for large directories [1], some testing also showed that we may expect performance degradation for large directories. I would like to check with your experience on top of this for your Ceph cluster if you are using XFS. Thanks. [1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On May 28, 2014, at 5:31 AM, Gregory Farnum wrote: > On Sun, May 25, 2014 at 6:24 PM, Guang Yang wrote: >> On May 21, 2014, at 1:33 AM, Gregory Farnum wrote: >> >>> This failure means the messenger subsystem is trying to create a >>> thread and is getting an error code back — probably due to a process >>> or system thread limit that you can turn up with ulimit. >>> >>> This is happening because a replicated PG primary needs a connection >>> to only its replicas (generally 1 or 2 connections), but with an >>> erasure-coded PG the primary requires a connection to m+n-1 replicas >>> (everybody who's in the erasure-coding set, including itself). Right >>> now our messenger requires a thread for each connection, so kerblam. >>> (And it actually requires a couple such connections because we have >>> separate heartbeat, cluster data, and client data systems.) >> Hi Greg, >> Is there any plan to refactor the messenger component to reduce the num of >> threads? For example, use event-driven mode. > > We've discussed it in very broad terms, but there are no concrete > designs and it's not on the schedule yet. If anybody has conclusive > evidence that it's causing them trouble they can't work around, that > would be good to know… Thanks for the response! We used to have a cluster with each OSD host having 11 disks (daemons), on each host, there are around 15K threads, the system is stable but when there is cluster wide change (e.g. OSD down / out, recovery), we observed system load increasing, there is no cascading failure though. Most recently we are evaluating Ceph against high density hardware with each OSD host having 33 disks (daemons), on each host, there are around 40K-50K threads, with some OSD host down/out, we started seeing high load increasing and a large volume of thread join/creation. We don’t have a strong evidence that the messenger thread model is the problem and how event-driven approach can help, but I think as moving to high density hardware (for cost saving purpose), the issue could be amplified. If there is any plan, it is good to know and we are very interested to involve. Thanks, Guang > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On May 21, 2014, at 1:33 AM, Gregory Farnum wrote: > This failure means the messenger subsystem is trying to create a > thread and is getting an error code back — probably due to a process > or system thread limit that you can turn up with ulimit. > > This is happening because a replicated PG primary needs a connection > to only its replicas (generally 1 or 2 connections), but with an > erasure-coded PG the primary requires a connection to m+n-1 replicas > (everybody who's in the erasure-coding set, including itself). Right > now our messenger requires a thread for each connection, so kerblam. > (And it actually requires a couple such connections because we have > separate heartbeat, cluster data, and client data systems.) Hi Greg, Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode. Thanks, Guang > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, May 20, 2014 at 3:43 AM, Kenneth Waegeman > wrote: >> Hi, >> >> On a setup of 400 OSDs (20 nodes, with 20 OSDs per node), I first tried to >> create a erasure coded pool with 4096 pgs, but this crashed the cluster. >> I then started with 1024 pgs, expanding to 2048 (pg_num and pgp_num), when I >> then try to expand to 4096 (not even quite enough) the cluster crashes >> again. ( Do we need less of pg's with erasure coding?) >> >> The crash starts with individual OSDs crashing, eventually bringing down the >> mons (until there is no more quorum or too few osds) >> >> Out of the logs: >> >> >> -16> 2014-05-20 10:31:55.545590 7fd42f34d700 5 -- op tracker -- , seq: >> 14301, time: 2014-05-20 10:31:55.545590, event: started, request: >> pg_query(0.974 epoch 3315) v3 >> -15> 2014-05-20 10:31:55.545776 7fd42f34d700 1 -- >> 130.246.178.141:6836/10446 --> 130.246.179.191:6826/21854 -- pg_notify(0.974 >> epoch 3326) v5 -- ?+0 0xc8b4ec0 con 0x9 >> 026b40 >> -14> 2014-05-20 10:31:55.545807 7fd42f34d700 5 -- op tracker -- , seq: >> 14301, time: 2014-05-20 10:31:55.545807, event: done, request: >> pg_query(0.974 epoch 3315) v3 >> -13> 2014-05-20 10:31:55.559661 7fd3fdb0f700 1 -- >> 130.246.178.141:6837/10446 >> :/0 pipe(0xce0c380 sd=468 :6837 s=0 pgs=0 cs=0 >> l=0 c=0x1255f0c0).accept sd=468 130.246.179.191:60618/0 >> -12> 2014-05-20 10:31:55.564034 7fd3bf72f700 1 -- >> 130.246.178.141:6838/10446 >> :/0 pipe(0xe3f2300 sd=596 :6838 s=0 pgs=0 cs=0 >> l=0 c=0x129b5ee0).accept sd=596 130.246.179.191:43913/0 >> -11> 2014-05-20 10:31:55.627776 7fd42df4b700 1 -- >> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 3 >> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:52.994368) v2 47+0+0 >> (855262282 0 0) 0xb6863c0 con 0x1255b9c0 >> -10> 2014-05-20 10:31:55.629425 7fd42df4b700 1 -- >> 130.246.178.141:0/10446 <== osd.170 130.246.179.191:6827/21854 4 >> osd_ping(ping_reply e3316 stamp 2014-05-20 10:31:53.509621) v2 47+0+0 >> (2581193378 0 0) 0x93d6c80 con 0x1255b9c0 >>-9> 2014-05-20 10:31:55.631270 7fd42f34d700 1 -- >> 130.246.178.141:6836/10446 <== osd.169 130.246.179.191:6841/25473 2 >> pg_query(7.3ffs6 epoch 3326) v3 144+0+0 (221596234 0 0) 0x10b994a0 con >> 0x9383860 >>-8> 2014-05-20 10:31:55.631308 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631130, event: header_read, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-7> 2014-05-20 10:31:55.631315 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631133, event: throttled, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-6> 2014-05-20 10:31:55.631339 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631207, event: all_read, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-5> 2014-05-20 10:31:55.631343 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631303, event: dispatched, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-4> 2014-05-20 10:31:55.631349 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631349, event: waiting_for_osdmap, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-3> 2014-05-20 10:31:55.631363 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631363, event: started, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-2> 2014-05-20 10:31:55.631402 7fd42f34d700 5 -- op tracker -- , seq: >> 14302, time: 2014-05-20 10:31:55.631402, event: done, request: >> pg_query(7.3ffs6 epoch 3326) v3 >>-1> 2014-05-20 10:31:55.631488 7fd427b41700 1 -- >> 130.246.178.141:6836/10446 --> 130.246.179.191:6841/25473 -- >> pg_notify(7.3ffs6(14) epoch 3326) v5 -- ?+0 0xcc7b9c0 con 0x9383860 >> 0> 2014-05-20 10:31:55.632127 7fd42cb49700 -1 common/Thread.cc: In >> function 'void Thread::create(size_t)' thread 7fd42cb49700 time 2014-05-20 >> 10:31:55.630937 >> common/Thread.cc: 110: FAILED assert(ret == 0) >> >> ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) >>
Re: [ceph-users] Firefly 0.80 rados bench cleanup / object removal broken?
Hi Matt, The problem you came across was due to a change made in the rados bench along with the Firefly release, it aimed to solve the problem that if there were multiple rados instance (for writing), we want to do a rados read for each run as well. Unfortunately, that change broke your user case, here is my suggestion to solve your problem: 1. Remove the pre-defined metadata file by $ rados -p {pool_name} rm benchmark_last_metadata 2. Cleanup by prefix $ sudo rados -p {pool_name} cleanup --prefix bench Moving forward, you can use the new parameter ‘--run-name’ to name each turn of run and cleanup on that basis, if you still want to do a slow liner search to cleanup, be sure removing the benchmark_last_metadata object before you kick off running the cleanup. Let me know if that helps. Thanks, Guang On May 20, 2014, at 6:45 AM, matt.lat...@hgst.com wrote: > > I was experimenting previously with 0.72 , and could easily cleanup pool > objects from several previous rados bench (write) jobs with : > > rados -p cleanup bench (would remove all objects starting > with "bench") > > I quickly realised when I moved to 0.80 that my script was broken and > theoretically I now need: > > rados -p cleanup --prefix benchmark_data > > But this only works sometimes, and sometimes partially. Issuing the command > line twice seems to help a bit ! Also if I do "rados -p ls" > before hand, it seems to increase my chances of success, but often I am > still left with benchmark objects undeleted. I also tried using the > --run-name option to no avail. > > The story gets more bizarre now I have set up a "hot SSD" cachepool in > front of the backing OSD (SATA) pool. Objects won't delete from either pool > with rados cleanup I tried > > "rados -p cache-flush-evict-all" > > which worked (rados df shows all objects now on the backing pool). Then > bizarrely trying cleanup from the backing OSD pool just appears to copy > them back into the cachepool, and they remain on the backing pool. > > I can list individual object names with > > rados -p ls > > but rados rm will not remove individual objects stating "file > or directory not found". > > Are others seeing these things and any ways to work around or am I doing > something wrong? Are these commands now deprecated in which case what > should I use? > > Ubuntu 12.04, Kernel 3.14.0 > > Matt Latter > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XFS tunning on OSD
Hello all, Recently I am working on Ceph performance analysis on our cluster, our OSD hardware looks like: 11 SATA disks, 4TB for each, 7200RPM 48GB RAM When break down the latency, we found that half of the latency (average latency is around 60 milliseconds via radosgw) comes from file lookup and open (there could be a couple of disk seeks there). When looking at the file system cache (slabtop), we found that around 5M dentry / inodes are cached, however, the host has around 110 million files (and directories) in total. I am wondering if there is any good experience within community tunning for the same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] ? [1] http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph GET latency
Hi ceph-users, We are using Ceph (radosgw) to store user generated images, as GET latency is critical for us, most recently I did some investigation over the GET path to understand where time spend. I first confirmed that the latency came from OSD (read op), so that we instrumented code to trace the GET request (read op at OSD side, to be more specific, each object with size [512K + 4M * x] are splitted into [1 + x] chunks, each chunk needs one read op ), for each read op, it needs to go through the following steps: 1. Dispatch and take by a op thread to process (process not started). 0 – 20 ms, 94% 20 – 50 ms, 2% 50 – 100 ms, 2% 100ms+ , 2% For those having 20ms+ latency, half of them are due to waiting for pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another half are yet to be investigated. 2. Get file xattr (‘-‘), which open the file and populate fd cache (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230). 0 – 20 ms, 80% 20 – 50 ms, 8% 50 – 100 ms, 7% 100ms+ , 5% The latency either comes from (from more to less): file path lookup (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file open, or fd cache lookup /add. Currently objects are store in level 6 or level 7 folder (due to http://tracker.ceph.com/issues/7207, I stopped folder splitting). 3. Get more xattrs, this is fast due to previous fd cache (rarely > 1ms). 4. Read the data. 0 – 20 ms, 84% 20 – 50 ms, 10% 50 – 100 ms, 4% 100ms+ , 2% I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS favor dentry/inode cache over page cache, unfortunately it does not help. Long story short, most of the long latency read op comes from file system call (for cold data), as our workload mainly stores objects less than 500KB, so that it generates a large bunch of objects. I would like to ask if people experienced similar issue and if there is any suggestion I can try to boost the GET performance. On the other hand, PUT could be sacrificed. Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time
Thanks all for the help. We finally identified the root cause of the issue was due to a lock contention happening at folder splitting and here is a tracking ticket (thanks Inktank for the fix!): http://tracker.ceph.com/issues/7207 Thanks, Guang On Tuesday, December 31, 2013 8:22 AM, Guang Yang wrote: Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time >On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Mark, my comments inline... Date: Mon, 30 Dec 2013 07:36:56 -0600 From: Mark Nelson To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time On 12/30/2013 05:45 AM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? You may want to use the dump_historic_ops command in the admin socket for the slow OSDs. That will give you some clues regarding where the ops are hanging up in the OSD. You can also crank the osd debugging way up on that node and search through the logs to see if there are any patterns or trends (consistent slowness, pauses, etc). It may also be useful to look and see if that OSD is pegging CPU and if so attach strace or perf to it and see what it's doing. [yguang] We have a job dump_historic_ops but unfortunate
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time >On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? > What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido > Thanks, > Guang > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailin
Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time
Thanks Wido, my comments inline... >Date: Mon, 30 Dec 2013 14:04:35 +0100 >From: Wido den Hollander >To: ceph-users@lists.ceph.com >Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) > after running some time >On 12/30/2013 12:45 PM, Guang wrote: > Hi ceph-users and ceph-devel, > Merry Christmas and Happy New Year! > > We have a ceph cluster with radosgw, our customer is using S3 API to > access the cluster. > > The basic information of the cluster is: > bash-4.1$ ceph -s > cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 > health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors > monmap e1: 3 mons at > {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, > election epoch 40, quorum 0,1,2 osd151,osd152,osd153 > osdmap e129885: 787 osds: 758 up, 758 in > pgmap v1884502: 22203 pgs: 22125 active+clean, 1 > active+clean+scrubbing, 1 active+clean+inconsistent, 76 > active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 > TB avail > mdsmap e1: 0/0/1 up > > #When the latency peak happened, there was no scrubbing, recovering or > backfilling at the moment.# > > While the performance of the cluster (only with WRITE traffic) is stable > until Dec 25th, our monitoring (for radosgw access log) shows a > significant increase of average latency and 99% latency. > > And then I chose one OSD and try to grep slow requests logs and find > that most of the slow requests were waiting for subop, I take osd22 for > example. > > osd[561-571] are hosted by osd22. > -bash-4.1$ for i in {561..571}; do grep "slow request" ceph-osd.$i.log | > grep "2013-12-25 16"| grep osd_op | grep -oP "\d+,\d+" ; done > > ~/slow_osd.txt > -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr > 3586 656,598 > 289 467,629 > 284 598,763 > 279 584,598 > 203 172,598 > 182 598,6 > 155 629,646 > 83 631,598 > 65 631,593 > 21 616,629 > 20 609,671 > 20 609,390 > 13 609,254 > 12 702,629 > 12 629,641 > 11 665,613 > 11 593,724 > 11 361,591 > 10 591,709 > 9 681,609 > 9 609,595 > 9 591,772 > 8 613,662 > 8 575,591 > 7 674,722 > 7 609,603 > 6 585,605 > 5 613,691 > 5 293,629 > 4 774,591 > 4 717,591 > 4 613,776 > 4 538,629 > 4 485,629 > 3 702,641 > 3 608,629 > 3 593,580 > 3 591,676 > > It turns out most of the slow requests were waiting for osd 598, 629, I > ran the procedure on another host osd22 and got the same pattern. > > Then I turned to the host having osd598 and dump the perf counter to do > comparision. > > -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon > /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done > op_latency,subop_latency,total_ops > 0.192097526753471,0.0344513450167198,7549045 > 1.99137797628122,1.42198426157216,9184472 > 0.198062399664129,0.0387090378926376,6305973 > 0.621697271315762,0.396549768986993,9726679 > 29.5222496247375,18.246379615, 10860858 > 0.229250239525916,0.0557482067611005,8149691 > 0.208981698303654,0.0375553180438224,6623842 > 0.47474766302086,0.292583928601509,9838777 > 0.339477790083925,0.101288409388438,9340212 > 0.186448840141895,0.0327296517417626,7081410 > 0.807598201207144,0.0139762289702332,6093531 > (osd 598 is op hotspot as well) > > This double confirmed that osd 598 was having some performance issues > (it has around *30 seconds average op latency*!). > sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the > latency difference is not as significant as we saw from osd perf. > reads kbread writes kbwrite %busy avgqu await svctm > 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 > 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 > 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 > 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 > > Another disk at the same time for comparison (/dev/sdb). > reads kbread writes kbwrite %busy avgqu await svctm > 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 > 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 > 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 > 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 > > Any idea why a couple of OSDs are so slow that impact the performance of > the entire cluster? > What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido > Thanks,___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 'ceph osd reweight' VS 'ceph osd crush reweight'
Hello ceph-users, I am a little bit confused by these two options, I understand crush reweight determine the weight of the OSD in the crush map so that it impacts I/O and utilization, however, I am a little bit confused by osd reweight option, is that something control the I/O distribution across different OSDs on a single host? While looking at the code, I only found that if 'osd weight' is 1 (0x1), it means the osd is up and if it is 0, it means the osd is down. Please advice... Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Thanks Mark. I cannot connect to my hosts, I will do the check and get back to you tomorrow. Thanks, Guang 在 2013-10-24,下午9:47,Mark Nelson 写道: > On 10/24/2013 08:31 AM, Guang Yang wrote: >> Hi Mark, Greg and Kyle, >> Sorry to response this late, and thanks for providing the directions for >> me to look at. >> >> We have exact the same setup for OSD, pool replica (and even I tried to >> create the same number of PGs within the small cluster), however, I can >> still reproduce this constantly. >> >> This is the command I run: >> $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write >> >> With 24 OSDs: >> Average Latency: 0.00494123 >> Max latency: 0.511864 >> Min latency: 0.002198 >> >> With 330 OSDs: >> Average Latency:0.00913806 >> Max latency: 0.021967 >> Min latency: 0.005456 >> >> In terms of the crush rule, we are using the default one, for the small >> cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we >> have 30 OSD hosts (11 * 30). >> >> I have a couple of questions: >> 1. Is it possible that latency is due to that we have only three layer >> hierarchy? like root -> host -> OSD, and as we are using the Straw (by >> default) bucket type, which has O(N) speed, and if host number increase, >> so that the computation actually increase. I suspect not as the >> computation is in the order of microseconds per my understanding. > > I suspect this is very unlikely as well. > >> >> 2. Is it possible because we have more OSDs, the cluster will need to >> maintain far more connections between OSDs which potentially slow things >> down? > > One thing here that might be very interesting is this: > > After you run your tests, if you do something like: > > find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} > dump_historic_ops \; > foo > > on each OSD server, you will get a dump of the 10 slowest operations > over the last 10 minutes for each OSD on each server, and it will tell > you were in each OSD operations were backing up. You can sort of search > through these files by greping for "duration" first, looking for the > long ones, and then going back and searching through the file for those > long durations and looking at the associated latencies. > > Something I have been investigating recently is time spent waiting for > osdmap propagation. It's something I haven't had time to dig into > meaningfully, but if we were to see that this was more significant on > your larger cluster vs your smaller one, that would be very interesting > news. > >> >> 3. Anything else i might miss? >> >> Thanks all for the constant help. >> >> Guang >> >> >> 在 2013-10-22,下午10:22,Guang Yang > <mailto:yguan...@yahoo.com>> 写道: >> >>> Hi Kyle and Greg, >>> I will get back to you with more details tomorrow, thanks for the >>> response. >>> >>> Thanks, >>> Guang >>> 在 2013-10-22,上午9:37,Kyle Bader >> <mailto:kyle.ba...@gmail.com>> 写道: >>> >>>> Besides what Mark and Greg said it could be due to additional hops >>>> through network devices. What network devices are you using, what is >>>> the network topology and does your CRUSH map reflect the network >>>> topology? >>>> >>>> On Oct 21, 2013 9:43 AM, "Gregory Farnum" >>> <mailto:g...@inktank.com>> wrote: >>>> >>>>On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang >>><mailto:yguan...@yahoo.com>> wrote: >>>>> Dear ceph-users, >>>>> Recently I deployed a ceph cluster with RadosGW, from a small >>>>one (24 OSDs) to a much bigger one (330 OSDs). >>>>> >>>>> When using rados bench to test the small cluster (24 OSDs), it >>>>showed the average latency was around 3ms (object size is 5K), >>>>while for the larger one (330 OSDs), the average latency was >>>>around 7ms (object size 5K), twice comparing the small cluster. >>>>> >>>>> The OSD within the two cluster have the same configuration, SAS >>>>disk, and two partitions for one disk, one for journal and the >>>>other for metadata. >>>>> >>>>> For PG numbers, the small cluster tested with the pool having >>>>100 PGs, and for the large cluster, the pool has 4 PGs (as I &
Re: [ceph-users] Rados bench result when increasing OSDs
Hi Mark, Greg and Kyle, Sorry to response this late, and thanks for providing the directions for me to look at. We have exact the same setup for OSD, pool replica (and even I tried to create the same number of PGs within the small cluster), however, I can still reproduce this constantly. This is the command I run: $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write With 24 OSDs: Average Latency: 0.00494123 Max latency: 0.511864 Min latency: 0.002198 With 330 OSDs: Average Latency:0.00913806 Max latency: 0.021967 Min latency: 0.005456 In terms of the crush rule, we are using the default one, for the small cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we have 30 OSD hosts (11 * 30). I have a couple of questions: 1. Is it possible that latency is due to that we have only three layer hierarchy? like root -> host -> OSD, and as we are using the Straw (by default) bucket type, which has O(N) speed, and if host number increase, so that the computation actually increase. I suspect not as the computation is in the order of microseconds per my understanding. 2. Is it possible because we have more OSDs, the cluster will need to maintain far more connections between OSDs which potentially slow things down? 3. Anything else i might miss? Thanks all for the constant help. Guang 在 2013-10-22,下午10:22,Guang Yang 写道: > Hi Kyle and Greg, > I will get back to you with more details tomorrow, thanks for the response. > > Thanks, > Guang > 在 2013-10-22,上午9:37,Kyle Bader 写道: > >> Besides what Mark and Greg said it could be due to additional hops through >> network devices. What network devices are you using, what is the network >> topology and does your CRUSH map reflect the network topology? >> >> On Oct 21, 2013 9:43 AM, "Gregory Farnum" wrote: >> On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang wrote: >> > Dear ceph-users, >> > Recently I deployed a ceph cluster with RadosGW, from a small one (24 >> > OSDs) to a much bigger one (330 OSDs). >> > >> > When using rados bench to test the small cluster (24 OSDs), it showed the >> > average latency was around 3ms (object size is 5K), while for the larger >> > one (330 OSDs), the average latency was around 7ms (object size 5K), twice >> > comparing the small cluster. >> > >> > The OSD within the two cluster have the same configuration, SAS disk, and >> > two partitions for one disk, one for journal and the other for metadata. >> > >> > For PG numbers, the small cluster tested with the pool having 100 PGs, and >> > for the large cluster, the pool has 4 PGs (as I will to further scale >> > the cluster, so I choose a much large PG). >> > >> > Does my test result make sense? Like when the PG number and OSD increase, >> > the latency might drop? >> >> Besides what Mark said, can you describe your test in a little more >> detail? Writing/reading, length of time, number of objects, etc. >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Hi Kyle and Greg, I will get back to you with more details tomorrow, thanks for the response. Thanks, Guang 在 2013-10-22,上午9:37,Kyle Bader 写道: > Besides what Mark and Greg said it could be due to additional hops through > network devices. What network devices are you using, what is the network > topology and does your CRUSH map reflect the network topology? > > On Oct 21, 2013 9:43 AM, "Gregory Farnum" wrote: > On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang wrote: > > Dear ceph-users, > > Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) > > to a much bigger one (330 OSDs). > > > > When using rados bench to test the small cluster (24 OSDs), it showed the > > average latency was around 3ms (object size is 5K), while for the larger > > one (330 OSDs), the average latency was around 7ms (object size 5K), twice > > comparing the small cluster. > > > > The OSD within the two cluster have the same configuration, SAS disk, and > > two partitions for one disk, one for journal and the other for metadata. > > > > For PG numbers, the small cluster tested with the pool having 100 PGs, and > > for the large cluster, the pool has 4 PGs (as I will to further scale > > the cluster, so I choose a much large PG). > > > > Does my test result make sense? Like when the PG number and OSD increase, > > the latency might drop? > > Besides what Mark said, can you describe your test in a little more > detail? Writing/reading, length of time, number of objects, etc. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Thanks Mark for the response. My comments inline... From: Mark Nelson To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Rados bench result when increasing OSDs Message-ID: <52653b49.8090...@inktank.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 10/21/2013 09:13 AM, Guang Yang wrote: > Dear ceph-users, Hi! > Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) > to a much bigger one (330 OSDs). > > When using rados bench to test the small cluster (24 OSDs), it showed the > average latency was around 3ms (object size is 5K), while for the larger one > (330 OSDs), the average latency was around 7ms (object size 5K), twice > comparing the small cluster. Did you have the same number of concurrent requests going? [yguang] Yes. I run the test with 3 or 5 concurrent request, that does not change the result. > > The OSD within the two cluster have the same configuration, SAS disk, and > two partitions for one disk, one for journal and the other for metadata. > > For PG numbers, the small cluster tested with the pool having 100 PGs, and > for the large cluster, the pool has 4 PGs (as I will to further scale the > cluster, so I choose a much large PG). Forgive me if this is a silly question, but were the pools using the same level of replication? [yguang] Yes, both have 3 replicas. > > Does my test result make sense? Like when the PG number and OSD increase, the > latency might drop? You wouldn't necessarily expect a larger cluster to show higher latency if the nodes, pools, etc were all configured exactly the same, especially if you were using the same amount of concurrency. It's possible that you have some slow drives on the larger cluster that could be causing the average latency to increase. If there are more disks per node, that could do it too. [yguang] Glad to know this :) I will need to gather more information in terms of if there is any slow disk, will get back on this. Are there any other differences you can think of? [yguang] Another difference is, for the large cluster, as we expect to scale it to more than a thousand OSDs, we have a large PG number (4) pre-created. Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rados bench result when increasing OSDs
Dear ceph-users, Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to a much bigger one (330 OSDs). When using rados bench to test the small cluster (24 OSDs), it showed the average latency was around 3ms (object size is 5K), while for the larger one (330 OSDs), the average latency was around 7ms (object size 5K), twice comparing the small cluster. The OSD within the two cluster have the same configuration, SAS disk, and two partitions for one disk, one for journal and the other for metadata. For PG numbers, the small cluster tested with the pool having 100 PGs, and for the large cluster, the pool has 4 PGs (as I will to further scale the cluster, so I choose a much large PG). Does my test result make sense? Like when the PG number and OSD increase, the latency might drop? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy zap disk failure
Thanks all for the recommendation. I worked around by modifying the ceph-deploy by giving and full path for sgdisk. Thanks, Guang 在 2013-10-16,下午10:47,Alfredo Deza 写道: > On Tue, Oct 15, 2013 at 9:19 PM, Guang wrote: >> -bash-4.1$ which sgdisk >> /usr/sbin/sgdisk >> >> Which path does ceph-deploy use? > > That is unexpected... these are the paths that ceph-deploy uses: > > '/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin' > > So `/usr/sbin/` is there. I believe this is a case where $PATH gets > altered because of sudo (resetting the env variable). > > This should be fixed in the next release. In the meantime, you could > set the $PATH for non-interactive sessions (which is what ceph-deploy > does) > for all users. I *think* that would be in `/etc/profile` > > >> >> Thanks, >> Guang >> >> On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote: >> >>> On Tue, Oct 15, 2013 at 10:52 AM, Guang wrote: Hi ceph-users, I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a new issue: -bash-4.1$ ceph-deploy --version 1.2.7 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb [ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap server:/dev/sdb [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from remote host [ceph_deploy.osd][INFO ] Distro info: Red Hat Enterprise Linux Server 6.4 Santiago [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device [osd2.ceph.mobstor.bf1.yahoo.com][INFO ] Running command: sudo sgdisk --zap-all --clear --mbrtogpt -- /dev/sdb [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found While I run disk zap on the host directly, it can work without issues. Anyone meet the same issue? >>> >>> Can you run `which sgdisk` on that host? I want to make sure this is >>> not a $PATH problem. >>> >>> ceph-deploy tries to use the proper path remotely but it could be that >>> this one is not there. >>> >>> Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Greg. >>The typical case is going to depend quite a lot on your scale. [Guang] I am thinking the scale as billions of objects with size from several KB to several MB, my concern is over the cache efficiency for such use case. That said, I'm not sure why you'd want to use CephFS for a small-object store when you could just use raw RADOS, and avoid all the posix overheads. Perhaps I've misunderstood your use case? [Guang] No, you don't. That is my use case :) I am also thinking of using RADOW directly without the above POSIX layer, but before that, I want to consider each option we have and compare the cons / pros. Thanks, Guang From: Gregory Farnum To: Guang Yang Cc: Gregory Farnum ; "ceph-us...@ceph.com" Sent: Tuesday, August 20, 2013 9:51 AM Subject: Re: [ceph-users] Usage pattern and design of Ceph On Monday, August 19, 2013, Guang Yang wrote: Thanks Greg. > > >Some comments inline... > > >On Sunday, August 18, 2013, Guang Yang wrote: > >Hi ceph-users, >>This is Guang and I am pretty new to ceph, glad to meet you guys in the >>community! >> >> >>After walking through some documents of Ceph, I have a couple of questions: >> 1. Is there any comparison between Ceph and AWS S3, in terms of the ability >>to handle different work-loads (from KB to GB), with corresponding >>performance report? > > >Not really; any comparison would be highly biased depending on your Amazon >ping and your Ceph cluster. We've got some internal benchmarks where Ceph >looks good, but they're not anything we'd feel comfortable publishing. > [Guang] Yeah, I mean the solely server side time regardless of the RTT impact >over the comparison. > 2. Looking at some industry solutions for distributed storage, GFS / >Haystack / HDFS all use meta-server to store the logical-to-physical mapping >within memory and avoid disk I/O lookup for file reading, is the concern valid >for Ceph (in terms of latency to read file)? > > >These are very different systems. Thanks to CRUSH, RADOS doesn't need to do >any IO to find object locations; CephFS only does IO if the inode you request >has fallen out of the MDS cache (not terribly likely in general). This >shouldn't be an issue... >[Guang] " CephFS only does IO if the inode you request has fallen out of the >MDS cache", my understanding is, if we use CephFS, we will need to interact >with Rados twice, the first time to retrieve meta-data (file attribute, owner, >etc.) and the second time to load data, and both times will need disk I/O in >terms of inode and data. Is my understanding correct? The way some other >storage system tried was to cache the file handle in memory, so that it can >avoid the I/O to read inode in. In the worst case this can happen with CephFS, yes. However, the client is not accessing metadata directly; it's going through the MetaData Server, which caches (lots of) metadata on its own, and the client can get leases as well (so it doesn't need to go to the MDS for each access, and can cache information on its own). The typical case is going to depend quite a lot on your scale. That said, I'm not sure why you'd want to use CephFS for a small-object store when you could just use raw RADOS, and avoid all the posix overheads. Perhaps I've misunderstood your use case? -Greg > 3. Some industry research shows that one issue of file system is the >metadata-to-data ratio, in terms of both access and storage, and some technic >uses the mechanism to combine small files to large physical files to reduce >the ratio (Haystack for example), if we want to use ceph to store photos, >should this be a concern as Ceph use one physical file per object? > > >...although this might be. The issue basically comes down to how many disk >seeks are required to retrieve an item, and one way to reduce that number is >to hack the filesystem by keeping a small number of very large files an >calculating (or caching) where different objects are inside that file. Since >Ceph is designed for MB-sized objects it doesn't go to these lengths to >optimize that path like Haystack might (I'm not familiar with Haystack in >particular). >That said, you need some pretty extreme latency requirements before this >becomes an issue and if you're also looking at HDFS or S3 I can't imagine >you're in that ballpark. You should be fine. :) >[Guang] Yep, that makes a lot sense. >-Greg > >-- >Software Engineer #42 @ http://inktank.com | http://ceph.com > > > -- Software Engineer #42 @ http://inktank.com | http://ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Then that makes total sense to me. Thanks, Guang From: Mark Kirkwood To: Guang Yang Cc: "ceph-users@lists.ceph.com" Sent: Tuesday, August 20, 2013 1:19 PM Subject: Re: [ceph-users] Usage pattern and design of Ceph On 20/08/13 13:27, Guang Yang wrote: > Thanks Mark. > > What is the design considerations to break large files into 4M chunk > rather than storing the large file directly? > > Quoting Wolfgang from previous reply: => which is a good thing in terms of replication and OSD usage distribution ...which covers what I would have said quite well :-) Cheers Mark___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Greg. Some comments inline... On Sunday, August 18, 2013, Guang Yang wrote: Hi ceph-users, >This is Guang and I am pretty new to ceph, glad to meet you guys in the >community! > > >After walking through some documents of Ceph, I have a couple of questions: > 1. Is there any comparison between Ceph and AWS S3, in terms of the ability >to handle different work-loads (from KB to GB), with corresponding performance >report? Not really; any comparison would be highly biased depending on your Amazon ping and your Ceph cluster. We've got some internal benchmarks where Ceph looks good, but they're not anything we'd feel comfortable publishing. [Guang] Yeah, I mean the solely server side time regardless of the RTT impact over the comparison. 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any IO to find object locations; CephFS only does IO if the inode you request has fallen out of the MDS cache (not terribly likely in general). This shouldn't be an issue... [Guang] " CephFS only does IO if the inode you request has fallen out of the MDS cache", my understanding is, if we use CephFS, we will need to interact with Rados twice, the first time to retrieve meta-data (file attribute, owner, etc.) and the second time to load data, and both times will need disk I/O in terms of inode and data. Is my understanding correct? The way some other storage system tried was to cache the file handle in memory, so that it can avoid the I/O to read inode in. 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? ...although this might be. The issue basically comes down to how many disk seeks are required to retrieve an item, and one way to reduce that number is to hack the filesystem by keeping a small number of very large files an calculating (or caching) where different objects are inside that file. Since Ceph is designed for MB-sized objects it doesn't go to these lengths to optimize that path like Haystack might (I'm not familiar with Haystack in particular). That said, you need some pretty extreme latency requirements before this becomes an issue and if you're also looking at HDFS or S3 I can't imagine you're in that ballpark. You should be fine. :) [Guang] Yep, that makes a lot sense. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Mark. What is the design considerations to break large files into 4M chunk rather than storing the large file directly? Thanks, Guang From: Mark Kirkwood To: Guang Yang Cc: "ceph-users@lists.ceph.com" Sent: Monday, August 19, 2013 5:18 PM Subject: Re: [ceph-users] Usage pattern and design of Ceph On 19/08/13 18:17, Guang Yang wrote: > 3. Some industry research shows that one issue of file system is the > metadata-to-data ratio, in terms of both access and storage, and some > technic uses the mechanism to combine small files to large physical > files to reduce the ratio (Haystack for example), if we want to use ceph > to store photos, should this be a concern as Ceph use one physical file > per object? If you use Ceph as a pure object store, and get and put data via the basic rados api then sure, one client data object will be stored in one Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike api) then each client data object will be broken up into chunks at the rados level (typically 4M sized chunks). Regards Mark___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deploy Ceph on RHEL6.4
Hi ceph-users, I would like to check if there is any manual / steps which can let me try to deploy ceph in RHEL? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usage pattern and design of Ceph
Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usage pattern and design of Ceph
Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com