Re: [ceph-users] Long peering - throttle at FileStore::queue_transactions
On Mon, Jan 4, 2016 at 7:21 PM, Sage Weil <s...@newdream.net> wrote: > On Mon, 4 Jan 2016, Guang Yang wrote: >> Hi Cephers, >> Happy New Year! I got question regards to the long PG peering.. >> >> Over the last several days I have been looking into the *long peering* >> problem when we start a OSD / OSD host, what I observed was that the >> two peering working threads were throttled (stuck) when trying to >> queue new transactions (writing pg log), thus the peering process are >> dramatically slow down. >> >> The first question came to me was, what were the transactions in the >> queue? The major ones, as I saw, included: >> >> - The osd_map and incremental osd_map, this happens if the OSD had >> been down for a while (in a large cluster), or when the cluster got >> upgrade, which made the osd_map epoch the down OSD had, was far behind >> the latest osd_map epoch. During the OSD booting, it would need to >> persist all those osd_maps and generate lots of filestore transactions >> (linear with the epoch gap). >> > As the PG was not involved in most of those epochs, could we only take and >> > persist those osd_maps which matter to the PGs on the OSD? > > This part should happen before the OSD sends the MOSDBoot message, before > anyone knows it exists. There is a tunable threshold that controls how > recent the map has to be before the OSD tries to boot. If you're > seeing this in the real world, be probably just need to adjust that value > way down to something small(er). It would queue the transactions and then sends out the MOSDBoot, thus there is still a chance that it could have contention with the peering OPs (especially on large clusters where there are lots of activities which generates many osdmap epoch). Any chance we can change the *queue_transactions* to "apply_transactions*, thus we block there waiting for the persistent of the osdmap. At least we may be able to do that during OSD booting? The concern is, if the OSD is active, the apply_transaction would take longer with holding the osd_lock.. I don't find such tuning, could you elaborate? Thanks! > > sage > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Long peering - throttle at FileStore::queue_transactions
Hi Cephers, Happy New Year! I got question regards to the long PG peering.. Over the last several days I have been looking into the *long peering* problem when we start a OSD / OSD host, what I observed was that the two peering working threads were throttled (stuck) when trying to queue new transactions (writing pg log), thus the peering process are dramatically slow down. The first question came to me was, what were the transactions in the queue? The major ones, as I saw, included: - The osd_map and incremental osd_map, this happens if the OSD had been down for a while (in a large cluster), or when the cluster got upgrade, which made the osd_map epoch the down OSD had, was far behind the latest osd_map epoch. During the OSD booting, it would need to persist all those osd_maps and generate lots of filestore transactions (linear with the epoch gap). > As the PG was not involved in most of those epochs, could we only take and > persist those osd_maps which matter to the PGs on the OSD? - There are lots of deletion transactions, and as the PG booting, it needs to merge the PG log from its peers, and for the deletion PG entry, it would need to queue the deletion transaction immediately. > Could we delay the queue of the transactions until all PGs on the host are > peered? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD disk replacement best practise
Hi cephers, Most recently I am drafting the run books for OSD disk replacement, I think the rule of thumb is to reduce data migration (recover/backfill), and I thought the following procedure should achieve the purpose: 1. ceph osd out osd.XXX (mark it out to trigger data migration) 2. ceph osd rm osd.XXX 3. ceph auth rm osd.XXX 4. provision a new OSD which will take XXX as the OSD id and migrate data back. With the above procedure, the crush weight of the host never changed so that we can limit the data migration only for those which are neccesary. Does it make sense? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] row geo-replication to another data store?
Hi cephers, We are investigating a backup solution for Ceph, in short, we would like a solution to backup a Ceph cluster to another data store (not Ceph cluster, assume it has SWIFT API). We would like to have both full backup and incremental backup on top of the full backup. After going through the geo-replication blueprint [1], I am thinking that we can leverage the effort and instead of replicate the data into another ceph cluster, we make it replicate to another data store. At the same time, I have a couple of questions which need your help: 1) How does the ragosgw-agent scale to multiple hosts? Our first investigation shows it only works on a single host but I would like to confirm. 2) Can we configure the interval to do incremental backup like 1 hour / 1 day / 1 month? [1] https://wiki.ceph.com/Planning/Blueprints/Dumpling/RGW_Geo-Replication_and_Disaster_Recovery Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ask a performance question for the RGW
Hello, There is a known limitation of bucket scalability, and there is a blueprint tracking it - https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability. At time being, I would recommend to do sharding at application level (create multiple buckets) to workaround this limitation. Thanks, Guang On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote: hello, everyone! when I user rest bench test RGW performance and the cmd is: ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 -t 200 -b 524288 -no-cleanup write test result: Total time run: 362.962324 T otal writes made: 48189 Write size: 524288 Bandwidth (MB/sec): 66.383 Stddev Bandwidth: 40.7776 Max bandwidth (MB/sec): 173 Min bandwidth (MB/sec): 0 Average Latency: 1.50435 Stddev Latency: 0.910731 Max latency: 9.12276 Min latency: 0.19867 my environment is 4 host and 40 disk(osd)。 but test result is very bad, average latency is 1.5 seconds 。and I find write obj metadate is very slowly。because it puts so many object to one bucket, we know writing object metadate can call method “bucket_prepare_op”,and test find this op is very slowly。 I find the osd which contain bucket-obj。and see the “bucket_prepare_op”by dump_historic_ops : { description: osd_op(client.4742.0:87613 .dir.default.4243.3 [call rgw.bucket_prepare_op] 3.3670fe74 e317), received_at: 2014-06-30 13:35:55.409597, age: 51.148026, duration: 4.130137, type_data: [ commit sent; apply or cleanup, { client: client.4742, tid: 87613}, [ { time: 2014-06-30 13:35:55.409660, event: waiting_for_osdmap}, { time: 2014-06-30 13:35:55.409669, event: queue op_wq}, { time: 2014-06-30 13:35:55.896766, event: reached_pg}, { time: 2014-06-30 13:35:55.896793, event: started}, { time: 2014-06-30 13:35:55.896796, event: started}, { time: 2014-06-30 13:35:55.899450, event: waiting for subops from [40,43]}, { time: 2014-06-30 13:35:55.899757, event: commit_queued_for_journal_write}, { time: 2014-06-30 13:35:55.899799, event: write_thread_in_journal_buffer}, { time: 2014-06-30 13:35:55.899910, event: journaled_completion_queued}, { time: 2014-06-30 13:35:55.899936, event: journal first callback}, { time: 2014-06-30 13:35:55.899944, event: queuing ondisk}, { time: 2014-06-30 13:35:56.142104, event: sub_op_commit_rec}, { time: 2014-06-30 13:35:56.176950, event: sub_op_commit_rec}, { time: 2014-06-30 13:35:59.535301, event: op_commit}, { time: 2014-06-30 13:35:59.535331, event: commit_sent}, { time: 2014-06-30 13:35:59.539723, event: op_applied}, { time: 2014-06-30 13:35:59.539734, event: done}]]}, so why from journaled_completion_queued to op_commit is very slowly, and what happened? thanks baijia...@126.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ask a performance question for the RGW
On Jun 30, 2014, at 3:59 PM, baijia...@126.com wrote: Hello, thanks for you answer the question. But when there are less than 50 thousand objects, and latency is very big . I see the write ops for the bucket index object., from journaled_completion_queue to op_commit cost 3.6 seconds,this mean that from “writing journal finish” to op_commit cost 3.6 seconds。 so I can't understand this and what happened? The operations updating the same bucket index object get serialized, one possibility is that those operation was hang there waiting other ops finishing their work. thanks baijia...@126.com 发件人: Guang Yang 发送时间: 2014-06-30 14:57 收件人: baijiaruo 抄送: ceph-users 主题: Re: [ceph-users] Ask a performance question for the RGW Hello, There is a known limitation of bucket scalability, and there is a blueprint tracking it - https://wiki.ceph.com/Planning/Blueprints/Submissions/rgw%3A_bucket_index_scalability. At time being, I would recommend to do sharding at application level (create multiple buckets) to workaround this limitation. Thanks, Guang On Jun 30, 2014, at 2:54 PM, baijia...@126.com wrote: hello, everyone! when I user rest bench test RGW performance and the cmd is: ./rest-bench --access-key=ak --secret=sk --bucket=bucket_name --seconds=600 -t 200 -b 524288 -no-cleanup write test result: Total time run: 362.962324 T otal writes made: 48189 Write size: 524288 Bandwidth (MB/sec): 66.383 Stddev Bandwidth: 40.7776 Max bandwidth (MB/sec): 173 Min bandwidth (MB/sec): 0 Average Latency: 1.50435 Stddev Latency: 0.910731 Max latency: 9.12276 Min latency: 0.19867 my environment is 4 host and 40 disk(osd)。 but test result is very bad, average latency is 1.5 seconds 。and I find write obj metadate is very slowly。because it puts so many object to one bucket, we know writing object metadate can call method “bucket_prepare_op”,and test find this op is very slowly。 I find the osd which contain bucket-obj。and see the “bucket_prepare_op”by dump_historic_ops : { description: osd_op(client.4742.0:87613 .dir.default.4243.3 [call rgw.bucket_prepare_op] 3.3670fe74 e317), received_at: 2014-06-30 13:35:55.409597, age: 51.148026, duration: 4.130137, type_data: [ commit sent; apply or cleanup, { client: client.4742, tid: 87613}, [ { time: 2014-06-30 13:35:55.409660, event: waiting_for_osdmap}, { time: 2014-06-30 13:35:55.409669, event: queue op_wq}, { time: 2014-06-30 13:35:55.896766, event: reached_pg}, { time: 2014-06-30 13:35:55.896793, event: started}, { time: 2014-06-30 13:35:55.896796, event: started}, { time: 2014-06-30 13:35:55.899450, event: waiting for subops from [40,43]}, { time: 2014-06-30 13:35:55.899757, event: commit_queued_for_journal_write}, { time: 2014-06-30 13:35:55.899799, event: write_thread_in_journal_buffer}, { time: 2014-06-30 13:35:55.899910, event: journaled_completion_queued}, { time: 2014-06-30 13:35:55.899936, event: journal first callback}, { time: 2014-06-30 13:35:55.899944, event: queuing ondisk}, { time: 2014-06-30 13:35:56.142104, event: sub_op_commit_rec}, { time: 2014-06-30 13:35:56.176950, event: sub_op_commit_rec}, { time: 2014-06-30 13:35:59.535301, event: op_commit}, { time: 2014-06-30 13:35:59.535331, event: commit_sent}, { time: 2014-06-30 13:35:59.539723, event: op_applied}, { time: 2014-06-30 13:35:59.539734, event: done}]]}, so why from journaled_completion_queued to op_commit is very slowly, and what happened? thanks baijia...@126.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XFS - number of files in a directory
Hello Cephers, We used to have a Ceph cluster and setup our data pool as 3 replicas, we estimated the number of files (given disk size and object size) for each PG was around 8K, we disabled folder splitting which mean all files located at the root PG folder. Our testing showed a good performance with such setup. Right now we are evaluating erasure coding, which split the object into a number of chunks and increase the number of files several times, although XFS claims a good support for large directories [1], some testing also showed that we may expect performance degradation for large directories. I would like to check with your experience on top of this for your Ceph cluster if you are using XFS. Thanks. [1] http://www.scs.stanford.edu/nyu/02fa/sched/xfs.pdf Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding pg's of an erasure coded pool
On May 28, 2014, at 5:31 AM, Gregory Farnum g...@inktank.com wrote: On Sun, May 25, 2014 at 6:24 PM, Guang Yang yguan...@yahoo.com wrote: On May 21, 2014, at 1:33 AM, Gregory Farnum g...@inktank.com wrote: This failure means the messenger subsystem is trying to create a thread and is getting an error code back — probably due to a process or system thread limit that you can turn up with ulimit. This is happening because a replicated PG primary needs a connection to only its replicas (generally 1 or 2 connections), but with an erasure-coded PG the primary requires a connection to m+n-1 replicas (everybody who's in the erasure-coding set, including itself). Right now our messenger requires a thread for each connection, so kerblam. (And it actually requires a couple such connections because we have separate heartbeat, cluster data, and client data systems.) Hi Greg, Is there any plan to refactor the messenger component to reduce the num of threads? For example, use event-driven mode. We've discussed it in very broad terms, but there are no concrete designs and it's not on the schedule yet. If anybody has conclusive evidence that it's causing them trouble they can't work around, that would be good to know… Thanks for the response! We used to have a cluster with each OSD host having 11 disks (daemons), on each host, there are around 15K threads, the system is stable but when there is cluster wide change (e.g. OSD down / out, recovery), we observed system load increasing, there is no cascading failure though. Most recently we are evaluating Ceph against high density hardware with each OSD host having 33 disks (daemons), on each host, there are around 40K-50K threads, with some OSD host down/out, we started seeing high load increasing and a large volume of thread join/creation. We don’t have a strong evidence that the messenger thread model is the problem and how event-driven approach can help, but I think as moving to high density hardware (for cost saving purpose), the issue could be amplified. If there is any plan, it is good to know and we are very interested to involve. Thanks, Guang -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Firefly 0.80 rados bench cleanup / object removal broken?
Hi Matt, The problem you came across was due to a change made in the rados bench along with the Firefly release, it aimed to solve the problem that if there were multiple rados instance (for writing), we want to do a rados read for each run as well. Unfortunately, that change broke your user case, here is my suggestion to solve your problem: 1. Remove the pre-defined metadata file by $ rados -p {pool_name} rm benchmark_last_metadata 2. Cleanup by prefix $ sudo rados -p {pool_name} cleanup --prefix bench Moving forward, you can use the new parameter ‘--run-name’ to name each turn of run and cleanup on that basis, if you still want to do a slow liner search to cleanup, be sure removing the benchmark_last_metadata object before you kick off running the cleanup. Let me know if that helps. Thanks, Guang On May 20, 2014, at 6:45 AM, matt.lat...@hgst.com wrote: I was experimenting previously with 0.72 , and could easily cleanup pool objects from several previous rados bench (write) jobs with : rados -p poolname cleanup bench (would remove all objects starting with bench) I quickly realised when I moved to 0.80 that my script was broken and theoretically I now need: rados -p poolname cleanup --prefix benchmark_data But this only works sometimes, and sometimes partially. Issuing the command line twice seems to help a bit ! Also if I do rados -p poolname ls before hand, it seems to increase my chances of success, but often I am still left with benchmark objects undeleted. I also tried using the --run-name option to no avail. The story gets more bizarre now I have set up a hot SSD cachepool in front of the backing OSD (SATA) pool. Objects won't delete from either pool with rados cleanup I tried rados -p cachepoolname cache-flush-evict-all which worked (rados df shows all objects now on the backing pool). Then bizarrely trying cleanup from the backing OSD pool just appears to copy them back into the cachepool, and they remain on the backing pool. I can list individual object names with rados -p poolname ls but rados rm objectname will not remove individual objects stating file or directory not found. Are others seeing these things and any ways to work around or am I doing something wrong? Are these commands now deprecated in which case what should I use? Ubuntu 12.04, Kernel 3.14.0 Matt Latter ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Docs - trouble shooting mon
Hello, Today I read the monitor trouble shooting doc (https://ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/) with this section: Scrap the monitor and create a new one You should only take this route if you are positive that you won’t lose the information kept by that monitor; that you have other monitors and that they are running just fine so that your new monitor is able to synchronize from the remaining monitors. Keep in mind that destroying a monitor, if there are no other copies of its contents, may lead to loss of data. I would like to ask how to check if “there are other copies of its content” for a given monitor instance? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CEPH's data durability with different configurations
Hi all, One goal of the storage system is to achieve certain durability SLAs, so that we replicate data with multiple copies, and check consistency on regular basis (e.g. scrubbing), however, replication could increase cost (tradeoff between cost durability), and cluster wide consistency checking could bring performance impact (tradeoff between performance durability). Most recently I am trying to figure out the best configuration for such, including: 1) how many copies do I need? (pool min_size and size) 2) how frequency should I run scrubbing and deep scrubbing? Can someone share your experience tuning those numbers and what is the durability you can achieve with that? BTW, S3 claims they have 99.9% durability of objects over a given year, that seems super high on commodity hardware. Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] A simple tool to do osd crush reweigh after creating pool to gain better PG distribution across OSDs
Hi all, In order to deal with PG uneven problem[1, 2] which further leads to uneven disk usage, I recently developed a simple script which aims to do *osd crush* reweight right after creating the pool (which has the most significant data, e.g. .rgw.buckets), we had good experience to tune the distribution difference to less than 10% with use of this tool. Here is the tool - https://github.com/guangyy/ceph_misc/blob/master/osd_crush_reweight/ceph_osd_crush_reweight.pl If you have a similar experience with relatively high PG distribution across OSDs by default, you can check out the script and see if that serve your purpose or not, all reviews and suggestions (especially the algorithm for a even better distribution) are welcomed. [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg04216.html [2] http://www.spinics.net/lists/ceph-devel/msg17509.html Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XFS tunning on OSD
Hello all, Recently I am working on Ceph performance analysis on our cluster, our OSD hardware looks like: 11 SATA disks, 4TB for each, 7200RPM 48GB RAM When break down the latency, we found that half of the latency (average latency is around 60 milliseconds via radosgw) comes from file lookup and open (there could be a couple of disk seeks there). When looking at the file system cache (slabtop), we found that around 5M dentry / inodes are cached, however, the host has around 110 million files (and directories) in total. I am wondering if there is any good experience within community tunning for the same workload, e.g. change the in ode size ? use mkfs.xfs -n size=64k option[1] ? [1] http://xfs.org/index.php/XFS_FAQ#Q:_Performance:_mkfs.xfs_-n_size.3D64k_option Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PG folder hierarchy
Hello, Most recently when looking at PG’s folder splitting, I found that there was only one sub folder in the top 3 / 4 levels and start having 16 sub folders starting from level 6, what is the design consideration behind this? For example, if the PG root folder is ‘3.1905_head’, in the first level, it only has one sub folder ‘DIR_5’ and then one sub folder ‘DIR_0’, and then ‘DIR_9’, under which there are two sub folders ‘DIR_1’ and ‘DIR_9’, starting from which, the next level has 16 sub folders. If we start splitting into 16 sub folders in the very first level, we may potential gain better performance with less dentry lookup (though most likely the root level been cached). Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG folder hierarchy
Got it. Thanks Greg for the response! Thanks, Guang On Feb 26, 2014, at 11:51 AM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 25, 2014 at 7:13 PM, Guang yguan...@yahoo.com wrote: Hello, Most recently when looking at PG's folder splitting, I found that there was only one sub folder in the top 3 / 4 levels and start having 16 sub folders starting from level 6, what is the design consideration behind this? For example, if the PG root folder is '3.1905_head', in the first level, it only has one sub folder 'DIR_5' and then one sub folder 'DIR_0', and then 'DIR_9', under which there are two sub folders 'DIR_1' and 'DIR_9', starting from which, the next level has 16 sub folders. If we start splitting into 16 sub folders in the very first level, we may potential gain better performance with less dentry lookup (though most likely the root level been cached). It's an implementation detail of the FileStore (the part of the OSD that stores data in the filesystem). Each of those folders represents an ever-smaller division of the hash space that objects live in. The more PGs you have, the less hash space each one covers, so there's that trail of folders. It's a bit unfortunate, because as you mention it involves more metadata memory caching, but fixing it would require some fairly detailed code in a critical path. The cost of fixing it and the risk of breaking things haven't been worth it yet. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph GET latency
Hi ceph-users, We are using Ceph (radosgw) to store user generated images, as GET latency is critical for us, most recently I did some investigation over the GET path to understand where time spend. I first confirmed that the latency came from OSD (read op), so that we instrumented code to trace the GET request (read op at OSD side, to be more specific, each object with size [512K + 4M * x] are splitted into [1 + x] chunks, each chunk needs one read op ), for each read op, it needs to go through the following steps: 1. Dispatch and take by a op thread to process (process not started). 0 – 20 ms, 94% 20 – 50 ms, 2% 50 – 100 ms, 2% 100ms+ , 2% For those having 20ms+ latency, half of them are due to waiting for pg lock (https://github.com/ceph/ceph/blob/dumpling/src/osd/OSD.cc#L7089), another half are yet to be investigated. 2. Get file xattr (‘-‘), which open the file and populate fd cache (https://github.com/ceph/ceph/blob/dumpling/src/os/FileStore.cc#L230). 0 – 20 ms, 80% 20 – 50 ms, 8% 50 – 100 ms, 7% 100ms+ , 5% The latency either comes from (from more to less): file path lookup (https://github.com/ceph/ceph/blob/dumpling/src/os/HashIndex.cc#L294), file open, or fd cache lookup /add. Currently objects are store in level 6 or level 7 folder (due to http://tracker.ceph.com/issues/7207, I stopped folder splitting). 3. Get more xattrs, this is fast due to previous fd cache (rarely 1ms). 4. Read the data. 0 – 20 ms, 84% 20 – 50 ms, 10% 50 – 100 ms, 4% 100ms+ , 2% I decreased vfs_cache_pressure from its default value 100 to 5 to make VFS favor dentry/inode cache over page cache, unfortunately it does not help. Long story short, most of the long latency read op comes from file system call (for cold data), as our workload mainly stores objects less than 500KB, so that it generates a large bunch of objects. I would like to ask if people experienced similar issue and if there is any suggestion I can try to boost the GET performance. On the other hand, PUT could be sacrificed. Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks all for the help. We finally identified the root cause of the issue was due to a lock contention happening at folder splitting and here is a tracking ticket (thanks Inktank for the fix!): http://tracker.ceph.com/issues/7207 Thanks, Guang On Tuesday, December 31, 2013 8:22 AM, Guang Yang yguan...@yahoo.com wrote: Thanks Wido, my comments inline... Date: Mon, 30 Dec 2013 14:04:35 +0100 From: Wido den Hollander w...@42on.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time On 12/30/2013 12:45 PM, Guang wrote: Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log | grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around *30 seconds average op latency*!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido Thanks
Re: [ceph-users] RADOS + deep scrubbing performance issues in production environment
+ceph-users. Does anybody have the similar experience of scrubbing / deep-scrubbing? Thanks, Guang On Jan 29, 2014, at 10:35 AM, Guang yguan...@yahoo.com wrote: Glad to see there are some discussion around scrubbing / deep-scrubbing. We are experiencing the same that scrubbing could affect latency quite a bit and so far I found two slow patterns (dump_historic_ops): 1) waiting from being dispatched 2) waiting in the op working queue to be fetched by an available op thread. For the first slow pattern, it looks like there is lock (as dispatcher stop working for 2 seconds and then resume, same for scrubber thread), that needs further investigation. For the second slow pattern, as scrubbing brings more ops (for scrubbing check), that make the op thread's work load increase (client op has a lower priority), I think that could be improved by increasing the op thread number, I will confirm this analysis by adding more op threads and turn on scrubbing on OSD basis. Does the above observation and analysis make sense? Thanks, Guang On Jan 29, 2014, at 2:13 AM, Filippos Giannakos philipg...@grnet.gr wrote: On Mon, Jan 27, 2014 at 10:45:48AM -0800, Sage Weil wrote: There is also ceph osd set noscrub and then later ceph osd unset noscrub I forget whether this pauses an in-progress PG scrub or just makes it stop when it gets to the next PG boundary. sage I bumped into those settings but I couldn't find any documentation about them. When I first tried them, they didn't do anything immediately, so I thought they weren't the answer. After your mention, I tried them again, and after a while the deep-scrubbing stopped. So I'm guessing they stop scrubbing on the next PG boundary. I see from this thread and others before, that some people think it is a spindle issue. I'm not sure that it is just that. Replicating it to an idle cluster that can do more than 250MiB/seconds and pausing for 4-5 seconds on a single request, sounds like an issue by itself. Maybe there is too much locking or not enough priority to the actual I/O ? Plus, that idea of throttling deep scrubbing based on the iops sounds appealing. Kind Regards, -- Filippos philipg...@grnet.gr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster is unreachable because of authentication failure
Thanks Sage. I just captured part of the log (it was fast growing), the process did not hang but I saw the same pattern repeatedly. Should I increase the log level and send over email (it constantly reproduced)? Thanks, Guang On Jan 18, 2014, at 12:05 AM, Sage Weil s...@inktank.com wrote: On Fri, 17 Jan 2014, Guang wrote: Thanks Sage. I further narrow down the problem to #any command using paxos service would hang#, following are details: 1. I am able to run ceph status / osd dump, etc., however, the result are out of date (though I stopped all OSDs, it does not reflect in ceph status report). -bash-4.1$ sudo ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 123278, quorum 0,1,2 osd151,osd152,osd153 osdmap e134893: 781 osds: 694 up, 751 in pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%) mdsmap e1: 0/0/1 up 2. If I run a command which uses paxos, the command will hang forever, this includes, ceph osd set noup (and also including those commands osd send to monitor when being started (create-or-add)). I attached the corresponding monitor log (it is like a bug). I see the osd set command coming through, but it arrives while paxos is converging and the log seems to end before the mon would normally process te delayed messages. Is there a reason why the log fragment you attached ends there, or did the process hang or something? Thanks- sage I On Jan 17, 2014, at 1:35 AM, Sage Weil s...@inktank.com wrote: Hi Guang, On Thu, 16 Jan 2014, Guang wrote: I still have bad the luck to figure out what is the problem making authentication failure, so in order to get the cluster back, I tried: 1. stop all daemons (mon osd) 2. change the configuration to disable cephx 3. start mon daemons (3 in total) 4. start osd daemon one by one After finishing step 3, the cluster can be reachable ('ceph -s' give results): -bash-4.1$ sudo ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_WARN 2797 pgs degraded; 107 pgs down; 7503 pgs peering; 917 pgs recovering; 6079 pgs recovery_wait; 2957 pgs stale; 7771 pgs stuck inactive; 2957 pgs stuck stale; 16567 pgs stuck unclean; recovery 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%); 2 near full osd(s); 57/751 in osds are down; noout,nobackfill,norecover,noscrub,nodeep-scrub flag(s) set monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 106022, quorum 0,1,2 osd151,osd152,osd153 osdmap e134893: 781 osds: 694 up, 751 in pgmap v2388518: 22203 pgs: 26 inactive, 14 active, 79 stale+active+recovering, 5020 active+clean, 242 stale, 4352 active+recovery_wait, 616 stale+active+clean, 177 active+recovering+degraded, 6714 peering, 925 stale+active+recovery_wait, 86 down+peering, 1547 active+degraded, 32 stale+active+recovering+degraded, 648 stale+peering, 21 stale+down+peering, 239 stale+active+degraded, 651 active+recovery_wait+degraded, 30 remapped+peering, 151 stale+active+recovery_wait+degraded, 4 stale+remapped+peering, 629 active+recovering; 79656 GB data, 363 TB used, 697 TB / 1061 TB avail; 54346804/779462977 degraded (6.972%); 9/259724199 unfound (0.000%) mdsmap e1: 0/0/1 up (at this point, all OSDs should be down). When I tried to start OSD daemon, the starting script got hang, and the process hang is: root 80497 80496 0 08:18 pts/000:00:00 python /usr/bin/ceph --name=osd.22 --keyring=/var/lib/ceph/osd/ceph-22/keyring osd crush create-or-move -- 22 0.40 root=default host=osd173 When I strace the starting script, I got the following traces (process 75873 is the above process), it failed with futex and then do a infinite loop: select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) Any idea what might
Re: [ceph-users] Ceph cluster is unreachable because of authentication failure
Thanks Sage. -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok mon_status { name: osd151, rank: 2, state: electing, election_epoch: 85469, quorum: [], outside_quorum: [], extra_probe_peers: [], sync_provider: [], monmap: { epoch: 1, fsid: b9cb3ea9-e1de-48b4-9e86-6921e2c537d2, modified: 0.00, created: 0.00, mons: [ { rank: 0, name: osd152, addr: 10.193.207.130:6789\/0}, { rank: 1, name: osd153, addr: 10.193.207.131:6789\/0}, { rank: 2, name: osd151, addr: 10.194.0.68:6789\/0}]}} And: -bash-4.1$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.osd151.asok quorum_status { election_epoch: 85480, quorum: [ 0, 1, 2], quorum_names: [ osd151, osd152, osd153], quorum_leader_name: osd152, monmap: { epoch: 1, fsid: b9cb3ea9-e1de-48b4-9e86-6921e2c537d2, modified: 0.00, created: 0.00, mons: [ { rank: 0, name: osd152, addr: 10.193.207.130:6789\/0}, { rank: 1, name: osd153, addr: 10.193.207.131:6789\/0}, { rank: 2, name: osd151, addr: 10.194.0.68:6789\/0}]}} The election has been finished with leader selected from the above status. Thanks, Guang On Jan 14, 2014, at 10:55 PM, Sage Weil s...@inktank.com wrote: On Tue, 14 Jan 2014, GuangYang wrote: Hi ceph-users and ceph-devel, I came across an issue after restarting monitors of the cluster, that authentication fails which prevents running any ceph command. After we did some maintenance work, I restart OSD, however, I found that the OSD would not join the cluster automatically after being restarted, though TCP dump showed it had already sent messenger to monitor telling add me into the cluster. So that I suspected there might be some issues of monitor and I restarted monitor one by one (3 in total), however, after restarting monitors, all ceph command would fail saying authentication timeout? 2014-01-14 12:00:30.499397 7fc7f195e700 0 monclient(hunting): authenticate timed out after 300 2014-01-14 12:00:30.499440 7fc7f195e700 0 librados: client.admin authentication error (110) Connection timed out Error connecting to cluster: Error Any idea why such error happened (restarting OSD would result in the same error)? I am thinking the authentication information is persisted in mon local disk and is there a chance those data got corrupted? That sounds unlikely, but you're right that the core problem is with the mons. What does ceph daemon mon.`hostname` mon_status say? Perhaps they are not forming a quorum and that is what is preventing authentication. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: [rgw - Bug #7073] (New) rgw gc max objs should have a prime number as default value
Hi ceph-users, After reading through the GC related code, I am thinking to use a much larger value for rgw gc max obis (like 997), and I don't see any side effect if we increase this value. Did I miss anything? Thanks, Guang Begin forwarded message: From: redm...@tracker.ceph.com Subject: [rgw - Bug #7073] (New) rgw gc max objs should have a prime number as default value Date: December 31, 2013 3:28:53 PM GMT+08:00 Issue #7073 has been reported by Guang Yang. Bug #7073: rgw gc max objs should have a prime number as default value Author: Guang Yang Status: New Priority: Normal Assignee: Category: Target version: Source: other Backport: Tags: Severity: 3 - minor Reviewed: Recently when we trouble shoot latency increasing on our ceph cluster, we observed a couple of gc objects were hotspot which slow down the entire OSD, after checking the .rgw.gc pool, we found a couple of gc objects has tens of thousands of entries while other gc objects has zero entry. The problem is because we have a bad default value (32) for rgw gc max objs. The data flow is: 1. For each object, it has a object ID with pattern {client_id}.{eachreqincrease_by_1_number}, sample is: 0_default.4351.24557. 2. For each delete request, it needs to set gc entry for the object, the way how it does is: 2.1 hash the object ID to figure out which gc object to use (0 – 31) 2.2 set two entries for that gc object. The problem comes from step 2.1, as the default max objs is 32, so that for each string (object tag) hashed value, it will need to mod 32, which result a un-even distribution, it definitely should choose a prime number to have a evenly distribution. I wrote a small problem to simulate the above as: #include iostream #include sstream #include string using namespace std; unsigned str_hash(const char* str, unsigned length) { unsigned long hash = 0; while (length--) { unsigned char c = *str++; hash = (hash + (c 4) + (c 4)) * 11; } return hash; } int main() { int gc_old32 = {0,0}; int gc_new31 = {0,0}; string base(0_default.4351.); ostringstream os; for (int i = 0; i 1; ++i) { os.clear(); os i; string tag = base + os.str(); unsigned n = str_hash(tag.c_str(), tag.size()); gc_old[n%32]++; gc_new[n%31]++; } cout with use max objs 32...lt;endl; for(int i = 0; i 32; ++i) { cout gc. i gc_old[i] endl; } cout with use max objs 31...lt;endl; for(int i = 0; i 31; ++i) { cout gc. igc_new[i] endl; } return 0; }, output of the program is: with use max objs 32... gc.00 gc.10 gc.22317 gc.358 gc.40 gc.50 gc.668 gc.757 gc.80 gc.90 gc.1068 gc.1157 gc.120 gc.130 gc.1467 gc.1557 gc.160 gc.170 gc.182319 gc.1955 gc.200 gc.210 gc.2269 gc.2357 gc.240 gc.250 gc.264569 gc.2758 gc.280 gc.290 gc.3068 gc.3156 with use max objs 31... gc.0322 gc.1287 gc.2307 gc.3315 gc.4345 gc.5333 gc.6333 gc.7323 gc.8297 gc.9324 gc.10316 gc.11354 gc.12313 gc.13331 gc.14314 gc.15312 gc.16335 gc.17320 gc.18337 gc.19317 gc.20316 gc.21340 gc.22330 gc.23322 gc.24306 gc.25350 gc.26332 gc.27327 gc.28309 gc.29292 gc.30341 In order to avoid the hotspot, we should choose a prime number as default value and clearly document that if user need to change the value, he / she should choose a prime number to have a better performance. You have received this notification because you have either subscribed to it, or are involved in it. To change your notification preferences, please click here: http://tracker.ceph.com/my/account ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Wido, my comments inline... Date: Mon, 30 Dec 2013 14:04:35 +0100 From: Wido den Hollander w...@42on.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time On 12/30/2013 12:45 PM, Guang wrote: Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log | grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around *30 seconds average op latency*!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Mark, my comments inline... Date: Mon, 30 Dec 2013 07:36:56 -0600 From: Mark Nelson mark.nel...@inktank.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time On 12/30/2013 05:45 AM, Guang wrote: Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log | grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around *30 seconds average op latency*!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? You may want to use the dump_historic_ops command in the admin socket for the slow OSDs. That will give you some clues regarding where the ops are hanging up in the OSD. You can also crank the osd debugging way up on that node and search through the logs to see if there are any patterns or trends (consistent slowness, pauses, etc). It may also be useful to look and see if that OSD is pegging CPU and if so attach strace or perf to it and see what it's doing. [yguang] We have a job dump_historic_ops but unfortunately it wasn't running at the time (my bad), and as we are using as a pre-production system
[ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log | grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort –nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around 30 seconds average op latency!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3459.989.8 4106.9 61.8 1.6 12.24.9 42.3545.891.8 4296.3 69.7 2.4 17.65.2 42.0483.893.1 4263.6 68.8 1.8 13.35.1 39.7425.589.4 4327.0 68.5 1.8 14.05.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2502.680.13524.353.4 1.3 11.8 4.7 35.3560.983.73742.056.0 1.2 9.8 4.7 30.4371.5 78.8 3631.452.2 1.7 15.8 4.8 33.0389.4 78.8 3597.6 54.2 1.4 12.14.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time
Thanks Wido, my comments inline... Date: Mon, 30 Dec 2013 14:04:35 +0100 From: Wido den Hollander w...@42on.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph cluster performance degrade (radosgw) after running some time On 12/30/2013 12:45 PM, Guang wrote: Hi ceph-users and ceph-devel, Merry Christmas and Happy New Year! We have a ceph cluster with radosgw, our customer is using S3 API to access the cluster. The basic information of the cluster is: bash-4.1$ ceph -s cluster b9cb3ea9-e1de-48b4-9e86-6921e2c537d2 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 3 mons at {osd151=10.194.0.68:6789/0,osd152=10.193.207.130:6789/0,osd153=10.193.207.131:6789/0}, election epoch 40, quorum 0,1,2 osd151,osd152,osd153 osdmap e129885: 787 osds: 758 up, 758 in pgmap v1884502: 22203 pgs: 22125 active+clean, 1 active+clean+scrubbing, 1 active+clean+inconsistent, 76 active+clean+scrubbing+deep; 96319 GB data, 302 TB used, 762 TB / 1065 TB avail mdsmap e1: 0/0/1 up #When the latency peak happened, there was no scrubbing, recovering or backfilling at the moment.# While the performance of the cluster (only with WRITE traffic) is stable until Dec 25th, our monitoring (for radosgw access log) shows a significant increase of average latency and 99% latency. And then I chose one OSD and try to grep slow requests logs and find that most of the slow requests were waiting for subop, I take osd22 for example. osd[561-571] are hosted by osd22. -bash-4.1$ for i in {561..571}; do grep slow request ceph-osd.$i.log | grep 2013-12-25 16| grep osd_op | grep -oP \d+,\d+ ; done ~/slow_osd.txt -bash-4.1$ cat ~/slow_osd.txt | sort | uniq -c | sort ?nr 3586 656,598 289 467,629 284 598,763 279 584,598 203 172,598 182 598,6 155 629,646 83 631,598 65 631,593 21 616,629 20 609,671 20 609,390 13 609,254 12 702,629 12 629,641 11 665,613 11 593,724 11 361,591 10 591,709 9 681,609 9 609,595 9 591,772 8 613,662 8 575,591 7 674,722 7 609,603 6 585,605 5 613,691 5 293,629 4 774,591 4 717,591 4 613,776 4 538,629 4 485,629 3 702,641 3 608,629 3 593,580 3 591,676 It turns out most of the slow requests were waiting for osd 598, 629, I ran the procedure on another host osd22 and got the same pattern. Then I turned to the host having osd598 and dump the perf counter to do comparision. -bash-4.1$ for i in {594..604}; do sudo ceph --admin-daemon /var/run/ceph/ceph-osd.$i.asok perf dump | ~/do_calc_op_latency.pl; done op_latency,subop_latency,total_ops 0.192097526753471,0.0344513450167198,7549045 1.99137797628122,1.42198426157216,9184472 0.198062399664129,0.0387090378926376,6305973 0.621697271315762,0.396549768986993,9726679 29.5222496247375,18.246379615, 10860858 0.229250239525916,0.0557482067611005,8149691 0.208981698303654,0.0375553180438224,6623842 0.47474766302086,0.292583928601509,9838777 0.339477790083925,0.101288409388438,9340212 0.186448840141895,0.0327296517417626,7081410 0.807598201207144,0.0139762289702332,6093531 (osd 598 is op hotspot as well) This double confirmed that osd 598 was having some performance issues (it has around *30 seconds average op latency*!). sar shows slightly higher disk I/O for osd 598 (/dev/sdf) but the latency difference is not as significant as we saw from osd perf. reads kbread writes kbwrite %busy avgqu await svctm 37.3 459.9 89.8 4106.9 61.8 1.6 12.2 4.9 42.3 545.8 91.8 4296.3 69.7 2.4 17.6 5.2 42.0 483.8 93.1 4263.6 68.8 1.8 13.3 5.1 39.7 425.5 89.4 4327.0 68.5 1.8 14.0 5.3 Another disk at the same time for comparison (/dev/sdb). reads kbread writes kbwrite %busy avgqu await svctm 34.2 502.6 80.1 3524.3 53.4 1.3 11.8 4.7 35.3 560.9 83.7 3742.0 56.0 1.2 9.8 4.7 30.4 371.5 78.8 3631.4 52.2 1.7 15.8 4.8 33.0 389.4 78.8 3597.6 54.2 1.4 12.1 4.8 Any idea why a couple of OSDs are so slow that impact the performance of the entire cluster? What filesystem are you using? Btrfs or XFS? Btrfs still suffers from a performance degradation over time. So if you run btrfs, that might be the problem. [yguang] We are running on xfs, journal and data share the same disk with different partitions. Wido Thanks,___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 'ceph osd reweight' VS 'ceph osd crush reweight'
Hello ceph-users, I am a little bit confused by these two options, I understand crush reweight determine the weight of the OSD in the crush map so that it impacts I/O and utilization, however, I am a little bit confused by osd reweight option, is that something control the I/O distribution across different OSDs on a single host? While looking at the code, I only found that if 'osd weight' is 1 (0x1), it means the osd is up and if it is 0, it means the osd is down. Please advice... Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding ceph cluster by adding more OSDs
Hi Kyle, Thanks for you response. Though I haven't tested it, my gut feeling is the same, changing the PG number may result in re-shuffling of the data. In terms of the strategy you mentioned to expand a cluster, I have a few questions: 1. By adding a LITTLE more weight each time, my understanding is to reduce the load for the OSD being added, is it? If so, can we use the throttle setting to achieve the same goal? 2. If I would like to expand the cluster every quarter with 30% capacity, by using such way, it might take a long time to add new capacity, is my understanding correct? 3. Is there any automatic tool to do this, or I will need to closely monitor, and dump the crush rule / edit it and push back? I am testing a scenario to add one OSD each time (I have 330 OSD in total), the weight is using default one. There are a couple of observations: 1) the recovery start quick (several hundred MB/s) and then get slower to around 10MB/s. 2) It impact the online traffic quite a lot (from my observation, mainly of the recovering PGs). I tried to search some best practice to expand a cluster with bad luck, anybody would like to share your experience? Thanks very much. Thanks, Guang Date: Thu, 10 Oct 2013 05:15:27 -0700 From: Kyle Bader kyle.ba...@gmail.com To: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Subject: Re: [ceph-users] Expanding ceph cluster by adding more OSDs Message-ID: cafmfnwq+hbgsezme3vwom_gqcwikd1393rxc+xb0xgt4nxq...@mail.gmail.com Content-Type: text/plain; charset=utf-8 I've contracted and expanded clusters by up to a rack of 216 OSDs - 18 nodes, 12 drives each. New disks are configured with a CRUSH weight of 0 and I slowly add weight (0.1 to 0.01 increments), wait for the cluster to become active+clean and then add more weight. I was expanding after contraction so my PG count didn't need to be corrected, I tend to be liberal and opt for more PGs. If I hadn't contracted the cluster prior to expanding it I would probably add PGs after all the new OSDs have finished being weighted into the cluster. On Wed, Oct 9, 2013 at 8:55 PM, Michael Lowe j.michael.l...@gmail.comwrote: I had those same questions, I think the answer I got was that it was better to have too few pg's than to have overloaded osd's. So add osd's then add pg's. I don't know the best increments to grow in, probably depends largely on the hardware in your osd's. Sent from my iPad On Oct 9, 2013, at 11:34 PM, Guang yguan...@yahoo.com wrote: Thanks Mike. I get your point. There are still a few things confusing me: 1) We expand Ceph cluster by adding more OSDs, which will trigger re-balance PGs across the old new OSDs, and likely it will break the optimized PG numbers for the cluster. 2) We can add more PGs which will trigger re-balance objects across old new PGs. So: 1) What is the recommended way to expand the cluster by adding OSDs (and potentially adding PGs), should we do them at the same time? 2) What is the recommended way to scale a cluster from like 1PB to 2PB, should we scale it to like 1.1PB to 1.2PB or move to 2PB directly? Thanks, Guang On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote: There used to be, can't find it right now. Something like 'ceph osd set pg_num num' then 'ceph osd set pgp_num num' to actually move your data into the new pg's. I successfully did it several months ago, when bobtail was current. Sent from my iPad On Oct 9, 2013, at 10:30 PM, Guang yguan...@yahoo.com wrote: Thanks Mike. Is there any documentation for that? Thanks, Guang On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote: You can add PGs, the process is called splitting. I don't think PG merging, the reduction in the number of PGs, is ready yet. On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote: Hi ceph-users, Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my understanding, the number of PGs for a pool should be fixed even we scale out / in the cluster by adding / removing OSDs, does that mean if we double the OSD numbers, the PG number for a pool is not optimal any more and there is no chance to correct it? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Adding a new OSD crash the monitors (assertion failure)
Hi all, Today I tried to add a new OSD into the cluster and immediately it get the monitors crashed. Platform: RHEL6.4 Steps to add new monitor: 1. sudo ceph-disk zap /dev/sdh 2. sudo ceph-disk activate /dev/sdh Then the monitor got crashed with the following logs: 013-10-30 02:17:14.252726 7f44395a9700 0 mon.ceph2@0(leader) e2 handle_command mon_command({prefix: osd crush create-or-move, args: [root=default, host=ceph8], id: 24, weight: 0.40998} v 0) v1 2013-10-30 02:17:14.252792 7f44395a9700 1 mon.ceph2@0(leader).paxos(paxos active c 322285..323030) is_readable now=2013-10-30 02:17:14.252794 lease_expire=2013-10-30 02:17:19.063672 has v0 lc 323030 2013-10-30 02:17:14.252911 7f44395a9700 0 mon.ceph2@0(leader).osd e916 create-or-move crush item name 'osd.24' initial_weight 0.41 at location {host=ceph8,root=default} 2013-10-30 02:17:14.255347 7f44395a9700 -1 crush/CrushWrapper.cc: In function 'int CrushWrapper::insert_item(CephContext*, int, float, std::string, const std::mapstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::basic_stringchar, std::char_traitschar, std::allocatorchar , std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::allocatorstd::pairconst std::basic_stringchar, std::char_traitschar, std::allocatorchar , std::basic_stringchar, std::char_traitschar, std::allocatorchar)' thread 7f44395a9700 time 2013-10-30 02:17:14.253030 crush/CrushWrapper.cc: 413: FAILED assert(!r) ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a) 1: (CrushWrapper::insert_item(CephContext*, int, float, std::string, std::mapstd::string, std::string, std::lessstd::string, std::allocatorstd::pairstd::string const, std::string const)+0x14b4) [0x6b9514] 2: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::string, std::mapstd::string, std::string, std::lessstd::string, std::allocatorstd::pairstd::string const, std::string const)+0x2d6) [0x6ba0f6] 3: (OSDMonitor::prepare_command(MMonCommand*)+0x150a) [0x5aa89a] 4: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x20b) [0x5b2e2b] 5: (PaxosService::dispatch(PaxosServiceMessage*)+0xa20) [0x58bea0] 6: (Monitor::handle_command(MMonCommand*)+0xdec) [0x557ddc] 7: (Monitor::_ms_dispatch(Message*)+0xc2f) [0x5600af] 8: (Monitor::handle_forward(MForward*)+0x990) [0x55f0c0] 9: (Monitor::_ms_dispatch(Message*)+0xd53) [0x5601d3] 10: (Monitor::ms_dispatch(Message*)+0x32) [0x578742] 11: (DispatchQueue::entry()+0x5a2) [0x7bdcc2] 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7b690d] 13: /lib64/libpthread.so.0() [0x3208a07851] 14: (clone()+0x6d) [0x32086e890d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Anyone else came across the same issue? Or am I missing anything when add a new OSD? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding a new OSD crash the monitors (assertion failure)
I just found the trick.. When I am using a default crush, which use straw bucket type, things are good. However, for the error I posted below, it is using tree bucket type. Is it related? Thanks, Guang On Oct 30, 2013, at 6:52 PM, Guang wrote: Hi all, Today I tried to add a new OSD into the cluster and immediately it get the monitors crashed. Platform: RHEL6.4 Steps to add new monitor: 1. sudo ceph-disk zap /dev/sdh 2. sudo ceph-disk activate /dev/sdh Then the monitor got crashed with the following logs: 013-10-30 02:17:14.252726 7f44395a9700 0 mon.ceph2@0(leader) e2 handle_command mon_command({prefix: osd crush create-or-move, args: [root=default, host=ceph8], id: 24, weight: 0.40998} v 0) v1 2013-10-30 02:17:14.252792 7f44395a9700 1 mon.ceph2@0(leader).paxos(paxos active c 322285..323030) is_readable now=2013-10-30 02:17:14.252794 lease_expire=2013-10-30 02:17:19.063672 has v0 lc 323030 2013-10-30 02:17:14.252911 7f44395a9700 0 mon.ceph2@0(leader).osd e916 create-or-move crush item name 'osd.24' initial_weight 0.41 at location {host=ceph8,root=default} 2013-10-30 02:17:14.255347 7f44395a9700 -1 crush/CrushWrapper.cc: In function 'int CrushWrapper::insert_item(CephContext*, int, float, std::string, const std::mapstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::basic_stringchar, std::char_traitschar, std::allocatorchar , std::lessstd::basic_stringchar, std::char_traitschar, std::allocatorchar , std::allocatorstd::pairconst std::basic_stringchar, std::char_traitschar, std::allocatorchar , std::basic_stringchar, std::char_traitschar, std::allocatorchar )' thread 7f44395a9700 time 2013-10-30 02:17:14.253030 crush/CrushWrapper.cc: 413: FAILED assert(!r) ceph version 0.67.3 (408cd61584c72c0d97b774b3d8f95c6b1b06341a) 1: (CrushWrapper::insert_item(CephContext*, int, float, std::string, std::mapstd::string, std::string, std::lessstd::string, std::allocatorstd::pairstd::string const, std::string const)+0x14b4) [0x6b9514] 2: (CrushWrapper::create_or_move_item(CephContext*, int, float, std::string, std::mapstd::string, std::string, std::lessstd::string, std::allocatorstd::pairstd::string const, std::string const)+0x2d6) [0x6ba0f6] 3: (OSDMonitor::prepare_command(MMonCommand*)+0x150a) [0x5aa89a] 4: (OSDMonitor::prepare_update(PaxosServiceMessage*)+0x20b) [0x5b2e2b] 5: (PaxosService::dispatch(PaxosServiceMessage*)+0xa20) [0x58bea0] 6: (Monitor::handle_command(MMonCommand*)+0xdec) [0x557ddc] 7: (Monitor::_ms_dispatch(Message*)+0xc2f) [0x5600af] 8: (Monitor::handle_forward(MForward*)+0x990) [0x55f0c0] 9: (Monitor::_ms_dispatch(Message*)+0xd53) [0x5601d3] 10: (Monitor::ms_dispatch(Message*)+0x32) [0x578742] 11: (DispatchQueue::entry()+0x5a2) [0x7bdcc2] 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x7b690d] 13: /lib64/libpthread.so.0() [0x3208a07851] 14: (clone()+0x6d) [0x32086e890d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Anyone else came across the same issue? Or am I missing anything when add a new OSD? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Hi Mark, Greg and Kyle, Sorry to response this late, and thanks for providing the directions for me to look at. We have exact the same setup for OSD, pool replica (and even I tried to create the same number of PGs within the small cluster), however, I can still reproduce this constantly. This is the command I run: $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write With 24 OSDs: Average Latency: 0.00494123 Max latency: 0.511864 Min latency: 0.002198 With 330 OSDs: Average Latency:0.00913806 Max latency: 0.021967 Min latency: 0.005456 In terms of the crush rule, we are using the default one, for the small cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we have 30 OSD hosts (11 * 30). I have a couple of questions: 1. Is it possible that latency is due to that we have only three layer hierarchy? like root - host - OSD, and as we are using the Straw (by default) bucket type, which has O(N) speed, and if host number increase, so that the computation actually increase. I suspect not as the computation is in the order of microseconds per my understanding. 2. Is it possible because we have more OSDs, the cluster will need to maintain far more connections between OSDs which potentially slow things down? 3. Anything else i might miss? Thanks all for the constant help. Guang 在 2013-10-22,下午10:22,Guang Yang yguan...@yahoo.com 写道: Hi Kyle and Greg, I will get back to you with more details tomorrow, thanks for the response. Thanks, Guang 在 2013-10-22,上午9:37,Kyle Bader kyle.ba...@gmail.com 写道: Besides what Mark and Greg said it could be due to additional hops through network devices. What network devices are you using, what is the network topology and does your CRUSH map reflect the network topology? On Oct 21, 2013 9:43 AM, Gregory Farnum g...@inktank.com wrote: On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang yguan...@yahoo.com wrote: Dear ceph-users, Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to a much bigger one (330 OSDs). When using rados bench to test the small cluster (24 OSDs), it showed the average latency was around 3ms (object size is 5K), while for the larger one (330 OSDs), the average latency was around 7ms (object size 5K), twice comparing the small cluster. The OSD within the two cluster have the same configuration, SAS disk, and two partitions for one disk, one for journal and the other for metadata. For PG numbers, the small cluster tested with the pool having 100 PGs, and for the large cluster, the pool has 4 PGs (as I will to further scale the cluster, so I choose a much large PG). Does my test result make sense? Like when the PG number and OSD increase, the latency might drop? Besides what Mark said, can you describe your test in a little more detail? Writing/reading, length of time, number of objects, etc. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Thanks Mark. I cannot connect to my hosts, I will do the check and get back to you tomorrow. Thanks, Guang 在 2013-10-24,下午9:47,Mark Nelson mark.nel...@inktank.com 写道: On 10/24/2013 08:31 AM, Guang Yang wrote: Hi Mark, Greg and Kyle, Sorry to response this late, and thanks for providing the directions for me to look at. We have exact the same setup for OSD, pool replica (and even I tried to create the same number of PGs within the small cluster), however, I can still reproduce this constantly. This is the command I run: $ rados bench -p perf_40k_PG -b 5000 -t 3 --show-time 10 write With 24 OSDs: Average Latency: 0.00494123 Max latency: 0.511864 Min latency: 0.002198 With 330 OSDs: Average Latency:0.00913806 Max latency: 0.021967 Min latency: 0.005456 In terms of the crush rule, we are using the default one, for the small cluster, it has 3 OSD hosts (11 + 11 + 2), for the large cluster, we have 30 OSD hosts (11 * 30). I have a couple of questions: 1. Is it possible that latency is due to that we have only three layer hierarchy? like root - host - OSD, and as we are using the Straw (by default) bucket type, which has O(N) speed, and if host number increase, so that the computation actually increase. I suspect not as the computation is in the order of microseconds per my understanding. I suspect this is very unlikely as well. 2. Is it possible because we have more OSDs, the cluster will need to maintain far more connections between OSDs which potentially slow things down? One thing here that might be very interesting is this: After you run your tests, if you do something like: find /var/run/ceph/*.asok -maxdepth 1 -exec sudo ceph --admin-daemon {} dump_historic_ops \; foo on each OSD server, you will get a dump of the 10 slowest operations over the last 10 minutes for each OSD on each server, and it will tell you were in each OSD operations were backing up. You can sort of search through these files by greping for duration first, looking for the long ones, and then going back and searching through the file for those long durations and looking at the associated latencies. Something I have been investigating recently is time spent waiting for osdmap propagation. It's something I haven't had time to dig into meaningfully, but if we were to see that this was more significant on your larger cluster vs your smaller one, that would be very interesting news. 3. Anything else i might miss? Thanks all for the constant help. Guang 在 2013-10-22,下午10:22,Guang Yang yguan...@yahoo.com mailto:yguan...@yahoo.com 写道: Hi Kyle and Greg, I will get back to you with more details tomorrow, thanks for the response. Thanks, Guang 在 2013-10-22,上午9:37,Kyle Bader kyle.ba...@gmail.com mailto:kyle.ba...@gmail.com 写道: Besides what Mark and Greg said it could be due to additional hops through network devices. What network devices are you using, what is the network topology and does your CRUSH map reflect the network topology? On Oct 21, 2013 9:43 AM, Gregory Farnum g...@inktank.com mailto:g...@inktank.com wrote: On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang yguan...@yahoo.com mailto:yguan...@yahoo.com wrote: Dear ceph-users, Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to a much bigger one (330 OSDs). When using rados bench to test the small cluster (24 OSDs), it showed the average latency was around 3ms (object size is 5K), while for the larger one (330 OSDs), the average latency was around 7ms (object size 5K), twice comparing the small cluster. The OSD within the two cluster have the same configuration, SAS disk, and two partitions for one disk, one for journal and the other for metadata. For PG numbers, the small cluster tested with the pool having 100 PGs, and for the large cluster, the pool has 4 PGs (as I will to further scale the cluster, so I choose a much large PG). Does my test result make sense? Like when the PG number and OSD increase, the latency might drop? Besides what Mark said, can you describe your test in a little more detail? Writing/reading, length of time, number of objects, etc. -Greg Software Engineer #42 @ http://inktank.com http://inktank.com/ | http://ceph.com http://ceph.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Thanks Mark for the response. My comments inline... From: Mark Nelson mark.nel...@inktank.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Rados bench result when increasing OSDs Message-ID: 52653b49.8090...@inktank.com Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 10/21/2013 09:13 AM, Guang Yang wrote: Dear ceph-users, Hi! Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to a much bigger one (330 OSDs). When using rados bench to test the small cluster (24 OSDs), it showed the average latency was around 3ms (object size is 5K), while for the larger one (330 OSDs), the average latency was around 7ms (object size 5K), twice comparing the small cluster. Did you have the same number of concurrent requests going? [yguang] Yes. I run the test with 3 or 5 concurrent request, that does not change the result. The OSD within the two cluster have the same configuration, SAS disk, and two partitions for one disk, one for journal and the other for metadata. For PG numbers, the small cluster tested with the pool having 100 PGs, and for the large cluster, the pool has 4 PGs (as I will to further scale the cluster, so I choose a much large PG). Forgive me if this is a silly question, but were the pools using the same level of replication? [yguang] Yes, both have 3 replicas. Does my test result make sense? Like when the PG number and OSD increase, the latency might drop? You wouldn't necessarily expect a larger cluster to show higher latency if the nodes, pools, etc were all configured exactly the same, especially if you were using the same amount of concurrency. It's possible that you have some slow drives on the larger cluster that could be causing the average latency to increase. If there are more disks per node, that could do it too. [yguang] Glad to know this :) I will need to gather more information in terms of if there is any slow disk, will get back on this. Are there any other differences you can think of? [yguang] Another difference is, for the large cluster, as we expect to scale it to more than a thousand OSDs, we have a large PG number (4) pre-created. Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rados bench result when increasing OSDs
Hi Kyle and Greg, I will get back to you with more details tomorrow, thanks for the response. Thanks, Guang 在 2013-10-22,上午9:37,Kyle Bader kyle.ba...@gmail.com 写道: Besides what Mark and Greg said it could be due to additional hops through network devices. What network devices are you using, what is the network topology and does your CRUSH map reflect the network topology? On Oct 21, 2013 9:43 AM, Gregory Farnum g...@inktank.com wrote: On Mon, Oct 21, 2013 at 7:13 AM, Guang Yang yguan...@yahoo.com wrote: Dear ceph-users, Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to a much bigger one (330 OSDs). When using rados bench to test the small cluster (24 OSDs), it showed the average latency was around 3ms (object size is 5K), while for the larger one (330 OSDs), the average latency was around 7ms (object size 5K), twice comparing the small cluster. The OSD within the two cluster have the same configuration, SAS disk, and two partitions for one disk, one for journal and the other for metadata. For PG numbers, the small cluster tested with the pool having 100 PGs, and for the large cluster, the pool has 4 PGs (as I will to further scale the cluster, so I choose a much large PG). Does my test result make sense? Like when the PG number and OSD increase, the latency might drop? Besides what Mark said, can you describe your test in a little more detail? Writing/reading, length of time, number of objects, etc. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Rados bench result when increasing OSDs
Dear ceph-users, Recently I deployed a ceph cluster with RadosGW, from a small one (24 OSDs) to a much bigger one (330 OSDs). When using rados bench to test the small cluster (24 OSDs), it showed the average latency was around 3ms (object size is 5K), while for the larger one (330 OSDs), the average latency was around 7ms (object size 5K), twice comparing the small cluster. The OSD within the two cluster have the same configuration, SAS disk, and two partitions for one disk, one for journal and the other for metadata. For PG numbers, the small cluster tested with the pool having 100 PGs, and for the large cluster, the pool has 4 PGs (as I will to further scale the cluster, so I choose a much large PG). Does my test result make sense? Like when the PG number and OSD increase, the latency might drop? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy zap disk failure
Thanks all for the recommendation. I worked around by modifying the ceph-deploy by giving and full path for sgdisk. Thanks, Guang 在 2013-10-16,下午10:47,Alfredo Deza alfredo.d...@inktank.com 写道: On Tue, Oct 15, 2013 at 9:19 PM, Guang yguan...@yahoo.com wrote: -bash-4.1$ which sgdisk /usr/sbin/sgdisk Which path does ceph-deploy use? That is unexpected... these are the paths that ceph-deploy uses: '/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin' So `/usr/sbin/` is there. I believe this is a case where $PATH gets altered because of sudo (resetting the env variable). This should be fixed in the next release. In the meantime, you could set the $PATH for non-interactive sessions (which is what ceph-deploy does) for all users. I *think* that would be in `/etc/profile` Thanks, Guang On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote: On Tue, Oct 15, 2013 at 10:52 AM, Guang yguan...@yahoo.com wrote: Hi ceph-users, I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a new issue: -bash-4.1$ ceph-deploy --version 1.2.7 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb [ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap server:/dev/sdb [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from remote host [ceph_deploy.osd][INFO ] Distro info: Red Hat Enterprise Linux Server 6.4 Santiago [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device [osd2.ceph.mobstor.bf1.yahoo.com][INFO ] Running command: sudo sgdisk --zap-all --clear --mbrtogpt -- /dev/sdb [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found While I run disk zap on the host directly, it can work without issues. Anyone meet the same issue? Can you run `which sgdisk` on that host? I want to make sure this is not a $PATH problem. ceph-deploy tries to use the proper path remotely but it could be that this one is not there. Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-deploy zap disk failure
-bash-4.1$ which sgdisk /usr/sbin/sgdisk Which path does ceph-deploy use? Thanks, Guang On Oct 15, 2013, at 11:15 PM, Alfredo Deza wrote: On Tue, Oct 15, 2013 at 10:52 AM, Guang yguan...@yahoo.com wrote: Hi ceph-users, I am trying with the new ceph-deploy utility on RHEL6.4 and I came across a new issue: -bash-4.1$ ceph-deploy --version 1.2.7 -bash-4.1$ ceph-deploy disk zap server:/dev/sdb [ceph_deploy.cli][INFO ] Invoked (1.2.7): /usr/bin/ceph-deploy disk zap server:/dev/sdb [ceph_deploy.osd][DEBUG ] zapping /dev/sdb on server [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] detect platform information from remote host [ceph_deploy.osd][INFO ] Distro info: Red Hat Enterprise Linux Server 6.4 Santiago [osd2.ceph.mobstor.bf1.yahoo.com][DEBUG ] zeroing last few blocks of device [osd2.ceph.mobstor.bf1.yahoo.com][INFO ] Running command: sudo sgdisk --zap-all --clear --mbrtogpt -- /dev/sdb [osd2.ceph.mobstor.bf1.yahoo.com][ERROR ] sudo: sgdisk: command not found While I run disk zap on the host directly, it can work without issues. Anyone meet the same issue? Can you run `which sgdisk` on that host? I want to make sure this is not a $PATH problem. ceph-deploy tries to use the proper path remotely but it could be that this one is not there. Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph stats and monitoring
Hi, Can someone share your experience with monitoring the Ceph cluster? How is going with the work mentioned here: http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/ceph_stats_and_monitoring_tools Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding ceph cluster by adding more OSDs
Thanks Mike. Is there any documentation for that? Thanks, Guang On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote: You can add PGs, the process is called splitting. I don't think PG merging, the reduction in the number of PGs, is ready yet. On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote: Hi ceph-users, Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my understanding, the number of PGs for a pool should be fixed even we scale out / in the cluster by adding / removing OSDs, does that mean if we double the OSD numbers, the PG number for a pool is not optimal any more and there is no chance to correct it? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Expanding ceph cluster by adding more OSDs
Thanks Mike. I get your point. There are still a few things confusing me: 1) We expand Ceph cluster by adding more OSDs, which will trigger re-balance PGs across the old new OSDs, and likely it will break the optimized PG numbers for the cluster. 2) We can add more PGs which will trigger re-balance objects across old new PGs. So: 1) What is the recommended way to expand the cluster by adding OSDs (and potentially adding PGs), should we do them at the same time? 2) What is the recommended way to scale a cluster from like 1PB to 2PB, should we scale it to like 1.1PB to 1.2PB or move to 2PB directly? Thanks, Guang On Oct 10, 2013, at 11:10 AM, Michael Lowe wrote: There used to be, can't find it right now. Something like 'ceph osd set pg_num num' then 'ceph osd set pgp_num num' to actually move your data into the new pg's. I successfully did it several months ago, when bobtail was current. Sent from my iPad On Oct 9, 2013, at 10:30 PM, Guang yguan...@yahoo.com wrote: Thanks Mike. Is there any documentation for that? Thanks, Guang On Oct 9, 2013, at 9:58 PM, Mike Lowe wrote: You can add PGs, the process is called splitting. I don't think PG merging, the reduction in the number of PGs, is ready yet. On Oct 8, 2013, at 11:58 PM, Guang yguan...@yahoo.com wrote: Hi ceph-users, Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my understanding, the number of PGs for a pool should be fixed even we scale out / in the cluster by adding / removing OSDs, does that mean if we double the OSD numbers, the PG number for a pool is not optimal any more and there is no chance to correct it? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph monitoring / stats and troubleshooting tools
Hi ceph-users, After walking through the operations document, I still have several questions in terms of operation / monitoring for ceph which need you help. Thanks! 1. Does ceph provide build in monitoring mechanism for Rados and RadosGW? Taking Rados for example, is it possible to monitor the health / latency / storage on regular basis and ideally have an web UI? 2. One common trouble shooting requirement would be, given an object name, how to locate the PG / OSD / physical file path for this object? Does Ceph provide such type of utility? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Expanding ceph cluster by adding more OSDs
Hi ceph-users, Ceph recommends the PGs number of a pool is (100 * OSDs) / Replicas, per my understanding, the number of PGs for a pool should be fixed even we scale out / in the cluster by adding / removing OSDs, does that mean if we double the OSD numbers, the PG number for a pool is not optimal any more and there is no chance to correct it? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy issues on RHEL6.4
Hi ceph-users, I recently deployed a ceph cluster with use of *ceph-deploy* utility, on RHEL6.4, during the time, I came across a couple of issues / questions which I would like to ask for your help. 1. ceph-deploy does not help to install dependencies (snappy leveldb gdisk python-argparse gperftools-libs) on the target host, so I will need to manually install those dependencies before performing 'ceph-deploy install {host_name}'. I am investigate the way to deploy ceph onto a hundred nodes and it is time-consuming to manually install those dependencies manually. Am I missing something here? I am thinking the dependency installation should be handled by *ceph-deploy* itself. 2. When performing 'ceph-deploy -v disk zap ceph.host.name:/dev/sdb', I have the following errors: [ceph_deploy.osd][DEBUG ] zapping /dev/sdc on ceph.host.name [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo Traceback (most recent call last): File /usr/bin/ceph-deploy, line 21, in module sys.exit(main()) File /usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py, line 83, in newfunc return f(*a, **kw) File /usr/lib/python2.6/site-packages/ceph_deploy/cli.py, line 147, in main return args.func(args) File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 381, in disk disk_zap(args) File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 317, in disk_zap zap_r(disk) File /usr/lib/python2.6/site-packages/pushy/protocol/proxy.py, line 255, in lambda (conn.operator(type_, self, args, kwargs)) File /usr/lib/python2.6/site-packages/pushy/protocol/connection.py, line 66, in operator return self.send_request(type_, (object, args, kwargs)) File /usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py, line 329, in send_request return self.__handle(m) File /usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py, line 645, in __handle raise e pushy.protocol.proxy.ExceptionProxy: [Errno 2] No such file or directory And then I logon to the host to perform 'ceph-disk zap /dev/sdb' and it can be successful without any issues. 3. When performing 'ceph-deploy -v disk activate ceph.host.name:/dev/sdb', I have the following errors: ceph_deploy.osd][DEBUG ] Activating cluster ceph disks ceph.host.name:/dev/sdb: [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo [ceph_deploy.osd][DEBUG ] Activating host ceph.host.name disk /dev/sdb [ceph_deploy.osd][DEBUG ] Distro RedHatEnterpriseServer codename Santiago, will use sysvinit Traceback (most recent call last): File /usr/bin/ceph-deploy, line 21, in module sys.exit(main()) File /usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py, line 83, in newfunc return f(*a, **kw) File /usr/lib/python2.6/site-packages/ceph_deploy/cli.py, line 147, in main return args.func(args) File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 379, in disk activate(args, cfg) File /usr/lib/python2.6/site-packages/ceph_deploy/osd.py, line 271, in activate cmd=cmd, ret=ret, out=out, err=err) NameError: global name 'ret' is not defined Also, I logon to the host to perform 'ceph-disk activate /dev/sdb' and it is good. Any help is appreciated. Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph deployment issue in physical hosts
Hi ceph-users, I deployed a cluster successfully in VMs, and today I tried to deploy a cluster in physical nodes. However, I came across a problem when I started creating a monitor. -bash-4.1$ ceph-deploy mon create x [ceph_deploy.mon][DEBUG ] Deploying mon, cluster ceph hosts [ceph_deploy.mon][DEBUG ] detecting platform for host web2 ... [ceph_deploy.sudo_pushy][DEBUG ] will use a remote connection with sudo [ceph_deploy.mon][INFO ] distro info: RedHatEnterpriseServer 6.4 Santiago [web2][DEBUG ] determining if provided host has same hostname in remote [web2][DEBUG ] deploying mon to web2 [web2][DEBUG ] remote hostname: web2 [web2][INFO ] write cluster configuration to /etc/ceph/{cluster}.conf [web2][DEBUG ] checking for done path: /var/lib/ceph/mon/ceph-web2/done [web2][INFO ] create a done file to avoid re-doing the mon deployment [web2][INFO ] create the init path if it does not exist [web2][INFO ] locating `service` executable... [web2][INFO ] found `service` executable: /sbin/service ssh: Could not resolve hostname web2: Name or service not known Traceback (most recent call last): File /usr/bin/ceph-deploy, line 21, in module sys.exit(main()) File /usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py, line 83, in newfunc return f(*a, **kw) File /usr/lib/python2.6/site-packages/ceph_deploy/cli.py, line 147, in main return args.func(args) File /usr/lib/python2.6/site-packages/ceph_deploy/mon.py, line 246, in mon mon_create(args) File /usr/lib/python2.6/site-packages/ceph_deploy/mon.py, line 105, in mon_create distro.mon.create(distro, rlogger, args, monitor_keyring) File /usr/lib/python2.6/site-packages/ceph_deploy/hosts/centos/mon/create.py, line 15, in create rconn = get_connection(hostname, logger) File /usr/lib/python2.6/site-packages/ceph_deploy/connection.py, line 13, in get_connection sudo=needs_sudo(), File /usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/connection.py, line 12, in __init__ self.gateway = execnet.makegateway('ssh=%s' % hostname) File /usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/lib/execnet/multi.py, line 89, in makegateway gw = gateway_bootstrap.bootstrap(io, spec) File /usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_bootstrap.py, line 70, in bootstrap bootstrap_ssh(io, spec) File /usr/lib/python2.6/site-packages/ceph_deploy/lib/remoto/lib/execnet/gateway_bootstrap.py, line 42, in bootstrap_ssh raise HostNotFound(io.remoteaddress) execnet.gateway_bootstrap.HostNotFound: web2 Does anyone come across the same issue? Looks like I mis-configured the network environment? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph deployment issue in physical hosts
Thanks Wolfgang. -bash-4.1$ ping web2 PING web2 (10.193.244.209) 56(84) bytes of data. 64 bytes from web2 (10.193.244.209): icmp_seq=1 ttl=64 time=0.505 ms 64 bytes from web2 (10.193.244.209): icmp_seq=2 ttl=64 time=0.194 ms ... [I omit part of the host name]. It can ping to the host and I actually used ceph-deploy to install ceph onto the web2 remote host… Thanks, Guang Date: Wed, 25 Sep 2013 10:29:14 +0200 From: Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph deployment issue in physical hosts Message-ID: 52429eda.8070...@risc-software.at Content-Type: text/plain; charset=ISO-8859-1 On 09/25/2013 10:03 AM, Guang wrote: Hi ceph-users, I deployed a cluster successfully in VMs, and today I tried to deploy a cluster in physical nodes. However, I came across a problem when I started creating a monitor. -bash-4.1$ ceph-deploy mon create x ssh: Could not resolve hostname web2: Name or service not known Does anyone come across the same issue? Looks like I mis-configured the network environment? The machine you run ceph-deploy on doesn't know who web2 is. If this command succeeds: ping web2 then ceph deploy will at least be able to contact that host. hint: look at your /etc/hosts file. Thanks, Guang Wolfgang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph deployment issue in physical hosts
Thanks for the reply! I don't know the reason, but I work-around this issue by add a new entry in the /etc/hosts with something like 'web2 {id_address_of_web2}' and it can work. I am not sure if that is due to some mis-config by my end of the deployment script, will further investigate. Thanks all for the help! Guang On Sep 25, 2013, at 8:38 PM, Alfredo Deza wrote: On Wed, Sep 25, 2013 at 5:08 AM, Guang yguan...@yahoo.com wrote: Thanks Wolfgang. -bash-4.1$ ping web2 PING web2 (10.193.244.209) 56(84) bytes of data. 64 bytes from web2 (10.193.244.209): icmp_seq=1 ttl=64 time=0.505 ms 64 bytes from web2 (10.193.244.209): icmp_seq=2 ttl=64 time=0.194 ms ... [I omit part of the host name]. It can ping to the host and I actually used ceph-deploy to install ceph onto the web2 remote host… This is very unexpected, it most definitely sounds like at some point web2 is not resolvable (as the error says) but you are also right in that you initiate the deployment correctly with ceph-deploy doing work on the remote end. Are you able to SSH directly to this host from where you are executing ceph-deploy? (same user/login) Thanks, Guang Date: Wed, 25 Sep 2013 10:29:14 +0200 From: Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph deployment issue in physical hosts Message-ID: 52429eda.8070...@risc-software.at Content-Type: text/plain; charset=ISO-8859-1 On 09/25/2013 10:03 AM, Guang wrote: Hi ceph-users, I deployed a cluster successfully in VMs, and today I tried to deploy a cluster in physical nodes. However, I came across a problem when I started creating a monitor. -bash-4.1$ ceph-deploy mon create x ssh: Could not resolve hostname web2: Name or service not known Does anyone come across the same issue? Looks like I mis-configured the network environment? The machine you run ceph-deploy on doesn't know who web2 is. If this command succeeds: ping web2 then ceph deploy will at least be able to contact that host. hint: look at your /etc/hosts file. Thanks, Guang Wolfgang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph / RadosGW deployment questions
Hi ceph-users, I deployed a Ceph cluster (including RadosGW) with use of ceph-deploy on RHEL6.4, during the deployment, I have a couple of questions which need your help. 1. I followed the steps http://ceph.com/docs/master/install/rpm/ to deploy the RadosGW node, however, after the deployment, all requests failed with 500 returned. With some hints from http://irclogs.ceph.widodh.nl/index.php?date=2013-01-25, I changed the FastCgiExternalServer to FastCgiServer within rgw.conf. Is this change valid or I missed somewhere else which leads the need for this change? 2. It still does not work and the httpd has the following error log: [Mon Sep 23 07:34:32 2013] [crit] (98)Address already in use: FastCGI: can't create server /var/www/s3gw.fcgi: bind() failed [/tmp/radosgw.sock] which indicates that radosgw is not started properly, so that I manually run radosgw --rgw-socket-path=/tmp/radosgw.sock -c /etc/ceph/ceph.conf -n client.radosgw.gateway to start a radosgw daemon and then the gateway starts working as expected. Did I miss anything this part? 3. When I was trying to run ceph admin-daemon command on the radosGW host, it failed because it does not have the corresponding asok file, however, I am able to run the command on monitor host and found that the radosGW's information can be retrieved there. @monitor (monitor and gateway are deployed on different hosts). [xxx@startbart ceph]$ sudo ceph --admin-daemon /var/run/ceph/ceph-mon.startbart.asok config show | grep rgw rgw: 1\/5, rgw_data: \/var\/lib\/ceph\/radosgw\/ceph-startbart, rgw_enable_apis: s3, swift, swift_auth, admin, rgw_cache_enabled: true, rgw_cache_lru_size: 1, rgw_socket_path: , rgw_host: , rgw_port: , rgw_dns_name: , rgw_script_uri: , rgw_request_uri: , rgw_swift_url: , rgw_swift_url_prefix: swift, rgw_swift_auth_url: , rgw_swift_auth_entry: auth, rgw_keystone_url: , rgw_keystone_admin_token: , rgw_keystone_accepted_roles: Member, admin, rgw_keystone_token_cache_size: 1, rgw_keystone_revocation_interval: 900, rgw_admin_entry: admin, rgw_enforce_swift_acls: true, rgw_swift_token_expiration: 86400, rgw_print_continue: true, rgw_remote_addr_param: REMOTE_ADDR, rgw_op_thread_timeout: 600, rgw_op_thread_suicide_timeout: 0, rgw_thread_pool_size: 100, Is this expected? 4. cephx authentication. After reading through the cephx introduction, I got the feeling that cephx is for client to cluster authentication, so that each librados user will need to create a new key. However, this page http://ceph.com/docs/master/rados/operations/authentication/#enabling-cephx got me confused in terms of why should we create keys for mon and osd? And how does that fit into the authentication diagram? BTW, I found the keyrings under /var/lib/cecph/{role}/ for each roles, are they being used when talk to other roles? Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deploy a Ceph cluster to play around with
Hello ceph-users, ceph-devel, Nice to meet you in the community! Today I tried to deploy a Ceph cluster to play around with the API, and during the deployment, i have a couple of questions which may need you help: 1) How many hosts do I need if I want to deploy a cluster with RadosGW (so that I can try with the S3 API)? Is it 3 OSD + 1 Mon + 1 GW = 5 hosts on minimum? 2) I have a list of hardwares, however, my host only have 1 disk with two partitions, one for boot and another for LVM members, is it possible to deploy an OSD on such hardware (e.g. make a partition with ext4)? Or I will need another disk to do so? -bash-4.1$ ceph-deploy disk list myserver.com [ceph_deploy.osd][INFO ] Distro info: RedHatEnterpriseServer 6.3 Santiago [ceph_deploy.osd][DEBUG ] Listing disks on myserver.com... [repl101.mobstor.gq1.yahoo.com][INFO ] Running command: ceph-disk list [repl101.mobstor.gq1.yahoo.com][INFO ] /dev/sda : [repl101.mobstor.gq1.yahoo.com][INFO ] /dev/sda1 other, ext4, mounted on /boot [repl101.mobstor.gq1.yahoo.com][INFO ] /dev/sda2 other, LVM2_member Thanks, Guang ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Then that makes total sense to me. Thanks, Guang From: Mark Kirkwood mark.kirkw...@catalyst.net.nz To: Guang Yang yguan...@yahoo.com Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Sent: Tuesday, August 20, 2013 1:19 PM Subject: Re: [ceph-users] Usage pattern and design of Ceph On 20/08/13 13:27, Guang Yang wrote: Thanks Mark. What is the design considerations to break large files into 4M chunk rather than storing the large file directly? Quoting Wolfgang from previous reply: = which is a good thing in terms of replication and OSD usage distribution ...which covers what I would have said quite well :-) Cheers Mark___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Greg. The typical case is going to depend quite a lot on your scale. [Guang] I am thinking the scale as billions of objects with size from several KB to several MB, my concern is over the cache efficiency for such use case. That said, I'm not sure why you'd want to use CephFS for a small-object store when you could just use raw RADOS, and avoid all the posix overheads. Perhaps I've misunderstood your use case? [Guang] No, you don't. That is my use case :) I am also thinking of using RADOW directly without the above POSIX layer, but before that, I want to consider each option we have and compare the cons / pros. Thanks, Guang From: Gregory Farnum g...@inktank.com To: Guang Yang yguan...@yahoo.com Cc: Gregory Farnum g...@inktank.com; ceph-us...@ceph.com ceph-us...@ceph.com Sent: Tuesday, August 20, 2013 9:51 AM Subject: Re: [ceph-users] Usage pattern and design of Ceph On Monday, August 19, 2013, Guang Yang wrote: Thanks Greg. Some comments inline... On Sunday, August 18, 2013, Guang Yang wrote: Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? Not really; any comparison would be highly biased depending on your Amazon ping and your Ceph cluster. We've got some internal benchmarks where Ceph looks good, but they're not anything we'd feel comfortable publishing. [Guang] Yeah, I mean the solely server side time regardless of the RTT impact over the comparison. 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any IO to find object locations; CephFS only does IO if the inode you request has fallen out of the MDS cache (not terribly likely in general). This shouldn't be an issue... [Guang] CephFS only does IO if the inode you request has fallen out of the MDS cache, my understanding is, if we use CephFS, we will need to interact with Rados twice, the first time to retrieve meta-data (file attribute, owner, etc.) and the second time to load data, and both times will need disk I/O in terms of inode and data. Is my understanding correct? The way some other storage system tried was to cache the file handle in memory, so that it can avoid the I/O to read inode in. In the worst case this can happen with CephFS, yes. However, the client is not accessing metadata directly; it's going through the MetaData Server, which caches (lots of) metadata on its own, and the client can get leases as well (so it doesn't need to go to the MDS for each access, and can cache information on its own). The typical case is going to depend quite a lot on your scale. That said, I'm not sure why you'd want to use CephFS for a small-object store when you could just use raw RADOS, and avoid all the posix overheads. Perhaps I've misunderstood your use case? -Greg 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? ...although this might be. The issue basically comes down to how many disk seeks are required to retrieve an item, and one way to reduce that number is to hack the filesystem by keeping a small number of very large files an calculating (or caching) where different objects are inside that file. Since Ceph is designed for MB-sized objects it doesn't go to these lengths to optimize that path like Haystack might (I'm not familiar with Haystack in particular). That said, you need some pretty extreme latency requirements before this becomes an issue and if you're also looking at HDFS or S3 I can't imagine you're in that ballpark. You should be fine. :) [Guang] Yep, that makes a lot sense. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usage pattern and design of Ceph
Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deploy Ceph on RHEL6.4
Hi ceph-users, I would like to check if there is any manual / steps which can let me try to deploy ceph in RHEL? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Mark. What is the design considerations to break large files into 4M chunk rather than storing the large file directly? Thanks, Guang From: Mark Kirkwood mark.kirkw...@catalyst.net.nz To: Guang Yang yguan...@yahoo.com Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Sent: Monday, August 19, 2013 5:18 PM Subject: Re: [ceph-users] Usage pattern and design of Ceph On 19/08/13 18:17, Guang Yang wrote: 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? If you use Ceph as a pure object store, and get and put data via the basic rados api then sure, one client data object will be stored in one Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike api) then each client data object will be broken up into chunks at the rados level (typically 4M sized chunks). Regards Mark___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Greg. Some comments inline... On Sunday, August 18, 2013, Guang Yang wrote: Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? Not really; any comparison would be highly biased depending on your Amazon ping and your Ceph cluster. We've got some internal benchmarks where Ceph looks good, but they're not anything we'd feel comfortable publishing. [Guang] Yeah, I mean the solely server side time regardless of the RTT impact over the comparison. 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any IO to find object locations; CephFS only does IO if the inode you request has fallen out of the MDS cache (not terribly likely in general). This shouldn't be an issue... [Guang] CephFS only does IO if the inode you request has fallen out of the MDS cache, my understanding is, if we use CephFS, we will need to interact with Rados twice, the first time to retrieve meta-data (file attribute, owner, etc.) and the second time to load data, and both times will need disk I/O in terms of inode and data. Is my understanding correct? The way some other storage system tried was to cache the file handle in memory, so that it can avoid the I/O to read inode in. 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? ...although this might be. The issue basically comes down to how many disk seeks are required to retrieve an item, and one way to reduce that number is to hack the filesystem by keeping a small number of very large files an calculating (or caching) where different objects are inside that file. Since Ceph is designed for MB-sized objects it doesn't go to these lengths to optimize that path like Haystack might (I'm not familiar with Haystack in particular). That said, you need some pretty extreme latency requirements before this becomes an issue and if you're also looking at HDFS or S3 I can't imagine you're in that ballpark. You should be fine. :) [Guang] Yep, that makes a lot sense. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Usage pattern and design of Ceph
Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD shared between clients
Thank you, Gandalf and Igor. I intuitively think that building a cluster on another is not appropriate. Maybe I should give RadosGW a try first. On Thu, May 2, 2013 at 3:00 AM, Igor Laskovy igor.lask...@gmail.com wrote: Or maybe in case the hosting purposes easier implement RadosGW. -- Yudong Guang guangyudongb...@gmail.com 786-554-3993 +86-138-1174-5701 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD shared between clients
Hi, I've been trying to use block device recently. I have a running cluster with 2 machines and 3 OSDs. On a client machine, let's say A, I created a rbd image using `rbd create` , then formatted, mounted and wrote something in it, everything was working fine. However, problem occurred when I tried to use this image on the other client, let's say B, on which I mapped the same image that created on A. I found that any changes I made on any of them cannot be shown on the other client, but if I unmap the device and then map again, the changes will be shown. I tested the same thing with ceph fs, but there was no such problem. Every change made on one client can be shown on the other client instantly. I wonder whether this kind of behavior of RADOS block device is normal or not. Is there any way that we can read and write on the same image on multiple clients? Any idea is appreciated. Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com