Re: ceph and efficient access of distributed resources
The client does a 12MB read, which (because of the striping) gets broken into 3 separate 4MB reads, each of which is sent, all in parallel, to 3 distinct OSDs. The only bottle-neck in such an operation is the client-NIC. On 04/16/2013 01:06 PM, Gandalf Corvotempesta wrote: 2013/4/16 Mark Kampe : RADOS is the underlying storage cluster, but the access methods (block, object, and file) stripe their data across many RADOS objects, which CRUSH very effectively distributes across all of the servers. A 100MB read or write turns into dozens of parallel operations to servers all over the cluster. Let me try to explain. AFAIK check will split datas into chunks of 4MB each, so, a single 12MB file will be stored in 3 different chunks across multiple OSDs and then replicated many times (based on value of replica count) Let's assume a 12MB file and a 3x replica. RADOS will create 3x3 chuks for the same file stored on 9 OSDs When reading AFAIK replicas are not used, so all reads are done to the "master copy". But these 3 chunks are read in parallel on multiple OSDs or all read request are done trough a single OSD? In the first case we will have 3x bandwidth for read operations directed to a file with at least 3 chunks, in the latter we have a big bottleneck. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph and efficient access of distributed resources
On 04/16/13 00:20, Gandalf Corvotempesta wrote: 2013/4/16 Mark Kampe : The entire web is richly festooned with cache servers whose sole raison d'etre is to solve precisely this problem. They are so good at it that back-bone providers often find it more cash-efficient to buy more cache servers than to lay more fiber. Cache servers don't merely save disk I/O, they catch these requests before they reach the server (or even the backbone). Mine was just an example, there are many other cases where a frotnend cache is not possible. I think that ceph should spread reads across the whole clusters by default (like a big RAID-1), to archieve bandwidth improvement. At my previous distributed storage start-up (Parascale) we had the ability to distribute reads across copies for load distribution purposes and everybody we talked to said "who cares!". Why? For hot-spot situations (as in your original example) higher level caching is far more effective than random traffic distribution. For lower level (e.g. coincidental) reuse, sending all the requests to a single server will usually perform better. Network I/O is much faster than disk I/O, and a single recipient will have N * the cache hit rate that N servers would have. What happens in case of a big file (for example, 100MB) with multiple chunks? Is ceph smart enough to read multiple chunks from multiple servers simultaneously or the whole file will be served by just an OSD RADOS is the underlying storage cluster, but the access methods (block, object, and file) stripe their data across many RADOS objects, which CRUSH very effectively distributes across all of the servers. A 100MB read or write turns into dozens of parallel operations to servers all over the cluster. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph and efficient access of distributed resources
If I correctly understand the discussion, you are correct that I/O could be saved by doing this ... were it not for the fact the I/O in question is already being saved much more effectively by someone else. The entire web is richly festooned with cache servers whose sole raison d'etre is to solve precisely this problem. They are so good at it that back-bone providers often find it more cash-efficient to buy more cache servers than to lay more fiber. Cache servers don't merely save disk I/O, they catch these requests before they reach the server (or even the backbone). On 04/15/2013 01:06 PM, Gandalf Corvotempesta wrote: Currently reads always come from the primary OSD in the placement group rather than a secondary even if the secondary is closer to the client. In this way, only one OSD will be involved in reading an object, this will result in a bottleneck if multiple clients needs to access to the same file. For example, a 3KB CSS file served by a webserver to 400 users, will be read just from one OSD. 400 users directed to 1 OSD while (in case of replica 3) other 2 OSDs are available? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'
It seems to me that the surviving OSDs still remember all of the osdmap and pgmap history back to "last epoch started" for all of their PGs. Isn't this enough to enable reconstruction of all of the pgmaps and osdmaps required to find any copy of currently stored object? My history has given me biases, but I prefer reconstruction over snapshots because: (a) it enables recovery from more catastrophic incidents (e.g. a bug has corrupted all of the monitor stores or a fire has reduced all monitor nodes to slag) (b) it is less likely to result in inconsistencies involving object updates after the last snapshot (c) the ability to reconstruct is a superset of the ability to audit, so we get consistency audits for free It tends to be a common source of discomfort among potential Ceph users that if their mons ever become unrecoverable, it's almost impossible to recover your data (compare to GlusterFS, where you can always pull data out of Gluster bricks unharmed, at least as long as you don't use striping volumes). With a file backed mon store, I had hoped that eventually this might tie into btrfs snapshots such that you would have been able to roll back to a known good configuration in an emergency. With the switch to leveldb, I no longer foresee that ever happening. Mind sharing your thoughts on that? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Geographic DR for RGW
A few weeks ago, Yehuda Sadeh sent out a proposal for adding support for asynchronous remote site replication to the RADOS Gateway. We have done some preliminary planning and are now starting the implementation work. At the 100,000' level (from which height all water looks drinkable and all mountains climbable) the work can be divided into: 1. a bunch of changes within the gateway to create regions, add new attributes to buckets and objects, log new information and implement/expose some new operations 2. new RESTful APIs to exploit and manage the new behaviors, and associated unit test suites. 3. free-standing data and metadata synchronization agents that learn of and propagate changes 4. management agents to monitor, control, and report on this activity. 5. white-box test suites to stress change detection, reconciliation, propagation, and replay. We feel that we pretty much have to do (1) (lots of changes to a great deal of complicated code). Category (2) is new code, and (in principle) decoupled from the internals of the gateway, but it has many tendrils. C++ developers with some familiarity with the Gateway could definitely help here ... but it is questionable whether or not it makes sense to try to bring new people up to speed for a project that will only last a few months. In the near term, the most modular pieces with the lowest activation energy are (3), and in two months there may be enough working to enable work on (4) and (5). The synchronization and management agents are free-standing processes based on RESTful APIs, and so can be implemented in pretty much anything (Python, Java, C++, Ruby, ...). If there are other people who are able to help make this happen, we would love to invite your participation. This is an opportunity to: * accelerate the development of a strategic feature * help to shape some major new functionality * get very familiar with the Gateway code * play with eventual consistency, asynchronous pull replication * be one of the kool kids * earn Karma and improve the world through Open Source contribution thank you, mark.ka...@inktank.com VP, Engineering -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: some performance issue
Writes are intrinsically more expensive (in both the file system and hardware) but it is not uncommon for individual small random writes to substantially outperform reads even if O_DIRECT. If the I/O is not massively parallel, reads are going to be processed one at a time (e.g. ~6ms seek, ~4ms latency, and 27us transfer). Writes, however, are commonly accepted by the drive and then queued, enabling the drive to choose among the competing requests to significantly (e.g. 2-3x) reduce both average seek time and rotational latency. If the I/O is being buffered, the performance advantages for random writes can be even greater (due to a deeper request queue and potential request aggregation). Isolated random reads (with few cache hits) get a much smaller performance boost (if any) from buffered I/O. With massively parallel requests, however, the write advantage should evaporate. On 02/04/2013 09:15 AM, sheng qiu wrote: Hi Xiaoxi, thanks for your reply. On Mon, Feb 4, 2013 at 10:52 AM, Chen, Xiaoxi wrote: I doubt your data is correct ,even the ext4 data, have you use O_DIRECT when doing the test? It's unusual to have 2X random write IOPS than random read. i did not use O_DIRECT. so page cache is used during the test. one thing i guess why random write is better than random read is that since the io request size is 4KB, so for each write request if miss on page cache, it will allocate a new page and write the complete 4KB dirty data there (since no partitional writes, no need to fetch the missed data from OSDs). While for read requests, it has to wait until the data are fetched from the OSDs. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: on disk encryption
Correct. I wasn't actually involved in this (or any other real) work, but as I recall the only real trick is how much key management you want: Do we want to be able to recover the key if a good disk is rescued from a destroyed server and added to a new server? Do we want to ensure that the keys are not persisted on the server, so that an entire server can be decommissioned without having to worry about the data being recovered by somebody who knows where to look? If you are willing to keep the key on the server and lose the data when the server fails, this is trivial. If you are unwilling to keep the key on the server, or if you need the disk to remain readable after the server is lost, we need some third party (like the monitors) to maintain the keys. We thought these might be important, so we were looking at how to get the monitors to keep track of the encryption keys. On 01/31/2013 03:42 PM, Marcus Sorensen wrote: Yes, anyone could do this now by setting up the OSDs on top of dm-crypted disks, correct? This would just automate the process, and manage keys for us? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: geo replication
Right now, your only option is synchronous replication, which happens at the speed of the slowest OSD ... so unless your WAN links are fast and fat, it comes at non-negligible performance penalty. We will soon be sending out a proposal for an asynchronous replication mechanism with eventual consistency for the RADOS Gateway ... but that is a somewhat simpler problem (immutable objects, good change lists, and a WAN friendly protocol). Asynchronous RADOS replication is definitely on our list, but more complex and farther out. On 01/09/2013 01:19 PM, Gandalf Corvotempesta wrote: probably this was already asked before but i'm unable to find any answer. Is possible to replicate a cluster geografically? GlusterFS does this with rsync (i think called automatically on every file write), does cheph do something similiar? I don't think that using multiple geographically distributed OSD with 10-15ms of latency will be good -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Are there significant performance enhancements in 0.56.x to be expected soon or planned in the near future?
Performance work is always ongoing, but I am not aware of any significant imminent enhancements. We are just wrapping up an investigation of the effects of various file system and I/O options on different types of traffic, and the next major area of focus will be RADOS Block Device and VMs over RBD. This is pretty far away from Hadoop and probably won't yield much fruit until March. There are a few people working on Hadoop integration, and I have not been closely following their activities, but I do not believe that any major performance work will be forthcoming in the next few weeks On 01/09/2013 04:51 AM, Lachfeld, Jutta wrote: Hi all, in expectation of better performance, we are just switching from CEPH version 0.48 to 0.56.1 for comparisons between Hadoop with HDFS and Hadoop with CEPH FS. We are now wondering whether there are currently any development activities concerning further significant performance enhancements, or whether further significant performance enhancements are already planned for the near future. I would now be loath to start benchmarking with 0.56.1 and then, a month or so later, detect that there have been significant performance enhancements in CEPH in the meantime. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD fio Performance concerns
Sequential is faster than random on a disk, but we are not doing I/O to a disk, but a distributed storage cluster: small random operations are striped over multiple objects and servers, and so can proceed in parallel and take advantage of more nodes and disks. This parallelism can overcome the added latencies of network I/O to yield very good throughput. small sequential read and write operations are serialized on a single server, NIC, and drive. This serialization eliminates parallelism, and the network and other queuing delays are no longer compensated for. This striping is a good idea for the small random I/O that is typical of the way Linux systems talk to their disks. But for other I/O patterns, it is not optimal. On 11/21/2012 01:47 PM, Sébastien Han wrote: Hi Mark, Well the most concerning thing is that I have 2 Ceph clusters and both of them show better rand than seq... I don't have enough background to argue on your assomptions but I could try to skrink my test platform to a single OSD and how it performs. We keep in touch on that one. But it seems that Alexandre and I have the same results (more rand than seq), he has (at least) one cluster and I have 2. Thus I start to think that's not an isolated issue. Is it different for you? Do you usually get more seq IOPS from an RBD thant rand? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD fio Performance concerns
Recall: 1. RBD volumes are striped (4M wide) across RADOS objects 2. distinct writes to a single RADOS object are serialized Your sequential 4K writes are direct, depth=256, so there are (at all times) 256 writes queued to the same object. All of your writes are waiting through a very long line, which is adding horrendous latency. If you want to do sequential I/O, you should do it buffered (so that the writes can be aggregated) or with a 4M block size (very efficient and avoiding object serialization). We do direct writes for benchmarking, not because it is a reasonable way to do I/O, but because it bypasses the buffer cache and enables us to directly measure cluster I/O throughput (which is what we are trying to optimize). Applications should usually do buffered I/O, to get the (very significant) benefits of caching and write aggregation. That's correct for some of the benchmarks. However even with 4K for seq, I still get less IOPS. See below my last fio: # fio rbd-bench.fio seq-read: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-read: (g=1): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 seq-write: (g=2): rw=write, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 rand-write: (g=3): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=256 fio 1.59 Starting 4 processes Jobs: 1 (f=1): [___w] [57.6% done] [0K/405K /s] [0 /99 iops] [eta 02m:59s] seq-read: (groupid=0, jobs=1): err= 0: pid=15096 read : io=801892KB, bw=13353KB/s, iops=3338 , runt= 60053msec slat (usec): min=8 , max=45921 , avg=296.69, stdev=1584.90 clat (msec): min=18 , max=133 , avg=76.37, stdev=16.63 lat (msec): min=18 , max=133 , avg=76.67, stdev=16.62 bw (KB/s) : min=0, max=14406, per=31.89%, avg=4258.24, stdev=6239.06 cpu : usr=0.87%, sys=5.57%, ctx=165281, majf=0, minf=279 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=200473/0/0, short=0/0/0 lat (msec): 20=0.01%, 50=9.46%, 100=90.45%, 250=0.10% rand-read: (groupid=1, jobs=1): err= 0: pid=16846 read : io=6376.4MB, bw=108814KB/s, iops=27203 , runt= 60005msec slat (usec): min=8 , max=12723 , avg=33.54, stdev=59.87 clat (usec): min=4642 , max=55760 , avg=9374.10, stdev=970.40 lat (usec): min=4671 , max=55788 , avg=9408.00, stdev=971.21 bw (KB/s) : min=105496, max=109136, per=100.00%, avg=108815.48, stdev=648.62 cpu : usr=8.26%, sys=49.11%, ctx=1486259, majf=0, minf=278 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=1632349/0/0, short=0/0/0 lat (msec): 10=83.39%, 20=16.56%, 50=0.04%, 100=0.01% seq-write: (groupid=2, jobs=1): err= 0: pid=18653 write: io=44684KB, bw=753502 B/s, iops=183 , runt= 60725msec slat (usec): min=8 , max=1246.8K, avg=5402.76, stdev=40024.97 clat (msec): min=25 , max=4868 , avg=1384.22, stdev=470.19 lat (msec): min=25 , max=4868 , avg=1389.62, stdev=470.17 bw (KB/s) : min=7, max= 2165, per=104.03%, avg=764.65, stdev=353.97 cpu : usr=0.05%, sys=0.35%, ctx=5478, majf=0, minf=21 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.3%, >=64=99.4% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=0/11171/0, short=0/0/0 lat (msec): 50=0.21%, 100=0.44%, 250=0.97%, 500=1.49%, 750=4.60% lat (msec): 1000=12.73%, 2000=66.36%, >=2000=13.20% rand-write: (groupid=3, jobs=1): err= 0: pid=20446 write: io=208588KB, bw=3429.5KB/s, iops=857 , runt= 60822msec slat (usec): min=10 , max=1693.9K, avg=1148.15, stdev=15210.37 clat (msec): min=22 , max=5639 , avg=297.37, stdev=430.27 lat (msec): min=22 , max=5639 , avg=298.52, stdev=430.84 bw (KB/s) : min=0, max= 7728, per=31.44%, avg=1078.21, stdev=2000.45 cpu : usr=0.34%, sys=1.61%, ctx=37183, majf=0, minf=19 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued r/w/d: total=0/52147/0, short=0/0/0 lat (msec): 50=2.82%, 100=25.63%, 250=46.12%, 500=10.36%, 750=5.10% lat (msec): 1000=2.91%, 2000=5.75%, >=2000=1.33% Run status group 0 (all jobs): READ: io=801892KB, aggrb=13353KB/s, minb=13673KB/s, maxb=13673KB/s, mint=60053msec, maxt=60053msec Run status group 1 (all jobs): READ: io=6376.4MB, aggrb=108814KB/s, minb=111425KB/s,
Re: RBD fio Performance concerns
On 11/15/2012 12:23 PM, Sébastien Han wrote: First of all, I would like to thank you for this well explained, structured and clear answer. I guess I got better IOPS thanks to the 10K disks. 10K RPM would bring your per-drive throughput (for 4K random writes) up to 142 IOPS and your aggregate cluster throughput up to 1700. This would predict a corresponding RADOSbench throughput somewhere above 425 (how much better depending on write aggregation and cylinder affinity). Your RADOSbench 708 now seems even more reasonable. To be really honest I wasn't so concerned about the RADOS benchmarks but more about the RBD fio benchmarks and the amont of IOPS that comes out of it, which I found à bit to low. Sticking with 4K random writes, it looks to me like you were running fio with libaio (which means direct, no buffer cache). Because it is direct, every I/O operation is really happening and the best sustained throughput you should expect from this cluster is the aggregate raw fio 4K write throughput (1700 IOPS) divided by two copies = 850 random 4K writes per second. If I read the output correctly you got 763 or about 90% of back-of-envelope. BUT, there are some footnotes (there always are with performance) If you had been doing buffered I/O you would have seen a lot more (up front) benefit from page caching ... but you wouldn't have been measuring real (and hence sustainable) I/O throughput ... which is ultimately limited by the heads on those twelve disk drives, where all of those writes ultimately wind up. It is easy to be fast if you aren't really doing the writes :-) I would have expected write aggregation and cylinder affinity to have eliminated some seeks and improved rotational latency resulting in better than theoretical random write throughput. Against those expectations 763/850 IOPS is not so impressive. But, it looks to me like you were running fio in a 1G file with 100 parallel requests. The default RBD stripe width is 4M. This means that those 100 parallel requests were being spread across 256 (1G/4M) objects. People in the know tell me that writes to a single object are serialized, which means that many of those (potentially) parallel writes were to the same object, and hence serialized. This would increase the average request time for the colliding operations, and reduce the aggregate throughput correspondingly. Use a bigger file (or a narrower stripe) and this will get better. Thus, getting 763 random 4K write IOPs out of those 12 drives still sounds about right to me. On 15 nov. 2012, at 19:43, Mark Kampe wrote: Dear Sebastien, Ross Turn forwarded me your e-mail. You sent a great deal of information, but it was not immediately obvious to me what your specific concern was. You have 4 servers, 3 OSDs per, 2 copy, and you measured a radosbench (4K object creation) throughput of 2.9MB/s (or 708 IOPS). I infer that you were disappointed by this number, but it looks right to me. Assuming typical 7200 RPM drives, I would guess that each of them would deliver a sustained direct 4K random write performance in the general neighborhood of: 4ms seek (short seeks with write-settle-downs) 4ms latency (1/2 rotation) 0ms write (4K/144MB/s ~ 30us) - 8ms or about 125 IOPS Your twelve drives should therefore have a sustainable aggregate direct 4K random write throughput of 1500 IOPS. Each 4K object create involves four writes (two copies, each getting one data write and one data update). Thus I would expect a (crude) 4K create rate of 375 IOPS (1500/4). You are getting almost twice the expected raw IOPS ... and we should expect that a large number of parallel operations would realize some write/seek aggregation benefits ... so these numbers look right to me. Is this the number you were concerned about, or have I misunderstood? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using asphyxiate with Doxygen and Java?
TV is of the opinion that Asphyxiate was the right direction to move in, and that the sloth problems are solvable, but would require work. On 10/26/2012 6:26 PM, Noah Watkins wrote: I stumbled upon Breathe, and then asphyxiate. The doxygenfile directive in the later doesn't seem to like what Doxygen produces from parsing JavaDoc markup, although I've read that the Doxygen produced should be compliant. Here is the error: AssertionError: cannot handle compounddef kind=class Before going any further and I wanted to ping the list to see if anyone thinks it would be a good/bad idea to look into this. It'd be nice to have the Java documentation in Sphinx seamlessly. Any change Breathe has gotten faster over time? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Guidelines for Calculating IOPS?
Replication should have no effect on read throughput/IOPS. The client does a single write to the primary, and the primary then handles re-replication to the secondary copies. As such the client does not pay (in terms of CPU or NIC bandwidth) for the replication. Per-client throughput limitations should be largely independent of the replication. However, the replication does generate additional network and I/O activity between the OSDs. This means that the available aggregate throughput (of the entire cluster) is effectively cut in half when you move from one-copy to two. I am confused by your math: You say 385MB/s and 5250 IOPS (x8k) 5250 IOPS * 8192 = 43MB/s Do you mean that some of your clients are generating a lot of small block writes (at up to 5250 IPS) and that others of your clients are doing larger writes (with an aggregate throughput of 385MB/s)? For RADOS throughput: 385MB/s is a fairly small number 5250 buffered sequential IOPS is a very small number 5250 random IOPS is not a particularly large number, but will require several servers My guess is that the IOPS may drive the number of servers, and the drives per server will be the capacity divided by the number of required servers. So how many IOPS can you get per server? You are using RBD, and depending on the particulars of your stack, there may be a great deal of buffering and caching on the client side that can make the RADOS traffic much more efficient than the tributary client requests. Thus, I would suggest that you probably want to actually benchmark the application in question to measure the client-experienced throughput. On 10/19/12 07:47, Mike Dawson wrote: All, I am investigating the use of Ceph for a video surveillance project with the following minimum block storage requirements: 385 Mbps of constant write bandwidth 100TB storage requirement 5250 IOPS (size of ~8 KB) I believe 2 replicas would be acceptable. We intend to use large capacity (2 or 3TB) SATA 7200rpm 3.5" drives, if the IOPS work out properly. Is there a method / formula to estimate IOPS for RDB? Specifically I would like to understand: - How does replica count affect read/write IOPS? - I'm trying to understand best practice for when to optimize server count, drives per server, and drive capacity as it relates to IOPS. Is there a point of diminishing I/O performance using server chassis with lots of drive slots, like the 36-drive Supermicro SC847a? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client Location
I'm not a real engineer, so please forgive me if I misunderstand, but can't you create a separate rule for each data center (choosing first a local copy, and then remote copies), which should ensure that the primary is always local. Each data center would then use a different pool, associated with the appropriate location- sensitive rule. Does this approach get you the desired locality preference? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie questions
In your original question, I assumed you were talking about the partitioning of a single cluster. Now you are talking about Geographic Disaster Recovery: the replication of data across multiple (relatively) independent clusters. This is not yet supported, but it is most definitely on the road-map. On 10/6/2012 5:08 PM, Adam Nielsen wrote: The problem you are describing is called split-brain. Ceph has an odd number of monitors and quorum is required before objects can be served. The partition with the smaller number of monitors will wait harmlessly until connectivity is reestablished . Ah right, that makes sense. Is this set in stone or can it be configured? I'm just thinking that in this scenario it could be beneficial to allow read-only access from the partition with the smaller number of monitors, if there are also clients that can only see those hosts. (For example, a business with two sites, and the link between them goes down, so client PCs can only see their site-local servers.) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Newbie questions
The problem you are describing is called split-brain. Ceph has an odd number of monitors and quorum is required before objects can be served. The partition with the smaller number of monitors will wait harmlessly until connectivity is reestablished . Adam Nielsen wrote: >Thanks both for your answers - very informative. I think I will set up a test >Ceph system on my home servers to try it out. > >I have one more question: > >Ceph seems to handle failed nodes well enough, but what about failed network >links? Say you have a few systems in two locations, connected by a single >link. If the link fails, you will have two isolated networks, each of which >will think the other has failed and presumably will try to go on as best it >can. What happens when the link comes back up again? What if the same file >was modified by both isolated clusters when the link was down? What version >will end up back in the cluster? > >Thanks again, >Adam. > > >-- >To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Release/branch naming; input requested
On 05/18/12 09:32, Sage Weil wrote: I think we can limit the relative branches to: master = integration, unstable, tip, bleeding edge (same as now) [next] = next upcoming release (same as now) current = most recent release stable = most recent stable release We have already signed one contract that obligates us to years of support, and once customers go into production they will be loathe to move to each new stable release as it comes out. Thus, I fear that we will be maintaining multiple stable releases ... but once a stable release ceases to be the newest, the bar on what has to be back-ported to it can be significantly raised, lowering the cost. I like Yehuda's suggestion of cephalopods, or other interesting sea creatures. As he points out, though, http://www.thecephalopodpage.org/taxa.php suggests that there may not be enough good choices that are strictly cephalopods, though. Although it might be ok? ... I don't have an opinion about themes, but I do suggest that they should be memorable and easily pronounced ... and taxonomic family names do not always have those characteristics. 3. What do we do with version numbers? With a 2-3 week iteration, we'll end up with something like 0.41.x, 0.56.x for Folsom integration (less than a year from now), and 0.57, 0.58 etc for "latest". We can keep those, they are completely orthogonal. These are exactly what they are: dev cycle numbers. I'm not too afraid of big numbers there, as they become uninteresting once you have the other naming scheme. They have the nice property of monotonically increasing which is useful internally. Given that most releases will only get a subset of what is in builds, I too think that builds should be orthogonal to releases. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Logging braindump
On 03/22/12 09:38, Colin McCabe wrote: On Mon, Mar 19, 2012 at 1:53 PM, Tommi Virtanen wrote: [mmap'ed buffer discussion] I always thought mmap'ed circular buffers were an elegant approach for getting data that survived a process crash, but not paying the overhead of write(2) and read(2). The main problem is that you need special tools to read the circular buffer files off of the disk. As Sage commented, that is probably undesirable for many users. (a) I actually favor not simply mmaping the circular buffer, but having a program that pulls the data out of memory and writes it to disk (ala Varnish). In addition to doing huge writes (greatly reducing the write overhead), it can filter what it processes, so that we have extensive logging for the last few seconds, and more manageable logs on disk extending farther back in time (modulo log rotation). (b) The most interesting logs are probably the ones in coredumps (that didn't make it out to disk) for which we want a crawler/extractor anyway. It probably isn't very hard to make the program that extracts logs from memory also be able to pick the pockets of dead bodies (put a big self identifying header on the front of each buffer). Note also that having the ability to extract the logs from a coredump pretty much eliminates any motivations to flush log entries out to disk promptly/expensively. If the process exits clealy, we'll get the logs. If the process produces a coredump, we'll still get the logs. (c) I have always loved text logs that I can directly view. Their immediate and effortless accessibility encourages their use, which encourages work in optimizing their content (lots of the stuff you need, and little else). But binary logs are less than half the size (cheaper to take and keep twice as much info), and a program that formats them can take arguments about which records/fields you want and how you want them formatted ... and getting the output the way you want it (whether for browsing or subsequent reprocessing) is a huge win. You get used to running the log processing command quickly, but the benefits (d) If somebody really wants text logs for archival, it is completely trivial to run the output of the log-extractor through the formatter before writing it to disk ... so the in memory format need not be tied to the on-disk format. The rotation code won't care. An mmap'ed buffer, even a lockless one, is a simple beast. Do you really need a whole library just for that? Maybe I'm just old-fashioned. IMHO, surprisingly few things involving large numbers of performance critical threads turn out to be simple :-) For example: If we are logging a lot, buffer management has the potential to become a bottle-neck ... so we need to be able to allocate a record of the required size from the circular buffer with atomic instructions (at least in non-wrap situations). But if records are allocated and then filled, we have to consider how to handle the case where the filling is delayed, and the reader catches up with an incomplete log record (e.g. skip it, wait how long, ???). And while we hope this will never happen, we have to deal with what happens when the writer catches up with the reader, or worse, an incomplete log block ... where we might have to determine whether or not the owner is deceased (making it safe to break his record lock) ... or should we simply take down the service at that point (on the assumption that something has gone very wrong). If we are going to use multiple buffers, we may have to do a transaction dance (last guy in has to close this buffer to new writes, start a new one, and somebody has to wait for pending additions to complete, queue this one for delivery or perhaps even flush it to disk if we don't have some other thread/process doing this). -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph article for ;login:
Title Ceph: a next generation, open source storage platform - Part I Target size, scope, depth: 2K-3K words CRUSH, RADOS, RGW, RBD (leave CephFS for part II) os/storage concepts literate non-programmers Audience: Engineers, admins, and technically inclined Linux community members who may or may not have heard of Ceph before, but might find it interesting to learn more about the domain, technology, and product. Goals: Generate positive buzz, leading more potential adopters (or even community members) to investigate further contributing to our image as smart people, great technology, and a high-road endeavor Strategy: primary focus is technical educate readers about the challenges of peta-byte scale storage educate readers about effective approaches to addressing them showcase RADOS as an example of such solutions inspire people with the value and idealism of open source Anti-Strategies: do not pitch the company/business (let technology do that) do not dis existing products (we're above that) Suggested Table of Contents and word-budget 1. introduction and challenges of petabyte scale storate (200) 2. solutions to these challenges (300-400) 3. CRUSH and RADOS architecture (500 + pictures) 4. how we address those challenges (400+pictures) 5. examples of products built on top of RADOS RBD concept, layering, functionality, value (200+picture) RGW concept, layering, functionality, value (200+picture) 6. why open source is the right thing here (200) 7. state of the project ... 50% technology, 25% community, 25% company (200) 8. how to get involved (100) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: efficient removal of old objects
On 01/31/12 17:02, Tommi Virtanen wrote: To make my point even clearer: point me to another data store that has that idiom. (a) Automatic expiration and deletion is, and has long been, a standard feature of archival systems ... and our RADOS clouds are much larger than most archival systems. (b) I have no competent opinions on the short term solution to this particular problem, but in the longer term I do not believe that garbage collection can or should be entrusted to clients. Clients are ephemeral and cannot be depended on to remember, a few years (or even hours) from now, that there were some files they were supposed to delete. IMHO, object store intelligence is not merely about back-ground replication and migration, but about "being able to take responsibility for the life cycle of the data they hold". The amount of data we store will quickly grow beyond the ability of external agents to manage it, and lifecycle automation will become increasingly critical. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem while reading the paper about CRUSH
De-cluster means ensure that objects that all have one copy on a single volume have their other copies spread all over the cloud. This enables many to many recovery. The weights bias selection, e.g. so we can discourage placement on a device that is more full. A "metric" is a unit or means of measuring an interesting quantity. ---mark---N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj"��!�i
Re: towards a user-mode diagnostic log mechanism
On 01/05/12 20:09, Colin McCabe wrote: Getting the system time is a surprisingly expensive operation, and this poses a problem for logging system designers. You can use the rdtsc CPU instruction, but unfortunately on some architectures CPU frequency scaling makes it very inaccurate. Even when it is accurate, it's not synchronized between multiple CPUs. Another option is to without time for most messages and just have a periodic timestamp that gets injected every so often-- perhaps every second, for example. I agree it needs to be cheap ... but my experience with debugging problems in this sort of system suggests that we need the finest grained timestamps we can get (on every single message). Even though the clocks on different nodes are not that closely synchronized, computing the relative offsets from initial transactions isn't hard ... and then it becomes possible to construct a total ordering of events with accurate timings. Pantheios and log4cpp are two potential candidates. I don't know that much about either, unfortunately. Good suggestions. I am also looking at varnish (suggested by Wido den Hollander) which does logging in a shared memory segment from which external processes can save it (or not). What was (for me) a new idea here is the clean decoupling of in-memory event capture from on-disk persistence and log rotation. After I thought about it for a few minutes, I concluded it had many nice consequences. Honestly, I feel like logging is something that you will end up customizing pretty extensively to suit the needs of your application. But perhaps it's worth checking out what these libraries provide-- especially in terms of efficiency. I agree that the data captured is going to be something we hone based on experience. I am (for now) most concerned with the mechanism ... because I don't want to start making big investments in instrumentation until we have a good mechanism, around which we can fix the APIs that instrumentation will use. I'll try to review the suggested candidates and describe the mechanisms and advantages of each in another two weeks. Thank you very much for the feedback. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
towards a user-mode diagnostic log mechanism
I'd like to keep this ball moving ... as I believe that the limitations of our current logging mechanisms are already making support difficult, and that is about to become worse. As a first step, I'd just like to get opinions on the general requirements we are trying to satisfy, and decisions we have to make along the way. Comments? I Requirements A. Primary Requirements (must have) 1. information captured a. standard: time, sub-system, level, proc/thread b. additional: operation and parameters c. extensible for new operations 2. efficiency a. run time overhead < 1% (I believe this requires delayed flush circular bufferring) b. persistent space O(Gigabytes per node-year) 3. configurability a. capture level per sub-system 4. persistence a. flushed out on process shut-down b. recoverable from user-mode core-dumps 5. presentation a. output can be processed w/grep,less,... B. Secondary Requirements (nice to have) 1. ease of use a. compatible with/convertable from existing calls b. run-time definition of new event records 2. configurability a. size/rotation rules per sub-system b. separate in-memory/on-disk capture levels II Decisions to be made A. Capture Circumstances 1. some subset of procedure calls (I'm opposed to this, but it is an option) 2. explicit event logging calls B. Capture Format 1. ASCII text 2. per-event binary format 3. binary header + ASCII text C. Synchronization 1. per-process vs per-thread buffers D. Flushing 1. last writer flushes vs dedicated thread 2. single- vs double-bufferred output E. Available open source candidates -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
moving towards release criteria
At present we are running automated nightlies to catch problems that slip past developers or only show up on long runs, and filing bugs when they fail ... but the decision of whether or not we are ready to push out a new release is not yet criterion-based. We have to start moving towards official release criteria. I suggest that our initial release criteria should fall into four categories: (1) functional validation 100% passage of designated validation suites, with a formal process for managing the functional assertions to be tested (or designating specific assertions to be compliance-optional) (2) regression tests 100% passage of designated regression suites, with a formal process for designating which bugs do and do not require the creation of new regression test cases. (3) performance individual and aggregate throughput measurements, and key-event timings will be made with controlled loads on specified hardware configuration, and compared against performance targets, and a formal process for defining the target metrics and requirements. (4) reliability and robustness a specified number of hours of (client perceived) error free operation under continuous load (with specified levels and characteristics), in the face of specified error injections ... and a formal process for defining the times, load characteristics, error injections, and acceptable performance. Does this seem like the right general form for our release criteria? What changes would you suggest? Once we agree on the general form of our release criteria, the next steps are: (a) put some stakes in the ground for the initial requirements (knowing that they will evolve in scope, specificity, and rigour) (b) propose some processes for the review, approval, and evolution of those standards, and the communication of the (current) requirements and results to the community. (c) set a date for the first release to be subject to these criteria comments? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The costs of logging and not logging
I'm a big believer in asynchronous flushes of an in-memory ring-buffer. For user-mode processes a core-dump-grave robber can reliably pull out all of the un-flushed entries ... and the same process will also work for the vast majority of all kernel crashes. String logging is popular because: 1. It trivial to do (in the instrumented code) 2. It is trivial to read (in the recovered logs) 3. It is easily usable by grep/perl/etc type tools But binary logging: 1. is much (e.g. 4-10x) smaller (especially for standard header info) 2. is much faster to take (no formatting, less data to copy) 3. is much faster to process (no re-parsing) 4. is smaller to store on disk and faster to ship for diagnosis and a log dumper can quickly produce output that is identical to what the strings would have been So I also prefer binary logs ... even though they require the importation of additional classes. But ... (a) the log classes must be kept upwards compatible so that old logs can be ready by newer tools. (b) the binary records should glow-in-the-dark, so that they can be recovered even from corrupted ring-buffers and blocks whose meta-data has been lost. I see two main issues with the slowness of the current logs: - all of the string rendering in the operator<<()'s is slow. things like prefixing every line with a dump of the pg state is great for debugging, but makes the overhead very high. we could scale all of that back, but it'd be a big project. - the logging always goes to a file, synchronously. we could write to a ring buffer and either write it out only on crash, or (at the very least) write it async. I wonder, though, if something different might work. gcc lets you arbitrarily instrument function calls with -finstrument-functions. Something that logs function calls and arguments to an in-memory ring buffer and dumps that out on crash could potentially have a low overhead (if we restrict it to certain code) and would give us lots of insight into what happend leading up to the crash. This does have the advantage of being automatic ... but it is much more information, perhaps without much more value. My experience with logging is that you don't have to capture very much information, and that in fact we often go back to weed out no-longer-interesting information. Not only does too much information take cycles and space, but it also makes it hard to find "the good stuff". I think that human architects can make very good decisions about what information should be captured, and when. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The costs of logging and not logging
The bugs we most dread are situations that only happen rarely, and are only detected long after the damage has been done. Given the business we are in, we will face many of them. We apparently have such bugs open at this very moment. In most cases, the primary debugging tools one has are audit and diagnostic logs ... which WE do not have because they are too expensive (because they are synchronously written with C++ streams) to leave enabled all the time. I think it is a mistake to think of audit and diagnostic logs as a tool to be turned on when we have a problem to debug. There should be a basic level of logging that is always enabled (so we will have data after the first instance of the bug) ... which can be cranked up from verbose to bombastic when we find a problem that won't yield to more moderate interrogation: (a) after the problem happens is too late to start collecting data. (b) these logs are gold mines of information for a myriad of purposes we cannot yet even imagine. This can only be done if the logging mechanism is sufficiently inexpensive that we are not afraid to use it: low execution artifact from the logging operations reansonable memory costs for bufferring small enough on disk that we can keep them for months Not having such a mechanism is (if I correctly understand) already hurting us for internal debugging, and will quickly cripple us when we have customer (i.e. people who cannot diagnose problems for themselves) problems to debug. There are many tricks to make logging cheap, and the sizes acceptable. There are probably a dozen open-source implementations that already do what we need, and if they don't something basic can be built in a two-digit number of hours. The real cost is not in the mechanism but in adapting existing code to use it. This cost can be mitigated by making the changes opportunistically ... one component at a time, as dictated by need/fear. But we cannot make that change-over until we have a mechanism. Because the greatest cost is not the mechanism, but the change-over, we should give more than passing thought to what mechanism to choose ... so that the decision we make remains a good one for the next few years. This may be something that we need to do sooner, rather than later. regards, ---mark--- -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html