Re: ceph replication and data redundancy
Hi everybody, In fact, i found searching the doc on section adding/removing a monitor, infos about the paxos system used for quorum establishment. Following the documentation, in a catastrophy scenario, i need to remove the other monitors configured on the other buildings. For better efficiency, i think i'll keep 1 monitor per building, and, if two other building fails, i will delete those two monitors from the configuration in order to access data again. I'll simulate that and see if it goes well. Thanks for your help and advices. Regards, -- Gomes do Vale Victor System, Network and Security engineer. 2013/1/20 Gregory Farnum g...@inktank.com: (Sorry for the blank email just now, my client got a little eager!) Apart from the things that Wido has mentioned, you say you've set up 4 nodes and each one has a monitor on it. That's why you can't do anything when you bring down two nodes — the monitor cluster requires a strict majority in order to continue operating, which is why we recommend odd numbers. If you set up a different node as a monitor (simulating one in a different data center) and then bring down two nodes, things should keep working. -Greg On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote: Hi, On 01/17/2013 10:55 AM, Ulysse 31 wrote: Hi all, I'm not sure if it's the good mailing, if not, sorry for that, tell me the appropriate one, i'll go for it. Here is my actual project : The company i work for has several buildings, each of them are linked with gigabit trunk links allowing us to have multiple machines over the same lan on different buildings. We need to archive some data (over 5 to 10Tb), but we want that data present on each buildings, and, in case of the lost of a building (catastrophy scenario) we steel have the data. Rather than using simple storage machines sync'ed by rsync, we thaught re-using older desktop machines we have in stock, and make a clusterized fs on it : In fact, speed is clearly not the goal of this data storage, we would just store old projects on it sometimes, and will access it in rare cases. the most important is to keep that data archived somewhere. Ok, keep that in mind. All writes to RADOS are synchronous, so if you experience high latency or some congestion on your network Ceph will become slow. I was interrested by ceph in the way that we can declare, using the crush-map, a hierarchical maner to place replicated data. So for a test, i build a sample cluster composed of 4 nodes, installed under debian squeeze and actual bobtail stable version of ceph. On my sample i wanted to simulate 2 per buildings nodes, each nodes has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but that just a sample), osd uses xfs on /dev/sda3, and made a crush map like : --- # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host server-0 { id -2 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } host server-1 { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 } host server-2 { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } host server-3 { id -7 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 } rack bat0 { id -3 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item server-0 weight 1.000 item server-1 weight 1.000 } rack bat1 { id -4 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item server-2 weight 1.000 item server-3 weight 1.000 } root root { id -1 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item bat0 weight 3.000 item bat1 weight 3.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take root step chooseleaf firstn 0 type rack step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take root step chooseleaf firstn 0 type rack step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take root step chooseleaf firstn 0 type rack step emit } # end crush map --- Using this crush-map, coupled with a default pool data size 2 (replication 2), allowed me to be sure to have duplicate of all data on both sample building bat0 and bat1. Then I mounted on a client using ceph-fuse using : ceph-fuse -m server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything works fine has expected, can write/read data, from one or more clients, no probs on that. Just to
Re: ceph replication and data redundancy
On 01/21/2013 08:14 AM, Ulysse 31 wrote: Hi everybody, In fact, i found searching the doc on section adding/removing a monitor, infos about the paxos system used for quorum establishment. Following the documentation, in a catastrophy scenario, i need to remove the other monitors configured on the other buildings. For better efficiency, i think i'll keep 1 monitor per building, and, if two other building fails, i will delete those two monitors from the configuration in order to access data again. I'll simulate that and see if it goes well. Thanks for your help and advices. If you are set on that approach, you could just as well add a third monitor on one of the buildings (whichever you feel to be more resilient), and cut down the chances of an unavailable cluster if the other fails. It doesn't solve your problem, but if the building with just one monitor fails, your cluster will still be available; if it's the other way around, you could do the manual recovery just the same anyway. -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Always Signal() the first Cond when changing the maximum
Removes a test by which the waiting queue is only Signal()ed if the new maximum is lower than the current maximum. There is no evidence of a use case where such a restriction would be useful. In addition waking up a thread when the maximum increases gives it a chance to immediately continue the suspended process instead of waiting for the next put(). For additional context see the discussion at http://marc.info/?t=13586893831r=1w=4 Signed-off-by: Loic Dachary l...@dachary.org --- src/common/Throttle.cc |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/common/Throttle.cc b/src/common/Throttle.cc index 844263a..82ffe7a 100644 --- a/src/common/Throttle.cc +++ b/src/common/Throttle.cc @@ -65,7 +65,7 @@ Throttle::~Throttle() void Throttle::_reset_max(int64_t m) { assert(lock.is_locked()); - if (m ((int64_t)max.read()) !cond.empty()) + if (!cond.empty()) cond.front()-SignalOne(); logger-set(l_throttle_max, m); max.set((size_t)m); -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph replication and data redundancy
On 01/21/2013 02:08 PM, Joao Eduardo Luis wrote: On 01/21/2013 08:14 AM, Ulysse 31 wrote: Hi everybody, In fact, i found searching the doc on section adding/removing a monitor, infos about the paxos system used for quorum establishment. Following the documentation, in a catastrophy scenario, i need to remove the other monitors configured on the other buildings. For better efficiency, i think i'll keep 1 monitor per building, and, if two other building fails, i will delete those two monitors from the configuration in order to access data again. I'll simulate that and see if it goes well. Thanks for your help and advices. If you are set on that approach, you could just as well add a third monitor on one of the buildings (whichever you feel to be more resilient), and cut down the chances of an unavailable cluster if the other fails. It doesn't solve your problem, but if the building with just one monitor fails, your cluster will still be available; if it's the other way around, you could do the manual recovery just the same anyway. Another approach, if possible try to add a 3rd monitor in a neutral place. I for sure don't know how your network looks like, but you might be able to put up a monitor in an external datacenter and do something with a VPN? Assuming both buildings have their own external internet connection. Wido -Joao -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Throttle::wait use case clarification
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/21/2013 12:02 AM, Gregory Farnum wrote: On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote: Hi, While working on unit tests for Throttle.{cc,h} I tried to figure out a use case related to the Throttle::wait method but couldn't https://github.com/ceph/ceph/pull/34/files#L3R258 Although it was not a blocker and I managed to reach 100% coverage anyway, it got me curious and I would very much appreciate pointers to understand the rationale. wait() can be called to set a new maximum before waiting for all pending threads to get get what they asked for. Since the maximum has changed, wait() wakes up the first thread : the conditions under which it decided to go to sleep have changed and the conclusion may be different. However, it only does so when the new maximum is less than current one. For instance A) decision does not change max = 10, current 9 thread 1 tries to get 5 but only 1 is available, it goes to sleep wait(8) max = 8, current 9 wakes up thread 1 thread 1 tries to get 5 but current is already beyond the maximum, it goes to sleep B) decision changes max = 10, current 1 thread 1 tries to get 10 but only 9 is available, it goes to sleep wait(9) max = 9, current 1 wakes up thread 1 thread 1 tries to get 10 which is above the maximum : it succeeds because current is below the new maximum It will not wake up a thread if the maximum increases, for instance: max = 10, current 9 thread 1 tries to get 5 but only 1 is available, it goes to sleep wait(20) max = 20, current 9 does *not* wake up thread 1 keeps waiting until another thread put(N) with N = 0 although there now is 11 available and it would allow it to get 5 out of it Why is it not desirable for thread 1 to wake up in this case ? When debugging a real world situation, I think it would show as a thread blocked although the throttle it is waiting on has enough to satisfy its request. What am I missing ? Cheers Attachments: - loic.vcf Looking through the history of that test (in _reset_max), I think it's an accident and we actually want to be waking up the front if the maximum increases (or possibly in all cases, in case the front is a very large request we're going to let through anyway). Want to submit a patch? :) :-) Here it is. make check does not complain. I've not run teuthology + qa-suite though. I figured out how to run teuthology but did not yet try qa-suite. http://marc.info/?l=ceph-develm=135877502606311w=4 The other possibility I was trying to investigate is that it had something to do with handling get() requests larger than the max correctly, but I can't find any evidence of that one... I've run the Throttle unit tests after uncommenting https://github.com/ceph/ceph/pull/34/files#L3R269 and commenting out https://github.com/ceph/ceph/pull/34/files#L3R266 and it passes. I'm not sure if I should have posted the proposed Throttle unit test to the list instead of proposing it as a pull request https://github.com/ceph/ceph/pull/34 What is best ? -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAlD9RjUACgkQ8dLMyEl6F20boACggzHH3Dw+/kM+awkD5POxyQB4 WosAn02bfzTUnItoTlwKtU0cDlWnckGv =SsIe -END PGP SIGNATURE- attachment: loic.vcf
Re: Ceph docs page down
Hi, On 01/21/2013 05:11 PM, Travis Rhoden wrote: I suspect someone is working on it, but http://ceph.com/docs/master/ is returning HTTP 404. I don't know if anyone is working on it, but in the meantime, use: http://eu.ceph.com/docs/master/ Wido - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph docs page down
They're back up now. They're auto-generated from git, the build failed due to an intermittent network connectivity problem, and eventually the old content timed out because master wasn't updated over the weekend. Need to special-case the docs gitbuilder, or increase the timeout, or something so that this doesn't happen again.. sage On Mon, 21 Jan 2013, Wido den Hollander wrote: Hi, On 01/21/2013 05:11 PM, Travis Rhoden wrote: I suspect someone is working on it, but http://ceph.com/docs/master/ is returning HTTP 404. I don't know if anyone is working on it, but in the meantime, use: http://eu.ceph.com/docs/master/ Wido - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Inktank team @ FOSDEM 2013 ?
Hi guys, See you at FOSDEM ;-) Cheers, -- Regards, Sébastien Han. On Sun, Jan 20, 2013 at 6:13 PM, Constantinos Venetsanopoulos c...@grnet.gr wrote: Hello Loic, Sebastien, Patrick, that's great news! I'm sure we'll have some very interesting stuff to talk about. Saturday 14:00 @ K.3.201 also seems fine. Thanks, Constantinos On 1/20/2013 6:19 PM, Patrick McGarry wrote: Hey guys, I will be attending from Inktank and would love to meet all of you folks. Also, for what it's worth, the wiki has been deprecated and mostly redirects to the doc. If you are trying to take a look at Sebatien's page you'll most likely have to use: http://wiki.ceph.com/deprecated/FOSDEM2013 Thanks. Best Regards, Patrick On Sun, Jan 20, 2013 at 6:10 AM, Loic Dachary l...@dachary.org wrote: On 01/20/2013 10:43 AM, Constantinos Venetsanopoulos wrote: Hello, I'd like to ask if anybody of the team will be attending FOSDEM 2013 [1], held at Brussels. I'm just asking, because we will be having a talk there at the cloud devroom and will be referencing our experiences with RADOS. So, it would be nice if we could also meet in person with any of you guys. Hi, Sebastien Han and myself will be there (but we're not with Inktank ;-). It would be great to organize an informal meeting to discuss Ceph and I've just added a page to the wiki (defunct but still useful), listing our names : http://ceph.com/wiki/FOSDEM2013 Cheers -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Concepts of whole cluster snapshots/backups and backups in general.
For anyone looking for a solution I want to outline the solution I will go with in my coming setup. First I wanna say I'm looking forward to the geo replication feature, which hopefully features async replication and if there will someday be some sort of snapshotted replica that would be another awesome thing. But for the time being: For object storage (radosgw) I will use the admin function to retrieve all buckets, retrieve all objects within these buckets and then back them up into another offsite ceph or on some distributed logic on top of simple rsync + fs For block storage (rbd) I will use the snapshot feature to snapshot all images for non corrupt and uptodate backup and then back them up into another offsite ceph or on some distributed logic on top of simple rsync + fs For distributed fs (cephfs) I could use the snapshot feature too, but I'm not yet using cephfs so it's just a consideration. Some considerations: I will probably go with offsite ceph with 2 replicas cause this will distribute the load and make retrieval and so easier. (could be accessible read only to users...) The actual data retrieval of the live ceph cluster will probably be done by some chef script running from time to time. With the approach of using ceph at the backend buckets could have the name of the date/time of the actual snapshot and therefore provide reverting capabilities. (planned monthly and 3 days) Some things to keep in mind: Copying all data will drain the network bandwidth. Keeping Snapshots of data of a production environment at scale with a replica of 2 seems like a lot of storage is thrown away. - use 3+ replicas on production and perhaps only 1 replica in offsite backup? With 3 replicas + 1 offsite replica + 1 monthly snapshot + 2-3 daily snapshots the total amount of storage needed for 1 tb of data would result in 7-8 tb in actual storage. Seems like a crazy idea to do 4-5 additional copies... (perhaps only have 1 monthly and the last day or 2) Just a few ideas I wanted to throw at anyone interested. Cheers Michael On Tue, Jan 15, 2013 at 10:36 PM, Michael Grosser m...@seetheprogress.com wrote: Hey, within this mail data is a reference to rados chunks so the actual data behind (fs/object/block storage). I was thinking about different scenarios, which could lead to data-loss. 1. The usual stupid customer deleting some important data. 2. The not so usual, totally corrupted cluster after upgrade or sorts. 3. The fun to think about datacenter struck by [disaster] - nothing left scenario. While thinking about these scenarios, I wondered how these disasters and the mentioned data-loss could be prevented. Telling customers data is lost, be it self inflicted or nature inflicted, is nothing you want or should need to do. But what are the technical solutions to provide another layer of disaster recovery (not just one datacenter with n replicas)? Some ideas, which came to mind: 1. Snapshotting (ability to get user deleted files + revert to old state after corruption) 2. Offsite backup (ability to recover from a lost datacenter) With these ideas a few problems came to mind. Is it cost effective to backup the whole cluster (would probably backup all replicas, which is not good at all?)? Is there a way to snapshot the current state and back it up to some offsite server array, could be another ceph cluster or a NAS? Do you really want to snapshot the non readable Ceph objects from rados? Shouldn't a backup always be readable? The simplest solution darkfaded from irc came up with was using special replicas. Using additional replicas, which only sync hourly, daily or monthly and dettach after sync could be a solution. But how could that be done? Some benefits of this solution: 1. Readable, cause it could a fully functioning cluster. Doable? Need for replication of gateways etc. or could that be intergrated within a special replica backup? 2. Easy recovery, just make the needed replica the master. 3. No new system. Ceph in and out \o/ 4. Offsite backup possibility. 5. Versioned states via different replicas hourly, daily, monthly Some problems: 1. strain on ceph cluster when sync is done for each special replica 2. additional disk space needed (could be double the already used amount, when using 3 replicas with one current, one daily, one monthly replica) 3. more costs 4. more complex solution? Could someone shed some light on how to have replicas without the write to be acknowledged for every replica and therefore only be a mirror instead of a full replica. Could this replica based backup be used as current snapshot in another datacenter? Wouldn't that be the async feature, which isn't yet possible sort of? I hope this mail is not too cluttered and I'm looking forward to the thread about it. Hopefully we can not only collect some ideas and solutions, but hear some current implementations from some bigger players. Cheers Michael -- To unsubscribe
Re: questions on networks and hardware
Thanks all for your responses! Some comments inline. On Jan 20, 2013, at 10:16 AM, Wido den Hollander w...@widodh.nl wrote: On 01/19/2013 12:34 AM, John Nielsen wrote: I'm planning a Ceph deployment which will include: 10Gbit/s public/client network 10Gbit/s cluster network dedicated mon hosts (3 to start) dedicated storage hosts (multiple disks, one XFS and OSD per disk, 3-5 to start) dedicated RADOS gateway host (1 to start) I've done some initial testing and read through most of the docs but I still have a few questions. Please respond even if you just have a suggestion or response for one of them. If I have cluster network and public network entries under [global] or [osd], do I still need to specify public addr and cluster addr for each OSD individually? Already answered, but no. You don't need to. The OSDs will bind to the available IP in that network. Nice. That's how I hoped/assumed it would work but I have seen some configurations on the web that include both so I wanted to make sure. Which network(s) should the monitor hosts be on? If both, is it valid to have more than one mon addr entry per mon host or is there a different way to do it? They should be on the public network since the clients also need to be able to reach the monitors. It sounds (from other followups) like there is work on adding some awareness of the cluster network to the monitor but it's not there yet. I'll stay tuned. It would be nice if the monitors and OSD's together could form a reachability map for the cluster and give the option of using the public network for the affected subset of OSD traffic in the event of a problem on the cluster network. Having the monitor(s) connected to the cluster network might be useful for that... Then again, a human should be involved if there is a network failure anyway; manually changing the network settings in the Ceph config as a temporary workaround is already an option I suppose. I'd like to have 2x 10Gbit/s NIC's on the gateway host and maximize throughput. Any suggestions on how to best do that? I'm assuming it will talk to the OSD's on the Ceph public/client network, so does that imply a third even-more-public network for the gateway's clients? No, there is no third network. You could trunk the 2 NICs with LACP or something. Since the client will open TCP connections to all the OSDs you will get a pretty good balancing over the available bandwith. Good suggestion, thanks. I actually just started using LACP to pair up 1Gb NIC's on a small test cluster and it's proven beneficial, even with less-than-ideal hashing from the switch. To put my question here a different way: suppose I really really want to segregate the Ceph traffic from the HTTP traffic and I set up the IP's and routing necessary to do so (3rd network idea). Is there any reason NOT to do that? Is it worthwhile to have 10G NIC's on the monitor hosts? (The storage hosts will each have 2x 10Gbit/s NIC's.) No, not really. 1Gbit should be more then enough for your monitors. 3 monitors should also be good. No need to go for 5 or 7. Maybe I'll have the monitors just be on the public network but use LACP with dual 1Gbit NIC's for fault tolerance (since I'll have the NIC's onboard anyway). I think this has come up before, but has anyone written up something with more details on setting up gateways? Hardware recommendations, strategies to improve caching and performance, multiple gateway setups with and without a load balancer, etc. Not that I know. I'm still trying to play with RGW and Varnish in front of it, but haven't really taken the time yet. The goal is to offload a lot of the caching to Varnish and have Varnish 'ban' objects when they change. You could also use Varnish as a loadbalancer. But in this case you could also use RR-DNS or LVS with Direct Routing, that way only incoming traffic goes through the loadbalancer and return traffic goes directly to your network's gateway. I definitely plan to look in to Varnish at some point, I'll be sure to follow up here if I learn anything interesting. Thanks again, JN -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flashcache
On Jan 19, 2013, at 7:56 PM, Joseph Glanville joseph.glanvi...@orionvm.com.au wrote: I assume it is now an EoIB driver. Does it replace the IPoIB driver? Nope, it is upper-layer thing: https://lwn.net/Articles/509448/ Aye, its effectively a NAT translation layer that strips Ethernet headers and grafts on IPoIB headers, thus using the same wire protocol and allowing communication from EoIB to IPoIB. However this approach is a little dirty and has been nacked by the netdev community so we aren't likely to see it in the mainline kernel.. basically ever. Just to clarify: EoIB has been around for a while (at least in the Mellanox software, not sure about mainline). It uses the mlx4_vnic module and is a true Ethernet encapsulation over InfiniBand. Unfortunately the newer Mellanox switches won't support it any more and the ones that to have entered Limited Support. (Not to be confused with mlx4_en, which just turns a ConnectX card into a 10G Ethernet NIC.) IPoIB is IP over InfiniBand without Ethernet (the data link layer is straight InfiniBand). eIPoIB is (or will be, maybe) Ethernet over IP over InfiniBand. It is intended to work with both Linux bridging and regular IB switches that support IPoIB. (Allowing e.g. unmodified KVM guests on hypervisors connected to an IPoIB fabric.) Both Joseph's comments and the LWN link above are referring to eIPoIB. Last I heard (from a pretty direct source in the last couple of weeks) Mellanox is still working on this but doesn't have anything generally available yet. Here's hoping.. but the feedback on netdev was quite negative, to put it mildly. JN -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Consistently reading/writing rados objects via command line
I would like to store some objects in rados, and retrieve them in a consistent manor. In my initial tests, if I do a 'rados -p foo put test /tmp/test', while it is uploading I can do a 'rados -p foo get test /tmp/blah' on another machine, and it will download a partially written file without returning an error code, so the downloader cannot tell the file is corrupt/incomplete. My question is, how do I read/write objects in rados via the command line in such a way where the downloader does not get a corrupt or incomplete file? It's fine if it just returns an error on the client and I can try again, I just need to be notified on error. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Consistently reading/writing rados objects via command line
On Monday, January 21, 2013 at 5:01 PM, Nick Bartos wrote: I would like to store some objects in rados, and retrieve them in a consistent manor. In my initial tests, if I do a 'rados -p foo put test /tmp/test', while it is uploading I can do a 'rados -p foo get test /tmp/blah' on another machine, and it will download a partially written file without returning an error code, so the downloader cannot tell the file is corrupt/incomplete. My question is, how do I read/write objects in rados via the command line in such a way where the downloader does not get a corrupt or incomplete file? It's fine if it just returns an error on the client and I can try again, I just need to be notified on error. You must be writing large-ish objects? By default the rados tool will upload objects 4MB at a time and you're trying to download mid-way through the full object upload. You can add a --block-size 20971520 to upload 20MB in a single operation, but make sure you don't exceed the osd max write size (90MB by default). This is all client-side stuff, though — from the RADOS object store's perspective, the file is complete after each 4MB write. If you want something more sophisticated (like handling larger objects) you'll need to do at least some minimal tooling of your own, e.g. by setting an object xattr before starting and after finishing the file change, then checking for that presence when reading (and locking on reads or doing a check when the read completes). You can do that with the setxattr, rmxattr, and getxattr options. -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Consistently reading/writing rados objects via command line
On Mon, 21 Jan 2013, Gregory Farnum wrote: On Monday, January 21, 2013 at 5:01 PM, Nick Bartos wrote: I would like to store some objects in rados, and retrieve them in a consistent manor. In my initial tests, if I do a 'rados -p foo put test /tmp/test', while it is uploading I can do a 'rados -p foo get test /tmp/blah' on another machine, and it will download a partially written file without returning an error code, so the downloader cannot tell the file is corrupt/incomplete. My question is, how do I read/write objects in rados via the command line in such a way where the downloader does not get a corrupt or incomplete file? It's fine if it just returns an error on the client and I can try again, I just need to be notified on error. You must be writing large-ish objects? By default the rados tool will upload objects 4MB at a time and you're trying to download mid-way through the full object upload. You can add a --block-size 20971520 to upload 20MB in a single operation, but make sure you don't exceed the osd max write size (90MB by default). This is all client-side stuff, though ? from the RADOS object store's perspective, the file is complete after each 4MB write. If you want something more sophisticated (like handling larger objects) you'll need to do at least some minimal tooling of your own, e.g. by setting an object xattr before starting and after finishing the file change, then checking for that presence when reading (and locking on reads or doing a check when the read completes). You can do that with the setxattr, rmxattr, and getxattr options. With a bit of additional support in the rados tool, we could write to object $foo.tmp with key $foo, and then clone it into position and delete the .tmp. If they're really big objects, though, you may also be better off with radosgw, which provides striping and atomicity.. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
handling fs errors
We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed suicide. XFS seemed to unwedge itself a bit after that, as the daemon was able to restart and continue. The problem is that during that 180s the OSD was claiming to be alive but not able to do any IO. That heartbeat check is meant as a sanity check against a wedged kernel, but waiting so long meant that the ceph-osd wasn't failed by the cluster quickly enough and client IO stalled. We could simply change that timeout to something close to the heartbeat interval (currently default is 20s). That will make ceph-osd much more sensitive to fs stalls that may be transient (high load, whatever). Another option would be to make the osd heartbeat replies conditional on whether the internal heartbeat is healthy. Then the heartbeat warnings could start at 10-20s, ping replies would pause, but the suicide could still be 180s out. If the stall is short-lived, pings will continue, the osd will mark itself back up (if it was marked down) and continue. Having written that out, the last option sounds like the obvious choice. Any other thoughts? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: handling fs errors
On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed suicide. XFS seemed to unwedge itself a bit after that, as the daemon was able to restart and continue. The problem is that during that 180s the OSD was claiming to be alive but not able to do any IO. That heartbeat check is meant as a sanity check against a wedged kernel, but waiting so long meant that the ceph-osd wasn't failed by the cluster quickly enough and client IO stalled. We could simply change that timeout to something close to the heartbeat interval (currently default is 20s). That will make ceph-osd much more sensitive to fs stalls that may be transient (high load, whatever). Another option would be to make the osd heartbeat replies conditional on whether the internal heartbeat is healthy. Then the heartbeat warnings could start at 10-20s, ping replies would pause, but the suicide could still be 180s out. If the stall is short-lived, pings will continue, the osd will mark itself back up (if it was marked down) and continue. Having written that out, the last option sounds like the obvious choice. Any other thoughts? Another option would be to have the osd reply to the ping with some health description. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: handling fs errors
On Tue, Jan 22, 2013 at 10:05 AM, Sage Weil s...@inktank.com wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd locked up (hung in xfs_ilock) for somewhere between 2 and 4 minutes. After 3 minutes (180s), ceph-osd gave up waiting and committed suicide. XFS seemed to unwedge itself a bit after that, as the daemon was able to restart and continue. The problem is that during that 180s the OSD was claiming to be alive but not able to do any IO. That heartbeat check is meant as a sanity check against a wedged kernel, but waiting so long meant that the ceph-osd wasn't failed by the cluster quickly enough and client IO stalled. We could simply change that timeout to something close to the heartbeat interval (currently default is 20s). That will make ceph-osd much more sensitive to fs stalls that may be transient (high load, whatever). Another option would be to make the osd heartbeat replies conditional on whether the internal heartbeat is healthy. Then the heartbeat warnings could start at 10-20s, ping replies would pause, but the suicide could still be 180s out. If the stall is short-lived, pings will continue, the osd will mark itself back up (if it was marked down) and continue. Having written that out, the last option sounds like the obvious choice. Any other thoughts? sage Seems to be possible to run in domino-style failing marks there if lock is triggered frequently enough and depends only on pure amount of workload. By the way, was that fs aged or you`re able to catch the lock on fresh one? And which kernel you have run there? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html