Re: ceph init script didn't stop the ceph.
ramu ramu.freesystems at gmail.com writes: No error messages displayed and ceph osd down 1 this command also not working.When run this command in ceph-osd.1.log the error message is map e38 wrongly marked me down. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: filestore flusher = false , correct my problem of constant write (need info on this parameter)
Hi Sage, thanks for your response. If you turn off the journal compeletely, you will see bursty write commits from the perspective of the client, because the OSD is periodically doing a sync or snapshot and only acking the writes then. If you enable the journal, the OSD will reply with a commit as soon as the write is stable in the journal. That's one reason why it is there--file system commits of heavyweight and slow. Yes of course, I don't wan't to desactivate journal, using a journal on a fast ssd or nvram is the right way. If we left the file system to its own devices and did a sync every 10 seconds, the disk would sit idle while a bunch of dirty data accumulated in cache, and then the sync/snapshot would take a really long time. This is horribly inefficient (the disk is idle half the time), and useless (the delayed write behavior makes sense for local workloads, but not servers where there is a client on the other end batching its writes). To prevent this, 'filestore flusher' will prod the kernel to flush out any written data to the disk quickly. Then, when we get around to doing the sync/snapshot it is pretty quick, because only fs metadata and just-written data needs to be flushed. mmm, I disagree. If you flush quickly, it's works fine with sequential write workload. But if you have a lot of random write with 4k block by exemple, you are going to have a lot of disk seeks. The way zfs works or netapp san storage works, they take random writes in a fast journal then flush them sequentially each 20s to slow storage. To compare with zfs or netapp, I can achieve around 2io/s on random write 4K with 4GB nvram and 10 x 7200 disk. with ceph, i'm around 2000io/s with same config. (3 nodes with 10x7200disk, 2x replication), so around real disk io limit without any write cache. So for now, i'm think i'm going to use ssd for my osds,I have 80% random write workload. (no seeks, so no problem to constant random write) NTW: maybe wiki is wrong http://ceph.com/wiki/OSD_journal section Motivation Enterprise products like NetApp filers cheat by journaling all writes to NVRAM and then taking their time to flush things out to disk efficiently. This gives you very low-latency writes _and_ efficient disk IO at the expense of hardware. This why I thinked ceph worked like this. Thanks again, -Alexandre - Mail original - De: Sage Weil s...@inktank.com À: Alexandre DERUMIER aderum...@odiso.com Cc: ceph-devel@vger.kernel.org, Mark Nelson mark.nel...@inktank.com, Stefan Priebe s.pri...@profihost.ag Envoyé: Jeudi 21 Juin 2012 18:03:45 Objet: Re: filestore flusher = false , correct my problem of constant write (need info on this parameter) Hi Alexandre, [Sorry I didn't follow up earlier; I didn't understand your question.] If you turn off the journal compeletely, you will see bursty write commits from the perspective of the client, because the OSD is periodically doing a sync or snapshot and only acking the writes then. If you enable the journal, the OSD will reply with a commit as soon as the write is stable in the journal. That's one reason why it is there--file system commits of heavyweight and slow. If we left the file system to its own devices and did a sync every 10 seconds, the disk would sit idle while a bunch of dirty data accumulated in cache, and then the sync/snapshot would take a really long time. This is horribly inefficient (the disk is idle half the time), and useless (the delayed write behavior makes sense for local workloads, but not servers where there is a client on the other end batching its writes). To prevent this, 'filestore flusher' will prod the kernel to flush out any written data to the disk quickly. Then, when we get around to doing the sync/snapshot it is pretty quick, because only fs metadata and just-written data needs to be flushed. So: the behavior you're seeing is normal, and good. Did I understand your confusion correctly? Thanks! sage On Wed, 20 Jun 2012, Alexandre DERUMIER wrote: Hi, I have tried to disabe filestore flusher filestore flusher = false filestore max sync interval = 30 filestore min sync interval = 29 in osd config. now, I see correct sync each 30s when doing rados bench rados -p pool3 bench 60 write -t 16 seekwatcher movie: before -- http://odisoweb1.odiso.net/seqwrite-radosbench-flusherenable.mpg after - http://odisoweb1.odiso.net/seqwrite-radosbench-flusherdisable.mpg Shouldn't it be the normal behaviour ? What's exactly is filestore flusher vs syncfs ? This seem to works fine with rados bench, But when I launch benchmark with fio from my guest vm, I see again constant write. (I'll try to debug that today) My target is to be able to handle small random write and write them each 30s. Regards, Alexandre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in
Re: ceph init script didn't stop the ceph.
Hi Dan Mick, Thanks for reply, I tried -v also ,it can't stop.The all daemons also still running. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rolling upgrades possible?
I guess this has been asked before, I'm just new to the list and wondered whether it's possible to do rolling upgrades of mons, osds and radosgw? We will soon be in the process of migrating from our current storage solution to Ceph/RGW. We will only use the object storage, actually mainly the S3-interface radosgw supplies. Right now we have a very small test-installation - 1 mon, 2 osds where the mon also runs rgw. Next week I've heard that 0.48 might be released, if we upgrade to that, do we have to shut down the cluster during the upgrade or can we do a rolling upgrade while still responding to PUTs and GETs? If not possible yet, is this in the pipeline? Best, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Recommendations for OSDs, RGW and MON
Currently we're running a test cluster with 1 mon, 1 radosgw and 2 osds. RGW runs on the same host as the mon while the osds recides on two different servers. We have thought of maybe running more than 1 osd on each storage server, where the osds use different disks of course - is this something reasonable or would performance/stability suffer? Is there any recommendation against running rgw/mon on the same server? Would a better setup be to put osd/mon/rgw on each server and loadbalancing rgw? Of course we might add more osd:s at some point and I guess we don't want to run mons/rgw on those. Also, on Ubuntu 12.04, does anybody have experience with ceph on btrfs performance or is the recommendation still to run on xfs? All this is for running only the object storage part of ceph (only accessed through RGWs S3-interface). Thanks, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rolling upgrades possible?
On 06/22/2012 11:23 AM, John Axel Eriksson wrote: I guess this has been asked before, I'm just new to the list and wondered whether it's possible to do rolling upgrades of mons, osds and radosgw? We will soon be in the process of migrating from our current storage solution to Ceph/RGW. We will only use the object storage, actually mainly the S3-interface radosgw supplies. Right now we have a very small test-installation - 1 mon, 2 osds where the mon also runs rgw. Next week I've heard that 0.48 might be released, if we upgrade to that, do we have to shut down the cluster during the upgrade or can we do a rolling upgrade while still responding to PUTs and GETs? If not possible yet, is this in the pipeline? Currently there is no guarantee that rolling upgrades will work, I however suspect that with 0.48 this will become a priority. With 0.48 there will be a on-disk format change, but I don't know if the protocol between the daemons will change. Towards 0.48 I wouldn't bet on a rolling upgrade, but you can always try with a test cluster :) Wido Best, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Rolling upgrades possible?
On Fri, Jun 22, 2012 at 1:23 PM, John Axel Eriksson j...@insane.se wrote: I guess this has been asked before, I'm just new to the list and wondered whether it's possible to do rolling upgrades of mons, osds and radosgw? We will soon be in the process of migrating from our current storage solution to Ceph/RGW. We will only use the object storage, actually mainly the S3-interface radosgw supplies. Right now we have a very small test-installation - 1 mon, 2 osds where the mon also runs rgw. Next week I've heard that 0.48 might be released, if we upgrade to that, do we have to shut down the cluster during the upgrade or can we do a rolling upgrade while still responding to PUTs and GETs? If not possible yet, is this in the pipeline? Best, John Should not be possible with only one mon, you need at least three for continuous operation, so you can add them right now and then try to upgrade cluster nodes one by one. By the way, does recent upgrade of osd` on-disk content means near stabilization of data format(which means theoretically flawless per-node upgrade)? If so, is there any approximate times? About one and half months ago some in-list discussion mentioned such stabilization as very soon(tm), so I`ll be happy on more exact timeline before pushing ceph-based infrastructure into production. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for OSDs, RGW and MON
On 06/22/2012 11:28 AM, John Axel Eriksson wrote: Currently we're running a test cluster with 1 mon, 1 radosgw and 2 osds. RGW runs on the same host as the mon while the osds recides on two different servers. We have thought of maybe running more than 1 osd on each storage server, where the osds use different disks of course - is this something reasonable or would performance/stability suffer? No, that is not a problem at all. You can run multiple OSD's on one server. Just make sure you have something like 1GB ~ 2GB per OSD available on memory. See: http://www.ceph.com/docs/master/rec/ Is there any recommendation against running rgw/mon on the same server? Would a better setup be to put osd/mon/rgw on each server and loadbalancing rgw? Of course we might add more osd:s at some point and I guess we don't want to run mons/rgw on those. You can mix the RGW and MON daemons, but you are better of in letting the OSD's run on their own, dedicated machines. In a later stage you can always move the monitors to new machines. Also, on Ubuntu 12.04, does anybody have experience with ceph on btrfs performance or is the recommendation still to run on xfs? I wouldn't run with the stock 12.04 kernel with btrfs. The story goes (see ml archive) that with kernel 3.5 there have been some btrfs improvements, but if you are only using the RGW, XFS might be your best option. http://www.ceph.com/docs/master/rec/filesystem/ Wido All this is for running only the object storage part of ceph (only accessed through RGWs S3-interface). Thanks, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
Am Montag, 18. Juni 2012, 10:00:32 schrieben Sie: On Fri, Jun 15, 2012 at 1:48 PM, Josh Durgin josh.dur...@inktank.com wrote: $ rbd unpreserve pool/image@snap Error unpreserving: child images rely on this image UX nit: this should also say what image it found. rbd: Cannot unpreserve: Still in use by pool2/image2 What if it's in use by a lot of images? Should it print them all, or should it print something like Still in use by pool2/image2 and 50 others, use list_children to see them all? Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On 06/15/2012 03:48 PM, Josh Durgin wrote: Then you can perform the clone: $ rbd clone --parent pool/parent@snap pool2/child1 Based on my comments above, if the parent had not been preserved it would automatically be at this point, by virtue of the fact it has a clone associated with it. Since there is always exactly one parent and one child, I'd say drop the --parent and just have the parent and child be defined by their position. If the parent could be optionally skipped for some reason, then make it be the second one. I think that would be a very bad idea. clone source target would be a good idea; nearly all similar commandline utilities (cp, mv, ln) work like that. clone target source would be counterintuitive and probably lead to otherwise avoidable mistakes. Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
Am Freitag, 22. Juni 2012, 02:02:38 schrieb Alex Elsayed: Dan Mick dan.mick at inktank.com writes: On 06/18/2012 11:01 AM, Sage Weil wrote: On Mon, 18 Jun 2012, Josh Durgin wrote: $ rbd copyup pool2/child1 disown and adopt? :) (actually I started as a joke, but really I kinda like that; fits with the parent-child name) The issue I see with that is that the argument refers to the child rather than the parent, so it doesn't match. I personally like 'unshare' since it'll also work in the dedup case, but if we stick with the parent/child terminology 'emancipate' might work (although it lacks a good reverse). AFAIK the word started in ancient Rome as meaning to release slaves into freedom, so I suppose the opposite would be enslave? Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unmountable btrfs filesystems
Am Samstag, 16. Juni 2012, 14:12:03 schrieb Mark Nelson: btrfsck might tell you what's wrong. Sounds like there is a btrfs-restore command in the dangerdonteveruse branch you could try. Beyond that, I guess it just really comes down to tradeoffs. I've had similar problems in the recent past. Turns out Ceph makes heavy use of btrfs snapshots when running on btrfs, and btrfs-restore will not restore those, so it cannot be used to restore a broken osd. Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unmountable btrfs filesystems
Am Sonntag, 17. Juni 2012, 15:55:42 schrieb Martin Mailand: Hi Wido, until recently there were still a few bugs in btrfs which could be hit quite easily with ceph. The last big one was fixed here http://www.spinics.net/lists/ceph-devel/msg06270.html I keep hearing things along the lines of yes, btrfs is really really close to ready, we just had some really nasty bug in the last release, so you absolutely have to run the very latest Linux kernel since at least Linux 3.1. I think I will probably wait until there have been at least three major Linux releases with no serious btrfs issues before I start using it in production. Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: all rbd users: set 'filestore fiemap = false'
On Mon, Jun 18, 2012 at 08:32:50AM -0700, Sage Weil wrote: On Mon, 18 Jun 2012, Christoph Hellwig wrote: On Sun, Jun 17, 2012 at 09:02:15PM -0700, Sage Weil wrote: that data over the wire. We have observed incorrect/changing FIEMAP on both btrfs: both btrfs and? Whoops, it was XFS. :/ If you manage to extract a minimal test case I'd love to see it, FIEMAP is a complete mess, although most of the time the errors actually are on the users side due to it's complicated semantics. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On Fri, Jun 22, 2012 at 7:36 AM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: rbd: Cannot unpreserve: Still in use by pool2/image2 What if it's in use by a lot of images? Should it print them all, or should it print something like Still in use by pool2/image2 and 50 others, use list_children to see them all? As walking through all the (potential) clones is an expensive operation, this should abort as soon as possible, and just complain about the one encountered so far. That could easily be a difference of a few seconds vs tens of seconds. We don't even know the count, without paying that cost, so that can't be printed either. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
I'm still able to crash the ceph cluster while doing a lot of random I/O and then shut down the KVM. Stefan Am 21.06.2012 21:57, schrieb Stefan Priebe: OK i discovered this time that all osds had the same disk usage before crash. After starting the osd again i got this one: /dev/sdb1 224G 23G 191G 11% /srv/osd.30 /dev/sdc1 224G 1,5G 213G 1% /srv/osd.31 /dev/sdd1 224G 1,5G 213G 1% /srv/osd.32 /dev/sde1 224G 1,6G 213G 1% /srv/osd.33 So instead of 1,5GB osd 30 now uses 23G. Stefan Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG: Mhm is this normal (ceph health is NOW OK again) /dev/sdb1 224G 655M 214G 1% /srv/osd.20 /dev/sdc1 224G 640M 214G 1% /srv/osd.21 /dev/sdd1 224G 34G 181G 16% /srv/osd.22 /dev/sde1 224G 608M 214G 1% /srv/osd.23 Why does one OSD has so much more used space than the others? On my other OSD nodes all have around 600MB-700MB. Even when i reformat /dev/sdd1 after the backfill it has again 34GB? Stefan Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG: Another strange thing. Why does THIS OSD have 24GB and the others just 650MB? /dev/sdb1 224G 654M 214G 1% /srv/osd.20 /dev/sdc1 224G 638M 214G 1% /srv/osd.21 /dev/sdd1 224G 24G 190G 12% /srv/osd.22 /dev/sde1 224G 607M 214G 1% /srv/osd.23 When i start now the OSD again it seems to hang for forever. Load goes up to 200 and I/O Waits rise vom 0% to 20%. Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD layering design draft
On Thu, Jun 21, 2012 at 2:51 PM, Alex Elder el...@dreamhost.com wrote: Before cloning a snapshot, you must mark it as preserved, to prevent it from being deleted while child images refer to it: :: $ rbd preserve pool/image@snap Why is it necessary to do this? I think it may be desirable to So the snapshot will not be removed. See this: http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/6595/focus=6675 $ rbd clone --parent pool/parent@snap pool2/child1 Based on my comments above, if the parent had not been preserved it would automatically be at this point, by virtue of the fact it has a clone associated with it. The client creating the child typically has no write access to the parent, and cannot do anything to it. To delete the parent, you must first mark it unpreserved, which checks that there are no children left: :: Please show what happens here if this is done at this point: $ rbd snap rm pool/image@snap rbd: Cannot remove a preserved snapshot: pool/image@snap or something like that. Note that the preserve and unpreserve operations are valid on snapshots, not RBD images or clones. That's a very good point. Perhaps the command should be rbd snap preserve and rbd snap unpreserve. In the initial implementation, called 'trivial layering', there will be no tracking of which objects exist in a clone. A read that hits a non-existent object will attempt to read from the parent object, and this will continue recursively until an object exists or an image with no parent is found. So a non-existent object in a clone is a bit like a hole in a file, but instead of implicitly backing it with zeroes it backs it with the data found at the same range as the snapshot the clone was based on? Yes. Continuation of that: will the clone store sparse objects, or always copy all the data for that object from the parent? That is, what happens if I write 1 byte to a fresh clone? (And remember that block sizes can differ.) If a clone had snapshots, does this mean a snapshot can include non-existent objects in it? I don't like the phrase include non-existent objects, and find that an overambitious topological exercise, but yes, a snapshot may be sparse. Reads fall through toward parents until they find something -- or run out of parents, in which case they read zeros. Does this mean that an attempt to read beyond the end of an RBD snapshot is not an error if the read is being done for a clone whose size has been increased from what it was originally? (In that case, the correct action would be to read the range as zeroes.) This was discussed later in the email, and I see you responded to that part. In addition to knowing which parent a given image has, we want to be able to tell if a preserved image still has children. This is accomplished with a new per-pool object, `rbd_children`, which maps (parent pool, parent id, parent snapshot id) to a list of child My first thought was, why does the parent snapshot need to know the *identity* of its descendant clones? The main thing it seems to need is a count of the number of clones it has. Maintaining that count in a distributed system, without listing the things that are in it, gets challenging. Idempotent counters are challenging. Maintaining it as a set is easier, significantly more debuggable, and unlikely to be too costly. Plus it lets us serve rbd children faster. The other thing though is that you shouldn't store the mapping in the rbd_children object. Instead, you should only store the child object ids there, and consult those objects to identify their parents. Otherwise you end up with problems related to possible discrepancy between what a child points to and what the rbd_children mapping says. The question we need to ask is who here is a child of $FOO. Needing an indirection for every member makes that cost a lot more. image ids. This is stored in the same pool as the child image because the client creating a clone already has read/write access to everything in this pool. This lets a client with read-only access to one pool clone a snapshot from that pool into a pool they have full access to. It increases the cost of unpreserving an image, since this This is really a bad feature of this design because it doesn't scale. So we ought to be thinking about a better way to do it if possible. That would be nice. Good luck! We await your email, though not holding our breath ;) To support resizing of layered images, we need to keep track of the minimum size the image ever was, so that if a child image is shrunk We don't want the minimum size. We want to know the highest valid offset in the image: - Upon cloning, the last valid offset of the clone is set to the last valid offset of the snapshot. - If an image is resized larger, the last valid offset remains the same. - If an image is resized smaller, the last valid offset is reduced to the new, smaller size. - If
Re: Rolling upgrades possible?
A rolling upgrade to 0.48 will be possible, provided the old version is reasonably recent (0.45ish or later; I need to confirm that). The upgrade will be a bit awkward because of teh disk format upgrade, however. Each ceph-osd will need to do a conversion on startup which can take a while, so you will want to restart them on a per-host basis or per-rack basis (depending on how your CRUSH map is structured). The monitors are also doing an ncoding change, but will only make the transition after all members of the quorum run the new code. If you start the upgrade with a degraded cluster and have another failure, you'll need to make sure the recovering node(s) run new code. The goal is to make all future upgrades possible using rolling upgrades. It will be tricky with some of the OSD changes coming, but that is the goal. sage On Fri, 22 Jun 2012, John Axel Eriksson wrote: I guess this has been asked before, I'm just new to the list and wondered whether it's possible to do rolling upgrades of mons, osds and radosgw? We will soon be in the process of migrating from our current storage solution to Ceph/RGW. We will only use the object storage, actually mainly the S3-interface radosgw supplies. Right now we have a very small test-installation - 1 mon, 2 osds where the mon also runs rgw. Next week I've heard that 0.48 might be released, if we upgrade to that, do we have to shut down the cluster during the upgrade or can we do a rolling upgrade while still responding to PUTs and GETs? If not possible yet, is this in the pipeline? Best, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Ceph fixes for -rc4
Hi Linus, Please pull the following Ceph fixes from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are a couple of fixes from Yan for bad pointer dereferences in the messenger code and when fiddling with page-private after page migration, a fix from Alex for a use-after-free in the osd client code, and a couple fixes for the message refcounting and shutdown ordering. Thanks! sage Alex Elder (1): libceph: osd_client: don't drop reply reference too early Sage Weil (2): libceph: use con get/put ops from osd_client libceph: flush msgr queue during mon_client shutdown Yan, Zheng (2): ceph: check PG_Private flag before accessing page-private rbd: Clear ceph_msg-bio_iter for retransmitted message fs/ceph/addr.c | 21 - net/ceph/ceph_common.c |7 --- net/ceph/messenger.c |4 net/ceph/mon_client.c |8 net/ceph/osd_client.c | 12 ++-- 5 files changed, 30 insertions(+), 22 deletions(-) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] libceph: encapsulate and document connect sequence
Encapsulate the code handles the initial phase of establishing a ceph connection with a peer, and add a bunch of documentation about what's involved. Change process_banner() to return 1 on success rather than 0, to allow the new ceph_con_connect_response() to return 0 to indicate the response has not yet been completely read. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 71 ++- 1 file changed, 54 insertions(+), 17 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1472,7 +1472,7 @@ static int process_banner(struct ceph_co ceph_pr_addr(con-msgr-inst.addr.in_addr)); } - return 0; + return 1; } static void fail_protocol(struct ceph_connection *con) @@ -1970,6 +1970,57 @@ static void process_message(struct ceph_ prepare_read_tag(con); } +/* + * Initiate the first phase of establishing a connection with + * the peer (connecting). This phase consists of: + * - client requests TCP connection to server + * - server accepts TCP connection from client + * - client sends banner to server + * - server receives and validates client's banner + * - client sends little-endian encoded own socket (IP) address + * - server recieves, validates, and records client's encoded address + * If all is well to this point, then we begin processing the + * connect response. + */ +static int ceph_con_connect(struct ceph_connection *con) +{ + set_bit(CONNECTING, con-state); + + con_out_kvec_reset(con); + prepare_write_banner(con); + prepare_read_banner(con); + + BUG_ON(con-in_msg); + con-in_tag = CEPH_MSGR_TAG_READY; + dout(%s initiating connect on %p new state %lu\n, + __func__, con, con-state); + + return ceph_tcp_connect(con); +} + +/* + * Handle the response from the first phase of establishing a + * connection with the peer. This consists of: + * - server sends banner to client + * - client receives and validates server's banner + * - server sends little-endian encoded own socket (IP) address + * - client recieves, validates, and records server's encoded address + * - server sends little-endian encoded socket (IP) address for client + * - client recieves and records its encoded address supplied by server + * If all is well to this point, then we can transition to the + * NEGOTIATING state. + */ +static int ceph_con_connect_response(struct ceph_connection *con) +{ + int ret; + + dout(%s connecting\n, __func__); + ret = read_partial_banner(con); + if (ret 0) + ret = process_banner(con); + + return ret; +} /* * Write something to the socket. Called in a worker thread when the @@ -1986,17 +2037,7 @@ more: /* open the socket first? */ if (con-sock == NULL) { - set_bit(CONNECTING, con-state); - - con_out_kvec_reset(con); - prepare_write_banner(con); - prepare_read_banner(con); - - BUG_ON(con-in_msg); - con-in_tag = CEPH_MSGR_TAG_READY; - dout(try_write initiating connect on %p new state %lu\n, -con, con-state); - ret = ceph_tcp_connect(con); + ret = ceph_con_connect(con); if (ret 0) { con-error_msg = connect error; goto out; @@ -2095,13 +2136,9 @@ more: } if (test_bit(CONNECTING, con-state)) { - dout(try_read connecting\n); - ret = read_partial_banner(con); + ret = ceph_con_connect_response(con); if (ret = 0) goto out; - ret = process_banner(con); - if (ret 0) - goto out; clear_bit(CONNECTING, con-state); set_bit(NEGOTIATING, con-state); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] libceph: encapsulate and document negotiation phase
Encapsulate the code handles the negotiation phase of establishing a ceph connection with a peer, and add a bunch of documentation about what's involved. Change process_connect() to return 1 on success rather than 0, to allow the new ceph_con_negotiate_response() to return 0 to indicate the response has not yet been completely read. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 107 +++ 1 file changed, 91 insertions(+), 16 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1633,7 +1633,7 @@ static int process_connect(struct ceph_c con-error_msg = protocol error, garbage tag during connect; return -1; } - return 0; + return 1; } @@ -2023,6 +2023,82 @@ static int ceph_con_connect_response(str } /* + * The first phase of connecting with the peer succeeded. Now start + * the second phase (negotiating), which consists of: + * - client sends a connect message to server, specifying + *information about itself, including the protocol it intends to + *use and the features it supports. + * - if authorizer data is needed for the connection, its length is + *recorded in the connect message, and client sends its content + *immediately after the connect message + * - server receives the connect message from the client, and if it + *indicates authorizer data follows, reads that also. + * If all is well to this point, then we begin processing the + * negotiation response. + */ +static int ceph_con_negotiate(struct ceph_connection *con) +{ + int ret; + + clear_bit(CONNECTING, con-state); + set_bit(NEGOTIATING, con-state); + + /* Banner was good, exchange connection info */ + ret = prepare_write_connect(con); + if (ret = 0) + prepare_read_connect(con); + + return ret; +} + +/* + * Handle the response from the negotiating phase of connecting the + * peer. This consists of: + * - server validates the connect message (and possibly authorizer + *data), and sends a response to the client: + * - if the protocol version supplied by the client is not what + *was expected, response is a BADPROTOVER tag + * - if the features supported by the client are missing + *features required by the server, response is a FEATURES + *tag. + * - if the features supported by the client are missing + * - if authorizer data is supplied by the client and it is not + *valid, response is a BADAUTHORIZER tag. + * - (There are some other conditions related to message and + *connection sequence numbers but they are not covered here) + * - Otherwise the response will begin with a READY tag, and + *will include a ceph connect reply message, which will + *include the features supported by the server, and the + *server's own authorization data. + * - client validates the connect message (and possibly authorizer + *data) from the server: + * - If the tag indicates a bad protocol or mismatching + *features, the connection attempt is abandoned, so the ceph + *connection is reset and closed. + * - If the tag indicates a bad authorizer, a second connect + *attempt is initiated. If a second attempt fails due to a + *bad authorizer, the connection attempt fails. + * - If the tag indicates READY, the client will check the + *features supported by the server. If the server's + *features do not include a feature required by the client, + *the connection attempt is abandoned, so the ceph + *connection is reset and closed. + * If no failures occurred to this point, the connection is established. + */ +static int ceph_con_negotiate_response(struct ceph_connection *con) +{ + int ret; + + dout(%s negotiating\n, __func__); + + ret = read_partial_connect(con); + if (ret 0) + ret = process_connect(con); + + return ret; +} + +/* * Write something to the socket. Called in a worker thread when the * socket appears to be writeable and we have something ready to send. */ @@ -2136,31 +2212,30 @@ more: } if (test_bit(CONNECTING, con-state)) { + /* +* See if we got the response we expect from our +* connection request. +*/ ret = ceph_con_connect_response(con); if (ret = 0) goto out; - clear_bit(CONNECTING, con-state); - set_bit(NEGOTIATING, con-state); - - /* Banner is good, exchange connection info */ - ret = prepare_write_connect(con); - if (ret 0) - goto out; -
[PATCH 3/9] libceph: close the connection's socket on reset
When a ceph connection is reset, all its state is cleared. However the underlying socket never actually gets closed. Do that, to essentially make the reset process complete. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |1 + 1 file changed, 1 insertion(+) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -492,6 +492,7 @@ static void reset_connection(struct ceph } con-in_seq = 0; con-in_seq_acked = 0; + con_close_socket(con); } /* -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] libceph: don't close socket in OPENING state
The only way a socket enters OPENING state is via ceph_con_open(). The only times ceph_con_open() is called are: - In fs/ceph/mds_client.c:register_session(), where it occurs soon after a call to ceph_con_init(). - In fs/ceph/mds_client.c:send_mds_reconnect(). This is called in two places. - In fs/ceph/mds_client.c:check_new_map(), it is called after a call to ceph_con_close() - Or in fs/ceph/mds_client.c:peer_reset(), which is also only called after reset_connection, which includes a call to ceph_con_close(). - In net/ceph/mon_client.c:__open_session(), where it's called right after a call to ceph_con_init(). - In net/ceph/osd_client.c:__reset_osd(), right after a call to ceph_con_close(). - In net/ceph/osd_client.c:__map_request(), shortly after a call to create_osd(), which includes a call to ceph_con_init(). After a call to ceph_con_init(), the state of a ceph connection is CLOSED, and its socket pointer is null. Similarly, after a call to ceph_con_close(), the state of the connection is CLOSED, the underlying socket is closed, and the connection's socket pointer is null. Therefore, there is no reason to call con_close_socket() when a connection is found to be in OPENING state in con_work(), because the socket will already be closed, and the connection will already be in CLOSED state. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |1 - 1 file changed, 1 deletion(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2387,7 +2387,6 @@ restart: if (test_and_clear_bit(OPENING, con-state)) { /* reopen w/ new peer */ dout(con_work OPENING\n); - con_close_socket(con); } ret = try_read(con); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] libceph: change TAG_CLOSE handling
Currently, if a connection is READY in try_read(), and a CLOSE tag is the what is received next, the connection's state changes from CONNECTED to CLOSED and try_read() returns. If this happens, control returns to con_work(), and try_write() is called. If there was queued data to send, try_write() appears to attempt to send it despite the receipt of the CLOSE tag. Eventually, try_write() will return either: - A non-negative value, in which case con_work() will end, and will at some point get triggered to run by an event. - -EAGAIN, in which case control returns to the top of con_work() - Some other error, which will cause con_work() to call ceph_fault(), which will close the socket and force a new connection sequence to be initiated on the next write. At the top of con_work(), if the connection is in CLOSED state, the same fault handling will be done as would happen for any other error. Instead of messing with the connection state deep inside try_read(), just have try_read() return a negative value (an errno), and let the fault handling code in con_work() take care of resetting the connection right away. This will also close the connection before needlessly sending any queued data to the other end. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2273,8 +2273,7 @@ more: prepare_read_ack(con); break; case CEPH_MSGR_TAG_CLOSE: - clear_bit(CONNECTED, con-state); - set_bit(CLOSED, con-state); /* fixme */ + ret = -EIO; goto out; default: goto bad_tag; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] libceph: kill fail_protocol()
In the negotiating phase of establishing a connection, the server can indicate various connection failures using special tag values. The tags can mean: that the client does not have features needed by the server; that the protocol advertised by the client is not what the server expects; or that the authorizer data provided by the client was not adequate to grant access. These three cases are handled in process_connect(), which calls fail_protocal() for all three. The result of that is that the connection gets reset, and the connection gets moved to CLOSED state. The previous patch description walks through what happens when a connection gets marked CLOSED within try_read(), and why it's sufficient (and better) to simply have it return a negative value. So just do that--don't bother with fail_protocol(), just return a negative value in these cases and let the caller sort out resetting things. Return -EIO in these cases rather than -1 (which can be confused with -EPERM). We can get rid of fail_protocol() because it is no longer used. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 15 +++ 1 file changed, 3 insertions(+), 12 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1476,12 +1476,6 @@ static int process_banner(struct ceph_co return 1; } -static void fail_protocol(struct ceph_connection *con) -{ - reset_connection(con); - set_bit(CLOSED, con-state); /* in case there's queued work */ -} - static int process_connect(struct ceph_connection *con) { u64 sup_feat = con-msgr-supported_features; @@ -1499,8 +1493,7 @@ static int process_connect(struct ceph_c ceph_pr_addr(con-peer_addr.in_addr), sup_feat, server_feat, server_feat ~sup_feat); con-error_msg = missing required protocol features; - fail_protocol(con); - return -1; + return -EIO; case CEPH_MSGR_TAG_BADPROTOVER: pr_err(%s%lld %s protocol version mismatch, @@ -1510,8 +1503,7 @@ static int process_connect(struct ceph_c le32_to_cpu(con-out_connect.protocol_version), le32_to_cpu(con-in_reply.protocol_version)); con-error_msg = protocol version mismatch; - fail_protocol(con); - return -1; + return -EIO; case CEPH_MSGR_TAG_BADAUTHORIZER: con-auth_retry++; @@ -1597,8 +1589,7 @@ static int process_connect(struct ceph_c ceph_pr_addr(con-peer_addr.in_addr), req_feat, server_feat, req_feat ~server_feat); con-error_msg = missing required protocol features; - fail_protocol(con); - return -1; + return -EIO; } clear_bit(NEGOTIATING, con-state); set_bit(CONNECTED, con-state); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] libceph: close connection on reset tag
When a CEPH_MSGR_TAG_RESETSESSION tag is received, the connection should be reset, dropping any pending messages and preparing for a new connection to be negotiated. Currently, reset_connection() is called to do this, but that only drops messages. To really get the connection fully reset, call ceph_con_close() instead. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1533,7 +1533,8 @@ static int process_connect(struct ceph_c pr_err(%s%lld %s connection reset\n, ENTITY_NAME(con-peer_name), ceph_pr_addr(con-peer_addr.in_addr)); - reset_connection(con); + ceph_con_close(con); + ret = prepare_write_connect(con); if (ret 0) return ret; -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] libceph: close connection on connect failure
The only time the CLOSED state is set on a ceph connection is in ceph_con_init() and ceph_con_close(). Both of these will ensure the connection's socket is closed. Therefore there is no need to close the socket in con_work() if the connection is found to be in CLOSED state. Rearrange things a bit in ceph_con_close() so we only manipulate the state and flag bits *after* we've acquired the connection mutex. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -502,6 +502,8 @@ void ceph_con_close(struct ceph_connecti { dout(con_close %p peer %s\n, con, ceph_pr_addr(con-peer_addr.in_addr)); + + mutex_lock(con-mutex); clear_bit(NEGOTIATING, con-state); clear_bit(CONNECTING, con-state); clear_bit(CONNECTED, con-state); @@ -512,11 +514,13 @@ void ceph_con_close(struct ceph_connecti clear_bit(KEEPALIVE_PENDING, con-flags); clear_bit(WRITE_PENDING, con-flags); - mutex_lock(con-mutex); + /* Clear everything out */ reset_connection(con); con-peer_global_seq = 0; cancel_delayed_work(con-work); + mutex_unlock(con-mutex); + queue_con(con); } EXPORT_SYMBOL(ceph_con_close); @@ -2372,7 +2376,6 @@ restart: } if (test_bit(CLOSED, con-state)) { /* e.g. if we are replaced */ dout(con_work CLOSED\n); - con_close_socket(con); goto done; } if (test_and_clear_bit(OPENING, con-state)) { -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/9] libceph: set CONNECTING state even earlier
Move the setting of the CONNECTING state in a ceph connection all the way back to where a connection first gets opened. At that point the connection's socket pointer is still null, and the connection sequence is about to begin. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) Index: b/net/ceph/messenger.c === --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -533,6 +533,7 @@ void ceph_con_open(struct ceph_connectio dout(con_open %p %s\n, con, ceph_pr_addr(addr-in_addr)); set_bit(OPENING, con-state); WARN_ON(!test_and_clear_bit(CLOSED, con-state)); + set_bit(CONNECTING, con-state); memcpy(con-peer_addr, addr, sizeof(*addr)); con-delay = 0; /* reset backoff memory */ @@ -1981,8 +1982,6 @@ static void process_message(struct ceph_ */ static int ceph_con_connect(struct ceph_connection *con) { - set_bit(CONNECTING, con-state); - con_out_kvec_reset(con); prepare_write_banner(con); prepare_read_banner(con); -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
I am still looking into the logs. -Sam On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote: Stefan, I'm looking at your logs and coredump now. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3M ceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: reproducable osd crash
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which is not quite 0.47.3. You can get the version with binary -v, or (in my case) examining strings in the binary. I'm retrieving that version to analyze the core dump. On 06/21/2012 11:43 PM, Stefan Priebe wrote: Does anybody have an idea? This is right now a showstopper to me. Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost AGs.pri...@profihost.ag: Hello list, i'm able to reproducably crash osd daemons. How i can reproduce: Kernel: 3.5.0-rc3 Ceph: 0.47.3 FS: btrfs Journal: 2GB tmpfs per OSD OSD: 3x servers with 4x Intel SSD OSDs each 10GBE Network rbd_cache_max_age: 2.0 rbd_cache_size: 33554432 Disk is set to writeback. Start a KVM VM via PXE with the disk attached in writeback mode. Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes. # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1; halt Strangely exactly THIS OSD also has the most log entries: 64K ceph-osd.20.log 64K ceph-osd.21.log 1,3Mceph-osd.22.log 64K ceph-osd.23.log But all OSDs are set to debug osd = 20. dmesg shows: ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000] I uploaded the following files: priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't crash priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD üu priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump priebe_fio_randwrite_ceph-osd.bz2 = osd binary Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Unable to restart Mon after reboot
Hi all, I am testing Ceph 0.47.2 on btrfs with three servers running Fedora 17. Following a reboot of the servers, one of the mon daemons crashes on startup with FAILED assert(r0) MDS and the OSD start and run fine as do the mon daemons on the other two servers. The debug log is at http://pastebin.com/tXwvd44Z I would really appreciate any comments - especially if I am missing something obvious. David-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to restart Mon after reboot
Hi David: The code there is trying to read some stuff off the monitor's storage to initialize, and apparently failing in an odd way. It's trying to read the file 'latest' from the monitor directory (/data/mon0); the file can be opened, and stat says it's 4289 bytes long, but apparently the read is succeeding without error, but only getting back 0 bytes (i.e., not an error, but apparently end of file). See if there's a file /data/mon0/latest of length 4289, and see if something is odd about its permissions (like maybe the read bits are turned off, or maybe the filesystem it's on has errors). On 06/22/2012 05:31 PM, David Blundell wrote: Hi all, I am testing Ceph 0.47.2 on btrfs with three servers running Fedora 17. Following a reboot of the servers, one of the mon daemons crashes on startup with FAILED assert(r0) MDS and the OSD start and run fine as do the mon daemons on the other two servers. The debug log is at http://pastebin.com/tXwvd44Z I would really appreciate any comments - especially if I am missing something obvious. David-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance benchmark of rbd
Hi Eric, Do you have find any clue about slow random write iops ? I'm doing some benchmark from a kvm guest with fio, random 4K block, fio --filename=$DISK --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 --group_reporting --name=file1 journal is on tmpfs and storage is 15k drive I can't have more than 1000-2000 iops. I Don't understand why I don't have a lot more iops. If journal is on tmpfs, it should be around 3iops on a gigabit link (using all the bandwith) I also try use rbd_caching on my kvm guest, didn't change nothing. sequential write with 4MB block can use the full the gigabit link (around 100MB/S) Is the bottleneck the in rbd protocol ? - Mail original - De: Eric YH Chen eric_yh_c...@wiwynn.com À: mark nelson mark.nel...@inktank.com Cc: ceph-devel@vger.kernel.org, Chris YT Huang chris_yt_hu...@wiwynn.com, Victor CY Chang victor_cy_ch...@wiwynn.com Envoyé: Jeudi 14 Juin 2012 03:26:12 Objet: RE: Performance benchmark of rbd Hi, Mark: I forget to mention one thing, I create the rbd at the same machine and test it. That means the network latency may be lower than normal case. 1. I use ext4 as the backend filesystem and with following attribute. data=writeback,noatime,nodiratime,user_xattr 2. I use the default replication number, I think it is 2, right? 3. On my platform, I have 192GB memory 4. Sorry about the column name is left-right reversal. Here is the correct one Seq-write Seq-read 32 KB 23 MB/s 690 MB/s 512 KB 26 MB/s 960 MB/s 4 MB 27 MB/s 1290 MB/s 32 MB 36 MB/s 1435 MB/s 5. If I put all the journal data on a SSD device (Intel 520). The sequence write performance would reach 135MB/s instead of 27MB/s in original. (object size = 4MB). And others are no different, including random-write. I am curious why the SSD device doesn't help the performance of random-write. 6. For the random read write, the data I provided before was correct. But I can give you the detail. Is it too high than what you expected? rand-write-4k rand-write-16k bw iops bw iops 3,524 881 9,032 564 mix-4k (50/50) r:bw r:iops w:bw w:iops 2,925 731 2,924 731 mix-8k (50/50) r:bw r:iops w:bw w:iops 4,509 563 4,509 563 mix-16k (50/50) r:bw r:iops w:bw w:iops 8,366 522 8,345 521 7. Here is the hw raid cache policy we used now. Write Policy Write Back with BBU Read Policy ReadAhead If you are interested in how HW raid help the performance, I can do for little help, since we also want to know what is the best configuration on our platform. Any test you want to know? Furthermore, is there any suggestion for our platform that can improve the performance? Thanks! -Original Message- From: Mark Nelson [mailto:mark.nel...@inktank.com] Sent: Wednesday, June 13, 2012 8:30 PM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org Subject: Re: Performance benchmark of rbd Hi Eric! On 6/13/12 5:06 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: I am doing some benchmark of rbd. The platform is on a NAS storage. CPU: Intel E5640 2.67GHz Memory: 192 GB Hard Disk: SATA 250G * 1, 7200 rpm (H0) + SATA 1T * 12 , 7200rpm (H1~ H12) RAID Card: LSI 9260-4i OS: Ubuntu12.04 with Kernel 3.2.0-24 Network: 1 Gb/s We create 12 OSD on H1 ~ H12 with the journal is put on H0. Just to make sure I understand, you have a single node with 12 OSDs and 3 mons, and all 12 OSDs are using the H0 disk for their journals? What filesystem are you using for the OSDs? How much replication? We also create 3 MON in the cluster. In briefly, we setup a ceph cluster all-in-one, with 3 monitors and 12 OSD. The benchmark tool we used is fio 2.0.3. We had 7 basic test case 1) sequence write with bs=64k 2) sequence read with bs=64k 3) random write with bs=4k 4) random write with bs=16k 5) mix read/write with bs=4k 6) mix read/write with bs=8k 7) mix read/write with bs=16k We create several rbd with different object size for the benchmark. 1. size = 20G, object size = 32KB 2. size = 20G, object size = 512KB 3. size = 20G, object size = 4MB 4. size = 20G, object size = 32MB Given how much memory you have, you may want to increase the amount of data you are writing during each test to rule out caching. We have some conclusion after the benchmark. a. We can get better performance of sequence read/write when the object size is bigger. Seq-read Seq-write 32 KB 23 MB/s 690 MB/s 512 KB 26 MB/s 960 MB/s 4 MB 27 MB/s 1290 MB/s 32 MB 36 MB/s 1435 MB/s Which test are these results from? I'm suspicious that the write numbers are so high. Figure that even with a local client and 1X replication, your journals and data partitions are each writing out a copy of the data. You don't have enough disk in that box to sustain 1.4GB/s to both even under perfectly ideal conditions. Given that it sounds like you are using a single 7200rpm disk for 12 journals, I would