Re: Client Location
Hi There The basic setup Im trying to get is a backend to a Hypervisor cluster, so that auto-failover and live migration works. The mail thing is that we have a number of datacenters with a gigabit interconnect that is not always 100% reliable. In the event of a failure we want all the virtual machines to fail over to the remaining datacenters, so we need all the data in each location. The other issue is that within each datacenter we can use link aggregation to increase the bandwidth between hypervisors and the ceph cluster but between the datacenters we only have the gigabit so it become essential to have the hyperviors looking at the storage in the same datacenter. Another consideration is that the virtual machines might get migrated between datacenters without any failure, and the main problem I see with Mark suggests is that in this mode the migrated VM would still be connecting to the OSD's in the remote datacenter. Tbh Im fairly new to ceph and I know im asking for everything and the kitchen sink! Any thoughts would be very helpful though. Thanks James - Original Message - From: Gregory Farnum g...@inktank.com To: Mark Kampe mark.ka...@inktank.com Cc: James Horner james.hor...@precedent.co.uk, ceph-devel@vger.kernel.org Sent: Tuesday, October 9, 2012 5:48:37 PM Subject: Re: Client Location On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe mark.ka...@inktank.com wrote: I'm not a real engineer, so please forgive me if I misunderstand, but can't you create a separate rule for each data center (choosing first a local copy, and then remote copies), which should ensure that the primary is always local. Each data center would then use a different pool, associated with the appropriate location- sensitive rule. Does this approach get you the desired locality preference? This sounds right to me — I think maybe there's a misunderstanding about how CRUSH works. What precisely are you after, James? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ignore O_SYNC for rbd cache
Hi, Recent tests on my test rack with 20G IB(iboip, 64k mtu, default CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite fantastic performance - on both reads and writes Ceph completely utilizing all disk bandwidth as high as 0.9 of theoretical limit of sum of all bandwidths bearing in mind replication level. The only thing that may bring down overall performance is a O_SYNC|O_DIRECT writes which will be issued by almost every database server in the default setup. Assuming that the database config may be untouchable and somehow I can build very reliable hardware setup which `ll never fail on power, should ceph have an option to ignore these flags? May be there is another real-world cases for including such or I am very wrong even thinking on fool client application in this way. Thank you for any suggestion! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph Disk Controller Performance Article Part1
Hi Guys, Just wanted to let you know we've published a short introductory article on the ceph blog looking at write performance on a couple of different RAID/SAS controllers configured in different ways. Hopefully you guys find it useful! We'll likely be publishing more articles in the future that dig deeper and wider into ceph performance on the test platform being used. The article is available here: http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/ Thanks, Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] rbd: finish up basic format 2 support
On Tue, 09 Oct 2012 13:57:09 -0700 Alex Elder el...@inktank.com wrote: This series includes updates for two patches posted previously. -Alex Greetings, We're gearing up to test v0.52 (specifically the RBD stuff) on our cluster. After reading this series of posts about rbd format 2 patches I began wondering if we should start testing these patches as well or not. To put it simply, what I'd like to know is: Is it enough to use the 3.6 vanilla kernel client to take full advantage of the rbd changes in v0.52 (i.e. new RBD cloning features)? Do we have any benefits from applying any of these patches on top of v3.6 and using format 2, assuming that we stick to v0.52 on the server, or is this strictly v0.53 and beyond stuff? I apologize if this is a dumb question, but by looking at the v0.52 changelog, at doc/rbd/* and the list, it doesn't seem clear how this fits with v0.52. Thanks in advance Best regards Cláudio -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ignore O_SYNC for rbd cache
On Wed, 10 Oct 2012, Andrey Korolyov wrote: Hi, Recent tests on my test rack with 20G IB(iboip, 64k mtu, default CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite fantastic performance - on both reads and writes Ceph completely utilizing all disk bandwidth as high as 0.9 of theoretical limit of sum of all bandwidths bearing in mind replication level. The only thing that may bring down overall performance is a O_SYNC|O_DIRECT writes which will be issued by almost every database server in the default setup. Assuming that the database config may be untouchable and somehow I can build very reliable hardware setup which `ll never fail on power, should ceph have an option to ignore these flags? May be there is another real-world cases for including such or I am very wrong even thinking on fool client application in this way. I certainly wouldn't recommend it, but there are probably use cases where it makes sense (i.e., the data isn't as important as the performance). Any such option would probably be called rbd async flush danger danger = true and would trigger a flush but not wait for it, or perhaps rbd ignore flush danger danger = true which would not honor flush at all. This would jeopoardize the integrity of the file system living on the RBD image; they rely on flush to order their commits, and playing fast and loose with that can lead to any number of corruptions. The only silver lining is that in the not-so-distant future (3-4 years ago) this was poorly supported by the block layer and file systems alike and ext3 didn't crash and burn as quite often as you might have expected. Anyway, not something I would recommend, certainly for a generic VM platform. Maybe if you have a sepcific performance-sensitive application you can afford to let crash and burn... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] rbd: finish up basic format 2 support
On 10/10/2012 08:55 AM, Cláudio Martins wrote: On Tue, 09 Oct 2012 13:57:09 -0700 Alex Elder el...@inktank.com wrote: This series includes updates for two patches posted previously. -Alex Greetings, We're gearing up to test v0.52 (specifically the RBD stuff) on our cluster. After reading this series of posts about rbd format 2 patches I began wondering if we should start testing these patches as well or not. To put it simply, what I'd like to know is: Is it enough to use the 3.6 vanilla kernel client to take full advantage of the rbd changes in v0.52 (i.e. new RBD cloning features)? Do we have any benefits from applying any of these patches on top of v3.6 and using format 2, assuming that we stick to v0.52 on the server, or is this strictly v0.53 and beyond stuff? I apologize if this is a dumb question, but by looking at the v0.52 changelog, at doc/rbd/* and the list, it doesn't seem clear how this fits with v0.52. Thanks in advance Best regards Cláudio These patches support using format 2, to make adding new features easy, but this is not very useful to you yet. They don't yet support any new features (like cloning) - that's the next step, but it will take a bunch more work. To use rbd cloning, you'll need to access rbd through userspace (e.g. with qemu and librbd) for now. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ignore O_SYNC for rbd cache
On 10/10/2012 09:23 AM, Sage Weil wrote: On Wed, 10 Oct 2012, Andrey Korolyov wrote: Hi, Recent tests on my test rack with 20G IB(iboip, 64k mtu, default CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite fantastic performance - on both reads and writes Ceph completely utilizing all disk bandwidth as high as 0.9 of theoretical limit of sum of all bandwidths bearing in mind replication level. The only thing that may bring down overall performance is a O_SYNC|O_DIRECT writes which will be issued by almost every database server in the default setup. Assuming that the database config may be untouchable and somehow I can build very reliable hardware setup which `ll never fail on power, should ceph have an option to ignore these flags? May be there is another real-world cases for including such or I am very wrong even thinking on fool client application in this way. I certainly wouldn't recommend it, but there are probably use cases where it makes sense (i.e., the data isn't as important as the performance). Any such option would probably be called rbd async flush danger danger = true and would trigger a flush but not wait for it, or perhaps rbd ignore flush danger danger = true which would not honor flush at all. qemu already has a cache=unsafe option which does exactly that. This would jeopoardize the integrity of the file system living on the RBD image; they rely on flush to order their commits, and playing fast and loose with that can lead to any number of corruptions. The only silver lining is that in the not-so-distant future (3-4 years ago) this was poorly supported by the block layer and file systems alike and ext3 didn't crash and burn as quite often as you might have expected. Anyway, not something I would recommend, certainly for a generic VM platform. Maybe if you have a sepcific performance-sensitive application you can afford to let crash and burn... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client Location
On Wed, 10 Oct 2012, James Horner wrote: Hi There The basic setup Im trying to get is a backend to a Hypervisor cluster, so that auto-failover and live migration works. The mail thing is that we have a number of datacenters with a gigabit interconnect that is not always 100% reliable. In the event of a failure we want all the virtual machines to fail over to the remaining datacenters, so we need all the data in each location. The other issue is that within each datacenter we can use link aggregation to increase the bandwidth between hypervisors and the ceph cluster but between the datacenters we only have the gigabit so it become essential to have the hyperviors looking at the storage in the same datacenter. The ceph replication is syncrhonous, so even if you are writing to a local OSD, it will be updating the replica at the remote DC. The 1gbps link may quickly become a bottleneck. This is a matter of having your cake and eating it too... you can't seamlessly fail over to another DC if you don't synchronously replicate to it. Another consideration is that the virtual machines might get migrated between datacenters without any failure, and the main problem I see with Mark suggests is that in this mode the migrated VM would still be connecting to the OSD's in the remote datacenter. The new rbd cloning functionality can be used to 'migrate' and image by cloning to a different pool (the new local DC) and then later (in teh background, whenever) doing a 'flatten' to migrate teh data from the parent to the clone. Performance will be slower initially but improve once the data is migrated. This isn't a perfect solution for your use-case, but it would work.. sage Tbh Im fairly new to ceph and I know im asking for everything and the kitchen sink! Any thoughts would be very helpful though. Thanks James - Original Message - From: Gregory Farnum g...@inktank.com To: Mark Kampe mark.ka...@inktank.com Cc: James Horner james.hor...@precedent.co.uk, ceph-devel@vger.kernel.org Sent: Tuesday, October 9, 2012 5:48:37 PM Subject: Re: Client Location On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe mark.ka...@inktank.com wrote: I'm not a real engineer, so please forgive me if I misunderstand, but can't you create a separate rule for each data center (choosing first a local copy, and then remote copies), which should ensure that the primary is always local. Each data center would then use a different pool, associated with the appropriate location- sensitive rule. Does this approach get you the desired locality preference? This sounds right to me ? I think maybe there's a misunderstanding about how CRUSH works. What precisely are you after, James? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5] rbd: fix the memory leak of bio_chain_clone
On 10/09/2012 08:26 PM, Alex Elder wrote: On 09/11/2012 02:17 PM, Alex Elder wrote: On 09/06/2012 06:30 AM, Guangliang Zhao wrote: The bio_pair alloced in bio_chain_clone would not be freed, this will cause a memory leak. It could be freed actually only after 3 times release, because the reference count of bio_pair is initialized to 3 when bio_split and bio_pair_release only drops the reference count. The function bio_pair_release must be called three times for releasing bio_pair, and the callback functions of bios on the requests will be called when the last release time in bio_pair_release, however, these functions will also be called in rbd_req_cb. In other words, they will be called twice, and it may cause serious consequences. I just want you to know I'm looking at this patch now. This is a pretty complex bit of code though, so it may take me a bit to get back to you. Sorry about the long delay. I've finally had a chance to look a little more closely at your patch. I had to sort of port what you supplied so it fit the current code, which has changed a little since you first sent this. It looks to me like it should work. Rather than using bio_split() when a bio is more than is needed to satisfy a particular segment of a request, you create a clone of the bio and pass it back to the caller. The next call will use that clone rather than the original as it continues processing the next segment of the request. The original bio in this case will be freed as before, and the clone will be freed (drop a reference) in a subsequent call when it gets used up. I've done enough testing with this to be satisfied this works correctly. Do you have a test that you used to verify this both performed correctly when a split was found and no longer leaked anything? I am still interested to know if you had a particular way to verify that that the leak was occurring (or not). But we obviously won't be leaking any bio_pairs any more... -Alex I'm going to put it through some testing myself. I might want to make small revisions to a comment here or there, but otherwise I'll take it in unless I find it fails something. Thanks a lot. Reviewed-by: Alex Elder el...@inktank.com This patch clones bio chain from the origin directly instead of bio_split. The old bios which will be split may be modified by the callback fn, so their copys need to be saved(called split_bio). The new bio chain can be released whenever we don't need it. This patch can just handle the split of *single page* bios, but it's enough here for the following reasons: Only bios span across multiple osds need to be split, and these bios *must* be single page because of rbd_merge_bvec. With the function, the new bvec will not permitted to merge, if it make the bio cross the osd boundary, except it is the first one. In other words, there are two types of bio: - the bios don't cross the osd boundary They have one or more pages. The value of offset will always be 0 in this case, so nothing will be changed, and the code changes tmp bios doesn't take effact at all. - the bios cross the osd boundary Each one have only one page. These bios need to be split, and the offset is used to indicate the next bio, it makes sense only in this instance. The original bios may be modifid by the callback fn before the next bio_chain_clone() called, when a bio need to be split, so its copy will be saved. Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c | 102 ++- 1 file changed, 60 insertions(+), 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9917943..a605e1c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -717,50 +717,70 @@ static void zero_bio_chain(struct bio *chain, int start_ofs) } } -/* +/** * bio_chain_clone - clone a chain of bios up to a certain length. - * might return a bio_pair that will need to be released. + * @old: bio to clone + * @split_bio: bio which will be split + * @offset: start point for bio clone + * @len: length of bio chain + * @gfp_mask: allocation priority + * + * Value of split_bio will be !NULL only when there is a bio need to be + * split. NULL otherwise. + * + * RETURNS: + * Pointer to new bio chain on success, NULL on failure. */ -static struct bio *bio_chain_clone(struct bio **old, struct bio **next, - struct bio_pair **bp, - int len, gfp_t gfpmask) +static struct bio *bio_chain_clone(struct bio **old, struct bio **split_bio, + int *offset, int len, gfp_t gfpmask) { -struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail = NULL; -int total = 0; - -if (*bp) { -bio_pair_release(*bp); -*bp = NULL; -} +struct bio *tmp, *old_chain, *split, *new_chain = NULL, *tail = NULL; +int total = 0, need = len; +
Re: Unable to build CEPH packages
Hi Hemant - I'll be happy to help you with the problem. The first things that would be helpful for me to know is what version of ceph you are trying to build, what distribution you are building on, and what your yum repositories are. You can get the last piece of information with the yum repolist comand. Thanks, Gary On Oct 10, 2012, at 2:34 AM, hemant surale wrote: Hi Folks , I was trying to build ceph from source code , To have stable setup for VMs . While I was executing yum install rpm-build rpmdevtools Error observed is No package rpm-build is available No package rpmdevtools is available All previous steps are working fine ,but error is observed while building ceph packages . http://ceph.com/docs/master/source/build-packages/ Please help me out . Thanks Regards, Hemant Surale. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determining RBD storage utilization
On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote: Additionally, 500G - 7.5G != 467G (the number shown as Avail). Why the huge discrepancy? I don't expect the numbers to add up exact due to rounding from kB, MB, GB, etc, but they should be darn close, a la ext4 keeps some reserved space, 5% by default, for when the disk is full so you are still able to use the filesystem and clean it up. 500G * 0.05 = 25G 500G - (25G + 7.5G) = 467G Can't tell you where the 7.5G comes from though! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determining RBD storage utilization
Damien, Thanks for solving that part of the mystery. I can't believe I forgot about that. Thanks for the reminder and the clear explanation. - Travis On Wed, Oct 10, 2012 at 1:28 PM, Damien Churchill dam...@gmail.com wrote: On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote: Additionally, 500G - 7.5G != 467G (the number shown as Avail). Why the huge discrepancy? I don't expect the numbers to add up exact due to rounding from kB, MB, GB, etc, but they should be darn close, a la ext4 keeps some reserved space, 5% by default, for when the disk is full so you are still able to use the filesystem and clean it up. 500G * 0.05 = 25G 500G - (25G + 7.5G) = 467G Can't tell you where the 7.5G comes from though! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determining RBD storage utilization
On 10/10/2012 10:10 AM, Travis Rhoden wrote: Hey folks, I have two questions about determining how much storage has been used *inside* of an RBD. First, I'm confused by the output of df. I've created, mapped, and mounted a 500GB RBD, and see the following: # df -h /srv/test Filesystem Size Used Avail Use% Mounted on /dev/rbd44 500G 7.5G 467G 2% /srv/test # cd /srv/test # du -sh . 20K . Any ideas way a brand-new, no files added mount shows 7.5GB of used space? Does this happen from the file system formatting (ext4 in this case)? Additionally, 500G - 7.5G != 467G (the number shown as Avail). Why the huge discrepancy? I don't expect the numbers to add up exact due to rounding from kB, MB, GB, etc, but they should be darn close, a la df -h /dev/sda1 Filesystem Size Used Avail Use% Mounted on /dev/sda115G 1.7G 13G 12% / Second question, is it possible to know how much storage has been used in the RBD without mounting it and running df or du? For the same RBD as above, I see: # rbd info test rbd image 'test'': size 500 GB in 128000 objects order 22 (4096 KB objects) block_name_prefix: rb.0.18f9.2d9c66c6 parent: (pool -1) Is there perhaps a way to know the number of objects that have been 'used'? Then I could take that and multiply by the object size (4MB). You can get an upper bound by looking at the number of objects in the image: rados --pool rbd ls | grep -c '^rb\.0\.18f9\.2d9c66c6' Each object represents a section of the block device, but they may not be entirely filled (objects are sparse), so this will probably still be a higher estimate than df. Also note that listing all the objects in a pool is an expensive operation, so it shouldn't be done very often. Josh I'm running 0.48.1argonaut on Ubuntu 12.04. RBD maps are also on Ubuntu 12.04, with the stock 3.2.0-29-generic kernel. Thanks, - Travis -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determining RBD storage utilization
I don't know the ext4 internals at all, but filesystems tend to require allocation tables of various sorts (for managing extents, etc). 7.5GB out of 500GB seems a little large for that metadata, but isn't ridiculously so... On Wed, Oct 10, 2012 at 10:28 AM, Damien Churchill dam...@gmail.com wrote: On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote: Additionally, 500G - 7.5G != 467G (the number shown as Avail). Why the huge discrepancy? I don't expect the numbers to add up exact due to rounding from kB, MB, GB, etc, but they should be darn close, a la ext4 keeps some reserved space, 5% by default, for when the disk is full so you are still able to use the filesystem and clean it up. 500G * 0.05 = 25G 500G - (25G + 7.5G) = 467G Can't tell you where the 7.5G comes from though! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determining RBD storage utilization
Thanks for the input, Gregory and Josh. What I am hearing is that this has everything to do with the filesystem, and nothing to do with the block device on Ceph. Thanks again, - Travis On Wed, Oct 10, 2012 at 1:55 PM, Gregory Farnum g...@inktank.com wrote: I don't know the ext4 internals at all, but filesystems tend to require allocation tables of various sorts (for managing extents, etc). 7.5GB out of 500GB seems a little large for that metadata, but isn't ridiculously so... On Wed, Oct 10, 2012 at 10:28 AM, Damien Churchill dam...@gmail.com wrote: On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote: Additionally, 500G - 7.5G != 467G (the number shown as Avail). Why the huge discrepancy? I don't expect the numbers to add up exact due to rounding from kB, MB, GB, etc, but they should be darn close, a la ext4 keeps some reserved space, 5% by default, for when the disk is full so you are still able to use the filesystem and clean it up. 500G * 0.05 = 25G 500G - (25G + 7.5G) = 467G Can't tell you where the 7.5G comes from though! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with ceph osd create uuid
After applying the patch, we went through 65 successful cluster reinstalls without encountering the error (previously it would happen at least every 8-10 reinstalls). Therefore it really looks like this fixed the issue. Thanks! On Mon, Oct 8, 2012 at 5:17 PM, Sage Weil s...@inktank.com wrote: Hi Mandell, I see the bug. I pushed a fix to wip-mon-command-race, 5011485e5e3fc9952ea58cd668e6feefc98024bf, and I believe fixes it, but I wasn't able to easily reproduce it myself so I'm not 100% certain. Can you give it a go? Thanks! sage On Mon, 8 Oct 2012, Mandell Degerness wrote: osd dump output: [root@node-172-20-0-14 ~]# ceph osd dump 2 dumped osdmap epoch 2 epoch 2 fsid d82665b6-3435-44b8-a89e-f7185f78d09d created 2012-10-08 21:29:52.232400 modifed 2012-10-08 21:29:57.297479 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 max_osd 1 osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 exists,new 564d7166-07b7-48cc-9b50-46ef7b260d5c [root@node-172-20-0-14 ~]# ceph osd dump 3 dumped osdmap epoch 3 epoch 3 fsid d82665b6-3435-44b8-a89e-f7185f78d09d created 2012-10-08 21:29:52.232400 modifed 2012-10-08 21:29:58.299491 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 max_osd 1 osd.0 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723 172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c [root@node-172-20-0-14 ~]# ceph osd dump 4 dumped osdmap epoch 4 epoch 4 fsid d82665b6-3435-44b8-a89e-f7185f78d09d created 2012-10-08 21:29:52.232400 modifed 2012-10-08 21:29:59.304087 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 max_osd 3 osd.0 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723 172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c osd.1 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 exists,new 3351a0f0-f6e8-430a-b7a4-ea613a3ddf35 osd.2 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 exists,new 3f04cdbe-a468-42d3-a465-2487cc369d90 On Mon, Oct 8, 2012 at 3:49 PM, Sage Weil s...@inktank.com wrote: On Mon, 8 Oct 2012, Mandell Degerness wrote: Sorry, I should have used the https link: https://gist.github.com/af546ece91be0ba268d3 What do 'ceph osd dump 2', 'ceph osd dump 3', and 'ceph osd dump 4' say? thanks! sage On Mon, Oct 8, 2012 at 3:20 PM, Mandell Degerness mand...@pistoncloud.com wrote: Here is the log I got when running with the options suggested by sage: g...@gist.github.com:af546ece91be0ba268d3.git On Mon, Oct 8, 2012 at 11:34 AM, Sage Weil s...@inktank.com wrote: Hi Mandell, On Mon, 8 Oct 2012, Mandell Degerness wrote: Hi list, I've run into a bit of a weird error and I'm hoping that you can tell me what is going wrong. There seems to be a race condition in the way I am using ceph osd create uuid and actually creating the OSD's. The log from one of the servers is at: https://gist.github.com/528e347a5c0ffeb30abd The process I am trying to follow (for the OSDs) is: 1) Create XFS file system on disk. 2) Use FS UUID as source to get a new OSD id #. 'ceph', 'osd', 'create', '32895846-ca1c-4265-9ce7-9f2a42b41672' (Returns 2.) 3) Pass the UUID and OSD id to the create osd command ceph-osd -c /etc/ceph/ceph.conf --fsid e61c1b11-4a1c-47aa-868d-7b51b1e610d3 --osd-uuid 32895846-ca1c-4265-9ce7-9f2a42b41672 -i 2 --mkfs --osd-journal-size 8192 4) Start the OSD, as part of the start process, I verify that the whoami and osd fsid agree (in case this disk came from a previous cluster, somehow) - should be just a sanity check 'ceph', 'osd', 'create', '32895846-ca1c-4265-9ce7-9f2a42b41672' (Returns 1!) This is clearly a race condition because we have several cluster creations without this happening and then this happens about once every 8
Re: [PATCH v5] rbd: fix the memory leak of bio_chain_clone
On 10/09/2012 08:26 PM, Alex Elder wrote: On 09/11/2012 02:17 PM, Alex Elder wrote: On 09/06/2012 06:30 AM, Guangliang Zhao wrote: The bio_pair alloced in bio_chain_clone would not be freed, this will cause a memory leak. It could be freed actually only after 3 times release, because the reference count of bio_pair is initialized to 3 when bio_split and bio_pair_release only drops the reference count. The function bio_pair_release must be called three times for releasing bio_pair, and the callback functions of bios on the requests will be called when the last release time in bio_pair_release, however, these functions will also be called in rbd_req_cb. In other words, they will be called twice, and it may cause serious consequences. I just want you to know I'm looking at this patch now. This is a pretty complex bit of code though, so it may take me a bit to get back to you. Sorry about the long delay. I've finally had a chance to look a little more closely at your patch. I had to sort of port what you supplied so it fit the current code, which has changed a little since you first sent this. I'm sorry to report that I'm getting a consistent failure when running xfstests #13 when running with this patch applied over rbd images. I don't have time to look at it any more today but we need to get this fixed soon. -Alex It looks to me like it should work. Rather than using bio_split() when a bio is more than is needed to satisfy a particular segment of a request, you create a clone of the bio and pass it back to the caller. The next call will use that clone rather than the original as it continues processing the next segment of the request. The original bio in this case will be freed as before, and the clone will be freed (drop a reference) in a subsequent call when it gets used up. Do you have a test that you used to verify this both performed correctly when a split was found and no longer leaked anything? I'm going to put it through some testing myself. I might want to make small revisions to a comment here or there, but otherwise I'll take it in unless I find it fails something. Thanks a lot. Reviewed-by: Alex Elder el...@inktank.com This patch clones bio chain from the origin directly instead of bio_split. The old bios which will be split may be modified by the callback fn, so their copys need to be saved(called split_bio). The new bio chain can be released whenever we don't need it. This patch can just handle the split of *single page* bios, but it's enough here for the following reasons: Only bios span across multiple osds need to be split, and these bios *must* be single page because of rbd_merge_bvec. With the function, the new bvec will not permitted to merge, if it make the bio cross the osd boundary, except it is the first one. In other words, there are two types of bio: - the bios don't cross the osd boundary They have one or more pages. The value of offset will always be 0 in this case, so nothing will be changed, and the code changes tmp bios doesn't take effact at all. - the bios cross the osd boundary Each one have only one page. These bios need to be split, and the offset is used to indicate the next bio, it makes sense only in this instance. The original bios may be modifid by the callback fn before the next bio_chain_clone() called, when a bio need to be split, so its copy will be saved. Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c | 102 ++- 1 file changed, 60 insertions(+), 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 9917943..a605e1c 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -717,50 +717,70 @@ static void zero_bio_chain(struct bio *chain, int start_ofs) } } -/* +/** * bio_chain_clone - clone a chain of bios up to a certain length. - * might return a bio_pair that will need to be released. + * @old: bio to clone + * @split_bio: bio which will be split + * @offset: start point for bio clone + * @len: length of bio chain + * @gfp_mask: allocation priority + * + * Value of split_bio will be !NULL only when there is a bio need to be + * split. NULL otherwise. + * + * RETURNS: + * Pointer to new bio chain on success, NULL on failure. */ -static struct bio *bio_chain_clone(struct bio **old, struct bio **next, - struct bio_pair **bp, - int len, gfp_t gfpmask) +static struct bio *bio_chain_clone(struct bio **old, struct bio **split_bio, + int *offset, int len, gfp_t gfpmask) { -struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail = NULL; -int total = 0; - -if (*bp) { -bio_pair_release(*bp); -*bp = NULL; -} +struct bio *tmp, *old_chain, *split, *new_chain = NULL, *tail = NULL; +int total = 0, need = len; +split = *split_bio; +
Re: Determining RBD storage utilization
On Wed, 10 Oct 2012, Gregory Farnum wrote: I don't know the ext4 internals at all, but filesystems tend to require allocation tables of various sorts (for managing extents, etc). 7.5GB out of 500GB seems a little large for that metadata, but isn't ridiculously so... ext3/4 are particularly bad about this, with lots of space statically set aside for inodes and allocation metadata. s On Wed, Oct 10, 2012 at 10:28 AM, Damien Churchill dam...@gmail.com wrote: On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote: Additionally, 500G - 7.5G != 467G (the number shown as Avail). Why the huge discrepancy? I don't expect the numbers to add up exact due to rounding from kB, MB, GB, etc, but they should be darn close, a la ext4 keeps some reserved space, 5% by default, for when the disk is full so you are still able to use the filesystem and clean it up. 500G * 0.05 = 25G 500G - (25G + 7.5G) = 467G Can't tell you where the 7.5G comes from though! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with ceph osd create uuid
Wonderful, thanks! sage On Wed, 10 Oct 2012, Nick Bartos wrote: After applying the patch, we went through 65 successful cluster reinstalls without encountering the error (previously it would happen at least every 8-10 reinstalls). Therefore it really looks like this fixed the issue. Thanks! On Mon, Oct 8, 2012 at 5:17 PM, Sage Weil s...@inktank.com wrote: Hi Mandell, I see the bug. I pushed a fix to wip-mon-command-race, 5011485e5e3fc9952ea58cd668e6feefc98024bf, and I believe fixes it, but I wasn't able to easily reproduce it myself so I'm not 100% certain. Can you give it a go? Thanks! sage On Mon, 8 Oct 2012, Mandell Degerness wrote: osd dump output: [root@node-172-20-0-14 ~]# ceph osd dump 2 dumped osdmap epoch 2 epoch 2 fsid d82665b6-3435-44b8-a89e-f7185f78d09d created 2012-10-08 21:29:52.232400 modifed 2012-10-08 21:29:57.297479 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 max_osd 1 osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 exists,new 564d7166-07b7-48cc-9b50-46ef7b260d5c [root@node-172-20-0-14 ~]# ceph osd dump 3 dumped osdmap epoch 3 epoch 3 fsid d82665b6-3435-44b8-a89e-f7185f78d09d created 2012-10-08 21:29:52.232400 modifed 2012-10-08 21:29:58.299491 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 max_osd 1 osd.0 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723 172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c [root@node-172-20-0-14 ~]# ceph osd dump 4 dumped osdmap epoch 4 epoch 4 fsid d82665b6-3435-44b8-a89e-f7185f78d09d created 2012-10-08 21:29:52.232400 modifed 2012-10-08 21:29:59.304087 flags pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 max_osd 3 osd.0 up in weight 1 up_from 3 up_thru 0 down_at 0 last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723 172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c osd.1 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 exists,new 3351a0f0-f6e8-430a-b7a4-ea613a3ddf35 osd.2 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 exists,new 3f04cdbe-a468-42d3-a465-2487cc369d90 On Mon, Oct 8, 2012 at 3:49 PM, Sage Weil s...@inktank.com wrote: On Mon, 8 Oct 2012, Mandell Degerness wrote: Sorry, I should have used the https link: https://gist.github.com/af546ece91be0ba268d3 What do 'ceph osd dump 2', 'ceph osd dump 3', and 'ceph osd dump 4' say? thanks! sage On Mon, Oct 8, 2012 at 3:20 PM, Mandell Degerness mand...@pistoncloud.com wrote: Here is the log I got when running with the options suggested by sage: g...@gist.github.com:af546ece91be0ba268d3.git On Mon, Oct 8, 2012 at 11:34 AM, Sage Weil s...@inktank.com wrote: Hi Mandell, On Mon, 8 Oct 2012, Mandell Degerness wrote: Hi list, I've run into a bit of a weird error and I'm hoping that you can tell me what is going wrong. There seems to be a race condition in the way I am using ceph osd create uuid and actually creating the OSD's. The log from one of the servers is at: https://gist.github.com/528e347a5c0ffeb30abd The process I am trying to follow (for the OSDs) is: 1) Create XFS file system on disk. 2) Use FS UUID as source to get a new OSD id #. 'ceph', 'osd', 'create', '32895846-ca1c-4265-9ce7-9f2a42b41672' (Returns 2.) 3) Pass the UUID and OSD id to the create osd command ceph-osd -c /etc/ceph/ceph.conf --fsid e61c1b11-4a1c-47aa-868d-7b51b1e610d3 --osd-uuid 32895846-ca1c-4265-9ce7-9f2a42b41672 -i 2 --mkfs --osd-journal-size 8192 4) Start the OSD, as part of the start process, I verify that the whoami and osd fsid agree (in case this disk came from a previous cluster, somehow) - should be just a sanity check 'ceph', 'osd',
rgw_rest.cc build failure
This needed for latest master: diff --git a/src/rgw/rgw_rest.cc b/src/rgw/rgw_rest.cc index 53bbeca..3612a9e 100644 --- a/src/rgw/rgw_rest.cc +++ b/src/rgw/rgw_rest.cc @@ -1,4 +1,5 @@ #include errno.h +#include limits.h #include common/Formatter.h #include common/utf8.h to fix: CXXradosgw-rgw_rest.o rgw/rgw_rest.cc: In static member function ‘static int RESTArgs::get_uint64(req_state*, const string, uint64_t, uint64_t*, bool*)’: rgw/rgw_rest.cc:326:15: error: ‘ULLONG_MAX’ was not declared in this scope rgw/rgw_rest.cc: In static member function ‘static int RESTArgs::get_int64(req_state*, const string, int64_t, int64_t*, bool*)’: rgw/rgw_rest.cc:351:15: error: ‘LLONG_MAX’ was not declared in this scope - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rgw_rest.cc build failure
I'll apply this, can I assume you have signed off this patch? On Wed, Oct 10, 2012 at 2:25 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: This needed for latest master: diff --git a/src/rgw/rgw_rest.cc b/src/rgw/rgw_rest.cc index 53bbeca..3612a9e 100644 --- a/src/rgw/rgw_rest.cc +++ b/src/rgw/rgw_rest.cc @@ -1,4 +1,5 @@ #include errno.h +#include limits.h #include common/Formatter.h #include common/utf8.h to fix: CXXradosgw-rgw_rest.o rgw/rgw_rest.cc: In static member function ‘static int RESTArgs::get_uint64(req_state*, const string, uint64_t, uint64_t*, bool*)’: rgw/rgw_rest.cc:326:15: error: ‘ULLONG_MAX’ was not declared in this scope rgw/rgw_rest.cc: In static member function ‘static int RESTArgs::get_int64(req_state*, const string, int64_t, int64_t*, bool*)’: rgw/rgw_rest.cc:351:15: error: ‘LLONG_MAX’ was not declared in this scope - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rgw_rest.cc build failure
On Wed, Oct 10, 2012 at 2:29 PM, Yehuda Sadeh yeh...@inktank.com wrote: I'll apply this, can I assume you have signed off this patch? Ahh, yes, sorry. Signed-off-by: Noah Watkins noahwatk...@gmail.com On Wed, Oct 10, 2012 at 2:25 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: This needed for latest master: diff --git a/src/rgw/rgw_rest.cc b/src/rgw/rgw_rest.cc index 53bbeca..3612a9e 100644 --- a/src/rgw/rgw_rest.cc +++ b/src/rgw/rgw_rest.cc @@ -1,4 +1,5 @@ #include errno.h +#include limits.h #include common/Formatter.h #include common/utf8.h to fix: CXXradosgw-rgw_rest.o rgw/rgw_rest.cc: In static member function ‘static int RESTArgs::get_uint64(req_state*, const string, uint64_t, uint64_t*, bool*)’: rgw/rgw_rest.cc:326:15: error: ‘ULLONG_MAX’ was not declared in this scope rgw/rgw_rest.cc: In static member function ‘static int RESTArgs::get_int64(req_state*, const string, int64_t, int64_t*, bool*)’: rgw/rgw_rest.cc:351:15: error: ‘LLONG_MAX’ was not declared in this scope - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL v5] java: add libcephfs Java bindings
Laszlo, James: Changes based on your previous feedback are ready for review. I pushed the changes here: git://github.com/noahdesu/ceph.git wip-java-cephfs Thanks! - Noah From 0d8c4dc39f9b8f2e264bb2503c053418ad72b705 Mon Sep 17 00:00:00 2001 From: Noah Watkins noahwatk...@gmail.com Date: Wed, 10 Oct 2012 13:57:03 -0700 Subject: [PATCH] java: update deb bits from ceph-devel feedback Signed-off-by: Noah Watkins noahwatk...@gmail.com --- debian/.gitignore|3 ++- debian/control | 10 -- debian/libceph-java.jlibs|1 + debian/libceph-jni.install |1 + debian/libceph1-java.install |2 -- debian/rules |1 + src/java/.gitignore |2 +- src/java/Makefile.am |8 src/java/README |2 +- src/java/build.xml |6 +++--- 10 files changed, 22 insertions(+), 14 deletions(-) create mode 100644 debian/libceph-java.jlibs create mode 100644 debian/libceph-jni.install delete mode 100644 debian/libceph1-java.install diff --git a/debian/.gitignore b/debian/.gitignore index c5b73ce..2fd5a05 100644 --- a/debian/.gitignore +++ b/debian/.gitignore @@ -30,5 +30,6 @@ /rest-bench-dbg /rest-bench /python-ceph -/libceph1-java +/libceph-java +/libceph-jni /tmp diff --git a/debian/control b/debian/control index 579855f..62c85d9 100644 --- a/debian/control +++ b/debian/control @@ -319,8 +319,14 @@ Description: Python libraries for the Ceph distributed filesystem This package contains Python libraries for interacting with Ceph's RADOS object storage, and RBD (RADOS block device). -Package: libceph1-java +Package: libceph-java Section: java +Architecture: all +Depends: libceph-jni, ${java:Depends}, ${misc:Depends} +Description: Java libraries for the Ceph File System. + +Package: libceph-jni Architecture: linux-any +Section: libs Depends: libcephfs1, ${shlibs:Depends}, ${java:Depends}, ${misc:Depends} -Description: Java libraries for the Ceph File System +Description: Java Native Interface library for CephFS Java bindings. diff --git a/debian/libceph-java.jlibs b/debian/libceph-java.jlibs new file mode 100644 index 000..952a190 --- /dev/null +++ b/debian/libceph-java.jlibs @@ -0,0 +1 @@ +src/java/ceph.jar diff --git a/debian/libceph-jni.install b/debian/libceph-jni.install new file mode 100644 index 000..072b990 --- /dev/null +++ b/debian/libceph-jni.install @@ -0,0 +1 @@ +usr/lib/libcephfs_jni.so* usr/lib/jni diff --git a/debian/libceph1-java.install b/debian/libceph1-java.install deleted file mode 100644 index 98133e4..000 --- a/debian/libceph1-java.install +++ /dev/null @@ -1,2 +0,0 @@ -usr/lib/libcephfs_jni.so* usr/lib/jni -usr/lib/libcephfs.jar usr/share/java diff --git a/debian/rules b/debian/rules index b848ddc..6d61385 100755 --- a/debian/rules +++ b/debian/rules @@ -93,6 +93,7 @@ install: build # Add here commands to install the package into debian/testpack. # Build architecture-independent files here. binary-indep: build install + jh_installlibs -v -i # We have nothing to do by default. # Build architecture-dependent files here. diff --git a/src/java/.gitignore b/src/java/.gitignore index 8208e2b..b8eb0e9 100644 --- a/src/java/.gitignore +++ b/src/java/.gitignore @@ -1,4 +1,4 @@ *.class -libcephfs.jar +ceph.jar native/com_ceph_fs_CephMount.h TEST-*.txt diff --git a/src/java/Makefile.am b/src/java/Makefile.am index 5c54f36..87d763d 100644 --- a/src/java/Makefile.am +++ b/src/java/Makefile.am @@ -24,20 +24,20 @@ CEPH_PROXY=java/com/ceph/fs/CephMount.class $(CEPH_PROXY): $(JAVA_SRC) export CLASSPATH=java/ ; - $(JAVAC) java/com/ceph/fs/*.java + $(JAVAC) -source 1.5 -target 1.5 java/com/ceph/fs/*.java $(JAVA_H): $(CEPH_PROXY) export CLASSPATH=java/ ; \ $(JAVAH) -jni -o $@ com.ceph.fs.CephMount -libcephfs.jar: $(CEPH_PROXY) +ceph.jar: $(CEPH_PROXY) $(JAR) cf $@ $(JAVA_CLASSES:%=-C java %) # $(ESCAPED_JAVA_CLASSES:%=-C java %) javadir = $(libdir) -java_DATA = libcephfs.jar +java_DATA = ceph.jar BUILT_SOURCES = $(JAVA_H) -CLEANFILES = -rf java/com/ceph/fs/*.class $(JAVA_H) libcephfs.jar +CLEANFILES = -rf java/com/ceph/fs/*.class $(JAVA_H) ceph.jar endif diff --git a/src/java/README b/src/java/README index ca39a44..d58ab8a 100644 --- a/src/java/README +++ b/src/java/README @@ -33,7 +33,7 @@ Ant is used to run the unit test (apt-get install ant). For example: 1. The tests depend on the compiled wrappers. If the wrappers are installed as part of a package (e.g. Debian package) then this should 'just work'. Ant will -also look in the current directory for 'libcephfs.jar' and in ../.libs for the +also look in the current directory for 'ceph.jar' and in ../.libs for the JNI library. If all else fails, set the environment variables CEPHFS_JAR, and CEPHFS_JNI_LIB accordingly. diff --git a/src/java/build.xml b/src/java/build.xml index f846ca4..203ffc0 100644 --- a/src/java/build.xml +++
Re: [GIT PULL v5] java: add libcephfs Java bindings
Hi Noah, On Wed, 2012-10-10 at 15:00 -0700, Noah Watkins wrote: Laszlo, James: Changes based on your previous feedback are ready for review. I pushed the changes here: git://github.com/noahdesu/ceph.git wip-java-cephfs Checking only the diff, as it's 3 am here. It looks quite OK. But will check it further in the afternoon. Laszlo/GCS -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL v5] java: add libcephfs Java bindings
On Wed, Oct 10, 2012 at 5:53 PM, Laszlo Boszormenyi (GCS) g...@debian.hu wrote: Hi Noah, On Wed, 2012-10-10 at 15:00 -0700, Noah Watkins wrote: Laszlo, James: Changes based on your previous feedback are ready for review. I pushed the changes here: git://github.com/noahdesu/ceph.git wip-java-cephfs Checking only the diff, as it's 3 am here. It looks quite OK. But will check it further in the afternoon. Ok, great. The one thing I was most curious about is if the ceph.jar reference in debian/libceph-java.jlibs is correct. Previously I was able to reference its installation path within debian/tmp (usr/lib/ceph.jar), but jh_installlibs was only able to find the jar when I referenced its build location (src/java/ceph.jar). Thanks! Laszlo/GCS -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] rbd: simplify rbd_do_op() et al
These three patches simplify a few paths through the code involving read and write requests. -Alex [PATCH 1/3] rbd: kill rbd_req_{read,write}() [PATCH 2/3] rbd: kill drop rbd_do_op() opcode and flags [PATCH 3/3] rbd: consolidate rbd_do_op() calls -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] rbd: consolidate rbd_do_op() calls
The two calls to rbd_do_op() from rbd_rq_fn() differ only in the value passed for the snapshot id and the snapshot context. For reads the snapshot always comes from the mapping, and for writes the snapshot id is always CEPH_NOSNAP. The snapshot context is always null for reads. For writes, the snapshot context always comes from the rbd header, but it is acquired under protection of header semaphore and could change thereafter, so we can't simply use what's available inside rbd_do_op(). Eliminate the snapid parameter from rbd_do_op(), and set it based on the I/O direction inside that function instead. Always pass the snapshot context acquired in the caller, but reset it to a null pointer inside rbd_do_op() if the operation is a read. As a result, there is no difference in the read and write calls to rbd_do_op() made in rbd_rq_fn(), so just call it unconditionally. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 26 +- 1 file changed, 9 insertions(+), 17 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 396af14..ca28036 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -1163,7 +1163,6 @@ done: static int rbd_do_op(struct request *rq, struct rbd_device *rbd_dev, struct ceph_snap_context *snapc, -u64 snapid, u64 ofs, u64 len, struct bio *bio, struct rbd_req_coll *coll, @@ -1177,6 +1176,7 @@ static int rbd_do_op(struct request *rq, u32 payload_len; int opcode; int flags; + u64 snapid; seg_name = rbd_segment_name(rbd_dev, ofs); if (!seg_name) @@ -1187,10 +1187,13 @@ static int rbd_do_op(struct request *rq, if (rq_data_dir(rq) == WRITE) { opcode = CEPH_OSD_OP_WRITE; flags = CEPH_OSD_FLAG_WRITE|CEPH_OSD_FLAG_ONDISK; + snapid = CEPH_NOSNAP; payload_len = seg_len; } else { opcode = CEPH_OSD_OP_READ; flags = CEPH_OSD_FLAG_READ; + snapc = NULL; + snapid = rbd_dev-mapping.snap_id; payload_len = 0; } @@ -1518,24 +1521,13 @@ static void rbd_rq_fn(struct request_queue *q) kref_get(coll-kref); bio = bio_chain_clone(rq_bio, next_bio, bp, op_size, GFP_ATOMIC); - if (!bio) { + if (bio) + (void) rbd_do_op(rq, rbd_dev, snapc, + ofs, op_size, + bio, coll, cur_seg); + else rbd_coll_end_req_index(rq, coll, cur_seg, -ENOMEM, op_size); - goto next_seg; - } - - /* init OSD command: write or read */ - if (do_write) - (void) rbd_do_op(rq, rbd_dev, - snapc, CEPH_NOSNAP, - ofs, op_size, bio, - coll, cur_seg); - else - (void) rbd_do_op(rq, rbd_dev, - NULL, rbd_dev-mapping.snap_id, - ofs, op_size, bio, - coll, cur_seg); -next_seg: size -= op_size; ofs += op_size; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html