Re: Client Location

2012-10-10 Thread James Horner
Hi There

The basic setup Im trying to get is a backend to a Hypervisor cluster, so that 
auto-failover and live migration works. The mail thing is that we have a number 
of datacenters with a gigabit interconnect that is not always 100% reliable. In 
the event of a failure we want all the virtual machines to fail over to the 
remaining datacenters, so we need all the data in each location.
The other issue is that within each datacenter we can use link aggregation to 
increase the bandwidth between hypervisors and the ceph cluster but between the 
datacenters we only have the gigabit so it become essential to have the 
hyperviors looking at the storage in the same datacenter.
Another consideration is that the virtual machines might get migrated between 
datacenters without any failure, and the main problem I see with Mark suggests 
is that in this mode the migrated VM would still be connecting to the OSD's in 
the remote datacenter.

Tbh Im fairly new to ceph and I know im asking for everything and the kitchen 
sink! Any thoughts would be very helpful though.

Thanks
James

- Original Message -
From: Gregory Farnum g...@inktank.com
To: Mark Kampe mark.ka...@inktank.com
Cc: James Horner james.hor...@precedent.co.uk, ceph-devel@vger.kernel.org
Sent: Tuesday, October 9, 2012 5:48:37 PM
Subject: Re: Client Location

On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe mark.ka...@inktank.com wrote:
 I'm not a real engineer, so please forgive me if I misunderstand,
 but can't you create a separate rule for each data center (choosing
 first a local copy, and then remote copies), which should ensure
 that the primary is always local.  Each data center would then
 use a different pool, associated with the appropriate location-
 sensitive rule.

 Does this approach get you the desired locality preference?

This sounds right to me — I think maybe there's a misunderstanding
about how CRUSH works. What precisely are you after, James?
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ignore O_SYNC for rbd cache

2012-10-10 Thread Andrey Korolyov
Hi,

Recent tests on my test rack with 20G IB(iboip, 64k mtu, default
CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite
fantastic performance - on both reads and writes Ceph completely
utilizing all disk bandwidth as high as 0.9 of theoretical limit of
sum of all bandwidths bearing in mind replication level. The only
thing that may bring down overall performance is a O_SYNC|O_DIRECT
writes which will be issued by almost every database server in the
default setup. Assuming that the database config may be untouchable
and somehow I can build very reliable hardware setup which `ll never
fail on power, should ceph have an option to ignore these flags? May
be there is another real-world cases for including such or I am very
wrong even thinking on fool client application in this way.

Thank you for any suggestion!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph Disk Controller Performance Article Part1

2012-10-10 Thread Mark Nelson

Hi Guys,

Just wanted to let you know we've published a short introductory article 
on the ceph blog looking at write performance on a couple of different 
RAID/SAS controllers configured in different ways.  Hopefully you guys 
find it useful!  We'll likely be publishing more articles in the future 
that dig deeper and wider into ceph performance on the test platform 
being used.


The article is available here:

http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/

Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] rbd: finish up basic format 2 support

2012-10-10 Thread Cláudio Martins

On Tue, 09 Oct 2012 13:57:09 -0700 Alex Elder el...@inktank.com wrote:
 This series includes updates for two patches posted previously.
 
   -Alex

 Greetings,

 We're gearing up to test v0.52 (specifically the RBD stuff) on our
cluster. After reading this series of posts about rbd format 2 patches
I began wondering if we should start testing these patches as well or
not. To put it simply, what I'd like to know is:

 Is it enough to use the 3.6 vanilla kernel client to take full
advantage of the rbd changes in v0.52 (i.e. new RBD cloning features)?

 Do we have any benefits from applying any of these patches on top of
v3.6 and using format 2, assuming that we stick to v0.52 on the
server, or is this strictly v0.53 and beyond stuff?


 I apologize if this is a dumb question, but by looking at the v0.52
changelog, at doc/rbd/* and the list, it doesn't seem clear how this
fits with v0.52.

 Thanks in advance

Best regards

Cláudio

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ignore O_SYNC for rbd cache

2012-10-10 Thread Sage Weil
On Wed, 10 Oct 2012, Andrey Korolyov wrote:
 Hi,
 
 Recent tests on my test rack with 20G IB(iboip, 64k mtu, default
 CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite
 fantastic performance - on both reads and writes Ceph completely
 utilizing all disk bandwidth as high as 0.9 of theoretical limit of
 sum of all bandwidths bearing in mind replication level. The only
 thing that may bring down overall performance is a O_SYNC|O_DIRECT
 writes which will be issued by almost every database server in the
 default setup. Assuming that the database config may be untouchable
 and somehow I can build very reliable hardware setup which `ll never
 fail on power, should ceph have an option to ignore these flags? May
 be there is another real-world cases for including such or I am very
 wrong even thinking on fool client application in this way.

I certainly wouldn't recommend it, but there are probably use cases where 
it makes sense (i.e., the data isn't as important as the performance).  
Any such option would probably be called

 rbd async flush danger danger = true

and would trigger a flush but not wait for it, or perhaps

 rbd ignore flush danger danger = true

which would not honor flush at all. 

This would jeopoardize the integrity of the file system living on the RBD 
image; they rely on flush to order their commits, and playing fast and 
loose with that can lead to any number of corruptions.  The only silver 
lining is that in the not-so-distant future (3-4 years ago) this was 
poorly supported by the block layer and file systems alike and ext3 didn't 
crash and burn as quite often as you might have expected.

Anyway, not something I would recommend, certainly for a generic VM 
platform.  Maybe if you have a sepcific performance-sensitive application 
you can afford to let crash and burn...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] rbd: finish up basic format 2 support

2012-10-10 Thread Josh Durgin

On 10/10/2012 08:55 AM, Cláudio Martins wrote:


On Tue, 09 Oct 2012 13:57:09 -0700 Alex Elder el...@inktank.com wrote:

This series includes updates for two patches posted previously.

-Alex


  Greetings,

  We're gearing up to test v0.52 (specifically the RBD stuff) on our
cluster. After reading this series of posts about rbd format 2 patches
I began wondering if we should start testing these patches as well or
not. To put it simply, what I'd like to know is:

  Is it enough to use the 3.6 vanilla kernel client to take full
advantage of the rbd changes in v0.52 (i.e. new RBD cloning features)?

  Do we have any benefits from applying any of these patches on top of
v3.6 and using format 2, assuming that we stick to v0.52 on the
server, or is this strictly v0.53 and beyond stuff?


  I apologize if this is a dumb question, but by looking at the v0.52
changelog, at doc/rbd/* and the list, it doesn't seem clear how this
fits with v0.52.

  Thanks in advance

Best regards

Cláudio


These patches support using format 2, to make adding new features
easy, but this is not very useful to you yet. They don't yet support
any new features (like cloning) - that's the next step, but it will
take a bunch more work.

To use rbd cloning, you'll need to access rbd through userspace (e.g.
with qemu and librbd) for now.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ignore O_SYNC for rbd cache

2012-10-10 Thread Josh Durgin

On 10/10/2012 09:23 AM, Sage Weil wrote:

On Wed, 10 Oct 2012, Andrey Korolyov wrote:

Hi,

Recent tests on my test rack with 20G IB(iboip, 64k mtu, default
CUBIC, CFQ, LSI SAS 2108 w/ wb cache) interconnect shows a quite
fantastic performance - on both reads and writes Ceph completely
utilizing all disk bandwidth as high as 0.9 of theoretical limit of
sum of all bandwidths bearing in mind replication level. The only
thing that may bring down overall performance is a O_SYNC|O_DIRECT
writes which will be issued by almost every database server in the
default setup. Assuming that the database config may be untouchable
and somehow I can build very reliable hardware setup which `ll never
fail on power, should ceph have an option to ignore these flags? May
be there is another real-world cases for including such or I am very
wrong even thinking on fool client application in this way.


I certainly wouldn't recommend it, but there are probably use cases where
it makes sense (i.e., the data isn't as important as the performance).
Any such option would probably be called

  rbd async flush danger danger = true

and would trigger a flush but not wait for it, or perhaps

  rbd ignore flush danger danger = true

which would not honor flush at all.


qemu already has a cache=unsafe option which does exactly that.


This would jeopoardize the integrity of the file system living on the RBD
image; they rely on flush to order their commits, and playing fast and
loose with that can lead to any number of corruptions.  The only silver
lining is that in the not-so-distant future (3-4 years ago) this was
poorly supported by the block layer and file systems alike and ext3 didn't
crash and burn as quite often as you might have expected.

Anyway, not something I would recommend, certainly for a generic VM
platform.  Maybe if you have a sepcific performance-sensitive application
you can afford to let crash and burn...

sage


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Client Location

2012-10-10 Thread Sage Weil
On Wed, 10 Oct 2012, James Horner wrote:
 Hi There
 
 The basic setup Im trying to get is a backend to a Hypervisor cluster, 
 so that auto-failover and live migration works. The mail thing is that 
 we have a number of datacenters with a gigabit interconnect that is not 
 always 100% reliable. In the event of a failure we want all the virtual 
 machines to fail over to the remaining datacenters, so we need all the 
 data in each location.

 The other issue is that within each datacenter we can use link 
 aggregation to increase the bandwidth between hypervisors and the ceph 
 cluster but between the datacenters we only have the gigabit so it 
 become essential to have the hyperviors looking at the storage in the 
 same datacenter.

The ceph replication is syncrhonous, so even if you are writing to a local 
OSD, it will be updating the replica at the remote DC. The 1gbps link may 
quickly become a bottleneck.  This is a matter of having your cake and 
eating it too... you can't seamlessly fail over to another DC if you don't 
synchronously replicate to it.

 Another consideration is that the virtual machines might get migrated 
 between datacenters without any failure, and the main problem I see with 
 Mark suggests is that in this mode the migrated VM would still be 
 connecting to the OSD's in the remote datacenter.

The new rbd cloning functionality can be used to 'migrate' and image by 
cloning to a different pool (the new local DC) and then later (in teh 
background, whenever) doing a 'flatten' to migrate teh data from the 
parent to the clone.  Performance will be slower initially but improve 
once the data is migrated.

This isn't a perfect solution for your use-case, but it would work..

sage

 Tbh Im fairly new to ceph and I know im asking for everything and the 
 kitchen sink! Any thoughts would be very helpful though.
 
 Thanks
 James
 
 - Original Message -
 From: Gregory Farnum g...@inktank.com
 To: Mark Kampe mark.ka...@inktank.com
 Cc: James Horner james.hor...@precedent.co.uk, ceph-devel@vger.kernel.org
 Sent: Tuesday, October 9, 2012 5:48:37 PM
 Subject: Re: Client Location
 
 On Tue, Oct 9, 2012 at 9:43 AM, Mark Kampe mark.ka...@inktank.com wrote:
  I'm not a real engineer, so please forgive me if I misunderstand,
  but can't you create a separate rule for each data center (choosing
  first a local copy, and then remote copies), which should ensure
  that the primary is always local.  Each data center would then
  use a different pool, associated with the appropriate location-
  sensitive rule.
 
  Does this approach get you the desired locality preference?
 
 This sounds right to me ? I think maybe there's a misunderstanding
 about how CRUSH works. What precisely are you after, James?
 -Greg
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] rbd: fix the memory leak of bio_chain_clone

2012-10-10 Thread Alex Elder

On 10/09/2012 08:26 PM, Alex Elder wrote:

On 09/11/2012 02:17 PM, Alex Elder wrote:

On 09/06/2012 06:30 AM, Guangliang Zhao wrote:

The bio_pair alloced in bio_chain_clone would not be freed,
this will cause a memory leak. It could be freed actually only
after 3 times release, because the reference count of bio_pair
is initialized to 3 when bio_split and bio_pair_release only
drops the reference count.

The function bio_pair_release must be called three times for
releasing bio_pair, and the callback functions of bios on the
requests will be called when the last release time in bio_pair_release,
however, these functions will also be called in rbd_req_cb. In
other words, they will be called twice, and it may cause serious
consequences.



I just want you to know I'm looking at this patch now.  This is a
pretty complex bit of code though, so it may take me a bit to get
back to you.


Sorry about the long delay.  I've finally had a chance to look a
little more closely at your patch.

I had to sort of port what you supplied so it fit the current
code, which has changed a little since you first sent this.

It looks to me like it should work.  Rather than using bio_split()
when a bio is more than is needed to satisfy a particular
segment of a request, you create a clone of the bio and pass
it back to the caller.  The next call will use that clone
rather than the original as it continues processing the next
segment of the request.  The original bio in this case will
be freed as before, and the clone will be freed (drop a reference)
in a subsequent call when it gets used up.


I've done enough testing with this to be satisfied this
works correctly.


Do you have a test that you used to verify this both performed
correctly when a split was found and no longer leaked anything?


I am still interested to know if you had a particular way
to verify that that the leak was occurring (or not).  But
we obviously won't be leaking any bio_pairs any more...

-Alex


I'm going to put it through some testing myself.  I might want
to make small revisions to a comment here or there, but otherwise
I'll take it in unless I find it fails something.

Thanks a lot.

Reviewed-by: Alex Elder el...@inktank.com


This patch clones bio chain from the origin directly instead of
bio_split. The old bios which will be split may be modified by
the callback fn, so their copys need to be saved(called split_bio).
The new bio chain can be released whenever we don't need it.

This patch can just handle the split of *single page* bios, but
it's enough here for the following reasons:

Only bios span across multiple osds need to be split, and these bios
*must* be single page because of rbd_merge_bvec. With the function,
the new bvec will not permitted to merge, if it make the bio cross
the osd boundary, except it is the first one. In other words, there
are two types of bio:

- the bios don't cross the osd boundary
  They have one or more pages. The value of offset will
  always be 0 in this case, so nothing will be changed, and
  the code changes tmp bios doesn't take effact at all.

- the bios cross the osd boundary
  Each one have only one page. These bios need to be split,
  and the offset is used to indicate the next bio, it makes
  sense only in this instance.

The original bios may be modifid by the callback fn before the next
bio_chain_clone() called, when a bio need to be split, so its copy
will be saved.

Signed-off-by: Guangliang Zhao gz...@suse.com
---

  drivers/block/rbd.c |  102
++-
  1 file changed, 60 insertions(+), 42 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9917943..a605e1c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -717,50 +717,70 @@ static void zero_bio_chain(struct bio *chain,
int start_ofs)
  }
  }

-/*
+/**
   * bio_chain_clone - clone a chain of bios up to a certain length.
- * might return a bio_pair that will need to be released.
+ * @old: bio to clone
+ * @split_bio: bio which will be split
+ * @offset: start point for bio clone
+ * @len: length of bio chain
+ * @gfp_mask: allocation priority
+ *
+ * Value of split_bio will be !NULL only when there is a bio need to be
+ * split. NULL otherwise.
+ *
+ * RETURNS:
+ * Pointer to new bio chain on success, NULL on failure.
   */
-static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
-   struct bio_pair **bp,
-   int len, gfp_t gfpmask)
+static struct bio *bio_chain_clone(struct bio **old, struct bio
**split_bio,
+   int *offset, int len, gfp_t gfpmask)
  {
-struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail =
NULL;
-int total = 0;
-
-if (*bp) {
-bio_pair_release(*bp);
-*bp = NULL;
-}
+struct bio *tmp, *old_chain, *split, *new_chain = NULL, *tail =
NULL;
+int total = 0, need = len;

+

Re: Unable to build CEPH packages

2012-10-10 Thread Gary Lowell
Hi Hemant -

I'll be happy to help you with the problem.   The first things that would be 
helpful for me to know is what version of ceph you are trying to build, what 
distribution you are building on, and what your yum repositories are.  You can 
get the last piece of information with the yum repolist comand.

Thanks,
Gary


On Oct 10, 2012, at 2:34 AM, hemant surale wrote:

 Hi Folks ,
 
 I was trying to build ceph from source code , To have stable setup for VMs .
 
 While I was executing yum install rpm-build rpmdevtools  Error
 observed is No package rpm-build is available  No package rpmdevtools
 is available 
 
 All previous steps are working fine ,but error is observed while
 building ceph packages .
 http://ceph.com/docs/master/source/build-packages/
 
 Please help me out .
 
 Thanks  Regards,
 Hemant Surale.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determining RBD storage utilization

2012-10-10 Thread Damien Churchill
On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote:
 Additionally, 500G - 7.5G != 467G (the number shown as Avail).  Why
 the huge discrepancy?  I don't expect the numbers to add up exact due
 to rounding from kB, MB, GB, etc, but they should be darn close, a la

ext4 keeps some reserved space, 5% by default, for when the disk is
full so you are still able to use the filesystem and clean it up.

500G * 0.05 = 25G
500G - (25G + 7.5G) = 467G

Can't tell you where the 7.5G comes from though!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determining RBD storage utilization

2012-10-10 Thread Travis Rhoden
Damien,

Thanks for solving that part of the mystery.  I can't believe I forgot
about that.  Thanks for the reminder and the clear explanation.

 - Travis

On Wed, Oct 10, 2012 at 1:28 PM, Damien Churchill dam...@gmail.com wrote:
 On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote:
 Additionally, 500G - 7.5G != 467G (the number shown as Avail).  Why
 the huge discrepancy?  I don't expect the numbers to add up exact due
 to rounding from kB, MB, GB, etc, but they should be darn close, a la

 ext4 keeps some reserved space, 5% by default, for when the disk is
 full so you are still able to use the filesystem and clean it up.

 500G * 0.05 = 25G
 500G - (25G + 7.5G) = 467G

 Can't tell you where the 7.5G comes from though!
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determining RBD storage utilization

2012-10-10 Thread Josh Durgin

On 10/10/2012 10:10 AM, Travis Rhoden wrote:

Hey folks,

I have two questions about determining how much storage has been used
*inside* of an RBD.

First, I'm confused by the output of df.  I've created, mapped, and
mounted a 500GB RBD, and see the following:

# df -h /srv/test
Filesystem  Size  Used Avail Use% Mounted on
/dev/rbd44  500G  7.5G  467G   2% /srv/test

# cd /srv/test
# du -sh .
20K .


Any ideas way a brand-new, no files added mount shows 7.5GB of used
space?  Does this happen from the file system formatting (ext4 in this
case)?
Additionally, 500G - 7.5G != 467G (the number shown as Avail).  Why
the huge discrepancy?  I don't expect the numbers to add up exact due
to rounding from kB, MB, GB, etc, but they should be darn close, a la

df -h /dev/sda1
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda115G  1.7G   13G  12% /


Second question, is it possible to know how much storage has been used
in the RBD without mounting it and running df or du?  For the same RBD
as above, I see:

# rbd info test
rbd image 'test'':
size 500 GB in 128000 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.18f9.2d9c66c6
parent:  (pool -1)

Is there perhaps a way to know the number of objects that have been
'used'?  Then I could take that and multiply by the object size (4MB).


You can get an upper bound by looking at the number of objects in the
image:

rados --pool rbd ls | grep -c '^rb\.0\.18f9\.2d9c66c6'

Each object represents a section of the block device, but they may not
be entirely filled (objects are sparse), so this will probably still be
a higher estimate than df. Also note that listing all the objects in a
pool is an expensive operation, so it shouldn't be done very often.

Josh


I'm running 0.48.1argonaut on Ubuntu 12.04.
RBD maps are also on Ubuntu 12.04, with the stock 3.2.0-29-generic kernel.

Thanks,

  - Travis



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determining RBD storage utilization

2012-10-10 Thread Gregory Farnum
I don't know the ext4 internals at all, but filesystems tend to
require allocation tables of various sorts (for managing extents,
etc). 7.5GB out of 500GB seems a little large for that metadata, but
isn't ridiculously so...

On Wed, Oct 10, 2012 at 10:28 AM, Damien Churchill dam...@gmail.com wrote:
 On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote:
 Additionally, 500G - 7.5G != 467G (the number shown as Avail).  Why
 the huge discrepancy?  I don't expect the numbers to add up exact due
 to rounding from kB, MB, GB, etc, but they should be darn close, a la

 ext4 keeps some reserved space, 5% by default, for when the disk is
 full so you are still able to use the filesystem and clean it up.

 500G * 0.05 = 25G
 500G - (25G + 7.5G) = 467G

 Can't tell you where the 7.5G comes from though!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determining RBD storage utilization

2012-10-10 Thread Travis Rhoden
Thanks for the input, Gregory and Josh.

What I am hearing is that this has everything to do with the
filesystem, and nothing to do with the block device on Ceph.

Thanks again,

 - Travis

On Wed, Oct 10, 2012 at 1:55 PM, Gregory Farnum g...@inktank.com wrote:
 I don't know the ext4 internals at all, but filesystems tend to
 require allocation tables of various sorts (for managing extents,
 etc). 7.5GB out of 500GB seems a little large for that metadata, but
 isn't ridiculously so...

 On Wed, Oct 10, 2012 at 10:28 AM, Damien Churchill dam...@gmail.com wrote:
 On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote:
 Additionally, 500G - 7.5G != 467G (the number shown as Avail).  Why
 the huge discrepancy?  I don't expect the numbers to add up exact due
 to rounding from kB, MB, GB, etc, but they should be darn close, a la

 ext4 keeps some reserved space, 5% by default, for when the disk is
 full so you are still able to use the filesystem and clean it up.

 500G * 0.05 = 25G
 500G - (25G + 7.5G) = 467G

 Can't tell you where the 7.5G comes from though!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with ceph osd create uuid

2012-10-10 Thread Nick Bartos
After applying the patch, we went through 65 successful cluster
reinstalls without encountering the error (previously it would happen
at least every 8-10 reinstalls).  Therefore it really looks like this
fixed the issue.  Thanks!


On Mon, Oct 8, 2012 at 5:17 PM, Sage Weil s...@inktank.com wrote:
 Hi Mandell,

 I see the bug.  I pushed a fix to wip-mon-command-race,
 5011485e5e3fc9952ea58cd668e6feefc98024bf, and I believe fixes it, but I
 wasn't able to easily reproduce it myself so I'm not 100% certain.  Can
 you give it a go?

 Thanks!
 sage


 On Mon, 8 Oct 2012, Mandell Degerness wrote:

 osd dump output:

 [root@node-172-20-0-14 ~]# ceph osd dump 2
 dumped osdmap epoch 2
 epoch 2
 fsid d82665b6-3435-44b8-a89e-f7185f78d09d
 created 2012-10-08 21:29:52.232400
 modifed 2012-10-08 21:29:57.297479
 flags

 pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
 pg_num 64 pgp_num 64 last_change 1 owner 0
 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64
 pgp_num 64 last_change 1 owner 0

 max_osd 1
 osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0
 last_clean_interval [0,0) :/0 :/0 :/0 exists,new
 564d7166-07b7-48cc-9b50-46ef7b260d5c


 [root@node-172-20-0-14 ~]# ceph osd dump 3
 dumped osdmap epoch 3
 epoch 3
 fsid d82665b6-3435-44b8-a89e-f7185f78d09d
 created 2012-10-08 21:29:52.232400
 modifed 2012-10-08 21:29:58.299491
 flags

 pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
 pg_num 64 pgp_num 64 last_change 1 owner 0
 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64
 pgp_num 64 last_change 1 owner 0

 max_osd 1
 osd.0 up   in  weight 1 up_from 3 up_thru 0 down_at 0
 last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723
 172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c


 [root@node-172-20-0-14 ~]# ceph osd dump 4
 dumped osdmap epoch 4
 epoch 4
 fsid d82665b6-3435-44b8-a89e-f7185f78d09d
 created 2012-10-08 21:29:52.232400
 modifed 2012-10-08 21:29:59.304087
 flags

 pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
 pg_num 64 pgp_num 64 last_change 1 owner 0
 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64
 pgp_num 64 last_change 1 owner 0

 max_osd 3
 osd.0 up   in  weight 1 up_from 3 up_thru 0 down_at 0
 last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723
 172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c
 osd.1 down out weight 0 up_from 0 up_thru 0 down_at 0
 last_clean_interval [0,0) :/0 :/0 :/0 exists,new
 3351a0f0-f6e8-430a-b7a4-ea613a3ddf35
 osd.2 down out weight 0 up_from 0 up_thru 0 down_at 0
 last_clean_interval [0,0) :/0 :/0 :/0 exists,new
 3f04cdbe-a468-42d3-a465-2487cc369d90




 On Mon, Oct 8, 2012 at 3:49 PM, Sage Weil s...@inktank.com wrote:
  On Mon, 8 Oct 2012, Mandell Degerness wrote:
  Sorry, I should have used the https link:
 
  https://gist.github.com/af546ece91be0ba268d3
 
  What do 'ceph osd dump 2', 'ceph osd dump 3', and 'ceph osd dump 4' say?
 
  thanks!
  sage
 
 
  On Mon, Oct 8, 2012 at 3:20 PM, Mandell Degerness
  mand...@pistoncloud.com wrote:
   Here is the log I got when running with the options suggested by sage:
  
   g...@gist.github.com:af546ece91be0ba268d3.git
  
   On Mon, Oct 8, 2012 at 11:34 AM, Sage Weil s...@inktank.com wrote:
   Hi Mandell,
  
   On Mon, 8 Oct 2012, Mandell Degerness wrote:
   Hi list,
  
   I've run into a bit of a weird error and I'm hoping that you can tell
   me what is going wrong.  There seems to be a race condition in the way
   I am using ceph osd create uuid and actually creating the OSD's.
   The log from one of the servers is at:
  
   https://gist.github.com/528e347a5c0ffeb30abd
  
   The process I am trying to follow (for the OSDs) is:
  
   1) Create XFS file system on disk.
   2) Use FS UUID as source to get a new OSD id #.
   'ceph', 'osd', 'create', '32895846-ca1c-4265-9ce7-9f2a42b41672'
   (Returns 2.)
   3) Pass the UUID and OSD id to the create osd command
  
   ceph-osd -c /etc/ceph/ceph.conf --fsid
   e61c1b11-4a1c-47aa-868d-7b51b1e610d3 --osd-uuid
   32895846-ca1c-4265-9ce7-9f2a42b41672 -i 2 --mkfs --osd-journal-size
   8192
   4) Start the OSD, as part of the start process, I verify that the
   whoami and osd fsid agree (in case this disk came from a previous
   cluster, somehow) - should be just a sanity check
   'ceph', 'osd', 'create', '32895846-ca1c-4265-9ce7-9f2a42b41672'
   (Returns 1!)
  
   This is clearly a race condition because we have several cluster
   creations without this happening and then this happens about once
   every 8 

Re: [PATCH v5] rbd: fix the memory leak of bio_chain_clone

2012-10-10 Thread Alex Elder

On 10/09/2012 08:26 PM, Alex Elder wrote:

On 09/11/2012 02:17 PM, Alex Elder wrote:

On 09/06/2012 06:30 AM, Guangliang Zhao wrote:

The bio_pair alloced in bio_chain_clone would not be freed,
this will cause a memory leak. It could be freed actually only
after 3 times release, because the reference count of bio_pair
is initialized to 3 when bio_split and bio_pair_release only
drops the reference count.

The function bio_pair_release must be called three times for
releasing bio_pair, and the callback functions of bios on the
requests will be called when the last release time in bio_pair_release,
however, these functions will also be called in rbd_req_cb. In
other words, they will be called twice, and it may cause serious
consequences.



I just want you to know I'm looking at this patch now.  This is a
pretty complex bit of code though, so it may take me a bit to get
back to you.


Sorry about the long delay.  I've finally had a chance to look a
little more closely at your patch.

I had to sort of port what you supplied so it fit the current
code, which has changed a little since you first sent this.


I'm sorry to report that I'm getting a consistent failure when
running xfstests #13 when running with this patch applied over
rbd images.

I don't have time to look at it any more today but we need to
get this fixed soon.

-Alex

It looks to me like it should work.  Rather than using bio_split()
when a bio is more than is needed to satisfy a particular
segment of a request, you create a clone of the bio and pass
it back to the caller.  The next call will use that clone
rather than the original as it continues processing the next
segment of the request.  The original bio in this case will
be freed as before, and the clone will be freed (drop a reference)
in a subsequent call when it gets used up.

Do you have a test that you used to verify this both performed
correctly when a split was found and no longer leaked anything?

I'm going to put it through some testing myself.  I might want
to make small revisions to a comment here or there, but otherwise
I'll take it in unless I find it fails something.

Thanks a lot.

Reviewed-by: Alex Elder el...@inktank.com


This patch clones bio chain from the origin directly instead of
bio_split. The old bios which will be split may be modified by
the callback fn, so their copys need to be saved(called split_bio).
The new bio chain can be released whenever we don't need it.

This patch can just handle the split of *single page* bios, but
it's enough here for the following reasons:

Only bios span across multiple osds need to be split, and these bios
*must* be single page because of rbd_merge_bvec. With the function,
the new bvec will not permitted to merge, if it make the bio cross
the osd boundary, except it is the first one. In other words, there
are two types of bio:

- the bios don't cross the osd boundary
  They have one or more pages. The value of offset will
  always be 0 in this case, so nothing will be changed, and
  the code changes tmp bios doesn't take effact at all.

- the bios cross the osd boundary
  Each one have only one page. These bios need to be split,
  and the offset is used to indicate the next bio, it makes
  sense only in this instance.

The original bios may be modifid by the callback fn before the next
bio_chain_clone() called, when a bio need to be split, so its copy
will be saved.

Signed-off-by: Guangliang Zhao gz...@suse.com
---

  drivers/block/rbd.c |  102
++-
  1 file changed, 60 insertions(+), 42 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 9917943..a605e1c 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -717,50 +717,70 @@ static void zero_bio_chain(struct bio *chain,
int start_ofs)
  }
  }

-/*
+/**
   * bio_chain_clone - clone a chain of bios up to a certain length.
- * might return a bio_pair that will need to be released.
+ * @old: bio to clone
+ * @split_bio: bio which will be split
+ * @offset: start point for bio clone
+ * @len: length of bio chain
+ * @gfp_mask: allocation priority
+ *
+ * Value of split_bio will be !NULL only when there is a bio need to be
+ * split. NULL otherwise.
+ *
+ * RETURNS:
+ * Pointer to new bio chain on success, NULL on failure.
   */
-static struct bio *bio_chain_clone(struct bio **old, struct bio **next,
-   struct bio_pair **bp,
-   int len, gfp_t gfpmask)
+static struct bio *bio_chain_clone(struct bio **old, struct bio
**split_bio,
+   int *offset, int len, gfp_t gfpmask)
  {
-struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail =
NULL;
-int total = 0;
-
-if (*bp) {
-bio_pair_release(*bp);
-*bp = NULL;
-}
+struct bio *tmp, *old_chain, *split, *new_chain = NULL, *tail =
NULL;
+int total = 0, need = len;

+split = *split_bio;
+

Re: Determining RBD storage utilization

2012-10-10 Thread Sage Weil
On Wed, 10 Oct 2012, Gregory Farnum wrote:
 I don't know the ext4 internals at all, but filesystems tend to
 require allocation tables of various sorts (for managing extents,
 etc). 7.5GB out of 500GB seems a little large for that metadata, but
 isn't ridiculously so...

ext3/4 are particularly bad about this, with lots of space statically set 
aside for inodes and allocation metadata.

s


 
 On Wed, Oct 10, 2012 at 10:28 AM, Damien Churchill dam...@gmail.com wrote:
  On 10 October 2012 18:10, Travis Rhoden trho...@gmail.com wrote:
  Additionally, 500G - 7.5G != 467G (the number shown as Avail).  Why
  the huge discrepancy?  I don't expect the numbers to add up exact due
  to rounding from kB, MB, GB, etc, but they should be darn close, a la
 
  ext4 keeps some reserved space, 5% by default, for when the disk is
  full so you are still able to use the filesystem and clean it up.
 
  500G * 0.05 = 25G
  500G - (25G + 7.5G) = 467G
 
  Can't tell you where the 7.5G comes from though!
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with ceph osd create uuid

2012-10-10 Thread Sage Weil
Wonderful, thanks!

sage

On Wed, 10 Oct 2012, Nick Bartos wrote:

 After applying the patch, we went through 65 successful cluster
 reinstalls without encountering the error (previously it would happen
 at least every 8-10 reinstalls).  Therefore it really looks like this
 fixed the issue.  Thanks!
 
 
 On Mon, Oct 8, 2012 at 5:17 PM, Sage Weil s...@inktank.com wrote:
  Hi Mandell,
 
  I see the bug.  I pushed a fix to wip-mon-command-race,
  5011485e5e3fc9952ea58cd668e6feefc98024bf, and I believe fixes it, but I
  wasn't able to easily reproduce it myself so I'm not 100% certain.  Can
  you give it a go?
 
  Thanks!
  sage
 
 
  On Mon, 8 Oct 2012, Mandell Degerness wrote:
 
  osd dump output:
 
  [root@node-172-20-0-14 ~]# ceph osd dump 2
  dumped osdmap epoch 2
  epoch 2
  fsid d82665b6-3435-44b8-a89e-f7185f78d09d
  created 2012-10-08 21:29:52.232400
  modifed 2012-10-08 21:29:57.297479
  flags
 
  pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
  64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
  pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
  pg_num 64 pgp_num 64 last_change 1 owner 0
  pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64
  pgp_num 64 last_change 1 owner 0
 
  max_osd 1
  osd.0 down out weight 0 up_from 0 up_thru 0 down_at 0
  last_clean_interval [0,0) :/0 :/0 :/0 exists,new
  564d7166-07b7-48cc-9b50-46ef7b260d5c
 
 
  [root@node-172-20-0-14 ~]# ceph osd dump 3
  dumped osdmap epoch 3
  epoch 3
  fsid d82665b6-3435-44b8-a89e-f7185f78d09d
  created 2012-10-08 21:29:52.232400
  modifed 2012-10-08 21:29:58.299491
  flags
 
  pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
  64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
  pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
  pg_num 64 pgp_num 64 last_change 1 owner 0
  pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64
  pgp_num 64 last_change 1 owner 0
 
  max_osd 1
  osd.0 up   in  weight 1 up_from 3 up_thru 0 down_at 0
  last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723
  172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c
 
 
  [root@node-172-20-0-14 ~]# ceph osd dump 4
  dumped osdmap epoch 4
  epoch 4
  fsid d82665b6-3435-44b8-a89e-f7185f78d09d
  created 2012-10-08 21:29:52.232400
  modifed 2012-10-08 21:29:59.304087
  flags
 
  pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num
  64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
  pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins
  pg_num 64 pgp_num 64 last_change 1 owner 0
  pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 64
  pgp_num 64 last_change 1 owner 0
 
  max_osd 3
  osd.0 up   in  weight 1 up_from 3 up_thru 0 down_at 0
  last_clean_interval [0,0) 172.20.0.13:6800/1723 172.20.0.13:6801/1723
  172.20.0.13:6802/1723 exists,up 564d7166-07b7-48cc-9b50-46ef7b260d5c
  osd.1 down out weight 0 up_from 0 up_thru 0 down_at 0
  last_clean_interval [0,0) :/0 :/0 :/0 exists,new
  3351a0f0-f6e8-430a-b7a4-ea613a3ddf35
  osd.2 down out weight 0 up_from 0 up_thru 0 down_at 0
  last_clean_interval [0,0) :/0 :/0 :/0 exists,new
  3f04cdbe-a468-42d3-a465-2487cc369d90
 
 
 
 
  On Mon, Oct 8, 2012 at 3:49 PM, Sage Weil s...@inktank.com wrote:
   On Mon, 8 Oct 2012, Mandell Degerness wrote:
   Sorry, I should have used the https link:
  
   https://gist.github.com/af546ece91be0ba268d3
  
   What do 'ceph osd dump 2', 'ceph osd dump 3', and 'ceph osd dump 4' say?
  
   thanks!
   sage
  
  
   On Mon, Oct 8, 2012 at 3:20 PM, Mandell Degerness
   mand...@pistoncloud.com wrote:
Here is the log I got when running with the options suggested by sage:
   
g...@gist.github.com:af546ece91be0ba268d3.git
   
On Mon, Oct 8, 2012 at 11:34 AM, Sage Weil s...@inktank.com wrote:
Hi Mandell,
   
On Mon, 8 Oct 2012, Mandell Degerness wrote:
Hi list,
   
I've run into a bit of a weird error and I'm hoping that you can 
tell
me what is going wrong.  There seems to be a race condition in the 
way
I am using ceph osd create uuid and actually creating the OSD's.
The log from one of the servers is at:
   
https://gist.github.com/528e347a5c0ffeb30abd
   
The process I am trying to follow (for the OSDs) is:
   
1) Create XFS file system on disk.
2) Use FS UUID as source to get a new OSD id #.
'ceph', 'osd', 'create', '32895846-ca1c-4265-9ce7-9f2a42b41672'
(Returns 2.)
3) Pass the UUID and OSD id to the create osd command
   
ceph-osd -c /etc/ceph/ceph.conf --fsid
e61c1b11-4a1c-47aa-868d-7b51b1e610d3 --osd-uuid
32895846-ca1c-4265-9ce7-9f2a42b41672 -i 2 --mkfs --osd-journal-size
8192
4) Start the OSD, as part of the start process, I verify that the
whoami and osd fsid agree (in case this disk came from a previous
cluster, somehow) - should be just a sanity check
'ceph', 'osd', 

rgw_rest.cc build failure

2012-10-10 Thread Noah Watkins
This needed for latest master:

diff --git a/src/rgw/rgw_rest.cc b/src/rgw/rgw_rest.cc
index 53bbeca..3612a9e 100644
--- a/src/rgw/rgw_rest.cc
+++ b/src/rgw/rgw_rest.cc
@@ -1,4 +1,5 @@
 #include errno.h
+#include limits.h

 #include common/Formatter.h
 #include common/utf8.h

to fix:

  CXXradosgw-rgw_rest.o
rgw/rgw_rest.cc: In static member function ‘static int
RESTArgs::get_uint64(req_state*, const string, uint64_t, uint64_t*,
bool*)’:
rgw/rgw_rest.cc:326:15: error: ‘ULLONG_MAX’ was not declared in this scope
rgw/rgw_rest.cc: In static member function ‘static int
RESTArgs::get_int64(req_state*, const string, int64_t, int64_t*,
bool*)’:
rgw/rgw_rest.cc:351:15: error: ‘LLONG_MAX’ was not declared in this scope

- Noah
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw_rest.cc build failure

2012-10-10 Thread Yehuda Sadeh
I'll apply this, can I assume you have signed off this patch?

On Wed, Oct 10, 2012 at 2:25 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 This needed for latest master:

 diff --git a/src/rgw/rgw_rest.cc b/src/rgw/rgw_rest.cc
 index 53bbeca..3612a9e 100644
 --- a/src/rgw/rgw_rest.cc
 +++ b/src/rgw/rgw_rest.cc
 @@ -1,4 +1,5 @@
  #include errno.h
 +#include limits.h

  #include common/Formatter.h
  #include common/utf8.h

 to fix:

   CXXradosgw-rgw_rest.o
 rgw/rgw_rest.cc: In static member function ‘static int
 RESTArgs::get_uint64(req_state*, const string, uint64_t, uint64_t*,
 bool*)’:
 rgw/rgw_rest.cc:326:15: error: ‘ULLONG_MAX’ was not declared in this scope
 rgw/rgw_rest.cc: In static member function ‘static int
 RESTArgs::get_int64(req_state*, const string, int64_t, int64_t*,
 bool*)’:
 rgw/rgw_rest.cc:351:15: error: ‘LLONG_MAX’ was not declared in this scope

 - Noah
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rgw_rest.cc build failure

2012-10-10 Thread Noah Watkins
On Wed, Oct 10, 2012 at 2:29 PM, Yehuda Sadeh yeh...@inktank.com wrote:
 I'll apply this, can I assume you have signed off this patch?

Ahh, yes, sorry.

Signed-off-by: Noah Watkins noahwatk...@gmail.com


 On Wed, Oct 10, 2012 at 2:25 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 This needed for latest master:

 diff --git a/src/rgw/rgw_rest.cc b/src/rgw/rgw_rest.cc
 index 53bbeca..3612a9e 100644
 --- a/src/rgw/rgw_rest.cc
 +++ b/src/rgw/rgw_rest.cc
 @@ -1,4 +1,5 @@
  #include errno.h
 +#include limits.h

  #include common/Formatter.h
  #include common/utf8.h

 to fix:

   CXXradosgw-rgw_rest.o
 rgw/rgw_rest.cc: In static member function ‘static int
 RESTArgs::get_uint64(req_state*, const string, uint64_t, uint64_t*,
 bool*)’:
 rgw/rgw_rest.cc:326:15: error: ‘ULLONG_MAX’ was not declared in this scope
 rgw/rgw_rest.cc: In static member function ‘static int
 RESTArgs::get_int64(req_state*, const string, int64_t, int64_t*,
 bool*)’:
 rgw/rgw_rest.cc:351:15: error: ‘LLONG_MAX’ was not declared in this scope

 - Noah
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-10 Thread Noah Watkins
Laszlo, James:

Changes based on your previous feedback are ready for review. I pushed the 
changes here:

  git://github.com/noahdesu/ceph.git wip-java-cephfs

Thanks!
- Noah

From 0d8c4dc39f9b8f2e264bb2503c053418ad72b705 Mon Sep 17 00:00:00 2001
From: Noah Watkins noahwatk...@gmail.com
Date: Wed, 10 Oct 2012 13:57:03 -0700
Subject: [PATCH] java: update deb bits from ceph-devel feedback

Signed-off-by: Noah Watkins noahwatk...@gmail.com
---
 debian/.gitignore|3 ++-
 debian/control   |   10 --
 debian/libceph-java.jlibs|1 +
 debian/libceph-jni.install   |1 +
 debian/libceph1-java.install |2 --
 debian/rules |1 +
 src/java/.gitignore  |2 +-
 src/java/Makefile.am |8 
 src/java/README  |2 +-
 src/java/build.xml   |6 +++---
 10 files changed, 22 insertions(+), 14 deletions(-)
 create mode 100644 debian/libceph-java.jlibs
 create mode 100644 debian/libceph-jni.install
 delete mode 100644 debian/libceph1-java.install

diff --git a/debian/.gitignore b/debian/.gitignore
index c5b73ce..2fd5a05 100644
--- a/debian/.gitignore
+++ b/debian/.gitignore
@@ -30,5 +30,6 @@
 /rest-bench-dbg
 /rest-bench
 /python-ceph
-/libceph1-java
+/libceph-java
+/libceph-jni
 /tmp
diff --git a/debian/control b/debian/control
index 579855f..62c85d9 100644
--- a/debian/control
+++ b/debian/control
@@ -319,8 +319,14 @@ Description: Python libraries for the Ceph distributed 
filesystem
  This package contains Python libraries for interacting with Ceph's
  RADOS object storage, and RBD (RADOS block device).
 
-Package: libceph1-java
+Package: libceph-java
 Section: java
+Architecture: all
+Depends: libceph-jni, ${java:Depends}, ${misc:Depends}
+Description: Java libraries for the Ceph File System.
+
+Package: libceph-jni
 Architecture: linux-any
+Section: libs
 Depends: libcephfs1, ${shlibs:Depends}, ${java:Depends}, ${misc:Depends}
-Description: Java libraries for the Ceph File System
+Description: Java Native Interface library for CephFS Java bindings.
diff --git a/debian/libceph-java.jlibs b/debian/libceph-java.jlibs
new file mode 100644
index 000..952a190
--- /dev/null
+++ b/debian/libceph-java.jlibs
@@ -0,0 +1 @@
+src/java/ceph.jar
diff --git a/debian/libceph-jni.install b/debian/libceph-jni.install
new file mode 100644
index 000..072b990
--- /dev/null
+++ b/debian/libceph-jni.install
@@ -0,0 +1 @@
+usr/lib/libcephfs_jni.so* usr/lib/jni
diff --git a/debian/libceph1-java.install b/debian/libceph1-java.install
deleted file mode 100644
index 98133e4..000
--- a/debian/libceph1-java.install
+++ /dev/null
@@ -1,2 +0,0 @@
-usr/lib/libcephfs_jni.so* usr/lib/jni
-usr/lib/libcephfs.jar usr/share/java
diff --git a/debian/rules b/debian/rules
index b848ddc..6d61385 100755
--- a/debian/rules
+++ b/debian/rules
@@ -93,6 +93,7 @@ install: build
 # Add here commands to install the package into debian/testpack.
 # Build architecture-independent files here.
 binary-indep: build install
+   jh_installlibs -v -i
 
 # We have nothing to do by default.
 # Build architecture-dependent files here.
diff --git a/src/java/.gitignore b/src/java/.gitignore
index 8208e2b..b8eb0e9 100644
--- a/src/java/.gitignore
+++ b/src/java/.gitignore
@@ -1,4 +1,4 @@
 *.class
-libcephfs.jar
+ceph.jar
 native/com_ceph_fs_CephMount.h
 TEST-*.txt
diff --git a/src/java/Makefile.am b/src/java/Makefile.am
index 5c54f36..87d763d 100644
--- a/src/java/Makefile.am
+++ b/src/java/Makefile.am
@@ -24,20 +24,20 @@ CEPH_PROXY=java/com/ceph/fs/CephMount.class
 
 $(CEPH_PROXY): $(JAVA_SRC)
export CLASSPATH=java/ ;
-   $(JAVAC) java/com/ceph/fs/*.java
+   $(JAVAC) -source 1.5 -target 1.5 java/com/ceph/fs/*.java
 
 $(JAVA_H): $(CEPH_PROXY)
export CLASSPATH=java/ ; \
$(JAVAH) -jni -o $@ com.ceph.fs.CephMount
 
-libcephfs.jar: $(CEPH_PROXY)
+ceph.jar: $(CEPH_PROXY)
$(JAR) cf $@ $(JAVA_CLASSES:%=-C java %) # $(ESCAPED_JAVA_CLASSES:%=-C 
java %)
 
 javadir = $(libdir)
-java_DATA = libcephfs.jar
+java_DATA = ceph.jar
 
 BUILT_SOURCES = $(JAVA_H)
 
-CLEANFILES = -rf java/com/ceph/fs/*.class $(JAVA_H) libcephfs.jar
+CLEANFILES = -rf java/com/ceph/fs/*.class $(JAVA_H) ceph.jar
 
 endif
diff --git a/src/java/README b/src/java/README
index ca39a44..d58ab8a 100644
--- a/src/java/README
+++ b/src/java/README
@@ -33,7 +33,7 @@ Ant is used to run the unit test (apt-get install ant). For 
example:
 
 1. The tests depend on the compiled wrappers. If the wrappers are installed as
 part of a package (e.g. Debian package) then this should 'just work'. Ant will
-also look in the current directory for 'libcephfs.jar' and in ../.libs for the
+also look in the current directory for 'ceph.jar' and in ../.libs for the
 JNI library.  If all else fails, set the environment variables CEPHFS_JAR, and
 CEPHFS_JNI_LIB accordingly.
 
diff --git a/src/java/build.xml b/src/java/build.xml
index f846ca4..203ffc0 100644
--- a/src/java/build.xml
+++ 

Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-10 Thread Laszlo Boszormenyi (GCS)
Hi Noah,

On Wed, 2012-10-10 at 15:00 -0700, Noah Watkins wrote:
 Laszlo, James:
 
 Changes based on your previous feedback are ready for review. I pushed the 
 changes here:
 
   git://github.com/noahdesu/ceph.git wip-java-cephfs
 Checking only the diff, as it's 3 am here. It looks quite OK. But will
check it further in the afternoon.

Laszlo/GCS

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL v5] java: add libcephfs Java bindings

2012-10-10 Thread Noah Watkins
On Wed, Oct 10, 2012 at 5:53 PM, Laszlo Boszormenyi (GCS) g...@debian.hu 
wrote:
 Hi Noah,

 On Wed, 2012-10-10 at 15:00 -0700, Noah Watkins wrote:
 Laszlo, James:

 Changes based on your previous feedback are ready for review. I pushed the 
 changes here:

   git://github.com/noahdesu/ceph.git wip-java-cephfs
  Checking only the diff, as it's 3 am here. It looks quite OK. But will
 check it further in the afternoon.

Ok, great. The one thing I was most curious about is if the ceph.jar
reference in debian/libceph-java.jlibs is correct. Previously I was
able to reference its installation path within debian/tmp
(usr/lib/ceph.jar), but jh_installlibs was only able to find the jar
when I referenced its build location (src/java/ceph.jar).

Thanks!


 Laszlo/GCS

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] rbd: simplify rbd_do_op() et al

2012-10-10 Thread Alex Elder

These three patches simplify a few paths through the code
involving read and write requests.

-Alex

[PATCH 1/3] rbd: kill rbd_req_{read,write}()
[PATCH 2/3] rbd: kill drop rbd_do_op() opcode and flags
[PATCH 3/3] rbd: consolidate rbd_do_op() calls
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] rbd: consolidate rbd_do_op() calls

2012-10-10 Thread Alex Elder

The two calls to rbd_do_op() from rbd_rq_fn() differ only in the
value passed for the snapshot id and the snapshot context.

For reads the snapshot always comes from the mapping, and for writes
the snapshot id is always CEPH_NOSNAP.

The snapshot context is always null for reads.  For writes, the
snapshot context always comes from the rbd header, but it is
acquired under protection of header semaphore and could change
thereafter, so we can't simply use what's available inside
rbd_do_op().

Eliminate the snapid parameter from rbd_do_op(), and set it
based on the I/O direction inside that function instead.  Always
pass the snapshot context acquired in the caller, but reset it
to a null pointer inside rbd_do_op() if the operation is a read.

As a result, there is no difference in the read and write calls
to rbd_do_op() made in rbd_rq_fn(), so just call it unconditionally.

Signed-off-by: Alex Elder el...@inktank.com
---
 drivers/block/rbd.c |   26 +-
 1 file changed, 9 insertions(+), 17 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 396af14..ca28036 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -1163,7 +1163,6 @@ done:
 static int rbd_do_op(struct request *rq,
 struct rbd_device *rbd_dev,
 struct ceph_snap_context *snapc,
-u64 snapid,
 u64 ofs, u64 len,
 struct bio *bio,
 struct rbd_req_coll *coll,
@@ -1177,6 +1176,7 @@ static int rbd_do_op(struct request *rq,
u32 payload_len;
int opcode;
int flags;
+   u64 snapid;

seg_name = rbd_segment_name(rbd_dev, ofs);
if (!seg_name)
@@ -1187,10 +1187,13 @@ static int rbd_do_op(struct request *rq,
if (rq_data_dir(rq) == WRITE) {
opcode = CEPH_OSD_OP_WRITE;
flags = CEPH_OSD_FLAG_WRITE|CEPH_OSD_FLAG_ONDISK;
+   snapid = CEPH_NOSNAP;
payload_len = seg_len;
} else {
opcode = CEPH_OSD_OP_READ;
flags = CEPH_OSD_FLAG_READ;
+   snapc = NULL;
+   snapid = rbd_dev-mapping.snap_id;
payload_len = 0;
}

@@ -1518,24 +1521,13 @@ static void rbd_rq_fn(struct request_queue *q)
kref_get(coll-kref);
bio = bio_chain_clone(rq_bio, next_bio, bp,
  op_size, GFP_ATOMIC);
-   if (!bio) {
+   if (bio)
+   (void) rbd_do_op(rq, rbd_dev, snapc,
+   ofs, op_size,
+   bio, coll, cur_seg);
+   else
rbd_coll_end_req_index(rq, coll, cur_seg,
   -ENOMEM, op_size);
-   goto next_seg;
-   }
-
-   /* init OSD command: write or read */
-   if (do_write)
-   (void) rbd_do_op(rq, rbd_dev,
-   snapc, CEPH_NOSNAP,
-   ofs, op_size, bio,
-   coll, cur_seg);
-   else
-   (void) rbd_do_op(rq, rbd_dev,
-   NULL, rbd_dev-mapping.snap_id,
-   ofs, op_size, bio,
-   coll, cur_seg);
-next_seg:
size -= op_size;
ofs += op_size;

--
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html