Re: PG Backend Proposal

2013-08-01 Thread Sage Weil
On Thu, 1 Aug 2013, Samuel Just wrote:
> I think there are some tricky edge cases with the above approach.  You
> might end up with two pg replicas in the same acting set which happen
> for reasons of history to have the same chunk for one or more objects.
>  That would have to be detected and repaired even though the object
> would be missing from neither replica (and might not even be in the pg
> log).  The erasure_code_rank would have to be somehow maintained
> through recovery (do we remember the original holder of a particular
> chunk in case it ever comes back?).
> 
> The chunk rank doesn't *need* to match the acting set position, but
> there are some good reasons to arrange for that to be the case:
> 1) Otherwise, we need something else to assign the chunk ranks
> 2) This way, a new primary can determine which osds hold which
> replicas of which chunk rank by looking at past osd maps.
> 
> It seems to me that given an OSDMap and an object, we should know
> immediately where all chunks should be stored since a future primary
> may need to do that without access to the objects themselves.
> 
> Importantly, while it may be possible for an acting set transition
> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
> mode which will cause replacement to behave well for erasure codes:
> 
> initial: [0,1,2]
> 0 fails: [3,1,2]
> 2 fails: [3,1,4]
> 0 recovers: [0,1,4]
> 
> We do, however, need to decouple primariness from position in the
> acting set so that backfill can work well.

BTW, this reminds me: this might also be a good time to add the ability in 
the OSDMap to probabilistically reorder acting sets (and adjust 
primariness) for any mapping so that we can cheaply shift read traffic 
around (e.g., based on normal statistical skew, or temorary workload 
balance).  Currently the only way to do this also involves moving data.

s


> -Sam
> 
> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary  wrote:
> > Hi Sam,
> >
> > I'm under the impression that
> > https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
> > assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
> >
> > The chunk rank does not need to match the OSD position in the acting set. 
> > As long as each object chunk is stored with its rank in an attribute, 
> > changing the order of the acting set does not require to move the chunks 
> > around.
> >
> > With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
> > [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute 
> > set to their rank.
> >
> > If the acting set changes to [2,1,0] the read would reorder the chunk based 
> > on their 'erasure_code_rank' attribute instead of the rank of the OSD they 
> > originate from in the current acting set. And then be able to decode them 
> > with the erasure code library, which requires that the chunks are provided 
> > in a specific order.
> >
> > When doing a full write, the chunks are written in the same order as the 
> > acting set. This implies that the order of the chunks of the previous 
> > version of the object may be different but I don't see a problem with that.
> >
> > When doing an append, the primary must first retrieve the order in which 
> > the objects are stored by retrieving their 'erasure_code_rank' attribute, 
> > because the order of the acting set is not the same as the order of the 
> > chunks. It then maps the chunks to the OSDs matching their rank and pushes 
> > them to the OSDs.
> >
> > The only downside is that it may make things more complicated to implement 
> > optimizations based on the fact that, sometimes, chunks can just be 
> > concatenated to recover the content of the object and don't need to be 
> > decoded ( when using systematic codes and the M data chunks are available ).
> >
> > Cheers
> >
> > On 01/08/2013 19:14, Loic Dachary wrote:
> >>
> >>
> >> On 01/08/2013 18:42, Loic Dachary wrote:
> >>> Hi Sam,
> >>>
> >>> When the acting set changes order two chunks for the same object may 
> >>> co-exist in the same placement group. The key should therefore also 
> >>> contain the chunk number.
> >>>
> >>> That's probably the most sensible comment I have so far. This document is 
> >>> immensely useful (even in its current state) because it shows me your 
> >>> perspective on the implementation.
> >>>
> >>> I'm puzzled by:
> >>
> >> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
> >> spurious non version chunks would get in the way.
> >>
> >> :-)
> >>
> >>>
> >>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires 
> >>> that we retain the deleted object until all replicas have persisted the 
> >>> deletion event. ErasureCoded backend will therefore need to store objects 
> >>> with the version at which they were created included in the key provided 
> >>> to the filestore. Old versions of an object can be pruned when

Re: [PATCH] ceph: fix bugs about handling short-read for sync read mode.

2013-08-01 Thread Sage Weil
On Fri, 2 Aug 2013, majianpeng wrote:
> cephfs . show_layout
> >layyout.data_pool: 0
> >layout.object_size:   4194304
> >layout.stripe_unit:   4194304
> >layout.stripe_count:  1
> 
> TestA:
> >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
> >dd if=/dev/urandom of=test bs=1M count=2 seek=4  oflag=direct
> >dd if=test of=/dev/null bs=6M count=1 iflag=direct
> The messages from func striped_read are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 2097152 
> HITSTRIPE SHORT
> ceph:   file.c:350  : striped_read 2097152~4194304 (read 2097152) got 
> 0 HITSTRIPE SHORT
> ceph:   file.c:381  : zero tail 4194304
> ceph:   file.c:390  : striped_read returns 6291456
> The hole of file is from 2M--4M.But actualy it zero the last 4M include
> the last 2M area which isn't a hole.
> Using this patch, the messages are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 2097152 
> HITSTRIPE SHORT
> ceph:   file.c:358  :  zero gap 2097152 to 4194304
> ceph:   file.c:350  : striped_read 4194304~2097152 (read 4194304) got 
> 2097152
> ceph:   file.c:384  : striped_read returns 6291456
> 
> TestB:
> >echo majianpeng > test
> >dd if=test of=/dev/null bs=2M count=1 iflag=direct
> The messages are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 11 
> HITSTRIPE SHORT
> ceph:   file.c:350  : striped_read 11~6291445 (read 11) got 0 
> HITSTRIPE SHORT
> ceph:   file.c:390  : striped_read returns 11
> For this case,it did once more striped_read.It's no meaningless.
> Using this patch, the message are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 11 
> HITSTRIPE SHORT
> ceph:   file.c:384  : striped_read returns 11
> 
> Big thanks to Yan Zheng for the patch.

Thanks for working through this!

Do you mind putting the test into a simple .sh script (or whatever) that 
verifies this is working properly (that the read returns the correct data) 
on a file in the currenct directory?  Then we can add this to the 
collection of tests in ceph.git/qa/workunits.  Probably writing the same 
thing to $CWD/test and /tmp/test.$$ and running cmp to verify they match 
is the simplest.

Thanks!
sage



> 
> Signed-off-by: Jianpeng Ma 
> Reviewed-by: Yan, Zheng 
> ---
>  fs/ceph/file.c | 40 +---
>  1 file changed, 17 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 2ddf061..3d8d14d 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -349,44 +349,38 @@ more:
>   dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
>ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
>  
> - if (ret > 0) {
> - int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
> -
> - if (read < pos - off) {
> - dout(" zero gap %llu to %llu\n", off + read, pos);
> - ceph_zero_page_vector_range(page_align + read,
> - pos - off - read, pages);
> + if (ret >= 0) {
> + int  didpages;
> + if (was_short && (pos + ret < inode->i_size)) {
> + u64 tmp = min(this_len - ret,
> +  inode->i_size - pos - ret);
> + dout(" zero gap %llu to %llu\n",
> + pos + ret, pos + ret + tmp);
> + ceph_zero_page_vector_range(page_align + read + ret,
> + tmp, pages);
> + ret += tmp;
>   }
> +
> + didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>   pos += ret;
>   read = pos - off;
>   left -= ret;
>   page_pos += didpages;
>   pages_left -= didpages;
>  
> - /* hit stripe? */
> - if (left && hit_stripe)
> + /* hit stripe and need continue*/
> + if (left && hit_stripe && pos < inode->i_size)
>   goto more;
> +
>   }
>  
> - if (was_short) {
> + if (ret >= 0) {
> + ret = read;
>   /* did we bounce off eof? */
>   if (pos + left > inode->i_size)
>   *checkeof = 1;
> -
> - /* zero trailing bytes (inside i_size) */
> - if (left > 0 && pos < inode->i_size) {
> - if (pos + left > inode->i_size)
> - left = inode->i_size - pos;
> -
> - dout("zero tail %d\n", left);
> - ceph_zero_page_vector_range(page_align + read, left,
> - pages);
> - read += left;
> - }
>   }
>  
> - if (ret >= 0)
> - ret = read;
>   dout("striped_read returns %d\n", ret);
>   return ret;
>  }
> -- 
> 1.8.1.2

Re: PG Backend Proposal

2013-08-01 Thread Samuel Just
I think there are some tricky edge cases with the above approach.  You
might end up with two pg replicas in the same acting set which happen
for reasons of history to have the same chunk for one or more objects.
 That would have to be detected and repaired even though the object
would be missing from neither replica (and might not even be in the pg
log).  The erasure_code_rank would have to be somehow maintained
through recovery (do we remember the original holder of a particular
chunk in case it ever comes back?).

The chunk rank doesn't *need* to match the acting set position, but
there are some good reasons to arrange for that to be the case:
1) Otherwise, we need something else to assign the chunk ranks
2) This way, a new primary can determine which osds hold which
replicas of which chunk rank by looking at past osd maps.

It seems to me that given an OSDMap and an object, we should know
immediately where all chunks should be stored since a future primary
may need to do that without access to the objects themselves.

Importantly, while it may be possible for an acting set transition
like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
mode which will cause replacement to behave well for erasure codes:

initial: [0,1,2]
0 fails: [3,1,2]
2 fails: [3,1,4]
0 recovers: [0,1,4]

We do, however, need to decouple primariness from position in the
acting set so that backfill can work well.
-Sam

On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary  wrote:
> Hi Sam,
>
> I'm under the impression that
> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>
> The chunk rank does not need to match the OSD position in the acting set. As 
> long as each object chunk is stored with its rank in an attribute, changing 
> the order of the acting set does not require to move the chunks around.
>
> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
> [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set 
> to their rank.
>
> If the acting set changes to [2,1,0] the read would reorder the chunk based 
> on their 'erasure_code_rank' attribute instead of the rank of the OSD they 
> originate from in the current acting set. And then be able to decode them 
> with the erasure code library, which requires that the chunks are provided in 
> a specific order.
>
> When doing a full write, the chunks are written in the same order as the 
> acting set. This implies that the order of the chunks of the previous version 
> of the object may be different but I don't see a problem with that.
>
> When doing an append, the primary must first retrieve the order in which the 
> objects are stored by retrieving their 'erasure_code_rank' attribute, because 
> the order of the acting set is not the same as the order of the chunks. It 
> then maps the chunks to the OSDs matching their rank and pushes them to the 
> OSDs.
>
> The only downside is that it may make things more complicated to implement 
> optimizations based on the fact that, sometimes, chunks can just be 
> concatenated to recover the content of the object and don't need to be 
> decoded ( when using systematic codes and the M data chunks are available ).
>
> Cheers
>
> On 01/08/2013 19:14, Loic Dachary wrote:
>>
>>
>> On 01/08/2013 18:42, Loic Dachary wrote:
>>> Hi Sam,
>>>
>>> When the acting set changes order two chunks for the same object may 
>>> co-exist in the same placement group. The key should therefore also contain 
>>> the chunk number.
>>>
>>> That's probably the most sensible comment I have so far. This document is 
>>> immensely useful (even in its current state) because it shows me your 
>>> perspective on the implementation.
>>>
>>> I'm puzzled by:
>>
>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
>> spurious non version chunks would get in the way.
>>
>> :-)
>>
>>>
>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that 
>>> we retain the deleted object until all replicas have persisted the deletion 
>>> event. ErasureCoded backend will therefore need to store objects with the 
>>> version at which they were created included in the key provided to the 
>>> filestore. Old versions of an object can be pruned when all replicas have 
>>> committed up to the log event deleting the object.
>>>
>>> because I don't understand why the version would be necessary. I thought 
>>> that deleting an erasure coded object could be even easier than erasing a 
>>> replicated object because it cannot be resurrected if enough chunks are 
>>> lots, therefore you don't need to wait for ack from all OSDs in the up set. 
>>> I'm obviously missing something.
>>>
>>> I failed to understand how important the pg logs were to maintaining the 
>>> consistency of the PG. For some reason I thought about them only in terms 
>>> of 

Re: PG Backend Proposal

2013-08-01 Thread Loic Dachary
Hi Sam,

I'm under the impression that
https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.

The chunk rank does not need to match the OSD position in the acting set. As 
long as each object chunk is stored with its rank in an attribute, changing the 
order of the acting set does not require to move the chunks around.

With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
[0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set 
to their rank.

If the acting set changes to [2,1,0] the read would reorder the chunk based on 
their 'erasure_code_rank' attribute instead of the rank of the OSD they 
originate from in the current acting set. And then be able to decode them with 
the erasure code library, which requires that the chunks are provided in a 
specific order.

When doing a full write, the chunks are written in the same order as the acting 
set. This implies that the order of the chunks of the previous version of the 
object may be different but I don't see a problem with that.

When doing an append, the primary must first retrieve the order in which the 
objects are stored by retrieving their 'erasure_code_rank' attribute, because 
the order of the acting set is not the same as the order of the chunks. It then 
maps the chunks to the OSDs matching their rank and pushes them to the OSDs.

The only downside is that it may make things more complicated to implement 
optimizations based on the fact that, sometimes, chunks can just be 
concatenated to recover the content of the object and don't need to be decoded 
( when using systematic codes and the M data chunks are available ).

Cheers

On 01/08/2013 19:14, Loic Dachary wrote:
> 
> 
> On 01/08/2013 18:42, Loic Dachary wrote:
>> Hi Sam,
>>
>> When the acting set changes order two chunks for the same object may 
>> co-exist in the same placement group. The key should therefore also contain 
>> the chunk number. 
>>
>> That's probably the most sensible comment I have so far. This document is 
>> immensely useful (even in its current state) because it shows me your 
>> perspective on the implementation. 
>>
>> I'm puzzled by:
> 
> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
> spurious non version chunks would get in the way.
> 
> :-)
> 
>>
>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that 
>> we retain the deleted object until all replicas have persisted the deletion 
>> event. ErasureCoded backend will therefore need to store objects with the 
>> version at which they were created included in the key provided to the 
>> filestore. Old versions of an object can be pruned when all replicas have 
>> committed up to the log event deleting the object.
>>
>> because I don't understand why the version would be necessary. I thought 
>> that deleting an erasure coded object could be even easier than erasing a 
>> replicated object because it cannot be resurrected if enough chunks are 
>> lots, therefore you don't need to wait for ack from all OSDs in the up set. 
>> I'm obviously missing something.
>>
>> I failed to understand how important the pg logs were to maintaining the 
>> consistency of the PG. For some reason I thought about them only in terms of 
>> being a light weight version of the operation logs. Adding a payload to the 
>> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I 
>> would have never thought or dared think the logs could be extended in such a 
>> way. Given the recent problems with logs writes having a high impact on 
>> performances ( I'm referring to what forced you to introduce code to reduce 
>> the amount of logs being written to only those that have been changed 
>> instead of the complete logs ) I thought about the pg logs as something 
>> immutable.
>>
>> I'm still trying to figure out how PGBackend::perform_write / read / 
>> try_rollback would fit in the current backfilling / write / read / scrubbing 
>> ... code path. 
>>
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature


Re: Rados Protocoll

2013-08-01 Thread Noah Watkins
Hi Niklas,

The RADOS reference implementation in C++ is quite large. Reproducing
it all in another language would be interesting, but I'm curious if
wrapping the C interface is not an option for you? There are Java
bindings that are being worked on here:
https://github.com/wido/rados-java.

There are links on ceph.com/docs to some information about Ceph, as
well as videos on Youtube, and academic papers linked to.

-Noah

On Thu, Aug 1, 2013 at 1:01 PM, Niklas Goerke  wrote:
> Hi,
>
> I was wondering why there is no native Java implementation of librados. I'm
> thinking about creating one and I'm thus looking for a documentation of the
> RADOS protocol.
> Also the way I see it librados implements the crush algorithm. Is there a
> documentation for it?
> Also an educated guess about whether the RADOS Protocol is due to changes
> would be very much appreciated.
>
> Thank you in advance
>
> Niklas
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just
Can you dump your osd settings?
sudo ceph --admin-daemon ceph-osd..asok config show
-Sam

On Thu, Aug 1, 2013 at 12:07 PM, Stefan Priebe  wrote:
> Mike we already have the async patch running. Yes it helps but only helps it
> does not solve. It just hides the issue ...
> Am 01.08.2013 20:54, schrieb Mike Dawson:
>
>> I am also seeing recovery issues with 0.61.7. Here's the process:
>>
>> - ceph osd set noout
>>
>> - Reboot one of the nodes hosting OSDs
>>  - VMs mounted from RBD volumes work properly
>>
>> - I see the OSD's boot messages as they re-join the cluster
>>
>> - Start seeing active+recovery_wait, peering, and active+recovering
>>  - VMs mounted from RBD volumes become unresponsive.
>>
>> - Recovery completes
>>  - VMs mounted from RBD volumes regain responsiveness
>>
>> - ceph osd unset noout
>>
>> Would joshd's async patch for qemu help here, or is there something else
>> going on?
>>
>> Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY
>>
>> Thanks,
>>
>> Mike Dawson
>> Co-Founder & Director of Cloud Architecture
>> Cloudapt LLC
>> 6330 East 75th Street, Suite 170
>> Indianapolis, IN 46250
>>
>> On 8/1/2013 2:34 PM, Samuel Just wrote:
>>>
>>> Can you reproduce and attach the ceph.log from before you stop the osd
>>> until after you have started the osd and it has recovered?
>>> -Sam
>>>
>>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
>>>  wrote:

 Hi,

 i still have recovery issues with cuttlefish. After the OSD comes back
 it seem to hang for around 2-4 minutes and then recovery seems to start
 (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
 get a lot of slow request messages an hanging VMs.

 What i noticed today is that if i leave the OSD off as long as ceph
 starts to backfill - the recovery and "re" backfilling wents absolutely
 smooth without any issues and no slow request messages at all.

 Does anybody have an idea why?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Rados Protocoll

2013-08-01 Thread Niklas Goerke

Hi,

I was wondering why there is no native Java implementation of librados. 
I'm thinking about creating one and I'm thus looking for a documentation 
of the RADOS protocol.
Also the way I see it librados implements the crush algorithm. Is there 
a documentation for it?
Also an educated guess about whether the RADOS Protocol is due to 
changes would be very much appreciated.


Thank you in advance

Niklas
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe
Mike we already have the async patch running. Yes it helps but only 
helps it does not solve. It just hides the issue ...

Am 01.08.2013 20:54, schrieb Mike Dawson:

I am also seeing recovery issues with 0.61.7. Here's the process:

- ceph osd set noout

- Reboot one of the nodes hosting OSDs
 - VMs mounted from RBD volumes work properly

- I see the OSD's boot messages as they re-join the cluster

- Start seeing active+recovery_wait, peering, and active+recovering
 - VMs mounted from RBD volumes become unresponsive.

- Recovery completes
 - VMs mounted from RBD volumes regain responsiveness

- ceph osd unset noout

Would joshd's async patch for qemu help here, or is there something else
going on?

Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 2:34 PM, Samuel Just wrote:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Mike Dawson

I am also seeing recovery issues with 0.61.7. Here's the process:

- ceph osd set noout

- Reboot one of the nodes hosting OSDs
- VMs mounted from RBD volumes work properly

- I see the OSD's boot messages as they re-join the cluster

- Start seeing active+recovery_wait, peering, and active+recovering
- VMs mounted from RBD volumes become unresponsive.

- Recovery completes
- VMs mounted from RBD volumes regain responsiveness

- ceph osd unset noout

Would joshd's async patch for qemu help here, or is there something else 
going on?


Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 2:34 PM, Samuel Just wrote:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe

here it is

Am 01.08.2013 20:36, schrieb Samuel Just:

For now, just the main ceph.log.
-Sam

On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe  wrote:

m 01.08.2013 20:34, schrieb Samuel Just:


Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam



Sure which log levels?



On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:


Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





ceph.log.gz
Description: application/gzip


Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just
Is there a bug open for this?  I suspect we don't sufficiently
throttle the snapshot removal work.
-Sam

On Thu, Aug 1, 2013 at 7:50 AM, Andrey Korolyov  wrote:
> Second this. Also for long-lasting snapshot problem and related
> performance issues I may say that cuttlefish improved things greatly,
> but creation/deletion of large snapshot (hundreds of gigabytes of
> commited data) still can bring down cluster for a minutes, despite
> usage of every possible optimization.
>
> On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>>
>> i still have recovery issues with cuttlefish. After the OSD comes back
>> it seem to hang for around 2-4 minutes and then recovery seems to start
>> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
>> get a lot of slow request messages an hanging VMs.
>>
>> What i noticed today is that if i leave the OSD off as long as ceph
>> starts to backfill - the recovery and "re" backfilling wents absolutely
>> smooth without any issues and no slow request messages at all.
>>
>> Does anybody have an idea why?
>>
>> Greets,
>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just
It doesn't have log levels, should be in /var/log/ceph/ceph.log.
-Sam

On Thu, Aug 1, 2013 at 11:36 AM, Samuel Just  wrote:
> For now, just the main ceph.log.
> -Sam
>
> On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe  wrote:
>> m 01.08.2013 20:34, schrieb Samuel Just:
>>
>>> Can you reproduce and attach the ceph.log from before you stop the osd
>>> until after you have started the osd and it has recovered?
>>> -Sam
>>
>>
>> Sure which log levels?
>>
>>
>>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
>>>  wrote:

 Hi,

 i still have recovery issues with cuttlefish. After the OSD comes back
 it seem to hang for around 2-4 minutes and then recovery seems to start
 (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
 get a lot of slow request messages an hanging VMs.

 What i noticed today is that if i leave the OSD off as long as ceph
 starts to backfill - the recovery and "re" backfilling wents absolutely
 smooth without any issues and no slow request messages at all.

 Does anybody have an idea why?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just
For now, just the main ceph.log.
-Sam

On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe  wrote:
> m 01.08.2013 20:34, schrieb Samuel Just:
>
>> Can you reproduce and attach the ceph.log from before you stop the osd
>> until after you have started the osd and it has recovered?
>> -Sam
>
>
> Sure which log levels?
>
>
>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>>
>>> Hi,
>>>
>>> i still have recovery issues with cuttlefish. After the OSD comes back
>>> it seem to hang for around 2-4 minutes and then recovery seems to start
>>> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
>>> get a lot of slow request messages an hanging VMs.
>>>
>>> What i noticed today is that if i leave the OSD off as long as ceph
>>> starts to backfill - the recovery and "re" backfilling wents absolutely
>>> smooth without any issues and no slow request messages at all.
>>>
>>> Does anybody have an idea why?
>>>
>>> Greets,
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe

m 01.08.2013 20:34, schrieb Samuel Just:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam


Sure which log levels?


On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just
Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> i still have recovery issues with cuttlefish. After the OSD comes back
> it seem to hang for around 2-4 minutes and then recovery seems to start
> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
> get a lot of slow request messages an hanging VMs.
>
> What i noticed today is that if i leave the OSD off as long as ceph
> starts to backfill - the recovery and "re" backfilling wents absolutely
> smooth without any issues and no slow request messages at all.
>
> Does anybody have an idea why?
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

2013-08-01 Thread Sage Weil
On Thu, 1 Aug 2013, Yan, Zheng wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju  wrote:
> > From: Sha Zhengju 
> >
> > Following we will begin to add memcg dirty page accounting around
> __set_page_dirty_
> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> avoid exporting
> > those details to filesystems.
> >
> > Signed-off-by: Sha Zhengju 
> > ---
> >  fs/ceph/addr.c |   13 +
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 3e68ac1..1445bf1 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >         if (unlikely(!mapping))
> >                 return !TestSetPageDirty(page);
> >
> > -       if (TestSetPageDirty(page)) {
> > +       if (!__set_page_dirty_nobuffers(page)) {
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

Unfortunately I only remember being frustrated by this code.  :)  Looking 
at it now, though, it seems like the minimum fix is to set the 
page->private before marking the page dirty.  I don't know the locking 
rules around that, though.  If that is potentially racy, maybe the safest 
thing would be if __set_page_dirty_nobuffers() took a void* to set 
page->private to atomically while holding the tree_lock.

sage

> 
> >                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> >                      mapping->host, page, page->index);
> >                 return 0;
> > @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
> >              snapc, snapc->seq, snapc->num_snaps);
> >         spin_unlock(&ci->i_ceph_lock);
> >
> > -       /* now adjust page */
> > -       spin_lock_irq(&mapping->tree_lock);
> >         if (page->mapping) {    /* Race with truncate? */
> > -               WARN_ON_ONCE(!PageUptodate(page));
> > -               account_page_dirtied(page, page->mapping);
> > -               radix_tree_tag_set(&mapping->page_tree,
> > -                               page_index(page), PAGECACHE_TAG_DIRTY);
> > -
> 
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.
> 
> Regards
> Yan, Zheng
> 
> >                 /*
> >                  * Reference snap context in page->private.  Also set
> >                  * PagePrivate so that we get invalidatepage callback.
> > @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
> >                 undo = 1;
> >         }
> >
> > -       spin_unlock_irq(&mapping->tree_lock);
> 
> 
> 
> 
> > -
> >         if (undo)
> >                 /* whoops, we failed to dirty the page */
> >                 ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> >
> > -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > -
> >         BUG_ON(!PageDirty(page));
> >         return 1;
> >  }
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 

v0.67-rc3 Dumpling release candidate

2013-08-01 Thread Sage Weil
We've tagged and pushed out packages for another release candidate for 
Dumpling.  At this point things are looking very good.  There are a few 
odds and ends with the CLI changes but the core ceph functionality is 
looking quite stable.  Please test!

Packages are available in the -testing repos:

http://ceph.com/debian-testing/
http://ceph.com/rpm-testing/

Note that at any time you can also run the latest code for the pending 
release from

http://gitbuilder.ceph.com/ceph-deb-$DISTRO-x86_64-basic/ref/next/

The draft release notes for v0.67 dumpling are at

http://ceph.com/docs/master/release-notes/

You can see our bug queue at

http://tracker.ceph.com/projects/ceph/issues?query_id=27

Happy testing!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mds: remove waiting lock before merging with neighbours

2013-08-01 Thread Sage Weil
On Thu, 1 Aug 2013, David Disseldorp wrote:
> Hi,
> 
> Did anyone get a chance to look at this change?
> Any comments/feedback/ridicule would be appreciated.

Sorry, not yet--and Greg just headed out for vacation yesterday.  It's on 
my list to look at when I have some time tonight or tomorrow, though. 
Thanks!  

I'm hopefully this will clear up some of the locking hangs we've seen with 
the samba and flock tests...

sage


> 
> Cheers, David
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PG Backend Proposal

2013-08-01 Thread Loic Dachary


On 01/08/2013 18:42, Loic Dachary wrote:
> Hi Sam,
> 
> When the acting set changes order two chunks for the same object may co-exist 
> in the same placement group. The key should therefore also contain the chunk 
> number. 
> 
> That's probably the most sensible comment I have so far. This document is 
> immensely useful (even in its current state) because it shows me your 
> perspective on the implementation. 
> 
> I'm puzzled by:

I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
spurious non version chunks would get in the way.

:-)

> 
> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we 
> retain the deleted object until all replicas have persisted the deletion 
> event. ErasureCoded backend will therefore need to store objects with the 
> version at which they were created included in the key provided to the 
> filestore. Old versions of an object can be pruned when all replicas have 
> committed up to the log event deleting the object.
> 
> because I don't understand why the version would be necessary. I thought that 
> deleting an erasure coded object could be even easier than erasing a 
> replicated object because it cannot be resurrected if enough chunks are lots, 
> therefore you don't need to wait for ack from all OSDs in the up set. I'm 
> obviously missing something.
> 
> I failed to understand how important the pg logs were to maintaining the 
> consistency of the PG. For some reason I thought about them only in terms of 
> being a light weight version of the operation logs. Adding a payload to the 
> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I 
> would have never thought or dared think the logs could be extended in such a 
> way. Given the recent problems with logs writes having a high impact on 
> performances ( I'm referring to what forced you to introduce code to reduce 
> the amount of logs being written to only those that have been changed instead 
> of the complete logs ) I thought about the pg logs as something immutable.
> 
> I'm still trying to figure out how PGBackend::perform_write / read / 
> try_rollback would fit in the current backfilling / write / read / scrubbing 
> ... code path. 
> 
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature


Re: PG Backend Proposal

2013-08-01 Thread Samuel Just
DELETE can always be rolled forward, but there may be other operations
in the log that can't be (like an append).  So we need to be able to
roll it back (I think)  perform_write, read, try_rollback probably
don't matter to backfill, scrubbing.  You are correct, we need to
include the chunk number in the object as well!
-Sam

On Thu, Aug 1, 2013 at 9:42 AM, Loic Dachary  wrote:
> Hi Sam,
>
> When the acting set changes order two chunks for the same object may co-exist 
> in the same placement group. The key should therefore also contain the chunk 
> number.
>
> That's probably the most sensible comment I have so far. This document is 
> immensely useful (even in its current state) because it shows me your 
> perspective on the implementation.
>
> I'm puzzled by:
>
> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we 
> retain the deleted object until all replicas have persisted the deletion 
> event. ErasureCoded backend will therefore need to store objects with the 
> version at which they were created included in the key provided to the 
> filestore. Old versions of an object can be pruned when all replicas have 
> committed up to the log event deleting the object.
>
> because I don't understand why the version would be necessary. I thought that 
> deleting an erasure coded object could be even easier than erasing a 
> replicated object because it cannot be resurrected if enough chunks are lots, 
> therefore you don't need to wait for ack from all OSDs in the up set. I'm 
> obviously missing something.
>
> I failed to understand how important the pg logs were to maintaining the 
> consistency of the PG. For some reason I thought about them only in terms of 
> being a light weight version of the operation logs. Adding a payload to the 
> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I 
> would have never thought or dared think the logs could be extended in such a 
> way. Given the recent problems with logs writes having a high impact on 
> performances ( I'm referring to what forced you to introduce code to reduce 
> the amount of logs being written to only those that have been changed instead 
> of the complete logs ) I thought about the pg logs as something immutable.
>
> I'm still trying to figure out how PGBackend::perform_write / read / 
> try_rollback would fit in the current backfilling / write / read / scrubbing 
> ... code path.
>
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PG Backend Proposal

2013-08-01 Thread Loic Dachary
Hi Sam,

When the acting set changes order two chunks for the same object may co-exist 
in the same placement group. The key should therefore also contain the chunk 
number. 

That's probably the most sensible comment I have so far. This document is 
immensely useful (even in its current state) because it shows me your 
perspective on the implementation. 

I'm puzzled by:

CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we 
retain the deleted object until all replicas have persisted the deletion event. 
ErasureCoded backend will therefore need to store objects with the version at 
which they were created included in the key provided to the filestore. Old 
versions of an object can be pruned when all replicas have committed up to the 
log event deleting the object.

because I don't understand why the version would be necessary. I thought that 
deleting an erasure coded object could be even easier than erasing a replicated 
object because it cannot be resurrected if enough chunks are lots, therefore 
you don't need to wait for ack from all OSDs in the up set. I'm obviously 
missing something.

I failed to understand how important the pg logs were to maintaining the 
consistency of the PG. For some reason I thought about them only in terms of 
being a light weight version of the operation logs. Adding a payload to the 
pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would 
have never thought or dared think the logs could be extended in such a way. 
Given the recent problems with logs writes having a high impact on performances 
( I'm referring to what forced you to introduce code to reduce the amount of 
logs being written to only those that have been changed instead of the complete 
logs ) I thought about the pg logs as something immutable.

I'm still trying to figure out how PGBackend::perform_write / read / 
try_rollback would fit in the current backfilling / write / read / scrubbing 
... code path. 

https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature


Re: still recovery issues with cuttlefish

2013-08-01 Thread Andrey Korolyov
Second this. Also for long-lasting snapshot problem and related
performance issues I may say that cuttlefish improved things greatly,
but creation/deletion of large snapshot (hundreds of gigabytes of
commited data) still can bring down cluster for a minutes, despite
usage of every possible optimization.

On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> i still have recovery issues with cuttlefish. After the OSD comes back
> it seem to hang for around 2-4 minutes and then recovery seems to start
> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
> get a lot of slow request messages an hanging VMs.
>
> What i noticed today is that if i leave the OSD off as long as ceph
> starts to backfill - the recovery and "re" backfilling wents absolutely
> smooth without any issues and no slow request messages at all.
>
> Does anybody have an idea why?
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Add missing buildrequires for Fedora

2013-08-01 Thread Danny Al-Gaaf
Hi,

I've opened a pull request with some additional fixes for this issue:
https://github.com/ceph/ceph/pull/478

Danny

Am 30.07.2013 09:53, schrieb Erik Logtenberg:
> Hi,
> 
> This patch adds two buildrequires to the ceph.spec file, that are needed
> to build the rpms under Fedora. Danny Al-Gaaf commented that the
> snappy-devel dependency should actually be added to the leveldb-devel
> package. I will try to get that fixed too, in the mean time, this patch
> does make sure Ceph builds on Fedora.
> 
> Signed-off-by: Erik Logtenberg 
> ---
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mds: remove waiting lock before merging with neighbours

2013-08-01 Thread David Disseldorp
Hi,

Did anyone get a chance to look at this change?
Any comments/feedback/ridicule would be appreciated.

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

2013-08-01 Thread Sha Zhengju
From: Sha Zhengju 

Following we will begin to add memcg dirty page accounting around 
__set_page_dirty_
{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid 
exporting
those details to filesystems.

Signed-off-by: Sha Zhengju 
---
 fs/ceph/addr.c |   13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac1..1445bf1 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
-   if (TestSetPageDirty(page)) {
+   if (!__set_page_dirty_nobuffers(page)) {
dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 mapping->host, page, page->index);
return 0;
@@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
 snapc, snapc->seq, snapc->num_snaps);
spin_unlock(&ci->i_ceph_lock);
 
-   /* now adjust page */
-   spin_lock_irq(&mapping->tree_lock);
if (page->mapping) {/* Race with truncate? */
-   WARN_ON_ONCE(!PageUptodate(page));
-   account_page_dirtied(page, page->mapping);
-   radix_tree_tag_set(&mapping->page_tree,
-   page_index(page), PAGECACHE_TAG_DIRTY);
-
/*
 * Reference snap context in page->private.  Also set
 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
undo = 1;
}
 
-   spin_unlock_irq(&mapping->tree_lock);
-
if (undo)
/* whoops, we failed to dirty the page */
ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
BUG_ON(!PageDirty(page));
return 1;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LFS & Ceph

2013-08-01 Thread Chmouel Boudjnah
Hello,

Sorry for the late answer as I was travelling lately.

The LFS works has been in a heavy state of work in progress by Peter
(in CC) and others there is some documentation in this review :

https://review.openstack.org/#/c/30051/

(summarized in Pete's gist here
https://gist.github.com/portante/5488238/raw/66b0bf2a91a8ca75301fa68dc0fef2d3dc76e5a2/gistfile1.txt)

some preliminary work here :

https://review.openstack.org/#/c/35381/

and some old documentation here :

https://raw.github.com/zaitcev/swift-lfs/master/doc/source/lfs_plugin.rst

I am sure Pete would appreciate the feedback.

Cheers,
Chmouel.

On 24/07/2013 17:00, Loic Dachary wrote:
> Hi,
>
> Thanks for take the time to discuss LFS today @ OSCON :-) Would you be so 
> kind as to send links to the current discussion about the LFS driver API ?
>
> Cheers
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe - Profihost AG
Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: question about striped_read

2013-08-01 Thread Yan, Zheng
On Thu, Aug 1, 2013 at 2:30 PM, majianpeng  wrote:
>>On Thu, Aug 1, 2013 at 9:45 AM, majianpeng  wrote:
On Wed, Jul 31, 2013 at 3:32 PM, majianpeng  wrote:
>
>  [snip]
> Test case
> A: touch file
> dd if=file of=/dev/null bs=5M count=1 iflag=direct
> B: [data(2M)|hole(2m)][data(2M)]
>dd if=file of=/dev/null bs=8M count=1 iflag=direct
> C: [data(4M)[hole(4M)][hole(4M)][data(2M)]
>   dd if=file of=/dev/null bs=16M count=1 iflag=direct
> D: touch file;truncate -s 5M file
>   dd if=file of=/dev/null bs=8M count=1 iflag=direct
>
> Those cases can work.
> Now i make different processing  for short-read between 'ret > 0' and 
> "ret =0".
> For the short-read which ret > 0, it don't do read-page rather than zero 
> the left area.
> This means reduce one meaningless read operation.
>

This patch looks good. But I still hope not to duplicate code.

how about change
 "hit_stripe = this_len < left;"
to
 "hit_stripe = this_len < left && (ret == this_len || pos + this_len <
inode->i_size);"

>>> To make the code easy to understand, i don't apply your suggestion.But i 
>>> add this check on the judgement of
>>> whether read more contents.
>>> The follow is the latest patch.Can you check?
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 2ddf061..3d8d14d 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -349,44 +349,38 @@ more:
>>> dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
>>>  ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : 
>>> "");
>>>
>>> -   if (ret > 0) {
>>> -   int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>>> -
>>> -   if (read < pos - off) {
>>> -   dout(" zero gap %llu to %llu\n", off + read, pos);
>>> -   ceph_zero_page_vector_range(page_align + read,
>>> -   pos - off - read, 
>>> pages);
>>> +   if (ret >= 0) {
>>> +   int  didpages;
>>> +   if (was_short && (pos + ret < inode->i_size)) {
>>> +   u64 tmp = min(this_len - ret,
>>> +inode->i_size - pos - ret);
>>> +   dout(" zero gap %llu to %llu\n",
>>> +   pos + ret, pos + ret + tmp);
>>> +   ceph_zero_page_vector_range(page_align + read + ret,
>>> +   tmp, pages);
>>> +   ret += tmp;
>>> }
>>> +
>>> +   didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>>> pos += ret;
>>> read = pos - off;
>>> left -= ret;
>>> page_pos += didpages;
>>> pages_left -= didpages;
>>>
>>> -   /* hit stripe? */
>>> -   if (left && hit_stripe)
>>> +   /* hit stripe and need continue*/
>>> +   if (left && hit_stripe && pos < inode->i_size)
>>> goto more;
>>> +
>>> }
>>>
>>> -   if (was_short) {
>>> +   if (ret >= 0) {
>>> +   ret = read;
>>> /* did we bounce off eof? */
>>> if (pos + left > inode->i_size)
>>> *checkeof = 1;
>>> -
>>> -   /* zero trailing bytes (inside i_size) */
>>> -   if (left > 0 && pos < inode->i_size) {
>>> -   if (pos + left > inode->i_size)
>>> -   left = inode->i_size - pos;
>>> -
>>> -   dout("zero tail %d\n", left);
>>> -   ceph_zero_page_vector_range(page_align + read, left,
>>> -   pages);
>>> -   read += left;
>>> -   }
>>> }
>>>
>>> -   if (ret >= 0)
>>> -   ret = read;
>>
>>I think this line should be "if (read > 0) ret = read;". Other than
>>this, your patch looks good.
> Because you metioned this, I noticed for ceph_sync_read/write the result are 
> && the every striped read/write.
> That is if we met one error, the total result is error.It can't return 
> partial result.

This behavior is not correct. If we read/write some data, then meet an
error, we should return the size we have
read/written. I think all other FS behave like this. See
generic_file_aio_read() and do_generic_file_read().

Regards
Yan, Zheng



> I think i should write anthor patch for that.
>>You can add "Reviewed-by: Yan, Zheng " to your
>>formal patch.
> Ok ,thanks your times.
>
> Thanks!
> Jianpeng Ma
>>
>>Regards
>>Yan, Zheng
>>
>>
>>> dout("striped_read returns %d\n", ret);
>>> return ret;
>>>  }
>>>
>>> Thanks!
>>> Jianpeng Ma
> Thanks!
> Jianpeng Ma
>>On Thu, Aug 1, 2013 at 9:45 AM, majianpeng  wrote:
On Wed, Jul 31, 2013 at 3:32 PM, maj