date:20130801

Re: PG Backend Proposal

2013-08-01 Thread Sage Weil

On Thu, 1 Aug 2013, Samuel Just wrote:
> I think there are some tricky edge cases with the above approach.  You
> might end up with two pg replicas in the same acting set which happen
> for reasons of history to have the same chunk for one or more objects.
>  That would have to be detected and repaired even though the object
> would be missing from neither replica (and might not even be in the pg
> log).  The erasure_code_rank would have to be somehow maintained
> through recovery (do we remember the original holder of a particular
> chunk in case it ever comes back?).
> 
> The chunk rank doesn't *need* to match the acting set position, but
> there are some good reasons to arrange for that to be the case:
> 1) Otherwise, we need something else to assign the chunk ranks
> 2) This way, a new primary can determine which osds hold which
> replicas of which chunk rank by looking at past osd maps.
> 
> It seems to me that given an OSDMap and an object, we should know
> immediately where all chunks should be stored since a future primary
> may need to do that without access to the objects themselves.
> 
> Importantly, while it may be possible for an acting set transition
> like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
> mode which will cause replacement to behave well for erasure codes:
> 
> initial: [0,1,2]
> 0 fails: [3,1,2]
> 2 fails: [3,1,4]
> 0 recovers: [0,1,4]
> 
> We do, however, need to decouple primariness from position in the
> acting set so that backfill can work well.

BTW, this reminds me: this might also be a good time to add the ability in 
the OSDMap to probabilistically reorder acting sets (and adjust 
primariness) for any mapping so that we can cheaply shift read traffic 
around (e.g., based on normal statistical skew, or temorary workload 
balance).  Currently the only way to do this also involves moving data.

s


> -Sam
> 
> On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary  wrote:
> > Hi Sam,
> >
> > I'm under the impression that
> > https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
> > assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
> >
> > The chunk rank does not need to match the OSD position in the acting set. 
> > As long as each object chunk is stored with its rank in an attribute, 
> > changing the order of the acting set does not require to move the chunks 
> > around.
> >
> > With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
> > [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute 
> > set to their rank.
> >
> > If the acting set changes to [2,1,0] the read would reorder the chunk based 
> > on their 'erasure_code_rank' attribute instead of the rank of the OSD they 
> > originate from in the current acting set. And then be able to decode them 
> > with the erasure code library, which requires that the chunks are provided 
> > in a specific order.
> >
> > When doing a full write, the chunks are written in the same order as the 
> > acting set. This implies that the order of the chunks of the previous 
> > version of the object may be different but I don't see a problem with that.
> >
> > When doing an append, the primary must first retrieve the order in which 
> > the objects are stored by retrieving their 'erasure_code_rank' attribute, 
> > because the order of the acting set is not the same as the order of the 
> > chunks. It then maps the chunks to the OSDs matching their rank and pushes 
> > them to the OSDs.
> >
> > The only downside is that it may make things more complicated to implement 
> > optimizations based on the fact that, sometimes, chunks can just be 
> > concatenated to recover the content of the object and don't need to be 
> > decoded ( when using systematic codes and the M data chunks are available ).
> >
> > Cheers
> >
> > On 01/08/2013 19:14, Loic Dachary wrote:
> >>
> >>
> >> On 01/08/2013 18:42, Loic Dachary wrote:
> >>> Hi Sam,
> >>>
> >>> When the acting set changes order two chunks for the same object may 
> >>> co-exist in the same placement group. The key should therefore also 
> >>> contain the chunk number.
> >>>
> >>> That's probably the most sensible comment I have so far. This document is 
> >>> immensely useful (even in its current state) because it shows me your 
> >>> perspective on the implementation.
> >>>
> >>> I'm puzzled by:
> >>
> >> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
> >> spurious non version chunks would get in the way.
> >>
> >> :-)
> >>
> >>>
> >>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires 
> >>> that we retain the deleted object until all replicas have persisted the 
> >>> deletion event. ErasureCoded backend will therefore need to store objects 
> >>> with the version at which they were created included in the key provided 
> >>> to the filestore. Old versions of an object can be pruned when

Re: [PATCH] ceph: fix bugs about handling short-read for sync read mode.

2013-08-01 Thread Sage Weil

On Fri, 2 Aug 2013, majianpeng wrote:
> cephfs . show_layout
> >layyout.data_pool: 0
> >layout.object_size:   4194304
> >layout.stripe_unit:   4194304
> >layout.stripe_count:  1
> 
> TestA:
> >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
> >dd if=/dev/urandom of=test bs=1M count=2 seek=4  oflag=direct
> >dd if=test of=/dev/null bs=6M count=1 iflag=direct
> The messages from func striped_read are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 2097152 
> HITSTRIPE SHORT
> ceph:   file.c:350  : striped_read 2097152~4194304 (read 2097152) got 
> 0 HITSTRIPE SHORT
> ceph:   file.c:381  : zero tail 4194304
> ceph:   file.c:390  : striped_read returns 6291456
> The hole of file is from 2M--4M.But actualy it zero the last 4M include
> the last 2M area which isn't a hole.
> Using this patch, the messages are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 2097152 
> HITSTRIPE SHORT
> ceph:   file.c:358  :  zero gap 2097152 to 4194304
> ceph:   file.c:350  : striped_read 4194304~2097152 (read 4194304) got 
> 2097152
> ceph:   file.c:384  : striped_read returns 6291456
> 
> TestB:
> >echo majianpeng > test
> >dd if=test of=/dev/null bs=2M count=1 iflag=direct
> The messages are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 11 
> HITSTRIPE SHORT
> ceph:   file.c:350  : striped_read 11~6291445 (read 11) got 0 
> HITSTRIPE SHORT
> ceph:   file.c:390  : striped_read returns 11
> For this case,it did once more striped_read.It's no meaningless.
> Using this patch, the message are:
> ceph:   file.c:350  : striped_read 0~6291456 (read 0) got 11 
> HITSTRIPE SHORT
> ceph:   file.c:384  : striped_read returns 11
> 
> Big thanks to Yan Zheng for the patch.

Thanks for working through this!

Do you mind putting the test into a simple .sh script (or whatever) that 
verifies this is working properly (that the read returns the correct data) 
on a file in the currenct directory?  Then we can add this to the 
collection of tests in ceph.git/qa/workunits.  Probably writing the same 
thing to $CWD/test and /tmp/test.$$ and running cmp to verify they match 
is the simplest.

Thanks!
sage



> 
> Signed-off-by: Jianpeng Ma 
> Reviewed-by: Yan, Zheng 
> ---
>  fs/ceph/file.c | 40 +---
>  1 file changed, 17 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 2ddf061..3d8d14d 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -349,44 +349,38 @@ more:
>   dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
>ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : "");
>  
> - if (ret > 0) {
> - int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
> -
> - if (read < pos - off) {
> - dout(" zero gap %llu to %llu\n", off + read, pos);
> - ceph_zero_page_vector_range(page_align + read,
> - pos - off - read, pages);
> + if (ret >= 0) {
> + int  didpages;
> + if (was_short && (pos + ret < inode->i_size)) {
> + u64 tmp = min(this_len - ret,
> +  inode->i_size - pos - ret);
> + dout(" zero gap %llu to %llu\n",
> + pos + ret, pos + ret + tmp);
> + ceph_zero_page_vector_range(page_align + read + ret,
> + tmp, pages);
> + ret += tmp;
>   }
> +
> + didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>   pos += ret;
>   read = pos - off;
>   left -= ret;
>   page_pos += didpages;
>   pages_left -= didpages;
>  
> - /* hit stripe? */
> - if (left && hit_stripe)
> + /* hit stripe and need continue*/
> + if (left && hit_stripe && pos < inode->i_size)
>   goto more;
> +
>   }
>  
> - if (was_short) {
> + if (ret >= 0) {
> + ret = read;
>   /* did we bounce off eof? */
>   if (pos + left > inode->i_size)
>   *checkeof = 1;
> -
> - /* zero trailing bytes (inside i_size) */
> - if (left > 0 && pos < inode->i_size) {
> - if (pos + left > inode->i_size)
> - left = inode->i_size - pos;
> -
> - dout("zero tail %d\n", left);
> - ceph_zero_page_vector_range(page_align + read, left,
> - pages);
> - read += left;
> - }
>   }
>  
> - if (ret >= 0)
> - ret = read;
>   dout("striped_read returns %d\n", ret);
>   return ret;
>  }
> -- 
> 1.8.1.2

Re: PG Backend Proposal

2013-08-01 Thread Samuel Just

I think there are some tricky edge cases with the above approach.  You
might end up with two pg replicas in the same acting set which happen
for reasons of history to have the same chunk for one or more objects.
 That would have to be detected and repaired even though the object
would be missing from neither replica (and might not even be in the pg
log).  The erasure_code_rank would have to be somehow maintained
through recovery (do we remember the original holder of a particular
chunk in case it ever comes back?).

The chunk rank doesn't *need* to match the acting set position, but
there are some good reasons to arrange for that to be the case:
1) Otherwise, we need something else to assign the chunk ranks
2) This way, a new primary can determine which osds hold which
replicas of which chunk rank by looking at past osd maps.

It seems to me that given an OSDMap and an object, we should know
immediately where all chunks should be stored since a future primary
may need to do that without access to the objects themselves.

Importantly, while it may be possible for an acting set transition
like [0,1,2]->[2,1,0] to occur in some pathological case, CRUSH has a
mode which will cause replacement to behave well for erasure codes:

initial: [0,1,2]
0 fails: [3,1,2]
2 fails: [3,1,4]
0 recovers: [0,1,4]

We do, however, need to decouple primariness from position in the
acting set so that backfill can work well.
-Sam

On Thu, Aug 1, 2013 at 4:54 PM, Loic Dachary  wrote:
> Hi Sam,
>
> I'm under the impression that
> https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
> assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.
>
> The chunk rank does not need to match the OSD position in the acting set. As 
> long as each object chunk is stored with its rank in an attribute, changing 
> the order of the acting set does not require to move the chunks around.
>
> With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
> [0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set 
> to their rank.
>
> If the acting set changes to [2,1,0] the read would reorder the chunk based 
> on their 'erasure_code_rank' attribute instead of the rank of the OSD they 
> originate from in the current acting set. And then be able to decode them 
> with the erasure code library, which requires that the chunks are provided in 
> a specific order.
>
> When doing a full write, the chunks are written in the same order as the 
> acting set. This implies that the order of the chunks of the previous version 
> of the object may be different but I don't see a problem with that.
>
> When doing an append, the primary must first retrieve the order in which the 
> objects are stored by retrieving their 'erasure_code_rank' attribute, because 
> the order of the acting set is not the same as the order of the chunks. It 
> then maps the chunks to the OSDs matching their rank and pushes them to the 
> OSDs.
>
> The only downside is that it may make things more complicated to implement 
> optimizations based on the fact that, sometimes, chunks can just be 
> concatenated to recover the content of the object and don't need to be 
> decoded ( when using systematic codes and the M data chunks are available ).
>
> Cheers
>
> On 01/08/2013 19:14, Loic Dachary wrote:
>>
>>
>> On 01/08/2013 18:42, Loic Dachary wrote:
>>> Hi Sam,
>>>
>>> When the acting set changes order two chunks for the same object may 
>>> co-exist in the same placement group. The key should therefore also contain 
>>> the chunk number.
>>>
>>> That's probably the most sensible comment I have so far. This document is 
>>> immensely useful (even in its current state) because it shows me your 
>>> perspective on the implementation.
>>>
>>> I'm puzzled by:
>>
>> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
>> spurious non version chunks would get in the way.
>>
>> :-)
>>
>>>
>>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that 
>>> we retain the deleted object until all replicas have persisted the deletion 
>>> event. ErasureCoded backend will therefore need to store objects with the 
>>> version at which they were created included in the key provided to the 
>>> filestore. Old versions of an object can be pruned when all replicas have 
>>> committed up to the log event deleting the object.
>>>
>>> because I don't understand why the version would be necessary. I thought 
>>> that deleting an erasure coded object could be even easier than erasing a 
>>> replicated object because it cannot be resurrected if enough chunks are 
>>> lots, therefore you don't need to wait for ack from all OSDs in the up set. 
>>> I'm obviously missing something.
>>>
>>> I failed to understand how important the pg logs were to maintaining the 
>>> consistency of the PG. For some reason I thought about them only in terms 
>>> of

Re: PG Backend Proposal

2013-08-01 Thread Loic Dachary

Hi Sam,

I'm under the impression that
https://github.com/athanatos/ceph/blob/wip-erasure-coding-doc/doc/dev/osd_internals/erasure_coding.rst#distinguished-acting-set-positions
assumes acting[0] stores all chunk[0], acting[1] stores all chunk[1] etc.

The chunk rank does not need to match the OSD position in the acting set. As 
long as each object chunk is stored with its rank in an attribute, changing the 
order of the acting set does not require to move the chunks around.

With M=2+K=1 and the acting set is [0,1,2] chunks M0,M1,K0 are written on 
[0,1,2] respectively, each of them have the 'erasure_code_rank' attribute set 
to their rank.

If the acting set changes to [2,1,0] the read would reorder the chunk based on 
their 'erasure_code_rank' attribute instead of the rank of the OSD they 
originate from in the current acting set. And then be able to decode them with 
the erasure code library, which requires that the chunks are provided in a 
specific order.

When doing a full write, the chunks are written in the same order as the acting 
set. This implies that the order of the chunks of the previous version of the 
object may be different but I don't see a problem with that.

When doing an append, the primary must first retrieve the order in which the 
objects are stored by retrieving their 'erasure_code_rank' attribute, because 
the order of the acting set is not the same as the order of the chunks. It then 
maps the chunks to the OSDs matching their rank and pushes them to the OSDs.

The only downside is that it may make things more complicated to implement 
optimizations based on the fact that, sometimes, chunks can just be 
concatenated to recover the content of the object and don't need to be decoded 
( when using systematic codes and the M data chunks are available ).

Cheers

On 01/08/2013 19:14, Loic Dachary wrote:
> 
> 
> On 01/08/2013 18:42, Loic Dachary wrote:
>> Hi Sam,
>>
>> When the acting set changes order two chunks for the same object may 
>> co-exist in the same placement group. The key should therefore also contain 
>> the chunk number. 
>>
>> That's probably the most sensible comment I have so far. This document is 
>> immensely useful (even in its current state) because it shows me your 
>> perspective on the implementation. 
>>
>> I'm puzzled by:
> 
> I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
> spurious non version chunks would get in the way.
> 
> :-)
> 
>>
>> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that 
>> we retain the deleted object until all replicas have persisted the deletion 
>> event. ErasureCoded backend will therefore need to store objects with the 
>> version at which they were created included in the key provided to the 
>> filestore. Old versions of an object can be pruned when all replicas have 
>> committed up to the log event deleting the object.
>>
>> because I don't understand why the version would be necessary. I thought 
>> that deleting an erasure coded object could be even easier than erasing a 
>> replicated object because it cannot be resurrected if enough chunks are 
>> lots, therefore you don't need to wait for ack from all OSDs in the up set. 
>> I'm obviously missing something.
>>
>> I failed to understand how important the pg logs were to maintaining the 
>> consistency of the PG. For some reason I thought about them only in terms of 
>> being a light weight version of the operation logs. Adding a payload to the 
>> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I 
>> would have never thought or dared think the logs could be extended in such a 
>> way. Given the recent problems with logs writes having a high impact on 
>> performances ( I'm referring to what forced you to introduce code to reduce 
>> the amount of logs being written to only those that have been changed 
>> instead of the complete logs ) I thought about the pg logs as something 
>> immutable.
>>
>> I'm still trying to figure out how PGBackend::perform_write / read / 
>> try_rollback would fit in the current backfilling / write / read / scrubbing 
>> ... code path. 
>>
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
>> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>>
>> Cheers
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

signature.asc
Description: OpenPGP digital signature

Re: Rados Protocoll

2013-08-01 Thread Noah Watkins

Hi Niklas,

The RADOS reference implementation in C++ is quite large. Reproducing
it all in another language would be interesting, but I'm curious if
wrapping the C interface is not an option for you? There are Java
bindings that are being worked on here:
https://github.com/wido/rados-java.

There are links on ceph.com/docs to some information about Ceph, as
well as videos on Youtube, and academic papers linked to.

-Noah

On Thu, Aug 1, 2013 at 1:01 PM, Niklas Goerke  wrote:
> Hi,
>
> I was wondering why there is no native Java implementation of librados. I'm
> thinking about creating one and I'm thus looking for a documentation of the
> RADOS protocol.
> Also the way I see it librados implements the crush algorithm. Is there a
> documentation for it?
> Also an educated guess about whether the RADOS Protocol is due to changes
> would be very much appreciated.
>
> Thank you in advance
>
> Niklas
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just

Can you dump your osd settings?
sudo ceph --admin-daemon ceph-osd..asok config show
-Sam

On Thu, Aug 1, 2013 at 12:07 PM, Stefan Priebe  wrote:
> Mike we already have the async patch running. Yes it helps but only helps it
> does not solve. It just hides the issue ...
> Am 01.08.2013 20:54, schrieb Mike Dawson:
>
>> I am also seeing recovery issues with 0.61.7. Here's the process:
>>
>> - ceph osd set noout
>>
>> - Reboot one of the nodes hosting OSDs
>>  - VMs mounted from RBD volumes work properly
>>
>> - I see the OSD's boot messages as they re-join the cluster
>>
>> - Start seeing active+recovery_wait, peering, and active+recovering
>>  - VMs mounted from RBD volumes become unresponsive.
>>
>> - Recovery completes
>>  - VMs mounted from RBD volumes regain responsiveness
>>
>> - ceph osd unset noout
>>
>> Would joshd's async patch for qemu help here, or is there something else
>> going on?
>>
>> Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY
>>
>> Thanks,
>>
>> Mike Dawson
>> Co-Founder & Director of Cloud Architecture
>> Cloudapt LLC
>> 6330 East 75th Street, Suite 170
>> Indianapolis, IN 46250
>>
>> On 8/1/2013 2:34 PM, Samuel Just wrote:
>>>
>>> Can you reproduce and attach the ceph.log from before you stop the osd
>>> until after you have started the osd and it has recovered?
>>> -Sam
>>>
>>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
>>>  wrote:

 Hi,

 i still have recovery issues with cuttlefish. After the OSD comes back
 it seem to hang for around 2-4 minutes and then recovery seems to start
 (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
 get a lot of slow request messages an hanging VMs.

 What i noticed today is that if i leave the OSD off as long as ceph
 starts to backfill - the recovery and "re" backfilling wents absolutely
 smooth without any issues and no slow request messages at all.

 Does anybody have an idea why?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Rados Protocoll

2013-08-01 Thread Niklas Goerke


Hi,

I was wondering why there is no native Java implementation of librados. 
I'm thinking about creating one and I'm thus looking for a documentation 
of the RADOS protocol.
Also the way I see it librados implements the crush algorithm. Is there 
a documentation for it?
Also an educated guess about whether the RADOS Protocol is due to 
changes would be very much appreciated.


Thank you in advance

Niklas
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe

Mike we already have the async patch running. Yes it helps but only 
helps it does not solve. It just hides the issue ...

Am 01.08.2013 20:54, schrieb Mike Dawson:

I am also seeing recovery issues with 0.61.7. Here's the process:

- ceph osd set noout

- Reboot one of the nodes hosting OSDs
 - VMs mounted from RBD volumes work properly

- I see the OSD's boot messages as they re-join the cluster

- Start seeing active+recovery_wait, peering, and active+recovering
 - VMs mounted from RBD volumes become unresponsive.

- Recovery completes
 - VMs mounted from RBD volumes regain responsiveness

- ceph osd unset noout

Would joshd's async patch for qemu help here, or is there something else
going on?

Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 2:34 PM, Samuel Just wrote:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Mike Dawson


I am also seeing recovery issues with 0.61.7. Here's the process:

- ceph osd set noout

- Reboot one of the nodes hosting OSDs
- VMs mounted from RBD volumes work properly

- I see the OSD's boot messages as they re-join the cluster

- Start seeing active+recovery_wait, peering, and active+recovering
- VMs mounted from RBD volumes become unresponsive.

- Recovery completes
- VMs mounted from RBD volumes regain responsiveness

- ceph osd unset noout

Would joshd's async patch for qemu help here, or is there something else 
going on?


Output of ceph -w at: http://pastebin.com/raw.php?i=JLcZYFzY

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250

On 8/1/2013 2:34 PM, Samuel Just wrote:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe


here it is

Am 01.08.2013 20:36, schrieb Samuel Just:

For now, just the main ceph.log.
-Sam

On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe  wrote:

m 01.08.2013 20:34, schrieb Samuel Just:


Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam



Sure which log levels?



On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:


Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





ceph.log.gz
Description: application/gzip

Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just

Is there a bug open for this?  I suspect we don't sufficiently
throttle the snapshot removal work.
-Sam

On Thu, Aug 1, 2013 at 7:50 AM, Andrey Korolyov  wrote:
> Second this. Also for long-lasting snapshot problem and related
> performance issues I may say that cuttlefish improved things greatly,
> but creation/deletion of large snapshot (hundreds of gigabytes of
> commited data) still can bring down cluster for a minutes, despite
> usage of every possible optimization.
>
> On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG
>  wrote:
>> Hi,
>>
>> i still have recovery issues with cuttlefish. After the OSD comes back
>> it seem to hang for around 2-4 minutes and then recovery seems to start
>> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
>> get a lot of slow request messages an hanging VMs.
>>
>> What i noticed today is that if i leave the OSD off as long as ceph
>> starts to backfill - the recovery and "re" backfilling wents absolutely
>> smooth without any issues and no slow request messages at all.
>>
>> Does anybody have an idea why?
>>
>> Greets,
>> Stefan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just

It doesn't have log levels, should be in /var/log/ceph/ceph.log.
-Sam

On Thu, Aug 1, 2013 at 11:36 AM, Samuel Just  wrote:
> For now, just the main ceph.log.
> -Sam
>
> On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe  wrote:
>> m 01.08.2013 20:34, schrieb Samuel Just:
>>
>>> Can you reproduce and attach the ceph.log from before you stop the osd
>>> until after you have started the osd and it has recovered?
>>> -Sam
>>
>>
>> Sure which log levels?
>>
>>
>>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
>>>  wrote:

 Hi,

 i still have recovery issues with cuttlefish. After the OSD comes back
 it seem to hang for around 2-4 minutes and then recovery seems to start
 (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
 get a lot of slow request messages an hanging VMs.

 What i noticed today is that if i leave the OSD off as long as ceph
 starts to backfill - the recovery and "re" backfilling wents absolutely
 smooth without any issues and no slow request messages at all.

 Does anybody have an idea why?

 Greets,
 Stefan
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just

For now, just the main ceph.log.
-Sam

On Thu, Aug 1, 2013 at 11:34 AM, Stefan Priebe  wrote:
> m 01.08.2013 20:34, schrieb Samuel Just:
>
>> Can you reproduce and attach the ceph.log from before you stop the osd
>> until after you have started the osd and it has recovered?
>> -Sam
>
>
> Sure which log levels?
>
>
>> On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
>>  wrote:
>>>
>>> Hi,
>>>
>>> i still have recovery issues with cuttlefish. After the OSD comes back
>>> it seem to hang for around 2-4 minutes and then recovery seems to start
>>> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
>>> get a lot of slow request messages an hanging VMs.
>>>
>>> What i noticed today is that if i leave the OSD off as long as ceph
>>> starts to backfill - the recovery and "re" backfilling wents absolutely
>>> smooth without any issues and no slow request messages at all.
>>>
>>> Does anybody have an idea why?
>>>
>>> Greets,
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe


m 01.08.2013 20:34, schrieb Samuel Just:

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam


Sure which log levels?


On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

2013-08-01 Thread Samuel Just

Can you reproduce and attach the ceph.log from before you stop the osd
until after you have started the osd and it has recovered?
-Sam

On Thu, Aug 1, 2013 at 1:22 AM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> i still have recovery issues with cuttlefish. After the OSD comes back
> it seem to hang for around 2-4 minutes and then recovery seems to start
> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
> get a lot of slow request messages an hanging VMs.
>
> What i noticed today is that if i leave the OSD off as long as ceph
> starts to backfill - the recovery and "re" backfilling wents absolutely
> smooth without any issues and no slow request messages at all.
>
> Does anybody have an idea why?
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

2013-08-01 Thread Sage Weil

On Thu, 1 Aug 2013, Yan, Zheng wrote:
> On Thu, Aug 1, 2013 at 7:51 PM, Sha Zhengju  wrote:
> > From: Sha Zhengju 
> >
> > Following we will begin to add memcg dirty page accounting around
> __set_page_dirty_
> > {buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
> avoid exporting
> > those details to filesystems.
> >
> > Signed-off-by: Sha Zhengju 
> > ---
> >  fs/ceph/addr.c |   13 +
> >  1 file changed, 1 insertion(+), 12 deletions(-)
> >
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > index 3e68ac1..1445bf1 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
> >         if (unlikely(!mapping))
> >                 return !TestSetPageDirty(page);
> >
> > -       if (TestSetPageDirty(page)) {
> > +       if (!__set_page_dirty_nobuffers(page)) {
> it's too early to set the radix tree tag here. We should set page's snapshot
> context and increase the i_wrbuffer_ref first. This is because once the tag
> is set, writeback thread can find and start flushing the page.

Unfortunately I only remember being frustrated by this code.  :)  Looking 
at it now, though, it seems like the minimum fix is to set the 
page->private before marking the page dirty.  I don't know the locking 
rules around that, though.  If that is potentially racy, maybe the safest 
thing would be if __set_page_dirty_nobuffers() took a void* to set 
page->private to atomically while holding the tree_lock.

sage

> 
> >                 dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> >                      mapping->host, page, page->index);
> >                 return 0;
> > @@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
> >              snapc, snapc->seq, snapc->num_snaps);
> >         spin_unlock(&ci->i_ceph_lock);
> >
> > -       /* now adjust page */
> > -       spin_lock_irq(&mapping->tree_lock);
> >         if (page->mapping) {    /* Race with truncate? */
> > -               WARN_ON_ONCE(!PageUptodate(page));
> > -               account_page_dirtied(page, page->mapping);
> > -               radix_tree_tag_set(&mapping->page_tree,
> > -                               page_index(page), PAGECACHE_TAG_DIRTY);
> > -
> 
> this code was coped from __set_page_dirty_nobuffers(). I think the reason
> Sage did this is to handle the race described in
> __set_page_dirty_nobuffers()'s comment. But I'm wonder if "page->mapping ==
> NULL" can still happen here. Because truncate_inode_page() unmap page from
> processes's address spaces first, then delete page from page cache.
> 
> Regards
> Yan, Zheng
> 
> >                 /*
> >                  * Reference snap context in page->private.  Also set
> >                  * PagePrivate so that we get invalidatepage callback.
> > @@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
> >                 undo = 1;
> >         }
> >
> > -       spin_unlock_irq(&mapping->tree_lock);
> 
> 
> 
> 
> > -
> >         if (undo)
> >                 /* whoops, we failed to dirty the page */
> >                 ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> >
> > -       __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > -
> >         BUG_ON(!PageDirty(page));
> >         return 1;
> >  }
> > --
> > 1.7.9.5
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
>

v0.67-rc3 Dumpling release candidate

2013-08-01 Thread Sage Weil

We've tagged and pushed out packages for another release candidate for 
Dumpling.  At this point things are looking very good.  There are a few 
odds and ends with the CLI changes but the core ceph functionality is 
looking quite stable.  Please test!

Packages are available in the -testing repos:

http://ceph.com/debian-testing/
http://ceph.com/rpm-testing/

Note that at any time you can also run the latest code for the pending 
release from

http://gitbuilder.ceph.com/ceph-deb-$DISTRO-x86_64-basic/ref/next/

The draft release notes for v0.67 dumpling are at

http://ceph.com/docs/master/release-notes/

You can see our bug queue at

http://tracker.ceph.com/projects/ceph/issues?query_id=27

Happy testing!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mds: remove waiting lock before merging with neighbours

2013-08-01 Thread Sage Weil

On Thu, 1 Aug 2013, David Disseldorp wrote:
> Hi,
> 
> Did anyone get a chance to look at this change?
> Any comments/feedback/ridicule would be appreciated.

Sorry, not yet--and Greg just headed out for vacation yesterday.  It's on 
my list to look at when I have some time tonight or tomorrow, though. 
Thanks!  

I'm hopefully this will clear up some of the locking hangs we've seen with 
the samba and flock tests...

sage

> 
> Cheers, David
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PG Backend Proposal

2013-08-01 Thread Loic Dachary



On 01/08/2013 18:42, Loic Dachary wrote:
> Hi Sam,
> 
> When the acting set changes order two chunks for the same object may co-exist 
> in the same placement group. The key should therefore also contain the chunk 
> number. 
> 
> That's probably the most sensible comment I have so far. This document is 
> immensely useful (even in its current state) because it shows me your 
> perspective on the implementation. 
> 
> I'm puzzled by:

I get it ( thanks to yanzheng ). Object is deleted, then created again ... 
spurious non version chunks would get in the way.

:-)

> 
> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we 
> retain the deleted object until all replicas have persisted the deletion 
> event. ErasureCoded backend will therefore need to store objects with the 
> version at which they were created included in the key provided to the 
> filestore. Old versions of an object can be pruned when all replicas have 
> committed up to the log event deleting the object.
> 
> because I don't understand why the version would be necessary. I thought that 
> deleting an erasure coded object could be even easier than erasing a 
> replicated object because it cannot be resurrected if enough chunks are lots, 
> therefore you don't need to wait for ack from all OSDs in the up set. I'm 
> obviously missing something.
> 
> I failed to understand how important the pg logs were to maintaining the 
> consistency of the PG. For some reason I thought about them only in terms of 
> being a light weight version of the operation logs. Adding a payload to the 
> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I 
> would have never thought or dared think the logs could be extended in such a 
> way. Given the recent problems with logs writes having a high impact on 
> performances ( I'm referring to what forced you to introduce code to reduce 
> the amount of logs being written to only those that have been changed instead 
> of the complete logs ) I thought about the pg logs as something immutable.
> 
> I'm still trying to figure out how PGBackend::perform_write / read / 
> try_rollback would fit in the current backfilling / write / read / scrubbing 
> ... code path. 
> 
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
> 
> Cheers
> 

-- 
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.



signature.asc
Description: OpenPGP digital signature

Re: PG Backend Proposal

2013-08-01 Thread Samuel Just

DELETE can always be rolled forward, but there may be other operations
in the log that can't be (like an append).  So we need to be able to
roll it back (I think)  perform_write, read, try_rollback probably
don't matter to backfill, scrubbing.  You are correct, we need to
include the chunk number in the object as well!
-Sam

On Thu, Aug 1, 2013 at 9:42 AM, Loic Dachary  wrote:
> Hi Sam,
>
> When the acting set changes order two chunks for the same object may co-exist 
> in the same placement group. The key should therefore also contain the chunk 
> number.
>
> That's probably the most sensible comment I have so far. This document is 
> immensely useful (even in its current state) because it shows me your 
> perspective on the implementation.
>
> I'm puzzled by:
>
> CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we 
> retain the deleted object until all replicas have persisted the deletion 
> event. ErasureCoded backend will therefore need to store objects with the 
> version at which they were created included in the key provided to the 
> filestore. Old versions of an object can be pruned when all replicas have 
> committed up to the log event deleting the object.
>
> because I don't understand why the version would be necessary. I thought that 
> deleting an erasure coded object could be even easier than erasing a 
> replicated object because it cannot be resurrected if enough chunks are lots, 
> therefore you don't need to wait for ack from all OSDs in the up set. I'm 
> obviously missing something.
>
> I failed to understand how important the pg logs were to maintaining the 
> consistency of the PG. For some reason I thought about them only in terms of 
> being a light weight version of the operation logs. Adding a payload to the 
> pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I 
> would have never thought or dared think the logs could be extended in such a 
> way. Given the recent problems with logs writes having a high impact on 
> performances ( I'm referring to what forced you to introduce code to reduce 
> the amount of logs being written to only those that have been changed instead 
> of the complete logs ) I thought about the pg logs as something immutable.
>
> I'm still trying to figure out how PGBackend::perform_write / read / 
> try_rollback would fit in the current backfilling / write / read / scrubbing 
> ... code path.
>
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
> https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h
>
> Cheers
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> All that is necessary for the triumph of evil is that good people do nothing.
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

PG Backend Proposal

2013-08-01 Thread Loic Dachary

Hi Sam,

When the acting set changes order two chunks for the same object may co-exist
in the same placement group. The key should therefore also contain the chunk
number.

That's probably the most sensible comment I have so far. This document is
immensely useful (even in its current state) because it shows me your
perspective on the implementation.

I'm puzzled by:

CEPH_OSD_OP_DELETE: The possibility of rolling back a delete requires that we
retain the deleted object until all replicas have persisted the deletion event.
ErasureCoded backend will therefore need to store objects with the version at
which they were created included in the key provided to the filestore. Old
versions of an object can be pruned when all replicas have committed up to the
log event deleting the object.

because I don't understand why the version would be necessary. I thought that
deleting an erasure coded object could be even easier than erasing a replicated
object because it cannot be resurrected if enough chunks are lots, therefore
you don't need to wait for ack from all OSDs in the up set. I'm obviously
missing something.

I failed to understand how important the pg logs were to maintaining the
consistency of the PG. For some reason I thought about them only in terms of
being a light weight version of the operation logs. Adding a payload to the
pg_log_entry ( i.e. APPEND size or attribute ) is a new idea for me and I would
have never thought or dared think the logs could be extended in such a way.
Given the recent problems with logs writes having a high impact on performances
( I'm referring to what forced you to introduce code to reduce the amount of
logs being written to only those that have been changed instead of the complete
logs ) I thought about the pg logs as something immutable.

I'm still trying to figure out how PGBackend::perform_write / read /
try_rollback would fit in the current backfilling / write / read / scrubbing
... code path.

https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/doc/dev/osd_internals/erasure_coding.rst
https://github.com/athanatos/ceph/blob/ba5c97eda4fe72a25831031a2cffb226fed8d9b7/src/osd/PGBackend.h

Cheers

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.

signature.asc
Description: OpenPGP digital signature

Re: still recovery issues with cuttlefish

2013-08-01 Thread Andrey Korolyov

Second this. Also for long-lasting snapshot problem and related
performance issues I may say that cuttlefish improved things greatly,
but creation/deletion of large snapshot (hundreds of gigabytes of
commited data) still can bring down cluster for a minutes, despite
usage of every possible optimization.

On Thu, Aug 1, 2013 at 12:22 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi,
>
> i still have recovery issues with cuttlefish. After the OSD comes back
> it seem to hang for around 2-4 minutes and then recovery seems to start
> (pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
> get a lot of slow request messages an hanging VMs.
>
> What i noticed today is that if i leave the OSD off as long as ceph
> starts to backfill - the recovery and "re" backfilling wents absolutely
> smooth without any issues and no slow request messages at all.
>
> Does anybody have an idea why?
>
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add missing buildrequires for Fedora

2013-08-01 Thread Danny Al-Gaaf

Hi,

I've opened a pull request with some additional fixes for this issue:
https://github.com/ceph/ceph/pull/478

Danny

Am 30.07.2013 09:53, schrieb Erik Logtenberg:
> Hi,
> 
> This patch adds two buildrequires to the ceph.spec file, that are needed
> to build the rpms under Fedora. Danny Al-Gaaf commented that the
> snappy-devel dependency should actually be added to the leveldb-devel
> package. I will try to get that fixed too, in the mean time, this patch
> does make sure Ceph builds on Fedora.
> 
> Signed-off-by: Erik Logtenberg 
> ---
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] mds: remove waiting lock before merging with neighbours

2013-08-01 Thread David Disseldorp

Hi,

Did anyone get a chance to look at this change?
Any comments/feedback/ridicule would be appreciated.

Cheers, David
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

2013-08-01 Thread Sha Zhengju

From: Sha Zhengju 

Following we will begin to add memcg dirty page accounting around 
__set_page_dirty_
{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid 
exporting
those details to filesystems.

Signed-off-by: Sha Zhengju 
---
 fs/ceph/addr.c |   13 +
 1 file changed, 1 insertion(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac1..1445bf1 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -76,7 +76,7 @@ static int ceph_set_page_dirty(struct page *page)
if (unlikely(!mapping))
return !TestSetPageDirty(page);
 
-   if (TestSetPageDirty(page)) {
+   if (!__set_page_dirty_nobuffers(page)) {
dout("%p set_page_dirty %p idx %lu -- already dirty\n",
 mapping->host, page, page->index);
return 0;
@@ -107,14 +107,7 @@ static int ceph_set_page_dirty(struct page *page)
 snapc, snapc->seq, snapc->num_snaps);
spin_unlock(&ci->i_ceph_lock);
 
-   /* now adjust page */
-   spin_lock_irq(&mapping->tree_lock);
if (page->mapping) {/* Race with truncate? */
-   WARN_ON_ONCE(!PageUptodate(page));
-   account_page_dirtied(page, page->mapping);
-   radix_tree_tag_set(&mapping->page_tree,
-   page_index(page), PAGECACHE_TAG_DIRTY);
-
/*
 * Reference snap context in page->private.  Also set
 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +119,10 @@ static int ceph_set_page_dirty(struct page *page)
undo = 1;
}
 
-   spin_unlock_irq(&mapping->tree_lock);
-
if (undo)
/* whoops, we failed to dirty the page */
ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
BUG_ON(!PageDirty(page));
return 1;
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: LFS & Ceph

2013-08-01 Thread Chmouel Boudjnah

Hello,

Sorry for the late answer as I was travelling lately.

The LFS works has been in a heavy state of work in progress by Peter
(in CC) and others there is some documentation in this review :

https://review.openstack.org/#/c/30051/

(summarized in Pete's gist here
https://gist.github.com/portante/5488238/raw/66b0bf2a91a8ca75301fa68dc0fef2d3dc76e5a2/gistfile1.txt)

some preliminary work here :

https://review.openstack.org/#/c/35381/

and some old documentation here :

https://raw.github.com/zaitcev/swift-lfs/master/doc/source/lfs_plugin.rst

I am sure Pete would appreciate the feedback.

Cheers,
Chmouel.

On 24/07/2013 17:00, Loic Dachary wrote:
> Hi,
>
> Thanks for take the time to discuss LFS today @ OSCON :-) Would you be so 
> kind as to send links to the current discussion about the LFS driver API ?
>
> Cheers
>

--
Loïc Dachary, Artisan Logiciel Libre
All that is necessary for the triumph of evil is that good people do nothing.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

still recovery issues with cuttlefish

2013-08-01 Thread Stefan Priebe - Profihost AG

Hi,

i still have recovery issues with cuttlefish. After the OSD comes back
it seem to hang for around 2-4 minutes and then recovery seems to start
(pgs in recovery_wait start to decrement). This is with ceph 0.61.7. I
get a lot of slow request messages an hanging VMs.

What i noticed today is that if i leave the OSD off as long as ceph
starts to backfill - the recovery and "re" backfilling wents absolutely
smooth without any issues and no slow request messages at all.

Does anybody have an idea why?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: question about striped_read

2013-08-01 Thread Yan, Zheng

On Thu, Aug 1, 2013 at 2:30 PM, majianpeng  wrote:
>>On Thu, Aug 1, 2013 at 9:45 AM, majianpeng  wrote:
On Wed, Jul 31, 2013 at 3:32 PM, majianpeng  wrote:
>
>  [snip]
> Test case
> A: touch file
> dd if=file of=/dev/null bs=5M count=1 iflag=direct
> B: [data(2M)|hole(2m)][data(2M)]
>dd if=file of=/dev/null bs=8M count=1 iflag=direct
> C: [data(4M)[hole(4M)][hole(4M)][data(2M)]
>   dd if=file of=/dev/null bs=16M count=1 iflag=direct
> D: touch file;truncate -s 5M file
>   dd if=file of=/dev/null bs=8M count=1 iflag=direct
>
> Those cases can work.
> Now i make different processing  for short-read between 'ret > 0' and 
> "ret =0".
> For the short-read which ret > 0, it don't do read-page rather than zero 
> the left area.
> This means reduce one meaningless read operation.
>

This patch looks good. But I still hope not to duplicate code.

how about change
 "hit_stripe = this_len < left;"
to
 "hit_stripe = this_len < left && (ret == this_len || pos + this_len <
inode->i_size);"

>>> To make the code easy to understand, i don't apply your suggestion.But i 
>>> add this check on the judgement of
>>> whether read more contents.
>>> The follow is the latest patch.Can you check?
>>>
>>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>>> index 2ddf061..3d8d14d 100644
>>> --- a/fs/ceph/file.c
>>> +++ b/fs/ceph/file.c
>>> @@ -349,44 +349,38 @@ more:
>>> dout("striped_read %llu~%u (read %u) got %d%s%s\n", pos, left, read,
>>>  ret, hit_stripe ? " HITSTRIPE" : "", was_short ? " SHORT" : 
>>> "");
>>>
>>> -   if (ret > 0) {
>>> -   int didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>>> -
>>> -   if (read < pos - off) {
>>> -   dout(" zero gap %llu to %llu\n", off + read, pos);
>>> -   ceph_zero_page_vector_range(page_align + read,
>>> -   pos - off - read, 
>>> pages);
>>> +   if (ret >= 0) {
>>> +   int  didpages;
>>> +   if (was_short && (pos + ret < inode->i_size)) {
>>> +   u64 tmp = min(this_len - ret,
>>> +inode->i_size - pos - ret);
>>> +   dout(" zero gap %llu to %llu\n",
>>> +   pos + ret, pos + ret + tmp);
>>> +   ceph_zero_page_vector_range(page_align + read + ret,
>>> +   tmp, pages);
>>> +   ret += tmp;
>>> }
>>> +
>>> +   didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
>>> pos += ret;
>>> read = pos - off;
>>> left -= ret;
>>> page_pos += didpages;
>>> pages_left -= didpages;
>>>
>>> -   /* hit stripe? */
>>> -   if (left && hit_stripe)
>>> +   /* hit stripe and need continue*/
>>> +   if (left && hit_stripe && pos < inode->i_size)
>>> goto more;
>>> +
>>> }
>>>
>>> -   if (was_short) {
>>> +   if (ret >= 0) {
>>> +   ret = read;
>>> /* did we bounce off eof? */
>>> if (pos + left > inode->i_size)
>>> *checkeof = 1;
>>> -
>>> -   /* zero trailing bytes (inside i_size) */
>>> -   if (left > 0 && pos < inode->i_size) {
>>> -   if (pos + left > inode->i_size)
>>> -   left = inode->i_size - pos;
>>> -
>>> -   dout("zero tail %d\n", left);
>>> -   ceph_zero_page_vector_range(page_align + read, left,
>>> -   pages);
>>> -   read += left;
>>> -   }
>>> }
>>>
>>> -   if (ret >= 0)
>>> -   ret = read;
>>
>>I think this line should be "if (read > 0) ret = read;". Other than
>>this, your patch looks good.
> Because you metioned this, I noticed for ceph_sync_read/write the result are 
> && the every striped read/write.
> That is if we met one error, the total result is error.It can't return 
> partial result.

This behavior is not correct. If we read/write some data, then meet an
error, we should return the size we have
read/written. I think all other FS behave like this. See
generic_file_aio_read() and do_generic_file_read().

Regards
Yan, Zheng



> I think i should write anthor patch for that.
>>You can add "Reviewed-by: Yan, Zheng " to your
>>formal patch.
> Ok ,thanks your times.
>
> Thanks!
> Jianpeng Ma
>>
>>Regards
>>Yan, Zheng
>>
>>
>>> dout("striped_read returns %d\n", ret);
>>> return ret;
>>>  }
>>>
>>> Thanks!
>>> Jianpeng Ma
> Thanks!
> Jianpeng Ma
>>On Thu, Aug 1, 2013 at 9:45 AM, majianpeng  wrote:
On Wed, Jul 31, 2013 at 3:32 PM, maj

Re: PG Backend Proposal

Re: [PATCH] ceph: fix bugs about handling short-read for sync read mode.

Re: PG Backend Proposal

Re: PG Backend Proposal

Re: Rados Protocoll

Re: still recovery issues with cuttlefish

Rados Protocoll

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: still recovery issues with cuttlefish

Re: [PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

v0.67-rc3 Dumpling release candidate

Re: [PATCH] mds: remove waiting lock before merging with neighbours

Re: PG Backend Proposal

Re: PG Backend Proposal

PG Backend Proposal

Re: still recovery issues with cuttlefish

Re: [PATCH] Add missing buildrequires for Fedora

Re: [PATCH] mds: remove waiting lock before merging with neighbours

[PATCH V5 2/8] fs/ceph: vfs __set_page_dirty_nobuffers interface instead of doing it inside filesystem

Re: LFS & Ceph

still recovery issues with cuttlefish

Re: Re: question about striped_read

28 matches

Site Navigation

Mail list logo

Footer information