Re: rename whilst in use

2013-02-28 Thread Josh Durgin

On 02/27/2013 10:01 PM, Wolfgang Hennerbichler wrote:

On 02/27/2013 11:01 PM, Josh Durgin wrote:


Since it doesn't appear in rbd ls, this suggests that the old
rbd_id.mysql3 object still exists.


Yes. Is there a way I can safely delete it?

rados ls -p rd | grep mysql | less
mysql3.rbd
rbd_id.mysql3.new

=> here it is (and yes, it's format==1). can I just say rados rm ...
without having to fear that everything breaks down?


If you've just had this problem with a format 1 image, you can safely
rados rm the old header (oldname.rbd). It's the only thing that wouldn't
be cleaned up by a rename while it's in use.


unfortunately my scrollback buffer isn't long enough to find out about
the actual error I got during the first rename, but it should be easy to
reproduce.


I can't reproduce it (at least not with format 2 images). Can you?


I currently only have this production system to play, so I can't really
try. but you're right, it's a format 1 image. (that was the use case, I
tried to "convert" a format 1 image to a format 2 image by creating the
.new and copying stuff over)


I'd expect this problem with format 1 images, since they don't separate
the header object from the name of the image.


so true. :)


Thanks, it certainly seems like a bug.


as it's a format==1 bug, maybe it's of less importance. so: is rados rm
mysql3.rbd a good idea? :)


If mysql3 is the old name of the image that was renamed while in use,
yes.


Josh


Wolfgang

PS: I didn't receive mails (even not your reply) on the mailing-list for
about 12 hours, did your mailserver break? Is your mailserver backed by
ceph-fs? :)


Not yet :) I'm not sure why that happened, I didn't notice any issues
with vger or my other emails today.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: complete lingering requests only once

2013-02-28 Thread Alex Elder
An osd request marked to linger will be re-submitted in the event
a connection to the target osd gets dropped.  Currently, if there
is a callback function associated with a request it will be called
each time a request is submitted--which for lingering requests can
be more than once.

Change it so a request--including lingering ones--will get completed
(from the perspective of the user of the osd client) exactly once.

This resolves:
http://tracker.ceph.com/issues/3967

Signed-off-by: Alex Elder 
---
 include/linux/ceph/osd_client.h |1 +
 net/ceph/osd_client.c   |5 +
 2 files changed, 6 insertions(+)

diff --git a/include/linux/ceph/osd_client.h
b/include/linux/ceph/osd_client.h
index 1dd5d46..a79f833 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -85,6 +85,7 @@ struct ceph_osd_request {
s32   r_reply_op_result[CEPH_OSD_MAX_OP];
int   r_got_reply;
int   r_linger;
+   int   r_completed;

struct ceph_osd_client *r_osdc;
struct kref   r_kref;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1d9ebf9..a28c976a 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1174,6 +1174,7 @@ static void handle_reply(struct ceph_osd_client
*osdc, struct ceph_msg *msg,
u32 reassert_epoch;
u64 reassert_version;
u32 osdmap_epoch;
+   int already_completed;
int i;

tid = le64_to_cpu(msg->hdr.tid);
@@ -1282,7 +1283,11 @@ static void handle_reply(struct ceph_osd_client
*osdc, struct ceph_msg *msg,
((flags & CEPH_OSD_FLAG_WRITE) == 0))
__unregister_request(osdc, req);

+   already_completed = req->r_completed;
+   req->r_completed = 1;
mutex_unlock(&osdc->request_mutex);
+   if (already_completed)
+   goto done;

if (req->r_callback)
req->r_callback(req, msg);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: maintanance on osd host

2013-02-28 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:37 PM, Stefan Priebe - Profihost AG
 wrote:
> Hi Greg,
>   Hi Sage,
>
> Am 26.02.2013 21:27, schrieb Gregory Farnum:
>> On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe  
>> wrote:
>> "out" and "down" are quite different — are you sure you tried "down"
>> and not "out"? (You reference out in your first email, rather than
>> down.)
>> -Greg
>
> sorry that's it i misread down / out. Sorry. Wouldn't it make sense to
> mark the osd automatically down when shutting down via the init script?
> It doesn't seem to make sense to hope for the automatic detection when
> somebody uses the init script.

Yes, yes it would. http://tracker.ceph.com/issues/4267 :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory usage

2013-02-28 Thread Bryan K. Wright
Hi folks,

I've been looking into my problem with OSDs that use
up a lot of memory.  Let running, I've seen them swell to 
over 8 GB of resident memory.  I'd really like to have some
way of limiting the maximum memory footprint of an OSD.
Is there a knob to do this?

I've just today recompiled ceph-osd with tcmalloc
turned on, so I could do some memory profiling.  (The RPMS
from ceph.com don't have it turned on.)  Here's an example
of what I see from pprof:

http://ayesha.phys.virginia.edu/~bryan/junk2.pdf

Any suggestions would be appreciated.

Thanks,
Bryan

-- 

Bryan Wright  |"If you take cranberries and stew them like 
Physics Department| applesauce, they taste much more like prunes
University of Virginia| than rhubarb does."  --  Groucho 
Charlottesville, VA  22901| 
(434) 924-7218| br...@virginia.edu



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory usage

2013-02-28 Thread Samuel Just
Looks like it would have to be the pg_interval_t maps copied into the
MOSDPGNotify messages or something in the OSDMap...  Can you confirm
that all of your OSDs are all running the same version?  There are two
paths in that method, one for handling contemporary OSDs and a second
path for handling OSDs prior to 0.5something.  Thanks for the heap
map!
-Sam

On Thu, Feb 28, 2013 at 10:57 AM, Bryan K. Wright
 wrote:
> Hi folks,
>
> I've been looking into my problem with OSDs that use
> up a lot of memory.  Let running, I've seen them swell to
> over 8 GB of resident memory.  I'd really like to have some
> way of limiting the maximum memory footprint of an OSD.
> Is there a knob to do this?
>
> I've just today recompiled ceph-osd with tcmalloc
> turned on, so I could do some memory profiling.  (The RPMS
> from ceph.com don't have it turned on.)  Here's an example
> of what I see from pprof:
>
> http://ayesha.phys.virginia.edu/~bryan/junk2.pdf
>
> Any suggestions would be appreciated.
>
> Thanks,
> Bryan
>
> --
> 
> Bryan Wright  |"If you take cranberries and stew them like
> Physics Department| applesauce, they taste much more like prunes
> University of Virginia| than rhubarb does."  --  Groucho
> Charlottesville, VA  22901|
> (434) 924-7218| br...@virginia.edu
> 
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-28 Thread Jim Schutt
Hi Sage,

On 02/26/2013 12:36 PM, Sage Weil wrote:
> On Tue, 26 Feb 2013, Jim Schutt wrote:
>>> I think the right solution is to make an option that will setsockopt on 
>>> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
>>> wip-tcp.  Do you mind checking to see if this addresses the issue (without 
>>> manually adjusting things in /proc)?
>>
>> I'll be happy to test it out...
> 
> That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
> rcvbuf'.

I've verified that I can reproduce the slowdown with the
default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
and 'ms tcp rcvbuf' at 0.

I've also verified that I could not reproduce any slowdown when
I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.

So, that's great news - sorry for the delay in testing.

Also, FWIW, I ended up testing with commits cb15e6e0f4 and
c346282940 cherry-picked on top of next as of a day or
so ago (commit f58601d681), as for some reason wip-tcp
wouldn't work for me - ceph-mon was non-responsive in
some way I didn't dig into.

-- Jim

> 
> Thanks!
> sage
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory usage

2013-02-28 Thread Bryan K. Wright

sam.j...@inktank.com said:
>  Can you confirm that all of your OSDs are all running the same version? 

Yes, they're all the same version, 0.56.3.

Thanks,
Bryan

-- 

Bryan Wright  |"If you take cranberries and stew them like 
Physics Department| applesauce, they taste much more like prunes
University of Virginia| than rhubarb does."  --  Groucho 
Charlottesville, VA  22901| 
(434) 924-7218| br...@virginia.edu



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory usage

2013-02-28 Thread Samuel Just
In the first message, you indicated that they were running 56.2, you
upgraded prior to getting the heap dump?
-Sam

On Thu, Feb 28, 2013 at 11:55 AM, Bryan K. Wright
 wrote:
>
> sam.j...@inktank.com said:
>>  Can you confirm that all of your OSDs are all running the same version?
>
> Yes, they're all the same version, 0.56.3.
>
> Thanks,
> Bryan
>
> --
> 
> Bryan Wright  |"If you take cranberries and stew them like
> Physics Department| applesauce, they taste much more like prunes
> University of Virginia| than rhubarb does."  --  Groucho
> Charlottesville, VA  22901|
> (434) 924-7218| br...@virginia.edu
> 
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD memory usage

2013-02-28 Thread Samuel Just
Oops, misread the first message, you can disregard my last message.
-Sam

On Thu, Feb 28, 2013 at 12:01 PM, Samuel Just  wrote:
> In the first message, you indicated that they were running 56.2, you
> upgraded prior to getting the heap dump?
> -Sam
>
> On Thu, Feb 28, 2013 at 11:55 AM, Bryan K. Wright
>  wrote:
>>
>> sam.j...@inktank.com said:
>>>  Can you confirm that all of your OSDs are all running the same version?
>>
>> Yes, they're all the same version, 0.56.3.
>>
>> Thanks,
>> Bryan
>>
>> --
>> 
>> Bryan Wright  |"If you take cranberries and stew them like
>> Physics Department| applesauce, they taste much more like prunes
>> University of Virginia| than rhubarb does."  --  Groucho
>> Charlottesville, VA  22901|
>> (434) 924-7218| br...@virginia.edu
>> 
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] Understanding delays due to throttling under very heavy write load

2013-02-28 Thread Sage Weil
On Thu, 28 Feb 2013, Jim Schutt wrote:
> Hi Sage,
> 
> On 02/26/2013 12:36 PM, Sage Weil wrote:
> > On Tue, 26 Feb 2013, Jim Schutt wrote:
> >>> I think the right solution is to make an option that will setsockopt on 
> >>> SO_RECVBUF to some value (say, 256KB).  I pushed a branch that does this, 
> >>> wip-tcp.  Do you mind checking to see if this addresses the issue 
> >>> (without 
> >>> manually adjusting things in /proc)?
> >>
> >> I'll be happy to test it out...
> > 
> > That would be great!  It's branch wip-tcp, and the setting is 'ms tcp 
> > rcvbuf'.
> 
> I've verified that I can reproduce the slowdown with the
> default value of 1 for /proc/sys/net/ipv4/tcp_moderate_rcvbuf,
> and 'ms tcp rcvbuf' at 0.
> 
> I've also verified that I could not reproduce any slowdown when
> I configure 'ms tcp rcvbuf' to 256 KiB on OSDs.
> 
> So, that's great news - sorry for the delay in testing.

Awesome--thanks so much for testing that!  Pulling it into master now.
 
> Also, FWIW, I ended up testing with commits cb15e6e0f4 and
> c346282940 cherry-picked on top of next as of a day or
> so ago (commit f58601d681), as for some reason wip-tcp
> wouldn't work for me - ceph-mon was non-responsive in
> some way I didn't dig into.

Yeah, sorry about that.  I rebased wip-tcp a few days ago but you may have 
picked up the previous version.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] snapshot, clone and mount a VM-Image

2013-02-28 Thread Josh Durgin

On 02/16/2013 03:51 AM, Jens Kristian Søgaard wrote:

Hi Sage,


1) Decide what output format to use.  We want to use something that is


I have given it some thought, and my initial suggestion to keep things
simple is to use the QCOW2 image format.

The birds eye view of the process would be as follows:

* Initial backup

User supplied information: pool, image name

Create rbd snapshot of the image named "backup_1", where 1 could be a
timestamp or an integer count.

Save the snapshot to a standard qcow2 image. Similar to:

qemu-img convert rbd:data/myimage@backup_1 -O qcow2
data_myimage_backup_1.qcow2

Note: I don't know if qemu-img actually supports reading from snapshots
currently.


It does.


* Incremental backup

User supplied information: pool, image name, path to initial backup or
previous incremental file

Create rbd snapshot of the image named "backup_2", where 2 could be a
timestamp or an integer count.

Determine previous snapshot identifier from given file name.

Determine objects changed from the snapshot given by that identifier and
the newly created snapshot.

Construct QCOW2 L1- and L2-tables in memory from that changeset.

Create new qcow2 image with the previous backup file as the backing
image, and write out the tables and changed blocks.

Delete previous rbd snapshot.


* Restoring and mounting


The use of the QCOW2 format means that we can use existing tools for
restoring and mounting the backups.

To restore a backup the user can simply choose either the initial backup
file or an incremental, and use qemu-img to copy that to a new rbd image.

To mount the initial backup or an incremental, the user can use qemu-nbd
to mount and explore the backup to determine which one to restore.

The performance of restores and mounts would ofcourse be weakened if the
backup consists of a large number of incrementals. In that case the
existing qemu-img tool could be used to flatten the backup.


* Pros/cons

The QCOW2 format support compression, so we could implement compressed
backups without much effort.

The disadvantages to using QCOW2 like this is that we do not have any
checksumming or safe guards against potential errors such as users
mixing up images.

Another disadvantage to this approach is that vital information is
stored in the actual filename of the backup file. I don't see any place
in the QCOW2 file format for storing this information inside the file,
sadly.

We could opt for storing it inside a plain text file that accompanies
the QCOW2 file, or tarballing the qcow2 file and that plain text file.


qcow2 seems like a good initial format given the existing tools. We
could always add another format later, or wrap it with extra
information like you suggest.

Have you had a chance to start implementing this yet? It'd be great to
get it working in the next month.

Josh
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] snapshot, clone and mount a VM-Image

2013-02-28 Thread Sage Weil
On Thu, 28 Feb 2013, Josh Durgin wrote:
> On 02/16/2013 03:51 AM, Jens Kristian S?gaard wrote:
> > We could opt for storing it inside a plain text file that accompanies
> > the QCOW2 file, or tarballing the qcow2 file and that plain text file.
> 
> qcow2 seems like a good initial format given the existing tools. We
> could always add another format later, or wrap it with extra
> information like you suggest.
> 
> Have you had a chance to start implementing this yet? It'd be great to
> get it working in the next month.

Just so you know, David has been working on the librados "list snaps" 
operation that you'll need to let the tool tell what blocks have changed 
or not.  The code is currently in the wip-4207 branch.  We expect it will 
be merged in the next few days, and should be part of v0.59.

I suspect the next step would be a function in the 'rbd' tool that would 
do the export.  Then a similar 'import' tool to go with it?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] snapshot, clone and mount a VM-Image

2013-02-28 Thread Ian Colle
Please see:
http://tracker.ceph.com/issues/4084 rbd: incremental backups

http://tracker.ceph.com/issues/3387 librbd: expose changed objects since a
given snapshot

http://tracker.ceph.com/issues/3272 send/receive rbd snapshots


It would be great if we could track this discussion in those tickets.

Ian R. Colle
Ceph Program Manager
Inktank
Cell: +1.303.601.7713 
Email: i...@inktank.com


 
 




On 2/28/13 4:33 PM, "Josh Durgin"  wrote:

>On 02/16/2013 03:51 AM, Jens Kristian Søgaard wrote:
>> Hi Sage,
>>
>>> 1) Decide what output format to use.  We want to use something that is
>>
>> I have given it some thought, and my initial suggestion to keep things
>> simple is to use the QCOW2 image format.
>>
>> The birds eye view of the process would be as follows:
>>
>> * Initial backup
>>
>> User supplied information: pool, image name
>>
>> Create rbd snapshot of the image named "backup_1", where 1 could be a
>> timestamp or an integer count.
>>
>> Save the snapshot to a standard qcow2 image. Similar to:
>>
>> qemu-img convert rbd:data/myimage@backup_1 -O qcow2
>> data_myimage_backup_1.qcow2
>>
>> Note: I don't know if qemu-img actually supports reading from snapshots
>> currently.
>
>It does.
>
>> * Incremental backup
>>
>> User supplied information: pool, image name, path to initial backup or
>> previous incremental file
>>
>> Create rbd snapshot of the image named "backup_2", where 2 could be a
>> timestamp or an integer count.
>>
>> Determine previous snapshot identifier from given file name.
>>
>> Determine objects changed from the snapshot given by that identifier and
>> the newly created snapshot.
>>
>> Construct QCOW2 L1- and L2-tables in memory from that changeset.
>>
>> Create new qcow2 image with the previous backup file as the backing
>> image, and write out the tables and changed blocks.
>>
>> Delete previous rbd snapshot.
>>
>>
>> * Restoring and mounting
>>
>>
>> The use of the QCOW2 format means that we can use existing tools for
>> restoring and mounting the backups.
>>
>> To restore a backup the user can simply choose either the initial backup
>> file or an incremental, and use qemu-img to copy that to a new rbd
>>image.
>>
>> To mount the initial backup or an incremental, the user can use qemu-nbd
>> to mount and explore the backup to determine which one to restore.
>>
>> The performance of restores and mounts would ofcourse be weakened if the
>> backup consists of a large number of incrementals. In that case the
>> existing qemu-img tool could be used to flatten the backup.
>>
>>
>> * Pros/cons
>>
>> The QCOW2 format support compression, so we could implement compressed
>> backups without much effort.
>>
>> The disadvantages to using QCOW2 like this is that we do not have any
>> checksumming or safe guards against potential errors such as users
>> mixing up images.
>>
>> Another disadvantage to this approach is that vital information is
>> stored in the actual filename of the backup file. I don't see any place
>> in the QCOW2 file format for storing this information inside the file,
>> sadly.
>>
>> We could opt for storing it inside a plain text file that accompanies
>> the QCOW2 file, or tarballing the qcow2 file and that plain text file.
>
>qcow2 seems like a good initial format given the existing tools. We
>could always add another format later, or wrap it with extra
>information like you suggest.
>
>Have you had a chance to start implementing this yet? It'd be great to
>get it working in the next month.
>
>Josh
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-28 Thread Chris Dunlop
On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
>> On Sat, 23 Feb 2013, Chris Dunlop wrote:
>>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
 On Sat, 23 Feb 2013, Chris Dunlop wrote:
> On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
>> I just looked at the logs.  I can't tell what happend to cause that 
>> 10 
>> second delay.. strangely, messages were passing from 0 -> 1, but 
>> nothing 
>> came back from 1 -> 0 (although 1 was queuing, if not sending, them).
>>> 
>>> Is there any way of telling where they were delayed, i.e. in the 1's 
>>> output
>>> queue or 0's input queue?
>> 
>> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
>> generate a lot of logging, though.
> 
> I really don't want to load the system with too much logging, but I'm 
> happy
> modifying code...  Are there specific interesting debug outputs which I 
> can
> modify so they're output under "ms = 1"?
 
 I'm basically interested in everything in writer() and write_message(), 
 and reader() and read_message()...
>>> 
>>> Like this?
>> 
>> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
>> that this is the lions share of what debug 20 will spam to the log, but 
>> hopefully the load is manageable!
> 
> Good idea on the '2'. I'll get that installed and wait for it to happen again.

FYI...

To avoid running out of disk space for the massive logs, I
started using logrotate on the ceph logs every two hours, which
does a 'service ceph reload' to re-open the log files.

In the week since doing that I haven't seen any 'slow requests'
at all (the load has stayed the same as before the change),
which means the issue with the osds dropping out, then the
system not recovering properly, also hasn't happened.

That's a bit suspicious, no?

I've now put the log dirs on each machine on their own 2TB
partition and reverted back to the default daily rotates.

And once more we're waiting... Godot, is that you?


Chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mon losing touch with OSDs

2013-02-28 Thread Sage Weil
On Fri, 1 Mar 2013, Chris Dunlop wrote:
> On Sat, Feb 23, 2013 at 01:02:53PM +1100, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 05:52:11PM -0800, Sage Weil wrote:
> >> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>> On Fri, Feb 22, 2013 at 05:30:04PM -0800, Sage Weil wrote:
>  On Sat, 23 Feb 2013, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 04:13:21PM -0800, Sage Weil wrote:
> >> On Sat, 23 Feb 2013, Chris Dunlop wrote:
> >>> On Fri, Feb 22, 2013 at 03:43:22PM -0800, Sage Weil wrote:
>  On Sat, 23 Feb 2013, Chris Dunlop wrote:
> > On Fri, Feb 22, 2013 at 01:57:32PM -0800, Sage Weil wrote:
> >> I just looked at the logs.  I can't tell what happend to cause 
> >> that 10 
> >> second delay.. strangely, messages were passing from 0 -> 1, but 
> >> nothing 
> >> came back from 1 -> 0 (although 1 was queuing, if not sending, 
> >> them).
> >>> 
> >>> Is there any way of telling where they were delayed, i.e. in the 1's 
> >>> output
> >>> queue or 0's input queue?
> >> 
> >> Yeah, if you bump it up to 'debug ms = 20'.  Be aware that that will 
> >> generate a lot of logging, though.
> > 
> > I really don't want to load the system with too much logging, but I'm 
> > happy
> > modifying code...  Are there specific interesting debug outputs which I 
> > can
> > modify so they're output under "ms = 1"?
>  
>  I'm basically interested in everything in writer() and write_message(), 
>  and reader() and read_message()...
> >>> 
> >>> Like this?
> >> 
> >> Yeah.  You could do 2 instead of 1 so you can turn it down.  I suspect 
> >> that this is the lions share of what debug 20 will spam to the log, but 
> >> hopefully the load is manageable!
> > 
> > Good idea on the '2'. I'll get that installed and wait for it to happen 
> > again.
> 
> FYI...
> 
> To avoid running out of disk space for the massive logs, I
> started using logrotate on the ceph logs every two hours, which
> does a 'service ceph reload' to re-open the log files.
> 
> In the week since doing that I haven't seen any 'slow requests'
> at all (the load has stayed the same as before the change),
> which means the issue with the osds dropping out, then the
> system not recovering properly, also hasn't happened.
> 
> That's a bit suspicious, no?

I suspect the logging itself is changing the timing.  Let's wait and see 
if we get lucky... 

sage

> 
> I've now put the log dirs on each machine on their own 2TB
> partition and reverted back to the default daily rotates.
> 
> And once more we're waiting... Godot, is that you?
> 
> 
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] ceph: misc fixes

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

These patches are also in:
  git://github.com/ukernel/linux.git wip-ceph

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] ceph: fix LSSNAP regression

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

commit 6e8575faa8 makes parse_reply_info_extra() return -EIO for LSSNAP

Signed-off-by: Yan, Zheng 
---
 fs/ceph/mds_client.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index d958420..608ffcf 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -265,7 +265,8 @@ static int parse_reply_info_extra(void **p, void *end,
 {
if (info->head->op == CEPH_MDS_OP_GETFILELOCK)
return parse_reply_info_filelock(p, end, info, features);
-   else if (info->head->op == CEPH_MDS_OP_READDIR)
+   else if (info->head->op == CEPH_MDS_OP_READDIR ||
+info->head->op == CEPH_MDS_OP_LSSNAP)
return parse_reply_info_dir(p, end, info, features);
else if (info->head->op == CEPH_MDS_OP_CREATE)
return parse_reply_info_create(p, end, info, features);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] ceph: queue cap release when trimming cap

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

So the client will later send cap release message to MDS

Signed-off-by: Yan, Zheng 
---
 fs/ceph/caps.c   | 6 +++---
 fs/ceph/mds_client.c | 2 ++
 fs/ceph/super.h  | 2 ++
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 1e1e020..5d5c32b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -997,9 +997,9 @@ static int send_cap_msg(struct ceph_mds_session *session,
return 0;
 }
 
-static void __queue_cap_release(struct ceph_mds_session *session,
-   u64 ino, u64 cap_id, u32 migrate_seq,
-   u32 issue_seq)
+void __queue_cap_release(struct ceph_mds_session *session,
+u64 ino, u64 cap_id, u32 migrate_seq,
+u32 issue_seq)
 {
struct ceph_msg *msg;
struct ceph_mds_cap_release *head;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 608ffcf..ccc68b0 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1197,6 +1197,8 @@ static int trim_caps_cb(struct inode *inode, struct 
ceph_cap *cap, void *arg)
session->s_trim_caps--;
if (oissued) {
/* we aren't the only cap.. just remove us */
+   __queue_cap_release(session, ceph_ino(inode), cap->cap_id,
+   cap->mseq, cap->issue_seq);
__ceph_remove_cap(cap);
} else {
/* try to drop referring dentries */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 604526a..4353ebc 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -755,6 +755,8 @@ static inline void ceph_remove_cap(struct ceph_cap *cap)
 extern void ceph_put_cap(struct ceph_mds_client *mdsc,
 struct ceph_cap *cap);
 
+extern void __queue_cap_release(struct ceph_mds_session *session, u64 ino,
+   u64 cap_id, u32 migrate_seq, u32 issue_seq);
 extern void ceph_queue_caps_release(struct inode *inode);
 extern int ceph_write_inode(struct inode *inode, struct writeback_control 
*wbc);
 extern int ceph_fsync(struct file *file, loff_t start, loff_t end,
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] ceph: set mds_want according to cap import message

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

MDS ignores cap update message if migrate_seq mismatch, so when
receiving a cap import message with higher migrate_seq, set mds_want
according to the cap import message.

Signed-off-by: Yan, Zheng 
---
 fs/ceph/caps.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5d5c32b..61f3833 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -553,6 +553,7 @@ retry:
cap->implemented = 0;
cap->mds = mds;
cap->mds_wanted = 0;
+   cap->mseq = 0;
 
cap->ci = ci;
__insert_cap_node(ci, cap);
@@ -628,7 +629,10 @@ retry:
cap->cap_id = cap_id;
cap->issued = issued;
cap->implemented |= issued;
-   cap->mds_wanted |= wanted;
+   if (mseq > cap->mseq)
+   cap->mds_wanted = wanted;
+   else
+   cap->mds_wanted |= wanted;
cap->seq = seq;
cap->issue_seq = seq;
cap->mseq = mseq;
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] ceph: use I_COMPLETE inode flag instead of D_COMPLETE flag

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

commit c6ffe10015 moved the flag that tracks if the dcache contents
for a directory are complete to dentry. The problem is there are
lots of places that use ceph_dir_{set,clear,test}_complete() while
holding i_ceph_lock. but ceph_dir_{set,clear,test}_complete() may
sleep because they call dput().

This patch basically reverts that commit. For ceph_d_prune(), it's
called with both the dentry to prune and the parent dentry are
locked. So it's safe to access the parent dentry's d_inode and
clear I_COMPLETE flag.

Signed-off-by: Yan, Zheng 
---
 fs/ceph/caps.c   |  8 ---
 fs/ceph/dir.c| 62 ++--
 fs/ceph/inode.c  | 30 +++--
 fs/ceph/mds_client.c |  6 ++---
 fs/ceph/super.h  | 23 ++-
 5 files changed, 34 insertions(+), 95 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 61f3833..76634f4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -490,15 +490,17 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
struct ceph_cap *cap,
ci->i_rdcache_gen++;
 
/*
-* if we are newly issued FILE_SHARED, clear D_COMPLETE; we
+* if we are newly issued FILE_SHARED, clear I_COMPLETE; we
 * don't know what happened to this directory while we didn't
 * have the cap.
 */
if ((issued & CEPH_CAP_FILE_SHARED) &&
(had & CEPH_CAP_FILE_SHARED) == 0) {
ci->i_shared_gen++;
-   if (S_ISDIR(ci->vfs_inode.i_mode))
-   ceph_dir_clear_complete(&ci->vfs_inode);
+   if (S_ISDIR(ci->vfs_inode.i_mode)) {
+   dout(" marking %p NOT complete\n", &ci->vfs_inode);
+   ci->i_ceph_flags &= ~CEPH_I_COMPLETE;
+   }
}
 }
 
diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 8c1aabe..76821be 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -107,7 +107,7 @@ static unsigned fpos_off(loff_t p)
  * falling back to a "normal" sync readdir if any dentries in the dir
  * are dropped.
  *
- * D_COMPLETE tells indicates we have all dentries in the dir.  It is
+ * I_COMPLETE tells indicates we have all dentries in the dir.  It is
  * defined IFF we hold CEPH_CAP_FILE_SHARED (which will be revoked by
  * the MDS if/when the directory is modified).
  */
@@ -198,8 +198,8 @@ more:
filp->f_pos++;
 
/* make sure a dentry wasn't dropped while we didn't have parent lock */
-   if (!ceph_dir_test_complete(dir)) {
-   dout(" lost D_COMPLETE on %p; falling back to mds\n", dir);
+   if (!ceph_i_test(dir, CEPH_I_COMPLETE)) {
+   dout(" lost I_COMPLETE on %p; falling back to mds\n", dir);
err = -EAGAIN;
goto out;
}
@@ -284,7 +284,7 @@ static int ceph_readdir(struct file *filp, void *dirent, 
filldir_t filldir)
if ((filp->f_pos == 2 || fi->dentry) &&
!ceph_test_mount_opt(fsc, NOASYNCREADDIR) &&
ceph_snap(inode) != CEPH_SNAPDIR &&
-   ceph_dir_test_complete(inode) &&
+   (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1)) {
spin_unlock(&ci->i_ceph_lock);
err = __dcache_readdir(filp, dirent, filldir);
@@ -350,7 +350,7 @@ more:
 
if (!req->r_did_prepopulate) {
dout("readdir !did_prepopulate");
-   fi->dir_release_count--;/* preclude D_COMPLETE */
+   fi->dir_release_count--;/* preclude I_COMPLETE */
}
 
/* note next offset and last dentry name */
@@ -429,7 +429,8 @@ more:
 */
spin_lock(&ci->i_ceph_lock);
if (ci->i_release_count == fi->dir_release_count) {
-   ceph_dir_set_complete(inode);
+   dout(" marking %p complete\n", inode);
+   ci->i_ceph_flags |= CEPH_I_COMPLETE;
ci->i_max_offset = filp->f_pos;
}
spin_unlock(&ci->i_ceph_lock);
@@ -604,7 +605,7 @@ static struct dentry *ceph_lookup(struct inode *dir, struct 
dentry *dentry,
fsc->mount_options->snapdir_name,
dentry->d_name.len) &&
!is_root_ceph_dentry(dir, dentry) &&
-   ceph_dir_test_complete(dir) &&
+   (ci->i_ceph_flags & CEPH_I_COMPLETE) &&
(__ceph_caps_issued_mask(ci, CEPH_CAP_FILE_SHARED, 1))) {
spin_unlock(&ci->i_ceph_lock);
dout(" dir %p complete, -ENOENT\n", dir);
@@ -908,7 +909,7 @@ static int ceph_rename(struct inode *old_dir, struct dentry 
*old_dentry,
 */
 
/* d_move screws up d_subdirs order */
-   ceph_dir_clear_complete(new_dir);
+   ceph_i_clear(new_dir, CEPH_I_COMPLETE);
 
d_move(old_dentry, new

[PATCH 5/7] ceph: revert commit 22cddde104

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

commit 22cddde104 breaks the atomicity of write operation, it also
introduces a deadlock between write and truncate.

Signed-off-by: Yan, Zheng 
---
 fs/ceph/addr.c   | 51 +++-
 fs/ceph/file.c   | 73 +++-
 fs/ceph/mds_client.c |  1 +
 3 files changed, 48 insertions(+), 77 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index cfef3e0..d662025 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1067,51 +1067,23 @@ static int ceph_write_begin(struct file *file, struct 
address_space *mapping,
struct page **pagep, void **fsdata)
 {
struct inode *inode = file->f_dentry->d_inode;
-   struct ceph_inode_info *ci = ceph_inode(inode);
-   struct ceph_file_info *fi = file->private_data;
struct page *page;
pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-   int r, want, got = 0;
-
-   if (fi->fmode & CEPH_FILE_MODE_LAZY)
-   want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
-   else
-   want = CEPH_CAP_FILE_BUFFER;
-
-   dout("write_begin %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
-inode, ceph_vinop(inode), pos, len, inode->i_size);
-   r = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, &got, pos+len);
-   if (r < 0)
-   return r;
-   dout("write_begin %p %llx.%llx %llu~%u  got cap refs on %s\n",
-inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
-   if (!(got & (CEPH_CAP_FILE_BUFFER|CEPH_CAP_FILE_LAZYIO))) {
-   ceph_put_cap_refs(ci, got);
-   return -EAGAIN;
-   }
+   int r;
 
do {
/* get a page */
page = grab_cache_page_write_begin(mapping, index, 0);
-   if (!page) {
-   r = -ENOMEM;
-   break;
-   }
+   if (!page)
+   return -ENOMEM;
+   *pagep = page;
 
dout("write_begin file %p inode %p page %p %d~%d\n", file,
 inode, page, (int)pos, (int)len);
 
r = ceph_update_writeable_page(file, pos, len, page);
-   if (r)
-   page_cache_release(page);
} while (r == -EAGAIN);
 
-   if (r) {
-   ceph_put_cap_refs(ci, got);
-   } else {
-   *pagep = page;
-   *(int *)fsdata = got;
-   }
return r;
 }
 
@@ -1125,12 +1097,10 @@ static int ceph_write_end(struct file *file, struct 
address_space *mapping,
  struct page *page, void *fsdata)
 {
struct inode *inode = file->f_dentry->d_inode;
-   struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
struct ceph_mds_client *mdsc = fsc->mdsc;
unsigned from = pos & (PAGE_CACHE_SIZE - 1);
int check_cap = 0;
-   int got = (unsigned long)fsdata;
 
dout("write_end file %p inode %p page %p %d~%d (%d)\n", file,
 inode, page, (int)pos, (int)copied, (int)len);
@@ -1153,19 +1123,6 @@ static int ceph_write_end(struct file *file, struct 
address_space *mapping,
up_read(&mdsc->snap_rwsem);
page_cache_release(page);
 
-   if (copied > 0) {
-   int dirty;
-   spin_lock(&ci->i_ceph_lock);
-   dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
-   spin_unlock(&ci->i_ceph_lock);
-   if (dirty)
-   __mark_inode_dirty(inode, dirty);
-   }
-
-   dout("write_end %p %llx.%llx %llu~%u  dropping cap refs on %s\n",
-inode, ceph_vinop(inode), pos, len, ceph_cap_string(got));
-   ceph_put_cap_refs(ci, got);
-
if (check_cap)
ceph_check_caps(ceph_inode(inode), CHECK_CAPS_AUTHONLY, NULL);
 
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 9c4325e..a949805 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -718,53 +718,63 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
struct ceph_osd_client *osdc =
&ceph_sb_to_client(inode->i_sb)->client->osdc;
loff_t endoff = pos + iov->iov_len;
-   int got = 0;
-   int ret, err, written;
+   int want, got = 0;
+   int ret, err;
 
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
 retry_snap:
-   written = 0;
if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
return -ENOSPC;
__ceph_do_pending_vmtruncate(inode);
+   dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
+inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
+inode->i_size);
+   if (fi->fmode & CEPH_FILE_MODE_LAZY)
+   want = CEPH_CAP_FILE_BUFFER | CEPH_CAP_FILE_LAZYIO;
+   else
+   want = CEPH_CAP_FILE_BUFFER;
+  

[PATCH 6/7] ceph: don't early drop Fw cap

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR
cap dirty before data is copied to page cache and inode size is
updated. The optimization avoids slow cap revocation caused by
balance_dirty_pages(), but introduces inode size update race. If
ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. So just remove the
optimization.

Signed-off-by: Yan, Zheng 
---
 fs/ceph/file.c | 42 +-
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index a949805..28ef273 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -724,9 +724,12 @@ static ssize_t ceph_aio_write(struct kiocb *iocb, const 
struct iovec *iov,
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
+   sb_start_write(inode->i_sb);
 retry_snap:
-   if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL))
-   return -ENOSPC;
+   if (ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_FULL)) {
+   ret = -ENOSPC;
+   goto out;
+   }
__ceph_do_pending_vmtruncate(inode);
dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
 inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
@@ -750,29 +753,10 @@ retry_snap:
ret = ceph_sync_write(file, iov->iov_base, iov->iov_len,
&iocb->ki_pos);
} else {
-   /*
-* buffered write; drop Fw early to avoid slow
-* revocation if we get stuck on balance_dirty_pages
-*/
-   int dirty;
-
-   spin_lock(&ci->i_ceph_lock);
-   dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_FILE_WR);
-   spin_unlock(&ci->i_ceph_lock);
-   ceph_put_cap_refs(ci, got);
-
-   ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
-   if ((ret >= 0 || ret == -EIOCBQUEUED) &&
-   ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host)
-|| ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
-   err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
-   if (err < 0)
-   ret = err;
-   }
-
-   if (dirty)
-   __mark_inode_dirty(inode, dirty);
-   goto out;
+   mutex_lock(&inode->i_mutex);
+   ret = __generic_file_aio_write(iocb, iov, nr_segs,
+  &iocb->ki_pos);
+   mutex_unlock(&inode->i_mutex);
}
 
if (ret >= 0) {
@@ -790,12 +774,20 @@ out_put:
 ceph_cap_string(got));
ceph_put_cap_refs(ci, got);
 
+   if (ret >= 0 &&
+   ((file->f_flags & O_SYNC) || IS_SYNC(file->f_mapping->host) ||
+ceph_osdmap_flag(osdc->osdmap, CEPH_OSDMAP_NEARFULL))) {
+   err = vfs_fsync_range(file, pos, pos + ret - 1, 1);
+   if (err < 0)
+   ret = err;
+   }
 out:
if (ret == -EOLDSNAPC) {
dout("aio_write %p %llx.%llx %llu~%u got EOLDSNAPC, retrying\n",
 inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len);
goto retry_snap;
}
+   sb_end_write(inode->i_sb);
 
return ret;
 }
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] ceph: acquire i_mutex in __ceph_do_pending_vmtruncate

2013-02-28 Thread Yan, Zheng
From: "Yan, Zheng" 

make __ceph_do_pending_vmtruncate() acquire i_mutex if the caller
does not hold the mutex, so ceph_aio_read() can call it safely.

Signed-off-by: Yan, Zheng 
---
 fs/ceph/file.c  |  6 +++---
 fs/ceph/inode.c | 18 +-
 fs/ceph/super.h |  2 +-
 3 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 28ef273..b9eedd4 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -653,7 +653,7 @@ static ssize_t ceph_aio_read(struct kiocb *iocb, const 
struct iovec *iov,
dout("aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n",
 inode, ceph_vinop(inode), pos, (unsigned)len, inode);
 again:
-   __ceph_do_pending_vmtruncate(inode);
+   __ceph_do_pending_vmtruncate(inode, true);
if (fi->fmode & CEPH_FILE_MODE_LAZY)
want = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
else
@@ -730,7 +730,7 @@ retry_snap:
ret = -ENOSPC;
goto out;
}
-   __ceph_do_pending_vmtruncate(inode);
+   __ceph_do_pending_vmtruncate(inode, true);
dout("aio_write %p %llx.%llx %llu~%u getting caps. i_size %llu\n",
 inode, ceph_vinop(inode), pos, (unsigned)iov->iov_len,
 inode->i_size);
@@ -801,7 +801,7 @@ static loff_t ceph_llseek(struct file *file, loff_t offset, 
int whence)
int ret;
 
mutex_lock(&inode->i_mutex);
-   __ceph_do_pending_vmtruncate(inode);
+   __ceph_do_pending_vmtruncate(inode, false);
 
if (whence == SEEK_END || whence == SEEK_DATA || whence == SEEK_HOLE) {
ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 42c5769..2b3fee7 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1416,7 +1416,7 @@ out:
 
 
 /*
- * called by trunc_wq; take i_mutex ourselves
+ * called by trunc_wq;
  *
  * We also truncate in a separate thread as well.
  */
@@ -1427,9 +1427,7 @@ static void ceph_vmtruncate_work(struct work_struct *work)
struct inode *inode = &ci->vfs_inode;
 
dout("vmtruncate_work %p\n", inode);
-   mutex_lock(&inode->i_mutex);
-   __ceph_do_pending_vmtruncate(inode);
-   mutex_unlock(&inode->i_mutex);
+   __ceph_do_pending_vmtruncate(inode, true);
iput(inode);
 }
 
@@ -1453,12 +1451,10 @@ void ceph_queue_vmtruncate(struct inode *inode)
 }
 
 /*
- * called with i_mutex held.
- *
  * Make sure any pending truncation is applied before doing anything
  * that may depend on it.
  */
-void __ceph_do_pending_vmtruncate(struct inode *inode)
+void __ceph_do_pending_vmtruncate(struct inode *inode, bool needlock)
 {
struct ceph_inode_info *ci = ceph_inode(inode);
u64 to;
@@ -1491,7 +1487,11 @@ retry:
 ci->i_truncate_pending, to);
spin_unlock(&ci->i_ceph_lock);
 
+   if (needlock)
+   mutex_lock(&inode->i_mutex);
truncate_inode_pages(inode->i_mapping, to);
+   if (needlock)
+   mutex_unlock(&inode->i_mutex);
 
spin_lock(&ci->i_ceph_lock);
if (to == ci->i_truncate_size) {
@@ -1544,7 +1544,7 @@ int ceph_setattr(struct dentry *dentry, struct iattr 
*attr)
if (ceph_snap(inode) != CEPH_NOSNAP)
return -EROFS;
 
-   __ceph_do_pending_vmtruncate(inode);
+   __ceph_do_pending_vmtruncate(inode, false);
 
err = inode_change_ok(inode, attr);
if (err != 0)
@@ -1722,7 +1722,7 @@ int ceph_setattr(struct dentry *dentry, struct iattr 
*attr)
 ceph_cap_string(dirtied), mask);
 
ceph_mdsc_put_request(req);
-   __ceph_do_pending_vmtruncate(inode);
+   __ceph_do_pending_vmtruncate(inode, false);
return err;
 out:
spin_unlock(&ci->i_ceph_lock);
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index efbcb56..e5f1875 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -694,7 +694,7 @@ extern int ceph_readdir_prepopulate(struct ceph_mds_request 
*req,
 extern int ceph_inode_holds_cap(struct inode *inode, int mask);
 
 extern int ceph_inode_set_size(struct inode *inode, loff_t size);
-extern void __ceph_do_pending_vmtruncate(struct inode *inode);
+extern void __ceph_do_pending_vmtruncate(struct inode *inode, bool needlock);
 extern void ceph_queue_vmtruncate(struct inode *inode);
 
 extern void ceph_queue_invalidate(struct inode *inode);
-- 
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html