rbd image association

2013-06-03 Thread Roman Alekseev

Hi,

Is it possible to associate certain rbd device with the appropriate rbd 
image (for example : image1  /dev/rbd0)?
We always need to map image1 to /dev/rbd0, image2 to /dev/rbd1 etc. 
Because I'm using /dev/rbd* in some configuration files.


Thanks

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Roman Alekseev

On 03.06.2013 11:08, Wolfgang Hennerbichler wrote:

On Mon, Jun 03, 2013 at 10:52:14AM +0400, Roman Alekseev wrote:

Hi,

Is it possible to associate certain rbd device with the appropriate
rbd image (for example : image1  /dev/rbd0)?
We always need to map image1 to /dev/rbd0, image2 to /dev/rbd1 etc.
Because I'm using /dev/rbd* in some configuration files.

Hi,

I can't remember exactly, but I think there is something like 
/dev/rbd/pool/imagename which simply symlinks to /dev/rbdX. This should 
solve your issue.
  

Thanks

HTH
Wolfgang


Dear Wolfgang,

I was trying to use the command ln -s /dev/rbd1 image1 but it creates 
wrong symlink. Most likely there is more specific command to symlink the 
image and device correctly.


Thanks.

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Wido den Hollander

On 06/03/2013 09:17 AM, Roman Alekseev wrote:

On 03.06.2013 11:08, Wolfgang Hennerbichler wrote:

On Mon, Jun 03, 2013 at 10:52:14AM +0400, Roman Alekseev wrote:

Hi,

Is it possible to associate certain rbd device with the appropriate
rbd image (for example : image1  /dev/rbd0)?
We always need to map image1 to /dev/rbd0, image2 to /dev/rbd1 etc.
Because I'm using /dev/rbd* in some configuration files.

Hi,

I can't remember exactly, but I think there is something like
/dev/rbd/pool/imagename which simply symlinks to /dev/rbdX. This
should solve your issue.

Thanks

HTH
Wolfgang


Dear Wolfgang,

I was trying to use the command ln -s /dev/rbd1 image1 but it creates
wrong symlink. Most likely there is more specific command to symlink the
image and device correctly.



Why are you symlinking? Like Wolfgang mentioned, udev takes care of this 
for you.


Just use /dev/rbd/pool/imagename which is created by udev.


Thanks.




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Wolfgang Hennerbichler
On Mon, Jun 03, 2013 at 11:17:16AM +0400, Roman Alekseev wrote:
 Dear Wolfgang,
 
 I was trying to use the command ln -s /dev/rbd1 image1 but it
 creates wrong symlink. Most likely there is more specific command to
 symlink the image and device correctly.

No, you should not have to do anything manually. I was saying udev should 
create the symlink for you upon image mapping. If it doesn't, check you udev 
configuration. It does this for me on ubuntu server. 
 
 Thanks.

Wolfgang

-- 
http://www.wogri.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Roman Alekseev

On 03.06.2013 11:34, Wido den Hollander wrote:

udev takes car


Do you mean we need to create some rules in /etc/udev/rules.d/ directory 
and after running rbd -p pool map image my image will be mapped with 
appropriate device?


If so, could you please provide me with the commands which should be 
presented in that file?


Thanks

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Speed up 'rbd rm'

2013-06-03 Thread Chris Dunlop
On Thu, May 30, 2013 at 07:04:28PM -0700, Josh Durgin wrote:
 On 05/30/2013 06:40 PM, Chris Dunlop wrote:
 On Thu, May 30, 2013 at 01:50:14PM -0700, Josh Durgin wrote:
 On 05/29/2013 07:23 PM, Chris Dunlop wrote:
 On Wed, May 29, 2013 at 12:21:07PM -0700, Josh Durgin wrote:
 On 05/28/2013 10:59 PM, Chris Dunlop wrote:
 I see there's a new commit to speed up an 'rbd rm':

 http://tracker.ceph.com/projects/ceph/repository/revisions/40956410169709c32a282d9b872cb5f618a48926

 Is it safe to cherry-pick this commit on top of 0.56.6 (or, if not, 
 v0.61.2) to speed up the remove?

 You'll need 537386d906b8c0e395433461dcb03a82eb33f34f as well. It should
 apply cleanly to 0.61.2, and probably 0.56.6 too.

 Thanks. I'll see how I go, I may just leave the 'rm' running all
 weekend rather than futzing around recompiling ceph and getting
 off the mainline track.

# time rbd rm rbd/large-image
Removing image: 36% complete...Terminated
real2819m37.117s

I.e. 47 hours and only 36% complete before I gave up (I wanted
to restart that server). At that rate it would take 5.5 days to
remove!

 If you're mainly interested in getting rid of the accidentally 1.5PB
 image, you can just delete the header (and id object if it's format 2)
 and then 'rbd rm' will just remove it from the rbd_directory index, and
 not try to delete all the non-existent data objects.

 Yes, that's my main interest. Sorry, I haven't yet delved far
 into the details of how the rbd stuff hangs together: can you
 give me a hint or point me towards any docs regarding what
 delete the header (and id object would look like?
 
 For a format 2 image, 'rbd info imagename' will show a block_prefix
 like 'rbd_data.101574b0dc51'.
 
 The random suffix after the '.' is the id of the image.
 For format 2, the header is named after this id, so you'd do:
 
 rados -p poolname rm rbd_header.101574b0dc51
 
 For format 1 images, the header object is named after the image name,
 like 'imagename.rbd'.

 After removing the header object manually, rbd rm will clean up the
 rest.

The problematical image is format 2.

If it's tricky to manually remove, it's not doing any harm just
sitting there (it it??) so I guess I can just wait until the
parallelized delete is available in a stable release, i.e.
dumpling, or backported to bobtail or cuttlefish.

Cheers,

Chris.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


krbd + format=2 ?

2013-06-03 Thread Chris Dunlop
G'day,

Sage's recent pull message to Linus said:


Please pull the following Ceph patches from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

This is a big pull.  Most of it is culmination of Alex's work to implement
RBD image layering, which is now complete (yay!).


Am I correct in thinking RBD image layering... is now complete
implies there should be full(?) support for format=2?

I pulled the for-linus branch (@ 3abef3b) on top of 3.10.0-rc4, and it's
letting me map a format=2 image (created under bobtail), however reading
from the block device returns zeros rather than the data. The same image
correctly shows data (NTFS filesystem) when mounted into kvm using librbd.


# uname -r
3.10.0-rc4-00010-g0326739
# rbd ls -l
NAME   SIZE PARENT   FMT PROT LOCK 
xxx   1536G2
# rbd map rbd/xxx
# rbd showmapped
id pool image   snap device
1  rbd  xxx -/dev/rbd1 
# dd if=/dev/rbd1 of=/tmp/xxx count=20480
20480+0 records in
20480+0 records out
10485760 bytes (10 MB) copied, 0.757754 s, 13.8 MB/s
# od -c /tmp/xxx | less
000  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
5000



Cheers,

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] librados: Add RADOS locks to the C/C++ API

2013-06-03 Thread Filippos Giannakos

Hi Josh,

On 05/31/2013 10:44 PM, Josh Durgin wrote:

On 05/30/2013 06:02 AM, Filippos Giannakos wrote:

The following patches export the RADOS advisory locks functionality to
the C/C++
librados API. The extra API calls added are inspired by the relevant
functions
of librbd.


This looks good to me overall. I wonder if we should create a new
library in the future for these kinds of things that are built on top
of librados. Other generally useful class client operations could go
there, as well as generally useful things built on top of librados,
like methods for striping over many objects.


Thanks for the review. I will incorporate all your suggestions in a new
patch, which I will submit shortly.
As for the new library you mention, it is a good idea, but for now I
think that the basic RADOS locking functionality should be at the core
librados API.

King Regards,
--
Filippos.
philipg...@grnet.gr
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Sage Weil
On Mon, 3 Jun 2013, Roman Alekseev wrote:
 On 03.06.2013 11:34, Wido den Hollander wrote:
  udev takes car
 
 Do you mean we need to create some rules in /etc/udev/rules.d/ directory and
 after running rbd -p pool map image my image will be mapped with appropriate
 device?

The ceph package installs a udev rules file in that directory; you 
shouldn't have to do anything other than the 'rbd map ..' command.  If it 
is not already present, there must be something wrong with the package on 
the platform you are using.  What OS is it?

 
 If so, could you please provide me with the commands which should be presented
 in that file?
 
 Thanks
 
 -- 
 Kind regards,
 
 R. Alekseev
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Roman Alekseev

On 03.06.2013 18:49, Sage Weil wrote:

On Mon, 3 Jun 2013, Roman Alekseev wrote:

On 03.06.2013 11:34, Wido den Hollander wrote:

udev takes car

Do you mean we need to create some rules in /etc/udev/rules.d/ directory and
after running rbd -p pool map image my image will be mapped with appropriate
device?

The ceph package installs a udev rules file in that directory; you
shouldn't have to do anything other than the 'rbd map ..' command.  If it
is not already present, there must be something wrong with the package on
the platform you are using.  What OS is it?


If so, could you please provide me with the commands which should be presented
in that file?

Thanks

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Yep, u're right there is only the file 70-persistent-net.rules in 
/etc/udev/rules.d/ directory.

It is Debian Wheezy 3.2.0-4-amd64

thanks

--
Kind regards,

R. Alekseev

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd image association

2013-06-03 Thread Sage Weil
On Mon, 3 Jun 2013, Roman Alekseev wrote:
 On 03.06.2013 18:49, Sage Weil wrote:
  On Mon, 3 Jun 2013, Roman Alekseev wrote:
   On 03.06.2013 11:34, Wido den Hollander wrote:
udev takes car
   Do you mean we need to create some rules in /etc/udev/rules.d/ directory
   and
   after running rbd -p pool map image my image will be mapped with
   appropriate
   device?
  The ceph package installs a udev rules file in that directory; you
  shouldn't have to do anything other than the 'rbd map ..' command.  If it
  is not already present, there must be something wrong with the package on
  the platform you are using.  What OS is it?
  
   If so, could you please provide me with the commands which should be
   presented
   in that file?
   
   Thanks
   
   -- 
   Kind regards,
   
   R. Alekseev
   
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
   
 Yep, u're right there is only the file 70-persistent-net.rules in
 /etc/udev/rules.d/ directory.
 It is Debian Wheezy 3.2.0-4-amd64

The file you want is

/lib/udev/rules.d/50-rbd.rules

which is part of the librbd1 package.  dpkg -L librbd1 to verify it is 
there...

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph branch status

2013-06-03 Thread ceph branch robot
-- All Branches --

Alex Elder el...@inktank.com
2013-05-21 14:37:01 -0500   wip-rbd-testing

Babu Shanmugam a...@enovance.com
2013-05-30 10:28:23 +0530   wip-rgw-geo-enovance

Dan Mick dan.m...@inktank.com
2012-12-18 12:27:36 -0800   wip-rbd-striping
2013-03-15 17:27:54 -0700   wip-cephtool-stderr

David Zafman david.zaf...@inktank.com
2013-01-28 20:26:34 -0800   wip-wireshark-zafman
2013-03-22 18:14:10 -0700   wip-snap-test-fix
2013-05-30 17:18:24 -0700   wip-3527

Gary Lowell gary.low...@inktank.com
2013-05-28 13:58:22 -0700   last

Gary Lowell glow...@inktank.com
2013-01-28 22:49:45 -0800   wip-3930
2013-02-05 19:29:11 -0800   wip.cppchecker
2013-02-10 22:21:52 -0800   wip-3955
2013-02-26 19:28:48 -0800   wip-system-leveldb
2013-02-27 22:32:57 -0800   bobtail-leveldb
2013-03-01 18:55:35 -0800   wip-da-spec-1
2013-03-19 11:28:15 -0700   wip-3921
2013-04-11 23:00:05 -0700   wip-init-radosgw
2013-04-17 23:30:11 -0700   wip-4725
2013-04-21 22:06:37 -0700   wip-4752
2013-04-22 14:11:37 -0700   wip-4632
2013-05-10 09:32:33 -0700   wip-build-doc
2013-05-31 11:20:40 -0700   wip-doc-prereq

Greg Farnum g...@inktank.com
2013-02-13 14:46:38 -0800   wip-mds-snap-fix
2013-02-22 19:57:53 -0800   wip-4248-snapid-journaling
2013-05-01 17:06:27 -0700   wip-optracker-4354
2013-05-31 10:35:18 -0700   bobtail
2013-05-31 13:28:31 -0700   wip-rgw-geo-rebase-test
2013-05-31 17:08:27 -0700   wip-rgw-geo-rebase

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Joao Eduardo Luis joao.l...@inktank.com
2013-04-18 00:01:24 +0100   wip-4521-tool
2013-04-22 15:14:28 +0100   wip-4748
2013-04-24 16:42:11 +0100   wip-4521
2013-04-30 18:45:22 +0100   wip-mon-compact-dbg
2013-05-21 01:46:13 +0100   wip-monstoretool-foo
2013-05-31 16:26:02 +0100   wip-mon-cache-first-last-committed
2013-05-31 18:54:38 +0100   wip-mon-trim
2013-05-31 21:00:28 +0100   wip-mon-trim-b

Joe Buck jbb...@gmail.com
2013-05-02 16:32:33 -0700   wip-buck-add-terasort
2013-05-30 23:02:32 -0700   wip-rgw-geo-enovance-buck

John Wilkins john.wilk...@inktank.com
2012-12-21 15:14:37 -0800   wip-mon-docs

Josh Durgin josh.dur...@inktank.com
2013-03-01 14:45:23 -0800   wip-rbd-workunit-debug
2013-04-29 14:32:00 -0700   wip-rbd-close-image

Matt Benjamin m...@linuxbox.com
2013-05-21 09:45:30 -0700   wip-libcephfs

Noah Watkins noahwatk...@gmail.com
2013-01-05 11:58:38 -0800   wip-localized-read-tests
2013-01-11 12:49:28 -0800   wip-osx-upstream
2013-01-11 13:01:11 -0800   wip-osx
2013-02-01 10:28:26 -0800   wip-java-deb-warning
2013-04-22 15:23:09 -0700   wip-cls-lua
2013-05-30 13:29:42 -0700   wip-hadoop-doc

Roald van Loon roaldvanl...@gmail.com
2012-12-24 22:26:56 +   wip-dout

Sage Weil s...@inktank.com
2012-07-14 17:40:21 -0700   wip-osd-redirect
2012-07-28 13:56:47 -0700   wip-journal
2012-11-13 08:58:57 -0800   wip-fd-simple-cache
2012-11-30 13:47:27 -0800   wip-osd-readhole
2012-12-07 14:38:46 -0800   wip-osd-alloc
2012-12-12 22:18:02 -0800   automake-python
2012-12-14 17:11:31 -0800   wip_cur_perf_journal
2012-12-27 13:38:52 -0800   wip-linuxbox
2013-01-06 20:39:37 -0800   wip-msg-refs
2013-01-16 11:39:28 -0800   wip-client-layout-temp
2013-01-27 11:06:08 -0800   wip-argonaut-leveldb
2013-01-29 13:46:02 -0800   wip-readdir
2013-02-08 10:07:54 -0800   wip-bobtail-logrotate
2013-02-11 07:05:15 -0800   wip-sim-journal-clone
2013-02-12 08:29:29 -0800   wip-dump
2013-02-12 23:17:11 -0800   wip-monc
2013-02-18 20:50:57 -0800   wip-osd-scrub
2013-02-19 13:18:13 -0800   wip-4116-workaround
2013-03-11 17:27:15 -0700   wip-deploy-rgw
2013-03-15 17:10:24 -0700   wip-log-4192
2013-04-15 20:34:47 -0700   bobtail-dc
2013-04-18 13:51:36 -0700   argonaut
2013-04-25 10:38:45 -0700   wip-init2
2013-04-26 12:25:58 -0700   wip-mon-fwd
2013-05-02 14:50:38 -0700   wip-rbd-clear-layering
2013-05-03 09:01:53 -0700   wip-leveldb-reopen
2013-05-07 15:56:57 -0700   wip-ceph-tool
2013-05-08 09:46:53 -0700   wip-4578-bobtail
2013-05-08 13:39:25 -0700   wip-4951
2013-05-10 08:28:50 -0700   wip-rgw-crash
2013-05-22 09:03:53 -0700   wip-4895-cuttlefish
2013-05-22 10:33:35 -0700   unstable
2013-05-22 13:39:58 -0700   wip-notcmalloc
2013-05-23 19:32:56 -0700   wip-libcephfs-rebased
2013-05-27 12:45:33 -0700   wip-mon-compaction
2013-05-27 12:45:41 -0700   wip-cuttlefish-mon
2013-05-28 

ceph branch status

2013-06-03 Thread ceph branch robot
-- All Branches --

Alex Elder el...@inktank.com
2013-05-21 14:37:01 -0500   wip-rbd-testing

Babu Shanmugam a...@enovance.com
2013-05-30 10:28:23 +0530   wip-rgw-geo-enovance

Dan Mick dan.m...@inktank.com
2012-12-18 12:27:36 -0800   wip-rbd-striping
2013-03-15 17:27:54 -0700   wip-cephtool-stderr

David Zafman david.zaf...@inktank.com
2013-01-28 20:26:34 -0800   wip-wireshark-zafman
2013-03-22 18:14:10 -0700   wip-snap-test-fix
2013-05-30 17:18:24 -0700   wip-3527

Gary Lowell gary.low...@inktank.com
2013-05-28 13:58:22 -0700   last

Gary Lowell glow...@inktank.com
2013-01-28 22:49:45 -0800   wip-3930
2013-02-05 19:29:11 -0800   wip.cppchecker
2013-02-10 22:21:52 -0800   wip-3955
2013-02-26 19:28:48 -0800   wip-system-leveldb
2013-02-27 22:32:57 -0800   bobtail-leveldb
2013-03-01 18:55:35 -0800   wip-da-spec-1
2013-03-19 11:28:15 -0700   wip-3921
2013-04-11 23:00:05 -0700   wip-init-radosgw
2013-04-17 23:30:11 -0700   wip-4725
2013-04-21 22:06:37 -0700   wip-4752
2013-04-22 14:11:37 -0700   wip-4632
2013-05-10 09:32:33 -0700   wip-build-doc
2013-05-31 11:20:40 -0700   wip-doc-prereq

Greg Farnum g...@inktank.com
2013-02-13 14:46:38 -0800   wip-mds-snap-fix
2013-02-22 19:57:53 -0800   wip-4248-snapid-journaling
2013-05-01 17:06:27 -0700   wip-optracker-4354
2013-05-31 10:35:18 -0700   bobtail
2013-05-31 13:28:31 -0700   wip-rgw-geo-rebase-test
2013-05-31 17:08:27 -0700   wip-rgw-geo-rebase

James Page james.p...@ubuntu.com
2013-02-27 22:50:38 +   wip-debhelper-8

Joao Eduardo Luis joao.l...@inktank.com
2013-04-18 00:01:24 +0100   wip-4521-tool
2013-04-22 15:14:28 +0100   wip-4748
2013-04-24 16:42:11 +0100   wip-4521
2013-04-30 18:45:22 +0100   wip-mon-compact-dbg
2013-05-21 01:46:13 +0100   wip-monstoretool-foo
2013-05-31 16:26:02 +0100   wip-mon-cache-first-last-committed
2013-05-31 18:54:38 +0100   wip-mon-trim
2013-05-31 21:00:28 +0100   wip-mon-trim-b

Joe Buck jbb...@gmail.com
2013-05-02 16:32:33 -0700   wip-buck-add-terasort
2013-05-30 23:02:32 -0700   wip-rgw-geo-enovance-buck

John Wilkins john.wilk...@inktank.com
2012-12-21 15:14:37 -0800   wip-mon-docs

Josh Durgin josh.dur...@inktank.com
2013-03-01 14:45:23 -0800   wip-rbd-workunit-debug
2013-04-29 14:32:00 -0700   wip-rbd-close-image

Matt Benjamin m...@linuxbox.com
2013-05-21 09:45:30 -0700   wip-libcephfs

Noah Watkins noahwatk...@gmail.com
2013-01-05 11:58:38 -0800   wip-localized-read-tests
2013-01-11 12:49:28 -0800   wip-osx-upstream
2013-01-11 13:01:11 -0800   wip-osx
2013-02-01 10:28:26 -0800   wip-java-deb-warning
2013-04-22 15:23:09 -0700   wip-cls-lua
2013-05-30 13:29:42 -0700   wip-hadoop-doc

Roald van Loon roaldvanl...@gmail.com
2012-12-24 22:26:56 +   wip-dout

Sage Weil s...@inktank.com
2012-07-14 17:40:21 -0700   wip-osd-redirect
2012-07-28 13:56:47 -0700   wip-journal
2012-11-13 08:58:57 -0800   wip-fd-simple-cache
2012-11-30 13:47:27 -0800   wip-osd-readhole
2012-12-07 14:38:46 -0800   wip-osd-alloc
2012-12-12 22:18:02 -0800   automake-python
2012-12-14 17:11:31 -0800   wip_cur_perf_journal
2012-12-27 13:38:52 -0800   wip-linuxbox
2013-01-06 20:39:37 -0800   wip-msg-refs
2013-01-16 11:39:28 -0800   wip-client-layout-temp
2013-01-27 11:06:08 -0800   wip-argonaut-leveldb
2013-01-29 13:46:02 -0800   wip-readdir
2013-02-08 10:07:54 -0800   wip-bobtail-logrotate
2013-02-11 07:05:15 -0800   wip-sim-journal-clone
2013-02-12 08:29:29 -0800   wip-dump
2013-02-12 23:17:11 -0800   wip-monc
2013-02-18 20:50:57 -0800   wip-osd-scrub
2013-02-19 13:18:13 -0800   wip-4116-workaround
2013-03-11 17:27:15 -0700   wip-deploy-rgw
2013-03-15 17:10:24 -0700   wip-log-4192
2013-04-15 20:34:47 -0700   bobtail-dc
2013-04-18 13:51:36 -0700   argonaut
2013-04-25 10:38:45 -0700   wip-init2
2013-04-26 12:25:58 -0700   wip-mon-fwd
2013-05-02 14:50:38 -0700   wip-rbd-clear-layering
2013-05-03 09:01:53 -0700   wip-leveldb-reopen
2013-05-07 15:56:57 -0700   wip-ceph-tool
2013-05-08 09:46:53 -0700   wip-4578-bobtail
2013-05-08 13:39:25 -0700   wip-4951
2013-05-10 08:28:50 -0700   wip-rgw-crash
2013-05-22 09:03:53 -0700   wip-4895-cuttlefish
2013-05-22 10:33:35 -0700   unstable
2013-05-22 13:39:58 -0700   wip-notcmalloc
2013-05-23 19:32:56 -0700   wip-libcephfs-rebased
2013-05-27 12:45:33 -0700   wip-mon-compaction
2013-05-27 12:45:41 -0700   wip-cuttlefish-mon
2013-05-28 

Ceph killed by OS because of OOM under high load

2013-06-03 Thread Chen, Xiaoxi
Hi,
As my previous mail reported some weeks ago ,we are suffering from OSD 
crash/ OSD Flipping / System reboot and etc, all these unstable issue really 
stop us from digging further into ceph characterization.
Good news is that we seems find out the cause, I explain our 
experiments below:

Environment:
We have 2 machines, one for client and one for ceph, connected 
via 10GbE.
The client machine is very powerful, with 64 Cores and 256G 
RAM. 
The ceph machine with 32 Cores and 64G RAM, but we limited the 
available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 RPM 
1T disk , 4* DCS 3700 SSDs as journals.
Both client and ceph are v0.61.2.
We run 12 rados bench instances in client node as a stress to 
ceph node, each instance with 256 concurrent.
Experiment and result:
1.default ceph + default client ,   OK
2.tuned ceph  + default clientFAIL,One osd killed by OS due 
to OOM, and all swap space is run out. (tuning: Large queue ops/Large queue 
bytes/.No flusher/sync_flush =true)
3.tuned ceph WITHOUT large queue bytes  + default client   OK
4.tuned ceph WITHOUT large queue bytes  + aggressive client  
FAILED, One osd killed by OOM and one suicide because 150s op thread timeout.  
(aggressive client: objecter_inflight_ops, opjecter_inflight_bytes are both set 
to 10X of default)

Conclusion.
We would like to say, 
a.  under heavy load, some tuning will make ceph unstable 
,especially queue bytes related ( deduce from 1+2+3)
b.  Ceph doesn't do any control on the lenth of OSD Queue, 
this is a critical issue, with aggressive client or a lot of concurrent 
clients, the osd queue will become too long to fit in memory ,thus result in 
osd daemon being killed.(deduce from 3+4)
c.   An observation to osd daemon memory usage show that, if I 
use killall rados to kill all the rados bench instances, the ceph osd daemon 
cannot free the allocated memory, instead, it still remain very high memory 
usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , if killed 
rados, it still remain 5~6GB, restart ceph can solve this issue)

We don't capture any log now ,but since it's really easy to reproduce , 
so we can reproduce and provide any log /profiling info per request.
Any inputs/suggestion are highly appreciated. Thanks




Xiaoxi
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Ceph killed by OS because of OOM under high load

2013-06-03 Thread Gregory Farnum
On Mon, Jun 3, 2013 at 8:47 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:
 Hi,
 As my previous mail reported some weeks ago ,we are suffering from 
 OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue 
 really stop us from digging further into ceph characterization.
 Good news is that we seems find out the cause, I explain our 
 experiments below:

 Environment:
 We have 2 machines, one for client and one for ceph, 
 connected via 10GbE.
 The client machine is very powerful, with 64 Cores and 256G 
 RAM.
 The ceph machine with 32 Cores and 64G RAM, but we limited 
 the available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 
 RPM 1T disk , 4* DCS 3700 SSDs as journals.
 Both client and ceph are v0.61.2.
 We run 12 rados bench instances in client node as a stress to 
 ceph node, each instance with 256 concurrent.
 Experiment and result:
 1.default ceph + default client ,   OK
 2.tuned ceph  + default clientFAIL,One osd killed by OS 
 due to OOM, and all swap space is run out. (tuning: Large queue ops/Large 
 queue bytes/.No flusher/sync_flush =true)
 3.tuned ceph WITHOUT large queue bytes  + default client   OK
 4.tuned ceph WITHOUT large queue bytes  + aggressive client  
 FAILED, One osd killed by OOM and one suicide because 150s op thread timeout. 
  (aggressive client: objecter_inflight_ops, opjecter_inflight_bytes are both 
 set to 10X of default)

 Conclusion.
 We would like to say,
 a.  under heavy load, some tuning will make ceph unstable 
 ,especially queue bytes related ( deduce from 1+2+3)
 b.  Ceph doesn't do any control on the lenth of OSD 
 Queue, this is a critical issue, with aggressive client or a lot of 
 concurrent clients, the osd queue will become too long to fit in memory ,thus 
 result in osd daemon being killed.(deduce from 3+4)
 c.   An observation to osd daemon memory usage show that, if 
 I use killall rados to kill all the rados bench instances, the ceph osd 
 daemon cannot free the allocated memory, instead, it still remain very high 
 memory usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , 
 if killed rados, it still remain 5~6GB, restart ceph can solve this issue)

You don't have enough RAM for your OSDs. We really recommend 1-2GB per
daemon; 600MB/daemon is dangerous. You might be able to make it work,
but you'll definitely need to change the queue lengths and things.
Speaking of which...yes, the OSDs do control their queue lengths, but
it's not dynamic tuning and by default it will let clients stack up
500MB of in-progress writes. With such wimpy systems you'll want to
turn that down, probably alongside various journal and disk wait
queues.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/11] locks: scalability improvements for file locking

2013-06-03 Thread Davidlohr Bueso
On Fri, 2013-05-31 at 23:07 -0400, Jeff Layton wrote:
 This is not the first attempt at doing this. The conversion to the
 i_lock was originally attempted by Bruce Fields a few years ago. His
 approach was NAK'ed since it involved ripping out the deadlock
 detection. People also really seem to like /proc/locks for debugging, so
 keeping that in is probably worthwhile.

Yep, we need to keep this. FWIW, lslocks(8) relies on /proc/locks.

Thanks,
Davidlohr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rationale for a PGLog::merge_old_entry case

2013-06-03 Thread Samuel Just
In all three cases, we know the authoritative log does not contain an
entry for oe.soid, therefore:

If oe.prior_version  log.tail, we must already have processed an
earlier entry for that object resulting in the object being correctly
marked missing (or not) (specifically, the entry for
oe.prior_version).

If log.tail = oe.prior_version  eversion_t(), the missing entry
should have need set at oe.prior_version (revise_need).
oe.prior_version cannot be divergent because all divergent entries
must fall within the log (otherwise, we would have backfilled).

If oe.prior_version == eversion_t(), the object no longer exists, and
the object should be removed from the missing set.

Hope that helps.
-Sam

On Sun, Jun 2, 2013 at 4:09 AM, Loic Dachary l...@dachary.org wrote:
 Hi Sam,

 TL;DR:

 When there no new entry, what is the rationale for merge_old_entry to remove 
 the object from missing only if the tail is eversion_t() and the object 
 prior_version is also eversion_t() ?
 https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/osd/PGLog.cc#L330

 Long version:

 The conditions are created with:

 info.log_tail = eversion_t();
 oe.soid.hash = 1;
 oe.op = pg_log_entry_t::DELETE;
 oe.prior_version = eversion_t();

 missing.add(oe.soid, eversion_t(1,1), eversion_t());

 as shown in
 https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/test/osd/TestPGLog.cc#L467
 I double checked with gdb and when called with

 EXPECT_FALSE(merge_old_entry(t, oe, info, remove_snap, dirty_log));
 https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/test/osd/TestPGLog.cc#L481

 it reaches

 missing.rm(oe.soid, missing.missing[oe.soid].need);
 https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/osd/PGLog.cc#L330

 and the expected side effects are observed:

 EXPECT_FALSE(dirty_log);
 EXPECT_TRUE(remove_snap.empty());
 EXPECT_TRUE(t.empty());
 EXPECT_FALSE(missing.have_missing());
 EXPECT_TRUE(log.empty());
 EXPECT_EQ(0U, ondisklog.length());
 https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/test/osd/TestPGLog.cc#L483

 Cheers

 --
 Loïc Dachary, Artisan Logiciel Libre
 All that is necessary for the triumph of evil is that good people do nothing.

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/11] locks: scalability improvements for file locking

2013-06-03 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:23PM -0400, Jeff Layton wrote:
 Executive summary (tl;dr version): This patchset represents an overhaul
 of the file locking code with an aim toward improving its scalability
 and making the code a bit easier to understand.

Thanks for working on this, that code could use some love!

 Longer version:
 
 When the BKL was finally ripped out of the kernel in 2010, the strategy
 taken for the file locking code was to simply turn it into a new
 file_lock_locks spinlock. It was an expedient way to deal with the file
 locking code at the time, but having a giant spinlock around all of this
 code is clearly not great for scalability. Red Hat has bug reports that
 go back into the 2.6.18 era that point to BKL scalability problems in
 the file locking code and the file_lock_lock suffers from the same
 issues.
 
 This patchset is my first attempt to make this code less dependent on
 global locking. The main change is to switch most of the file locking
 code to be protected by the inode-i_lock instead of the file_lock_lock.
 
 While that works for most things, there are a couple of global data
 structures (lists in the current code) that need a global lock to
 protect them. So we still need a global lock in order to deal with
 those. The remaining patches are intended to make that global locking
 less painful. The big gain is made by turning the blocked_list into a
 hashtable, which greatly speeds up the deadlock detection code.
 
 I rolled a couple of small programs in order to test this code. The
 first one just forks off 128 children and has them lock and unlock the
 same file 10k times. Running this under time against a file on tmpfs
 gives typical values like this:

What kind of hardware was this?

 
 Unpatched (3.10-rc3-ish):
 real  0m5.283s
 user  0m0.380s
 sys   0m20.469s
 
 Patched (same base kernel):
 real  0m5.099s
 user  0m0.478s
 sys   0m19.662s
 
 ...so there seems to be some modest performance gain in this test. I
 think that's almost entirely due to the change to a hashtable and to
 optimize removing and readding blocked locks to the global lists. Note
 that with this code we have to take two spinlocks instead of just one,
 and that has some performance impact too. So the real peformance gain
 from that hashtable conversion is eaten up to some degree by this.

Might be nice to look at some profiles to confirm all of that.  I'd also
be curious how much variation there was in the results above, as they're
pretty close.

 The next test just forks off a bunch of children that each create their
 own file and then lock and unlock it 20k times. Obviously, the locks in
 this case are uncontended. Running that under time typically gives
 these rough numbers.
 
 Unpatched (3.10-rc3-ish):
 real  0m8.836s
 user  0m1.018s
 sys   0m34.094s
 
 Patched (same base kernel):
 real  0m4.965s
 user  0m1.043s
 sys   0m18.651s
 
 In this test, we see the real benefit of moving to the i_lock for most
 of this code. The run time is almost cut in half in this test. With
 these changes locking different inodes needs very little serialization.
 
 If people know of other file locking performance tests, then I'd be
 happy to try them out too. It's possible that this might make some
 workloads slower, and it would be helpful to know what they are (and
 address them) if so.
 
 This is not the first attempt at doing this. The conversion to the
 i_lock was originally attempted by Bruce Fields a few years ago. His
 approach was NAK'ed since it involved ripping out the deadlock
 detection. People also really seem to like /proc/locks for debugging, so
 keeping that in is probably worthwhile.

Yes, there's already code that depends on it.

The deadlock detection, though--I still wonder if we could get away with
ripping it out.  Might be worth at least giving an option to configure
it out as a first step.

--b.

 There's more work to be done in this area and this patchset is just a
 start. There's a horrible thundering herd problem when a blocking lock
 is released, for instance. There was also interest in solving the goofy
 unlock on any close POSIX lock semantics at this year's LSF. I think
 this patchset will help lay the groundwork for those changes as well.
 
 Comments and suggestions welcome.
 
 Jeff Layton (11):
   cifs: use posix_unblock_lock instead of locks_delete_block
   locks: make generic_add_lease and generic_delete_lease static
   locks: comment cleanups and clarifications
   locks: make added in __posix_lock_file a bool
   locks: encapsulate the fl_link list handling
   locks: convert to i_lock to protect i_flock list
   locks: only pull entries off of blocked_list when they are really
 unblocked
   locks: convert fl_link to a hlist_node
   locks: turn the blocked_list into a hashtable
   locks: add a new lm_owner_key lock operation
   locks: give the blocked_hash its own spinlock
 
  Documentation/filesystems/Locking |   27 +++-
  fs/afs/flock.c|5 +-
  

Re: [PATCH v1 01/11] cifs: use posix_unblock_lock instead of locks_delete_block

2013-06-03 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:24PM -0400, Jeff Layton wrote:
 commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported
 the locks_delete_block symbol. There's already an exported helper
 function that provides this capability however, so make cifs use that
 instead and turn locks_delete_block back into a static function.
 
 Note that if fl-fl_next == NULL then this lock has already been through
 locks_delete_block(), so we should be OK to ignore an ENOENT error here
 and simply not retry the lock.

ACK.--b.

 
 Cc: Pavel Shilovsky piastr...@gmail.com
 Signed-off-by: Jeff Layton jlay...@redhat.com
 ---
  fs/cifs/file.c |2 +-
  fs/locks.c |3 +--
  include/linux/fs.h |5 -
  3 files changed, 2 insertions(+), 8 deletions(-)
 
 diff --git a/fs/cifs/file.c b/fs/cifs/file.c
 index 48b29d2..44a4f18 100644
 --- a/fs/cifs/file.c
 +++ b/fs/cifs/file.c
 @@ -999,7 +999,7 @@ try_again:
   rc = wait_event_interruptible(flock-fl_wait, !flock-fl_next);
   if (!rc)
   goto try_again;
 - locks_delete_block(flock);
 + posix_unblock_lock(file, flock);
   }
   return rc;
  }
 diff --git a/fs/locks.c b/fs/locks.c
 index cb424a4..7a02064 100644
 --- a/fs/locks.c
 +++ b/fs/locks.c
 @@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock 
 *waiter)
  
  /*
   */
 -void locks_delete_block(struct file_lock *waiter)
 +static void locks_delete_block(struct file_lock *waiter)
  {
   lock_flocks();
   __locks_delete_block(waiter);
   unlock_flocks();
  }
 -EXPORT_SYMBOL(locks_delete_block);
  
  /* Insert waiter into blocker's block list.
   * We use a circular list so that processes can be easily woken up in
 diff --git a/include/linux/fs.h b/include/linux/fs.h
 index 43db02e..b9d7816 100644
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct 
 file_lock **);
  extern int lease_modify(struct file_lock **, int);
  extern int lock_may_read(struct inode *, loff_t start, unsigned long count);
  extern int lock_may_write(struct inode *, loff_t start, unsigned long count);
 -extern void locks_delete_block(struct file_lock *waiter);
  extern void lock_flocks(void);
  extern void unlock_flocks(void);
  #else /* !CONFIG_FILE_LOCKING */
 @@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, 
 loff_t start,
   return 1;
  }
  
 -static inline void locks_delete_block(struct file_lock *waiter)
 -{
 -}
 -
  static inline void lock_flocks(void)
  {
  }
 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 02/11] locks: make generic_add_lease and generic_delete_lease static

2013-06-03 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:25PM -0400, Jeff Layton wrote:
 Signed-off-by: Jeff Layton jlay...@redhat.com

ACK.--b.

 ---
  fs/locks.c |4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/locks.c b/fs/locks.c
 index 7a02064..e3140b8 100644
 --- a/fs/locks.c
 +++ b/fs/locks.c
 @@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp)
   return type;
  }
  
 -int generic_add_lease(struct file *filp, long arg, struct file_lock **flp)
 +static int generic_add_lease(struct file *filp, long arg, struct file_lock 
 **flp)
  {
   struct file_lock *fl, **before, **my_before = NULL, *lease;
   struct dentry *dentry = filp-f_path.dentry;
 @@ -1402,7 +1402,7 @@ out:
   return error;
  }
  
 -int generic_delete_lease(struct file *filp, struct file_lock **flp)
 +static int generic_delete_lease(struct file *filp, struct file_lock **flp)
  {
   struct file_lock *fl, **before;
   struct dentry *dentry = filp-f_path.dentry;
 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 03/11] locks: comment cleanups and clarifications

2013-06-03 Thread J. Bruce Fields
On Fri, May 31, 2013 at 11:07:26PM -0400, Jeff Layton wrote:
 Signed-off-by: Jeff Layton jlay...@redhat.com
 ---
  fs/locks.c |   24 +++-
  include/linux/fs.h |6 ++
  2 files changed, 25 insertions(+), 5 deletions(-)
 
 diff --git a/fs/locks.c b/fs/locks.c
 index e3140b8..a7d2253 100644
 --- a/fs/locks.c
 +++ b/fs/locks.c
 @@ -150,6 +150,16 @@ static int target_leasetype(struct file_lock *fl)
  int leases_enable = 1;
  int lease_break_time = 45;
  
 +/*
 + * The i_flock list is ordered by:
 + *
 + * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX
 + * 2) lock owner
 + * 3) lock range start
 + * 4) lock range end
 + *
 + * Obviously, the last two criteria only matter for POSIX locks.
 + */

Thanks, yes, that needs documenting!  Though I wonder if this is the
place people will look for it.

  #define for_each_lock(inode, lockp) \
   for (lockp = inode-i_flock; *lockp != NULL; lockp = 
 (*lockp)-fl_next)
  
 @@ -806,6 +816,11 @@ static int __posix_lock_file(struct inode *inode, struct 
 file_lock *request, str
   }
  
   lock_flocks();
 + /*
 +  * New lock request. Walk all POSIX locks and look for conflicts. If
 +  * there are any, either return -EAGAIN or put the request on the
 +  * blocker's list of waiters.
 +  */

This though, seems a) not 100% accurate (it could also return EDEADLCK,
for example), b) mostly redundant with respect to the following code.

   if (request-fl_type != F_UNLCK) {
   for_each_lock(inode, before) {
   fl = *before;
 @@ -844,7 +859,7 @@ static int __posix_lock_file(struct inode *inode, struct 
 file_lock *request, str
   before = fl-fl_next;
   }
  
 - /* Process locks with this owner.  */
 + /* Process locks with this owner. */
   while ((fl = *before)  posix_same_owner(request, fl)) {
   /* Detect adjacent or overlapping regions (if same lock type)
*/
 @@ -930,10 +945,9 @@ static int __posix_lock_file(struct inode *inode, struct 
 file_lock *request, str
   }
  
   /*
 -  * The above code only modifies existing locks in case of
 -  * merging or replacing.  If new lock(s) need to be inserted
 -  * all modifications are done bellow this, so it's safe yet to
 -  * bail out.
 +  * The above code only modifies existing locks in case of merging or
 +  * replacing. If new lock(s) need to be inserted all modifications are
 +  * done below this, so it's safe yet to bail out.
*/
   error = -ENOLCK; /* no luck */
   if (right  left == right  !new_fl2)
 diff --git a/include/linux/fs.h b/include/linux/fs.h
 index b9d7816..ae377e9 100644
 --- a/include/linux/fs.h
 +++ b/include/linux/fs.h
 @@ -926,6 +926,12 @@ int locks_in_grace(struct net *);
  /* that will die - we need it for nfs_lock_info */
  #include linux/nfs_fs_i.h
  
 +/*
 + * struct file_lock represents a generic file lock. It's used to represent
 + * POSIX byte range locks, BSD (flock) locks, and leases. It's important to
 + * note that the same struct is used to represent both a request for a lock 
 and
 + * the lock itself, but the same object is never used for both.

Yes, and I do find that confusing.  I wonder if there's a sensible way
to use separate structs for the different uses.

--b.

 + */
  struct file_lock {
   struct file_lock *fl_next;  /* singly linked list for this inode  */
   struct list_head fl_link;   /* doubly linked list of all locks */
 -- 
 1.7.1
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] mds: initialize some member variables of MDCache

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

I added some member variables to class MDCache, but forget to
initialize them.

Fixes: #5236
Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 src/mds/MDCache.cc | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc
index 8c17172..e2ecba8 100644
--- a/src/mds/MDCache.cc
+++ b/src/mds/MDCache.cc
@@ -147,6 +147,7 @@ MDCache::MDCache(MDS *m)
 (0.9 *(g_conf-osd_max_write_size  20));
 
   discover_last_tid = 0;
+  open_ino_last_tid = 0;
   find_ino_peer_last_tid = 0;
 
   last_cap_id = 0;
@@ -155,6 +156,10 @@ MDCache::MDCache(MDS *m)
   client_lease_durations[1] = 30.0;
   client_lease_durations[2] = 300.0;
 
+  resolves_pending = false;
+  rejoins_pending = false;
+  cap_imports_num_opening = 0;
+
   opening_root = open = false;
   lru.lru_set_max(g_conf-mds_cache_size);
   lru.lru_set_midpoint(g_conf-mds_cache_mid);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] ceph: check migrate seq before changing auth cap

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

We may receive old request reply from the exporter MDS after receiving
the importer MDS' cap import message.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 54c290b..790f88b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -612,9 +612,11 @@ retry:
__cap_delay_requeue(mdsc, ci);
}
 
-   if (flags  CEPH_CAP_FLAG_AUTH)
-   ci-i_auth_cap = cap;
-   else if (ci-i_auth_cap == cap) {
+   if (flags  CEPH_CAP_FLAG_AUTH) {
+   if (ci-i_auth_cap == NULL ||
+   ceph_seq_cmp(ci-i_auth_cap-mseq, mseq)  0)
+   ci-i_auth_cap = cap;
+   } else if (ci-i_auth_cap == cap) {
ci-i_auth_cap = NULL;
spin_lock(mdsc-cap_dirty_lock);
if (!list_empty(ci-i_dirty_item)) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/9] fixes for kclient

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

this patch series are also in:
  git://github.com/ukernel/linux.git wip-ceph

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] libceph: call r_unsafe_callback when unsafe reply is received

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

We can't use !req-r_sent to check if OSD request is sent for the
first time, this is because __cancel_request() zeros req-r_sent
when OSD map changes. Rather than adding a new variable to
ceph_osd_request to indicate if it's sent for the first time, We
can call the unsafe callback only when unsafe OSD reply is received.
If OSD's first reply is safe, just skip calling the unsafe callback.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 net/ceph/osd_client.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 536c0e5..6972d17 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1338,10 +1338,6 @@ static void __send_request(struct ceph_osd_client *osdc,
 
ceph_msg_get(req-r_request); /* send consumes a ref */
 
-   /* Mark the request unsafe if this is the first timet's being sent. */
-
-   if (!req-r_sent  req-r_unsafe_callback)
-   req-r_unsafe_callback(req, true);
req-r_sent = req-r_osd-o_incarnation;
 
ceph_con_send(req-r_osd-o_con, req-r_request);
@@ -1432,8 +1428,6 @@ static void handle_osds_timeout(struct work_struct *work)
 
 static void complete_request(struct ceph_osd_request *req)
 {
-   if (req-r_unsafe_callback)
-   req-r_unsafe_callback(req, false);
complete_all(req-r_safe_completion);  /* fsync waiter */
 }
 
@@ -1560,14 +1554,20 @@ static void handle_reply(struct ceph_osd_client *osdc, 
struct ceph_msg *msg,
mutex_unlock(osdc-request_mutex);
 
if (!already_completed) {
+   if (req-r_unsafe_callback 
+   result = 0  !(flags  CEPH_OSD_FLAG_ONDISK))
+   req-r_unsafe_callback(req, true);
if (req-r_callback)
req-r_callback(req, msg);
else
complete_all(req-r_completion);
}
 
-   if (flags  CEPH_OSD_FLAG_ONDISK)
+   if (flags  CEPH_OSD_FLAG_ONDISK) {
+   if (req-r_unsafe_callback  already_completed)
+   req-r_unsafe_callback(req, false);
complete_request(req);
+   }
 
 done:
dout(req=%p req-r_linger=%d\n, req, req-r_linger);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] ceph: fix cap release race

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

ceph_encode_inode_release() can race with ceph_open() and release
caps wanted by open files. So it should call __ceph_caps_wanted()
to get the wanted caps.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c | 22 ++
 1 file changed, 10 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index da0f9b8..54c290b 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3042,21 +3042,19 @@ int ceph_encode_inode_release(void **p, struct inode 
*inode,
 (cap-issued  unless) == 0)) {
if ((cap-issued  drop) 
(cap-issued  unless) == 0) {
-   dout(encode_inode_release %p cap %p %s - 
-%s\n, inode, cap,
+   int wanted = __ceph_caps_wanted(ci);
+   if ((ci-i_ceph_flags  CEPH_I_NODELAY) == 0)
+   wanted |= cap-mds_wanted;
+   dout(encode_inode_release %p cap %p 
+%s - %s, wanted %s - %s\n, inode, cap,
 ceph_cap_string(cap-issued),
-ceph_cap_string(cap-issued  ~drop));
+ceph_cap_string(cap-issued  ~drop),
+ceph_cap_string(cap-mds_wanted),
+ceph_cap_string(wanted));
+
cap-issued = ~drop;
cap-implemented = ~drop;
-   if (ci-i_ceph_flags  CEPH_I_NODELAY) {
-   int wanted = __ceph_caps_wanted(ci);
-   dout(  wanted %s - %s (act %s)\n,
-ceph_cap_string(cap-mds_wanted),
-ceph_cap_string(cap-mds_wanted 
-~wanted),
-ceph_cap_string(wanted));
-   cap-mds_wanted = wanted;
-   }
+   cap-mds_wanted = wanted;
} else {
dout(encode_inode_release %p cap %p %s
  (force)\n, inode, cap,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] ceph: reset iov_len when discarding cap release messages

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/mds_client.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4f22671..e2d7e56 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1391,6 +1391,7 @@ static void discard_cap_releases(struct ceph_mds_client 
*mdsc,
num = le32_to_cpu(head-num);
dout(discard_cap_releases mds%d %p %u\n, session-s_mds, msg, num);
head-num = cpu_to_le32(0);
+   msg-front.iov_len = sizeof(*head);
session-s_num_cap_releases += num;
 
/* requeue completed messages */
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] ceph: clear migrate seq when MDS restarts

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/mds_client.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index e2d7e56..ce7a789 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2455,6 +2455,7 @@ static int encode_caps_cb(struct inode *inode, struct 
ceph_cap *cap,
spin_lock(ci-i_ceph_lock);
cap-seq = 0;/* reset cap seq */
cap-issue_seq = 0;  /* and issue_seq */
+   cap-mseq = 0;   /* and migrate_seq */
 
if (recon_state-flock) {
rec.v2.cap_id = cpu_to_le64(cap-cap_id);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] libceph: fix truncate size calculation

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

check the not truncated yet case

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 net/ceph/osd_client.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 6972d17..93efdfb 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -733,12 +733,14 @@ struct ceph_osd_request *ceph_osdc_new_request(struct 
ceph_osd_client *osdc,
 
object_size = le32_to_cpu(layout-fl_object_size);
object_base = off - objoff;
-   if (truncate_size = object_base) {
-   truncate_size = 0;
-   } else {
-   truncate_size -= object_base;
-   if (truncate_size  object_size)
-   truncate_size = object_size;
+   if (!(truncate_seq == 1  truncate_size == -1ULL)) {
+   if (truncate_size = object_base) {
+   truncate_size = 0;
+   } else {
+   truncate_size -= object_base;
+   if (truncate_size  object_size)
+   truncate_size = object_size;
+   }
}
 
osd_req_op_extent_init(req, 0, opcode, objoff, objlen,
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/9] libceph: fix safe completion

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

handle_reply() calls complete_request() only if the first OSD reply
has ONDISK flag.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 include/linux/ceph/osd_client.h |  1 -
 net/ceph/osd_client.c   | 16 
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 186db0b..ce6df39 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -145,7 +145,6 @@ struct ceph_osd_request {
s32   r_reply_op_result[CEPH_OSD_MAX_OP];
int   r_got_reply;
int   r_linger;
-   int   r_completed;
 
struct ceph_osd_client *r_osdc;
struct kref   r_kref;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index a3395fd..536c0e5 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1525,6 +1525,8 @@ static void handle_reply(struct ceph_osd_client *osdc, 
struct ceph_msg *msg,
for (i = 0; i  numops; i++)
req-r_reply_op_result[i] = ceph_decode_32(p);
 
+   already_completed = req-r_got_reply;
+
if (!req-r_got_reply) {
 
req-r_result = result;
@@ -1555,16 +1557,14 @@ static void handle_reply(struct ceph_osd_client *osdc, 
struct ceph_msg *msg,
((flags  CEPH_OSD_FLAG_WRITE) == 0))
__unregister_request(osdc, req);
 
-   already_completed = req-r_completed;
-   req-r_completed = 1;
mutex_unlock(osdc-request_mutex);
-   if (already_completed)
-   goto done;
 
-   if (req-r_callback)
-   req-r_callback(req, msg);
-   else
-   complete_all(req-r_completion);
+   if (!already_completed) {
+   if (req-r_callback)
+   req-r_callback(req, msg);
+   else
+   complete_all(req-r_completion);
+   }
 
if (flags  CEPH_OSD_FLAG_ONDISK)
complete_request(req);
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 9/9] ceph: move inode to proper flushing list when auth MDS changes

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 790f88b..458a66e 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1982,8 +1982,14 @@ static void kick_flushing_inode_caps(struct 
ceph_mds_client *mdsc,
cap = ci-i_auth_cap;
dout(kick_flushing_inode_caps %p flushing %s flush_seq %lld\n, inode,
 ceph_cap_string(ci-i_flushing_caps), ci-i_cap_flush_seq);
+
__ceph_flush_snaps(ci, session, 1);
+
if (ci-i_flushing_caps) {
+   spin_lock(mdsc-cap_dirty_lock);
+   list_move_tail(ci-i_flushing_item, session-s_cap_flushing);
+   spin_unlock(mdsc-cap_dirty_lock);
+
delayed = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH,
 __ceph_caps_used(ci),
 __ceph_caps_wanted(ci),
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] ceph: fix race between page writeback and truncate

2013-06-03 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com

The client can receive truncate request from MDS at any time.
So the page writeback code need to get i_size, truncate_seq and
truncate_size atomically

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/addr.c | 84 --
 1 file changed, 40 insertions(+), 44 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 3e68ac1..3500b74 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -438,13 +438,12 @@ static int writepage_nounlock(struct page *page, struct 
writeback_control *wbc)
struct ceph_inode_info *ci;
struct ceph_fs_client *fsc;
struct ceph_osd_client *osdc;
-   loff_t page_off = page_offset(page);
-   int len = PAGE_CACHE_SIZE;
-   loff_t i_size;
-   int err = 0;
struct ceph_snap_context *snapc, *oldest;
-   u64 snap_size = 0;
+   loff_t page_off = page_offset(page);
long writeback_stat;
+   u64 truncate_size, snap_size = 0;
+   u32 truncate_seq;
+   int err = 0, len = PAGE_CACHE_SIZE;
 
dout(writepage %p idx %lu\n, page, page-index);
 
@@ -474,13 +473,20 @@ static int writepage_nounlock(struct page *page, struct 
writeback_control *wbc)
}
ceph_put_snap_context(oldest);
 
+   spin_lock(ci-i_ceph_lock);
+   truncate_seq = ci-i_truncate_seq;
+   truncate_size = ci-i_truncate_size;
+   if (!snap_size)
+   snap_size = i_size_read(inode);
+   spin_unlock(ci-i_ceph_lock);
+
/* is this a partial page at end of file? */
-   if (snap_size)
-   i_size = snap_size;
-   else
-   i_size = i_size_read(inode);
-   if (i_size  page_off + len)
-   len = i_size - page_off;
+   if (page_off = snap_size) {
+   dout(%p page eof %llu\n, page, snap_size);
+   goto out;
+   }
+   if (snap_size  page_off + len)
+   len = snap_size - page_off;
 
dout(writepage %p page %p index %lu on %llu~%u snapc %p\n,
 inode, page, page-index, page_off, len, snapc);
@@ -494,7 +500,7 @@ static int writepage_nounlock(struct page *page, struct 
writeback_control *wbc)
err = ceph_osdc_writepages(osdc, ceph_vino(inode),
   ci-i_layout, snapc,
   page_off, len,
-  ci-i_truncate_seq, ci-i_truncate_size,
+  truncate_seq, truncate_size,
   inode-i_mtime, page, 1);
if (err  0) {
dout(writepage setting page/mapping error %d %p\n, err, page);
@@ -631,25 +637,6 @@ static void writepages_finish(struct ceph_osd_request *req,
ceph_osdc_put_request(req);
 }
 
-static struct ceph_osd_request *
-ceph_writepages_osd_request(struct inode *inode, u64 offset, u64 *len,
-   struct ceph_snap_context *snapc, int num_ops)
-{
-   struct ceph_fs_client *fsc;
-   struct ceph_inode_info *ci;
-   struct ceph_vino vino;
-
-   fsc = ceph_inode_to_client(inode);
-   ci = ceph_inode(inode);
-   vino = ceph_vino(inode);
-   /* BUG_ON(vino.snap != CEPH_NOSNAP); */
-
-   return ceph_osdc_new_request(fsc-client-osdc, ci-i_layout,
-   vino, offset, len, num_ops, CEPH_OSD_OP_WRITE,
-   CEPH_OSD_FLAG_WRITE|CEPH_OSD_FLAG_ONDISK,
-   snapc, ci-i_truncate_seq, ci-i_truncate_size, true);
-}
-
 /*
  * initiate async writeback
  */
@@ -658,7 +645,8 @@ static int ceph_writepages_start(struct address_space 
*mapping,
 {
struct inode *inode = mapping-host;
struct ceph_inode_info *ci = ceph_inode(inode);
-   struct ceph_fs_client *fsc;
+   struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
+   struct ceph_vino vino = ceph_vino(inode);
pgoff_t index, start, end;
int range_whole = 0;
int should_loop = 1;
@@ -670,7 +658,8 @@ static int ceph_writepages_start(struct address_space 
*mapping,
unsigned wsize = 1  inode-i_blkbits;
struct ceph_osd_request *req = NULL;
int do_sync;
-   u64 snap_size;
+   u64 truncate_size, snap_size;
+   u32 truncate_seq;
 
/*
 * Include a 'sync' in the OSD request if this is a data
@@ -685,7 +674,6 @@ static int ceph_writepages_start(struct address_space 
*mapping,
 wbc-sync_mode == WB_SYNC_NONE ? NONE :
 (wbc-sync_mode == WB_SYNC_ALL ? ALL : HOLD));
 
-   fsc = ceph_inode_to_client(inode);
if (fsc-mount_state == CEPH_MOUNT_SHUTDOWN) {
pr_warning(writepage_start %p on forced umount\n, inode);
return -EIO; /* we're in a forced umount, don't write! */
@@ -728,6 +716,14 @@ retry:
snap_size = i_size_read(inode);
dout( oldest snapc is %p seq %lld (%d snaps)\n,
 snapc, snapc-seq,