rbd image association
Hi, Is it possible to associate certain rbd device with the appropriate rbd image (for example : image1 /dev/rbd0)? We always need to map image1 to /dev/rbd0, image2 to /dev/rbd1 etc. Because I'm using /dev/rbd* in some configuration files. Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On 03.06.2013 11:08, Wolfgang Hennerbichler wrote: On Mon, Jun 03, 2013 at 10:52:14AM +0400, Roman Alekseev wrote: Hi, Is it possible to associate certain rbd device with the appropriate rbd image (for example : image1 /dev/rbd0)? We always need to map image1 to /dev/rbd0, image2 to /dev/rbd1 etc. Because I'm using /dev/rbd* in some configuration files. Hi, I can't remember exactly, but I think there is something like /dev/rbd/pool/imagename which simply symlinks to /dev/rbdX. This should solve your issue. Thanks HTH Wolfgang Dear Wolfgang, I was trying to use the command ln -s /dev/rbd1 image1 but it creates wrong symlink. Most likely there is more specific command to symlink the image and device correctly. Thanks. -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On 06/03/2013 09:17 AM, Roman Alekseev wrote: On 03.06.2013 11:08, Wolfgang Hennerbichler wrote: On Mon, Jun 03, 2013 at 10:52:14AM +0400, Roman Alekseev wrote: Hi, Is it possible to associate certain rbd device with the appropriate rbd image (for example : image1 /dev/rbd0)? We always need to map image1 to /dev/rbd0, image2 to /dev/rbd1 etc. Because I'm using /dev/rbd* in some configuration files. Hi, I can't remember exactly, but I think there is something like /dev/rbd/pool/imagename which simply symlinks to /dev/rbdX. This should solve your issue. Thanks HTH Wolfgang Dear Wolfgang, I was trying to use the command ln -s /dev/rbd1 image1 but it creates wrong symlink. Most likely there is more specific command to symlink the image and device correctly. Why are you symlinking? Like Wolfgang mentioned, udev takes care of this for you. Just use /dev/rbd/pool/imagename which is created by udev. Thanks. -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On Mon, Jun 03, 2013 at 11:17:16AM +0400, Roman Alekseev wrote: Dear Wolfgang, I was trying to use the command ln -s /dev/rbd1 image1 but it creates wrong symlink. Most likely there is more specific command to symlink the image and device correctly. No, you should not have to do anything manually. I was saying udev should create the symlink for you upon image mapping. If it doesn't, check you udev configuration. It does this for me on ubuntu server. Thanks. Wolfgang -- http://www.wogri.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On 03.06.2013 11:34, Wido den Hollander wrote: udev takes car Do you mean we need to create some rules in /etc/udev/rules.d/ directory and after running rbd -p pool map image my image will be mapped with appropriate device? If so, could you please provide me with the commands which should be presented in that file? Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Speed up 'rbd rm'
On Thu, May 30, 2013 at 07:04:28PM -0700, Josh Durgin wrote: On 05/30/2013 06:40 PM, Chris Dunlop wrote: On Thu, May 30, 2013 at 01:50:14PM -0700, Josh Durgin wrote: On 05/29/2013 07:23 PM, Chris Dunlop wrote: On Wed, May 29, 2013 at 12:21:07PM -0700, Josh Durgin wrote: On 05/28/2013 10:59 PM, Chris Dunlop wrote: I see there's a new commit to speed up an 'rbd rm': http://tracker.ceph.com/projects/ceph/repository/revisions/40956410169709c32a282d9b872cb5f618a48926 Is it safe to cherry-pick this commit on top of 0.56.6 (or, if not, v0.61.2) to speed up the remove? You'll need 537386d906b8c0e395433461dcb03a82eb33f34f as well. It should apply cleanly to 0.61.2, and probably 0.56.6 too. Thanks. I'll see how I go, I may just leave the 'rm' running all weekend rather than futzing around recompiling ceph and getting off the mainline track. # time rbd rm rbd/large-image Removing image: 36% complete...Terminated real2819m37.117s I.e. 47 hours and only 36% complete before I gave up (I wanted to restart that server). At that rate it would take 5.5 days to remove! If you're mainly interested in getting rid of the accidentally 1.5PB image, you can just delete the header (and id object if it's format 2) and then 'rbd rm' will just remove it from the rbd_directory index, and not try to delete all the non-existent data objects. Yes, that's my main interest. Sorry, I haven't yet delved far into the details of how the rbd stuff hangs together: can you give me a hint or point me towards any docs regarding what delete the header (and id object would look like? For a format 2 image, 'rbd info imagename' will show a block_prefix like 'rbd_data.101574b0dc51'. The random suffix after the '.' is the id of the image. For format 2, the header is named after this id, so you'd do: rados -p poolname rm rbd_header.101574b0dc51 For format 1 images, the header object is named after the image name, like 'imagename.rbd'. After removing the header object manually, rbd rm will clean up the rest. The problematical image is format 2. If it's tricky to manually remove, it's not doing any harm just sitting there (it it??) so I guess I can just wait until the parallelized delete is available in a stable release, i.e. dumpling, or backported to bobtail or cuttlefish. Cheers, Chris. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
krbd + format=2 ?
G'day, Sage's recent pull message to Linus said: Please pull the following Ceph patches from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus This is a big pull. Most of it is culmination of Alex's work to implement RBD image layering, which is now complete (yay!). Am I correct in thinking RBD image layering... is now complete implies there should be full(?) support for format=2? I pulled the for-linus branch (@ 3abef3b) on top of 3.10.0-rc4, and it's letting me map a format=2 image (created under bobtail), however reading from the block device returns zeros rather than the data. The same image correctly shows data (NTFS filesystem) when mounted into kvm using librbd. # uname -r 3.10.0-rc4-00010-g0326739 # rbd ls -l NAME SIZE PARENT FMT PROT LOCK xxx 1536G2 # rbd map rbd/xxx # rbd showmapped id pool image snap device 1 rbd xxx -/dev/rbd1 # dd if=/dev/rbd1 of=/tmp/xxx count=20480 20480+0 records in 20480+0 records out 10485760 bytes (10 MB) copied, 0.757754 s, 13.8 MB/s # od -c /tmp/xxx | less 000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 5000 Cheers, Chris -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] librados: Add RADOS locks to the C/C++ API
Hi Josh, On 05/31/2013 10:44 PM, Josh Durgin wrote: On 05/30/2013 06:02 AM, Filippos Giannakos wrote: The following patches export the RADOS advisory locks functionality to the C/C++ librados API. The extra API calls added are inspired by the relevant functions of librbd. This looks good to me overall. I wonder if we should create a new library in the future for these kinds of things that are built on top of librados. Other generally useful class client operations could go there, as well as generally useful things built on top of librados, like methods for striping over many objects. Thanks for the review. I will incorporate all your suggestions in a new patch, which I will submit shortly. As for the new library you mention, it is a good idea, but for now I think that the basic RADOS locking functionality should be at the core librados API. King Regards, -- Filippos. philipg...@grnet.gr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On Mon, 3 Jun 2013, Roman Alekseev wrote: On 03.06.2013 11:34, Wido den Hollander wrote: udev takes car Do you mean we need to create some rules in /etc/udev/rules.d/ directory and after running rbd -p pool map image my image will be mapped with appropriate device? The ceph package installs a udev rules file in that directory; you shouldn't have to do anything other than the 'rbd map ..' command. If it is not already present, there must be something wrong with the package on the platform you are using. What OS is it? If so, could you please provide me with the commands which should be presented in that file? Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On 03.06.2013 18:49, Sage Weil wrote: On Mon, 3 Jun 2013, Roman Alekseev wrote: On 03.06.2013 11:34, Wido den Hollander wrote: udev takes car Do you mean we need to create some rules in /etc/udev/rules.d/ directory and after running rbd -p pool map image my image will be mapped with appropriate device? The ceph package installs a udev rules file in that directory; you shouldn't have to do anything other than the 'rbd map ..' command. If it is not already present, there must be something wrong with the package on the platform you are using. What OS is it? If so, could you please provide me with the commands which should be presented in that file? Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Yep, u're right there is only the file 70-persistent-net.rules in /etc/udev/rules.d/ directory. It is Debian Wheezy 3.2.0-4-amd64 thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd image association
On Mon, 3 Jun 2013, Roman Alekseev wrote: On 03.06.2013 18:49, Sage Weil wrote: On Mon, 3 Jun 2013, Roman Alekseev wrote: On 03.06.2013 11:34, Wido den Hollander wrote: udev takes car Do you mean we need to create some rules in /etc/udev/rules.d/ directory and after running rbd -p pool map image my image will be mapped with appropriate device? The ceph package installs a udev rules file in that directory; you shouldn't have to do anything other than the 'rbd map ..' command. If it is not already present, there must be something wrong with the package on the platform you are using. What OS is it? If so, could you please provide me with the commands which should be presented in that file? Thanks -- Kind regards, R. Alekseev -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Yep, u're right there is only the file 70-persistent-net.rules in /etc/udev/rules.d/ directory. It is Debian Wheezy 3.2.0-4-amd64 The file you want is /lib/udev/rules.d/50-rbd.rules which is part of the librbd1 package. dpkg -L librbd1 to verify it is there... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Alex Elder el...@inktank.com 2013-05-21 14:37:01 -0500 wip-rbd-testing Babu Shanmugam a...@enovance.com 2013-05-30 10:28:23 +0530 wip-rgw-geo-enovance Dan Mick dan.m...@inktank.com 2012-12-18 12:27:36 -0800 wip-rbd-striping 2013-03-15 17:27:54 -0700 wip-cephtool-stderr David Zafman david.zaf...@inktank.com 2013-01-28 20:26:34 -0800 wip-wireshark-zafman 2013-03-22 18:14:10 -0700 wip-snap-test-fix 2013-05-30 17:18:24 -0700 wip-3527 Gary Lowell gary.low...@inktank.com 2013-05-28 13:58:22 -0700 last Gary Lowell glow...@inktank.com 2013-01-28 22:49:45 -0800 wip-3930 2013-02-05 19:29:11 -0800 wip.cppchecker 2013-02-10 22:21:52 -0800 wip-3955 2013-02-26 19:28:48 -0800 wip-system-leveldb 2013-02-27 22:32:57 -0800 bobtail-leveldb 2013-03-01 18:55:35 -0800 wip-da-spec-1 2013-03-19 11:28:15 -0700 wip-3921 2013-04-11 23:00:05 -0700 wip-init-radosgw 2013-04-17 23:30:11 -0700 wip-4725 2013-04-21 22:06:37 -0700 wip-4752 2013-04-22 14:11:37 -0700 wip-4632 2013-05-10 09:32:33 -0700 wip-build-doc 2013-05-31 11:20:40 -0700 wip-doc-prereq Greg Farnum g...@inktank.com 2013-02-13 14:46:38 -0800 wip-mds-snap-fix 2013-02-22 19:57:53 -0800 wip-4248-snapid-journaling 2013-05-01 17:06:27 -0700 wip-optracker-4354 2013-05-31 10:35:18 -0700 bobtail 2013-05-31 13:28:31 -0700 wip-rgw-geo-rebase-test 2013-05-31 17:08:27 -0700 wip-rgw-geo-rebase James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Joao Eduardo Luis joao.l...@inktank.com 2013-04-18 00:01:24 +0100 wip-4521-tool 2013-04-22 15:14:28 +0100 wip-4748 2013-04-24 16:42:11 +0100 wip-4521 2013-04-30 18:45:22 +0100 wip-mon-compact-dbg 2013-05-21 01:46:13 +0100 wip-monstoretool-foo 2013-05-31 16:26:02 +0100 wip-mon-cache-first-last-committed 2013-05-31 18:54:38 +0100 wip-mon-trim 2013-05-31 21:00:28 +0100 wip-mon-trim-b Joe Buck jbb...@gmail.com 2013-05-02 16:32:33 -0700 wip-buck-add-terasort 2013-05-30 23:02:32 -0700 wip-rgw-geo-enovance-buck John Wilkins john.wilk...@inktank.com 2012-12-21 15:14:37 -0800 wip-mon-docs Josh Durgin josh.dur...@inktank.com 2013-03-01 14:45:23 -0800 wip-rbd-workunit-debug 2013-04-29 14:32:00 -0700 wip-rbd-close-image Matt Benjamin m...@linuxbox.com 2013-05-21 09:45:30 -0700 wip-libcephfs Noah Watkins noahwatk...@gmail.com 2013-01-05 11:58:38 -0800 wip-localized-read-tests 2013-01-11 12:49:28 -0800 wip-osx-upstream 2013-01-11 13:01:11 -0800 wip-osx 2013-02-01 10:28:26 -0800 wip-java-deb-warning 2013-04-22 15:23:09 -0700 wip-cls-lua 2013-05-30 13:29:42 -0700 wip-hadoop-doc Roald van Loon roaldvanl...@gmail.com 2012-12-24 22:26:56 + wip-dout Sage Weil s...@inktank.com 2012-07-14 17:40:21 -0700 wip-osd-redirect 2012-07-28 13:56:47 -0700 wip-journal 2012-11-13 08:58:57 -0800 wip-fd-simple-cache 2012-11-30 13:47:27 -0800 wip-osd-readhole 2012-12-07 14:38:46 -0800 wip-osd-alloc 2012-12-12 22:18:02 -0800 automake-python 2012-12-14 17:11:31 -0800 wip_cur_perf_journal 2012-12-27 13:38:52 -0800 wip-linuxbox 2013-01-06 20:39:37 -0800 wip-msg-refs 2013-01-16 11:39:28 -0800 wip-client-layout-temp 2013-01-27 11:06:08 -0800 wip-argonaut-leveldb 2013-01-29 13:46:02 -0800 wip-readdir 2013-02-08 10:07:54 -0800 wip-bobtail-logrotate 2013-02-11 07:05:15 -0800 wip-sim-journal-clone 2013-02-12 08:29:29 -0800 wip-dump 2013-02-12 23:17:11 -0800 wip-monc 2013-02-18 20:50:57 -0800 wip-osd-scrub 2013-02-19 13:18:13 -0800 wip-4116-workaround 2013-03-11 17:27:15 -0700 wip-deploy-rgw 2013-03-15 17:10:24 -0700 wip-log-4192 2013-04-15 20:34:47 -0700 bobtail-dc 2013-04-18 13:51:36 -0700 argonaut 2013-04-25 10:38:45 -0700 wip-init2 2013-04-26 12:25:58 -0700 wip-mon-fwd 2013-05-02 14:50:38 -0700 wip-rbd-clear-layering 2013-05-03 09:01:53 -0700 wip-leveldb-reopen 2013-05-07 15:56:57 -0700 wip-ceph-tool 2013-05-08 09:46:53 -0700 wip-4578-bobtail 2013-05-08 13:39:25 -0700 wip-4951 2013-05-10 08:28:50 -0700 wip-rgw-crash 2013-05-22 09:03:53 -0700 wip-4895-cuttlefish 2013-05-22 10:33:35 -0700 unstable 2013-05-22 13:39:58 -0700 wip-notcmalloc 2013-05-23 19:32:56 -0700 wip-libcephfs-rebased 2013-05-27 12:45:33 -0700 wip-mon-compaction 2013-05-27 12:45:41 -0700 wip-cuttlefish-mon 2013-05-28
ceph branch status
-- All Branches -- Alex Elder el...@inktank.com 2013-05-21 14:37:01 -0500 wip-rbd-testing Babu Shanmugam a...@enovance.com 2013-05-30 10:28:23 +0530 wip-rgw-geo-enovance Dan Mick dan.m...@inktank.com 2012-12-18 12:27:36 -0800 wip-rbd-striping 2013-03-15 17:27:54 -0700 wip-cephtool-stderr David Zafman david.zaf...@inktank.com 2013-01-28 20:26:34 -0800 wip-wireshark-zafman 2013-03-22 18:14:10 -0700 wip-snap-test-fix 2013-05-30 17:18:24 -0700 wip-3527 Gary Lowell gary.low...@inktank.com 2013-05-28 13:58:22 -0700 last Gary Lowell glow...@inktank.com 2013-01-28 22:49:45 -0800 wip-3930 2013-02-05 19:29:11 -0800 wip.cppchecker 2013-02-10 22:21:52 -0800 wip-3955 2013-02-26 19:28:48 -0800 wip-system-leveldb 2013-02-27 22:32:57 -0800 bobtail-leveldb 2013-03-01 18:55:35 -0800 wip-da-spec-1 2013-03-19 11:28:15 -0700 wip-3921 2013-04-11 23:00:05 -0700 wip-init-radosgw 2013-04-17 23:30:11 -0700 wip-4725 2013-04-21 22:06:37 -0700 wip-4752 2013-04-22 14:11:37 -0700 wip-4632 2013-05-10 09:32:33 -0700 wip-build-doc 2013-05-31 11:20:40 -0700 wip-doc-prereq Greg Farnum g...@inktank.com 2013-02-13 14:46:38 -0800 wip-mds-snap-fix 2013-02-22 19:57:53 -0800 wip-4248-snapid-journaling 2013-05-01 17:06:27 -0700 wip-optracker-4354 2013-05-31 10:35:18 -0700 bobtail 2013-05-31 13:28:31 -0700 wip-rgw-geo-rebase-test 2013-05-31 17:08:27 -0700 wip-rgw-geo-rebase James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Joao Eduardo Luis joao.l...@inktank.com 2013-04-18 00:01:24 +0100 wip-4521-tool 2013-04-22 15:14:28 +0100 wip-4748 2013-04-24 16:42:11 +0100 wip-4521 2013-04-30 18:45:22 +0100 wip-mon-compact-dbg 2013-05-21 01:46:13 +0100 wip-monstoretool-foo 2013-05-31 16:26:02 +0100 wip-mon-cache-first-last-committed 2013-05-31 18:54:38 +0100 wip-mon-trim 2013-05-31 21:00:28 +0100 wip-mon-trim-b Joe Buck jbb...@gmail.com 2013-05-02 16:32:33 -0700 wip-buck-add-terasort 2013-05-30 23:02:32 -0700 wip-rgw-geo-enovance-buck John Wilkins john.wilk...@inktank.com 2012-12-21 15:14:37 -0800 wip-mon-docs Josh Durgin josh.dur...@inktank.com 2013-03-01 14:45:23 -0800 wip-rbd-workunit-debug 2013-04-29 14:32:00 -0700 wip-rbd-close-image Matt Benjamin m...@linuxbox.com 2013-05-21 09:45:30 -0700 wip-libcephfs Noah Watkins noahwatk...@gmail.com 2013-01-05 11:58:38 -0800 wip-localized-read-tests 2013-01-11 12:49:28 -0800 wip-osx-upstream 2013-01-11 13:01:11 -0800 wip-osx 2013-02-01 10:28:26 -0800 wip-java-deb-warning 2013-04-22 15:23:09 -0700 wip-cls-lua 2013-05-30 13:29:42 -0700 wip-hadoop-doc Roald van Loon roaldvanl...@gmail.com 2012-12-24 22:26:56 + wip-dout Sage Weil s...@inktank.com 2012-07-14 17:40:21 -0700 wip-osd-redirect 2012-07-28 13:56:47 -0700 wip-journal 2012-11-13 08:58:57 -0800 wip-fd-simple-cache 2012-11-30 13:47:27 -0800 wip-osd-readhole 2012-12-07 14:38:46 -0800 wip-osd-alloc 2012-12-12 22:18:02 -0800 automake-python 2012-12-14 17:11:31 -0800 wip_cur_perf_journal 2012-12-27 13:38:52 -0800 wip-linuxbox 2013-01-06 20:39:37 -0800 wip-msg-refs 2013-01-16 11:39:28 -0800 wip-client-layout-temp 2013-01-27 11:06:08 -0800 wip-argonaut-leveldb 2013-01-29 13:46:02 -0800 wip-readdir 2013-02-08 10:07:54 -0800 wip-bobtail-logrotate 2013-02-11 07:05:15 -0800 wip-sim-journal-clone 2013-02-12 08:29:29 -0800 wip-dump 2013-02-12 23:17:11 -0800 wip-monc 2013-02-18 20:50:57 -0800 wip-osd-scrub 2013-02-19 13:18:13 -0800 wip-4116-workaround 2013-03-11 17:27:15 -0700 wip-deploy-rgw 2013-03-15 17:10:24 -0700 wip-log-4192 2013-04-15 20:34:47 -0700 bobtail-dc 2013-04-18 13:51:36 -0700 argonaut 2013-04-25 10:38:45 -0700 wip-init2 2013-04-26 12:25:58 -0700 wip-mon-fwd 2013-05-02 14:50:38 -0700 wip-rbd-clear-layering 2013-05-03 09:01:53 -0700 wip-leveldb-reopen 2013-05-07 15:56:57 -0700 wip-ceph-tool 2013-05-08 09:46:53 -0700 wip-4578-bobtail 2013-05-08 13:39:25 -0700 wip-4951 2013-05-10 08:28:50 -0700 wip-rgw-crash 2013-05-22 09:03:53 -0700 wip-4895-cuttlefish 2013-05-22 10:33:35 -0700 unstable 2013-05-22 13:39:58 -0700 wip-notcmalloc 2013-05-23 19:32:56 -0700 wip-libcephfs-rebased 2013-05-27 12:45:33 -0700 wip-mon-compaction 2013-05-27 12:45:41 -0700 wip-cuttlefish-mon 2013-05-28
Ceph killed by OS because of OOM under high load
Hi, As my previous mail reported some weeks ago ,we are suffering from OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue really stop us from digging further into ceph characterization. Good news is that we seems find out the cause, I explain our experiments below: Environment: We have 2 machines, one for client and one for ceph, connected via 10GbE. The client machine is very powerful, with 64 Cores and 256G RAM. The ceph machine with 32 Cores and 64G RAM, but we limited the available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 RPM 1T disk , 4* DCS 3700 SSDs as journals. Both client and ceph are v0.61.2. We run 12 rados bench instances in client node as a stress to ceph node, each instance with 256 concurrent. Experiment and result: 1.default ceph + default client , OK 2.tuned ceph + default clientFAIL,One osd killed by OS due to OOM, and all swap space is run out. (tuning: Large queue ops/Large queue bytes/.No flusher/sync_flush =true) 3.tuned ceph WITHOUT large queue bytes + default client OK 4.tuned ceph WITHOUT large queue bytes + aggressive client FAILED, One osd killed by OOM and one suicide because 150s op thread timeout. (aggressive client: objecter_inflight_ops, opjecter_inflight_bytes are both set to 10X of default) Conclusion. We would like to say, a. under heavy load, some tuning will make ceph unstable ,especially queue bytes related ( deduce from 1+2+3) b. Ceph doesn't do any control on the lenth of OSD Queue, this is a critical issue, with aggressive client or a lot of concurrent clients, the osd queue will become too long to fit in memory ,thus result in osd daemon being killed.(deduce from 3+4) c. An observation to osd daemon memory usage show that, if I use killall rados to kill all the rados bench instances, the ceph osd daemon cannot free the allocated memory, instead, it still remain very high memory usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , if killed rados, it still remain 5~6GB, restart ceph can solve this issue) We don't capture any log now ,but since it's really easy to reproduce , so we can reproduce and provide any log /profiling info per request. Any inputs/suggestion are highly appreciated. Thanks Xiaoxi -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Ceph killed by OS because of OOM under high load
On Mon, Jun 3, 2013 at 8:47 AM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote: Hi, As my previous mail reported some weeks ago ,we are suffering from OSD crash/ OSD Flipping / System reboot and etc, all these unstable issue really stop us from digging further into ceph characterization. Good news is that we seems find out the cause, I explain our experiments below: Environment: We have 2 machines, one for client and one for ceph, connected via 10GbE. The client machine is very powerful, with 64 Cores and 256G RAM. The ceph machine with 32 Cores and 64G RAM, but we limited the available RAM to 8GB by the grub configuration.12 OSDs on top of 12* 5400 RPM 1T disk , 4* DCS 3700 SSDs as journals. Both client and ceph are v0.61.2. We run 12 rados bench instances in client node as a stress to ceph node, each instance with 256 concurrent. Experiment and result: 1.default ceph + default client , OK 2.tuned ceph + default clientFAIL,One osd killed by OS due to OOM, and all swap space is run out. (tuning: Large queue ops/Large queue bytes/.No flusher/sync_flush =true) 3.tuned ceph WITHOUT large queue bytes + default client OK 4.tuned ceph WITHOUT large queue bytes + aggressive client FAILED, One osd killed by OOM and one suicide because 150s op thread timeout. (aggressive client: objecter_inflight_ops, opjecter_inflight_bytes are both set to 10X of default) Conclusion. We would like to say, a. under heavy load, some tuning will make ceph unstable ,especially queue bytes related ( deduce from 1+2+3) b. Ceph doesn't do any control on the lenth of OSD Queue, this is a critical issue, with aggressive client or a lot of concurrent clients, the osd queue will become too long to fit in memory ,thus result in osd daemon being killed.(deduce from 3+4) c. An observation to osd daemon memory usage show that, if I use killall rados to kill all the rados bench instances, the ceph osd daemon cannot free the allocated memory, instead, it still remain very high memory usage(a new started ceph used ~0.5 GB , and with load it used ~ 6GB , if killed rados, it still remain 5~6GB, restart ceph can solve this issue) You don't have enough RAM for your OSDs. We really recommend 1-2GB per daemon; 600MB/daemon is dangerous. You might be able to make it work, but you'll definitely need to change the queue lengths and things. Speaking of which...yes, the OSDs do control their queue lengths, but it's not dynamic tuning and by default it will let clients stack up 500MB of in-progress writes. With such wimpy systems you'll want to turn that down, probably alongside various journal and disk wait queues. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/11] locks: scalability improvements for file locking
On Fri, 2013-05-31 at 23:07 -0400, Jeff Layton wrote: This is not the first attempt at doing this. The conversion to the i_lock was originally attempted by Bruce Fields a few years ago. His approach was NAK'ed since it involved ripping out the deadlock detection. People also really seem to like /proc/locks for debugging, so keeping that in is probably worthwhile. Yep, we need to keep this. FWIW, lslocks(8) relies on /proc/locks. Thanks, Davidlohr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rationale for a PGLog::merge_old_entry case
In all three cases, we know the authoritative log does not contain an entry for oe.soid, therefore: If oe.prior_version log.tail, we must already have processed an earlier entry for that object resulting in the object being correctly marked missing (or not) (specifically, the entry for oe.prior_version). If log.tail = oe.prior_version eversion_t(), the missing entry should have need set at oe.prior_version (revise_need). oe.prior_version cannot be divergent because all divergent entries must fall within the log (otherwise, we would have backfilled). If oe.prior_version == eversion_t(), the object no longer exists, and the object should be removed from the missing set. Hope that helps. -Sam On Sun, Jun 2, 2013 at 4:09 AM, Loic Dachary l...@dachary.org wrote: Hi Sam, TL;DR: When there no new entry, what is the rationale for merge_old_entry to remove the object from missing only if the tail is eversion_t() and the object prior_version is also eversion_t() ? https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/osd/PGLog.cc#L330 Long version: The conditions are created with: info.log_tail = eversion_t(); oe.soid.hash = 1; oe.op = pg_log_entry_t::DELETE; oe.prior_version = eversion_t(); missing.add(oe.soid, eversion_t(1,1), eversion_t()); as shown in https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/test/osd/TestPGLog.cc#L467 I double checked with gdb and when called with EXPECT_FALSE(merge_old_entry(t, oe, info, remove_snap, dirty_log)); https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/test/osd/TestPGLog.cc#L481 it reaches missing.rm(oe.soid, missing.missing[oe.soid].need); https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/osd/PGLog.cc#L330 and the expected side effects are observed: EXPECT_FALSE(dirty_log); EXPECT_TRUE(remove_snap.empty()); EXPECT_TRUE(t.empty()); EXPECT_FALSE(missing.have_missing()); EXPECT_TRUE(log.empty()); EXPECT_EQ(0U, ondisklog.length()); https://github.com/dachary/ceph/blob/f58299db098d5f18c817b516fa6ffaa76245e57d/src/test/osd/TestPGLog.cc#L483 Cheers -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/11] locks: scalability improvements for file locking
On Fri, May 31, 2013 at 11:07:23PM -0400, Jeff Layton wrote: Executive summary (tl;dr version): This patchset represents an overhaul of the file locking code with an aim toward improving its scalability and making the code a bit easier to understand. Thanks for working on this, that code could use some love! Longer version: When the BKL was finally ripped out of the kernel in 2010, the strategy taken for the file locking code was to simply turn it into a new file_lock_locks spinlock. It was an expedient way to deal with the file locking code at the time, but having a giant spinlock around all of this code is clearly not great for scalability. Red Hat has bug reports that go back into the 2.6.18 era that point to BKL scalability problems in the file locking code and the file_lock_lock suffers from the same issues. This patchset is my first attempt to make this code less dependent on global locking. The main change is to switch most of the file locking code to be protected by the inode-i_lock instead of the file_lock_lock. While that works for most things, there are a couple of global data structures (lists in the current code) that need a global lock to protect them. So we still need a global lock in order to deal with those. The remaining patches are intended to make that global locking less painful. The big gain is made by turning the blocked_list into a hashtable, which greatly speeds up the deadlock detection code. I rolled a couple of small programs in order to test this code. The first one just forks off 128 children and has them lock and unlock the same file 10k times. Running this under time against a file on tmpfs gives typical values like this: What kind of hardware was this? Unpatched (3.10-rc3-ish): real 0m5.283s user 0m0.380s sys 0m20.469s Patched (same base kernel): real 0m5.099s user 0m0.478s sys 0m19.662s ...so there seems to be some modest performance gain in this test. I think that's almost entirely due to the change to a hashtable and to optimize removing and readding blocked locks to the global lists. Note that with this code we have to take two spinlocks instead of just one, and that has some performance impact too. So the real peformance gain from that hashtable conversion is eaten up to some degree by this. Might be nice to look at some profiles to confirm all of that. I'd also be curious how much variation there was in the results above, as they're pretty close. The next test just forks off a bunch of children that each create their own file and then lock and unlock it 20k times. Obviously, the locks in this case are uncontended. Running that under time typically gives these rough numbers. Unpatched (3.10-rc3-ish): real 0m8.836s user 0m1.018s sys 0m34.094s Patched (same base kernel): real 0m4.965s user 0m1.043s sys 0m18.651s In this test, we see the real benefit of moving to the i_lock for most of this code. The run time is almost cut in half in this test. With these changes locking different inodes needs very little serialization. If people know of other file locking performance tests, then I'd be happy to try them out too. It's possible that this might make some workloads slower, and it would be helpful to know what they are (and address them) if so. This is not the first attempt at doing this. The conversion to the i_lock was originally attempted by Bruce Fields a few years ago. His approach was NAK'ed since it involved ripping out the deadlock detection. People also really seem to like /proc/locks for debugging, so keeping that in is probably worthwhile. Yes, there's already code that depends on it. The deadlock detection, though--I still wonder if we could get away with ripping it out. Might be worth at least giving an option to configure it out as a first step. --b. There's more work to be done in this area and this patchset is just a start. There's a horrible thundering herd problem when a blocking lock is released, for instance. There was also interest in solving the goofy unlock on any close POSIX lock semantics at this year's LSF. I think this patchset will help lay the groundwork for those changes as well. Comments and suggestions welcome. Jeff Layton (11): cifs: use posix_unblock_lock instead of locks_delete_block locks: make generic_add_lease and generic_delete_lease static locks: comment cleanups and clarifications locks: make added in __posix_lock_file a bool locks: encapsulate the fl_link list handling locks: convert to i_lock to protect i_flock list locks: only pull entries off of blocked_list when they are really unblocked locks: convert fl_link to a hlist_node locks: turn the blocked_list into a hashtable locks: add a new lm_owner_key lock operation locks: give the blocked_hash its own spinlock Documentation/filesystems/Locking | 27 +++- fs/afs/flock.c|5 +-
Re: [PATCH v1 01/11] cifs: use posix_unblock_lock instead of locks_delete_block
On Fri, May 31, 2013 at 11:07:24PM -0400, Jeff Layton wrote: commit 66189be74 (CIFS: Fix VFS lock usage for oplocked files) exported the locks_delete_block symbol. There's already an exported helper function that provides this capability however, so make cifs use that instead and turn locks_delete_block back into a static function. Note that if fl-fl_next == NULL then this lock has already been through locks_delete_block(), so we should be OK to ignore an ENOENT error here and simply not retry the lock. ACK.--b. Cc: Pavel Shilovsky piastr...@gmail.com Signed-off-by: Jeff Layton jlay...@redhat.com --- fs/cifs/file.c |2 +- fs/locks.c |3 +-- include/linux/fs.h |5 - 3 files changed, 2 insertions(+), 8 deletions(-) diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 48b29d2..44a4f18 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -999,7 +999,7 @@ try_again: rc = wait_event_interruptible(flock-fl_wait, !flock-fl_next); if (!rc) goto try_again; - locks_delete_block(flock); + posix_unblock_lock(file, flock); } return rc; } diff --git a/fs/locks.c b/fs/locks.c index cb424a4..7a02064 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -496,13 +496,12 @@ static void __locks_delete_block(struct file_lock *waiter) /* */ -void locks_delete_block(struct file_lock *waiter) +static void locks_delete_block(struct file_lock *waiter) { lock_flocks(); __locks_delete_block(waiter); unlock_flocks(); } -EXPORT_SYMBOL(locks_delete_block); /* Insert waiter into blocker's block list. * We use a circular list so that processes can be easily woken up in diff --git a/include/linux/fs.h b/include/linux/fs.h index 43db02e..b9d7816 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1006,7 +1006,6 @@ extern int vfs_setlease(struct file *, long, struct file_lock **); extern int lease_modify(struct file_lock **, int); extern int lock_may_read(struct inode *, loff_t start, unsigned long count); extern int lock_may_write(struct inode *, loff_t start, unsigned long count); -extern void locks_delete_block(struct file_lock *waiter); extern void lock_flocks(void); extern void unlock_flocks(void); #else /* !CONFIG_FILE_LOCKING */ @@ -1151,10 +1150,6 @@ static inline int lock_may_write(struct inode *inode, loff_t start, return 1; } -static inline void locks_delete_block(struct file_lock *waiter) -{ -} - static inline void lock_flocks(void) { } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 02/11] locks: make generic_add_lease and generic_delete_lease static
On Fri, May 31, 2013 at 11:07:25PM -0400, Jeff Layton wrote: Signed-off-by: Jeff Layton jlay...@redhat.com ACK.--b. --- fs/locks.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index 7a02064..e3140b8 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1337,7 +1337,7 @@ int fcntl_getlease(struct file *filp) return type; } -int generic_add_lease(struct file *filp, long arg, struct file_lock **flp) +static int generic_add_lease(struct file *filp, long arg, struct file_lock **flp) { struct file_lock *fl, **before, **my_before = NULL, *lease; struct dentry *dentry = filp-f_path.dentry; @@ -1402,7 +1402,7 @@ out: return error; } -int generic_delete_lease(struct file *filp, struct file_lock **flp) +static int generic_delete_lease(struct file *filp, struct file_lock **flp) { struct file_lock *fl, **before; struct dentry *dentry = filp-f_path.dentry; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 03/11] locks: comment cleanups and clarifications
On Fri, May 31, 2013 at 11:07:26PM -0400, Jeff Layton wrote: Signed-off-by: Jeff Layton jlay...@redhat.com --- fs/locks.c | 24 +++- include/linux/fs.h |6 ++ 2 files changed, 25 insertions(+), 5 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index e3140b8..a7d2253 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -150,6 +150,16 @@ static int target_leasetype(struct file_lock *fl) int leases_enable = 1; int lease_break_time = 45; +/* + * The i_flock list is ordered by: + * + * 1) lock type -- FL_LEASEs first, then FL_FLOCK, and finally FL_POSIX + * 2) lock owner + * 3) lock range start + * 4) lock range end + * + * Obviously, the last two criteria only matter for POSIX locks. + */ Thanks, yes, that needs documenting! Though I wonder if this is the place people will look for it. #define for_each_lock(inode, lockp) \ for (lockp = inode-i_flock; *lockp != NULL; lockp = (*lockp)-fl_next) @@ -806,6 +816,11 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str } lock_flocks(); + /* + * New lock request. Walk all POSIX locks and look for conflicts. If + * there are any, either return -EAGAIN or put the request on the + * blocker's list of waiters. + */ This though, seems a) not 100% accurate (it could also return EDEADLCK, for example), b) mostly redundant with respect to the following code. if (request-fl_type != F_UNLCK) { for_each_lock(inode, before) { fl = *before; @@ -844,7 +859,7 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str before = fl-fl_next; } - /* Process locks with this owner. */ + /* Process locks with this owner. */ while ((fl = *before) posix_same_owner(request, fl)) { /* Detect adjacent or overlapping regions (if same lock type) */ @@ -930,10 +945,9 @@ static int __posix_lock_file(struct inode *inode, struct file_lock *request, str } /* - * The above code only modifies existing locks in case of - * merging or replacing. If new lock(s) need to be inserted - * all modifications are done bellow this, so it's safe yet to - * bail out. + * The above code only modifies existing locks in case of merging or + * replacing. If new lock(s) need to be inserted all modifications are + * done below this, so it's safe yet to bail out. */ error = -ENOLCK; /* no luck */ if (right left == right !new_fl2) diff --git a/include/linux/fs.h b/include/linux/fs.h index b9d7816..ae377e9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -926,6 +926,12 @@ int locks_in_grace(struct net *); /* that will die - we need it for nfs_lock_info */ #include linux/nfs_fs_i.h +/* + * struct file_lock represents a generic file lock. It's used to represent + * POSIX byte range locks, BSD (flock) locks, and leases. It's important to + * note that the same struct is used to represent both a request for a lock and + * the lock itself, but the same object is never used for both. Yes, and I do find that confusing. I wonder if there's a sensible way to use separate structs for the different uses. --b. + */ struct file_lock { struct file_lock *fl_next; /* singly linked list for this inode */ struct list_head fl_link; /* doubly linked list of all locks */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] mds: initialize some member variables of MDCache
From: Yan, Zheng zheng.z@intel.com I added some member variables to class MDCache, but forget to initialize them. Fixes: #5236 Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 5 + 1 file changed, 5 insertions(+) diff --git a/src/mds/MDCache.cc b/src/mds/MDCache.cc index 8c17172..e2ecba8 100644 --- a/src/mds/MDCache.cc +++ b/src/mds/MDCache.cc @@ -147,6 +147,7 @@ MDCache::MDCache(MDS *m) (0.9 *(g_conf-osd_max_write_size 20)); discover_last_tid = 0; + open_ino_last_tid = 0; find_ino_peer_last_tid = 0; last_cap_id = 0; @@ -155,6 +156,10 @@ MDCache::MDCache(MDS *m) client_lease_durations[1] = 30.0; client_lease_durations[2] = 300.0; + resolves_pending = false; + rejoins_pending = false; + cap_imports_num_opening = 0; + opening_root = open = false; lru.lru_set_max(g_conf-mds_cache_size); lru.lru_set_midpoint(g_conf-mds_cache_mid); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] ceph: check migrate seq before changing auth cap
From: Yan, Zheng zheng.z@intel.com We may receive old request reply from the exporter MDS after receiving the importer MDS' cap import message. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 54c290b..790f88b 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -612,9 +612,11 @@ retry: __cap_delay_requeue(mdsc, ci); } - if (flags CEPH_CAP_FLAG_AUTH) - ci-i_auth_cap = cap; - else if (ci-i_auth_cap == cap) { + if (flags CEPH_CAP_FLAG_AUTH) { + if (ci-i_auth_cap == NULL || + ceph_seq_cmp(ci-i_auth_cap-mseq, mseq) 0) + ci-i_auth_cap = cap; + } else if (ci-i_auth_cap == cap) { ci-i_auth_cap = NULL; spin_lock(mdsc-cap_dirty_lock); if (!list_empty(ci-i_dirty_item)) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/9] fixes for kclient
From: Yan, Zheng zheng.z@intel.com this patch series are also in: git://github.com/ukernel/linux.git wip-ceph Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] libceph: call r_unsafe_callback when unsafe reply is received
From: Yan, Zheng zheng.z@intel.com We can't use !req-r_sent to check if OSD request is sent for the first time, this is because __cancel_request() zeros req-r_sent when OSD map changes. Rather than adding a new variable to ceph_osd_request to indicate if it's sent for the first time, We can call the unsafe callback only when unsafe OSD reply is received. If OSD's first reply is safe, just skip calling the unsafe callback. Signed-off-by: Yan, Zheng zheng.z@intel.com --- net/ceph/osd_client.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 536c0e5..6972d17 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1338,10 +1338,6 @@ static void __send_request(struct ceph_osd_client *osdc, ceph_msg_get(req-r_request); /* send consumes a ref */ - /* Mark the request unsafe if this is the first timet's being sent. */ - - if (!req-r_sent req-r_unsafe_callback) - req-r_unsafe_callback(req, true); req-r_sent = req-r_osd-o_incarnation; ceph_con_send(req-r_osd-o_con, req-r_request); @@ -1432,8 +1428,6 @@ static void handle_osds_timeout(struct work_struct *work) static void complete_request(struct ceph_osd_request *req) { - if (req-r_unsafe_callback) - req-r_unsafe_callback(req, false); complete_all(req-r_safe_completion); /* fsync waiter */ } @@ -1560,14 +1554,20 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, mutex_unlock(osdc-request_mutex); if (!already_completed) { + if (req-r_unsafe_callback + result = 0 !(flags CEPH_OSD_FLAG_ONDISK)) + req-r_unsafe_callback(req, true); if (req-r_callback) req-r_callback(req, msg); else complete_all(req-r_completion); } - if (flags CEPH_OSD_FLAG_ONDISK) + if (flags CEPH_OSD_FLAG_ONDISK) { + if (req-r_unsafe_callback already_completed) + req-r_unsafe_callback(req, false); complete_request(req); + } done: dout(req=%p req-r_linger=%d\n, req, req-r_linger); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] ceph: fix cap release race
From: Yan, Zheng zheng.z@intel.com ceph_encode_inode_release() can race with ceph_open() and release caps wanted by open files. So it should call __ceph_caps_wanted() to get the wanted caps. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index da0f9b8..54c290b 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -3042,21 +3042,19 @@ int ceph_encode_inode_release(void **p, struct inode *inode, (cap-issued unless) == 0)) { if ((cap-issued drop) (cap-issued unless) == 0) { - dout(encode_inode_release %p cap %p %s - -%s\n, inode, cap, + int wanted = __ceph_caps_wanted(ci); + if ((ci-i_ceph_flags CEPH_I_NODELAY) == 0) + wanted |= cap-mds_wanted; + dout(encode_inode_release %p cap %p +%s - %s, wanted %s - %s\n, inode, cap, ceph_cap_string(cap-issued), -ceph_cap_string(cap-issued ~drop)); +ceph_cap_string(cap-issued ~drop), +ceph_cap_string(cap-mds_wanted), +ceph_cap_string(wanted)); + cap-issued = ~drop; cap-implemented = ~drop; - if (ci-i_ceph_flags CEPH_I_NODELAY) { - int wanted = __ceph_caps_wanted(ci); - dout( wanted %s - %s (act %s)\n, -ceph_cap_string(cap-mds_wanted), -ceph_cap_string(cap-mds_wanted -~wanted), -ceph_cap_string(wanted)); - cap-mds_wanted = wanted; - } + cap-mds_wanted = wanted; } else { dout(encode_inode_release %p cap %p %s (force)\n, inode, cap, -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] ceph: reset iov_len when discarding cap release messages
From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/mds_client.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 4f22671..e2d7e56 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1391,6 +1391,7 @@ static void discard_cap_releases(struct ceph_mds_client *mdsc, num = le32_to_cpu(head-num); dout(discard_cap_releases mds%d %p %u\n, session-s_mds, msg, num); head-num = cpu_to_le32(0); + msg-front.iov_len = sizeof(*head); session-s_num_cap_releases += num; /* requeue completed messages */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] ceph: clear migrate seq when MDS restarts
From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/mds_client.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index e2d7e56..ce7a789 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2455,6 +2455,7 @@ static int encode_caps_cb(struct inode *inode, struct ceph_cap *cap, spin_lock(ci-i_ceph_lock); cap-seq = 0;/* reset cap seq */ cap-issue_seq = 0; /* and issue_seq */ + cap-mseq = 0; /* and migrate_seq */ if (recon_state-flock) { rec.v2.cap_id = cpu_to_le64(cap-cap_id); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] libceph: fix truncate size calculation
From: Yan, Zheng zheng.z@intel.com check the not truncated yet case Signed-off-by: Yan, Zheng zheng.z@intel.com --- net/ceph/osd_client.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 6972d17..93efdfb 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -733,12 +733,14 @@ struct ceph_osd_request *ceph_osdc_new_request(struct ceph_osd_client *osdc, object_size = le32_to_cpu(layout-fl_object_size); object_base = off - objoff; - if (truncate_size = object_base) { - truncate_size = 0; - } else { - truncate_size -= object_base; - if (truncate_size object_size) - truncate_size = object_size; + if (!(truncate_seq == 1 truncate_size == -1ULL)) { + if (truncate_size = object_base) { + truncate_size = 0; + } else { + truncate_size -= object_base; + if (truncate_size object_size) + truncate_size = object_size; + } } osd_req_op_extent_init(req, 0, opcode, objoff, objlen, -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] libceph: fix safe completion
From: Yan, Zheng zheng.z@intel.com handle_reply() calls complete_request() only if the first OSD reply has ONDISK flag. Signed-off-by: Yan, Zheng zheng.z@intel.com --- include/linux/ceph/osd_client.h | 1 - net/ceph/osd_client.c | 16 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 186db0b..ce6df39 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -145,7 +145,6 @@ struct ceph_osd_request { s32 r_reply_op_result[CEPH_OSD_MAX_OP]; int r_got_reply; int r_linger; - int r_completed; struct ceph_osd_client *r_osdc; struct kref r_kref; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index a3395fd..536c0e5 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1525,6 +1525,8 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, for (i = 0; i numops; i++) req-r_reply_op_result[i] = ceph_decode_32(p); + already_completed = req-r_got_reply; + if (!req-r_got_reply) { req-r_result = result; @@ -1555,16 +1557,14 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, ((flags CEPH_OSD_FLAG_WRITE) == 0)) __unregister_request(osdc, req); - already_completed = req-r_completed; - req-r_completed = 1; mutex_unlock(osdc-request_mutex); - if (already_completed) - goto done; - if (req-r_callback) - req-r_callback(req, msg); - else - complete_all(req-r_completion); + if (!already_completed) { + if (req-r_callback) + req-r_callback(req, msg); + else + complete_all(req-r_completion); + } if (flags CEPH_OSD_FLAG_ONDISK) complete_request(req); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/9] ceph: move inode to proper flushing list when auth MDS changes
From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 790f88b..458a66e 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -1982,8 +1982,14 @@ static void kick_flushing_inode_caps(struct ceph_mds_client *mdsc, cap = ci-i_auth_cap; dout(kick_flushing_inode_caps %p flushing %s flush_seq %lld\n, inode, ceph_cap_string(ci-i_flushing_caps), ci-i_cap_flush_seq); + __ceph_flush_snaps(ci, session, 1); + if (ci-i_flushing_caps) { + spin_lock(mdsc-cap_dirty_lock); + list_move_tail(ci-i_flushing_item, session-s_cap_flushing); + spin_unlock(mdsc-cap_dirty_lock); + delayed = __send_cap(mdsc, cap, CEPH_CAP_OP_FLUSH, __ceph_caps_used(ci), __ceph_caps_wanted(ci), -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] ceph: fix race between page writeback and truncate
From: Yan, Zheng zheng.z@intel.com The client can receive truncate request from MDS at any time. So the page writeback code need to get i_size, truncate_seq and truncate_size atomically Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/addr.c | 84 -- 1 file changed, 40 insertions(+), 44 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 3e68ac1..3500b74 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -438,13 +438,12 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc) struct ceph_inode_info *ci; struct ceph_fs_client *fsc; struct ceph_osd_client *osdc; - loff_t page_off = page_offset(page); - int len = PAGE_CACHE_SIZE; - loff_t i_size; - int err = 0; struct ceph_snap_context *snapc, *oldest; - u64 snap_size = 0; + loff_t page_off = page_offset(page); long writeback_stat; + u64 truncate_size, snap_size = 0; + u32 truncate_seq; + int err = 0, len = PAGE_CACHE_SIZE; dout(writepage %p idx %lu\n, page, page-index); @@ -474,13 +473,20 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc) } ceph_put_snap_context(oldest); + spin_lock(ci-i_ceph_lock); + truncate_seq = ci-i_truncate_seq; + truncate_size = ci-i_truncate_size; + if (!snap_size) + snap_size = i_size_read(inode); + spin_unlock(ci-i_ceph_lock); + /* is this a partial page at end of file? */ - if (snap_size) - i_size = snap_size; - else - i_size = i_size_read(inode); - if (i_size page_off + len) - len = i_size - page_off; + if (page_off = snap_size) { + dout(%p page eof %llu\n, page, snap_size); + goto out; + } + if (snap_size page_off + len) + len = snap_size - page_off; dout(writepage %p page %p index %lu on %llu~%u snapc %p\n, inode, page, page-index, page_off, len, snapc); @@ -494,7 +500,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc) err = ceph_osdc_writepages(osdc, ceph_vino(inode), ci-i_layout, snapc, page_off, len, - ci-i_truncate_seq, ci-i_truncate_size, + truncate_seq, truncate_size, inode-i_mtime, page, 1); if (err 0) { dout(writepage setting page/mapping error %d %p\n, err, page); @@ -631,25 +637,6 @@ static void writepages_finish(struct ceph_osd_request *req, ceph_osdc_put_request(req); } -static struct ceph_osd_request * -ceph_writepages_osd_request(struct inode *inode, u64 offset, u64 *len, - struct ceph_snap_context *snapc, int num_ops) -{ - struct ceph_fs_client *fsc; - struct ceph_inode_info *ci; - struct ceph_vino vino; - - fsc = ceph_inode_to_client(inode); - ci = ceph_inode(inode); - vino = ceph_vino(inode); - /* BUG_ON(vino.snap != CEPH_NOSNAP); */ - - return ceph_osdc_new_request(fsc-client-osdc, ci-i_layout, - vino, offset, len, num_ops, CEPH_OSD_OP_WRITE, - CEPH_OSD_FLAG_WRITE|CEPH_OSD_FLAG_ONDISK, - snapc, ci-i_truncate_seq, ci-i_truncate_size, true); -} - /* * initiate async writeback */ @@ -658,7 +645,8 @@ static int ceph_writepages_start(struct address_space *mapping, { struct inode *inode = mapping-host; struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_fs_client *fsc; + struct ceph_fs_client *fsc = ceph_inode_to_client(inode); + struct ceph_vino vino = ceph_vino(inode); pgoff_t index, start, end; int range_whole = 0; int should_loop = 1; @@ -670,7 +658,8 @@ static int ceph_writepages_start(struct address_space *mapping, unsigned wsize = 1 inode-i_blkbits; struct ceph_osd_request *req = NULL; int do_sync; - u64 snap_size; + u64 truncate_size, snap_size; + u32 truncate_seq; /* * Include a 'sync' in the OSD request if this is a data @@ -685,7 +674,6 @@ static int ceph_writepages_start(struct address_space *mapping, wbc-sync_mode == WB_SYNC_NONE ? NONE : (wbc-sync_mode == WB_SYNC_ALL ? ALL : HOLD)); - fsc = ceph_inode_to_client(inode); if (fsc-mount_state == CEPH_MOUNT_SHUTDOWN) { pr_warning(writepage_start %p on forced umount\n, inode); return -EIO; /* we're in a forced umount, don't write! */ @@ -728,6 +716,14 @@ retry: snap_size = i_size_read(inode); dout( oldest snapc is %p seq %lld (%d snaps)\n, snapc, snapc-seq,