RE: slow fio random read benchmark, need help

2012-11-01 Thread Dietmar Maurer
I do not really understand that network latency argument. If one can get 40K iops with iSCSI, why can't I get the same with rados/ceph? Note: network latency is the same in both cases What do I miss? -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-

[PATCH 2/3] mm: Only enforce stable page writes if the backing device requires it

2012-11-01 Thread Darrick J. Wong
Create a helper function to check if a backing device requires stable page writes and, if so, performs the necessary wait. Then, make it so that all points in the memory manager that handle making pages writable use the helper function. This should provide stable page write support to most

[RFC PATCH v2 0/3] mm/fs: Implement faster stable page writes on filesystems

2012-11-01 Thread Darrick J. Wong
Hi all, This patchset makes some key modifications to the original 'stable page writes' patchset. First, it provides users (devices and filesystems) of a backing_dev_info the ability to declare whether or not it is necessary to ensure that page contents cannot change during writeout, whereas the

Re: slow fio random read benchmark, need help

2012-11-01 Thread Stefan Priebe - Profihost AG
Am 01.11.2012 08:38, schrieb Dietmar Maurer: I do not really understand that network latency argument. If one can get 40K iops with iSCSI, why can't I get the same with rados/ceph? Note: network latency is the same in both cases What do I miss? Good question. Also i've seen 20k iops on ceph

[PATCH 1/2] ceph: Don't update i_max_size when handling non-auth cap

2012-11-01 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com The cap from non-auth mds doesn't have a meaningful max_size value. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index

[PATCH 1/2] mds: Don't acquire replica object's versionlock

2012-11-01 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Both CInode and CDentry's versionlocks are of type LocalLock. Acquiring LocalLock in replica object is useless and problematic. For example, if two requests try acquiring a replica object's versionlock, the first request succeeds, the second request is added

[PATCH 2/2] ceph: Fix i_size update race

2012-11-01 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR dirty before data is copied to page cache and inode size is updated. If sceph_check_caps() flushes the dirty cap before the inode size is updated, MDS can miss the new inode size. The fix is

[PATCH 2/2] mds: Allow try_eval to eval unstable locks in freezing object

2012-11-01 Thread Yan, Zheng
From: Yan, Zheng zheng.z@intel.com Unstable locks hold auth_pins on the object, it prevents the freezing object become frozen and then unfreeze. So try_eval() should not wait for freezing object Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 4 ++-- 1 file changed,

Re: slow fio random read benchmark, need help

2012-11-01 Thread Gregory Farnum
I'm not sure that latency addition is quite correct. Most use cases cases do multiple IOs at the same time, and good benchmarks tend to reflect that. I suspect the IO limitations here are a result of QEMU's storage handling (or possibly our client layer) more than anything else — Josh can talk

Re: slow fio random read benchmark, need help

2012-11-01 Thread Stefan Priebe - Profihost AG
Am 01.11.2012 11:40, schrieb Gregory Farnum: I'm not sure that latency addition is quite correct. Most use cases cases do multiple IOs at the same time, and good benchmarks tend to reflect that. I suspect the IO limitations here are a result of QEMU's storage handling (or possibly our client

Re: [PATCH 5/6] rbd: get additional info in parent spec

2012-11-01 Thread Alex Elder
On 10/31/2012 08:49 PM, Josh Durgin wrote: I know you've got a queue of these already, but here's another: rbd_dev_probe_update_spec() could definitely use some warnings to distinguish its error cases. Reviewed-by: Josh Durgin josh.dur...@inktank.com Finally! I was going to accuse you of

Re: [PATCH 6/6] rbd: probe the parent of an image if present

2012-11-01 Thread Alex Elder
On 10/31/2012 09:07 PM, Josh Durgin wrote: This all makes sense, but it reminds me of another issue we'll need to address: http://www.tracker.newdream.net/issues/2533 I was not aware of that one. That's no good. We don't need to watch the header of a parent snapshot, since it's immutable

Re: [PATCH 1/3] bdi: Track users that require stable page writes

2012-11-01 Thread Jan Kara
On Thu 01-11-12 00:58:13, Darrick J. Wong wrote: This creates a per-backing-device counter that tracks the number of users which require pages to be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nobody requires stable pages. As I wrote

[PATCH 0/5] rbd: little cleanups

2012-11-01 Thread Alex Elder
These are a handful of fairly minor cleanup items I've been putting off sending out until some of the meatier stuff got done. -Alex [PATCH 1/5] rbd: document rbd_spec structure [PATCH 2/5] rbd: kill rbd_spec-image_name_len [PATCH 3/5] rbd: kill

[PATCH 1/5] rbd: document rbd_spec structure

2012-11-01 Thread Alex Elder
I promised Josh I would document whether there were any restrictions needed for accessing fields of an rbd_spec structure. This adds a big block of comments that documents the structure and how it is used--including the fact that we don't attempt to synchronize access to it. Signed-off-by: Alex

[PATCH 5/5] rbd: use kmemdup()

2012-11-01 Thread Alex Elder
This replaces two kmalloc()/memcpy() combinations with a single call to kmemdup(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c |7 ++- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 3378963..cf7b405

[PATCH] ceph: define ceph_encode_8_safe()

2012-11-01 Thread Alex Elder
It's kind of a silly macro, but ceph_encode_8_safe() is the only one missing from an otherwise pretty complete set. It's not used, but neither are a couple of the others in this set. While in there, insert some whitespace to tidy up the alignment of the line-terminating backslashes in some of

[PATCH] rbd: fix ceph_pg_poolid_by_name()

2012-11-01 Thread Alex Elder
Currently ceph_pg_poolid_by_name() returns an int, which is used to encode a ceph pool id. This could be a problem because a pool id (at least in some cases) is a 64-bit value. We have a defined pool id value that represents no pool, and that's a very sensible return value here. This patch

[PATCH 0/4] rbd: improve warnings

2012-11-01 Thread Alex Elder
This series adds a utility function rbd_warn() that will provide a central and unified way to generate warning messages from rbd. It then fleshes out some warning messages in a few areas. There is more to be done, but for now I'm just getting the mechanism and these initial uses of it in place.

[PATCH 1/4] rbd: define and use rbd_warn()

2012-11-01 Thread Alex Elder
Define a new function rbd_warn() that produces a boilerplate warning message, identifying in the resulting message the affected rbd device in the best way available. Use it in a few places that now use pr_warning(). Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 43

[PATCH 2/4] rbd: add warning messages for missing arguments

2012-11-01 Thread Alex Elder
Tell the user (via dmesg) what was wrong with the arguments provided via /sys/bus/rbd/add. Signed-off-by: Alex Elder el...@inktank.com --- drivers/block/rbd.c | 24 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c

[PATCH 3/4] rbd: add a warning in bio_chain_clone_range()

2012-11-01 Thread Alex Elder
Add a warning in bio_chain_clone_range() to help a user determine what exactly might have led to a failure. There is only one; please say something if you disagree with the following reasoning. There are three places this can return abnormally: - Initially, if there is nothing to clone. It

[PATCH 4/4] rbd: add warnings to rbd_dev_probe_update_spec()

2012-11-01 Thread Alex Elder
Josh suggested adding warnings to this function to help users diagnose problems. Other than memory allocatino errors, there are two places where errors can be returned. Both represent problems that should have been caught earlier, and as such might well have been handled with BUG_ON() calls.

Re: slow fio random read benchmark, need help

2012-11-01 Thread Marcus Sorensen
In this case he's doing a direct random read, so the ios queue one at a time on his various multipath channels. Be may have defined a depth that sends a bunch at once, but they still get split up, he could run a blktrace to verify. If they could merge he could maybe send multiples, or perhaps he

Re: slow fio random read benchmark, need help

2012-11-01 Thread Marcus Sorensen
Actually that didn't illustrate my point very well, since you see individual requests being sent to the driver without waiting for individual completion, but if you look at the full output you can see that once the queue is full, you're at the mercy of waiting for individual IOs to complete before

RE: slow fio random read benchmark, need help

2012-11-01 Thread Dietmar Maurer
For the record, I'm not saying that it's the entire reason why the performance is lower (obviously since iscsi is better), I'm just saying that when you're talking about high iops, adding 100us (best case gigabit) to each request and response is significant iSCSI also uses the network (also

Re: [PATCH 1/3] bdi: Track users that require stable page writes

2012-11-01 Thread Boaz Harrosh
On 11/01/2012 12:58 AM, Darrick J. Wong wrote: This creates a per-backing-device counter that tracks the number of users which require pages to be held immutable during writeout. Eventually it will be used to waive wait_for_page_writeback() if nobody requires stable pages. There is two

Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

2012-11-01 Thread Boaz Harrosh
On 11/01/2012 12:58 AM, Darrick J. Wong wrote: Fix up the filesystems that provide their own -page_mkwrite handlers to provide stable page writes if necessary. Signed-off-by: Darrick J. Wong darrick.w...@oracle.com --- fs/9p/vfs_file.c |1 + fs/afs/write.c |4 ++--

Re: Need CRYPTO_CXXFLAGS with latest master?

2012-11-01 Thread Gary Lowell
Hi Noah - What platform are you building on, and are you building with nss or cryptopp ? Thanks, Gary On Oct 31, 2012, at 8:22 PM, Noah Watkins wrote: Whoops, here is the original error: CXXtest_idempotent_sequence.o In file included from ./os/LFNIndex.h:27:0, from

Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

2012-11-01 Thread Jeff Layton
On Thu, 1 Nov 2012 11:43:26 -0700 Boaz Harrosh bharr...@panasas.com wrote: On 11/01/2012 12:58 AM, Darrick J. Wong wrote: Fix up the filesystems that provide their own -page_mkwrite handlers to provide stable page writes if necessary. Signed-off-by: Darrick J. Wong

Re: Ceph journal

2012-11-01 Thread Mark Nelson
On 11/01/2012 04:18 PM, Gandalf Corvotempesta wrote: 2012/10/31 Stefan Kleijkers ste...@kleijkers.nl: As far as I know, this is correct. You get a ACK (on the write) back after it landed on ALL three journals (or/and osds in case of BTRFS in parallel mode). So If you lose one node, you still

Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

2012-11-01 Thread Boaz Harrosh
On 11/01/2012 01:22 PM, Jeff Layton wrote: Hmm...I don't know... I've never been crazy about using the page lock for this, but in the absence of a better way to guarantee stable pages, it was what I ended up with at the time. cifs_writepages will hold the page lock until kernel_sendmsg

Re: Cephfs losing files and corrupting others

2012-11-01 Thread Sam Lang
On Thu 01 Nov 2012 11:22:59 AM CDT, Nathan Howell wrote: We have a small (3 node) Ceph cluster that occasionally has issues. It loses files and directories, truncates them or fills the contents with NULL bytes. So far we haven't been able to build a repro case but it seems to happen when bulk

Assertion failure in ceph_readlink()

2012-11-01 Thread Noah Watkins
I'm getting the following assertion failure when running a test that creates a symlink and then tries to read it using ceph_readlink(). This is the failure, and the test is shown below (and is in wip-java-symlinks). Also note that if the test below is altered to use relative paths for both

Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

2012-11-01 Thread Darrick J. Wong
On Thu, Nov 01, 2012 at 04:22:54PM -0400, Jeff Layton wrote: On Thu, 1 Nov 2012 11:43:26 -0700 Boaz Harrosh bharr...@panasas.com wrote: On 11/01/2012 12:58 AM, Darrick J. Wong wrote: Fix up the filesystems that provide their own -page_mkwrite handlers to provide stable page writes if

Re: [PATCH 1/3] bdi: Track users that require stable page writes

2012-11-01 Thread Boaz Harrosh
On 11/01/2012 11:57 AM, Darrick J. Wong wrote: On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote: On 11/01/2012 12:58 AM, Darrick J. Wong wrote: This creates a per-backing-device counter that tracks the number of users which require pages to be held immutable during writeout.

Re: Cephfs losing files and corrupting others

2012-11-01 Thread Gregory Farnum
On Thu, Nov 1, 2012 at 11:32 PM, Sam Lang sam.l...@inktank.com wrote: On Thu 01 Nov 2012 11:22:59 AM CDT, Nathan Howell wrote: We have a small (3 node) Ceph cluster that occasionally has issues. It loses files and directories, truncates them or fills the contents with NULL bytes. So far we

Re: Assertion failure in ceph_readlink()

2012-11-01 Thread Sam Lang
On 11/01/2012 05:38 PM, Noah Watkins wrote: I'm getting the following assertion failure when running a test that creates a symlink and then tries to read it using ceph_readlink(). This is the failure, and the test is shown below (and is in wip-java-symlinks). Also note that if the test below

Re: [PATCH 1/3] bdi: Track users that require stable page writes

2012-11-01 Thread Jan Kara
On Thu 01-11-12 15:56:34, Boaz Harrosh wrote: On 11/01/2012 11:57 AM, Darrick J. Wong wrote: On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote: On 11/01/2012 12:58 AM, Darrick J. Wong wrote: This creates a per-backing-device counter that tracks the number of users which

Re: Assertion failure in ceph_readlink()

2012-11-01 Thread Noah Watkins
filepath path(relpath); Inode *in; - int r = path_walk(path, in); + int r = path_walk(path, in, false); if (r 0) return r; Fixes both cases. Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org

Re: Cephfs losing files and corrupting others

2012-11-01 Thread Nathan Howell
On Thu, Nov 1, 2012 at 3:32 PM, Sam Lang sam.l...@inktank.com wrote: Do the writes succeed? I.e. the programs creating the files don't get errors back? Are you seeing any problems with the ceph mds or osd processes crashing? Can you describe your I/O workload during these bulk loads? How

RBD trim / unmap support?

2012-11-01 Thread Stefan Priebe
Hello list, does rbd support trim / unmap? Or is it planned to support it? Greets, Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: RBD trim / unmap support?

2012-11-01 Thread Josh Durgin
On 11/01/2012 04:33 PM, Stefan Priebe wrote: Hello list, does rbd support trim / unmap? Or is it planned to support it? Greets, Stefan librbd (and thus qemu) support it. The rbd kernel module does not yet. See http://ceph.com/docs/master/rbd/qemu-rbd/#enabling-discard-trim Josh -- To

Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

2012-11-01 Thread Jeff Layton
On Thu, 1 Nov 2012 15:47:30 -0700 Darrick J. Wong darrick.w...@oracle.com wrote: On Thu, Nov 01, 2012 at 04:22:54PM -0400, Jeff Layton wrote: On Thu, 1 Nov 2012 11:43:26 -0700 Boaz Harrosh bharr...@panasas.com wrote: On 11/01/2012 12:58 AM, Darrick J. Wong wrote: Fix up the

Re: Assertion failure in ceph_readlink()

2012-11-01 Thread Sam Lang
On 11/01/2012 06:22 PM, Noah Watkins wrote: filepath path(relpath); Inode *in; - int r = path_walk(path, in); + int r = path_walk(path, in, false); if (r 0) return r; Fixes both cases. Thanks! I discovered a few more bugs in path_walk() for the symlink case while

Re: Cephfs losing files and corrupting others

2012-11-01 Thread Yan, Zheng
On Fri, Nov 2, 2012 at 7:30 AM, Nathan Howell nathan.d.how...@gmail.com wrote: On Thu, Nov 1, 2012 at 3:32 PM, Sam Lang sam.l...@inktank.com wrote: Do the writes succeed? I.e. the programs creating the files don't get errors back? Are you seeing any problems with the ceph mds or osd processes

Re: Need CRYPTO_CXXFLAGS with latest master?

2012-11-01 Thread Noah Watkins
I guess I'm going to have to retract this problem, as I can't reproduce it today. No clue what happened :) On Thu, Nov 1, 2012 at 12:05 PM, Gary Lowell gary.low...@inktank.com wrote: Hi Noah - What platform are you building on, and are you building with nss or cryptopp ? Thanks, Gary

Re: Need CRYPTO_CXXFLAGS with latest master?

2012-11-01 Thread Gary Lowell
Let me know if you see the problem again. It's probably something in the autotools dependencies resolution. Cheers, Gary On Nov 1, 2012, at 8:54 PM, Noah Watkins wrote: I guess I'm going to have to retract this problem, as I can't reproduce it today. No clue what happened :) On Thu, Nov