Re: [PATCH 04/39] mds: make sure table request id unique

2013-03-21 Thread Gregory Farnum
On Thu, Mar 21, 2013 at 1:07 AM, Yan, Zheng zheng.z@intel.com wrote: On 03/21/2013 02:31 AM, Greg Farnum wrote: On Tuesday, March 19, 2013 at 11:49 PM, Yan, Zheng wrote: On 03/20/2013 02:15 PM, Sage Weil wrote: On Wed, 20 Mar 2013, Yan, Zheng wrote: On 03/20/2013 07:09 AM, Greg Farnum

Re: [PATCH 12/39] mds: compose and send resolve messages in batch

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com Software Engineer #42 @ http://inktank.com | http://ceph.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Resolve messages for all MDS are the same, so we can compose and send them in

Re: [PATCH 13/39] mds: don't send resolve message between active MDS

2013-03-20 Thread Gregory Farnum
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com When MDS cluster is resolving, current behavior is sending subtree resolve message to all other MDS and waiting for all other MDS' resolve message. The problem is that active MDS

Re: [PATCH 14/39] mds: set resolve/rejoin gather MDS set in advance

2013-03-20 Thread Gregory Farnum
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com For active MDS, it may receive resolve/resolve message before receiving resolve/rejoin, maybe? Other than that, Reviewed-by: Greg Farnum g...@inktank.com the mdsmap message that

Re: [PATCH 15/39] mds: don't send MDentry{Link,Unlink} before receiving cache rejoin

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com The active MDS calls MDCache::rejoin_scour_survivor_replicas() when it receives the cache rejoin message. The function will remove the

Re: [PATCH 16/39] mds: send cache rejoin messages after gathering all resolves

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 10 ++ src/mds/MDCache.h | 5 + 2 files changed,

Re: [PATCH 17/39] mds: send resolve acks after master updates are safely logged

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 33 + src/mds/MDCache.h | 7

Re: [PATCH 19/39] mds: remove MDCache::rejoin_fetch_dirfrags()

2013-03-20 Thread Gregory Farnum
Nice. Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com In commit 77946dcdae (mds: fetch missing inodes from disk), I introduced MDCache::rejoin_fetch_dirfrags(). But it basicly duplicates

Re: [PATCH 20/39] mds: include replica nonce in MMDSCacheRejoin::inode_strong

2013-03-20 Thread Gregory Farnum
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com So the recovering MDS can properly handle cache expire messages. Also increase the nonce value when sending the cache rejoin acks. Signed-off-by: Yan, Zheng zheng.z@intel.com

Re: [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack

2013-03-20 Thread Gregory Farnum
This needs to handle versioning the encoding based on peer feature bits too. On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Cache rejoin ack message already encodes inode base, make it also encode dirfrag base. This allowes the

Re: [PATCH 21/39] mds: encode dirfrag base in cache rejoin ack

2013-03-20 Thread Gregory Farnum
On Wed, Mar 20, 2013 at 4:33 PM, Gregory Farnum g...@inktank.com wrote: This needs to handle versioning the encoding based on peer feature bits too. On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: + void add_dirfrag_base(CDir *dir) { +::encode(dir-dirfrag

Re: [PATCH 23/39] mds: reqid for rejoinning authpin/wrlock need to be list

2013-03-20 Thread Gregory Farnum
I think Sage is right, we can just bump the MDS protocol instead of spending a feature bit on OTW changes — but this is another message we should update to the new encoding macros while we're making that bump. The rest looks good! -Greg On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng

Re: [PATCH 24/39] mds: take object's versionlock when rejoinning xlock

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 12 1 file changed, 12 insertions(+) diff --git

Re: [PATCH 25/39] mds: share inode max size after MDS recovers

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com The MDS may crash after journaling the new max size, but before sending the new max size to the client. Later when the MDS recovers, the

Re: [PATCH 26/39] mds: issue caps when lock state in replica become SYNC

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com because client can request READ caps from non-auth MDS. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Locker.cc | 2 ++

Re: [PATCH 27/39] mds: send lock action message when auth MDS is in proper state.

2013-03-20 Thread Gregory Farnum
On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com For rejoining object, don't send lock ACK message because lock states are still uncertain. The lock ACK may confuse object's auth MDS and trigger assertion. If object's auth MDS

Re: [PATCH 28/39] mds: add dirty imported dirfrag to LogSegment

2013-03-20 Thread Gregory Farnum
Whoops! Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc | 7 +-- src/mds/CDir.h | 2 +-

Re: [PATCH 29/39] mds: avoid double auth pin for file recovery

2013-03-20 Thread Gregory Farnum
This looks good on its face but I haven't had the chance to dig through the recovery queue stuff yet (it's on my list following some issues with recovery speed). How'd you run across this? If it's being added to the recovery queue multiple times I want to make sure we don't have some other

Re: [PATCH 30/39] mds: check MDS peer's state through mdsmap

2013-03-20 Thread Gregory Farnum
Yep. Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

Re: [PATCH 31/39] mds: unfreeze subtree if import aborts in PREPPED state

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-)

Re: [PATCH 32/39] mds: fix export cancel notification

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com The comment says that if the importer is dead, bystanders thinks the exporter is the only auth, as per mdcache-handle_mds_failure(). But

Re: [PATCH 33/39] mds: notify bystanders if export aborts

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com So bystanders know the subtree is single auth earlier. Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 34

Re: [PATCH 34/39] mds: don't open dirfrag while subtree is frozen

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)

Re: [PATCH 35/39] mds: clear dirty inode rstat if import fails

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/CDir.cc | 1 + src/mds/Migrator.cc | 2 ++ 2 files changed, 3

Re: [PATCH 36/39] mds: try merging subtree after clear EXPORTBOUND

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/Migrator.cc | 8 1 file changed, 4 insertions(+), 4 deletions(-)

Re: [PATCH 37/39] mds: eval inodes with caps imported by cache rejoin message

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com Signed-off-by: Yan, Zheng zheng.z@intel.com --- src/mds/MDCache.cc | 1 + 1 file changed, 1 insertion(+) diff --git

Re: [PATCH 38/39] mds: don't replicate purging dentry

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com open_remote_ino is racy, it's possible someone deletes the inode's last linkage while the MDS is discovering the inode. Signed-off-by:

Re: [PATCH 39/39] mds: clear scatter dirty if replica inode has no auth subtree

2013-03-20 Thread Gregory Farnum
Reviewed-by: Greg Farnum g...@inktank.com On Sun, Mar 17, 2013 at 7:51 AM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com This avoids sending superfluous scatterlock state to recovering MDS Signed-off-by: Yan, Zheng zheng.z@intel.com ---

Re: [PATCH 2/2] ceph: use i_release_count to indicate dir's completeness

2013-03-12 Thread Gregory Farnum
On Tuesday, March 12, 2013 at 9:50 PM, Yan, Zheng wrote: On 03/13/2013 09:24 AM, Greg Farnum wrote: On Monday, March 11, 2013 at 5:42 AM, Yan, Zheng wrote: From: Yan, Zheng zheng.z@intel.com (mailto:zheng.z@intel.com) Current ceph code tracks directory's completeness in two

Re: When ceph synchronizes journal to disk?

2013-03-04 Thread Gregory Farnum
On Sun, Mar 3, 2013 at 4:36 AM, Xing Lin xing...@cs.utah.edu wrote: Hi, There were some discussions about this before on the mailing list but I am still confused with this. I thought Ceph would flush data from the journal to disk when either the journal is full or when the time to do

Re: [PATCH 6/7] ceph: don't early drop Fw cap

2013-03-04 Thread Gregory Farnum
On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com ceph_aio_write() has an optimization that marks CEPH_CAP_FILE_WR cap dirty before data is copied to page cache and inode size is updated. The optimization avoids slow cap

Re: [PATCH 0/7] ceph: misc fixes

2013-03-04 Thread Gregory Farnum
On Thu, Feb 28, 2013 at 10:46 PM, Yan, Zheng zheng.z@intel.com wrote: From: Yan, Zheng zheng.z@intel.com These patches are also in: git://github.com/ukernel/linux.git wip-ceph 1, 2, 3, 5, 7 all look good to me. If you can double-check Sage's concerns on 4 and my questions on 6 I'll

Re: [PATCH 2/3] libceph: define mds_alloc_msg() method

2013-03-04 Thread Gregory Farnum
This looks like a faithful reshuffling to me. But... On Mon, Mar 4, 2013 at 10:12 AM, Alex Elder el...@inktank.com wrote: diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 6ec6051..c7d4278 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2804,55 +2804,34 @@

Re: [PATCH 1/3] libceph: drop mutex while allocating a message

2013-03-04 Thread Gregory Farnum
On Mon, Mar 4, 2013 at 10:12 AM, Alex Elder el...@inktank.com wrote: In ceph_con_in_msg_alloc(), if no alloc_msg method is defined for a connection a new message is allocated with ceph_msg_new(). Drop the mutex before making this call, and make sure we're still connected when we get it back

Re: [PATCH 3/3] libceph: no need for alignment for mds message

2013-03-04 Thread Gregory Farnum
Looks good. Reviewed-by: Greg Farnum g...@inktank.com On Mon, Mar 4, 2013 at 10:12 AM, Alex Elder el...@inktank.com wrote: Currently, incoming mds messages never use page data, which means there is no need to set the page_alignment field in the message. Signed-off-by: Alex Elder

Fwd: teuthology heads up

2013-03-04 Thread Gregory Farnum
We probably should have sent this to ceph-devel in the first place — sorry guys! -Greg -- Forwarded message -- From: Sam Lang sam.l...@inktank.com Date: Thu, Jan 31, 2013 at 7:40 AM Subject: teuthology heads up To: Dev d...@inktank.com Hi All, I just pushed a change to the way

Re: maintanance on osd host

2013-02-28 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:37 PM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi Greg, Hi Sage, Am 26.02.2013 21:27, schrieb Gregory Farnum: On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: out and down are quite different — are you sure you

Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 9:57 AM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 05:09:30PM -0800, Gregory Farnum wrote: On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've

Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-26 Thread Gregory Farnum
On Mon, Feb 25, 2013 at 4:01 PM, Gregory Farnum g...@inktank.com wrote: On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng zheng.z@intel.com wrote: On 02/23/2013 02:54 AM, Gregory Farnum wrote: I haven't spent that much time in the kernel client, but this patch isn't working out for me

Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small (1M) files. The folder is mounted by only one client by default. In case of overload, another clients spawn to mount the same folder and

Re: maintanance on osd host

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 11:44 AM, Stefan Priebe s.pri...@profihost.ag wrote: Hi Sage, Am 26.02.2013 18:24, schrieb Sage Weil: On Tue, 26 Feb 2013, Stefan Priebe - Profihost AG wrote: But that redults in a 1-3s hickup for all KVM vms. This is not what I want. You can do kill $pid

Re: Crash and strange things on MDS

2013-02-26 Thread Gregory Farnum
On Tue, Feb 26, 2013 at 1:57 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 26, 2013 at 12:26:17PM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 11:58 AM, Kevin Decherf ke...@kdecherf.com wrote: We have one folder per application (php, java, ruby). Every application has small

Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-25 Thread Gregory Farnum
On Fri, Feb 22, 2013 at 8:31 PM, Yan, Zheng zheng.z@intel.com wrote: On 02/23/2013 02:54 AM, Gregory Farnum wrote: I haven't spent that much time in the kernel client, but this patch isn't working out for me. In particular, I'm pretty sure we need to preserve this: diff --git a/fs/ceph

Re: CephFS: stable release?

2013-02-24 Thread Gregory Farnum
On Saturday, February 23, 2013 at 2:14 AM, Gandalf Corvotempesta wrote: Hi all, do you have an ETA about a stable realease (or something usable in production) for CephFS? Short answer: no. However, we do have a team of people working on the FS again as of a month or so ago. We're doing a

Re: Ceph scalar replicas performance

2013-02-22 Thread Gregory Farnum
On Thu, Feb 21, 2013 at 5:01 PM, kelvin_hu...@wiwynn.com wrote: Hi all, I have some problem after my scalar performance test !! Setup: Linux kernel: 3.2.0 OS: Ubuntu 12.04 Storage server : 11 HDD (each storage server has 11 osd, 7200 rpm, 1T) + 10GbE NIC + RAID card: LSI MegaRAID SAS

Re: [PATCH 3/3] ceph: fix vmtruncate deadlock

2013-02-22 Thread Gregory Farnum
I haven't spent that much time in the kernel client, but this patch isn't working out for me. In particular, I'm pretty sure we need to preserve this: diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 5d5c32b..b9d8417 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2067,12 +2067,6 @@

Re: [PATCH] ceph: fix statvfs fr_size

2013-02-22 Thread Gregory Farnum
On Fri, Feb 22, 2013 at 3:23 PM, Sage Weil s...@inktank.com wrote: Different versions of glibc are broken in different ways, but the short of it is that for the time being, frsize should == bsize, and be used as the multiple for the blocks, free, and available fields. This mirrors what is

Re: enable old OSD snapshot to re-join a cluster

2013-02-20 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva ol...@gnu.org wrote: It recently occurred to me that I messed up an OSD's storage, and decided that the easiest way to bring it back was to roll it back to an earlier snapshot I'd taken (along the lines of clustersnap) and let it recover from

Re: Crash and strange things on MDS

2013-02-19 Thread Gregory Farnum
On Sat, Feb 16, 2013 at 10:24 AM, Kevin Decherf ke...@kdecherf.com wrote: On Sat, Feb 16, 2013 at 11:36:09AM -0600, Sam Lang wrote: On Fri, Feb 15, 2013 at 7:02 PM, Kevin Decherf ke...@kdecherf.com wrote: It seems better now, I didn't see any storm so far. But we observe high latency on

Re: Hadoop DNS/topology details

2013-02-19 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: Here is the information that I've found so far regarding the operation of Hadoop w.r.t. DNS/topology. There are two parts, the file system client requirements, and other consumers of topology information. -- File

Re: Hadoop DNS/topology details

2013-02-19 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 4:39 PM, Sage Weil s...@inktank.com wrote: On Tue, 19 Feb 2013, Noah Watkins wrote: On Feb 19, 2013, at 2:22 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Feb 19, 2013 at 2:10 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: That is just truly annoying

Re: Crash and strange things on MDS

2013-02-19 Thread Gregory Farnum
On Tue, Feb 19, 2013 at 5:00 PM, Kevin Decherf ke...@kdecherf.com wrote: On Tue, Feb 19, 2013 at 10:15:48AM -0800, Gregory Farnum wrote: Looks like you've got ~424k dentries pinned, and it's trying to keep 400k inodes in cache. So you're still a bit oversubscribed, yes. This might just

Re: [ceph] Fix more performance issues found by cppcheck (#51)

2013-02-14 Thread Gregory Farnum
Hey Danny, I've merged in most of these (commit ffda2eab4695af79abdc9ed9bf001c3cd662a1f2) but had comments on a couple: d99764e8c72a24eaba0542944f497cc2d9e154b4 is a patch on gtest. We did import that wholesale into our repository as that's what they recommend,b but I'd prefer to get patches by

Re: [ceph-commit] [ceph/ceph] e330b7: mon: create fail_mds_gid() helper; make 'ceph mds ...

2013-02-14 Thread Gregory Farnum
On Thu, Feb 14, 2013 at 11:39 AM, GitHub nore...@github.com wrote: Branch: refs/heads/master Home: https://github.com/ceph/ceph Commit: e330b7ec54f89ca799ada376d5615e3c1dfc54f0 https://github.com/ceph/ceph/commit/e330b7ec54f89ca799ada376d5615e3c1dfc54f0 Author: Sage Weil

Further thoughts on fsck for CephFS

2013-02-14 Thread Gregory Farnum
Sage sent out an early draft of what we were thinking about doing for fsck on CephFS at the beginning of the week, but it was a bit incomplete and still very much a work in progress. I spent a good chunk of today thinking about it more so that we can start planning ticket-level chunks of work. The

Re: Write Replication on Degraded PGs

2013-02-13 Thread Gregory Farnum
On Wed, Feb 13, 2013 at 3:40 AM, Ben Rowland ben.rowl...@gmail.com wrote: Hi, Apologies that this is a fairly long post, but hopefully all my questions are similar (or even invalid!) Does Ceph allow writes to proceed if it's not possible to satisfy the rules for replica placement across

Re: Crash and strange things on MDS

2013-02-13 Thread Gregory Farnum
On Wed, Feb 13, 2013 at 3:47 AM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 11, 2013 at 10:54 AM, Kevin Decherf ke...@kdecherf.com wrote: Furthermore, I observe another strange thing more or less related to the storms

Re: rbd export speed limit

2013-02-13 Thread Gregory Farnum
On Wed, Feb 13, 2013 at 12:27 PM, Stefan Priebe s.pri...@profihost.ag wrote: Hi, Am 13.02.2013 21:21, schrieb Sage Weil: This results in writes up to 400Mb/s per OSD and then results in aborted / hanging task in VMs. Is it possible to give trim commands lower priority? Is that 400Mb or

Re: primary pg and replica pg are placed on the same node under ssd-primary rule

2013-02-12 Thread Gregory Farnum
Unfortunately this part doesn't fit into the CRUSH language. In order to do it and segregate properly by node you need to have separate SSD and HDD nodes, rather than interspersing them. (Or if you were brave you could set up some much more specified rules and pull each replica from a different

Re: preferred OSD

2013-02-11 Thread Gregory Farnum
On Fri, Feb 8, 2013 at 4:45 PM, Sage Weil s...@inktank.com wrote: Hi Marcus- On Fri, 8 Feb 2013, Marcus Sorensen wrote: I know people have been disscussing on and off about providing a preferred OSD for things like multi-datacenter, or even within a datacenter, choosing an OSD that would

Re: Unable to mount cephfs - can't read superblock

2013-02-11 Thread Gregory Farnum
On Sat, Feb 9, 2013 at 2:13 PM, Adam Nielsen a.niel...@shikadi.net wrote: $ ceph -s health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean monmap e1: 1 mons at {0=192.168.0.6:6789/0}, election epoch 0, quorum 0 0 osdmap e3: 1 osds: 1 up, 1 in pgmap v119: 192 pgs: 192 active+degraded; 0

Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote: References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) 1: /usr/bin/ceph-mds() [0x817e82] 2: (()+0xf140) [0x7f9091d30140] 3:

Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-02-11 Thread Gregory Farnum
1 Isaac - Original Message - From: Isaac Otsiabah zmoo...@yahoo.com To: Gregory Farnum g...@inktank.com Cc: ceph-devel@vger.kernel.org ceph-devel@vger.kernel.org Sent: Friday, January 25, 2013 9:51 AM Subject: Re: osd down (for 2 about 2 minutes) error after adding

Re: OSD Weights

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 12:43 PM, Holcombe, Christopher cholc...@cscinfo.com wrote: Hi Everyone, I just wanted to confirm my thoughts on the ceph osd weightings. My understanding is they are a statistical distribution number. My current setup has 3TB hard drives and they all have the

Re: rest mgmt api

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 2:00 PM, Sage Weil s...@inktank.com wrote: On Mon, 11 Feb 2013, Gregory Farnum wrote: On Wed, Feb 6, 2013 at 12:14 PM, Sage Weil s...@inktank.com wrote: On Wed, 6 Feb 2013, Dimitri Maziuk wrote: On 02/06/2013 01:34 PM, Sage Weil wrote: I think the one caveat here

Re: Crash and strange things on MDS

2013-02-11 Thread Gregory Farnum
On Mon, Feb 11, 2013 at 2:24 PM, Kevin Decherf ke...@kdecherf.com wrote: On Mon, Feb 11, 2013 at 12:25:59PM -0800, Gregory Farnum wrote: On Mon, Feb 4, 2013 at 10:01 AM, Kevin Decherf ke...@kdecherf.com wrote: References: [1] http://www.spinics.net/lists/ceph-devel/msg04903.html [2] ceph

Re: OSD down

2013-02-10 Thread Gregory Farnum
The OSD daemon is getting back EIO when it tries to do a read. Sounds like your disk is going bad. -Greg PS: This question is a good fit for the new ceph-users list. :) On Sunday, February 10, 2013 at 9:45 AM, Olivier Bonvalet wrote: Hi, I have an OSD which often stopped (ceph 0.56.2),

Re: Possible filesystem corruption or something else?

2013-02-09 Thread Gregory Farnum
On Saturday, February 9, 2013 at 6:23 AM, John Axel Eriksson wrote: Three times now, twice on one osd, once on another we've had the osd crash. Restarting it wouldn't help - it would crash with the same error. The only way I found to get it up again was to reformat both the journal disk and

Re: ceph mkfs failed

2013-02-08 Thread Gregory Farnum
On Fri, Feb 8, 2013 at 1:18 PM, sheng qiu herbert1984...@gmail.com wrote: ok, i have figured out it. That looks like a LevelDB issue given the backtrace (and the OSD isn't responding because it crashed). If you figured out why LevelDB crashed, it'd be good to know so that other people can

Quantal gitbuilder is broken

2013-02-08 Thread Gregory Farnum
I'm not sure who's responsible for this, but I see everything is red on our i386 Quantal gitbuilder. Probably just needs libboost-program-options installed, based on the error I'm seeing in one of my branches? Although there are some warnings I'm not too used to seeing in there as well. -Greg --

Re: ceph mkfs failed

2013-02-07 Thread Gregory Farnum
On Thu, Feb 7, 2013 at 12:42 PM, sheng qiu herbert1984...@gmail.com wrote: Hi Dan, thanks for your reply. after some code tracking, i found it failed at this point : in file leveldb/db/db_impl.cc -- NewDB() log::Writer log(file); std::string record; new_db.EncodeTo(record);

Re: Throttle::wait use case clarification

2013-02-05 Thread Gregory Farnum
/pull/39 Cheers On 02/05/2013 01:22 AM, Gregory Farnum wrote: Loic, Sorry for the delay in getting back to you about these patches. :( I finally got some time to look over them, and in general it's all good! I do have some comments, though. On Mon, Jan 21, 2013 at 5:44 AM, Loic Dachary l

Re: Ceph Development with Eclipse

2013-02-05 Thread Gregory Farnum
I haven't done the initial setup in several years, but as I recall, once Ceph was built it was a simple matter of doing New Makefile Project with Existing Code in Eclipse. Make sure you've got the C++ version of Eclipse. Other than that, I'm afraid you'll need to go through the Eclipse support

Re: Increase number of pg in running system

2013-02-05 Thread Gregory Farnum
On Tuesday, February 5, 2013 at 5:49 PM, Sage Weil wrote: On Tue, 5 Feb 2013, Mandell Degerness wrote: I would like very much to specify pg_num and pgp_num for the default pools, but they are defaulting to 64 (no OSDs are defined in the config file). I have tried using the options indicated

Re: [0.48.3] OSD memory leak when scrubbing

2013-02-04 Thread Gregory Farnum
Set your /proc/sys/kernel/core_pattern file. :) http://linux.die.net/man/5/core -Greg On Mon, Feb 4, 2013 at 1:08 PM, Sébastien Han han.sebast...@gmail.com wrote: ok I finally managed to get something on my test cluster, unfortunately, the dump goes to / any idea to change the destination

Re: Throttle::wait use case clarification

2013-02-04 Thread Gregory Farnum
Loic, Sorry for the delay in getting back to you about these patches. :( I finally got some time to look over them, and in general it's all good! I do have some comments, though. On Mon, Jan 21, 2013 at 5:44 AM, Loic Dachary l...@dachary.org wrote: Looking through the history of that test (in

Re: Paxos and long-lasting deleted data

2013-02-03 Thread Gregory Farnum
On Sunday, February 3, 2013 at 11:45 AM, Andrey Korolyov wrote: Just an update: this data stayed after pool deletion, so there is probably a way to delete garbage bytes on live pool without doing any harm(hope so), since it is can be dissected from actual pool pool data placement, in theory.

Re: Ceph Development with Eclipse‏

2013-02-02 Thread Gregory Farnum
I actually still do this. Set it up Ceph outside of Eclipse initially, then import it as a project with an existing Makefile. It should pick up on everything it needs to well enough. :) -Greg On Saturday, February 2, 2013 at 9:40 AM, charles L wrote: Hi I am a beginner at c++ and

Re: Paxos and long-lasting deleted data

2013-01-31 Thread Gregory Farnum
Can you pastebin the output of rados -p rbd ls? On Thu, Jan 31, 2013 at 10:17 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, Please take a look, this data remains for days and seems not to be deleted in future too: pool name category KB objects clones

Re: Paxos and long-lasting deleted data

2013-01-31 Thread Gregory Farnum
On Thu, Jan 31, 2013 at 10:50 AM, Andrey Korolyov and...@xdel.ru wrote: http://xdel.ru/downloads/ceph-log/rados-out.txt.gz On Thu, Jan 31, 2013 at 10:31 PM, Gregory Farnum g...@inktank.com wrote: Can you pastebin the output of rados -p rbd ls? Well, that sure is a lot of rbd objects. Looks

Re: Maintenance mode

2013-01-31 Thread Gregory Farnum
Try ceph osd set noout beforehand and then ceph osd unset noout. That will prevent any OSDs from getting removed from the mapping, so no data will be rebalanced. I don't think there's a way to prevent OSDs from getting zapped on an individual basis, though. This is described briefly in the

Re: HEALTH_ERR 18624 pgs stuck inactive; 18624 pgs stuck unclean; no osds

2013-01-30 Thread Gregory Farnum
I believe this is because you specified hostname rather than host for the OSDs in your ceph.conf. hostname isn't a config option that anything in Ceph recognizes. :) -Greg On Wednesday, January 30, 2013 at 8:12 AM, femi anjorin wrote: Hi, Can anyone help with this? I am running a

Re: [PATCH 0/6] fix build (ceph.spec)

2013-01-30 Thread Gregory Farnum
On Wednesday, January 30, 2013 at 10:00 AM, Danny Al-Gaaf wrote: This set fixes some issues in the spec file. I'm not sure what the reason for #35e5d74e5c5786bc91df5dc10b5c08c77305df4e was. But I would revert it and fix the underlaying issues instead. That is a pretty obtuse commit

Re: Geo-replication with RADOS GW

2013-01-28 Thread Gregory Farnum
On Monday, January 28, 2013 at 9:54 AM, Ben Rowland wrote: Hi, I'm considering using Ceph to create a cluster across several data centres, with the strict requirement that writes should go to both DCs. This seems possible by specifying rules in the CRUSH map, with an understood latency hit

Re: lagging peering wq

2013-01-25 Thread Gregory Farnum
On Friday, January 25, 2013 at 9:50 AM, Sage Weil wrote: Faidon/paravoid's cluster has a bunch of OSDs that are up, but the pg queries indicate they are tens of thousands of epochs behind: history: { epoch_created: 14, last_epoch_started: 88174, last_epoch_clean: 88174, last_epoch_split:

Re: osd down (for 2 about 2 minutes) error after adding a new host to my cluster

2013-01-24 Thread Gregory Farnum
) Isaac - Original Message - From: Gregory Farnum g...@inktank.com (mailto:g...@inktank.com) To: Isaac Otsiabah zmoo...@yahoo.com (mailto:zmoo...@yahoo.com) Cc: ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org) ceph-devel@vger.kernel.org (mailto:ceph-devel@vger.kernel.org

Re: ceph write path

2013-01-24 Thread Gregory Farnum
On Thursday, January 24, 2013 at 6:41 PM, sheng qiu wrote: Hi, i am trying to understand the ceph codes on client side. for write path, if it's aio_write, the ceph_write_begin() allocate pages in page cache to buffer the written data, however i did not see it allocated any space on the

Re: Using a Data Pool

2013-01-23 Thread Gregory Farnum
On Wednesday, January 23, 2013 at 5:01 AM, Paul Sherriffs wrote: Hello All; I have been trying to associate a directory to a data pool (both called 'Media') according to a previous thread on this list. It all works except the last line: ceph osd pool create Media 500 500 ceph

Re: some questions about ceph

2013-01-23 Thread Gregory Farnum
On Wednesday, January 23, 2013 at 3:35 PM, Yue Li wrote: Hi, i have some questions about ceph. ceph provide a POSIX client for users. for aio-read/write, it still use page cache on client side (seems to me). How long will the page cache expire (in case the data on server side has

Re: handling fs errors

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 5:12 AM, Wido den Hollander wrote: On 01/22/2013 07:12 AM, Yehuda Sadeh wrote: On Mon, Jan 21, 2013 at 10:05 PM, Sage Weil s...@inktank.com (mailto:s...@inktank.com) wrote: We observed an interesting situation over the weekend. The XFS volume ceph-osd

Re: Throttle::wait use case clarification

2013-01-22 Thread Gregory Farnum
On Monday, January 21, 2013 at 5:44 AM, Loic Dachary wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 01/21/2013 12:02 AM, Gregory Farnum wrote: On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote: Hi, While working on unit tests for Throttle.{cc,h} I tried

Re: ssh passwords

2013-01-22 Thread Gregory Farnum
On Tuesday, January 22, 2013 at 10:24 AM, Gandalf Corvotempesta wrote: Hi all, i'm trying my very first ceph installation following the 5-minutes quickstart: http://ceph.com/docs/master/start/quick-start/#install-debian-ubuntu just a question: why ceph is asking me for SSH password? Is ceph

Re: Consistently reading/writing rados objects via command line

2013-01-21 Thread Gregory Farnum
On Monday, January 21, 2013 at 5:01 PM, Nick Bartos wrote: I would like to store some objects in rados, and retrieve them in a consistent manor. In my initial tests, if I do a 'rados -p foo put test /tmp/test', while it is uploading I can do a 'rados -p foo get test /tmp/blah' on another

Re: osd max write size

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 11:06 AM, Stefan Priebe wrote: Hi, what is the purpose or idea behind this setting? Couple different things: 1) The OSDs can't accept writes which won't fit inside their journal, and if you have a small journal you could conceivably attempt to write

Re: questions on networks and hardware

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 12:30 PM, Gandalf Corvotempesta wrote: 2013/1/19 John Nielsen li...@jnielsen.net (mailto:li...@jnielsen.net): I'm planning a Ceph deployment which will include: 10Gbit/s public/client network 10Gbit/s cluster network I'm still trying to know if a

Re: ceph replication and data redundancy

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote: Hi, On 01/17/2013 10:55 AM, Ulysse 31 wrote: Hi all, I'm not sure if it's the good mailing, if not, sorry for that, tell me the appropriate one, i'll go for it. Here is my actual project : The company i work for has

Re: ceph replication and data redundancy

2013-01-20 Thread Gregory Farnum
(Sorry for the blank email just now, my client got a little eager!) Apart from the things that Wido has mentioned, you say you've set up 4 nodes and each one has a monitor on it. That's why you can't do anything when you bring down two nodes — the monitor cluster requires a strict majority in

Re: Throttle::wait use case clarification

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 5:39 AM, Loic Dachary wrote: Hi, While working on unit tests for Throttle.{cc,h} I tried to figure out a use case related to the Throttle::wait method but couldn't https://github.com/ceph/ceph/pull/34/files#L3R258 Although it was not a blocker and I

Re: questions on networks and hardware

2013-01-20 Thread Gregory Farnum
On Sunday, January 20, 2013 at 2:43 PM, Gandalf Corvotempesta wrote: 2013/1/20 Gregory Farnum g...@inktank.com (mailto:g...@inktank.com): This is a bit embarrassing, but if you're actually using two networks and the cluster network fails but the client network stays up, things behave

Re: max useful journal size

2013-01-18 Thread Gregory Farnum
On Fri, Jan 18, 2013 at 2:20 PM, Travis Rhoden trho...@gmail.com wrote: Hey folks, The Ceph docs give the following recommendation on sizing your journal: osd journal size = {2 * (expected throughput * filestore min sync interval)} The default value of min sync interval is .01. If you use

<    1   2   3   4   5   6   7   8   9   10   >