Re: mds: first stab at lookup-by-ino problem/soln description

2013-01-16 Thread Yan, Zheng
On Thu, Jan 17, 2013 at 5:52 AM, Gregory Farnum g...@inktank.com wrote:
 My biggest concern with this was how it worked on cluster with
 multiple data pools, and Sage's initial response was to either
 1) create an object for each inode that lives in the metadata pool,
 and holds the backtraces (rather than putting them as attributes on
 the first object in the file), or
 2) use a more sophisticated data structure, perhaps built on Eleanor's
 b-tree project from last summer
 (http://ceph.com/community/summer-adventures-with-ceph-building-a-b-tree/)

 I had thought that we could just query each data pool for the object,
 but Sage points out that 100-pool clusters aren't exactly unreasonable
 and that would take quite a lot of query time. And having the
 backtraces in the data pools significantly complicates things with our
 rules about setting layouts on new files.

 So this is going to need some kind of revision, please suggest alternatives!
 -Greg

how about using DHT to map regular files to their parent directories,
then use backtraces
to find parent directory's path.

Regards
Yan, Zheng


 On Tue, Jan 15, 2013 at 3:35 PM, Sage Weil s...@inktank.com wrote:
 One of the first things we need to fix in the MDS is how we support
 lookup-by-ino.  It's important for fsck, NFS reexport, and (insofar as
 there are limitations to the current anchor table design) hard links and
 snapshots.

 Below is a description of the problem and a rough sketch of my proposed
 solution.  This is the first time I thought about the lookup algorithm in
 any detail, so I've probably missed something, and the 'ghost entries' bit
 is what came to mind on the plane.  Hopefully we can think of something a
 bit lighter weight.

 Anyway, poke holes if anything isn't clear, if you have any better ideas,
 or if it's time to refine further.  This is just a starting point for the
 conversation.


 The problem
 ---

 The MDS stores all fs metadata (files, inodes) in a hierarchy,
 allowing it to distribute responsibility among ceph-mds daemons by
 partitioning the namespace hierarchically.  This is also a huge win
 for inode prefetching: loading the directory gets you both the names
 and the inodes in a single IO.

 One consequence of this is that we do not have a flat inode table
 that let's us look up files by inode number.  We *can* find
 directories by ino simply because they are stored in an object named
 after the ino.  However, we can't populate the cache this way because
 the metadata in cache must be fully attached to the root to avoid
 various forms of MDS anarchy.

 Lookup-by-ino is currently needed for hard links.  The first link to a
 file is deemed the primary link, and that is where the inode is
 stored.  Any additional links are internally remote links, and
 reference the inode by ino.  However, there are other uses for
 lookup-by-ino, including NFS reexport and fsck.

 Anchor table
 

 The anchor table is currently used to locate inodes that have hard
 links.  Inodes in the anchor table are said to be anchored, and can
 be found by ino alone with no knowledge of their path.  Normally, only
 inodes that have hard links need to be anchored.  There are a few
 other cases, but they are not relevant here.

 The anchor table is a flat table of records like:

  ino - (parent ino, hash(name), refcount)

 All parent ino's referenced in the table also have records.  The
 refcount includes both other records listing a given ino as parent and
 the anchor itself (i.e., the inode).  To anchor an inode, we insert
 records for the ino and all ancestors (if they are not already present).

 An anchor removal means decrementing the ino record.  Once a refcount
 hits 0 it can be removed, and the parent ino's refcount can be
 decremented.

 A directory rename involves changing the parent ino value for an
 existing record, populating the new ancestors into the table (as
 needed), and decrementing the old parent's refcount.

 This all works great if there are a small number of anchors, but does
 not scale.  The entire table is managed by a single MDS, and is
 currently kept in memory.  We do not want to anchor every inode in the
 system or this is impractical.

 But, be want lookup-by-ino for NFS reexport, and something
 similar/related for fsck.


 Current lookup by ino procedure
 ---

 ::

  lookup_ino(ino)
send message mds.N - mds.0
  anchor lookup $ino
get reply message mds.0 - mds.N
  reply contains record for $ino and all ancestors (an anchor trace)
parent = depest ancestor in trace that we have in our cache
while parent != ino
  child = parent.lookup(hash(name))
  if not found
restart from the top
  parent = child


 Directory backpointers
 --

 There is partial infrastructure for supporting fsck that is already 
 maintained
 for directories.  Each directory object (the first object for the directory,
 if there are multiple 

Re: mds: first stab at lookup-by-ino problem/soln description

2013-01-16 Thread Gregory Farnum
On Wed, Jan 16, 2013 at 3:54 PM, Sam Lang sam.l...@inktank.com wrote:

 On Wed, Jan 16, 2013 at 3:52 PM, Gregory Farnum g...@inktank.com wrote:

 My biggest concern with this was how it worked on cluster with
 multiple data pools, and Sage's initial response was to either
 1) create an object for each inode that lives in the metadata pool,
 and holds the backtraces (rather than putting them as attributes on
 the first object in the file), or
 2) use a more sophisticated data structure, perhaps built on Eleanor's
 b-tree project from last summer
 (http://ceph.com/community/summer-adventures-with-ceph-building-a-b-tree/)

 I had thought that we could just query each data pool for the object,
 but Sage points out that 100-pool clusters aren't exactly unreasonable
 and that would take quite a lot of query time. And having the
 backtraces in the data pools significantly complicates things with our
 rules about setting layouts on new files.

 So this is going to need some kind of revision, please suggest
 alternatives!


 Correct me if I'm wrong, but this seems like its only an issue in the NFS
 reexport case, as fsck can walk through the data objects in each pool (in
 parallel?) and verify back/forward consistency, so we won't have to guess
 which pool an ino is in.

 Given that, if we could stuff the pool id in the ino for the file returned
 through the client interfaces, then we wouldn't have to guess.

 -sam

I'm not familiar with the interfaces at work there. Do we have a free
32 bits we can steal in order to do that stuffing? (I *think* it would
go in the NFS filehandle structure rather than the ino, right?)
We would need to also store that information in order to eventually
replace the anchor table, but of course that's much easier to deal
with. If we can just do it this way, that still leaves handling files
which don't have any data written yet — under our current system,
users can apply a data layout to any inode which has not had data
written to it yet. Unfortunately that gets hard to deal with if a user
touches a bunch of files and then comes back to place them the next
day. :/ I suppose un-touched files could have the special property
that their lookup data is stored in the metadata pool and it gets
moved as soon as they have data — in the typical case files are
written right away and so this wouldn't be any more writes, just a bit
more logic.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


mds: first stab at lookup-by-ino problem/soln description

2013-01-15 Thread Sage Weil
One of the first things we need to fix in the MDS is how we support 
lookup-by-ino.  It's important for fsck, NFS reexport, and (insofar as 
there are limitations to the current anchor table design) hard links and 
snapshots.

Below is a description of the problem and a rough sketch of my proposed 
solution.  This is the first time I thought about the lookup algorithm in 
any detail, so I've probably missed something, and the 'ghost entries' bit 
is what came to mind on the plane.  Hopefully we can think of something a 
bit lighter weight.

Anyway, poke holes if anything isn't clear, if you have any better ideas, 
or if it's time to refine further.  This is just a starting point for the 
conversation.


The problem
---

The MDS stores all fs metadata (files, inodes) in a hierarchy,
allowing it to distribute responsibility among ceph-mds daemons by
partitioning the namespace hierarchically.  This is also a huge win
for inode prefetching: loading the directory gets you both the names
and the inodes in a single IO.

One consequence of this is that we do not have a flat inode table
that let's us look up files by inode number.  We *can* find
directories by ino simply because they are stored in an object named
after the ino.  However, we can't populate the cache this way because
the metadata in cache must be fully attached to the root to avoid
various forms of MDS anarchy.

Lookup-by-ino is currently needed for hard links.  The first link to a
file is deemed the primary link, and that is where the inode is
stored.  Any additional links are internally remote links, and
reference the inode by ino.  However, there are other uses for
lookup-by-ino, including NFS reexport and fsck.

Anchor table


The anchor table is currently used to locate inodes that have hard
links.  Inodes in the anchor table are said to be anchored, and can
be found by ino alone with no knowledge of their path.  Normally, only
inodes that have hard links need to be anchored.  There are a few
other cases, but they are not relevant here.

The anchor table is a flat table of records like:

 ino - (parent ino, hash(name), refcount)

All parent ino's referenced in the table also have records.  The
refcount includes both other records listing a given ino as parent and
the anchor itself (i.e., the inode).  To anchor an inode, we insert
records for the ino and all ancestors (if they are not already present).

An anchor removal means decrementing the ino record.  Once a refcount
hits 0 it can be removed, and the parent ino's refcount can be
decremented.

A directory rename involves changing the parent ino value for an
existing record, populating the new ancestors into the table (as
needed), and decrementing the old parent's refcount.

This all works great if there are a small number of anchors, but does
not scale.  The entire table is managed by a single MDS, and is
currently kept in memory.  We do not want to anchor every inode in the
system or this is impractical.

But, be want lookup-by-ino for NFS reexport, and something
similar/related for fsck.


Current lookup by ino procedure
---

::

 lookup_ino(ino)
   send message mds.N - mds.0
 anchor lookup $ino
   get reply message mds.0 - mds.N
 reply contains record for $ino and all ancestors (an anchor trace)
   parent = depest ancestor in trace that we have in our cache
   while parent != ino
 child = parent.lookup(hash(name))
 if not found
   restart from the top
 parent = child


Directory backpointers
--

There is partial infrastructure for supporting fsck that is already maintained
for directories.  Each directory object (the first object for the directory,
if there are multiple fragments) has an attr that provides a snapshot of 
ancestors
called a backtrace.::

 struct inode_backtrace_t {
   inodeno_t ino;   // my ino
   vectorinode_backpointer_t ancestors;
 }

 struct inode_backpointer_t {
   inodeno_t dirino;// containing directory ino
   string dname;// linking dentry name
   version_t version;   // child's version at time of backpointer creation
 };

The backpointer version field is there to establish *when* this
particular pointer was valid.  For a large directory /a/b/c, every
item in c would have an attr that describes the ancestors /a/b/c.  If
c were renamed to /c, we only update c's backtrace and not it's
children, so any future observer needs to be able to tell that the /c
forward pointer is newer than the backtrace's backpointer.

We already maintain backtraces for all directories in the system for
use by a future fsck tool.  They are updated asynchronously after
directory renames.


Proposal: file backtraces
-

Extend the current backtrace infrastructure to include files as well
as directories.  Attach the backtrace to the first object for the file
(normally named something like $ino.000).

Initially have the MDS set that attr sometime after the