I really like the idea, one scenario keeps bothering us is that there are too 
many small files which make the file system indexing slow (so that a single 
read request could take more than 10 disk IOs for path lookup).

If we pursuit this proposal, is there a chance we can take one step further, 
that instead of storing one physical file for each object, we can allocate a 
big file (tens of GB) and each object only map to a chunk within that big file. 
So that all those big file’s description could be cached to avoid disk I/O to 
open the file. At least we keep it flexible that if someone would like to 
implement in such way, there is a chance to leverage the existing framework.

Thanks,
Guang

On Jul 31, 2014, at 1:25 PM, Sage Weil <sw...@redhat.com> wrote:

> After the latest set of bug fixes to the FileStore file naming code I am 
> newly inspired to replace it with something less complex.  Right now I'm 
> mostly thinking about HDDs, although some of this may map well onto hybrid 
> SSD/HDD as well.  It may or may not make sense for pure flash.
> 
> Anyway, here are the main flaws with the overall approach that FileStore 
> uses:
> 
> - It tries to maintain a direct mapping of object names to file names.  
> This is problematic because of 255 character limits, rados namespaces, pg 
> prefixes, and the pg directory hashing we do to allow efficient split, for 
> starters.  It is also problematic because we often want to do things like 
> rename but can't make it happen atomically in combination with the rest of 
> our transaction.
> 
> - The PG directory hashing (that we do to allow efficient split) can have 
> a big impact on performance, particularly when injesting lots of data.  
> (And when benchmarking.)  It's also complex.
> 
> - We often overwrite or replace entire objects.  These are "easy" 
> operations to do safely without doing complete data journaling, but the 
> current design is not conducive to doing anything clever (and it's complex 
> enough that I wouldn't want to add any cleverness on top).
> 
> - Objects may contain only key/value data, but we still have to create an 
> inode for them and look that up first.  This only matters for some 
> workloads (rgw indexes, cephfs directory objects).
> 
> Instead, I think we should try a hybrid approach that more heavily 
> leverages a key/value db in combination with the file system.  The kv db 
> might be leveldb, rocksdb, LMDB, BDB, or whatever else; for now we just 
> assume it provides transactional key/value storage and efficient range 
> operations.  Here's the basic idea:
> 
> - The mapping from names to object lives in the kv db.  The object 
> metadata is in a structure we can call an "onode" to avoid confusing it 
> with the inodes in the backing file system.  The mapping is simple 
> ghobject_t -> onode map; there is no PG collection.  The PG collection 
> still exist but really only as ranges of those keys.  We will need to be 
> slightly clever with the coll_t to distinguish between "bare" PGs (that 
> live in this flat mapping) and the other collections (*_temp and 
> metadata), but that should be easy.  This makes PG splitting "free" as far 
> as the objects go.
> 
> - The onodes are relatively small.  They will contain the xattrs and 
> basic metadata like object size.  They will also identify the file name of 
> the backing file in the file system (if size > 0).
> 
> - The backing file can be a random, short file name.  We can just make a 
> one or two level deep set of directories, and let the directories get 
> reasonably big... whatever we decide the backing fs can handle 
> efficiently.  We can also store a file handle in the onode and use the 
> open by handle API; this should let us go directly from onode (in our kv 
> db) to the on-disk inode without looking at the directory at all, and fall 
> back to using the actual file name only if that fails for some reason 
> (say, someone mucked around with the backing files).  The backing file 
> need not have any xattrs on it at all (except perhaps some simple id to 
> verify it does it fact belong to the referring onode, just as a sanity 
> check).
> 
> - The name -> onode mapping can live in a disjunct part of the kv 
> namespace so that the other kv stuff associated with the file (like omap 
> pairs or big xattrs or whatever) don't blow up those parts of the 
> db and slow down lookup.
> 
> - We can keep a simple LRU of recent onodes in memory and avoid the kv 
> lookup for hot objects.
> 
> - Previously complicated operations like rename are now trivial: we just 
> update the kv db with a transaction.  The backing file never gets renamed, 
> ever, and the other object omap data is keyed by a unique (onode) id, not 
> the name.
> 
> Initially, for simplicity, we can start with the existing data journaling 
> behavior.  However, I think there are opportunities to improve the 
> situation there.  There is a pending wip-transactions branch in which I 
> started to rejigger the ObjectStore::Transaction interface a bit so that 
> you identify objects by handle and then operation on them.  Although it 
> doesn't change the encoding yet, once it does, we can make the 
> implementation take advantage of that, by avoid duplicate name lookups.  
> It will also let us do things like clearly identify when an object is 
> entirely new; in that case, we might forgo data journaling and instead 
> write the data to the (new) file, fsync, and then commit the journal entry 
> with the transaction that uses it.  (On remount a simple cleanup process 
> can throw out new but unreferenced backing files.)  It would also make it 
> easier to track all recently touched files and bulk fsync them instead of 
> doing a syncfs (if we decide that is faster).
> 
> Anyway, at the end of the day, small writes or overwrites would still be 
> journaled, but large writes or large new objects would not, which would (I 
> think) be a pretty big improvement.  Overall, I think the design will be 
> much simpler to reason about, and there are several potential avenues to 
> be clever and make improvements.  I'm not sure we can say the same about 
> the FileStore design, which suffers from the fact that it has evolved 
> slowly over the last 9 years or so.
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to