Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

Qu Wenruo Wed, 03 Dec 2014 22:58:14 -0800


-------- Original Message --------
Subject: Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?
From: Robert White <rwh...@pobox.com>

To: Qu Wenruo <quwen...@cn.fujitsu.com>, linux-btrfs<linux-btrfs@vger.kernel.org>

Date: 2014年12月04日 03:18

On 12/01/2014 05:17 PM, Qu Wenruo wrote:
But I am also somewhat tired of bringing new structure new searching
functions or even bring larger change on
the btrfsck record infrastructure when I found that can't provide the
function when new recovery function is going
to be implemented.
DISCLAIMER: My actual thing is modeling, so coming from the idea thatbtrfsck is basically building a model of the filesystem and comparingthe theoretically correct model to the actual filesystem as found...and having spent nearly zero time inside of the btrfsck code baseitself...
I've looked at what I _think_ you are trying to do with thelookup/correlate phase finding peer/parent/child references and I'dthink other things would make sense far more sense than SQL.
C++ Standard Template Library at a minimum, life is too short toreinvent data lookups in memory. But its also too short to re-factorSQL queries or wait for degenerate full-table scans.

C++ doesn't comes to me in my first thought, since I haven't use it fora long long time... :-(

The boost "multi-index container", if you are going to bring in anexternal library, would do you far better for dealing with futurestructural changes. It would directly index the fields of your(parent,child,name) relation and let you switch iteration on-the-fly(e.g. you can be walking parents and use the same iterator to startwalking names etc). (there's even a cookbook item for adding an LRUcache as a natural index for space-constrained items.) Again you'dhave to switch over to C++, but it implements exactly what I think youare looking for. the multi-index container declaration syntax gets alittle thick but once declared the usage model is really simple. (youpay most-or-all the cost during "update" which is really aremove-then-readd but you won't be updating as often as adding so...)

Yeah, C++ seems much easier to switch, but as the preparation, it isbetter to cleanup the inode_record codes from

cmds-check.c and put them into a single file.
After the cleanup, I think we can even have different backend to test with.
(Only to test, multi-backend support also seems to be overkilled)

My "crazy-ist idea" would be to maybe mmap all the metadata from thefilesystem into the process space and build a page-table-like indexover that. The virtual memory demand paging will become your cachingbuddy. mapped_block+offest gets you directly to the whole nativestructure. You might have to juggle mappings (map and unmap them) onsystems with arger filesystems but smaller virtual tables (e.g.classic pentium 32 bit). Even then, you'd be juggling mappings and thekernel would be doing all the real caching. The more data you canleave in the mmapped ranges the less storage management takes place inthe local code.
IMHO, finding a USB stick to make a swap space on is probably just asgood, if not better than, doing it with a whole other filesystem andthe "put my temp files there" option during emergency maintenance.

Sadly, there is already user complaining about the memory usage:
http://comments.gmane.org/gmane.comp.file-systems.btrfs/34573

The 13T fs may contain several hundreds GB of metadata, already takesover 8G memory.It may takes up 10+ of GB memory, so a memory stick may help but when ithappens on a remote server?

Or can you tolerate the snail like slow IO speed?
So better store them on disk when things go really huge.

Map and unmap will be good,but it may not resolve the problem at all.
The main memory usage in btrfsck is extent record, which

we can't free them until we read them all in and checked, so even wemmap/unmap, it can only help with

the extent_buffer(which is already freed if not used according to refs).

Now, I miss the page cache in kernel so much...

Anyway, btrfsck should really consider about the memory usage now.

If the mailing list engine honors attachments, you will findFileEntries.h from one of my projects, offered as a simplified exampleof a directory entry cache implemented in a boost multi-index container.
(Note that this is fully working code from an upcoming addition to mysoruceforge project, so it's not just theoretical implementation ideas.)
struct FileEntry {
  struct NameIndex {};
  struct DirectoryIndex {};
  struct INodeIndex {};
  struct TypedIndex {};
  dev_t         Device;
  long          ParentINode;
  long          INode;
  std::string   Name;
  enum FileType Type;
};
The empty structs inside the struct act as index identifiers/names(Because C++ templates only disambiguate on types soFileEntry::NameIndex is the type-name of the index of file names (etc).
Any other data could be added to FileEntry without disrupting anyexisting code for indexing or iterating.
The container itself is a little grusome textually ::

typedef multi_index_container<
  FileEntry,
  indexed_by<
    ordered_non_unique<
      tag<FileEntry::NameIndex>,
      member<
        FileEntry,
        std::string,
        &FileEntry::Name
      >
    >,
    ordered_non_unique<
      tag<FileEntry::DirectoryIndex>,
      composite_key<
        FileEntry,
        member<
          FileEntry,
          dev_t,
          &FileEntry::Device
        >,
        member<
          FileEntry,
          long,
          &FileEntry::ParentINode
        >
      >
    >,
    ordered_non_unique<
      tag<FileEntry::INodeIndex>,
      composite_key<
        FileEntry,
        member<
          FileEntry,
          dev_t,
          &FileEntry::Device
        >,
        member<
          FileEntry,
          long,
          &FileEntry::INode
        >
      >
    >,
    ordered_non_unique<
      tag<FileEntry::TypedIndex>,
      composite_key<
        FileEntry,
        member<
          FileEntry,
          enum FileType,
          &FileEntry::Type
        >,
        member<
          FileEntry,
          std::string,
          &FileEntry::Name
        >
      >
    >
  >
> FileSet;
But after that it's super easy to use, uh, if you know STL iteratorspeak anyway. 8-)
Having a file set is easy and typesafe in any type-able context (suchas a member of another structure, or a global):
FileSet result;

Picking yoru index/view has to be done by (admittedly ugly looking) type
FileSet::index<FileEntry::NameIndex>::type

But then when you make one and plumb it ("result" being the working set
FileSet::index<FileEntry::NameIndex>::type & by_name =result.get<FileEntry::NameIndex>();
by_name is now a fully functional view of the set as an iteraterablecontainer with elements like by_name.erase(some_name) and such.
It gets you what you say you want from SQL but without all theindirection or duplication of SQL.
(Note that the attached example will fill a FileSet object into memorywith the entire (sub) tree of a file system if you call Fill() withthe deep selector; or just fill the set with the contents of a singledirectory when the shallow selector is used.
etc.
So anyway. I've been in this modeling space for a while. I looked atother means of modeling such as SQL and I decided this one was "thebest" for what I was doing.
The results are fast, deterministic, and type-safe with no internaltypecasts or full table scan results.

Great implement! Almost fits all the needs, except the ability to saveit on disk...

In fact, I have already an idea to implement it in pure C, although notso generic than C++ template, but should

work anyway.

So the remaining problem will be the ability to cache things on disk forcase like the 13T array.

For C/C++, saving recording to disk will be a simplified version of pagecache, but will still be a huge chanllenge.For SQL, it will be much easier but I am consider it as the last trynow, since the C++ implement seems quite good.


Anyway, the memory usage problem is still a problem to resolve.

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Crazy idea of cleanup the inode_record btrfsck things with SQL?

Reply via email to