Stefan Fuhrmann wrote on Sun, Feb 12, 2012 at 13:57:23 +0100: > On 12.02.2012 06:27, Branko Čibej wrote: > >On 12.02.2012 02:52, Stefan Fuhrmann wrote: > > > >>The silly part of FSFS is that it does not optimize access > >>paths, yet, but stores changes individually. The challenge > >>is our two-dimensional key space and the fact that different > >>operations traverse the data along different dimensions > >>(e.g. log ./. checkout). > >Interestingly enough, the 2D keyspace isn't that big a problem. > > On the physical level, it becomes a problem as files > model a linear data stream, i.e. you have to find a > mapping onto a 1-dimensional structure that keeps > related / dependent information closely together. >
So perhaps solve _that_ problem? Write a filesystem/disk driver for two-dimensional files, that store their bytes in the same physical sector on multiple platters. A reader can then read either along a sector, or read the same sector on cosecutive platters... > > The real > >issue is that we don't even represent all the relevant keys, because of > >the lazy copying of subtrees. That's what actually prevents us from > >doing one-shot lookups of arbitrary path@rev; and even then, we'd only > >really have to do a step-by-step top-down resolve if the initial fast > >lookup failed. > > How do you verify that path@rev even exists? > > For instance, you could maintain an index of (only) > every path ever touched. As long as we want to keep > our O(1) delete operations for paths, entries in that > index cannot be updated / removed. So, we need If the cache contains only fspaths, with no metadata about them, sure. But if the cache can map fspaths to a set of revision ranges... > to always verify that there is actually a path from /@rev > to path@rev. > > >>Question: how many entries would a direct lookup structure > >>need to have (i.e. path@rev -> data pointer)? Keep in > >>mind that may valid paths like /branches/foo/bar will never > >>be mentioned anywhere in a SVN repository because they > >>never got touched under that name. A rough estimation for > >>a fully expanded list of entries is > >> > >>#nodesInTrunk * #openBranches * #revisions > >> > >>This yields 10^9 entries for small repositories and>10^14 > >>for KDE-sized ones. Clearly impractical. > >So that's not how you do it. :) > > > >You'd only need one reference per actual node, not per possible node > >lookup paths including revisions. Obviously you have to resolve path@rev > >to a concrete node before you can do anything with its attributes. With > >the caveat that there's a nasty edge case involving not-yet-lazy-copied > >child nodes; but given the way we currently crawl the tree, that's not > >really an issue. > > We simply follow the tree, i.e. one lookup per parent. > With any luck most of the intermediate nodes have > already been cached, which is exactly what makes our > differential trees so efficient: as long as a sub-tree is > not being modified, all copies and future revisions > share the *same* representation. > > But maybe we are simply talking past each other here. > So, could you give a short summary of what you consider > wrong / inefficient about SVN / FSFS and how you would > like to see this addressed. > > -- Stefan^2.