Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags
Eric Blake wrote: On 04/24/2011 11:49 AM, Jamie Lokier wrote: My problem with FIND_* is that we are messing with the well understood semantics of lseek(). fcntl() looks a better fit for FIND_HOLE/DATA anyway. With fcntl(), it would have to be something like: off_t offset = start; if (fcntl (fd, F_FIND_HOLE, offset) != 0) ; // find failed // offset is now set to the next location after start Yes that, or a pointer-to-struct in the style of other fcntl() operations. A struct offers more flexibiliy such as a limit on search distance (may be useful on filesystems where searching reads a lot of metadata and you don't really want to scan all of a large file), and whether to include unwritten prealloc space or written-known-zeros space. I thought of fcntl() because historically it's often how you get quirky information about files and how to read them, on many OSes. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags
Sunil Mushran wrote: On 04/22/2011 04:50 AM, Eric Blake wrote: That blog also mentioned the useful idea of adding FIND_HOLE and FIND_DATA, not implemented in Solaris, but which could easily be provided as additional lseek constants in Linux to locate the start of the next chunk without repositioning and which could ease application programmer's life a bit. After all, cp wants to know where data ends without repositioning (FIND_HOLE), read() that much data which repositions in the process, then skip to the next chunk of data (SEEK_DATA) - two lseek() calls per iteration if we have 4 constants, but 3 per iteration if we only have SEEK_HOLE and have to manually rewind. while(1) { read(block); if (block_all_zeroes) lseek(SEEK_DATA); } What's wrong with the above? If this is the case, even SEEK_HOLE is not needed but should be added as it is already in Solaris. Apart from the obvious waste of effort (scanning *all* data for zeros is cheap but not free if the file is mostly non-hole zeros), you can't do a pread() version of the above in parallel over different parts of the same file/device. My problem with FIND_* is that we are messing with the well understood semantics of lseek(). fcntl() looks a better fit for FIND_HOLE/DATA anyway. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: I have noticed that events in Btrfs develop by scenario not predicted by the paper of academic Ohad Rodeh (in spite of the announce that Btrfs is based on this paper). This is why I have started to grumble.. In btrfs, based on means started with the algorithm and ideas of, not uses the same algorithm as. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: I'll try to help, but I am rather pessimistic here: working out algorithms is something, which doesn't like timelines.. Nonsense. Working out algorithms is just work to an algorithm designer, just like programming is work to a programmer. Sure, some things are harder than others, and there's an element of creativity. But lots of it is just cranking the handle, even in algorithm design. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] Ok, let's violate this, but not in the detriment of utilisation: Internal fragmentation is a horror for file systems, the enemy #1 (which is completely defeated in the last century BTW). The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. IMHO this is a problem, as we can not estimate number of such nodes. Do you have any sane upper boundary for this number? I don't know such one. Maybe I have missed something? Well, I don't know if btrfs maintains enough statistics or other implicit processes to guarantee a sane upper boundary for this. The impression I'm getting from the thread is that it relies on heuristic behaviour which is usually sufficient (and perhaps a bit faster than updating statistics on the disk would be), but that it does not provide a hard upper boundary. I'm really not sure, though. I haven't read the code. [*] For example when filling a tree by inserting contiguously ascending keys, the classic split into two when full heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). At the end of what? At the end of inserting keys in ascending order. Just think about the classic B-tree when that is done: Node[1] fills to 100% utilisation, then is split into two nodes at 50% (call them node[1] and node[2]). Node[2] fills to 100%, then splits into node[2] at 50% and node[3]. Etc etc: Result is a million nodes at 50% utilisation, except the last one. If instead you fill node[1], then *don't* split it but permit node[2] to have 0% to start with, let that fill to 100%, then don't split node[2] and let node[3] start with 0%, etc. etc: Result is half a million nodes at 100% utilisation, except the last one. Both fit the desired bounds, but the second is more desirable in practice, especially as it's common behaviour to fill with contiguous, ascending keys in a filesystem (or blob-filled database) where data extents are part of the tree :-) To handle the cases of multiple independent ascending runs of keys being written in parallel, you generalise to maintain more than one below-50% block, near the active insertion heads. The above describes classic B-trees where updates are assumed to be written immediately. Things are different when there's a memory cache and delayed allocation/writing, and packing can be reorganised in memory before sending to storage. I hope you don't speak about on-line defragmentation? No, not at all. Can you point me to the paper (if any)? Sorry, no, I haven't seen this described in a paper. Imho it's obvious when you think about it. But maybe it's not obvious to everyone - perhaps this is even the heuristic modification missing from btrfs which it would need to avoid the scenario which started this thread? -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)
Edward Shishkin wrote: If you decide to base your file system on some algorithms then please use the original ones from proper academic papers. DO NOT modify the algorithms in solitude: this is very fragile thing! All such modifications must be reviewed by specialists in the theory of algorithms. Such review can be done in various scientific magazines of proper level. Personally I don't see any way to improve the situation with Btrfs except full redesigning the last one. If you want to base your file system on the paper of Ohad Rodeh, then please, use *exactly* the Bayer's B-trees that he refers to. That said, make sure that all records you put to the tree has equal length and all non-root nodes of your tree are at least half filled. First, thanks Edward for identifying a specific problem with the current btrfs implementation. I've studied modified B-trees quite a lot and know enough to be sure that they are quite robust when you modify them in all sorts of ways. Moreover, you are incorrect to say there's an intrinsic algorithmic problem with variable-length records. It is not true; if Knuth said so, Knuth was mistaken. This is easily shown by modelling variable-length records (key - string) as a range of fixed length records ([key,index] - byte) and apply the standard B-tree algorithms to that, which guarantees algorithm properties such as space utilisation and time; then you can substitute a compressed representation of contiguous index runs, which amounts to nothing more than just storing the strings (split where the algorithm says to do so) and endpoint indexes , and because this compression does not expand (in any way that matters), classic algorithmic properties are still guaranteed. Variable-length keys are a different business. Those are trickier, but as far as I know, btrfs doesn't use them. As to current Btrfs code: *NOT ACK*!!! I don't think we need such file systems. Btrfs provides many useful features that other filesystems don't. We definitely need it, or something like it. You have identified a bug. It's not a corruption bug, but it's definitely a bug, and probably affects performance as well as space utilisation. It is not deep design bug; it is just a result of the packing and balancing heuristics. If you are still interested, please apply your knowledge of B-tree algorithms to understanding why btrfs fails to balance the tree sufficiently well, and then propose a fix. Note that it's not necessarily a problem to have a few nodes with low utilisation. Deliberate violation of the classic balancing heuristic is often useful for performance.[*] The problem you've found is only a real problem when there are _too many_ nodes with low utilisation. [*] For example when filling a tree by inserting contiguously ascending keys, the classic split into two when full heuristic gives the worst possible results (50% lost space), and deliberately underfilling the most actively updated nodes, which is not permitted at all by the classic algorithm, gives denser packing in the end (almost zero lost space). It's also faster. The trick is to make sure there's just the right number of underfilled nodes... -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview
Dan Magenheimer wrote: Most important, cleancache is ephemeral. Pages which are copied into cleancache have an indefinite lifetime which is completely unknowable by the kernel and so may or may not still be in cleancache at any later time. Thus, as its name implies, cleancache is not suitable for dirty pages. The pseudo-RAM has complete discretion over what pages to preserve and what pages to discard and when. Fwiw, the feature sounds useful to userspace too, for those things with memory hungry caches like web browsers. Any plans to make it available to userspace? Thanks, -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Linus Torvalds wrote: This is about storage allocation, not aliases. Storage allocation only depends on lifetime. Well, the thing is, code motion does extend life-times, and if you think you can move stores across each other (even when you can see that they alias statically) due to type-based alias decisions, that does essentially end up making what _used_ to be disjoint lifetimes now be potentially overlapping. Sometimes code motion makes code faster and/or smaller but use more stack space. If you want to keep the stack use down, it blocks some other optimisations. Register allocation is similar: code motion optimisations may use more registers due to overlapping lifetimes, which causes more register spills and changes the code. The two interact; it's not trivial to optimise fully. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning
Harvey Harrison wrote: Oh yeah, and figure out what actually breaks on alpha such that they added the following (arch/alpha/include/asm/compiler.h) #ifdef __KERNEL__ /* Some idiots over in linux/compiler.h thought inline should imply always_inline. This breaks stuff. We'll include this file whenever we run into such problems. */ Does always_inline complain if the function isn't inlinable, while inline allows it? That would explain the alpha comment. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs for mainline
J. Bruce Fields wrote: Old kernel versions may still get booted after brtfs has gotten a reputation for stability. E.g. if I move my / to brtfs in 2.6.34, then one day need to boot back to 2.6.30 to track down some regression, the reminder that I'm moving back to some sort of brtfs dark-ages might be welcome. Require a mount option allow_unstable_version until it's stable? The stable version can ignore the option. In your example, you wouldn't use the option, and in the btrfs dark ages it would refuse to mount. -- Jamie -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html