Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-25 Thread Jamie Lokier
Eric Blake wrote:
 On 04/24/2011 11:49 AM, Jamie Lokier wrote:
  My problem with FIND_* is that we are messing with the well understood
  semantics of lseek().
  
  fcntl() looks a better fit for FIND_HOLE/DATA anyway.
 
 With fcntl(), it would have to be something like:
 
 off_t offset = start;
 if (fcntl (fd, F_FIND_HOLE, offset) != 0)
 ; // find failed
 // offset is now set to the next location after start

Yes that, or a pointer-to-struct in the style of other fcntl()
operations.

A struct offers more flexibiliy such as a limit on search distance
(may be useful on filesystems where searching reads a lot of metadata
and you don't really want to scan all of a large file), and whether to
include unwritten prealloc space or written-known-zeros space.

I thought of fcntl() because historically it's often how you get quirky
information about files and how to read them, on many OSes.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-24 Thread Jamie Lokier
Sunil Mushran wrote:
 On 04/22/2011 04:50 AM, Eric Blake wrote:
 That blog also mentioned the useful idea of adding FIND_HOLE and
 FIND_DATA, not implemented in Solaris, but which could easily be
 provided as additional lseek constants in Linux to locate the start of
 the next chunk without repositioning and which could ease application
 programmer's life a bit.  After all, cp wants to know where data ends
 without repositioning (FIND_HOLE), read() that much data which
 repositions in the process, then skip to the next chunk of data
 (SEEK_DATA) - two lseek() calls per iteration if we have 4 constants,
 but 3 per iteration if we only have SEEK_HOLE and have to manually rewind.
 
 while(1) {
 read(block);
 if (block_all_zeroes)
 lseek(SEEK_DATA);
 }
 
 What's wrong with the above? If this is the case, even SEEK_HOLE
 is not needed but should be added as it is already in Solaris.

Apart from the obvious waste of effort (scanning *all* data for zeros
is cheap but not free if the file is mostly non-hole zeros), you can't
do a pread() version of the above in parallel over different parts of
the same file/device.

 My problem with FIND_* is that we are messing with the well understood
 semantics of lseek().

fcntl() looks a better fit for FIND_HOLE/DATA anyway.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
 I have noticed that events in Btrfs develop by scenario not predicted
 by the paper of academic Ohad Rodeh (in spite of the announce that
 Btrfs is based on this paper). This is why I have started to grumble..

In btrfs, based on means started with the algorithm and ideas of,
not uses the same algorithm as.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-23 Thread Jamie Lokier
Edward Shishkin wrote:
 I'll try to help, but I am rather pessimistic here: working out
 algorithms is something, which doesn't like timelines..

Nonsense.  Working out algorithms is just work to an algorithm
designer, just like programming is work to a programmer.  Sure, some
things are harder than others, and there's an element of creativity.
But lots of it is just cranking the handle, even in algorithm design.

 Note that it's not necessarily a problem to have a few nodes with low
 utilisation.  Deliberate violation of the classic balancing heuristic
 is often useful for performance.[*]
 
 Ok, let's violate this, but not in the detriment of utilisation:
 Internal fragmentation is a horror for file systems, the enemy #1
 (which is completely defeated in the last century BTW).
 
   The problem you've found is only a
 real problem when there are _too many_ nodes with low utilisation.
 
 IMHO this is a problem, as we can not estimate number of such nodes.
 Do you have any sane upper boundary for this number? I don't know such one.
 Maybe I have missed something?

Well, I don't know if btrfs maintains enough statistics or other
implicit processes to guarantee a sane upper boundary for this.  The
impression I'm getting from the thread is that it relies on heuristic
behaviour which is usually sufficient (and perhaps a bit faster than
updating statistics on the disk would be), but that it does not
provide a hard upper boundary.  I'm really not sure, though.  I
haven't read the code.

 [*] For example when filling a tree by inserting contiguously
 ascending keys, the classic split into two when full heuristic gives
 the worst possible results (50% lost space), and deliberately
 underfilling the most actively updated nodes, which is not permitted
 at all by the classic algorithm, gives denser packing in the end
 (almost zero lost space).
 
 At the end of what?

At the end of inserting keys in ascending order.  Just think about the
classic B-tree when that is done: Node[1] fills to 100% utilisation,
then is split into two nodes at 50% (call them node[1] and node[2]).
Node[2] fills to 100%, then splits into node[2] at 50% and node[3].
Etc etc: Result is a million nodes at 50% utilisation, except the last
one.

If instead you fill node[1], then *don't* split it but permit node[2]
to have 0% to start with, let that fill to 100%, then don't split
node[2] and let node[3] start with 0%, etc. etc: Result is half a
million nodes at 100% utilisation, except the last one.

Both fit the desired bounds, but the second is more desirable in
practice, especially as it's common behaviour to fill with contiguous,
ascending keys in a filesystem (or blob-filled database) where data
extents are part of the tree :-)

To handle the cases of multiple independent ascending runs of keys
being written in parallel, you generalise to maintain more than one
below-50% block, near the active insertion heads.

The above describes classic B-trees where updates are assumed to be
written immediately.  Things are different when there's a memory cache
and delayed allocation/writing, and packing can be reorganised in
memory before sending to storage.

 I hope you don't speak about on-line defragmentation?

No, not at all.

 Can you point me to the paper (if any)?

Sorry, no, I haven't seen this described in a paper.  Imho it's
obvious when you think about it.  But maybe it's not obvious to
everyone - perhaps this is even the heuristic modification missing
from btrfs which it would need to avoid the scenario which started
this thread?

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: broken file system design (was Unbound(?) internal fragmentation in Btrfs)

2010-06-18 Thread Jamie Lokier
Edward Shishkin wrote:
 If you decide to base your file system on some algorithms then please
 use the original ones from proper academic papers. DO NOT modify the
 algorithms in solitude: this is very fragile thing! All such
 modifications must be reviewed by specialists in the theory of
 algorithms. Such review can be done in various scientific magazines of
 proper level.
 
 Personally I don't see any way to improve the situation with Btrfs
 except full redesigning the last one. If you want to base your file
 system on the paper of Ohad Rodeh, then please, use *exactly* the
 Bayer's B-trees that he refers to. That said, make sure that all
 records you put to the tree has equal length and all non-root nodes of
 your tree are at least half filled.

First, thanks Edward for identifying a specific problem with the
current btrfs implementation.

I've studied modified B-trees quite a lot and know enough to be sure
that they are quite robust when you modify them in all sorts of ways.

Moreover, you are incorrect to say there's an intrinsic algorithmic
problem with variable-length records.  It is not true; if Knuth said
so, Knuth was mistaken.

This is easily shown by modelling variable-length records (key -
string) as a range of fixed length records ([key,index] - byte) and
apply the standard B-tree algorithms to that, which guarantees
algorithm properties such as space utilisation and time; then you can
substitute a compressed representation of contiguous index runs,
which amounts to nothing more than just storing the strings (split
where the algorithm says to do so) and endpoint indexes , and because
this compression does not expand (in any way that matters), classic
algorithmic properties are still guaranteed.

Variable-length keys are a different business.  Those are trickier,
but as far as I know, btrfs doesn't use them.

 As to current Btrfs code: *NOT ACK*!!! I don't think we need such
 file systems.

Btrfs provides many useful features that other filesystems don't.  We
definitely need it, or something like it.  You have identified a bug.
It's not a corruption bug, but it's definitely a bug, and probably
affects performance as well as space utilisation.

It is not deep design bug; it is just a result of the packing and
balancing heuristics.

If you are still interested, please apply your knowledge of B-tree
algorithms to understanding why btrfs fails to balance the tree
sufficiently well, and then propose a fix.

Note that it's not necessarily a problem to have a few nodes with low
utilisation.  Deliberate violation of the classic balancing heuristic
is often useful for performance.[*]  The problem you've found is only a
real problem when there are _too many_ nodes with low utilisation.

[*] For example when filling a tree by inserting contiguously
ascending keys, the classic split into two when full heuristic gives
the worst possible results (50% lost space), and deliberately
underfilling the most actively updated nodes, which is not permitted
at all by the classic algorithm, gives denser packing in the end
(almost zero lost space).  It's also faster.  The trick is to make
sure there's just the right number of underfilled nodes...

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 0/7] Cleancache (was Transcendent Memory): overview

2010-06-02 Thread Jamie Lokier
Dan Magenheimer wrote:
 Most important, cleancache is ephemeral.  Pages which are copied into
 cleancache have an indefinite lifetime which is completely unknowable
 by the kernel and so may or may not still be in cleancache at any later time.
 Thus, as its name implies, cleancache is not suitable for dirty pages.  The
 pseudo-RAM has complete discretion over what pages to preserve and what
 pages to discard and when.

Fwiw, the feature sounds useful to userspace too, for those things
with memory hungry caches like web browsers.  Any plans to make it
available to userspace?

Thanks,
-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gcc inlining heuristics was Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-12 Thread Jamie Lokier
Linus Torvalds wrote:
  This is about storage allocation, not aliases.  Storage allocation only
  depends on lifetime.
 
 Well, the thing is, code motion does extend life-times, and if you think 
 you can move stores across each other (even when you can see that they 
 alias statically) due to type-based alias decisions, that does essentially 
 end up making what _used_ to be disjoint lifetimes now be potentially 
 overlapping.

Sometimes code motion makes code faster and/or smaller but use more
stack space.  If you want to keep the stack use down, it blocks some
other optimisations.

Register allocation is similar: code motion optimisations may use more
registers due to overlapping lifetimes, which causes more register
spills and changes the code.  The two interact; it's not trivial to
optimise fully.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -v7][RFC]: mutex: implement adaptive spinning

2009-01-09 Thread Jamie Lokier
Harvey Harrison wrote:
 Oh yeah, and figure out what actually breaks on alpha such that they added
 the following (arch/alpha/include/asm/compiler.h)
 
 #ifdef __KERNEL__
 /* Some idiots over in linux/compiler.h thought inline should imply
always_inline.  This breaks stuff.  We'll include this file whenever
we run into such problems.  */

Does always_inline complain if the function isn't inlinable, while
inline allows it?  That would explain the alpha comment.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs for mainline

2009-01-06 Thread Jamie Lokier
J. Bruce Fields wrote:
 Old kernel versions may still get booted after brtfs has gotten a
 reputation for stability.  E.g. if I move my / to brtfs in 2.6.34, then
 one day need to boot back to 2.6.30 to track down some regression, the
 reminder that I'm moving back to some sort of brtfs dark-ages might be
 welcome.

Require a mount option allow_unstable_version until it's stable?
The stable version can ignore the option.

In your example, you wouldn't use the option, and in the btrfs dark
ages it would refuse to mount.

-- Jamie
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html