from:"Daniel Phillips"

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-26 Thread Daniel Phillips

On Tuesday 26 February 2008 06:33, David Howells wrote:
> > Suppose one were to take a mundane approach to the persistent cache
> > problem instead of layering filesystems.  What you would do then is
> > change NFS's ->write_page and variants to fiddle the persistent
> > cache
> 
> It is a requirement laid down by the Linux NFS fs maintainers that the writes
> to the cache be asynchronous, even if the writes to NFS aren't.

As it happens, I will be hanging out for the next few days with said
NFS maintainers, it would help to be as informed as possible about
your patch set.

> Note further that NFS's write_page() != writing to the cache.  Writing to the
> cache is typically done by NFS's readpages().

Yes, of course.  But also by ->write_page no?

> > Which I could eventually find out by reading all the patches but asking you
> > is so much more fun :-)
> 
> And a waste of my time.  I've provided documentation in the main FS-Cache
> patch, both as text files and in comments in header files that answer your
> questions.  Please read them first.

37 Patches, none of which has "Documentation" in the subject line, and
you did not provide a diffstat in patch 0 for the patch set as a whole.
If I had known it was there of course I would have read it.  It is great
to see this level of documentation.  But I do not think it is fair to
blame your (one) reader for missing it.

See the smiley above?  The _real_ reason I am asking you is that I do
not think anybody understands your patch set, in spite of your
considerable efforts to address that.  Discussion in public, right or
wrong, is the only way to fix that.  It is counterproductive to drive
readers away from the discussion for fear that they may miss some point
obvious to the original author, or perhaps already discussed earlier on
lkml, and get flamed for it.

Obviously, the patch set is not going to be perfect when it goes in and
it would be a silly abuse of the open source process to require that,
but the parts where it touches the rest of the system have to be really
well understood, and it is clear from the level of participation in the
thread that they are not.

One bit that already came out of this, which you have alluded to
several times yourself but somehow seem to keep glossing over, is that
you need a ->direct_bio file operations method.  So does loopback mount.
It might be worth putting some effort into seeing how ->direct_IO can
be refactored to make that happen.  You can get it in separately on the
basis of helping loopback, and it will make your patches nicer.

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-26 Thread Daniel Phillips

I need to respond to this in pieces... first the bit that is bugging
me:

> >   * two new page flags
> 
> I need to keep track of two bits of per-cached-page information:
> 
>  (1) This page is known by the cache, and that the cache must be informed if
>  the page is going to go away.

I still do not understand the life cycle of this bit.  What does the
cache do when it learns the page has gone away?  How is it informed?
Who owns the page cache in which such a page lives, the nfs client?
Filesystem that hosts the page?  A third page cache owned by the
cache itself?  (See my basic confusion about how many page cache
levels you have, below.)

Suppose one were to take a mundane approach to the persistent cache
problem instead of layering filesystems.  What you would do then is
change NFS's ->write_page and variants to fiddle the persistent
cache as well as the network, instead of just the network as now.
This fiddling could even consist of ->write calls to another
filesystem, though working directly with the bio interface would
yield the fastest, and therefore to my mind, best result.

In any case, you find out how to write the page to backing store by
asking the filesystem, which in the naive approach would be nfs
augmented with caching library calls.  The filesystem keeps its own
metadata around to know how to map the page to disk.  So again
naively, this metadata could tell the nfs client that the page is
not mapped to disk at all.  So I do not see what your per-page bit
is for, obviously because I do not fully understand your caching
scheme.  Which I could eventually find out by reading all the
patches but asking you is so much more fun :-)

By the way, how many levels of page caching for the same data are
there, is it:

  1) nfs client
  2) cache layer's own page cache
  3) filesystem hosting the cache

or just:

  1) nfs client page cache
  2) filesystem hosting the cache

I think it is the second, but that is already double caching, which
has got to hurt.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-25 Thread Daniel Phillips

On Monday 25 February 2008 15:19, David Howells wrote:
> So I guess there's a problem in cachefiles's efficiency - possibly due
> to the fact that it tries to be fully asynchronous.

OK, not just my imagination, and it makes me feel better about the patch 
set because efficiency bugs are fixable while fundamental limitations 
are not.

How much of a hurry are you in to merge this feature?  You have bits 
like this:

"Add a function to install a monitor on the page lock waitqueue for a 
particular page, thus allowing the page being unlocked to be detected.
This is used by CacheFiles to detect read completion on a page in the 
backing filesystem so that it can then copy the data to the waiting 
netfs page."

We already have that hook, it is called bio_endio.  My strong intuition 
is that your whole mechanism should sit directly on the block device, 
no matter how attractive it seems to be able to piggyback on the 
namespace and layout management code of existing filesystems.  I see 
your current effort as the moral equivalent of FUSE: you are able to 
demonstrate certain desirable behavioral properties, but you are unable 
to reach full theoretical efficiency because there are layers and 
layers of interface gunk interposed between the netfs user and the 
cache device.

That said, I also see you have put a huge amount of work into this over 
the years, it is nicely broken out, you are responsive and easy to work 
with, all arguments for an early merge.  Against that, you invade core 
kernel for reasons that are not necessarily justified:

  * two new page flags
  * a new fileops method
  * many changes to LSM including new object class and new hooks
  * separate fs*id from task struct
  * new page-private destructor hook
  * probably other bits I missed

Would it be correct to say that some of these changes are to support 
disconnected operation?  If so, you really have two patch sets:

  1) Persistent netfs cache
  2) Disconnected netfs operation

You have some short snappers that look generally useful:

  * add_wait_queue_tail (cool)
  * write to a file without a struct file (includes ->mapping cleanup,
probably good)
  * export fsync_super

Why not hunt around for existing in-kernel users that would benefit so 
these can be submitted as standalone patches, shortening the remaining 
patch set and partially overcoming objections due to core kernel 
changes?

One thing I don't see is users coming on to lkml and saying "please 
merge this, it works great for me".  Since you probably have such 
users, why not give them a poke? 

Your cachefilesd is going to need anti-deadlock medicine like ddsnap 
has.  Since you don't seem at all worried about that right now, I 
suspect you have not hammered this code really heavily, correct?  
Without preventative measures, any memory-using daemon sitting in the 
block IO path will deadlock if you hit it hard enough.

A couple of years ago you explained the purpose of the new page flags to 
me and there is no way I can find that email again.  Could you explain 
it again please?  Meanwhile I am doing my duty and reading your OLS 
slides etc.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread Daniel Phillips

On Friday 22 February 2008 04:48, David Howells wrote:
> > But looking up the object in the cache should be nearly free - much less
> > than a microsecond per block.
> 
> The problem is that you have to do a database lookup of some sort, possibly
> involving several synchronous disk operations.

Right, so the obvious optimization strategy for this corner of it is to
decimate the synchronous disk ops for the average case, for which there
are a variety of options, one of which you already suggested.

> CacheFiles does a disk lookup by taking the key given to it by NFS, turning it
> into a set of file or directory names, and doing a short pathwalk to the 
> target
> cache file.  Throwing in extra indices won't necessarily help.  What matters 
> is
> how quick the backing filesystem is at doing lookups.  As it turns out, Ext3 
> is
> a fair bit better then BTRFS when the disk cache is cold.

All understood.  I am eventually going to suggest cutting the backing
filesystem entirely out of the picture, with a view to improving both
efficiency and transparency, hopefully with a code size reduction as
well.  But you are up and running with the filesystem approach, enough
to tackle the basic algorithm questions, which is worth a lot.

I really do not like idea of force fitting this cache into a generic
vfs model.  Sun was collectively smoking some serious crack when they
cooked that one up.  But there is also the ageless principle "isness is
more important than niceness".

> > > The metadata problem is quite a tricky one since it increases with the
> > > number of files you're dealing with.  As things stand in my patches, when
> > > NFS, for example, wants to access a new inode, it first has to go to the
> > > server to lookup the NFS file handle, and only then can it go to the cache
> > > to find out if there's a matching object in the case.
> > 
> > So without the persistent cache it can omit the LOOKUP and just send the
> > filehandle as part of the READ?
> 
> What 'it'?  Note that the get the filehandle, you have to do a LOOKUP op.  
> With
> the cache, we could actually cache the results of lookups that we've done,
> however, we don't know that the results are still valid without going to the
> server:-/

What I was trying to say.  It => the cache logic.

> AFS has a way around that - it versions its vnode (inode) IDs.

Which would require a change to NFS, not an option because you hope to
work with standard servers?  Of course with years to think about this,
the required protocol changes were put into v4.  Not.

/me hopes for an NFS hack to show up and explain the thinking there

Actually, there are many situations where changing both the client (you
must do that anyway) and the server is logistically practical.  In fact
that is true for all actual use cases I know of for this cache model.
So elaborating the protocol is not an option to reject out of hand.  A
hack along those lines could (should?) be provided as an opportunistic
option.

Have you completely exhausted optimization ideas for the file handle
lookup?

> > > The reason my client going to my server is so quick is that the server has
> > > the dcache and the pagecache preloaded, so that across-network lookup
> > > operations are really, really quick, as compared to the synchronous
> > > slogging of the local disk to find the cache object.
> > 
> > Doesn't that just mean you have to preload the lookup table for the
> > persistent cache so you can determine whether you are caching the data
> > for a filehandle without going to disk?
> 
> Where "lookup table" == "dcache".  That would be good yes.  cachefilesd
> prescans all the files in the cache, which ought to do just that, but it
> doesn't seem to be very effective.  I'm not sure why.

RCU?  Anyway, it is something to be tracked down and put right.

> > Your big can-t-get-there-from-here is the round trip to the server to
> > determine whether you should read from the local cache.  Got any ideas?
> 
> I'm not sure what you mean.  Your statement should probably read "... to
> determine _what_ you should read from the local cache".

What I tried to say.  So still... got any ideas?  That extra synchronous
network round trip is a killer.  Can it be made streaming/async to keep
throughput healthy?

> > And where is the Trond-meister in all of this?
> 
> Keeping quiet as far as I can tell.

/me does the Trond summoning dance

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread Daniel Phillips

On Thursday 21 February 2008 16:07, David Howells wrote:
> The way the client works is like this:

Thanks for the excellent ascii art, that cleared up the confusion right
away.

> What are you trying to do exactly?  Are you actually playing with it, or just
> looking at the numbers I've produced?

Trying to see if you are offering enough of a win to justify testing it,
and if that works out, then going shopping for a bin of rotten vegetables
to throw at your design, which I hope you will perceive as useful.

In short I am looking for a reason to throw engineering effort at it.
>From the numbers you have posted I think you are missing some basic
efficiencies that could take this design from the sorta-ok zone to wow!

I think you may already be in the wow zone for taking load off a server
and I know of applications where an NFS server gets hammered so badly
that having the client suck a little in the unloaded case is a price
worth paying.  But the whole idea would be much more attractive if the
regressions were smaller.

> > Who is supposed to win big?  Is this mainly about reducing the load on 
> > the server, or is the client supposed to win even with a lightly loaded 
> > server?
> 
> These are difficult questions to answer.  The obvious answer to both is "it
> depends", and the real answer to both is "it's a compromise".
> 
> Inserting a cache adds overhead: you have to look in the cache to see if your
> objects are mirrored there, and then you have to look in the cache to see if
> the data you want is stored there; and then you might have to go to the server
> anyway and then schedule a copy to be stored in the cache.

But looking up the object in the cache should be nearly free - much less
than a microsecond per block.  If not then there are design issues.  I
suspect that you are doing yourself a disservice by going all the way
through the vfs to do this cache lookup, but this needs to be proved.

> The characteristics of this type of cache depend on a number of things: the
> filesystem backing it being the most obvious variable, but also how fragmented
> it is and the properties of the disk drive or drives it is on.

Double caching and vm unawareness of that has to hurt.

> The metadata problem is quite a tricky one since it increases with the number
> of files you're dealing with.  As things stand in my patches, when NFS, for
> example, wants to access a new inode, it first has to go to the server to
> lookup the NFS file handle, and only then can it go to the cache to find out 
> if
> there's a matching object in the case.

So without the persistent cache it can omit the LOOKUP and just send the
filehandle as part of the READ?

> Worse, the cache must then perform 
> several synchronous disk bound metadata operations before it can be possible 
> to
> read from the cache.  Worse still, this means that a read on the network file
> cannot proceed until (a) we've been to the server *plus* (b) we've been to the
> disk.
> 
> The reason my client going to my server is so quick is that the server has 
> the 
> dcache and the pagecache preloaded, so that across-network lookup operations
> are really, really quick, as compared to the synchronous slogging of the local
> disk to find the cache object.

Doesn't that just mean you have to preload the lookup table for the
persistent cache so you can determine whether you are caching the data
for a filehandle without going to disk?

> I can probably improve this a little by pre-loading the subindex directories
> (hash tables) that I use to reduce the directory size in the cache, but I 
> don't
> know by how much.

Ah I should have read ahead.  I think the correct answer is "a lot".
Your big can-t-get-there-from-here is the round trip to the server to
determine whether you should read from the local cache.  Got any ideas?

And where is the Trond-meister in all of this?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread Daniel Phillips

Hi David,

I am trying to spot the numbers that show the sweet spot for this 
optimization, without much success so far.

Who is supposed to win big?  Is this mainly about reducing the load on 
the server, or is the client supposed to win even with a lightly loaded 
server?

When you say Ext3 cache vs NFS cache is the first on the server and the 
second on the client?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-20 Thread Daniel Phillips

Hi David,

On Wednesday 20 February 2008 08:05, David Howells wrote:
> These patches add local caching for network filesystems such as NFS.

Have you got before/after benchmark results?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3 freeze feature

2008-01-31 Thread Daniel Phillips

On Friday 25 January 2008 05:33, Theodore Tso wrote:
> and then detect the 
> deadlock case where the process holding the file descriptor used to
> freeze the filesystem gets frozen because it attempted to write to the
> filesystem --- at which point it gets some kind of signal (which
> defaults to killing the process), and the filesystem is unfrozen and
> as part of the unfreeze you wake up all of the processes that were put
> to sleep for touching the frozen filesystem.

Hi Ted,

There are a few holes:

  * The process may try to handle the signal and end up blocking on
 the filesystem again.

  * The process might pass the fd to another process by forking or
 fd passing.

  * The process holding the fd might be trying to take a lock held
 by another process that is blocked on the filesystem, and infinite
 variations on that theme.

Remembering the task that did the ioctl might work out better than
remembering the fd.   Or just not try to be so fancy and rely on the
application to take appropriate measures to ensure it will not access
the filesystem, such as memlocking and not execing.

The freezer also needs to run in PF_MEMALLOC mode or similar
unless it can be sure it will not cause pageout to the frozen filesystem
under low memory conditions.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-17 Thread Daniel Phillips

On Jan 17, 2008 7:29 AM, Szabolcs Szakacsits <[EMAIL PROTECTED]> wrote:
> Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:

I guess that is enough votes to justify going ahead and trying an
implementation of the reverse mapping ideas I posted.  But of course
more votes for this is better.  If online incremental fsck is
something people want, then please speak up here and that will very
definitely help make it happen.

On the walk-before-run principle, it would initially just be
filesystem checking, not repair.  But even this would help, by setting
per-group checked flags that offline fsck could use to do a much
quicker repair pass.  And it will let you know when a volume needs to
be taken offline without having to build in planned downtime just in
case, which already eats a bunch of nines.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Btrfs-devel] [ANNOUNCE] Btrfs v0.10 (online growing/shrinking, ext3 conversion, and more)

2008-01-17 Thread Daniel Phillips

On Jan 17, 2008 1:25 PM, Chris mason <[EMAIL PROTECTED]> wrote:
> So, I've put v0.11 out there.  It fixes those two problems and will also
> compile on older (2.6.18) enterprise kernels.
>
> v0.11 does not have any disk format changes.

Hi Chris,

First, massive congratulations for bringing this to fruition in such a
short time.

Now back to the regular carping: why even support older kernels?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Daniel Phillips

Hi Pavel,

Along with this effort, could you let me know if the world actually
cares about online fsck?  Now we know how to do it I think, but is it
worth the effort.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Daniel Phillips

On Jan 15, 2008 7:15 PM, Alan Cox <[EMAIL PROTECTED]> wrote:
> > Writeback cache on disk in iteself is not bad, it only gets bad if the
> > disk is not engineered to save all its dirty cache on power loss,
> > using the disk motor as a generator or alternatively a small battery.
> > It would be awfully nice to know which brands fail here, if any,
> > because writeback cache is a big performance booster.
>
> AFAIK no drive saves the cache. The worst case cache flush for drives is
> several seconds with no retries and a couple of minutes if something
> really bad happens.
>
> This is why the kernel has some knowledge of barriers and uses them to
> issue flushes when needed.

Indeed, you are right, which is supported by actual measurements:

http://sr5tech.com/write_back_cache_experiments.htm

Sorry for implying that anybody has engineered a drive that can do
such a nice thing with writeback cache.

The "disk motor as a generator" tale may not be purely folklore.  When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.

A small UPS can make writeback mode actually reliable, provided the
system is smart enough to take the drives out of writeback mode when
the line power is off.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-15 Thread Daniel Phillips

On Jan 15, 2008 6:07 PM, Pavel Machek <[EMAIL PROTECTED]> wrote:
> I had write cache enabled on my main computer. Oops. I guess that
> means we do need better documentation.

Writeback cache on disk in iteself is not bad, it only gets bad if the
disk is not engineered to save all its dirty cache on power loss,
using the disk motor as a generator or alternatively a small battery.
It would be awfully nice to know which brands fail here, if any,
because writeback cache is a big performance booster.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFD] Incremental fsck

2008-01-13 Thread Daniel Phillips

Hi Ted,

On Saturday 12 January 2008 06:51, Theodore Tso wrote:
> What is very hard to check is whether or not the link count on the
> inode is correct.  Suppose the link count is 1, but there are
> actually two directory entries pointing at it.  Now when someone
> unlinks the file through one of the directory hard entries, the link
> count will go to zero, and the blocks will start to get reused, even
> though the inode is still accessible via another pathname.  Oops. 
> Data Loss.
>
> This is why doing incremental, on-line fsck'ing is *hard*.  You're
> not going to find this while doing each directory one at a time, and
> if the filesystem is changing out from under you, it gets worse.  And
> it's not just the hard link count.  There is a similar issue with the
> block allocation bitmap.  Detecting the case where two files are
> simultaneously can't be done if you are doing it incrementally, and
> if the filesystem is changing out from under you, it's impossible,
> unless you also have the filesystem telling you every single change
> while it is happening, and you keep an insane amount of bookkeeping.

In this case I am listening to Chicken Little carefully and really do 
believe the sky will fall if we fail to come up with an incremental 
online fsck some time in the next few years.  I realize the challenge 
verges on insane, but I have been slowly chewing away at this question 
for some time.

Val proposes to simplify the problem by restricting the scope of block 
pointers and hard links.  Best of luck with that, the concept of fault 
isolation domains has a nice ring to it.  I prefer to stick close to 
tried and true Ext3 and not change the basic algorithms.

Rather than restricting pointers, I propose to add a small amount of new 
metadata to accelerate global checking.  The idea is to be able to 
build per-group reverse maps very quickly, to support mapping physical 
blocks back to inodes that own them, and mapping inodes back to the 
directories that reference them.

I see on-the-fly filesystem reverse mapping as useful for more than just 
online fsck.  For example it would be nice to be able to work backwards 
efficiently from a list of changed blocks such as ddsnap produces to a 
list of file level changes.

The amount of metadata required to support efficient on-the-fly reverse 
mapping is surprisingly small: 2K per block group per terabyte, in a 
fixed location at the base of each group.  This is consistent with my 
goal of producing code that is mergable for Ext4 and backportable to 
Ext3.

Building a block reverse map for a given group is easy and efficient.  
The first pass walks across the inode table and already maps most of 
the physical blocks for typical usage patterns, because most files only 
have direct pointers.  Index blocks discovered in the first pass go 
onto a list to be processed by subsequent passes, which may discover 
additional index blocks.  Just keep pushing the index blocks back onto 
the list and the algorithm terminates when the list is empty.  This 
builds a reverse map for the group including references to external 
groups.

Note that the recent metadata clustering patch from Abhishek Rai will 
speed up this group mapping algorithm significantly because (almost) 
all the index blocks can be picked up in one linear read.   This should 
only take a few milliseconds.  One more reason why I think his patch is 
an Important Patch[tm].

A data block may be up to four groups removed from its home group, 
therefore the reverse mapping process must follow pointers across 
groups and map each file entirely to be sure that all pointers to the 
group being checked have been discovered.  It is possible to construct 
a case where a group contains a lot of inodes of big files that are 
mostly stored in other groups.  Mapping such a group could possibly 
require examining all the index blocks on the entire volume.  That 
would be about 2**18 index blocks per terabyte, which is still within 
the realm of practicality.

To generate the inode reverse map for, walk each directory in the group, 
decoding the index blocks by hand.  Strictly speaking, directories 
ought to pass block level checking before being reverse mapped, but 
there could be many directories in the same group spilling over into a 
lot of external groups, so getting all the directory inodes to pass 
block level checks at the same time could be difficult with filesystem 
writing going on between fsck episodes.  Instead, just go ahead and 
assume a directory file is ok, and if this is not the case the 
directory walk will fail or a block level check will eventually pick up 
the problem.

The worst case for directory mapping is much worse than the worst case 
for block mapping.  A single directory could fill an entire volume.  
For such a large directory, reverse mapping is not possible without 
keeping the filesystem suspended for an unreasonable time.  Either make 
the reverse map incremental and maintained on the fly

Re: [RFD] Incremental fsck

2008-01-12 Thread Daniel Phillips

On Wednesday 09 January 2008 01:16, Andreas Dilger wrote:
> While an _incremental_ fsck isn't so easy for existing filesystem
> types, what is pretty easy to automate is making a read-only snapshot
> of a filesystem via LVM/DM and then running e2fsck against that.  The
> kernel and filesystem have hooks to flush the changes from cache and
> make the on-disk state consistent.
>
> You can then set the the ext[234] superblock mount count and last
> check time via tune2fs if all is well, or schedule an outage if there
> are inconsistencies found.
>
> There is a copy of this script at:
> http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html
>
> Note that it might need some tweaks to run with DM/LVM2
> commands/output, but is mostly what is needed.

You can do this now with ddsnap (an out-of-tree device mapper target) 
either by checking a local snapshot or a replicated snapshot on a 
different machine, see:

http://zumastor.org/

Doing the check on a remote machine seems attractive because the fsck 
does not create a load on the server.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH][RFC] fast file mapping for loop

2008-01-11 Thread Daniel Phillips

Hi Jens,

This looks really useful.

On Wednesday 09 January 2008 00:52, Jens Axboe wrote:
> Disadvantages:
>
> - The file block mappings must not change while loop is using the
> file. This means that we have to ensure exclusive access to the file
> and this is the bit that is currently missing in the implementation.
> It would be nice if we could just do this via open(), ideas
> welcome... 

Get_block methods are pretty fast and you have caching in the level 
above you, so you might be able to get away with no cache of physical 
addresses at all, in which case you just need i_mutex and i_alloc_sem 
at get_block time.  This would save a pile of code and still have the 
main benefit of avoiding double caching.

If you use ->get_block instead of bmap, it will fill in file holes for 
you, but of course get_block is not exposed, and Al is likely to bark 
at anyone who exposes it.

Instead of exposing get_block you could expose an aops method 
like ->bio_transfer that would hide the use of *_get_block in a library 
routine, just as __blockdev_direct_IO does.  Chances are, there are 
other users besides loop that would be interested in a generic way of 
performing bio transfers to files.

I presume you would fall back to the existing approach for any 
filesystem without get_block.  You could handle this transparently with 
a default library method that does read/write.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] VFS: apply coding standards to fs/ioctl.c

2007-10-28 Thread Daniel Phillips

On 10/28/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> While you're at it, it's probably worth splitting this out into
> a small helper function.

Why? Is the same pattern called from more than one place?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-09-01 Thread Daniel Phillips

On Friday 31 August 2007 14:41, Alasdair G Kergon wrote:
> On Thu, Aug 30, 2007 at 04:20:35PM -0700, Daniel Phillips wrote:
> > Resubmitting a bio or submitting a dependent bio from
> > inside a block driver does not need to be throttled because all
> > resources required to guarantee completion must have been obtained
> > _before_ the bio was allowed to proceed into the block layer.
>
> I'm toying with the idea of keeping track of the maximum device stack
> depth for each stacked device, and only permitting it to increase in
> controlled circumstances.

Hi Alasdair,

What kind of circumstances did you have in mind?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-30 Thread Daniel Phillips

On Wednesday 29 August 2007 01:53, Evgeniy Polyakov wrote:
> Then, if of course you will want, which I doubt, you can reread
> previous mails and find that it was pointed to that race and
> possibilities to solve it way too long ago.

What still bothers me about your response is that, while you know the 
race exists and do not disagree with my example, you don't seem to see 
that that race can eventually lock up the block device by repeatedly 
losing throttle counts which are never recovered.  What prevents that?

> > --- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0
> > -0700 +++ 2.6.22/block/ll_rw_blk.c  2007-08-24 12:07:16.0
> > -0700 @@ -3237,6 +3237,15 @@ end_io:
> >   */
> >  void generic_make_request(struct bio *bio)
> >  {
> > +   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
> > +
> > +   if (q && q->metric) {
> > +   int need = bio->bi_reserved = q->metric(bio);
> > +   bio->queue = q;
>
> In case you have stacked device, this entry will be rewritten and you
> will lost all your account data.

It is a weakness all right.  Well,

-   if (q && q->metric) {
+   if (q && q->metric && !bio->queue) {

which fixes that problem.  Maybe there is a better fix possible.  Thanks 
for the catch!

The original conception was that this block throttling would apply only 
to the highest level submission of the bio, the one that crosses the 
boundary between filesystem (or direct block device application) and 
block layer.  Resubmitting a bio or submitting a dependent bio from 
inside a block driver does not need to be throttled because all 
resources required to guarantee completion must have been obtained 
_before_ the bio was allowed to proceed into the block layer.

The other principle we are trying to satisfy is that the throttling 
should not be released until bio->endio, which I am not completely sure 
about with the patch as modified above.  Your earlier idea of having 
the throttle protection only cover the actual bio submission is 
interesting and may be effective in some cases, in fact it may cover 
the specific case of ddsnap.  But we don't have to look any further 
than ddraid (distributed raid) to find a case it doesn't cover - the 
additional memory allocated to hold parity data has to be reserved 
until parity data is deallocated, long after the submission completes.
So while you manage to avoid some logistical difficulties, it also looks 
like you didn't solve the general problem.

Hopefully I will be able to report on whether my patch actually works 
soon, when I get back from vacation.  The mechanism in ddsnap this is 
supposed to replace is effective, it is just ugly and tricky to verify.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips

On Tuesday 28 August 2007 10:54, Evgeniy Polyakov wrote:
> On Tue, Aug 28, 2007 at 10:27:59AM -0700, Daniel Phillips ([EMAIL PROTECTED]) 
> wrote:
> > > We do not care about one cpu being able to increase its counter
> > > higher than the limit, such inaccuracy (maximum bios in flight
> > > thus can be more than limit, difference is equal to the number of
> > > CPUs - 1) is a price for removing atomic operation. I thought I
> > > pointed it in the original description, but might forget, that if
> > > it will be an issue, that atomic operations can be introduced
> > > there. Any uber-precise measurements in the case when we are
> > > close to the edge will not give us any benefit at all, since were
> > > are already in the grey area.
> >
> > This is not just inaccurate, it is suicide.  Keep leaking throttle
> > counts and eventually all of them will be gone.  No more IO
> > on that block device!
>
> First, because number of increased and decreased operations are the
> same, so it will dance around limit in both directions.

No.  Please go and read it the description of the race again.  A count
gets irretrievably lost because the write operation of the first
decrement is overwritten by the second. Data gets lost.  Atomic 
operations exist to prevent that sort of thing.  You either need to use 
them or have a deep understanding of SMP read and write ordering in 
order to preserve data integrity by some equivalent algorithm.

> Let's solve problems in order of their appearence. If bio structure
> will be allowed to grow, then the whole patches can be done better.

How about like the patch below.  This throttles any block driver by
implementing a throttle metric method so that each block driver can
keep track of its own resource consumption in units of its choosing.
As an (important) example, it implements a simple metric for device
mapper devices.  Other block devices will work as before, because
they do not define any metric.  Short, sweet and untested, which is
why I have not posted it until now.

This patch originally kept its accounting info in backing_dev_info,
however that structure seems to be in some and it is just a part of
struct queue anyway, so I lifted the throttle accounting up into
struct queue.  We should be able to report on the efficacy of this
patch in terms of deadlock prevention pretty soon.

--- 2.6.22.clean/block/ll_rw_blk.c  2007-07-08 16:32:17.0 -0700
+++ 2.6.22/block/ll_rw_blk.c2007-08-24 12:07:16.0 -0700
@@ -3237,6 +3237,15 @@ end_io:
  */
 void generic_make_request(struct bio *bio)
 {
+   struct request_queue *q = bdev_get_queue(bio->bi_bdev);
+
+   if (q && q->metric) {
+   int need = bio->bi_reserved = q->metric(bio);
+   bio->queue = q;
+   wait_event_interruptible(q->throttle_wait, 
atomic_read(&q->available) >= need);
+   atomic_sub(&q->available, need);
+   }
+
if (current->bio_tail) {
/* make_request is active */
*(current->bio_tail) = bio;
--- 2.6.22.clean/drivers/md/dm.c2007-07-08 16:32:17.0 -0700
+++ 2.6.22/drivers/md/dm.c  2007-08-24 12:14:23.0 -0700
@@ -880,6 +880,11 @@ static int dm_any_congested(void *conges
return r;
 }
 
+static unsigned dm_metric(struct bio *bio)
+{
+   return bio->bi_vcnt;
+}
+
 /*-
  * An IDR is used to keep track of allocated minor numbers.
  *---*/
@@ -997,6 +1002,10 @@ static struct mapped_device *alloc_dev(i
goto bad1_free_minor;
 
md->queue->queuedata = md;
+   md->queue->metric = dm_metric;
+   atomic_set(&md->queue->available, md->queue->capacity = 1000);
+   init_waitqueue_head(&md->queue->throttle_wait);
+
md->queue->backing_dev_info.congested_fn = dm_any_congested;
md->queue->backing_dev_info.congested_data = md;
blk_queue_make_request(md->queue, dm_request);
--- 2.6.22.clean/fs/bio.c   2007-07-08 16:32:17.0 -0700
+++ 2.6.22/fs/bio.c 2007-08-24 12:10:41.0 -0700
@@ -1025,7 +1025,12 @@ void bio_endio(struct bio *bio, unsigned
bytes_done = bio->bi_size;
}
 
-   bio->bi_size -= bytes_done;
+   if (!(bio->bi_size -= bytes_done) && bio->bi_reserved) {
+   struct request_queue *q = bio->queue;
+   atomic_add(&q->available, bio->bi_reserved);
+   bio->bi_reserved = 0; /* just in case */
+   wake_up(&q->throttle_wait);
+   }
bio->bi_sector += (bytes_done >> 9);
 
if (bio->bi_end_io)
--- 2.6.22

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-28 Thread Daniel Phillips

On Tuesday 28 August 2007 02:35, Evgeniy Polyakov wrote:
> On Mon, Aug 27, 2007 at 02:57:37PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Say Evgeniy, something I was curious about but forgot to ask you
> > earlier...
> >
> > On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> > > ...All oerations are not atomic, since we do not care about
> > > precise number of bios, but a fact, that we are close or close
> > > enough to the limit.
> > > ... in bio->endio
> > > + q->bio_queued--;
> >
> > In your proposed patch, what prevents the race:
> >
> > cpu1cpu2
> >
> > read q->bio_queued
> > 
> > q->bio_queued--
> > write q->bio_queued - 1
> > Whoops! We leaked a throttle count.
>
> We do not care about one cpu being able to increase its counter
> higher than the limit, such inaccuracy (maximum bios in flight thus
> can be more than limit, difference is equal to the number of CPUs -
> 1) is a price for removing atomic operation. I thought I pointed it
> in the original description, but might forget, that if it will be an
> issue, that atomic operations can be introduced there. Any
> uber-precise measurements in the case when we are close to the edge
> will not give us any benefit at all, since were are already in the
> grey area.

This is not just inaccurate, it is suicide.  Keep leaking throttle 
counts and eventually all of them will be gone.  No more IO
on that block device!

> Another possibility is to create a queue/device pointer in the bio
> structure to hold original device and then in its backing dev
> structure add a callback to recalculate the limit, but it increases
> the size of the bio. Do we need this?

Different issue.  Yes, I think we need a nice simple approach like
that, and prove it is stable before worrying about the size cost.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-27 Thread Daniel Phillips

Say Evgeniy, something I was curious about but forgot to ask you 
earlier...

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> ...All oerations are not atomic, since we do not care about precise
> number of bios, but a fact, that we are close or close enough to the
> limit. 
> ... in bio->endio
> + q->bio_queued--;

In your proposed patch, what prevents the race:

cpu1cpu2

read q->bio_queued

q->bio_queued--
write q->bio_queued - 1
Whoops! We leaked a throttle count.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 05:46, Evgeniy Polyakov wrote:
> > The throttling of the virtual device must begin in
> > generic_make_request and last to ->endio.  You release the throttle
> > of the virtual device at the point you remap the bio to an
> > underlying device, which you have convinced yourself is ok, but it
> > is not.  You seem to miss the fact that whatever resources the
> > virtual device has allocated are no longer protected by the
> > throttle count *of the virtual device*, or you do not
>
> Because it is charged to another device.

Great.  You charged the resource to another device, but you did not 
limit the amount of resources that the first device can consume.  Which 
misses the whole point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 04:50, Evgeniy Polyakov wrote:
> On Tue, Aug 14, 2007 at 04:35:43AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > > > And it will not solve the deadlock problem in general.  (Maybe
> > > > it works for your virtual device, but I wonder...)  If the
> > > > virtual device allocates memory during generic_make_request
> > > > then the memory needs to be throttled.
> > >
> > > Daniel, if device process bio by itself, it has a limit and thus
> > > it will wait in generic_make_request()
> >
> > What will make it wait?
>
> gneric_make_request() for given block device.

Not good enough, that only makes one thread wait.  Look here:

http://lkml.org/lkml/2007/8/13/788

An unlimited number of threads can come in, each consuming resources of 
the virtual device, and violating the throttling rules.

The throttling of the virtual device must begin in generic_make_request 
and last to ->endio.  You release the throttle of the virtual device at 
the point you remap the bio to an underlying device, which you have 
convinced yourself is ok, but it is not.  You seem to miss the fact 
that whatever resources the virtual device has allocated are no longer 
protected by the throttle count *of the virtual device*, or you do not 
see why that is a bad thing.  It is a very bad thing, roughly like 
leaving some shared data outside a spin_lock/unlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 04:30, Evgeniy Polyakov wrote:
> > And it will not solve the deadlock problem in general.  (Maybe it
> > works for your virtual device, but I wonder...)  If the virtual
> > device allocates memory during generic_make_request then the memory
> > needs to be throttled.
>
> Daniel, if device process bio by itself, it has a limit and thus it
> will wait in generic_make_request()

What will make it wait?

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-14 Thread Daniel Phillips

On Tuesday 14 August 2007 01:46, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 06:04:06AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > Perhaps you never worried about the resources that the device
> > mapper mapping function allocates to handle each bio and so did not
> > consider this hole significant.  These resources can be
> > significant, as is the case with ddsnap.  It is essential to close
> > that window through with the virtual device's queue limit may be
> > violated.  Not doing so will allow deadlock.
>
> This is not a bug, this is special kind of calculation - total limit
> is number of physical devices multiplied by theirs limits. It was
> done _on purpose_ to allow different device to have different limits
> (for example in distributed storage project it is possible to have
> both remote and local node in the same device, but local device
> should not have _any_ limit at all, but network one should).
>
> Virtual device essentially has _no_ limit. And that as done on
> purpose.

And it will not solve the deadlock problem in general.  (Maybe it works 
for your virtual device, but I wonder...)  If the virtual device 
allocates memory during generic_make_request then the memory needs to 
be throttled.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 02:12, Jens Axboe wrote:
> > It is a system wide problem.  Every block device needs throttling,
> > otherwise queues expand without limit.  Currently, block devices
> > that use the standard request library get a slipshod form of
> > throttling for free in the form of limiting in-flight request
> > structs.  Because the amount of IO carried by a single request can
> > vary by two orders of magnitude, the system behavior of this
> > approach is far from predictable.
>
> Is it? Consider just 10 standard sata disks. The next kernel revision
> will have sg chaining support, so that allows 32MiB per request. Even
> if we disregard reads (not so interesting in this discussion) and
> just look at potentially pinned dirty data in a single queue, that
> number comes to 4GiB PER disk. Or 40GiB for 10 disks. Auch.
>
> So I still think that this throttling needs to happen elsewhere, you
> cannot rely the block layer throttling globally or for a single
> device. It just doesn't make sense.

You are right, so long as the unit of throttle accounting remains one 
request.  This is not what we do in ddsnap.  Instead we inc/dec the 
throttle counter by the number of bvecs in each bio, which produces a 
nice steady data flow to the disk under a wide variety of loads, and 
provides the memory resource bound we require.

One throttle count per bvec will not be the right throttling metric for 
every driver.  To customize this accounting metric for a given driver 
we already have the backing_dev_info structure, which provides 
per-device-instance accounting functions and instance data.  Perfect! 
This allows us to factor the throttling mechanism out of the driver, so 
the only thing the driver has to do is define the throttle accounting 
if it needs a custom one.

We can avoid affecting the traditional behavior quite easily, for 
example if backing_dev_info->throttle_fn (new method) is null then 
either not throttle at all (and rely on the struct request in-flight 
limit) or we can move the in-flight request throttling logic into core 
as the default throttling method, simplifying the request library and 
not changing its behavior.

> > These deadlocks are first and foremost, block layer deficiencies. 
> > Even the network becomes part of the problem only because it lies
> > in the block IO path.
>
> The block layer has NEVER guaranteed throttling, so it can - by
> definition - not be a block layer deficiency.

The block layer has always been deficient by not providing accurate 
throttling, or any throttling at all for some devices.  We have 
practical proof that this causes deadlock and a good theoretical basis 
for describing exactly how it happens.

To be sure, vm and net are co-conspirators, however the block layer 
really is the main actor in this little drama.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 05:18, Evgeniy Polyakov wrote:
> > Say you have a device mapper device with some physical device
> > sitting underneath, the classic use case for this throttle code. 
> > Say 8,000 threads each submit an IO in parallel.  The device mapper
> > mapping function will be called 8,000 times with associated
> > resource allocations, regardless of any throttling on the physical
> > device queue.
>
> Each thread will sleep in generic_make_request(), if limit is
> specified correctly, then allocated number of bios will be enough to
> have a progress.

The problem is, the sleep does not occur before the virtual device 
mapping function is called.  Let's consider two devices, a physical 
device named pdev and a virtual device sitting on top of it called 
vdev.   vdev's throttle limit is just one element, but we will see that 
in spite of this, two bios can be handled by the vdev's mapping method 
before any IO completes, which violates the throttling rules. According 
to your patch it works like this:

 Thread 1Thread  2

bio_queued is zero>

vdev->q->bio_queued++

blk_set_bdev(bio, pdev)
 vdev->bio_queued--

bio_queued is 
zero>

vdev->q->bio_queued++

whoops!  Our virtual device mapping
function has now allocated resources
for two in-flight bios in spite of 
having its
throttle limit set to 1.

Perhaps you never worried about the resources that the device mapper 
mapping function allocates to handle each bio and so did not consider 
this hole significant.  These resources can be significant, as is the 
case with ddsnap.  It is essential to close that window through with 
the virtual device's queue limit may be violated.  Not doing so will 
allow deadlock.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 05:04, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 04:04:26AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > > > Oops, and there is also:
> > > >
> > > > 3) The bio throttle, which is supposed to prevent deadlock, can
> > > > itself deadlock.  Let me see if I can remember how it goes.
> > > >
> > > >   * generic_make_request puts a bio in flight
> > > >   * the bio gets past the throttle and initiates network IO
> > > >   * net calls sk_alloc->alloc_pages->shrink_caches
> > > >   * shrink_caches submits a bio recursively to our block device
> > > >   * this bio blocks on the throttle
> > > >   * net may never get the memory it needs, and we are wedged
> > >
> > > If system is in such condition, it is already broken - throttle
> > > limit must be lowered (next time) not to allow such situation.
> >
> > Agreed that the system is broken, however lowering the throttle
> > limit gives no improvement in this case.
>
> How is it ever possible? The whole idea of throttling is to remove
> such situation, and now you say it can not be solved.

It was solved, by not throttling writeout that comes from shrink_caches.
Ugly.

> If limit is for 
> 1gb of pending block io, and system has for example 2gbs of ram (or
> any other resonable parameters), then there is no way we can deadlock
> in allocation, since it will not force page reclaim mechanism.

The problem is that sk_alloc (called from our block driver via 
socket->write) would recurse into shrink_pages, which recursively 
submits IO to our block driver and blocks on the throttle.  Subtle 
indeed, and yet another demonstration of why vm recursion is a Bad 
Thing.

I will find a traceback for you tomorrow, which makes this deadlock much 
clearer.

Regards
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 04:03, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 03:12:33AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > > This is not a very good solution, since it requires all users of
> > > the bios to know how to free it.
> >
> > No, only the specific ->endio needs to know that, which is set by
> > the bio owner, so this knowledge lies in exactly the right place. 
> > A small handful of generic endios all with the same destructor are
> > used nearly everywhere.
>
> That is what I meant - there will be no way to just alloc a bio and
> put it, helpers for generic bio sets must be exported and each and
> every bi_end_io() must be changed to check reference counter and they
> must know how they were allocated.

There are fewer non-generic bio allocators than you think.

> Endio callback is of course quite rare and additional atomic
> reading will not kill the system, but why introduce another read?
> It is possible to provide a flag for endio callback that it is last,
> but it still requires to change every single callback - why do we
> want this?

We don't.  Struct bio does not need to be shrunk.  Jens wanted to talk 
about what fields could be eliminated if we wanted to shrink it.  It is 
about time to let that lie, don't you think?

> So, I'm a bit lost...
>
> You say it is too big 

Did not say that.

> and some parts can be removed or combined

True.

> and  then that size does not matter.

Also true, backed up by numbers on real systems.

> Last/not-last checks in the code is 
> not clear design, so I do not see why it is needed at all if not for
> size shrinking.

Not needed, indeed.  Accurate throttling is needed.  If the best way to 
throttle requires expanding struct bio a little then we should not let 
concerns about the cost  of an int or two stand in the way.  Like Jens, 
I am more concerned about the complexity cost, and that is minimized in 
my opinion by throttling in the generic code rather than with custom 
code in each specialized block driver.

Your patch does throttle in the generic code, great.  Next thing is to 
be sure that it completely closes the window for reserve leakage, which 
is not yet clear.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 01:23, Evgeniy Polyakov wrote:
> On Sun, Aug 12, 2007 at 10:36:23PM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > (previous incomplete message sent accidentally)
> >
> > On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> > > On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
> > >
> > > So, what did we decide? To bloat bio a bit (add a queue pointer)
> > > or to use physical device limits? The latter requires to replace
> > > all occurence of bio->bi_bdev = something_new with
> > > blk_set_bdev(bio, somthing_new), where queue limits will be
> > > appropriately charged. So far I'm testing second case, but I only
> > > changed DST for testing, can change all other users if needed
> > > though.
> >
> > Adding a queue pointer to struct bio and using physical device
> > limits as in your posted patch both suffer from the same problem:
> > you release the throttling on the previous queue when the bio moves
> > to a new one, which is a bug because memory consumption on the
> > previous queue then becomes unbounded, or limited only by the
> > number of struct requests that can be allocated.  In other words,
> > it reverts to the same situation we have now as soon as the IO
> > stack has more than one queue.  (Just a shorter version of my
> > previous post.)
>
> No. Since all requests for virtual device end up in physical devices,
> which have limits, this mechanism works. Virtual device will
> essentially call either generic_make_request() for new physical
> device (and thus will sleep is limit is over), or will process bios
> directly, but in that case it will sleep in generic_make_request()
> for virutal device.

What can happen is, as soon as you unthrottle the previous queue, 
another thread can come in and put another request on it.  Sure, that 
thread will likely block on the physical throttle and so will the rest 
of the incoming threads, but it still allows the higher level queue to 
grow past any given limit, with the help of lots of threads.  JVM for 
example?

Say you have a device mapper device with some physical device sitting 
underneath, the classic use case for this throttle code.  Say 8,000 
threads each submit an IO in parallel.  The device mapper mapping 
function will be called 8,000 times with associated resource 
allocations, regardless of any throttling on the physical device queue.

Anyway, your approach is awfully close to being airtight, there is just 
a small hole.  I would be more than happy to be proved wrong about 
that, but the more I look, the more I see that hole.

> > 1) One throttle count per submitted bio is too crude a measure.  A
> > bio can carry as few as one page or as many as 256 pages.  If you
> > take only
>
> It does not matter - we can count bytes, pages, bio vectors or
> whatever we like, its just a matter of counter and can be changed
> without problem.

Quite true.  In some cases the simple inc/dec per bio works just fine.  
But the general case where finer granularity is required comes up in 
existing code, so there needs to be a plan.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 01:14, Evgeniy Polyakov wrote:
> > Oops, and there is also:
> >
> > 3) The bio throttle, which is supposed to prevent deadlock, can
> > itself deadlock.  Let me see if I can remember how it goes.
> >
> >   * generic_make_request puts a bio in flight
> >   * the bio gets past the throttle and initiates network IO
> >   * net calls sk_alloc->alloc_pages->shrink_caches
> >   * shrink_caches submits a bio recursively to our block device
> >   * this bio blocks on the throttle
> >   * net may never get the memory it needs, and we are wedged
>
> If system is in such condition, it is already broken - throttle limit
> must be lowered (next time) not to allow such situation.

Agreed that the system is broken, however lowering the throttle limit 
gives no improvement in this case.

This is not theoretical, but a testable, repeatable result.  
Instructions to reproduce should show up tomorrow.

This bug is now solved in a kludgy way.  Now, Peter's patch set offers a 
much cleaner way to fix this little problem, along with at least one 
other nasty that it already fixed.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 03:22, Jens Axboe wrote:
> I never compared the bio to struct page, I'd obviously agree that
> shrinking struct page was a worthy goal and that it'd be ok to uglify
> some code to do that. The same isn't true for struct bio.

I thought I just said that.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 03:06, Jens Axboe wrote:
> On Mon, Aug 13 2007, Daniel Phillips wrote:
> > Of course not.  Nothing I said stops endio from being called in the
> > usual way as well.  For this to work, endio just needs to know that
> > one call means "end" and the other means "destroy", this is
> > trivial.
>
> Sorry Daniel, but your suggestions would do nothing more than uglify
> the code and design.

Pretty much exactly what was said about shrinking struct page, ask Bill.  
The difference was, shrinking struct page actually mattered whereas 
shrinking struct bio does not, and neither does expanding it by a few 
bytes.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 02:18, Evgeniy Polyakov wrote:
> On Mon, Aug 13, 2007 at 02:08:57AM -0700, Daniel Phillips 
([EMAIL PROTECTED]) wrote:
> > > But that idea fails as well, since reference counts and IO
> > > completion are two completely seperate entities. So unless end IO
> > > just happens to be the last user holding a reference to the bio,
> > > you cannot free it.
> >
> > That is not a problem.  When bio_put hits zero it calls ->endio
> > instead of the destructor.  The ->endio sees that the count is zero
> > and destroys the bio.
>
> This is not a very good solution, since it requires all users of the
> bios to know how to free it.

No, only the specific ->endio needs to know that, which is set by the 
bio owner, so this knowledge lies in exactly the right place.  A small 
handful of generic endios all with the same destructor are used nearly 
everywhere.

> Right now it is hidden. 
> And adds additional atomic check (although reading is quite fast) in
> the end_io.

Actual endio happens once in the lifetime of the transfer, this read 
will be entirely lost in the noise.

> And for what purpose? To eat 8 bytes on 64bit platform? 
> This will not reduce its size noticebly, so the same number of bios
> will be in the cache's page, so what is a gain? All this cleanups and
> logic complicatins should be performed only if after size shring
> increased number of bios can fit into cache's page, will it be done
> after such cleanups?

Well, exactly,   My point from the beginning was that the size of struct 
bio is not even close to being a problem and adding a few bytes to it 
in the interest of doing the cleanest fix to a core kernel bug is just 
not a dominant issue.

I suppose that leaving out the word "bloated" and skipping straight to 
the "doesn't matter" proof would have saved some bandwidth.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 02:13, Jens Axboe wrote:
> On Mon, Aug 13 2007, Daniel Phillips wrote:
> > On Monday 13 August 2007 00:45, Jens Axboe wrote:
> > > On Mon, Aug 13 2007, Jens Axboe wrote:
> > > > > You did not comment on the one about putting the bio
> > > > > destructor in the ->endio handler, which looks dead simple. 
> > > > > The majority of cases just use the default endio handler and
> > > > > the default destructor.  Of the remaining cases, where a
> > > > > specialized destructor is needed, typically a specialized
> > > > > endio handler is too, so combining is free.  There are few if
> > > > > any cases where a new specialized endio handler would need to
> > > > > be written.
> > > >
> > > > We could do that without too much work, I agree.
> > >
> > > But that idea fails as well, since reference counts and IO
> > > completion are two completely seperate entities. So unless end IO
> > > just happens to be the last user holding a reference to the bio,
> > > you cannot free it.
> >
> > That is not a problem.  When bio_put hits zero it calls ->endio
> > instead of the destructor.  The ->endio sees that the count is zero
> > and destroys the bio.
>
> You can't be serious? You'd stall end io completion notification
> because someone holds a reference to a bio.

Of course not.  Nothing I said stops endio from being called in the 
usual way as well.  For this to work, endio just needs to know that one 
call means "end" and the other means "destroy", this is trivial.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 00:45, Jens Axboe wrote:
> On Mon, Aug 13 2007, Jens Axboe wrote:
> > > You did not comment on the one about putting the bio destructor
> > > in the ->endio handler, which looks dead simple.  The majority of
> > > cases just use the default endio handler and the default
> > > destructor.  Of the remaining cases, where a specialized
> > > destructor is needed, typically a specialized endio handler is
> > > too, so combining is free.  There are few if any cases where a
> > > new specialized endio handler would need to be written.
> >
> > We could do that without too much work, I agree.
>
> But that idea fails as well, since reference counts and IO completion
> are two completely seperate entities. So unless end IO just happens
> to be the last user holding a reference to the bio, you cannot free
> it.

That is not a problem.  When bio_put hits zero it calls ->endio instead 
of the destructor.  The ->endio sees that the count is zero and 
destroys the bio.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-13 Thread Daniel Phillips

On Monday 13 August 2007 00:28, Jens Axboe wrote:
> On Sun, Aug 12 2007, Daniel Phillips wrote:
> > Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can
> > derive efficiently from BIO_POOL_IDX() provided the bio was
> > allocated in the standard way.
>
> That would only be feasible, if we ruled that any bio in the system
> must originate from the standard pools.

Not at all.

> > This leaves a little bit of clean up to do for bios not allocated
> > from a standard pool.
>
> Please suggest how to do such a cleanup.

Easy, use the BIO_POOL bits to know the bi_max_size, the same as for a 
bio from the standard pool.  Just put the power of two size in the bits 
and map that number to the standard pool arrangement with a table 
lookup.

> > On the other hand, vm writeout deadlock ranks smack dab at the top
> > of the list, so that is where the patching effort must go for the
> > forseeable future.  Without bio throttling, the ddsnap load can go
> > to 24 MB for struct bio alone.  That definitely moves the needle. 
> > in short, we save 3,200 times more memory by putting decent
> > throttling in place than by saving an int in struct bio.
>
> Then fix the damn vm writeout. I always thought it was silly to
> depend on the block layer for any sort of throttling. If it's not a
> system wide problem, then throttle the io count in the
> make_request_fn handler of that problematic driver.

It is a system wide problem.  Every block device needs throttling, 
otherwise queues expand without limit.  Currently, block devices that 
use the standard request library get a slipshod form of throttling for 
free in the form of limiting in-flight request structs.  Because the 
amount of IO carried by a single request can vary by two orders of 
magnitude, the system behavior of this approach is far from 
predictable.

> > You did not comment on the one about putting the bio destructor in
> > the ->endio handler, which looks dead simple.  The majority of
> > cases just use the default endio handler and the default
> > destructor.  Of the remaining cases, where a specialized destructor
> > is needed, typically a specialized endio handler is too, so
> > combining is free.  There are few if any cases where a new
> > specialized endio handler would need to be written.
>
> We could do that without too much work, I agree.

OK, we got one and another is close to cracking, enough of that.

> > As far as code stability goes, current kernels are horribly
> > unstable in a variety of contexts because of memory deadlock and
> > slowdowns related to the attempt to fix the problem via dirty
> > memory limits.  Accurate throttling of bio traffic is one of the
> > two key requirements to fix this instability, the other other is
> > accurate writeout path reserve management, which is only partially
> > addressed by BIO_POOL.
>
> Which, as written above and stated many times over the years on lkml,
> is not a block layer issue imho.

Whoever stated that was wrong, but this should be no surprise.  There 
have been many wrong things said about this particular bug over the 
years.  The one thing that remains constant is, Linux continues to 
deadlock under a variety of loads both with and without network 
involvement, making it effectively useless as a storage platform.

These deadlocks are first and foremost, block layer deficiencies.  Even 
the network becomes part of the problem only because it lies in the 
block IO path.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-13 Thread Daniel Phillips

On Sunday 12 August 2007 22:36, I wrote:
> Note!  There are two more issues I forgot to mention earlier.

Oops, and there is also:

3) The bio throttle, which is supposed to prevent deadlock, can itself 
deadlock.  Let me see if I can remember how it goes.

  * generic_make_request puts a bio in flight
  * the bio gets past the throttle and initiates network IO
  * net calls sk_alloc->alloc_pages->shrink_caches
  * shrink_caches submits a bio recursively to our block device
  * this bio blocks on the throttle
  * net may never get the memory it needs, and we are wedged

I need to review a backtrace to get this precisely right, however you 
can see the danger.  In ddsnap we kludge around this problem by not 
throttling any bio submitted in PF_MEMALLOC mode, which effectively 
increases our reserve requirement by the amount of IO that mm will 
submit to a given block device before deciding the device is congested 
and should be left alone.  This works, but is sloppy and disgusting.

The right thing to do is to make sure than the mm knows about our 
throttle accounting in backing_dev_info so it will not push IO to our 
device when it knows that the IO will just block on congestion.  
Instead, shrink_caches will find some other less congested block device 
or give up, causing alloc_pages to draw from the memalloc reserve to 
satisfy the sk_alloc request.

The mm already uses backing_dev_info this way, we just need to set the 
right bits in the backing_dev_info state flags.  I think Peter posted a 
patch set that included this feature at some point.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips

(previous incomplete message sent accidentally)

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.

2) Exposing the per-block device throttle limits via sysfs or similar is 
really not a good long term solution for system administration.  
Imagine our help text: "just keep trying smaller numbers until your 
system deadlocks".  We really need to figure this out internally and 
get it correct.  I can see putting in a temporary userspace interface 
just for experimentation, to help determine what really is safe, and 
what size the numbers should be to approach optimal throughput in a 
fully loaded memory state.

Regards,

Daniel

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips

On Wednesday 08 August 2007 02:54, Evgeniy Polyakov wrote:
> On Tue, Aug 07, 2007 at 10:55:38PM +0200, Jens Axboe 
([EMAIL PROTECTED]) wrote:
>
> So, what did we decide? To bloat bio a bit (add a queue pointer) or
> to use physical device limits? The latter requires to replace all
> occurence of bio->bi_bdev = something_new with blk_set_bdev(bio,
> somthing_new), where queue limits will be appropriately charged. So
> far I'm testing second case, but I only changed DST for testing, can
> change all other users if needed though.

Adding a queue pointer to struct bio and using physical device limits as 
in your posted patch both suffer from the same problem: you release the 
throttling on the previous queue when the bio moves to a new one, which 
is a bug because memory consumption on the previous queue then becomes 
unbounded, or limited only by the number of struct requests that can be 
allocated.  In other words, it reverts to the same situation we have 
now as soon as the IO stack has more than one queue.  (Just a shorter 
version of my previous post.)

We can solve this by having the bio only point at the queue to which it 
was originally submitted, since throttling the top level queue 
automatically throttles all queues lower down the stack.  Alternatively 
the bio can point at the block_device or straight at the 
backing_dev_info, which is the per-device structure it actually needs 
to touch.

Note!  There are two more issues I forgot to mention earlier.

1) One throttle count per submitted bio is too crude a measure.  A bio 
can carry as few as one page or as many as 256 pages.  If you take only 
one throttle count per bio and that data will be transferred over the 
network then you have to assume that (a little more than) 256 pages of 
sk_alloc reserve will be needed for every bio, resulting in a grossly 
over-provisioned reserve.  The precise reserve calculation we want to 
do is per-block device, and you will find hooks like this already 
living in backing_dev_info.  We need to place our own fn+data there to 
calculate the throttle draw for each bio.  Unthrottling gets trickier 
with variable size throttle draw.  In ddsnap, we simply write the 
amount we drew from the throttle into (the private data of) bio for use 
later by unthrottle, thus avoiding the issue that the bio fields we 
used to calculate might have changed during the lifetime of the bio.  
This would translate into one more per-bio field.

the throttling performs another function: keeping a reasonable amount of 
IO in flight for the device.  The definition of "reasonable" is 
complex.  For a hard disk it depends on the physical distance between 
sector addresses of the bios in flight.  In ddsnap we make a crude but 
workable approximation that 

 In general, a per block device 

The throttle count needs to cover 

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-12 Thread Daniel Phillips

On Tuesday 07 August 2007 13:55, Jens Axboe wrote:
> I don't like structure bloat, but I do like nice design. Overloading
> is a necessary evil sometimes, though. Even today, there isn't enough
> room to hold bi_rw and bi_flags in the same variable on 32-bit archs,
> so that concern can be scratched. If you read bio.h, that much is
> obvious.

Sixteen bits in bi_rw are consumed by queue priority.  Is there a reason 
this lives in struct bio instead of struct request?

> If you check up on the iommu virtual merging, you'll understand the
> front and back size members. They may smell dubious to you, but
> please take the time to understand why it looks the way it does.

Virtual merging is only needed at the physical device, so why do these 
fields live in struct bio instead of struct request?

> Changing the number of bvecs is integral to how bio buildup current
> works.

Right, that is done by bi_vcnt.  I meant bi_max_vecs, which you can 
derive efficiently from BIO_POOL_IDX() provided the bio was allocated 
in the standard way.  This leaves a little bit of clean up to do for 
bios not allocated from a standard pool.

Incidentally, why does the bvl need to be memset to zero on allocation?  
bi_vcnt already tells you which bvecs are valid and the only field in a 
bvec that can reasonably default to zero is the offset, which ought to 
be set set every time a bvec is initialized anyway.

> > bi_destructor could be combined.  I don't see a lot of users of
> > bi_idx,
>
> bi_idx is integral to partial io completions.

Struct request has a remaining submission sector count so what does 
bi_idx do that is different?

> > that looks like a soft target.  See what happened to struct page
> > when a couple of folks got serious about attacking it, some really
> > deep hacks were done to pare off a few bytes here and there.  But
> > struct bio as a space waster is not nearly in the same ballpark.
>
> So show some concrete patches and examples, hand waving and
> assumptions is just a waste of everyones time.

Average struct bio memory footprint ranks near the bottom of the list of 
things that suck most about Linux storage.  At idle I see 8K in use 
(reserves); during updatedb it spikes occasionally to 50K; under a 
heavy  load generated by ddsnap on a storage box it sometimes goes to 
100K with bio throttling in place.  Really not moving the needle.

On the other hand, vm writeout deadlock ranks smack dab at the top of 
the list, so that is where the patching effort must go for the 
forseeable future.  Without bio throttling, the ddsnap load can go to 
24 MB for struct bio alone.  That definitely moves the needle.  in 
short, we save 3,200 times more memory by putting decent throttling in 
place than by saving an int in struct bio.

That said, I did a little analysis to get an idea of where the soft 
targets are in struct bio, and to get to know the bio layer a little 
better.  Maybe these few hints will get somebody interested enough to 
look further.

> > It would be interesting to see if bi_bdev could be made read only.
> > Generally, each stage in the block device stack knows what the next
> > stage is going to be, so why do we have to write that in the bio? 
> > For error reporting from interrupt context?  Anyway, if Evgeniy
> > wants to do the patch, I will happily unload the task of convincing
> > you that random fields are/are not needed in struct bio :-)
>
> It's a trade off, otherwise you'd have to pass the block device
> around a lot.

Which costs very little, probably less than trashing an extra field's 
worth of cache.

> And it's, again, a design issue. A bio contains 
> destination information, that means device/offset/size information.
> I'm all for shaving structure bytes where it matters, but not for the
> sake of sacrificing code stability or design. I consider struct bio
> quite lean and have worked hard to keep it that way. In fact, iirc,
> the only addition to struct bio since 2001 is the iommu front/back
> size members. And I resisted those for quite a while.

You did not comment on the one about putting the bio destructor in 
the ->endio handler, which looks dead simple.  The majority of cases 
just use the default endio handler and the default destructor.  Of the 
remaining cases, where a specialized destructor is needed, typically a 
specialized endio handler is too, so combining is free.  There are few 
if any cases where a new specialized endio handler would need to be 
written.

As far as code stability goes, current kernels are horribly unstable in 
a variety of contexts because of memory deadlock and slowdowns related 
to the attempt to fix the problem via dirty memory limits.  Accurate 
throttling of bio traffic is one of the two key requirements to fix 
this instability, the other other is accurate writeout path reserve 
management, which is only partially addressed by BIO_POOL.

Nice to see you jumping in Jens.  Now it is over to the other side of 
the thread where Evgeniy has posted a

Re: [1/1] Block device throttling [Re: Distributed storage.]

2007-08-12 Thread Daniel Phillips

Hi Evgeniy,

Sorry for not getting back to you right away, I was on the road with 
limited email access.  Incidentally, the reason my mails to you keep 
bouncing is, your MTA is picky about my mailer's IP reversing to a real 
hostname.  I will take care of that pretty soon, but for now my direct 
mail to you is going to bounce and you will only see the lkml copy.

On Wednesday 08 August 2007 03:17, Evgeniy Polyakov wrote:
> This throttling mechanism allows to limit maximum amount of queued
> bios per physical device. By default it is turned off and old block
> layer behaviour with unlimited number of bios is used. When turned on
> (queue limit is set to something different than -1U via
> blk_set_queue_limit()), generic_make_request() will sleep until there
> is room in the queue. number of bios is increased in
> generic_make_request() and reduced either in bio_endio(), when bio is
> completely processed (bi_size is zero), and recharged from original
> queue when new device is assigned to bio via blk_set_bdev(). All
> oerations are not atomic, since we do not care about precise number
> of bios, but a fact, that we are close or close enough to the limit.
>
> Tested on distributed storage device - with limit of 2 bios it works
> slow :)

it seems to me you need:

-   if (q) {
+   if (q && q->bio_limit != -1) {

This patch is short and simple, and will throttle more accurately than 
the current simplistic per-request allocation limit.  However, it fails 
to throttle device mapper devices.  This is because no request is 
allocated by the device mapper queue method, instead the mapping call 
goes straight through to the mapping function.  If the mapping function 
allocates memory (typically the case) then this resource usage evades 
throttling and deadlock becomes a risk.

There are three obvious fixes:

   1) Implement bio throttling in each virtual block device
   2) Implement bio throttling generically in device mapper
   3) Implement bio throttling for all block devices

Number 1 is the approach we currently use in ddsnap, but it is ugly and 
repetitious.  Number 2 is a possibility, but I favor number 3 because 
it is a system-wide solution to a system-wide problem, does not need to 
be repeated for every block device that lacks a queue, heads in the 
direction of code subtraction, and allows system-wide reserve 
accounting. 

Your patch is close to the truth, but it needs to throttle at the top 
(virtual) end of each block device stack instead of the bottom 
(physical) end.  It does head in the direction of eliminating your own 
deadlock risk indeed, however there are block devices it does not 
cover.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-07 Thread Daniel Phillips

On Tuesday 07 August 2007 05:05, Jens Axboe wrote:
> On Sun, Aug 05 2007, Daniel Phillips wrote:
> > A simple way to solve the stable accounting field issue is to add a
> > new pointer to struct bio that is owned by the top level submitter
> > (normally generic_make_request but not always) and is not affected
> > by any recursive resubmission.  Then getting rid of that field
> > later becomes somebody's summer project, which is not all that
> > urgent because struct bio is already bloated up with a bunch of
> > dubious fields and is a transient structure anyway.
>
> Thanks for your insights. Care to detail what bloat and dubious
> fields struct bio has?

First obvious one I see is bi_rw separate from bi_flags.  Front_size and 
back_size smell dubious.  Is max_vecs really necessary?  You could 
reasonably assume bi_vcnt rounded up to a power of two and bury the 
details of making that work behind wrapper functions to change the 
number of bvecs, if anybody actually needs that.  Bi_endio and 
bi_destructor could be combined.  I don't see a lot of users of bi_idx, 
that looks like a soft target.  See what happened to struct page when a 
couple of folks got serious about attacking it, some really deep hacks 
were done to pare off a few bytes here and there.  But struct bio as a 
space waster is not nearly in the same ballpark.

It would be interesting to see if bi_bdev could be made read only.  
Generally, each stage in the block device stack knows what the next 
stage is going to be, so why do we have to write that in the bio?  For 
error reporting from interrupt context?  Anyway, if Evgeniy wants to do 
the patch, I will happily unload the task of convincing you that random 
fields are/are not needed in struct bio :-)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Sunday 05 August 2007 08:01, Evgeniy Polyakov wrote:
> On Sun, Aug 05, 2007 at 01:06:58AM -0700, Daniel Phillips wrote:
> > > DST original code worked as device mapper plugin too, but its two
> > > additional allocations (io and clone) per block request ended up
> > > for me as a show stopper.
> >
> > Ah, sorry, I misread.  A show stopper in terms of efficiency, or in
> > terms of deadlock?
>
> At least as in terms of efficiency. Device mapper lives in happy
> world where memory does not end and allocations are fast.

Are you saying that things are different for a network block device 
because it needs to do GFP_ATOMIC allocations?  If so then that is just 
a misunderstanding.  The global page reserve Peter and I use is 
available in interrupt context just like GFP_ATOMIC.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Sunday 05 August 2007 08:08, Evgeniy Polyakov wrote:
> If we are sleeping in memory pool, then we already do not have memory
> to complete previous requests, so we are in trouble.

Not at all.  Any requests in flight are guaranteed to get the resources 
they need to complete.  This is guaranteed by the combination of memory 
reserve management and request queue throttling.  In logical terms, 
reserve management plus queue throttling is necessary and sufficient to 
prevent these deadlocks.  Conversely, the absence of either one allows 
deadlock.

> This can work 
> for devices which do not require additional allocations (like usual
> local storage), but not for network connected ones.

It works for network devices too, and also for a fancy device like 
ddsnap, which is the moral equivalent of a filesystem implemented in 
user space.

> If not in device, then at least it should say to block layer about
> its limits. What about new function to register queue...

Yes, a new internal API is needed eventually.  However, no new api is 
needed right at the moment because we can just hard code the reserve 
sizes and queue limits and audit them by hand, which is not any more 
sloppy than several other kernel subsystems.  The thing is, we need to 
keep any obfuscating detail out of the initial patches because these 
principles are hard enough to explain already without burying them in 
hundreds of lines of API fluff.

That said, the new improved API should probably not be a new way to 
register, but a set of function calls you can use after the queue is 
created, which follows the pattern of the existing queue API.

> ...which will get 
> maximum number of bios in flight and sleep in generic_make_request()
> when new bio is going to be submitted and it is about to exceed the
> limit?

Exactly.  This is what ddsnap currently does and it works.  But we did 
not change generic_make_request for this driver, instead we throttled 
the driver from the time it makes a request to its user space server, 
until the reply comes back.  We did it that way because it was easy and 
was the only segment of the request lifeline that could not be fixed by 
other means.  A proper solution for all block devices will move the 
throttling up into generic_make_request, as you say below.

> By default things will be like they are now, except additional
> non-atomic increment and branch in generic_make_request() and
> decrement and wake in bio_end_io()?

->endio is called in interrupt context, so the accounting needs to be 
atomic as far as I can see.

We actually account the total number of bio pages in flight, otherwise 
you would need to assume the largest possible bio and waste a huge 
amount of reserve memory.  A counting semaphore works fine for this 
purpose, with some slight inefficiency that is nigh on unmeasurable in 
the block IO path.  What the semaphore does is make the patch small and 
easy to understand, which is important at this point.

> I can cook up such a patch if idea worth efforts.

It is.  There are some messy details...  You need a place to store the 
accounting variable/semaphore and need to be able to find that place 
again in ->endio.  Trickier than it sounds, because of the unstructured 
way drivers rewrite ->bi_bdev.   Peterz has already poked at this in a 
number of different ways, typically involving backing_dev_info, which 
seems like a good idea to me.

A simple way to solve the stable accounting field issue is to add a new 
pointer to struct bio that is owned by the top level submitter 
(normally generic_make_request but not always) and is not affected by 
any recursive resubmission.  Then getting rid of that field later 
becomes somebody's summer project, which is not all that urgent because 
struct bio is already bloated up with a bunch of dubious fields and is 
a transient structure anyway.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Saturday 04 August 2007 09:44, Evgeniy Polyakov wrote:
> > On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> > > * storage can be formed on top of remote nodes and be
> > > exported simultaneously (iSCSI is peer-to-peer only, NBD requires
> > > device mapper and is synchronous)
> >
> > In fact, NBD has nothing to do with device mapper.  I use it as a
> > physical target underneath ddraid (a device mapper plugin) just
> > like I would use your DST if it proves out.
>
> I meant to create a storage on top of several nodes one needs to have
> device mapper or something like that on top of NBD itself. To further
> export resulted device one needs another userspace NDB application
> and so on. DST simplifies that greatly.
>
> DST original code worked as device mapper plugin too, but its two
> additional allocations (io and clone) per block request ended up for
> me as a show stopper.

Ah, sorry, I misread.  A show stopper in terms of efficiency, or in 
terms of deadlock?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-05 Thread Daniel Phillips

On Saturday 04 August 2007 09:37, Evgeniy Polyakov wrote:
> On Fri, Aug 03, 2007 at 06:19:16PM -0700, I wrote:
> > To be sure, I am not very proud of this throttling mechanism for
> > various reasons, but the thing is, _any_ throttling mechanism no
> > matter how sucky solves the deadlock problem.  Over time I want to
> > move the
>
> make_request_fn is always called in process context,

Yes, as is submit_bio which calls it.  The decision re where it is best 
to throttle, in submit_bio or in make_request_fn, has more to do with 
system factoring, that is, is throttling something that _every_ block 
device should have (yes I think) or is it a delicate, optional thing 
that needs a tweakable algorithm per block device type (no I think).

The big worry I had was that by blocking on congestion in the 
submit_bio/make_request_fn I might stuff up system-wide mm writeout.  
But a while ago that part of the mm was tweaked (by Andrew if I recall 
correctly) to use a pool of writeout threads and understand the concept 
of one of them blocking on some block device, and not submit more 
writeout to the same block device until the first thread finishes its 
submission.  Meanwhile, other mm writeout threads carry on with other 
block devices.

> we can wait in it for memory in mempool. Although that means we
> already in trouble. 

Not at all.  This whole block writeout path needs to be written to run 
efficiently even when normal system memory is completely gone.  All it 
means when we wait on a mempool is that the block device queue is as 
full as we are ever going to let it become, and that means the block 
device is working as hard as it can (subject to a small caveat: for 
some loads a device can work more efficiently if it can queue up larger 
numbers of requests down at the physical elevators).

By the way, ddsnap waits on a counting semaphore, not a mempool.  That 
is because we draw our reserve memory from the global memalloc reserve, 
not from a mempool.  And that is not only because it takes less code to 
do so, but mainly because global pools as opposed to lots of little 
special purpose pools seem like a good idea to me.  Though I will admit 
that with our current scheme we need to allow for the total of the 
maximum reserve requirements for all memalloc users in the memalloc 
pool, so it does not actually save any memory vs dedicated pools.  We 
could improve that if we wanted to, by having hard and soft reserve 
requirements: the global reserve actually only needs to be as big as 
the total of the hard requirements.  With this idea, if by some unlucky 
accident every single pool user got itself maxed out at the same time, 
we would still not exceed our share of the global reserve.  
Under "normal" low memory situations, a block device would typically be 
free to grab reserve memory up to its soft limit, allowing it to 
optimize over a wider range of queued transactions.   My little idea 
here is: allocating specific pages to a pool is kind of dumb, all we 
really want to do is account precisely for the number of pages we are 
allowed to draw from the global reserve.

OK, I kind of digressed, but this all counts as explaining the details 
of what Peter and I have been up to for the last year (longer for me).  
At this point, we don't need to do the reserve accounting in the most 
absolutely perfect way possible, we just need to get something minimal 
in place to fix the current deadlock problems, then we can iteratively 
improve it.

> I agree, any kind of high-boundary leveling must be implemented in
> device itself, since block layer does not know what device is at the
> end and what it will need to process given block request.

I did not say the throttling has to be implemented in the device, only 
that we did it there because it was easiest to code that up and try it 
out (it worked).  This throttling really wants to live at a higher 
level, possibly submit_bio()...bio->endio().  Someone at OLS (James 
Bottomley?) suggested it would be better done at the request queue 
layer, but I do not immediately see why that should be.  I guess this 
is going to come down to somebody throwing out a patch for interested 
folks to poke at.  But this detail is a fine point.  The big point is 
to have _some_ throttling mechanism in place on the block IO path, 
always.

Device mapper in particular does not have any throttling itself: calling 
submit_bio on a device mapper device directly calls the device mapper 
bio dispatcher.  Default initialized block device queue do provide a 
crude form of throttling based on limiting the number of requests.  
This is insufficiently precise to do a good job in the long run, but it 
works for now because the current gaggle of low level block drivers do 
not have a lot of resource requirements and tend to behave fairly 
predictably (except for some irritating issues re very slow devices 
working in parallel with very fast devices, but... worry about that 
later).  Network block driv

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote:
> On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote:
> > I see bits that worry me, e.g.:
> >
> > +   req = mempool_alloc(st->w->req_pool, GFP_NOIO);
> >
> > which seems to be callable in response to a local request, just the
> > case where NBD deadlocks.  Your mempool strategy can work reliably
> > only if you can prove that the pool allocations of the maximum
> > number of requests you can have in flight do not exceed the size of
> > the pool.  In other words, if you ever take the pool's fallback
> > path to normal allocation, you risk deadlock.
>
> mempool should be allocated to be able to catch up with maximum
> in-flight requests, in my tests I was unable to force block layer to
> put more than 31 pages in sync, but in one bio. Each request is
> essentially dealyed bio processing, so this must handle maximum
> number of in-flight bios (if they do not cover multiple nodes, if
> they do, then each node requires own request).

It depends on the characteristics of the physical and virtual block 
devices involved.  Slow block devices can produce surprising effects.  
Ddsnap still qualifies as "slow" under certain circumstances (big 
linear write immediately following a new snapshot). Before we added 
throttling we would see as many as 800,000 bios in flight.  Nice to 
know the system can actually survive this... mostly.  But memory 
deadlock is a clear and present danger under those conditions and we 
did hit it (not to mention that read latency sucked beyond belief). 

Anyway, we added a simple counting semaphore to throttle the bio traffic 
to a reasonable number and behavior became much nicer, but most 
importantly, this satisfies one of the primary requirements for 
avoiding block device memory deadlock: a strictly bounded amount of bio 
traffic in flight.  In fact, we allow some bounded number of 
non-memalloc bios *plus* however much traffic the mm wants to throw at 
us in memalloc mode, on the assumption that the mm knows what it is 
doing and imposes its own bound of in flight bios per device.   This 
needs auditing obviously, but the mm either does that or is buggy.  In 
practice, with this throttling in place we never saw more than 2,000 in 
flight no matter how hard we hit it, which is about the number we were 
aiming at.  Since we draw our reserve from the main memalloc pool, we 
can easily handle 2,000 bios in flight, even under extreme conditions.

See:
http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
down(&info->throttle_sem);

To be sure, I am not very proud of this throttling mechanism for various 
reasons, but the thing is, _any_ throttling mechanism no matter how 
sucky solves the deadlock problem.  Over time I want to move the 
throttling up into bio submission proper, or perhaps incorporate it in 
device mapper's queue function, not quite as high up the food chain.  
Only some stupid little logistical issues stopped me from doing it one 
of those ways right from the start.   I think Peter has also tried some 
things in this area.  Anyway, that part is not pressing because the 
throttling can be done in the virtual device itself as we do it, even 
if it is not very pretty there.  The point is: you have to throttle the 
bio traffic.  The alternative is to die a horrible death under 
conditions that may be rare, but _will_ hit somebody.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

Hi Mike,

On Thursday 02 August 2007 21:09, Mike Snitzer wrote:
> But NBD's synchronous nature is actually an asset when coupled with
> MD raid1 as it provides guarantees that the data has _really_ been
> mirrored remotely.

And bio completion doesn't?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

Hi Evgeniy,

Nit alert:

On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> * storage can be formed on top of remote nodes and be exported
>   simultaneously (iSCSI is peer-to-peer only, NBD requires device
>   mapper and is synchronous)

In fact, NBD has nothing to do with device mapper.  I use it as a 
physical target underneath ddraid (a device mapper plugin) just like I 
would use your DST if it proves out.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

On Friday 03 August 2007 07:53, Peter Zijlstra wrote:
> On Fri, 2007-08-03 at 17:49 +0400, Evgeniy Polyakov wrote:
> > On Fri, Aug 03, 2007 at 02:27:52PM +0200, Peter Zijlstra wrote:
> > ...my main position is to
> > allocate per socket reserve from socket's queue, and copy data
> > there from main reserve, all of which are allocated either in
> > advance (global one) or per sockoption, so that there would be no
> > fairness issues what to mark as special and what to not.
> >
> > Say we have a page per socket, each socket can assign a reserve for
> > itself from own memory, this accounts both tx and rx side. Tx is
> > not interesting, it is simple, rx has global reserve (always
> > allocated on startup or sometime way before reclaim/oom)where data
> > is originally received (including skb, shared info and whatever is
> > needed, page is just an exmaple), then it is copied into per-socket
> > reserve and reused for the next packet. Having per-socket reserve
> > allows to have progress in any situation not only in cases where
> > single action must be received/processed, and allows to be
> > completely fair for all users, but not only special sockets, thus
> > admin for example would be allowed to login, ipsec would work and
> > so on...
>
> Ah, I think I understand now. Yes this is indeed a good idea!
>
> It would be quite doable to implement this on top of that I already
> have. We would need to extend the socket with a sock_opt that would
> reserve a specified amount of data for that specific socket. And then
> on socket demux check if the socket has a non zero reserve and has
> not yet exceeded said reserve. If so, process the packet.
>
> This would also quite neatly work for -rt where we would not want
> incomming packet processing to be delayed by memory allocations.

At this point we need "anything that works" in mainline as a starting 
point.  By erring on the side of simplicity we can make this 
understandable for folks who haven't spent the last two years wallowing 
in it.  The page per socket approach is about as simple as it gets.  I 
therefore propose we save our premature optimizations for later.

It will also help our cause if we keep any new internal APIs to strictly 
what is needed to make deadlock go away.  Not a whole lot more than 
just the flag to mark a socket as part of the vm writeout path when you 
get right down to essentials.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

On Friday 03 August 2007 06:49, Evgeniy Polyakov wrote:
> ...rx has global reserve (always allocated on
> startup or sometime way before reclaim/oom)where data is originally
> received (including skb, shared info and whatever is needed, page is
> just an exmaple), then it is copied into per-socket reserve and
> reused for the next packet. Having per-socket reserve allows to have
> progress in any situation not only in cases where single action must
> be received/processed, and allows to be completely fair for all
> users, but not only special sockets, thus admin for example would be
> allowed to login, ipsec would work and so on...

And when the global reserve is entirely used up your system goes back to 
dropping vm writeout acknowledgements, not so good.  I like your 
approach, and specifically the copying idea cuts out considerable 
complexity.  But I believe the per-socket flag to mark a socket as part 
of the vm writeout path is not optional, and in this case it will be a 
better world if it is a slightly unfair world in favor of vm writeout 
traffic.

Ssh will still work fine even with vm getting priority access to the 
pool.  During memory crunches, non-vm ssh traffic may get bumped till 
after the crunch, but vm writeout is never supposed to hog the whole 
machine.  If vm writeout hogs your machine long enough to delay an ssh 
login then that is a vm bug and should be fixed at that level.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage.

2007-08-02 Thread Daniel Phillips

On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> Hi.
>
> I'm pleased to announce first release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.

Excellent!  This is precisely what the doctor ordered for the 
OCFS2-based distributed storage system I have been mumbling about for 
some time.  In fact the dd in ddsnap and ddraid stands for "distributed 
data".  The ddsnap/raid devices do not include an actual network 
transport, that is expected to be provided by a specialized block 
device, which up till now has been NBD.  But NBD has various 
deficiencies as you note, in addition to its tendency to deadlock when 
accessed locally.  Your new code base may be just the thing we always 
wanted.  We (zumastor et al) will take it for a drive and see if 
anything breaks.

Memory deadlock is a concern of course.  From a cursory glance through, 
it looks like this code is pretty vm-friendly and you have thought 
quite a lot about it, however I respectfully invite peterz 
(obsessive/compulsive memory deadlock hunter) to help give it a good 
going over with me.

I see bits that worry me, e.g.:

+   req = mempool_alloc(st->w->req_pool, GFP_NOIO);

which seems to be callable in response to a local request, just the case 
where NBD deadlocks.  Your mempool strategy can work reliably only if 
you can prove that the pool allocations of the maximum number of 
requests you can have in flight do not exceed the size of the pool.  In 
other words, if you ever take the pool's fallback path to normal 
allocation, you risk deadlock.

Anyway, if this is as grand as it seems then I would think we ought to 
factor out a common transfer core that can be used by all of NBD, 
iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own 
code those things have now.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC] extent mapped page cache

2007-07-12 Thread Daniel Phillips

On Tuesday 10 July 2007 14:03, Chris Mason wrote:
> This patch aims to demonstrate one way to replace buffer heads with a
> few extent trees...

Hi Chris,

Quite terse commentary on algorithms and data structures, but I suppose
that is not a problem because Jon has a whole week to reverse engineer
it for us.

What did you have in mind for subpages?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > do you think it is a bit premature to dismiss something even without
> > > ever seeing the code?
> >
> > You told me you are using a dlm for a single-node application, is there
> > anything more I need to know?
>
> I would still like to know why you consider it a "sin". On OpenVMS it is
> fast, provides a way of cleaning up...

There is something hard about handling EPIPE?

> and does not introduce single point 
> of failure as it is the case with a daemon. And if we ever want to spread
> the load between 2 boxes we easily can do it.

But you said it runs on an aging Alpha, surely you do not intend to expand it 
to two aging Alphas?  And what makes you think that socket-based 
synchronization keeps you from spreading out the load over multiple boxes?

> Why would I not want to use it?

It is not the right tool for the job from what you have told me.  You want to 
get a few bytes of information from one task to another?  Use a socket, as 
God intended.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remainingh

2005-09-05 Thread Daniel Phillips

On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> do you think it is a bit premature to dismiss something even without
> ever seeing the code?

You told me you are using a dlm for a single-node application, is there 
anything more I need to know?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remainingh

2005-09-05 Thread Daniel Phillips

On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > By the way, you said "alpha server" not "alpha servers", was that just a
> > slip? Because if you don't have a cluster then why are you using a dlm?
>
> No, it is not a slip. The application is running on just one node, so we
> do not really use "distributed" part. However we make heavy use of the
> rest of lock manager features, especially lock value blocks.

Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
without even having the excuse you were forced to use it.  Why don't you just 
have a daemon that sends your values over a socket?  That should be all of a 
day's coding.

Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
But you nicely supported my claim that most who think they should be using a 
dlm, really shouldn't.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > > > The only current users of dlms are cluster filesystems.  There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear.  The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch?  That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.

I did not say "potential", I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said "gee this works great, look what it does".

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said "alpha server" not "alpha servers", was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > The only current users of dlms are cluster filesystems.  There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear.  The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The only current users of dlms are cluster filesystems.  There are zero
> > users of the userspace dlm api.
>
> That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

> ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

> > What does have to be resolved is a common API for node management.  It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management.  Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <[EMAIL PROTECTED]> wrote:
> > > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface?   ie: how many syscalls would it take?
> >
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
>
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips

On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.  That's why I asked (thus
> far unsuccessfully):
>
>Are you saying that the posix-file lookalike interface provides
>access to part of the functionality, but there are other APIs which are
>used to access the rest of the functionality?  If so, what is that
>interface, and why cannot that interface offer access to 100% of the
>functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
 100K of userspace library?  Answer: because we don't want userspace-only
 dlm features bulking up the kernel.  Answer #2: the extra syscalls and
 interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
 Answer: only support tools need to do that.  A cut-down locking api is
 entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
 simple matter of coding.  But please bear in mind that dlm-style
 synchronization is probably a bad idea for most cluster applications,
 particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm "for 
free", but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 00:46, Andrew Morton wrote:
> Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The model you came up with for dlmfs is beyond cute, it's downright
> > clever.
>
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

Now, I see the ocfs2 guys are all ready to back down on this one, but I will 
at least argue weakly in favor.

Sick is a nice word for it, but it is actually not that far off.  Normally, 
this fs will acquire a lock whenever the user creates a virtual file and the 
create will block until the global lock arrives.  With O_NONBLOCK, it will 
return, erm... ETXTBSY (!) immediately.  Is that not what O_NONBLOCK is 
supposed to accomplish?

> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
>
> What happens when we want to add some new primitive which has no posix-file
> analog?
>
> Wy too cute.  Oh well, whatever.

The explicit way is syscalls or a set of ioctls, which he already has the 
makings of.  If there is going to be a userspace api, I would hope it looks 
more like the contents of userdlm.c than the traditional Vaxcluster API, 
which sucks beyond belief.

Another explicit way is to do it with a whole set of virtual attributes 
instead of just a single file trying to capture the whole model.  That is 
really unappealing, but I am afraid that is exactly what a whole lot of 
sysfs/configfs usage is going to end up looking like.

But more to the point: we have no urgent need for a userspace dlm api at the 
moment.  Nothing will break if we just put that issue off for a few months, 
quite the contrary.

If the only user is their tools I would say let it go ahead and be cute, even 
sickeningly so.  It is not supposed to be a general dlm api, at least that is 
my understanding.  It is just supposed to be an interface for their tools.  
Of course it would help to know exactly how those tools use it.  Too sleepy 
to find out tonight...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 01:00, Joel Becker wrote:
> On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> > Clearly, I ought to have asked why dlmfs can't be done by configfs.  It
> > is the same paradigm: drive the kernel logic from user-initiated vfs
> > methods.  You already have nearly all the right methods in nearly all the
> > right places.
>
>  configfs, like sysfs, does not support ->open() or ->release()
> callbacks.

struct configfs_item_operations {
 void (*release)(struct config_item *);
 ssize_t (*show)(struct config_item *, struct attribute *,char *);
 ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t);
 int (*allow_link)(struct config_item *src, struct config_item *target);
 int (*drop_link)(struct config_item *src, struct config_item *target);
};

struct configfs_group_operations {
 struct config_item *(*make_item)(struct config_group *group, const char *name);
 struct config_group *(*make_group)(struct config_group *group, const char 
*name);
 int (*commit_item)(struct config_item *item);
 void (*drop_item)(struct config_group *group, struct config_item *item);
};

You do have ->release and ->make_item/group.

If I may hand you a more substantive argument: you don't support user-driven
creation of files in configfs, only directories.  Dlmfs supports user-created
files.  But you know, there isn't actually a good reason not to support
user-created files in configfs, as dlmfs demonstrates.

Anyway, goodnight.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 00:30, Joel Becker wrote:
> You asked why dlmfs can't go into sysfs, and I responded.

And you got me!  In the heat of the moment I overlooked the fact that you and 
Greg haven't agreed to the merge yet ;-)

Clearly, I ought to have asked why dlmfs can't be done by configfs.  It is the 
same paradigm: drive the kernel logic from user-initiated vfs methods.  You 
already have nearly all the right methods in nearly all the right places.

Regards,

Daniel




-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 23:06, Joel Becker wrote:
>  dlmfs is *tiny*.  The VFS interface is less than his claimed 500
> lines of savings.

It is 640 lines.

> The few VFS callbacks do nothing but call DLM 
> functions.  You'd have to replace this VFS glue with sysfs glue, and
> probably save very few lines of code.
>  In addition, sysfs cannot support the dlmfs model.  In dlmfs,
> mkdir(2) creates a directory representing a DLM domain and mknod(2)
> creates the user representation of a lock.  sysfs doesn't support
> mkdir(2) or mknod(2) at all.

I said "configfs" in the email to which you are replying.

>  More than mkdir() and mknod(), however, dlmfs uses open(2) to
> acquire locks from userspace.  O_RDONLY acquires a shared read lock (PR
> in VMS parlance).  O_RDWR gets an exclusive lock (X).  O_NONBLOCK is a
> trylock.  Here, dlmfs is using the VFS for complete lifetiming.  A lock
> is released via close(2).  If a process dies, close(2) happens.  In
> other words, ->release() handles all the cleanup for normal and abnormal
> termination.
>
>  sysfs does not allow hooking into ->open() or ->release().  So
> this model, and the inherent lifetiming that comes with it, cannot be 
> used.

Configfs has a per-item release method.  Configfs has a group open method.  
What is it that configfs can't do, or can't be made to do trivially?

> If dlmfs was changed to use a less intuitive model that fits 
> sysfs, all the handling of lifetimes and cleanup would have to be added.

The model you came up with for dlmfs is beyond cute, it's downright clever.  
Why mar that achievement by then failing to capitalize on the framework you 
already have in configfs?

By the way, do you agree that dlmfs is too inefficient to be an effective way 
of exporting your dlm api to user space, except for slow-path applications 
like you have here?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 02:46, Wim Coekaerts wrote:
> On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> > On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > > As far as userspace dlm apis go, dlmfs already abstracts away a large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling your
> > own?
>
> because it's totally different. have a look at what it does.

You create a dlm domain when a directory is created.  You create a lock 
resource when a file of that name is opened.  You lock the resource when the 
file is opened.  You access the lvb by read/writing the file.  Why doesn't 
that fit the configfs-nee-sysfs model?  If it does, the payoff will be about 
500 lines saved.

This little dlm fs is very slick, but grossly inefficient.  Maybe efficiency 
doesn't matter here since it is just your slow-path userspace tools taking 
these locks.  Please do not even think of proposing this as a way to export a 
kernel-based dlm for general purpose use!

Your userdlm.c file has some hidden gold in it.  You have factored the dlm 
calls far more attractively than the bad old bazillion-parameter Vaxcluster 
legacy.  You are almost in system call zone there.  (But note my earlier 
comment on dlms in general: until there are dlm-based applications, merging a 
general-purpose dlm API is pointless and has nothing to do with getting your 
filesystem merged.)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 06:35, David Teigland wrote:
> Just a new version, not a big difference.  The ondisk format changed a
> little making it incompatible with the previous versions.  We'd been
> holding out on the format change for a long time and thought now would be
> a sensible time to finally do it.

What exactly was the format change, and for what purpose?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> As far as userspace dlm apis go, dlmfs already abstracts away a large part
> of the dlm interaction...

Dumb question, why can't you use sysfs for this instead of rolling your own?

Side note: you seem to have deleted all the 2.6.12-rc4 patches.  Perhaps you 
forgot that there are dozens of lkml archives pointing at them?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 17:17, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.

The only current users of dlms are cluster filesystems.  There are zero users 
of the userspace dlm api.  Therefore, the (g)dlm userspace interface actually 
has nothing to do with the needs of gfs.  It should be taken out the gfs 
patch and merged later, when or if user space applications emerge that need 
it.  Maybe in the meantime it will be possible to come up with a userspace 
dlm api that isn't completely repulsive.

Also, note that the only reason the two current dlms are in-kernel is because 
it supposedly cuts down on userspace-kernel communication with the cluster 
filesystems.  Then why should a userspace application bother with a an 
awkward interface to an in-kernel dlm?  This is obviously suboptimal.  Why 
not have a userspace dlm for userspace apps, if indeed there are any 
userspace apps that would need to use dlm-style synchronization instead of 
more typical socket-based synchronization, or Posix locking, which is already 
exposed via a standard api?

There is actually nothing wrong with having multiple, completely different 
dlms active at the same time.  There is no urgent need to merge them into the 
one true dlm.  It would be a lot better to let them evolve separately and 
pick the winner a year or two from now.  Just think of the dlm as part of the 
cfs until then.

What does have to be resolved is a common API for node management.  It is not 
just cluster filesystems and their lock managers that have to interface to 
node management.  Below the filesystem layer, cluster block devices and 
cluster volume management need to be coordinated by the same system, and 
above the filesystem layer, applications also need to be hooked into it.  
This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ia_attr_flags - time to die

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 15:41, Miklos Szeredi wrote:
> Already dead ;)
>
> 2.6.13-mm1: remove-ia_attr_flags.patch
>
> Miklos

Wow, the pace of Linux development really is picking up.  Now patches are 
applied before I even send them!

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] ia_attr_flags - time to die

2005-09-02 Thread Daniel Phillips

Struct iattr is not involved any more in such things as NOATIME inode flags.
There are no in-tree users of ia_attr_flags.

Signed-off-by Daniel Phillips <[EMAIL PROTECTED]>

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 
2.6.13-rc5-mm1/fs/hostfs/hostfs.h
--- 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2005-08-09 18:23:11.0 
-0400
+++ 2.6.13-rc5-mm1/fs/hostfs/hostfs.h   2005-09-01 17:54:40.0 -0400
@@ -49,7 +49,6 @@ struct hostfs_iattr {
struct timespec ia_atime;
struct timespec ia_mtime;
struct timespec ia_ctime;
-   unsigned intia_attr_flags;
 };
 
 extern int stat_file(const char *path, unsigned long long *inode_out,
diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/fs.h 
2.6.13-rc5-mm1/include/linux/fs.h
--- 2.6.13-rc5-mm1.clean/include/linux/fs.h 2005-08-09 18:23:31.0 
-0400
+++ 2.6.13-rc5-mm1/include/linux/fs.h   2005-09-01 18:27:42.0 -0400
@@ -282,19 +282,9 @@ struct iattr {
struct timespec ia_atime;
struct timespec ia_mtime;
struct timespec ia_ctime;
-   unsigned intia_attr_flags;
 };
 
 /*
- * This is the inode attributes flag definitions
- */
-#define ATTR_FLAG_SYNCRONOUS   1   /* Syncronous write */
-#define ATTR_FLAG_NOATIME  2   /* Don't update atime */
-#define ATTR_FLAG_APPEND   4   /* Append-only file */
-#define ATTR_FLAG_IMMUTABLE8   /* Immutable file */
-#define ATTR_FLAG_NODIRATIME   16  /* Don't update atime for directory */
-
-/*
  * Includes for diskquotas.
  */
 #include 
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-01 Thread Daniel Phillips

On Thursday 01 September 2005 06:46, David Teigland wrote:
> I'd like to get a list of specific things remaining for merging.

Where are the benchmarks and stability analysis?  How many hours does it 
survive cerberos running on all nodes simultaneously?  Where are the 
testimonials from users?  How long has there been a gfs2 filesystem?  Note 
that Reiser4 is still not in mainline a year after it was first offered, why 
do you think gfs2 should be in mainline after one month?

So far, all catches are surface things like bogus spinlocks.  Substantive 
issues have not even begun to be addressed.  Patience please, this is going 
to take a while.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: GFS, what's remaining

2005-09-01 Thread Daniel Phillips

On Thursday 01 September 2005 10:49, Alan Cox wrote:
> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> >   possibly gain (or vice versa)
> >
> > - Relative merits of the two offerings
>
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.

I thought that gfs2 just appeared last month.  Or is it really still just gfs?  
If there are substantive changes from gfs to gfs2 then obviously they have 
had practically zero testing, let alone posted benchmarks, testimonials, etc.  
If it is really still just gfs then the silly-rename should be undone.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Announcing Journaled File System (JFS) release 1.0.0 available

2001-06-28 Thread Daniel Phillips

On Thursday 28 June 2001 16:22, Steve Best wrote:
> June 28, 2001:
>
> IBM is pleased to announce the v 1.0.0 release of the open source
> Journaled File System (JFS), a high-performance, and scalable file
> system for Linux.
>
> http://oss.software.ibm.com/jfs

Congratulations, and thanks for being so clued in about how to run your 
project.  Example: the way you provide the source - direct links to cvs, tgz 
and patches, no annoying cgi.  You guys get it, unlike some other names I 
won't mention (Sun ;-).

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [PATCH][RFC] inode->u.nfs_i allocated separately

2001-06-28 Thread Daniel Phillips


On Thursday 28 June 2001 03:48, Alexander Viro wrote:
> On Thu, 28 Jun 2001, Daniel Phillips wrote:
> > > Advantages: no extra memory use, no indirection, no memory allocation
> > > overhead.
> >
> > An advantage you overlooked: clean up fs.h so it doesn't have to include
> > every filesystem in the known universe.
> >
> > All of this also applies to struct super_block.
>
> ... in both schemes. Think for a second - you don't need to include
> anything into fs.h to have
>   void *i_private;/* pointer to fs-private data */
> in struct inode. IOW, that's not an advantage at all - they are not
> different in that respect.

Yes indeed, both are huge improvements over the union and both require 
changes to every filesystem.  Given that equivalence I'd prefer the one that 
runs faster and uses less memory.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [PATCH][RFC] inode->u.nfs_i allocated separately

2001-06-28 Thread Daniel Phillips


On Thursday 28 June 2001 07:39, Alexander Viro wrote:
> BTW, cost of extra dereferncing is trivial - when we access ext2-specific
> part of inode we usually
>   a) do it more than once in a given function
>   b) access a lot of stuff outside of struct inode.

It's not the only cost:

  - The memory manager overhead is doubled, inode slab fragmentation is
doubled

  - We use an average of half a cache line more per inode, depending on inode 
size

If we choose not to align the inode objects in slab then we waste an extra 
full cache line (half wasted at the end of the generic inode and half at the 
beginning of the specific part).

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [PATCH][RFC] inode->u.nfs_i allocated separately

2001-06-27 Thread Daniel Phillips


On Wednesday 27 June 2001 23:22, Linus Torvalds wrote:
> we could _easily_ have the setup
>
>   struct ext2_inode {
>   struct inode inode; /* Generic fields */
>   specific-ext2 struct;   /* specific fields */
>   };
>
> and then when ext2 calls down to the generic VFS layer it just passes
>
>   &ext2_inode->inode
>
> down, and when it gets a "struct inode *" it uses "inode_to_ext2()" to
> convert it to an ext2 inode pointer.
>
> This is what the "struct list_head" thing does, and it works remarkably
> well. It allows for embedding a list (or a hundred) into any object. The
> above would take the same approach, and allow embedding an inode (and
> maybe several) into any object.
>
> Advantages: no extra memory use, no indirection, no memory allocation
> overhead.

An advantage you overlooked: clean up fs.h so it doesn't have to include 
every filesystem in the known universe.

All of this also applies to struct super_block.

> Disadvantages: ??

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: inode->i_blksize and inode->i_blocks

2001-06-04 Thread Daniel Phillips


On Monday 04 June 2001 22:41, Bryan Henderson wrote:
> I'm more confused than ever about the i_blocks (filesize divided by
> 512) field.  I note that in both the inode and the stat() result, the
> filesize in bytes exists, and presumably always has.  So why would
> anyone ever want to know separately how many 512 byte units of data
> are in the file?

Files can have holes.

> FS code appears to explicitly allow for a
> filesystem driver to denominate i_blocks in other units, but any
> other unit would appear to break the stat () interface.

This can be fixed with a multiply.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: inode->i_blksize and inode->i_blocks

2001-06-04 Thread Daniel Phillips


On Monday 04 June 2001 19:02, Bryan Henderson wrote:
> >Are there any deeper reasons,
> >why
> >a) inode->i_blksize is set to PAGESIZE eg. 4096 independent of the
> > actual block size of the file system?
>
> Well, why not?  The field tells what is a good chunk size to read or
> write for maximum performance.  If the I/O is done in PAGESIZE cache
> page units, then that's the best number to use.

But we already know that from PAGE_SIZE, this seems like a complete 
waste.

> I suppose in the very first unix filesystems, the field may have
> meant filesystem block size, which was identical to the highest
> performing read/write size, and that may account for its name.
>
> >b) the number of blocks is counted in 512 Bytes and not in the
> > actual blocksize of the filesystem?
>
> I can't see how the number of actual blocks would be helpful,
> especially since as described above, we don't even know how big they
> are.  We don't even know that they're fixed size or that a concept of
> a block even exists.

Counting in 512 byte units was just a mistake.  The correct units to 
count in are sb->s_blocksize.  It's a little tricky to change that now 
but it still may happen.

> >(is this for historical reasons??)
>
> That would be my guess.  Though I can't think of any particular
> programs that would measure a file by multiplying this number by 512.

The original NEC floppy disk controllers used 512 byte sectors.

> In any case, the inode fields are defined as they are because they
> implement a standard stat() interface that includes these same
> numbers.

We can fix things up in cp_old/new_stat if we want.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-27 Thread Daniel Phillips

On Sunday 27 May 2001 15:32, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > It won't, the open for "." is handled in the VFS, not the
> > filesystem - it will open the directory.  (Without needing to be
> > told it's a directory via O_DIRECTORY.)  If you do open("magicdev")
> > you'll get the device, because that's handled by magicdevfs.
>
> You really mean that "magicdev" is a directory and:
>
>   open("magicdev/.", O_RDONLY);
>   open("magicdev", O_RDONLY);
>
> would both succeed but open different objects?

Yes, and:

open("magicdev/.", O_RDONLY | O_DIRECTORY);
open("magicdev", O_RDONLY | O_DIRECTORY);

will both succeed and open the same object.

> > I'm not claiming there isn't breakage somewhere,
>
> you break UNIX fundamentals.  But I'm quite relieved now because I'm
> pretty sure that something like that will never go into the kernel.

OK, I'll take that as "I couldn't find a piece of code that breaks, so 
it's on to the legal issues".

SUS doesn't seem to have a lot to say about this.  The nearest thing to 
a ruling I found was "The special filename dot refers to the directory 
specified by its predecessor".  Which is not the same thing as:

   open("foo", O_RDONLY) == open ("foo/.", O_RDONLY)

I don't know about POSIX (I don't have it: a pox on standards 
organizations that don't make their standards freely available) but SUS 
doesn't seem to forbid this.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-25 Thread Daniel Phillips

On Thursday 24 May 2001 22:59, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > > > Readdir fills in a directory type, so ls sees it as a directory
> > > > and does the right thing.  On the other hand, we know we're on
> > > > a device filesystem so we will next open the name as a regular
> > > > file, and find ISCHR or ISBLK: good.
> > >
> > > ??? The kernel may know it, but the app?  Or do you really want
> > > to give different stat data on stat(2) and fstat(2)?  These flags
> > > are currently used by archive/backup prgs.  It's a hint that
> > > these files are not regular files and shouldn't be opened for
> > > reading. Having a 'd' would mean that they would really try to
> > > enter the directory and save it's contents.  Don't know what
> > > happens in this case to your "special" files ;-)
> >
> > I guess that's much like the question 'what happens in proc?'.
>
> And that's already bad enough.  Most of the "files" in proc should
> be fifos!  And using proc as an excuse to introduce another set of
> magic dirs?  No, thanks.

Wait a second, I thought proc was here to stay.  Wait another
second, device nodes are already magic.  Magic is magic, just
choose your color ;-)

This set of magic dirs is supposed to clean things up, not mess things 
up.  We already saw how the side-effects-on-open problem in ls -l goes 
away.  There's a much bigger problem I'd love to deal with: the 'no 
heirarchy can please everybody' problem.  In database terms, aheirarchy 
is an insufficiently general model for real-world problems, in other 
words, they never worked.  Tables work.  That's where I'm trying to go 
with this, so please bear with me.  This is not just a solution in 
search of a problem.

> > Correct me if I'm wrong, but what we learn from the proc example
> > is that tarring your whole source tree starting at / is not
> > something you want to do.
>
> IMHO it would be better to fix proc instead of adding more magic.  At
> the moment you have to exclude /proc.  You want to add /dev.

Well, actually no, ls -R, tar, zip, etc, work pretty well with the 
scheme I've described.

> And
> next? Exclude all $HOME/dev (in case process name spaces get added)? 
> Or make fifos magic too and add all of them to the exclude list?  But
> there's no central place for fifos.  So lets add more magic :-(

No, no, no, agreed and sometimes magic is good.  It's not deep magic.  
The only new thing here is the interpretation of the O_DIRECTORY flag, 
or rather, the lack of it.

> > What *won't* happen is, you won't get side effects from opening
> > your serial ports (you'd have to open them without O_DIRECTORY
> > to get that) so that seems like a little step forward.
>
> As already said: depending on O_DIRECTORY breaks POSIX compliance
> and that alone should kill this idea...

Thanks, two good points:
  - libc5 will get confused when doing ls in /magicdev
  - POSIX specifically forbids this

I'll put this away until I've specifically dug into both of them.  OK, 
over and out, thanks for your commentary.

/me peruses man pages

Oops, oh wait, there's already another open point: your breakage 
examples both rely on opening ".".  You're right, "." should always be 
a directory and I believe that's enforced by the VFS.  So we don't have 
an example of breakage yet.

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-25 Thread Daniel Phillips

On Friday 25 May 2001 00:00, Hans Reiser wrote:
> Daniel Phillips wrote:
> > I suppose I'm just reiterating the obvious, but we should
> > eventually have a generic filesystem transaction API at the VFS
> > level, once we have enough data points to know what the One True
> > API should be.
>
> Daniel, implementing transactions is not a trivial thing as you
> probably know. It requires that you resolve such issues as, what
> happens if the user forgets to close the transaction, issues of
> lock/transaction duration, of transaction batching, of levels of
> isolation, of concurrent transactions modifying global fs metadata
> and some but not all of those concurrent transactions receiving a
> rollback, and of permissions relating to keeping transactions open. 
> I would encourage you to participate in the reiser4 design discussion
> we will be having over the next 6 months, and give us your opinions. 
> Josh will be leading that design effort for the ReiserFS team.

Graciously accepted.  Coming up with something sensible in a mere 6 
months would be a minor miracle. ;-)

- what happens if the user forgets to close the transaction?

   I plan to set a checkpoint there (because the transaction got
   too big) and log the fact that it's open.

- issues of lock/transaction duration

   Once again relying on checkpoints, when the transaction gets
   uncomfortably big for cache, set a checkpoint.  I haven't thought
   about locks

- transaction batching

   1) Explicit transaction batch close 2) Cache gets past a certain 
   fullness.  In both cases, no new transactions are allowed to start
   and as soon as all current ones are closed we close the batch.

- of levels of isolation
- concurrent transactions modifying global fs metadata
   and some but not all of those concurrent transactions receiving a
   rollback

   First I was going to write 'huh?' here, then I realized you're   
   talking about real database ops, not just filesystem ops.  I had
   in mind something more modest: transactions are 'mv', 'read/write'
   (if the 'atomic read/write' is set), other filesystem operations I've
   forgotten, and anything the user puts between open_xact and  
   close_xact.  You are raising the ante a little ;-)

   In my case (Tux2) I could do an efficient rollback to the beginning
  of the batch (phase), then I would have had to have kept an   
   in-memory log of the transactions for selective replay.  With a  
   journal log you can obviously do the same thing, but perhaps more
   efficiently if your journal design supports undo/redo.

   The above is a pure flight of fancy, we won't be seeing anything
   so fancy as an API across filesystems.

- permissions relating to keeping transactions open. 
   We can see this one in the light of a simple filesystem  
   transaction: what happens if we are in the middle of a mv and
   someone changes the permissions?  Go with the starting or
   ending permissions?

Well, the database side of this is really interesting, but to get 
something generic across filesystems, the scope pretty well has to be 
limited to journal-type transactions, don't you think?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-25 Thread Daniel Phillips


On Thursday 24 May 2001 23:26, Alexander Viro wrote:
> On Thu, 24 May 2001, Edgar Toernig wrote:
> > > What *won't* happen is, you won't get side effects from opening
> > > your serial ports (you'd have to open them without O_DIRECTORY
> > > to get that) so that seems like a little step forward.
> >
> > As already said: depending on O_DIRECTORY breaks POSIX compliance
> > and that alone should kill this idea...
>
> What really kills that idea is the fact that you can trick
> applications into opening your serial ports _without_ O_DIRECTORY.

Err, I thought we already had that problem, but worse: an ordinary
ls -l will do it.  This way, we harmlessly list the device's properties 
instead.

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-24 Thread Daniel Phillips


On Thursday 24 May 2001 16:39, Oliver Xymoron wrote:
> On Thu, 24 May 2001, Marko Kreen wrote:
> > On Thu, May 24, 2001 at 02:23:27AM +0200, Edgar Toernig wrote:
> > > Daniel Phillips wrote:
> > > > > > It's going to be marked 'd', it's a directory, not a file.
> > > > >
> > > > > Aha.  So you lose the S_ISCHR/BLK attribute.
> > > >
> > > > Readdir fills in a directory type, so ls sees it as a directory
> > > > and does the right thing.  On the other hand, we know we're on
> > > > a device filesystem so we will next open the name as a regular
> > > > file, and find ISCHR or ISBLK: good.
> > >
> > > ??? The kernel may know it, but the app?  Or do you really want
> > > to give different stat data on stat(2) and fstat(2)?  These flags
> > > are currently used by archive/backup prgs.  It's a hint that
> > > these files are not regular files and shouldn't be opened for
> > > reading. Having a 'd' would mean that they would really try to
> > > enter the directory and save it's contents.  Don't know what
> > > happens in this case to your "special" files ;-)
> >
> > IMHO the CHR/BLK is not needed.  Think of /proc.  In the future,
> > the backup tools will be told to ignore /dev, that's all.
>
> The /dev dir should not be special. At least not to the kernel. I
> have device files in places other than /dev, and you probably do too
> (hint: anonymous FTP).

True.  If we're using a special filesystem for devices we can express
the desired restriction in terms of 'don't back up this filesystem type'
or 'don't go outside the root filesystem'.

--
Daniel

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-24 Thread Daniel Phillips


On Tuesday 22 May 2001 22:10, Andreas Dilger wrote:
> Peter Braam writes:
> > File system journal recovery can corrupt a snapshot, because it
> > copies data that needs to be preserved in a snapshot. During
> > journal replay such data may be copied again, but the source can
> > have new data already.
>
> The way it is implemented in reiserfs is to wait for existing
> transactions to complete, entirely flush the journal and block all
> new transactions from starting.  Stephen implemented a journal flush
> API to do this for ext3, but the hooks to call it from LVM are not in
> place yet.  This way the journal is totally empty at the time the
> snapshot is done, so the read-only copy does not need to do journal
> recovery, so no problems can arise.

I suppose I'm just reiterating the obvious, but we should eventually
have a generic filesystem transaction API at the VFS level, once we
have enough data points to know what the One True API should be.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-24 Thread Daniel Phillips

On Thursday 24 May 2001 02:23, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > > > It's going to be marked 'd', it's a directory, not a file.
> > >
> > > Aha.  So you lose the S_ISCHR/BLK attribute.
> >
> > Readdir fills in a directory type, so ls sees it as a directory and
> > does the right thing.  On the other hand, we know we're on a device
> > filesystem so we will next open the name as a regular file, and
> > find ISCHR or ISBLK: good.
>
> ??? The kernel may know it, but the app?  Or do you really want to
> give different stat data on stat(2) and fstat(2)?  These flags are
> currently used by archive/backup prgs.  It's a hint that these files
> are not regular files and shouldn't be opened for reading.
> Having a 'd' would mean that they would really try to enter the
> directory and save it's contents.  Don't know what happens in this
> case to your "special" files ;-)

I guess that's much like the question 'what happens in proc?'.

Recursively entering the device directory is ok as long as everything
inside it is ok.  I tried zipping /proc/bus -r and what I got is what I'd
expect if I'd cat'ed every non-directory entry.  This is what I
expected.  Maybe it's not right - zipping /proc/kcore is kind of
interesting.  Regardless, we are no worse than proc here.  In fact,
since we don't anticipate putting an elephant like kcore in as a
device property, we're a little nicer to get along with.

Correct me if I'm wrong, but what we learn from the proc example
is that tarring your whole source tree starting at / is not something
you want to do.  Just extend that idea to /dev - however, if you do
it, it will produce pretty reasonable results.

What *won't* happen is, you won't get side effects from opening
your serial ports (you'd have to open them without O_DIRECTORY
to get that) so that seems like a little step forward.

I'm still thinking about some of your other comments.

--
Daniel

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-23 Thread Daniel Phillips


On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
> IMO the whole idea of arguments following the device name is junk
> (incl a "/ctrl").

You know I didn't suggest that, right?  I find it pretty strange too, but
I'm listening to hear the technical arguments.

> Just think about the implications of the original "/dev/ttyS0/19200"
> suggestion.  It sounds nice and tempting.  But which programs will
> benefit.  Which gets confused.  What will be cleaned up.  After some
> thoughts you'll find out that it's useless ;-)

You know I didn't suggest that either, right?  But I'm with you, I don't
like it at'all, not least because we might change baud rate on the fly.

> And with special "ctrl" devices (ie /dev/ttyS0 and /dev/ttyS0ctrl):
> This _may_ work for some kind of devices.  But serial ports are one
> example where it simply will _not_.  It requires that you know the
> name of the device.  For ttys this is often not the case.
> Even if you manage to get some name for stdin for example - now I 
> should simply attach a "ctrl" to that name to get a control channel???
> At least dangerous.  If I'm lucky I only get an EPERM...

Again, I'll provide a sympathetic ear, but it wasn't my suggestion.

> Ciao, ET.

And you were referring to who?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-23 Thread Daniel Phillips

On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
> Daniel Phillips wrote:
> > On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> > > On Mon, 21 May 2001, Daniel Phillips wrote:
> > > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > > > What I'd like to see:
> > > > >
> > > > > - An interface for registering an array of related devices
> > > > > (almost always two: raw and ctl) and their legacy device
> > > > > numbers with a single userspace callout that does whatever
> > > > > /dev/ creation needs to be done. Thus, naming and permissions
> > > > > live in user space. No "device node is also a directory"
> > > > > weirdness...
> > > >
> > > > Could you be specific about what is weird about it?
> > >
> > > *boggle*
> > >
> > >[general sense of unease]
>
> I fully agree with Oliver.  It's an abomination.

We are, or at least, I am, investigating this question purely on
technical grounds - name calling is a noop.  I'd be happy to find a
real reason why this is a bad idea but so far none has been
presented.

Don't get me wrong, the fact that people I respect have reservations
about the idea does mean something to me, but this still needs to be
investigated properly.  Now on to the technical content...

> > > I don't think it's likely to be even workable. Just consider the
> > > directory entry for a moment - is it going to be marked d or
> > > [cb]?
> >
> > It's going to be marked 'd', it's a directory, not a file.
>
> Aha.  So you lose the S_ISCHR/BLK attribute.

Readdir fills in a directory type, so ls sees it as a directory and does
the right thing.  On the other hand, we know we're on a device 
filesystem so we will next open the name as a regular file, and find
ISCHR or ISBLK: good.

The rule for this filesystem is: if you open with O_DIRECTORY then
directory operations are permitted, nothing else.  If you open without
O_DIRECTORY then directory operations are forbidden (as
usual) and normal device semantics apply.

If there is weirdness anywhere, it's right here with this rule.  The
question is: what if anything breaks?

> > > If it doesn't have the directory bit set, Midnight commander
> > > won't let me look at it, and I wouldn't blame cd or ls for
> > > complaining. If it does have the 'd' bit set, I wouldn't blame
> > > cp, tar, find, or a million other programs if they did the wrong
> > > thing. They've had 30 years to expect that files aren't
> > > directories. They're going to act weird.
> >
> > No problem, it's a directory.
>
> Directories are not allowed to be read from/written to.  The VFS may
> support it, but it's not (current) UNIX.

Here, we obey this rule: if you open it with O_DIRECTORY then you
can't read from or write to it.

> > > Linus has been kicking this idea around for a couple years now
> > > and it's still a cute solution looking for a problem. It just
> > > doesn't belong in UNIX.
> >
> > Hmm, ok, do we still have any *technical* reasons?
>
> So with your definition, I have a fs-object that is marked as a
> directory but opening it opens a device.  Pretty nice..

No, you have to open it without O_DIRECTORY to get your device
fd handle.

> How I'm supposed to list it's contents?  open+readdir?

Nothing breaks here, ls works as it always did.

This is what ls does:

open("foobar", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl64(0x3, 0x2, 0x1, 0x2) = -1 ENOSYS (Function not implemented)
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
brk(0x805b000)  = 0x805b000
getdents64(0x3, 0x8058270, 0x1000, 0x26) = -1 ENOSYS (Function not implemented)
getdents(3, /* 2 entries */, 2980)  = 28
getdents(3, /* 0 entries */, 2980)  = 0
close(3)= 0

Note that ls doesn't do anything as inconvenient as opening 
foobar as a normal file first, expecting that operation to fail.

> But the open has nasty side effects.
> So you have a directory that you are not allowed
> to list (because of the possible side effects) but is allowed to be
> read from/written to maybe even issue ioctls to?. 

No, you would get side effects only if you open as a regular file.
I'd agree that that sucks, but that's not what we're trying to fix
just now.

> And you call that sane???

I would hope it seems saner now, after the clarification.
Please, if you know something that actually breaks, tell me.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-22 Thread Daniel Phillips


On Tuesday 22 May 2001 19:49, Oliver Xymoron wrote:
> On Tue, 22 May 2001, Daniel Phillips wrote:
> > > I don't think it's likely to be even workable. Just consider the
> > > directory entry for a moment - is it going to be marked d or
> > > [cb]?
> >
> > It's going to be marked 'd', it's a directory, not a file.
>
> Are we talking about the same proposal?  The one where I can open
> /dev/dsp and /dev/dsp/ctl? But I can still do 'cat /dev/hda >
> /dev/dsp'?

We already support read/write on directories in the VFS, that's not a
problem.

> It's still a file. If it's not a file anymore, it ain't UNIX.

It's a file with the directory bit set, I believe that's UNIX.

> > > If it doesn't have the directory bit set, Midnight commander
> > > won't let me look at it, and I wouldn't blame cd or ls for
> > > complaining. If it does have the 'd' bit set, I wouldn't blame
> > > cp, tar, find, or a million other programs if they did the wrong
> > > thing. They've had 30 years to expect that files aren't
> > > directories. They're going to act weird.
> >
> > No problem, it's a directory.
> >
> > > Linus has been kicking this idea around for a couple years now
> > > and it's still a cute solution looking for a problem. It just
> > > doesn't belong in UNIX.
> >
> > Hmm, ok, do we still have any *technical* reasons?
>
> If you define *technical* to not include design, sure.

Sorry, I don't see what you mean, do you mean the design is
difficult?

> Oh, did I mention unnecessary, solvable in userspace?

That's exactly the point: the generic filesystem allows all the
funny-shaped stuff to be dealt with in user space.  The
filesystem itself is lovely and clean.

BTW, I didn't realize I was reinventing Linus's wheel, this just
seemed very obvious and natural to me.  So I had to believe
there's a technical obstacle somewhere.

Has anyone written code to demonstrate the idea?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-22 Thread Daniel Phillips


On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> On Mon, 21 May 2001, Daniel Phillips wrote:
> > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > What I'd like to see:
> > >
> > > - An interface for registering an array of related devices
> > > (almost always two: raw and ctl) and their legacy device numbers
> > > with a single userspace callout that does whatever /dev/ creation
> > > needs to be done. Thus, naming and permissions live in user
> > > space. No "device node is also a directory" weirdness...
> >
> > Could you be specific about what is weird about it?
>
> *boggle*
>
>[general sense of unease]
>
> I don't think it's likely to be even workable. Just consider the
> directory entry for a moment - is it going to be marked d or [cb]?

It's going to be marked 'd', it's a directory, not a file.

> If it doesn't have the directory bit set, Midnight commander won't
> let me look at it, and I wouldn't blame cd or ls for complaining. If it
> does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
> million other programs if they did the wrong thing. They've had 30
> years to expect that files aren't directories. They're going to act
> weird.

No problem, it's a directory.

> Linus has been kicking this idea around for a couple years now and
> it's still a cute solution looking for a problem. It just doesn't
> belong in UNIX.

Hmm, ok, do we still have any *technical* reasons?

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-22 Thread Daniel Phillips


On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> What I'd like to see:
>
> - An interface for registering an array of related devices (almost
> always two: raw and ctl) and their legacy device numbers with a
> single userspace callout that does whatever /dev/ creation needs to
> be done. Thus, naming and permissions live in user space. No "device
> node is also a directory" weirdness...

Could you be specific about what is weird about it?

> ...which is overkill in the vast majority of cases.

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace

2001-05-22 Thread Daniel Phillips

On Monday 21 May 2001 14:43, [EMAIL PROTECTED] wrote:
> How about:
>
>   # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap
>   # ls /dev/mypartition
>   basesizedevicetype
>
> Generally, we shouldn't care which order the kernel enumerates
> devices in or which device number gets assigned internally.  If
> we did need to care, we'd just do:
>
>   # echo 666 >/dev/mypartition/number
>
> Only a single thing is of interest.
> What is the communication between user space and kernel
> that transports device identities?

It doesn't change, the same symbolic names still work.  What's
happening in my example is, we've gotten rid of the
can't-get-there-from-here device naming heirarchy.  It should 
be clear by now that we can't capture 'physical device location'
and 'device function' in one tree.  So instead, 'physical device'
is a property of 'logical device'.  The tree is now optional.

> Note that there is user (human) / user space (programs) / kernel.
>
> This user has interesting machinery in his hands,
> but his programs have only strings (path names, fake or not)
> to give to the kernel in open() and mount() calls.
>
> Now the device path is so complicated that the user is unable to
> describe it using a path name. devfs made an attempt listing
> controller, lun, etc etc but /dev/ide/host0/bus1/target1/lun0/disc is
> not very attractive, and things only get worse.

Yes, we flatten that by making host, bus, target and lun all
properties of /proc/ide/hda.

Our mistake up to now is that we've tried to carry the logical
view and physical view of the device in one name, or equivalently,
in path+name.  Let the physical device be a property of the logical
device and we no longer have our thumb tied to our nose.

> When I go to a bookshop to buy a book, I can do so without specifying
> all of Author, Editors, Title, Publisher, Date, ISBN, nr of pages,
> ... A few items suffice. Often the Title alone will do.
>
> We want an interface where the kernel exports what it has to offer
> and the user can pick. Yes, that Zip drive - never mind the bus.
> But can distinguish - Yes, that USB Zip drive, not the one
> on the parallel port.

100% agreed.  IOW, when the device *does* move we can usually
deduce where it's moved to, so lets update the hda's bus location
automatically whenever we can (log a message!) and only bother
the user about it if it's ambiguous.  For good measure, have a
system setting that says 'on a scale of 0 to 5, this is how interested
I am in being bothered about the fact that a device seems to have
moved'.

> The five minute hack would number devices 1, 2, 3 in order of
> detection, offer the detection message in
> /devices//detectionmessage and a corresponding device node in
> /devices//devicenode. The sysadmin figures out what is what,
> makes a collection of symlinks with his favorite names, and everybody
> is happy.
>
> Until the next reboot. Or until device removal and addition.
> There must be a way to give permanence to an association
> between name and device. Symlinks into a virtual filesystem
> like /devices are not good enough. Turning the five minute
> hack into a ten minute hack we take the md5sum of the part
> of the bootmessage that is expected to be the same the next time
> we encounter this device and use that as device number.
>
> I think a system somewhat in this style could be made to work well.

Yes, we are advocating the same thing.  I didn't mention that the
device properties are supposed to be persistent, did I?  If you
accept the idea of persistent device properties then the obvious
thing to do is to match them up against the detected devices.

I didn't want to bring up the persistency thing right away because
it begs the question of where you store the persistent data for the 
root device.  Until the namespace issue is resolved this is mainly
a distraction.

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace

2001-05-22 Thread Daniel Phillips


On Monday 21 May 2001 10:14, Lars Marowsky-Bree wrote:
> On 2001-05-19T16:25:47,
>
>    Daniel Phillips <[EMAIL PROTECTED]> said:
> > How about:
> >
> >   # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap
> >   # ls /dev/mypartition
> >   base  sizedevice  type
> >   # cat /dev/mypartition/size
> >   1048576
> >   # cat /dev/mypartition/device
> >   /dev/sda
> >   # mke2fs /dev/mypartition
>
> Ek. You want to run mke2fs on a _directory_ ?

Could you be specific about what is wrong with that?  Assuming that
this device directory lives on a special purpose filesystem?

> If anything, /dev/mypartition/realdev

Then every fstab in the world has to change, not to mention adding
verbosity to interactive commands.

--
Daniel

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace

2001-05-21 Thread Daniel Phillips


On Saturday 19 May 2001 08:23, Ben LaHaise wrote:
>  /dev/sda/offset=1024,limit=2048
> -> open a device that gives a view of sda at an
>   offset of 1KB to 2KB

Whatever we end up with, can we express it in terms of base, size,
please?

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: [RFD w/info-PATCH] device arguments from lookup, partion code in userspace

2001-05-21 Thread Daniel Phillips

On Saturday 19 May 2001 13:37, Eric W. Biederman wrote:
> For creating partitions you might want to do:
>   cat 1024 2048 > /dev/sda/newpartition

How about:

  # mkpart /dev/sda /dev/mypartition -o size=1024k,type=swap
  # ls /dev/mypartition
  base  sizedevice  type
  # cat /dev/mypartition/size
  1048576
  # cat /dev/mypartition/device
  /dev/sda
  # mke2fs /dev/mypartition

The information that was specified is persistent in /dev.  We can 
rearrange our physical devices any way we want without affecting
the name we chose in /dev.  When the kernel enumerates devices
at startup, our persistent information better match or we will have
to take some corrective action.

Generally, we shouldn't care which order the kernel enumerates
devices in or which device number gets assigned internally.  If we
did need to care, we'd just do:

  # echo 666 >/dev/mypartition/number

setting a persistent device minor number.  The major number is
inherited via the partition's /device property.

To set the minor number back to 'don't care':

  # rm /dev/mypartition/number

By taking the physical device off the top of the food chain we
gain the flexibility of being able to move the device from bus to 
bus for example, and only the partition's device property
changes, nothing in our fstab.  It's no great leap to set things
up so that not even the /device property would need to
change.

Note that we can have a heirarchy of partitions this way if 
we want to, since /dev/mypartition is just another block
device.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: ext3 for 2.4

2001-05-18 Thread Daniel Phillips


On Friday 18 May 2001 11:10, Alexander Viro wrote:
> On Thu, 17 May 2001, Daniel Phillips wrote:
> > Well, if you look how I did the index, it works with blocks and
> > buffers while still staying entirely in the page cache.  This was
> > Stephen's suggestion, and it integrates reliably with Al's
> > page-oriented code. So I'm mixing pages and blocks together and
> > it's working pretty well.
>
> ... or, in immortal words of Hans, "Yura, run the benchmarks"...
>
> > BTW, the parts of Al's patch that I converted from pages to blocks
> > got shorter and easier to read.
>
> 
> No offense, but your code is a twisted mess to the degree
> when bugs are hard to see just because of obfuscation.

Al, I was refering to your code, not my code ;-)

> Please, find and describe an obvious bug in ext2_add_entry()
> (dx-2.4.4-6.pcache version). After deobfuscation it becomes
> immediately visible - the only reason why it doesn't stick out like a
> sore tumb is that code is a spaghettish mess.
>
> Generally, when one has to draw a flowchart to figure out
> what happens in a function and to find lifetimes of local variables
> (couple of dozens of them) it means only one thing: function is
> _crap_.
>
> And yes, flowchart is what I finally had to resort to.
> Daniel, I don't care if you consider writing hairy code as a DSW and
> frankly, I'm less than impressed by the S of D being demonstrated in
> that particular case.
> 

But Al, don't tease me, what bug?

--
Daniel  Al
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: ext3 for 2.4

2001-05-17 Thread Daniel Phillips

On Thursday 17 May 2001 17:53, Andrew Morton wrote:
> It's probably worth thinking about adding a fourth journalling
> mode: `journal=none'.  Roll it all up into a single codebase
> and call it ext4.

Or ext5 (= ext2 + ext3).

> It rather depends on where the buffercache ends up.  ext3 is
> a client of JBD (nee JFS).  JBD does *block* level journalling.
> Any major change at that level will take rather some adjusting
> to.

Well, if you look how I did the index, it works with blocks and buffers 
while still staying entirely in the page cache.  This was Stephen's 
suggestion, and it integrates reliably with Al's page-oriented code.  
So I'm mixing pages and blocks together and it's working pretty well.  
BTW, the parts of Al's patch that I converted from pages to blocks got 
shorter and easier to read.

I'm now working on some code to handle non-data blocks in a similar 
way, so if this works out it could make the conversion an awful lot 
less painful for you.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

Re: ext3 for 2.4

2001-05-17 Thread Daniel Phillips

On Thursday 17 May 2001 13:20, Andrew Morton wrote:
> Summary: ext3 works, page_launder() doesn't :)
>
> The tree is based on the porting work which Peter Braam did.  It's
> in cvs in Jeff Garzik's home on sourceforge.  Info on CVS is at
> http://sourceforge.net/cvs/?group_id=3242 - the module name
> is `ext3'.  There's a README there which describes how to
> apply the patchset.

Congratulations to all.

Naturally, Ext3 will need a fast directory index, and quickly too, 
before people start running benchmarks against ReiserFS and XFS. :-)

Could you take a look at my indexing patch and see what the journalling 
issues are?  (If any)

I have three flavors for you to choose from:

  1) Good old buffer cache
  2) Page cache, block oriented
  3) Page cache, blocks and pages

The first two are from the same patch, with a compilation option:

  http://nl.linux.org/~phillips/htree/dx.testme-2.4.4

And the third is a combination of two patches:

  ftp://ftp.math.psu.edu/pub/viro/ext2-dir-patch-S4.gz
  http://nl.linux.org/~phillips/htree/dx.pcache-2.4.4-6

Please take a look and see which style fits best.  The pcache patch is 
the forward-looking one, it's prefered.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]

1 2 >

1 - 100 of 182 matches

Mail list logo