Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-22 Thread Oliver Xymoron

On Mon, 21 May 2001, Daniel Phillips wrote:

> On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > What I'd like to see:
> >
> > - An interface for registering an array of related devices (almost
> > always two: raw and ctl) and their legacy device numbers with a
> > single userspace callout that does whatever /dev/ creation needs to
> > be done. Thus, naming and permissions live in user space. No "device
> > node is also a directory" weirdness...
>
> Could you be specific about what is weird about it?

*boggle*

Without precedent in any other UNIX? Or other operating systems, for that
matter? Can you honestly say it doesn't strike you as weird? It's beating
the least surprise rule with a big stick, fercryinoutloud.

Ok, so technically UNIX directories were once just files. But it's been a
long time since people thought exposing that implementation detail was a
good idea, and anyway, it's the opposite situation (and no longer true on
modern fses).

I don't think it's likely to be even workable. Just consider the directory
entry for a moment - is it going to be marked d or [cb]? If it doesn't
have the directory bit set, Midnight commander won't let me look at it,
and I wouldn't blame cd or ls for complaining. If it does have the 'd' bit
set, I wouldn't blame cp, tar, find, or a million other programs if they
did the wrong thing. They've had 30 years to expect that files aren't
directories. They're going to act weird.

Linus has been kicking this idea around for a couple years now and it's
still a cute solution looking for a problem. It just doesn't belong in
UNIX.

More importantly, there's no call for the weirdness. Look, we've already
got to have a userspace callout for new devices so that we can do config,
firmware downloading, automounting, etc. There's no reason we can't stick
the rest of the dynamic /dev/ magic in userspace with the same mechanism.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-24 Thread Andreas Dilger

Peter Braam writes:
> On Tue, 22 May 2001, Andreas Dilger wrote:
> > Actually, the LVM snapshot
> > interface has (optional) hooks into the filesystem to ensure that it
> > is consistent at the time the snapshot is created.
> 
> File system journal recovery can corrupt a snapshot, because it copies
> data that needs to be preserved in a snapshot. During journal replay such
> data may be copied again, but the source can have new data already.

The way it is implemented in reiserfs is to wait for existing transactions
to complete, entirely flush the journal and block all new transactions from
starting.  Stephen implemented a journal flush API to do this for ext3, but
the hooks to call it from LVM are not in place yet.  This way the journal is
totally empty at the time the snapshot is done, so the read-only copy does
not need to do journal recovery, so no problems can arise.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-24 Thread Andreas Dilger

Jeff writes:
> Here's a dumb question, and I apologize if I am questioning computer
> science dogma...
> 
> Why are LVM and EVMS(competing LVM project) needed at all?
> 
> Surely the same can be accomplished with
> * md
> * snapshot blkdev (attached in previous e-mail)
> * giving partitions and blkdevs the ability to grow and shrink
> * giving filesystems the ability to grow and shrink
> 
> On-line optimization (defrag, etc) shouldn't be hard once you have the
> ability to move blocks and files around, which would come with the
> ability to grow and shrink blkdevs and fs's.

You're missing virtual->physical block mapping allowing you to move parts
of the device around, freedom from the need for contiguous disk space.

In the end, what you've described above is pretty much what LVM does (and
EVMS does better).  Having the various components inside a single layer
like EVMS gives you a lot move flexibility, IMHO.  You also don't have
the issue of wasted minor numbers for unused partitions, or too few minor
numbers in other cases.

For example, with MD RAID you still need devices of equal size to create
a RAID 1 mirror, or part of one device is wasted.  With EVMS you can (in
the future, or right now with AIX/HPUX LVM) do the RAID 1 mirroring on a
per-logical-extent basis and you get your physical extents from any device.
Because your virtual->physical mapping is already abstract, it also allows
you to add mirroring to any existing LVM device without interruption.

Cheers, Andreas

PS - I used to think shrinking a filesystem online was useful, but there
 are a huge amount of problems with this and very few real-life
 benefits, as long as you can at least do offline shrinking.  With
 proper LVM usage, the need to shrink a filesystem never really
 happens in practise, unlike the partition case where you always
 have to guess in advance how big a filesystem needs to be, and then
 add 10% for a safety margin.  With LVM you just create the minimal
 sized device you need now, and freely grow it in the future.
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-24 Thread Andreas Dilger

Linus writes:
> There are some strong arguments that we should have filesystem
> "backdoors" for maintenance purposes, including backup. 
> 
> You can, of course, so parts of this on a LVM level, and doing backups
> with "disk snapshots" may be a valid approach. However, even that is
> debatable: there is very little that says that the disk image has to be
> up-to-date at any particular point in time, so even with a disk snapshot
> capability (which is not necessarily reasonable under all circumstances)
> there are arguments for maintenance interfaces.

Actually, the LVM snapshot interface has (optional) hooks into the filesystem
to ensure that it is consistent at the time the snapshot is created.  For
most filesystems, it will call fsync_dev(dev) so that all buffers are written
to disk.  However, for journalled filesystems, LVM needs to write out the
journal and mark the filesystem clean because the snapshot is a read-only
block device.  In this case it calls fsync_dev_lockfs(dev) which will call
the write_super_lockfs() method for the filesystem (if it exists) which
tells the filesystem to flush the journal, block transactions, and mark the
filesystem clean until the unlockfs() method is called.

Reiserfs and XFS both use this to make consistent snapshots of the live
filesystem.  Unfortunately, XFS checks filesystem UUIDs at mount time,
which means you can't mount two copies of the same filesystem (even read-only).

> Things like "lazy fsck" (ie fsck while already running the filesystem) and
> defragmentation simply is not feasible on a LVM level.

Yes, with consistent LVM snapshots you can do fsck on the read-only copy.
In 99.9*% cases you will not detect any errors and you can continue.  If
you _do_ detect an error you probably want to stop everything and fix it
(fsck repairing an in-use filesystem is too twisted and dangerous, IMHO,
and a huge amount of effort for an extremely rare situation).

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-22 Thread Oliver Xymoron

On Tue, 22 May 2001, Daniel Phillips wrote:

> > I don't think it's likely to be even workable. Just consider the
> > directory entry for a moment - is it going to be marked d or [cb]?
>
> It's going to be marked 'd', it's a directory, not a file.

Are we talking about the same proposal?  The one where I can open /dev/dsp
and /dev/dsp/ctl? But I can still do 'cat /dev/hda > /dev/dsp'?

It's still a file. If it's not a file anymore, it ain't UNIX.

> > If it doesn't have the directory bit set, Midnight commander won't
> > let me look at it, and I wouldn't blame cd or ls for complaining. If it
> > does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
> > million other programs if they did the wrong thing. They've had 30
> > years to expect that files aren't directories. They're going to act
> > weird.
>
> No problem, it's a directory.
>
> > Linus has been kicking this idea around for a couple years now and
> > it's still a cute solution looking for a problem. It just doesn't
> > belong in UNIX.
>
> Hmm, ok, do we still have any *technical* reasons?

If you define *technical* to not include design, sure. Oh, did I
mention unnecessary, solvable in userspace?

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-23 Thread Oliver Xymoron

On Wed, 23 May 2001, Daniel Phillips wrote:

> > > > *boggle*
> > > >
> > > >[general sense of unease]
> >
> > I fully agree with Oliver.  It's an abomination.
>
> We are, or at least, I am, investigating this question purely on
> technical grounds - name calling is a noop.  I'd be happy to find a
> real reason why this is a bad idea but so far none has been
> presented.

I will agree that the thing can be done in principle. You're not going to
find anyone who's going to argue that part. All other things being equal,
I actually think it's a neat idea.

The part that is a problem is people, namely people who write programs.
They've had decades to expect that directories are not also files, and if
they happen to do things like check whether a file is not a directory
before opening it, it's _our fault_ if they get confused.

Consider the recent subtle change to fork() that was reversed because it
uncovered an unforseen bug in bash. The proposed change is not at all
subtle, is entirely without precedent, and is likely to break much.

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-24 Thread Oliver Xymoron

On Thu, 24 May 2001, Marko Kreen wrote:

> On Thu, May 24, 2001 at 02:23:27AM +0200, Edgar Toernig wrote:
> > Daniel Phillips wrote:
> > > > > It's going to be marked 'd', it's a directory, not a file.
> > > >
> > > > Aha.  So you lose the S_ISCHR/BLK attribute.
> > >
> > > Readdir fills in a directory type, so ls sees it as a directory and does
> > > the right thing.  On the other hand, we know we're on a device
> > > filesystem so we will next open the name as a regular file, and find
> > > ISCHR or ISBLK: good.
> >
> > ??? The kernel may know it, but the app?  Or do you really want to
> > give different stat data on stat(2) and fstat(2)?  These flags are
> > currently used by archive/backup prgs.  It's a hint that these files
> > are not regular files and shouldn't be opened for reading.
> > Having a 'd' would mean that they would really try to enter the
> > directory and save it's contents.  Don't know what happens in this
> > case to your "special" files ;-)
>
> IMHO the CHR/BLK is not needed.  Think of /proc.  In the future,
> the backup tools will be told to ignore /dev, that's all.

The /dev dir should not be special. At least not to the kernel. I have
device files in places other than /dev, and you probably do too (hint:
anonymous FTP).

--
 "Love the dolphins," she advised him. "Write by W.A.S.T.E.."

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]devicearguments from lookup)

2001-05-24 Thread Andreas Dilger

Malcolm Beattie writes:
> Andreas Dilger writes:
> > PS - I used to think shrinking a filesystem online was useful, but there
> >  are a huge amount of problems with this and very few real-life
> >  benefits, as long as you can at least do offline shrinking.  With
> >  proper LVM usage, the need to shrink a filesystem never really
> >  happens in practise, unlike the partition case where you always
> >  have to guess in advance how big a filesystem needs to be, and then
> >  add 10% for a safety margin.  With LVM you just create the minimal
> >  sized device you need now, and freely grow it in the future.
> 
> In an attempt to nudge you back towards your previous opinion: consider
> a system-wide spool or tmp filesystem. It would be nice to be able to
> add in a few extra volumes for a busy period but then shrink it down
> again when usage returns to normal. In the absence of the ability to
> shrink a live filesystem, storage management becomes a much harder job.
> You can't throw in a spare volume or two where it's needed without
> careful thought because you'll be ratchetting up the space on that one
> filesystem without being able to change your mind and reduce it again
> later. You'll end up with stingy storage admins who refuse to give you
> a bunch of extra filesystem space for a while because they can't get it
> back again afterwards.

I suppose it depends a bit on how your system is administered.  On LVM
systems, I tend to allocate new volumes for special situations like this.
When the special need is gone, you simply remove the whole thing.  Yes,
this is a bit of a hack for not having online shrinking, but I have not
really had a _big_ need to do that.

The only time I've really needed online shrinking is when someone
screwed up and made / or /var way too huge for some (bad) reason and
you can't unmount it conveniently.  Under AIX, you can't shrink JFS
even unmounted so it meant backup/restore.  Even so, having empty
space in a filesystem is not a reason to panic, while having no free
space in a filesystem _is_ a reason to panic, hence online growing
of ext2.

Cheers, Andreas
-- 
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/   -- Dogbert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/