Re: PUFFS ADVLOCK is too greedy

2011-11-14 Thread David Laight
On Fri, Nov 11, 2011 at 11:11:35AM +0100, Emmanuel Dreyfus wrote:
 
 Here is where the extra ADVLOCK happen:
 sys_exit - exit1 - fd_free - fd_close - VOP_ADVLOCK
 
 In a nutshell, NetBSD clears locks on all file descriptor when a process
 exit, that include read-only descriptors, and FUSE filesystems seem to
 be unconfortable with that.
 
 I am not sure of where this should be fixed: should we avoid unlocking
 read-only files in fd_close()? That seems a good idea performance-wise.

I think that is required by POSIX.
Any close() of a file by a process removes ALL locks held by that
process (on that file), not just those acquired through that fd.

Yes that has nasty side effects, especially in threaded processes,
or when library routines open/close files that might have locks
on them in other parts of the process.)
Similarly there isn't a file locking scheme that can be used between
threads os a single process.

David

-- 
David Laight: da...@l8s.co.uk


Re: MAXNAMLEN vs NAME_MAX

2011-11-14 Thread Matthew Mondor
On Sun, 13 Nov 2011 23:08:30 +
David Holland dholland-t...@netbsd.org wrote:

 I was recently talking to some people who'd been working with some
 (physicists, I think) doing data-intensive simulation of some kind,
 and that reminded me: for various reasons, many people who are doing
 serious data collection or simulation tend to encode vast amounts of
 metadata in the names of their data files. Arguably this is a bad way
 of doing things, but there are reasons for it and not so many clear
 alternatives... anyway, 256 character filenames often aren't enough in
 that context.

It's only my opinion, but they really should be using multiple files or
a database for the metadata with as necessary a link to an actual
file for data.
But I also tend to think the same of software relying on extended
attributes, resource forks and the like (with the possible exception of
a specialized facility for extended permissions :)

 (This sort of usage also often involves things like 50,000 files in
 one directory, so the columnizing behavior of ls is far from the top
 of the list of relevant issues.)

This reminds me, does anyone know about the current state of
UFS_DIRHASH?  I remember reading about some issues with it and ending up
disabling it on my kernels, yet huge directories can occur in a number
of scenarios (probably a more pressing issue than extending file names,
actually)...

   The 255 limit was just because that's how many bytes a one byte length
   field permitted, not because anyone thought names that long made sense.
   But if you're going to increase it, why  stop at 511?  That number
   means nothing - the next logical limit would be 65535 wouldn't it?
 
 Well... yes but there are other considerations. As you noted, going
 past one physical sector is problematic; going past one filesystem
 block very problematic. Plus, as long as MMU pages remain 4K,
 allocating contiguous kernel virtual space for path buffers (since if
 NAME_MAX were raised to 64K, PATH_MAX would have to be at least that
 large) could start to be a problem.

I agree, especially with all the software that allocates path/file name
buffers on the stack (but even on the heap it could be a general memory
waste with 64KB, other than the memory management performance issues).
-- 
Matt


Re: fs-independent quotas

2011-11-14 Thread Manuel Bouyer
On Sun, Nov 13, 2011 at 07:42:18PM -0500, Mouse wrote:
  The arguments that ufs_quota_entry (or whatever its name is) will be
  good enough for any future filesystem is just not true.
 
 You have asserted that.

I also explained why, I think.

 
 Proof by repeated assertion is...unconvincing.

I can explain again, but I don't like writing the same thing over and
over again. I have more interesting things to do.

-- 
Manuel Bouyer bou...@antioche.eu.org
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: fs-independent quotas

2011-11-14 Thread David Holland
On Sun, Nov 13, 2011 at 10:36:55PM +0100, Manuel Bouyer wrote:
  On Sat, Oct 29, 2011 at 05:14:30PM +, David Holland wrote:
   [...]
  3. Abolish the proplib-based transport encoding. Since it turns out
 that the use of proplib for quotactl(2) is only to encode struct
 ufs_quota_entry for transport across the user/kernel boundary,
 converting it back on the other side, it seems to me that it's
 a completely pointless complication and a poor use of proplib.
 It's also messy, even compared to other proplib usage elsewhere.
 (Regarding claims of easier/better compatibility, see below.)
  
  Ho no, not again !
  
  The arguments that ufs_quota_entry (or whatever its name is) will
  be good enough for any future filesystem is just not true. We already
  have been in this argument.

Yes, and your hypothetical examples haven't come close to convincing
me. And I agree there's no point thrashing back and forth any further;
this is why I've asked core to decide.

   3. There's already been some discussion of the compat issues in this
   thread. Basically it boils down to: if you send a program material
   that it's not expecting to receive, it won't be able to cope with it
   and will fail/die/crash. This is true whether the material is binary
   or a proplib bundle or text or whatever else.
  
  With a binary it'll probably crash. With a text-based format it will notice
  the syntax error return an error code. This is a big difference,
  especially for kernel.

Neither is good enough if you're providing backwards compatibility; it
has to *work*. This is the standard we're committed to, and I continue
to think there's no particular advantage for proplib in this regard,
particularly for this particular kind of data.

(I don't think any semistructured or self-describing data model,
including the perfect one I'd replace proplib with if I could wave a
wand to do so, provides any particular advantage for procedure call
compatibility. Sure, you can tag data bundles with version codes and
such, but we can and do already do that by tagging the call itself and
have lots of support architecture in place for doing it that way. The
advantages appear when you're dealing with irregularly structured
material, like when there are large numbers of optional fields or
optional parameters and so forth.)

   4. If using split on the output of quota/repquota/whatnot isn't good
   enough (in some specific places we may want a machine-readable-output
   option) then the best way to access the quota system from Perl or
   Python (or Ruby or Lua or ...) is with a set of bindings for the new
   proposed libquota. This should be straightforward to set up once the
   new libquota is in place. I think the current quotactl(8) should just
  
  Are you going to provide those bindings ? I'm interested in perl.

I don't do Perl. I might be persuaded to do Python bindings, but it
would probably be more effective to enlist someone who already knows
how to do this and won't therefore make newbie mistakes with the
interpreter. Anyway, the hard part is making the library interface
available; wrapping Perl or Python around it should be entirely
trivial.

   - As far as I can tell there is not and never has been support for
   manipulating quotas on unmounted filesystems.
  
  There was in quoya(1), repquota(8) and edquota(8), and I've been using
  it with netbsd-5. Just read the code in the netbsd-5 branch to see it.

I've looked. I don't know what you're seeing, but all I see is code
for directly manipulating the quota files. There's no logic for
mounting anything to reach them. So it'll work if the quota files are
on / and the volume is for /home regardless of whether /home is
mounted... but only because it doesn't have to touch /home in this
case.

I don't think that feature is worth preserving either, since it's
purely a side-effect of having visible quota files. And I don't see
the point; it's not like mounting the volume to run edquota will cause
a catastrophe.

-- 
David A. Holland
dholl...@netbsd.org


Re: MAXNAMLEN vs NAME_MAX

2011-11-14 Thread David Holland
On Mon, Nov 14, 2011 at 04:03:09AM -0500, Matthew Mondor wrote:
   I was recently talking to some people who'd been working with some
   (physicists, I think) doing data-intensive simulation of some kind,
   and that reminded me: for various reasons, many people who are doing
   serious data collection or simulation tend to encode vast amounts of
   metadata in the names of their data files. Arguably this is a bad way
   of doing things, but there are reasons for it and not so many clear
   alternatives... anyway, 256 character filenames often aren't enough in
   that context.
  
  It's only my opinion, but they really should be using multiple files or
  a database for the metadata with as necessary a link to an actual
  file for data.

Perhaps, but telling people they should be working a different way
usually doesn't help. (Have you ever done any stuff like this? Even if
you have only a few settings and only a couple hundred output files,
there's still no decent way to arrange it but name the output files
after the settings.)

   (This sort of usage also often involves things like 50,000 files in
   one directory, so the columnizing behavior of ls is far from the top
   of the list of relevant issues.)
  
  This reminds me, does anyone know about the current state of
  UFS_DIRHASH?  I remember reading about some issues with it and ending up
  disabling it on my kernels, yet huge directories can occur in a number
  of scenarios (probably a more pressing issue than extending file names,
  actually)...

I don't know. At best it's not really a complete solution, anyway...

   Well... yes but there are other considerations. As you noted, going
   past one physical sector is problematic; going past one filesystem
   block very problematic. Plus, as long as MMU pages remain 4K,
   allocating contiguous kernel virtual space for path buffers (since if
   NAME_MAX were raised to 64K, PATH_MAX would have to be at least that
   large) could start to be a problem.
  
  I agree, especially with all the software that allocates path/file name
  buffers on the stack (but even on the heap it could be a general memory
  waste with 64KB, other than the memory management performance issues).

Pathname buffers generally shouldn't be (and in NetBSD, aren't) on the
stack regardless. Even at only 1K each, it's really easy to blow a 4k
kernel stack with them. (In practice you can generally get away with
one; but two, like you need for rename, link, symlink, etc. is too
many.)

Or I guess you don't mean in the kernel, do you...

-- 
David A. Holland
dholl...@netbsd.org


Re: bumping ARG_MAX

2011-11-14 Thread David Laight
On Sun, Nov 13, 2011 at 11:17:52PM +, David Holland wrote:
 pkgsrc has grown to the point where the following happens:
 
valkyrie% pwd
/usr/pkgsrc
valkyrie% grep foo */*/Makefile
/usr/bin/grep: Argument list too long.
Exit 1

Use: grep -r --include Makefile foo .
But don't forget the '.' - should be the default with -r
(or at least an error).

David

-- 
David Laight: da...@l8s.co.uk


Re: MAXNAMLEN vs NAME_MAX

2011-11-14 Thread David Laight
On Mon, Nov 14, 2011 at 04:03:09AM -0500, Matthew Mondor wrote:
 On Sun, 13 Nov 2011 23:08:30 +
 David Holland dholland-t...@netbsd.org wrote:
 
  I was recently talking to some people who'd been working with some
  (physicists, I think) doing data-intensive simulation of some kind,
  and that reminded me: for various reasons, many people who are doing
  serious data collection or simulation tend to encode vast amounts of
  metadata in the names of their data files. Arguably this is a bad way
  of doing things, but there are reasons for it and not so many clear
  alternatives... anyway, 256 character filenames often aren't enough in
  that context.
 
 It's only my opinion, but they really should be using multiple files or
 a database for the metadata with as necessary a link to an actual
 file for data.

Or use '/' to separate the fields in their long filename :-)
(But then they'll hit the 32k/64k limit on subdirectories ...)

Thinks...   MD5 hash the user-specified filename and use that for
the 'real' name. Add some special fudgery so that readdir() works.
Then use some kind of overlay mount.

David

-- 
David Laight: da...@l8s.co.uk


Re: bumping ARG_MAX

2011-11-14 Thread Mouse
valkyrie% grep foo */*/Makefile
 Use: grep -r --include Makefile foo .

That (a) will include Makefiles at other depths than two (which may not
be a problem in the specific example of pkgsrc, but in general makes it
non-equivalent), (b) is grep-specifc, and (c) will walk the whole tre
to full depth even if there aren't any more Makefiles for it to find.

I think the point still stands.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: bumping ARG_MAX

2011-11-14 Thread David Holland
On Mon, Nov 14, 2011 at 05:39:02PM +, David Laight wrote:
   pkgsrc has grown to the point where the following happens:
   
  valkyrie% pwd
  /usr/pkgsrc
  valkyrie% grep foo */*/Makefile
  /usr/bin/grep: Argument list too long.
  Exit 1
  
  Use: grep -r --include Makefile foo .
  But don't forget the '.' - should be the default with -r
  (or at least an error).

Or use: find . -name Makefile -print | xargs grep foo
Or use: grep foo [a-m]*/*/Makefile; grep foo [n-z]*/*/Makefile

or whatever. That's completely not the point...

-- 
David A. Holland
dholl...@netbsd.org


Re: VOP_GETATTR: locking protocol change proposal

2011-11-14 Thread YAMAMOTO Takashi
hi,

 The vnode locking requirement currently allows to call VOP_GETATTR()
 on an unlocked vnode.  This is orthogonal to all other operations that
 read data or metadata and want at least a shared lock.  It also asks
 for trouble as the attributes may change while the operation is in
 progress.
 
 With the attached diff the locking protocol requests at least a shared
 lock and all calls to VOP_GETATTR() outside of file systems respect it.
 
 The calls from file systems need review (NFS server is suspicious at least).
 
 I will commit this diff around Oct 14 if noone objects.

postgresql assumes instant lseek(SEEK_END) to get the size of
their heap files.

http://rhaas.blogspot.com/2011/11/linux-lseek-scalability.html

as fsync etc keeps the vnode lock during i/o, it might cause severe
performance regression.

YAMAMOTO Takashi

 
 --
 Juergen Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig (Germany)