subject:"\[PATCH 00\/14\] GFS"

Re: [PATCH 00/14] GFS

2005-08-11 Thread Arjan van de Ven

On Thu, 2005-08-11 at 14:06 +0800, David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> 
> > * + if (create)
> > +   down_write(>i_rw_mutex);
> > +   else
> > +   down_read(>i_rw_mutex);
> > 
> > why do you use a rwsem and not a regular semaphore? You are aware that
> > rwsems are far more expensive than regular ones right?  How skewed is
> > the read/write ratio?
> 
> Rough tests show around 4/1, that high or low?

that's quite borderline; if it was my code I'd not use a rwsem for that
ratio (my own rule of thumb, based on not a lot other than gut feeling)
is a 10/1 ratio at minimum... but it's not so low that it screams for
removing it. However it might well make your code a lot simpler so
it might still be worth simplifying.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-11 Thread David Teigland

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> * +   if (create)
> + down_write(>i_rw_mutex);
> + else
> + down_read(>i_rw_mutex);
> 
> why do you use a rwsem and not a regular semaphore? You are aware that
> rwsems are far more expensive than regular ones right?  How skewed is
> the read/write ratio?

Rough tests show around 4/1, that high or low?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-11 Thread David Teigland

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

 * +   if (create)
 + down_write(ip-i_rw_mutex);
 + else
 + down_read(ip-i_rw_mutex);
 
 why do you use a rwsem and not a regular semaphore? You are aware that
 rwsems are far more expensive than regular ones right?  How skewed is
 the read/write ratio?

Rough tests show around 4/1, that high or low?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-11 Thread Arjan van de Ven

On Thu, 2005-08-11 at 14:06 +0800, David Teigland wrote:
 On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
 
  * + if (create)
  +   down_write(ip-i_rw_mutex);
  +   else
  +   down_read(ip-i_rw_mutex);
  
  why do you use a rwsem and not a regular semaphore? You are aware that
  rwsems are far more expensive than regular ones right?  How skewed is
  the read/write ratio?
 
 Rough tests show around 4/1, that high or low?

that's quite borderline; if it was my code I'd not use a rwsem for that
ratio (my own rule of thumb, based on not a lot other than gut feeling)
is a 10/1 ratio at minimum... but it's not so low that it screams for
removing it. However it might well make your code a lot simpler so
it might still be worth simplifying.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-10 Thread Kyle Moffett


On Aug 10, 2005, at 09:26:26, AJ Lewis wrote:

On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:


On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:

So for every directory hierarchy on a shared filesystem, each  
user needs
to have the complete list of bindmounts needed, and automatically  
resync
that across all nodes when a new one is added or removed? And  
then have

that executed by root, because a regular user can't?


Do it in an initscripts and let users simply not do it, they  
shouldn't

even know what kind of filesystem they are on.


I'm just thinking of a 100-node cluster that has different mounts  
on different
nodes, and trying to update the bind mounts in a sane and efficient  
manner

without clobbering the various mount setups.  Ouch.


How about something like the following:
cpslink()  => Create a Context Dependent Symlink
readcpslink()  => Return the Context Dependent path data
readlink() => Return the path of the Context Dependent  
Symlink as it
  would be evaluated in the current context,  
basically as a

  normal symlink.
lstat()=> Return information on the Context Dependent  
Symlink in

  the same format as a regular symlink.
unlink()   => Delete the Context Dependent Symlink.

You would need an extra userspace tool that understands cpslink/ 
readcpslink to
create and get information on the links for now, but ls and ln could  
eventually
be updated, and until then the would provide sane behavior.  Perhaps  
this

should be extended into a new API for some of the strange things several
filesystems want to do in the VFS:
extlink()  => Create an extended filesystem link (with type  
specified)

readextlink()  => Return the path (and type) for the link

The filesystem could define how each type of link acts with respect  
to other
syscalls.  OpenAFS could use extlink() instead of their symlink magic  
for
adjusting the AFS volume hierarchy.  The new in-kernel AFS client  
could use it
in similar fashion (It has no method to adjust hierarchy, because  
it's still
read-only).  GFS could use it for their Context Dependent Symlinks.   
Since it
would pass the type in as well, it would be possible to use it for  
different

kinds of links on the same filesystem.

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
  -- Alan Kay



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-10 Thread AJ Lewis

On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:
> On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
> > So for every directoy hiearchy on a shared filesystem, each user needs
> > to have the complete list of bindmounts needed, and automatically resync
> > that across all nodes when a new one is added or removed? And then have
> > that executed by root, because a regular user can't?
> 
> Do it in an initscripts and let users simply not do it, they shouldn't
> even know what kind of filesystem they are on.

I'm just thinking of a 100-node cluster that has different mounts on different
nodes, and trying to update the bind mounts in a sane and efficient manner
without clobbering the various mount setups.  Ouch.

-- 
AJ Lewis   Voice:  612-638-0500
Red HatE-Mail: [EMAIL PROTECTED]
One Main Street SE, Suite 209
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...



pgpsGmjTnZdeF.pgp
Description: PGP signature

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-10T12:05:11, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > > What would a syntax look like which in your opinion does not remove
> > > totally valid symlink targets for magic mushroom bullshit? Prefix with
> > > // (which, according to POSIX, allows for implementation-defined
> > > behaviour)? Something else, not allowed in a regular pathname?
> > None.  just don't do it.  Use bindmount, they're cheap and have sane
> > defined semtantics.
> 
> So for every directoy hiearchy on a shared filesystem, each user needs
> to have the complete list of bindmounts needed, and automatically resync
> that across all nodes when a new one is added or removed? And then have
> that executed by root, because a regular user can't?

Do it in an initscripts and let users simply not do it, they shouldn't
even know what kind of filesystem they are on.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T12:05:11, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> > What would a syntax look like which in your opinion does not remove
> > totally valid symlink targets for magic mushroom bullshit? Prefix with
> > // (which, according to POSIX, allows for implementation-defined
> > behaviour)? Something else, not allowed in a regular pathname?
> None.  just don't do it.  Use bindmount, they're cheap and have sane
> defined semtantics.

So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?

Sure. Very cheap and sane. I'm buying.

Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 01:02:59PM +0200, Lars Marowsky-Bree wrote:
> What would a syntax look like which in your opinion does not remove
> totally valid symlink targets for magic mushroom bullshit? Prefix with
> // (which, according to POSIX, allows for implementation-defined
> behaviour)? Something else, not allowed in a regular pathname?

None.  just don't do it.  Use bindmount, they're cheap and have sane
defined semtantics.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T11:54:50, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> It works now.  Unlike context link which steal totally valid symlink
> targets for magic mushroom bullshit.

Right, that is a valid concern. Avoiding context dependent symlinks
entirely certainly is one possible path around this. But, let's just for
the sake of this discussion continue the other path for a bit, to
explore the options available for implementing CPS which don't result in
shivers running down the spine, because I believe CPS do have some
applications in which bind mounts are not entirely adequate
replacements.

(Unless, of course, you want a bind mount for each homedirectory which
might include architecture-specific subdirectories or for every
host-specific configuration file.)

What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?

If we can't find an acceptable way of implementing them, maybe it's time
to grab some magic mushrooms and come up with a new approach, then ;-)

Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 12:34:24PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-10T11:32:56, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > > Would a generic implementation of that higher up in the VFS be more
> > > acceptable?
> > No.  Use mount --bind
> 
> That's a working and less complex alternative for upto how many places
> at once? That works for non-root users how...?

It works now.  Unlike context link which steal totally valid symlink
targets for magic mushroom bullshit.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T11:32:56, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> > Would a generic implementation of that higher up in the VFS be more
> > acceptable?
> No.  Use mount --bind

That's a working and less complex alternative for upto how many places
at once? That works for non-root users how...?


Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 12:30:41PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-10T08:03:09, Christoph Hellwig <[EMAIL PROTECTED]> wrote:
> 
> > > Kindly lose the "Context Dependent Pathname" crap.
> > Same for ocfs2.
> 
> Would a generic implementation of that higher up in the VFS be more
> acceptable?

No.  Use mount --bind

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T08:03:09, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

> > Kindly lose the "Context Dependent Pathname" crap.
> Same for ocfs2.

Would a generic implementation of that higher up in the VFS be more
acceptable?

It's not like context-dependent symlinks are an arbitary feature, but
rather very useful in practice.

Sincerely,
Lars Marowsky-Brée <[EMAIL PROTECTED]>

-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Tue, Aug 09, 2005 at 04:20:45PM +0100, Al Viro wrote:
> On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote:
> > Hi, GFS (Global File System) is a cluster file system that we'd like to
> > see added to the kernel.  The 14 patches total about 900K so I won't send
> > them to the list unless that's requested.  Comments and suggestions are
> > welcome.  Thanks
> > 
> > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> > http://redhat.com/~teigland/gfs2/20050801/broken-out/
> 
> Kindly lose the "Context Dependent Pathname" crap.

Same for ocfs2.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Tue, Aug 09, 2005 at 04:20:45PM +0100, Al Viro wrote:
 On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote:
  Hi, GFS (Global File System) is a cluster file system that we'd like to
  see added to the kernel.  The 14 patches total about 900K so I won't send
  them to the list unless that's requested.  Comments and suggestions are
  welcome.  Thanks
  
  http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
  http://redhat.com/~teigland/gfs2/20050801/broken-out/
 
 Kindly lose the Context Dependent Pathname crap.

Same for ocfs2.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T08:03:09, Christoph Hellwig [EMAIL PROTECTED] wrote:

  Kindly lose the Context Dependent Pathname crap.
 Same for ocfs2.

Would a generic implementation of that higher up in the VFS be more
acceptable?

It's not like context-dependent symlinks are an arbitary feature, but
rather very useful in practice.


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 12:30:41PM +0200, Lars Marowsky-Bree wrote:
 On 2005-08-10T08:03:09, Christoph Hellwig [EMAIL PROTECTED] wrote:
 
   Kindly lose the Context Dependent Pathname crap.
  Same for ocfs2.
 
 Would a generic implementation of that higher up in the VFS be more
 acceptable?

No.  Use mount --bind

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T11:32:56, Christoph Hellwig [EMAIL PROTECTED] wrote:

  Would a generic implementation of that higher up in the VFS be more
  acceptable?
 No.  Use mount --bind

That's a working and less complex alternative for upto how many places
at once? That works for non-root users how...?


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 12:34:24PM +0200, Lars Marowsky-Bree wrote:
 On 2005-08-10T11:32:56, Christoph Hellwig [EMAIL PROTECTED] wrote:
 
   Would a generic implementation of that higher up in the VFS be more
   acceptable?
  No.  Use mount --bind
 
 That's a working and less complex alternative for upto how many places
 at once? That works for non-root users how...?

It works now.  Unlike context link which steal totally valid symlink
targets for magic mushroom bullshit.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T11:54:50, Christoph Hellwig [EMAIL PROTECTED] wrote:

 It works now.  Unlike context link which steal totally valid symlink
 targets for magic mushroom bullshit.

Right, that is a valid concern. Avoiding context dependent symlinks
entirely certainly is one possible path around this. But, let's just for
the sake of this discussion continue the other path for a bit, to
explore the options available for implementing CPS which don't result in
shivers running down the spine, because I believe CPS do have some
applications in which bind mounts are not entirely adequate
replacements.

(Unless, of course, you want a bind mount for each homedirectory which
might include architecture-specific subdirectories or for every
host-specific configuration file.)

What would a syntax look like which in your opinion does not remove
totally valid symlink targets for magic mushroom bullshit? Prefix with
// (which, according to POSIX, allows for implementation-defined
behaviour)? Something else, not allowed in a regular pathname?

If we can't find an acceptable way of implementing them, maybe it's time
to grab some magic mushrooms and come up with a new approach, then ;-)


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 01:02:59PM +0200, Lars Marowsky-Bree wrote:
 What would a syntax look like which in your opinion does not remove
 totally valid symlink targets for magic mushroom bullshit? Prefix with
 // (which, according to POSIX, allows for implementation-defined
 behaviour)? Something else, not allowed in a regular pathname?

None.  just don't do it.  Use bindmount, they're cheap and have sane
defined semtantics.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Lars Marowsky-Bree

On 2005-08-10T12:05:11, Christoph Hellwig [EMAIL PROTECTED] wrote:

  What would a syntax look like which in your opinion does not remove
  totally valid symlink targets for magic mushroom bullshit? Prefix with
  // (which, according to POSIX, allows for implementation-defined
  behaviour)? Something else, not allowed in a regular pathname?
 None.  just don't do it.  Use bindmount, they're cheap and have sane
 defined semtantics.

So for every directoy hiearchy on a shared filesystem, each user needs
to have the complete list of bindmounts needed, and automatically resync
that across all nodes when a new one is added or removed? And then have
that executed by root, because a regular user can't?

Sure. Very cheap and sane. I'm buying.


Sincerely,
Lars Marowsky-Brée [EMAIL PROTECTED]

-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-10 Thread Christoph Hellwig

On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
 On 2005-08-10T12:05:11, Christoph Hellwig [EMAIL PROTECTED] wrote:
 
   What would a syntax look like which in your opinion does not remove
   totally valid symlink targets for magic mushroom bullshit? Prefix with
   // (which, according to POSIX, allows for implementation-defined
   behaviour)? Something else, not allowed in a regular pathname?
  None.  just don't do it.  Use bindmount, they're cheap and have sane
  defined semtantics.
 
 So for every directoy hiearchy on a shared filesystem, each user needs
 to have the complete list of bindmounts needed, and automatically resync
 that across all nodes when a new one is added or removed? And then have
 that executed by root, because a regular user can't?

Do it in an initscripts and let users simply not do it, they shouldn't
even know what kind of filesystem they are on.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-10 Thread AJ Lewis

On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:
 On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:
  So for every directoy hiearchy on a shared filesystem, each user needs
  to have the complete list of bindmounts needed, and automatically resync
  that across all nodes when a new one is added or removed? And then have
  that executed by root, because a regular user can't?
 
 Do it in an initscripts and let users simply not do it, they shouldn't
 even know what kind of filesystem they are on.

I'm just thinking of a 100-node cluster that has different mounts on different
nodes, and trying to update the bind mounts in a sane and efficient manner
without clobbering the various mount setups.  Ouch.

-- 
AJ Lewis   Voice:  612-638-0500
Red HatE-Mail: [EMAIL PROTECTED]
One Main Street SE, Suite 209
Minneapolis, MN 55414
   
Current GPG fingerprint = D9F8 EDCE 4242 855F A03D  9B63 F50C 54A8 578C 8715
Grab the key at: http://people.redhat.com/alewis/gpg.html or one of the
many keyservers out there...



pgpsGmjTnZdeF.pgp
Description: PGP signature

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-10 Thread Kyle Moffett


On Aug 10, 2005, at 09:26:26, AJ Lewis wrote:

On Wed, Aug 10, 2005 at 12:11:10PM +0100, Christoph Hellwig wrote:


On Wed, Aug 10, 2005 at 01:09:17PM +0200, Lars Marowsky-Bree wrote:

So for every directory hierarchy on a shared filesystem, each  
user needs
to have the complete list of bindmounts needed, and automatically  
resync
that across all nodes when a new one is added or removed? And  
then have

that executed by root, because a regular user can't?


Do it in an initscripts and let users simply not do it, they  
shouldn't

even know what kind of filesystem they are on.


I'm just thinking of a 100-node cluster that has different mounts  
on different
nodes, and trying to update the bind mounts in a sane and efficient  
manner

without clobbering the various mount setups.  Ouch.


How about something like the following:
cpslink()  = Create a Context Dependent Symlink
readcpslink()  = Return the Context Dependent path data
readlink() = Return the path of the Context Dependent  
Symlink as it
  would be evaluated in the current context,  
basically as a

  normal symlink.
lstat()= Return information on the Context Dependent  
Symlink in

  the same format as a regular symlink.
unlink()   = Delete the Context Dependent Symlink.

You would need an extra userspace tool that understands cpslink/ 
readcpslink to
create and get information on the links for now, but ls and ln could  
eventually
be updated, and until then the would provide sane behavior.  Perhaps  
this

should be extended into a new API for some of the strange things several
filesystems want to do in the VFS:
extlink()  = Create an extended filesystem link (with type  
specified)

readextlink()  = Return the path (and type) for the link

The filesystem could define how each type of link acts with respect  
to other
syscalls.  OpenAFS could use extlink() instead of their symlink magic  
for
adjusting the AFS volume hierarchy.  The new in-kernel AFS client  
could use it
in similar fashion (It has no method to adjust hierarchy, because  
it's still
read-only).  GFS could use it for their Context Dependent Symlinks.   
Since it
would pass the type in as well, it would be possible to use it for  
different

kinds of links on the same filesystem.

Cheers,
Kyle Moffett

--
Simple things should be simple and complex things should be possible
  -- Alan Kay



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-09 Thread Al Viro

On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks
> 
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

Kindly lose the "Context Dependent Pathname" crap.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-09 Thread Al Viro

On Tue, Aug 02, 2005 at 03:18:28PM +0800, David Teigland wrote:
 Hi, GFS (Global File System) is a cluster file system that we'd like to
 see added to the kernel.  The 14 patches total about 900K so I won't send
 them to the list unless that's requested.  Comments and suggestions are
 welcome.  Thanks
 
 http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
 http://redhat.com/~teigland/gfs2/20050801/broken-out/

Kindly lose the Context Dependent Pathname crap.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread Jörn Engel

On Mon, 8 August 2005 12:05:25 +0200, Arjan van de Ven wrote:
> On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote:
> > > 
> > > Please drop the extra braces.
> > 
> > Here and elsewhere we try to keep unused stuff off the stack.  Are you
> > suggesting that we're being overly cautious, or do you just dislike the
> > way it looks?
> 
> nice theory. In practice gcc 3.x still adds up all the stack space
> anyway and as long as gcc 3.x is a supported kernel compiler, you can't
> depend on this. Also.. please favor readability. gcc is getting smarter
> about stack use nowadays, and {}'s shouldn't be needed to help it, it
> tracks liveness of variables already.

Plus, you don't have to guess about stack usage.  Run "make
checkstack" or, better yet, run the objdump of fs/gfs/built-in.o
through the perl script.

Jörn

-- 
It's just what we asked for, but not what we want!
-- anonymous
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread Arjan van de Ven

On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote:
> > 
> > Please drop the extra braces.
> 
> Here and elsewhere we try to keep unused stuff off the stack.  Are you
> suggesting that we're being overly cautious, or do you just dislike the
> way it looks?

nice theory. In practice gcc 3.x still adds up all the stack space
anyway and as long as gcc 3.x is a supported kernel compiler, you can't
depend on this. Also.. please favor readability. gcc is getting smarter
about stack use nowadays, and {}'s shouldn't be needed to help it, it
tracks liveness of variables already.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread David Teigland

On Wed, Aug 03, 2005 at 09:44:06AM +0300, Pekka Enberg wrote:

> > +uint32_t gfs2_hash(const void *data, unsigned int len)
> > +{
> > + uint32_t h = 0x811C9DC5;
> > + h = hash_more_internal(data, len, h);
> > + return h;
> > +}
> 
> Is there a reason why you cannot use  or ?

See gfs2_hash_more() and comment; we hash discontiguous regions.

> > +#define RETRY_MALLOC(do_this, until_this) \
> > +for (;;) { \
> > + { do_this; } \
> > + if (until_this) \
> > + break; \
> > + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
> > + printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
> > + gfs2_malloc_warning = jiffies; \
> > + } \
> > + yield(); \
> > +}
> 
> Please drop this.

Done in the spot that could deal with an error, but there are three other
places that still need it.

> > +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
> > + struct gfs2_ea_location *el)
> > +{
> > + {
> > + struct ea_set es;
> > + int error;
> > +
> > + memset(, 0, sizeof(struct ea_set));
> > + es.es_er = er;
> > + es.es_el = el;
> > +
> > + error = ea_foreach(ip, ea_set_simple, );
> > + if (error > 0)
> > + return 0;
> > + if (error)
> > + return error;
> > + }
> > + {
> > + unsigned int blks = 2;
> > + if (!(ip->i_di.di_flags & GFS2_DIF_EA_INDIRECT))
> > + blks++;
> > + if (GFS2_EAREQ_SIZE_STUFFED(er) > ip->i_sbd->sd_jbsize)
> > + blks += DIV_RU(er->er_data_len,
> > +ip->i_sbd->sd_jbsize);
> > +
> > + return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
> > + }
> 
> Please drop the extra braces.

Here and elsewhere we try to keep unused stuff off the stack.  Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?

Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread David Teigland

On Fri, Aug 05, 2005 at 03:14:15PM +0800, David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> > * +++ b/fs/gfs2/fixed_div64.h   2005-08-01 14:13:08.009808200 +0800
> > e why?
> 
> I'm not sure, actually, apart from the comments:
> 
> do_div: /* For ia32 we need to pull some tricks to get past various versions
>of the compiler which do not like us using do_div in the middle
>of large functions. */
> 
> do_mod: /* Side effect free 64 bit mod operation */
> 
> fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
> perhaps this is an old problem that's now fixed?

I've looked into getting rid of these:

- The existing do_div() works fine for me with 64 bit numerators, so I'll
  get rid of the "fixed" version.

- The "fixed" do_mod() seems to be the only way to do 64 bit modulus.
  It would be great if I was wrong about that...

Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread David Teigland

On Fri, Aug 05, 2005 at 03:14:15PM +0800, David Teigland wrote:
 On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

  * +++ b/fs/gfs2/fixed_div64.h   2005-08-01 14:13:08.009808200 +0800
  e why?
 
 I'm not sure, actually, apart from the comments:
 
 do_div: /* For ia32 we need to pull some tricks to get past various versions
of the compiler which do not like us using do_div in the middle
of large functions. */
 
 do_mod: /* Side effect free 64 bit mod operation */
 
 fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
 perhaps this is an old problem that's now fixed?

I've looked into getting rid of these:

- The existing do_div() works fine for me with 64 bit numerators, so I'll
  get rid of the fixed version.

- The fixed do_mod() seems to be the only way to do 64 bit modulus.
  It would be great if I was wrong about that...

Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread David Teigland

On Wed, Aug 03, 2005 at 09:44:06AM +0300, Pekka Enberg wrote:

  +uint32_t gfs2_hash(const void *data, unsigned int len)
  +{
  + uint32_t h = 0x811C9DC5;
  + h = hash_more_internal(data, len, h);
  + return h;
  +}
 
 Is there a reason why you cannot use linux/hash.h or linux/jhash.h?

See gfs2_hash_more() and comment; we hash discontiguous regions.

  +#define RETRY_MALLOC(do_this, until_this) \
  +for (;;) { \
  + { do_this; } \
  + if (until_this) \
  + break; \
  + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
  + printk(GFS2: out of memory: %s, %u\n, __FILE__, __LINE__); \
  + gfs2_malloc_warning = jiffies; \
  + } \
  + yield(); \
  +}
 
 Please drop this.

Done in the spot that could deal with an error, but there are three other
places that still need it.

  +static int ea_set_i(struct gfs2_inode *ip, struct gfs2_ea_request *er,
  + struct gfs2_ea_location *el)
  +{
  + {
  + struct ea_set es;
  + int error;
  +
  + memset(es, 0, sizeof(struct ea_set));
  + es.es_er = er;
  + es.es_el = el;
  +
  + error = ea_foreach(ip, ea_set_simple, es);
  + if (error  0)
  + return 0;
  + if (error)
  + return error;
  + }
  + {
  + unsigned int blks = 2;
  + if (!(ip-i_di.di_flags  GFS2_DIF_EA_INDIRECT))
  + blks++;
  + if (GFS2_EAREQ_SIZE_STUFFED(er)  ip-i_sbd-sd_jbsize)
  + blks += DIV_RU(er-er_data_len,
  +ip-i_sbd-sd_jbsize);
  +
  + return ea_alloc_skeleton(ip, er, blks, ea_set_block, el);
  + }
 
 Please drop the extra braces.

Here and elsewhere we try to keep unused stuff off the stack.  Are you
suggesting that we're being overly cautious, or do you just dislike the
way it looks?

Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread Arjan van de Ven

On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote:
  
  Please drop the extra braces.
 
 Here and elsewhere we try to keep unused stuff off the stack.  Are you
 suggesting that we're being overly cautious, or do you just dislike the
 way it looks?

nice theory. In practice gcc 3.x still adds up all the stack space
anyway and as long as gcc 3.x is a supported kernel compiler, you can't
depend on this. Also.. please favor readability. gcc is getting smarter
about stack use nowadays, and {}'s shouldn't be needed to help it, it
tracks liveness of variables already.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-08 Thread Jörn Engel

On Mon, 8 August 2005 12:05:25 +0200, Arjan van de Ven wrote:
 On Mon, 2005-08-08 at 17:57 +0800, David Teigland wrote:
   
   Please drop the extra braces.
  
  Here and elsewhere we try to keep unused stuff off the stack.  Are you
  suggesting that we're being overly cautious, or do you just dislike the
  way it looks?
 
 nice theory. In practice gcc 3.x still adds up all the stack space
 anyway and as long as gcc 3.x is a supported kernel compiler, you can't
 depend on this. Also.. please favor readability. gcc is getting smarter
 about stack use nowadays, and {}'s shouldn't be needed to help it, it
 tracks liveness of variables already.

Plus, you don't have to guess about stack usage.  Run make
checkstack or, better yet, run the objdump of fs/gfs/built-in.o
through the perl script.

Jörn

-- 
It's just what we asked for, but not what we want!
-- anonymous
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-07 Thread Alan Cox

On Mer, 2005-08-03 at 16:33 +0200, Andi Kleen wrote:
> > http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> > http://redhat.com/~teigland/gfs2/20050801/broken-out/
> 
> I would suggest to not merge this before not all the code has not been
> reviewed by some experienced linux developer from outside the GFS team.

Are you volunteering ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-07 Thread Alan Cox

On Mer, 2005-08-03 at 16:33 +0200, Andi Kleen wrote:
  http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
  http://redhat.com/~teigland/gfs2/20050801/broken-out/
 
 I would suggest to not merge this before not all the code has not been
 reviewed by some experienced linux developer from outside the GFS team.

Are you volunteering ?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread David Teigland

On Fri, Aug 05, 2005 at 12:07:50PM +0200, J?rn Engel wrote:
> On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote:
> > Do we go a step beyond this and use say the crc32() function from
> > linux/crc32.h?  Is this _function_ as standard and unchanging as the table
> > of crcs?  In my tests it doesn't produce the same results as our
> > gfs2_disk_hash() function, even with both using the same crc table.  I
> > don't mind adopting a new function and just writing a user space
> > equivalent for the tools if it's a fixed standard.
> 
> The function is basically set in stone.  Variants exists depending on
> how it is called.  I know of four variants, but there may be more:
> 
> 1. Initial value is 0
> 2. Initial value is 0x
> a) Result is taken as-is
> b) Result is XORed with 0x
> 
> Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
> function or something similar?

You're right, initial value 0x and xor result with 0x
matches the results from our function.  Great, we can get rid of
gfs2_disk_hash() and use crc32() directly.
Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Jörn Engel

On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote:
> 
> linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
> This looks like a standard that's not going to change, as you've said, so
> including crc32table.h and getting rid of our own table would work fine.
> 
> Do we go a step beyond this and use say the crc32() function from
> linux/crc32.h?  Is this _function_ as standard and unchanging as the table
> of crcs?  In my tests it doesn't produce the same results as our
> gfs2_disk_hash() function, even with both using the same crc table.  I
> don't mind adopting a new function and just writing a user space
> equivalent for the tools if it's a fixed standard.

The function is basically set in stone.  Variants exists depending on
how it is called.  I know of four variants, but there may be more:

1. Initial value is 0
2. Initial value is 0x
a) Result is taken as-is
b) Result is XORed with 0x

Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
function or something similar?

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread David Teigland

On Fri, Aug 05, 2005 at 09:34:38AM +0200, Arjan van de Ven wrote:
> On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote:
> > On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> > 
> > > * +static const uint32_t crc_32_tab[] = .
> > > why do you duplicate this? The kernel has a perfectly good set of
> > > generic crc32 tables/functions just fine
> > 
> > The gfs2_disk_hash() function and the crc table on which it's based are a
> > part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
> > unusual since gfs uses a hash table on-disk for its directory structure.
> > This header, including the hash function/table, must be included by user
> > space programs like fsck that want to decipher a fs, and any change to the
> > function or table would effectively make the fs corrupted.  Because of
> > this I think it's best for gfs to keep it's own copy as part of its ondisk
> > format spec.
> 
> for userspace there's libcrc32 as well. If it's *the* bog standard crc32
> I don't see a reason why your "spec" can't just reference that instead.
> And esp in the kernel you should just use the in kernel one not your own
> regardless; you can assume the in kernel one is optimized and it also
> keeps size down.

linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
This looks like a standard that's not going to change, as you've said, so
including crc32table.h and getting rid of our own table would work fine.

Do we go a step beyond this and use say the crc32() function from
linux/crc32.h?  Is this _function_ as standard and unchanging as the table
of crcs?  In my tests it doesn't produce the same results as our
gfs2_disk_hash() function, even with both using the same crc table.  I
don't mind adopting a new function and just writing a user space
equivalent for the tools if it's a fixed standard.

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Arjan van de Ven

On Fri, 2005-08-05 at 10:28 +0200, Jan Engelhardt wrote:
> >The gfs2_disk_hash() function and the crc table on which it's based are a
> >part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
> >unusual since gfs uses a hash table on-disk for its directory structure.
> >This header, including the hash function/table, must be included by user
> >space programs like fsck that want to decipher a fs, and any change to the
> >function or table would effectively make the fs corrupted.  Because of
> >this I think it's best for gfs to keep it's own copy as part of its ondisk
> >format spec.
> 
> Tune the spec to use kernel and libcrc32 tables and bump the version number 
> of 
> the spec to e.g. GFS 2.1. That way, things transform smoothly and could go 
> out 
> eventually at some later date.

afaik the tables aren't actually different. So no need to bump the spec!

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Jan Engelhardt

>The gfs2_disk_hash() function and the crc table on which it's based are a
>part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
>unusual since gfs uses a hash table on-disk for its directory structure.
>This header, including the hash function/table, must be included by user
>space programs like fsck that want to decipher a fs, and any change to the
>function or table would effectively make the fs corrupted.  Because of
>this I think it's best for gfs to keep it's own copy as part of its ondisk
>format spec.

Tune the spec to use kernel and libcrc32 tables and bump the version number of 
the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out 
eventually at some later date.



Jan Engelhardt
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Arjan van de Ven

On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> 
> > * +static const uint32_t crc_32_tab[] = .
> > why do you duplicate this? The kernel has a perfectly good set of
> > generic crc32 tables/functions just fine
> 
> The gfs2_disk_hash() function and the crc table on which it's based are a
> part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
> unusual since gfs uses a hash table on-disk for its directory structure.
> This header, including the hash function/table, must be included by user
> space programs like fsck that want to decipher a fs, and any change to the
> function or table would effectively make the fs corrupted.  Because of
> this I think it's best for gfs to keep it's own copy as part of its ondisk
> format spec.

for userspace there's libcrc32 as well. If it's *the* bog standard crc32
I don't see a reason why your "spec" can't just reference that instead.
And esp in the kernel you should just use the in kernel one not your own
regardless; you can assume the in kernel one is optimized and it also
keeps size down.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-05 Thread Mike Christie

Mike Christie wrote:
> David Teigland wrote:
> 
>>On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
>>
>>
>>>* Why are you using bufferheads extensively in a new filesystem?
>>
>>
>>bh's are used for metadata, the log, and journaled data which need to be
>>written at the block granularity, not page.
>>
> 
> 
> In a scsi tree
> http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary

oh yeah it is in -mm too.

> there is a function, bio_map_kern(), in fs.c that maps a buffer into a
> bio. It does not have to be page granularity. Can something like that be
> used in these places?
> 
> --
> Linux-cluster mailing list
> [EMAIL PROTECTED]
> http://www.redhat.com/mailman/listinfo/linux-cluster

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-05 Thread Mike Christie

David Teigland wrote:
> On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
> 
>>* Why are you using bufferheads extensively in a new filesystem?
> 
> 
> bh's are used for metadata, the log, and journaled data which need to be
> written at the block granularity, not page.
> 

In a scsi tree
http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary
there is a function, bio_map_kern(), in fs.c that maps a buffer into a
bio. It does not have to be page granularity. Can something like that be
used in these places?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread David Teigland

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> * +static const uint32_t crc_32_tab[] = .
> why do you duplicate this? The kernel has a perfectly good set of
> generic crc32 tables/functions just fine

The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted.  Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.

> * Why are you using bufferheads extensively in a new filesystem?

bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.

> why do you use a rwsem and not a regular semaphore? You are aware that
> rwsems are far more expensive than regular ones right?  How skewed is
> the read/write ratio?

Aware, yes, it's the only rwsem in gfs.  Specific skew, no, we'll have to
measure that.

> * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800
> e why?

I'm not sure, actually, apart from the comments:

do_div: /* For ia32 we need to pull some tricks to get past various versions
   of the compiler which do not like us using do_div in the middle
   of large functions. */

do_mod: /* Side effect free 64 bit mod operation */

fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
perhaps this is an old problem that's now fixed?

> * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
> +unsigned int size)
> +{
> + int error;
> +
> + if (bh)
> + error = copy_to_user(*buf, bh->b_data + offset, size);
> + else
> + error = clear_user(*buf, size);
> 
> that looks to be missing a few kmaps.. whats the guarantee that b_data
> is actually, like in lowmem?

This is only used in the specific case of reading a journaled-data file.
That seems to effectively be the same as reading a buffer of fs metadata.

> The diaper device is a block device within gfs that gets transparently
> inserted between the real device the and rest of the filesystem.
> 
> h why not use device mapper or something? Is this really needed?

This is needed for the "withdraw" feature (described in the comment) which
is fairly important.  We'll see if dm could be used instead.

Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread David Teigland

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

 * +static const uint32_t crc_32_tab[] = .
 why do you duplicate this? The kernel has a perfectly good set of
 generic crc32 tables/functions just fine

The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted.  Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.

 * Why are you using bufferheads extensively in a new filesystem?

bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.

 why do you use a rwsem and not a regular semaphore? You are aware that
 rwsems are far more expensive than regular ones right?  How skewed is
 the read/write ratio?

Aware, yes, it's the only rwsem in gfs.  Specific skew, no, we'll have to
measure that.

 * +++ b/fs/gfs2/fixed_div64.h 2005-08-01 14:13:08.009808200 +0800
 e why?

I'm not sure, actually, apart from the comments:

do_div: /* For ia32 we need to pull some tricks to get past various versions
   of the compiler which do not like us using do_div in the middle
   of large functions. */

do_mod: /* Side effect free 64 bit mod operation */

fs/xfs/linux-2.6/xfs_linux.h (the origin of this file) has the same thing,
perhaps this is an old problem that's now fixed?

 * int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
 +unsigned int size)
 +{
 + int error;
 +
 + if (bh)
 + error = copy_to_user(*buf, bh-b_data + offset, size);
 + else
 + error = clear_user(*buf, size);
 
 that looks to be missing a few kmaps.. whats the guarantee that b_data
 is actually, like in lowmem?

This is only used in the specific case of reading a journaled-data file.
That seems to effectively be the same as reading a buffer of fs metadata.

 The diaper device is a block device within gfs that gets transparently
 inserted between the real device the and rest of the filesystem.
 
 h why not use device mapper or something? Is this really needed?

This is needed for the withdraw feature (described in the comment) which
is fairly important.  We'll see if dm could be used instead.

Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-05 Thread Mike Christie

David Teigland wrote:
 On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
 
* Why are you using bufferheads extensively in a new filesystem?
 
 
 bh's are used for metadata, the log, and journaled data which need to be
 written at the block granularity, not page.
 

In a scsi tree
http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary
there is a function, bio_map_kern(), in fs.c that maps a buffer into a
bio. It does not have to be page granularity. Can something like that be
used in these places?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: [PATCH 00/14] GFS

2005-08-05 Thread Mike Christie

Mike Christie wrote:
 David Teigland wrote:
 
On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:


* Why are you using bufferheads extensively in a new filesystem?


bh's are used for metadata, the log, and journaled data which need to be
written at the block granularity, not page.

 
 
 In a scsi tree
 http://kernel.org/git/?p=linux/kernel/git/jejb/scsi-block-2.6.git;a=summary

oh yeah it is in -mm too.

 there is a function, bio_map_kern(), in fs.c that maps a buffer into a
 bio. It does not have to be page granularity. Can something like that be
 used in these places?
 
 --
 Linux-cluster mailing list
 [EMAIL PROTECTED]
 http://www.redhat.com/mailman/listinfo/linux-cluster

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Arjan van de Ven

On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote:
 On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
 
  * +static const uint32_t crc_32_tab[] = .
  why do you duplicate this? The kernel has a perfectly good set of
  generic crc32 tables/functions just fine
 
 The gfs2_disk_hash() function and the crc table on which it's based are a
 part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
 unusual since gfs uses a hash table on-disk for its directory structure.
 This header, including the hash function/table, must be included by user
 space programs like fsck that want to decipher a fs, and any change to the
 function or table would effectively make the fs corrupted.  Because of
 this I think it's best for gfs to keep it's own copy as part of its ondisk
 format spec.

for userspace there's libcrc32 as well. If it's *the* bog standard crc32
I don't see a reason why your spec can't just reference that instead.
And esp in the kernel you should just use the in kernel one not your own
regardless; you can assume the in kernel one is optimized and it also
keeps size down.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Jan Engelhardt

The gfs2_disk_hash() function and the crc table on which it's based are a
part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
unusual since gfs uses a hash table on-disk for its directory structure.
This header, including the hash function/table, must be included by user
space programs like fsck that want to decipher a fs, and any change to the
function or table would effectively make the fs corrupted.  Because of
this I think it's best for gfs to keep it's own copy as part of its ondisk
format spec.

Tune the spec to use kernel and libcrc32 tables and bump the version number of 
the spec to e.g. GFS 2.1. That way, things transform smoothly and could go out 
eventually at some later date.



Jan Engelhardt
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Arjan van de Ven

On Fri, 2005-08-05 at 10:28 +0200, Jan Engelhardt wrote:
 The gfs2_disk_hash() function and the crc table on which it's based are a
 part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
 unusual since gfs uses a hash table on-disk for its directory structure.
 This header, including the hash function/table, must be included by user
 space programs like fsck that want to decipher a fs, and any change to the
 function or table would effectively make the fs corrupted.  Because of
 this I think it's best for gfs to keep it's own copy as part of its ondisk
 format spec.
 
 Tune the spec to use kernel and libcrc32 tables and bump the version number 
 of 
 the spec to e.g. GFS 2.1. That way, things transform smoothly and could go 
 out 
 eventually at some later date.

afaik the tables aren't actually different. So no need to bump the spec!

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread David Teigland

On Fri, Aug 05, 2005 at 09:34:38AM +0200, Arjan van de Ven wrote:
 On Fri, 2005-08-05 at 15:14 +0800, David Teigland wrote:
  On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:
  
   * +static const uint32_t crc_32_tab[] = .
   why do you duplicate this? The kernel has a perfectly good set of
   generic crc32 tables/functions just fine
  
  The gfs2_disk_hash() function and the crc table on which it's based are a
  part of gfs2_ondisk.h: the ondisk metadata specification.  This is a bit
  unusual since gfs uses a hash table on-disk for its directory structure.
  This header, including the hash function/table, must be included by user
  space programs like fsck that want to decipher a fs, and any change to the
  function or table would effectively make the fs corrupted.  Because of
  this I think it's best for gfs to keep it's own copy as part of its ondisk
  format spec.
 
 for userspace there's libcrc32 as well. If it's *the* bog standard crc32
 I don't see a reason why your spec can't just reference that instead.
 And esp in the kernel you should just use the in kernel one not your own
 regardless; you can assume the in kernel one is optimized and it also
 keeps size down.

linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
This looks like a standard that's not going to change, as you've said, so
including crc32table.h and getting rid of our own table would work fine.

Do we go a step beyond this and use say the crc32() function from
linux/crc32.h?  Is this _function_ as standard and unchanging as the table
of crcs?  In my tests it doesn't produce the same results as our
gfs2_disk_hash() function, even with both using the same crc table.  I
don't mind adopting a new function and just writing a user space
equivalent for the tools if it's a fixed standard.

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread Jörn Engel

On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote:
 
 linux/lib/crc32table.h : crc32table_le[] is the same as our crc_32_tab[].
 This looks like a standard that's not going to change, as you've said, so
 including crc32table.h and getting rid of our own table would work fine.
 
 Do we go a step beyond this and use say the crc32() function from
 linux/crc32.h?  Is this _function_ as standard and unchanging as the table
 of crcs?  In my tests it doesn't produce the same results as our
 gfs2_disk_hash() function, even with both using the same crc table.  I
 don't mind adopting a new function and just writing a user space
 equivalent for the tools if it's a fixed standard.

The function is basically set in stone.  Variants exists depending on
how it is called.  I know of four variants, but there may be more:

1. Initial value is 0
2. Initial value is 0x
a) Result is taken as-is
b) Result is XORed with 0x

Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
function or something similar?

Jörn

-- 
And spam is a useful source of entropy for /dev/random too!
-- Jasmine Strong
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-05 Thread David Teigland

On Fri, Aug 05, 2005 at 12:07:50PM +0200, J?rn Engel wrote:
 On Fri, 5 August 2005 17:44:52 +0800, David Teigland wrote:
  Do we go a step beyond this and use say the crc32() function from
  linux/crc32.h?  Is this _function_ as standard and unchanging as the table
  of crcs?  In my tests it doesn't produce the same results as our
  gfs2_disk_hash() function, even with both using the same crc table.  I
  don't mind adopting a new function and just writing a user space
  equivalent for the tools if it's a fixed standard.
 
 The function is basically set in stone.  Variants exists depending on
 how it is called.  I know of four variants, but there may be more:
 
 1. Initial value is 0
 2. Initial value is 0x
 a) Result is taken as-is
 b) Result is XORed with 0x
 
 Maybe your code implements 1a, while you tried 2b with the lib/crc32.c
 function or something similar?

You're right, initial value 0x and xor result with 0x
matches the results from our function.  Great, we can get rid of
gfs2_disk_hash() and use crc32() directly.
Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Mark Fasheh

On Wed, Aug 03, 2005 at 12:37:44PM +0200, Lars Marowsky-Bree wrote:
> On 2005-08-03T11:56:18, David Teigland <[EMAIL PROTECTED]> wrote:
> 
> > > * Why use your own journalling layer and not say ... jbd ?
> > Here's an analysis of three approaches to cluster-fs journaling and their
> > pros/cons (including using jbd):  http://tinyurl.com/7sbqq
> 
> Very instructive read, thanks for the link.
While it may be true that for a full log, flushing for a *single* lock may
be more expensive in OCFS2, Ken ignores the fact that in our one big flush
we've made all locks on journalled resources immediately releasable.
According to that description, GFS2 would have to do a seperate transaction
flush (including the extra step of writing revoke records) for each lock
protecting a journalled resource. Assuming the same number of locks are
required to be dropped under both systems then for a number of locks > 1
OCFS2 will actually do less work - the actual metadata blocks would be the
same on either end, but JBD only has to write that the journal is now clean
to the journal superblock whereas GFS2 has to revoke the blocks for each
dropped lock.

Of course all of this talk completely avoids the fact that in any case these
things are expensive so a cluster file system has to take care to ping locks
as little as possible. OCFS2 takes great pains to make as many operations
node local (requiring no cluster locks) as possible - data allocation is
usually done from a node local pool which is refreshed from the main bitmap.
Deallocation happens similarly - we have a truncate log in which we record
deleted clusters. Each node has their own inode and metadata chain
allocators which another node will only lock for delete (a truncate log
style local metadata delete log could easily be added if that ever became a
problem).
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Andi Kleen

David Teigland <[EMAIL PROTECTED]> writes:

> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks
> 
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

I would suggest to not merge this before not all the code has not been
reviewed by some experienced linux developer from outside the GFS team.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Lars Marowsky-Bree

On 2005-08-03T11:56:18, David Teigland <[EMAIL PROTECTED]> wrote:

> > * Why use your own journalling layer and not say ... jbd ?
> Here's an analysis of three approaches to cluster-fs journaling and their
> pros/cons (including using jbd):  http://tinyurl.com/7sbqq

Very instructive read, thanks for the link.



-- 
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread David Teigland

On Wed, Aug 03, 2005 at 11:17:09AM +0200, Arjan van de Ven wrote:
> On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
> > The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
> > on-disk instead of LE which is another useful way to verify endian
> > correctness.
> 
> that sounds wrong to be a compile option. If you really want to deal
> with dual disk endianness it really ought to be a runtime one (see jffs2
> for example).

We don't want BE to be an "option" per se; as developers we'd just like to
be able to compile it that way to verify gfs's endianness handling.  If
you think that's unmaintainable or a bad idea we'll rip it out.

> > > * +   while (!kthread_should_stop()) {
> > > + gfs2_scand_internal(sdp);
> > > +
> > > + set_current_state(TASK_INTERRUPTIBLE);
> > > + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> > > 
> > > you probably really want to check for signals if you do
> > > interruptible sleeps
> > 
> > I don't know why we'd be interested in signals here.
> 
> well.. because if you don't your schedule_timeout becomes a nop when you
> get one, which makes your loop a busy waiting one.

OK, it looks like we need to block/flush signals a la daemonize(); I guess
I mistakenly figured the kthread routines did everything daemonize did.
Thanks,
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Arjan van de Ven

On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
> The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
> on-disk instead of LE which is another useful way to verify endian
> correctness.

that sounds wrong to be a compile option. If you really want to deal
with dual disk endianness it really ought to be a runtime one (see jffs2
for example).



> > * Why use your own journalling layer and not say ... jbd ?
> 
> Here's an analysis of three approaches to cluster-fs journaling and their
> pros/cons (including using jbd):  http://tinyurl.com/7sbqq
> 
> > * + while (!kthread_should_stop()) {
> > +   gfs2_scand_internal(sdp);
> > +
> > +   set_current_state(TASK_INTERRUPTIBLE);
> > +   schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> > +   }
> > 
> > you probably really want to check for signals if you do interruptible sleeps
> 
> I don't know why we'd be interested in signals here.

well.. because if you don't your schedule_timeout becomes a nop when you
get one, which makes your loop a busy waiting one.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Arjan van de Ven


> I don't know anything about GFS, but expecting a filesystem author to
> use a journaling layer he does not want to is a bit arrogant.

good that I didn't expect that then.
I think it's fair enough to ask people if they can use it. If the answer
is "No because it doesn't fit our model " then that's fine. If the
answer is "eh yeah we could" then I think it's entirely reasonable to
expect people to use common code as opposed to adding new code.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Pekka Enberg

Hi David,

Some more comments below.

Pekka

On 8/2/05, David Teigland <[EMAIL PROTECTED]> wrote:
> +/**
> + * inode_create - create a struct gfs2_inode
> + * @i_gl: The glock covering the inode
> + * @inum: The inode number
> + * @io_gl: the iopen glock to acquire/hold (using holder in new gfs2_inode)
> + * @io_state: the state the iopen glock should be acquired in
> + * @ipp: pointer to put the returned inode in
> + *
> + * Returns: errno
> + */
> +
> +static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum,
> + struct gfs2_glock *io_gl, unsigned int io_state,
> + struct gfs2_inode **ipp)
> +{
> + struct gfs2_sbd *sdp = i_gl->gl_sbd;
> + struct gfs2_inode *ip;
> + int error = 0;
> +
> + RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip);

Why do you want to do this? The callers can handle ENOMEM just fine.

> +/**
> + * gfs2_random - Generate a random 32-bit number
> + *
> + * Generate a semi-crappy 32-bit pseudo-random number without using
> + * floating point.
> + *
> + * The PRNG is from "Numerical Recipes in C" (second edition), page 284.
> + *
> + * Returns: a 32-bit random number
> + */
> +
> +uint32_t gfs2_random(void)
> +{
> + gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F;
> + return gfs2_random_number;
> +}

Please consider moving this into lib/random.c. This one already appears in
drivers/net/hamradio/dmascc.c.

> +/**
> + * gfs2_hash - hash an array of data
> + * @data: the data to be hashed
> + * @len: the length of data to be hashed
> + *
> + * Take some data and convert it to a 32-bit hash.
> + *
> + * This is the 32-bit FNV-1a hash from:
> + * http://www.isthe.com/chongo/tech/comp/fnv/
> + *
> + * Returns: the hash
> + */
> +
> +uint32_t gfs2_hash(const void *data, unsigned int len)
> +{
> + uint32_t h = 0x811C9DC5;
> + h = hash_more_internal(data, len, h);
> + return h;
> +}

Is there a reason why you cannot use  or ?

> +void gfs2_sort(void *base, unsigned int num_elem, unsigned int size,
> +int (*compar) (const void *, const void *))
> +{
> + register char *pbase = (char *)base;
> + int i, j, k, h;
> + static int cols[16] = {1391376, 463792, 198768, 86961,
> +33936, 13776, 4592, 1968,
> +861, 336, 112, 48,
> +21, 7, 3, 1};
> +
> + for (k = 0; k < 16; k++) {
> + h = cols[k];
> + for (i = h; i < num_elem; i++) {
> + j = i;
> + while (j >= h &&
> +(*compar)((void *)(pbase + size * (j - h)),
> +  (void *)(pbase + size * j)) > 0) {
> + SWAP(pbase + size * j,
> +  pbase + size * (j - h),
> +  size);
> + j = j - h;
> + }
> + }
> + }
> +}

Please use sort() from lib/sort.c.

> +/**
> + * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw
> + * @ip:
> + * @function:
> + * @file:
> + * @line:

Please drop empty kerneldoc tags. (Appears in various other places as well.)

> +#define RETRY_MALLOC(do_this, until_this) \
> +for (;;) { \
> + { do_this; } \
> + if (until_this) \
> + break; \
> + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
> + printk("GFS2: out of memory: %s, %u\n", __FILE__, __LINE__); \
> + gfs2_malloc_warning = jiffies; \
> + } \
> + yield(); \
> +}

Please drop this.

> +int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip)
> +{
> + struct gfs2_sbd *sdp = dip->i_sbd;
> + struct posix_acl *acl = NULL;
> + struct gfs2_ea_request er;
> + mode_t mode = ip->i_di.di_mode;
> + int error;
> +
> + if (!sdp->sd_args.ar_posix_acl)
> + return 0;
> + if (S_ISLNK(ip->i_di.di_mode))
> + return 0;
> +
> + memset(, 0, sizeof(struct gfs2_ea_request));
> + er.er_type = GFS2_EATYPE_SYS;
> +
> + error = acl_get(dip, ACL_DEFAULT, , NULL,
> + _data, _data_len);
> + if (error)
> + return error;
> + if (!acl) {
> + mode &= ~current->fs->umask;
> + if (mode != ip->i_di.di_mode)
> + error = munge_mode(ip, mode);
> + return error;
> + }
> +
> + {
> + struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
> + error = -ENOMEM;
> + if (!clone)
> + goto out;
> + gfs2_memory_add(clone);
> + gfs2_memory_rm(acl);
> + posix_acl_release(acl);
> + acl = clone;
> + }

Please make this a real function. It is duplicated below.

> + if (error > 0) {
> + er.er_name =

Re: [PATCH 00/14] GFS

2005-08-03 Thread Jan Engelhardt


>> > because reiser got merged before jbd. Next question.
>>
>> That is the wrong reason.  We use our own journaling layer for the
>> reason that Vivaldi used his own melody.
>> 
>> [...] He might want to look at how reiser4 does wandering
>> logs instead of using jbd. but I would never claim that for sure
>> some other author should be expected to use it.  and something like
>> changing one's journaling system is not something to do just before a
>> merge.
>
> Do you see my point here?  If every person who added new kernel code
> just wrote their own thing without checking to see if it had already
> been done before, then there would be a lot of poorly maintained code
> in the kernel.  If a journalling layer already exists, _new_ journaled
> filesystems should either (A) use the layer as is, or (B) fix the layer
> so it has sufficient functionality for them to use, and submit patches.

Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be 
'eleet enough' for reiser* would overwhelm ext.

Lastly, there is the 'political' thing, when a -only 
specific change to jbd is rejected by all other jbd-using fs. (Basically the 
situation thing that leads to software forks, in any area.)



Jan Engelhardt
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread David Teigland

On Tue, Aug 02, 2005 at 01:16:53PM +0300, Pekka Enberg wrote:

> > +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
> > + unsigned int line)
> > +{
> > +   void *x;
> > +   for (;;) {
> > +   x = kmalloc(size, flags);
> > +   if (x)
> > +   return x;
> > +   if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
> > +   printk("GFS2: out of memory: %s, %u\n",
> > +  __FILE__, __LINE__);
> > +   gfs2_malloc_warning = jiffies;
> > +   }
> > +   yield();
> 
> This does not belong in a filesystem. It also seems like a very bad
> idea. What are you trying to do here? If you absolutely must not fail,
> use __GFP_NOFAIL instead.

will do, carried over from before NOFAIL existed

> -mm has memory leak detection patches and there are others floating
> around. Please do not introduce yet another subsystem-specific debug
> allocator.

ok, thanks
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread David Teigland

On Tue, Aug 02, 2005 at 01:16:53PM +0300, Pekka Enberg wrote:

  +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
  + unsigned int line)
  +{
  +   void *x;
  +   for (;;) {
  +   x = kmalloc(size, flags);
  +   if (x)
  +   return x;
  +   if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
  +   printk(GFS2: out of memory: %s, %u\n,
  +  __FILE__, __LINE__);
  +   gfs2_malloc_warning = jiffies;
  +   }
  +   yield();
 
 This does not belong in a filesystem. It also seems like a very bad
 idea. What are you trying to do here? If you absolutely must not fail,
 use __GFP_NOFAIL instead.

will do, carried over from before NOFAIL existed

 -mm has memory leak detection patches and there are others floating
 around. Please do not introduce yet another subsystem-specific debug
 allocator.

ok, thanks
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Jan Engelhardt


  because reiser got merged before jbd. Next question.

 That is the wrong reason.  We use our own journaling layer for the
 reason that Vivaldi used his own melody.
 
 [...] He might want to look at how reiser4 does wandering
 logs instead of using jbd. but I would never claim that for sure
 some other author should be expected to use it.  and something like
 changing one's journaling system is not something to do just before a
 merge.

 Do you see my point here?  If every person who added new kernel code
 just wrote their own thing without checking to see if it had already
 been done before, then there would be a lot of poorly maintained code
 in the kernel.  If a journalling layer already exists, _new_ journaled
 filesystems should either (A) use the layer as is, or (B) fix the layer
 so it has sufficient functionality for them to use, and submit patches.

Maybe jbd 'sucks' for something 'cool' like reiser*, and modifying jbd to be 
'eleet enough' for reiser* would overwhelm ext.

Lastly, there is the 'political' thing, when a your-favorite-jbd-fs-only 
specific change to jbd is rejected by all other jbd-using fs. (Basically the 
situation thing that leads to software forks, in any area.)



Jan Engelhardt
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Pekka Enberg

Hi David,

Some more comments below.

Pekka

On 8/2/05, David Teigland [EMAIL PROTECTED] wrote:
 +/**
 + * inode_create - create a struct gfs2_inode
 + * @i_gl: The glock covering the inode
 + * @inum: The inode number
 + * @io_gl: the iopen glock to acquire/hold (using holder in new gfs2_inode)
 + * @io_state: the state the iopen glock should be acquired in
 + * @ipp: pointer to put the returned inode in
 + *
 + * Returns: errno
 + */
 +
 +static int inode_create(struct gfs2_glock *i_gl, struct gfs2_inum *inum,
 + struct gfs2_glock *io_gl, unsigned int io_state,
 + struct gfs2_inode **ipp)
 +{
 + struct gfs2_sbd *sdp = i_gl-gl_sbd;
 + struct gfs2_inode *ip;
 + int error = 0;
 +
 + RETRY_MALLOC(ip = kmem_cache_alloc(gfs2_inode_cachep, GFP_KERNEL), ip);

Why do you want to do this? The callers can handle ENOMEM just fine.

 +/**
 + * gfs2_random - Generate a random 32-bit number
 + *
 + * Generate a semi-crappy 32-bit pseudo-random number without using
 + * floating point.
 + *
 + * The PRNG is from Numerical Recipes in C (second edition), page 284.
 + *
 + * Returns: a 32-bit random number
 + */
 +
 +uint32_t gfs2_random(void)
 +{
 + gfs2_random_number = 0x0019660D * gfs2_random_number + 0x3C6EF35F;
 + return gfs2_random_number;
 +}

Please consider moving this into lib/random.c. This one already appears in
drivers/net/hamradio/dmascc.c.

 +/**
 + * gfs2_hash - hash an array of data
 + * @data: the data to be hashed
 + * @len: the length of data to be hashed
 + *
 + * Take some data and convert it to a 32-bit hash.
 + *
 + * This is the 32-bit FNV-1a hash from:
 + * http://www.isthe.com/chongo/tech/comp/fnv/
 + *
 + * Returns: the hash
 + */
 +
 +uint32_t gfs2_hash(const void *data, unsigned int len)
 +{
 + uint32_t h = 0x811C9DC5;
 + h = hash_more_internal(data, len, h);
 + return h;
 +}

Is there a reason why you cannot use linux/hash.h or linux/jhash.h?

 +void gfs2_sort(void *base, unsigned int num_elem, unsigned int size,
 +int (*compar) (const void *, const void *))
 +{
 + register char *pbase = (char *)base;
 + int i, j, k, h;
 + static int cols[16] = {1391376, 463792, 198768, 86961,
 +33936, 13776, 4592, 1968,
 +861, 336, 112, 48,
 +21, 7, 3, 1};
 +
 + for (k = 0; k  16; k++) {
 + h = cols[k];
 + for (i = h; i  num_elem; i++) {
 + j = i;
 + while (j = h 
 +(*compar)((void *)(pbase + size * (j - h)),
 +  (void *)(pbase + size * j))  0) {
 + SWAP(pbase + size * j,
 +  pbase + size * (j - h),
 +  size);
 + j = j - h;
 + }
 + }
 + }
 +}

Please use sort() from lib/sort.c.

 +/**
 + * gfs2_io_error_inode_i - Flag an inode I/O error and withdraw
 + * @ip:
 + * @function:
 + * @file:
 + * @line:

Please drop empty kerneldoc tags. (Appears in various other places as well.)

 +#define RETRY_MALLOC(do_this, until_this) \
 +for (;;) { \
 + { do_this; } \
 + if (until_this) \
 + break; \
 + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) { \
 + printk(GFS2: out of memory: %s, %u\n, __FILE__, __LINE__); \
 + gfs2_malloc_warning = jiffies; \
 + } \
 + yield(); \
 +}

Please drop this.

 +int gfs2_acl_create(struct gfs2_inode *dip, struct gfs2_inode *ip)
 +{
 + struct gfs2_sbd *sdp = dip-i_sbd;
 + struct posix_acl *acl = NULL;
 + struct gfs2_ea_request er;
 + mode_t mode = ip-i_di.di_mode;
 + int error;
 +
 + if (!sdp-sd_args.ar_posix_acl)
 + return 0;
 + if (S_ISLNK(ip-i_di.di_mode))
 + return 0;
 +
 + memset(er, 0, sizeof(struct gfs2_ea_request));
 + er.er_type = GFS2_EATYPE_SYS;
 +
 + error = acl_get(dip, ACL_DEFAULT, acl, NULL,
 + er.er_data, er.er_data_len);
 + if (error)
 + return error;
 + if (!acl) {
 + mode = ~current-fs-umask;
 + if (mode != ip-i_di.di_mode)
 + error = munge_mode(ip, mode);
 + return error;
 + }
 +
 + {
 + struct posix_acl *clone = posix_acl_clone(acl, GFP_KERNEL);
 + error = -ENOMEM;
 + if (!clone)
 + goto out;
 + gfs2_memory_add(clone);
 + gfs2_memory_rm(acl);
 + posix_acl_release(acl);
 + acl = clone;
 + }

Please make this a real function. It is duplicated below.

 + if (error  0) {
 + er.er_name = GFS2_POSIX_ACL_ACCESS;
 + er.er_name_len = GFS2_POSIX_ACL_ACCESS_LEN;
 + posix_acl_to_xattr(acl, er.er_data,

Re: [PATCH 00/14] GFS

2005-08-03 Thread Arjan van de Ven


 I don't know anything about GFS, but expecting a filesystem author to
 use a journaling layer he does not want to is a bit arrogant.

good that I didn't expect that then.
I think it's fair enough to ask people if they can use it. If the answer
is No because it doesn't fit our model here then that's fine. If the
answer is eh yeah we could then I think it's entirely reasonable to
expect people to use common code as opposed to adding new code.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Arjan van de Ven

On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
 The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
 on-disk instead of LE which is another useful way to verify endian
 correctness.

that sounds wrong to be a compile option. If you really want to deal
with dual disk endianness it really ought to be a runtime one (see jffs2
for example).



  * Why use your own journalling layer and not say ... jbd ?
 
 Here's an analysis of three approaches to cluster-fs journaling and their
 pros/cons (including using jbd):  http://tinyurl.com/7sbqq
 
  * + while (!kthread_should_stop()) {
  +   gfs2_scand_internal(sdp);
  +
  +   set_current_state(TASK_INTERRUPTIBLE);
  +   schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
  +   }
  
  you probably really want to check for signals if you do interruptible sleeps
 
 I don't know why we'd be interested in signals here.

well.. because if you don't your schedule_timeout becomes a nop when you
get one, which makes your loop a busy waiting one.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread David Teigland

On Wed, Aug 03, 2005 at 11:17:09AM +0200, Arjan van de Ven wrote:
 On Wed, 2005-08-03 at 11:56 +0800, David Teigland wrote:
  The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
  on-disk instead of LE which is another useful way to verify endian
  correctness.
 
 that sounds wrong to be a compile option. If you really want to deal
 with dual disk endianness it really ought to be a runtime one (see jffs2
 for example).

We don't want BE to be an option per se; as developers we'd just like to
be able to compile it that way to verify gfs's endianness handling.  If
you think that's unmaintainable or a bad idea we'll rip it out.

   * +   while (!kthread_should_stop()) {
   + gfs2_scand_internal(sdp);
   +
   + set_current_state(TASK_INTERRUPTIBLE);
   + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
   
   you probably really want to check for signals if you do
   interruptible sleeps
  
  I don't know why we'd be interested in signals here.
 
 well.. because if you don't your schedule_timeout becomes a nop when you
 get one, which makes your loop a busy waiting one.

OK, it looks like we need to block/flush signals a la daemonize(); I guess
I mistakenly figured the kthread routines did everything daemonize did.
Thanks,
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Lars Marowsky-Bree

On 2005-08-03T11:56:18, David Teigland [EMAIL PROTECTED] wrote:

  * Why use your own journalling layer and not say ... jbd ?
 Here's an analysis of three approaches to cluster-fs journaling and their
 pros/cons (including using jbd):  http://tinyurl.com/7sbqq

Very instructive read, thanks for the link.



-- 
High Availability  Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
Ignorance more frequently begets confidence than does knowledge

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Andi Kleen

David Teigland [EMAIL PROTECTED] writes:

 Hi, GFS (Global File System) is a cluster file system that we'd like to
 see added to the kernel.  The 14 patches total about 900K so I won't send
 them to the list unless that's requested.  Comments and suggestions are
 welcome.  Thanks
 
 http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
 http://redhat.com/~teigland/gfs2/20050801/broken-out/

I would suggest to not merge this before not all the code has not been
reviewed by some experienced linux developer from outside the GFS team.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-03 Thread Mark Fasheh

On Wed, Aug 03, 2005 at 12:37:44PM +0200, Lars Marowsky-Bree wrote:
 On 2005-08-03T11:56:18, David Teigland [EMAIL PROTECTED] wrote:
 
   * Why use your own journalling layer and not say ... jbd ?
  Here's an analysis of three approaches to cluster-fs journaling and their
  pros/cons (including using jbd):  http://tinyurl.com/7sbqq
 
 Very instructive read, thanks for the link.
While it may be true that for a full log, flushing for a *single* lock may
be more expensive in OCFS2, Ken ignores the fact that in our one big flush
we've made all locks on journalled resources immediately releasable.
According to that description, GFS2 would have to do a seperate transaction
flush (including the extra step of writing revoke records) for each lock
protecting a journalled resource. Assuming the same number of locks are
required to be dropped under both systems then for a number of locks  1
OCFS2 will actually do less work - the actual metadata blocks would be the
same on either end, but JBD only has to write that the journal is now clean
to the journal superblock whereas GFS2 has to revoke the blocks for each
dropped lock.

Of course all of this talk completely avoids the fact that in any case these
things are expensive so a cluster file system has to take care to ping locks
as little as possible. OCFS2 takes great pains to make as many operations
node local (requiring no cluster locks) as possible - data allocation is
usually done from a node local pool which is refreshed from the main bitmap.
Deallocation happens similarly - we have a truncate log in which we record
deleted clusters. Each node has their own inode and metadata chain
allocators which another node will only lock for delete (a truncate log
style local metadata delete log could easily be added if that ever became a
problem).
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Kyle Moffett


On Aug 2, 2005, at 21:00:02, Hans Reiser wrote:

Arjan van de Ven wrote:

because reiser got merged before jbd. Next question.

That is the wrong reason.  We use our own journaling layer for the
reason that Vivaldi used his own melody.

I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant.  Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment.  He might want to look at how reiser4 does wandering
logs instead of using jbd. but I would never claim that for sure
some other author should be expected to use it.  and something  
like

changing one's journaling system is not something to do just before a
merge.


I don't want to start another big reiser4 flamewar, but...

"I don't know anything about Reiser4, but expecting a filesystem author
to use a VFS layer he does not want to is a bit arrogant.  Now, if you
got into details, and said the linux VFS does X, Y, and Z, and Reiser4
does..."

Do you see my point here?  If every person who added new kernel code
just wrote their own thing without checking to see if it had already
been done before, then there would be a lot of poorly maintained code
in the kernel.  If a journalling layer already exists, _new_ journaled
filesystems should either (A) use the layer as is, or (B) fix the layer
so it has sufficient functionality for them to use, and submit patches.
That way if somebody later says, "Ah, crap, there's a bug in the kernel
journalling layer", and fixes it, there are not eight other filesystems
with their own open-coded layers that need to be audited for similar
mistakes.

This is similar to why some kernel developers did not like the Reiser4
code, because it implemented some private layers that looked kinda like
stuff the VFS should be doing  (Again, I don't want to get into that
argument again, I'm just bringing up the similarities to clarify _this_
particular point, as that one has been beaten to death enough already).

Now the question for GFS is still a valid one; there might be  
reasons to

not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used  
by 2

filesystems in -mm already).


Personally, I am of the opinion that if GFS cannot use jdb, the  
developers

ought to clarify why it isn't useable, and possibly submit fixes to make
it useful, so that others can share the benefits.

Cheers,
Kyle Moffett

--
I lost interest in "blade servers" when I found they didn't throw  
knives at

people who weren't supposed to be in your machine room.
  -- Anthony de Boer


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread David Teigland

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

> * The on disk structures are defined in terms of uint32_t and friends,
> which are NOT endian neutral. Why are they not le32/be32 and thus
> endian-defined? Did you run bitwise-sparse on GFS yet ?

GFS has had proper endian handling for many years, it's still correct as
far as we've been able to test.  I ran bitwise-sparse yesterday and didn't
find anything alarming.

> * None of your on disk structures are packet. Are you sure?

Quite, particular attention has been paid to aligning the structure
fields, you'll find "pad" fields throughout.  We'll write a quick test to
verify that packing doesn't change anything.

> +#define gfs2_16_to_cpu be16_to_cpu
> +#define gfs2_32_to_cpu be32_to_cpu
> +#define gfs2_64_to_cpu be64_to_cpu
> 
> why this pointless abstracting?

#ifdef GFS2_ENDIAN_BIG

#define gfs2_16_to_cpu be16_to_cpu
#define gfs2_32_to_cpu be32_to_cpu
#define gfs2_64_to_cpu be64_to_cpu

#define cpu_to_gfs2_16 cpu_to_be16
#define cpu_to_gfs2_32 cpu_to_be32
#define cpu_to_gfs2_64 cpu_to_be64

#else /* GFS2_ENDIAN_BIG */

#define gfs2_16_to_cpu le16_to_cpu
#define gfs2_32_to_cpu le32_to_cpu
#define gfs2_64_to_cpu le64_to_cpu

#define cpu_to_gfs2_16 cpu_to_le16
#define cpu_to_gfs2_32 cpu_to_le32
#define cpu_to_gfs2_64 cpu_to_le64

#endif /* GFS2_ENDIAN_BIG */

The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
on-disk instead of LE which is another useful way to verify endian
correctness.

You should be able to use gfs in mixed architecture and mixed endian
clusters.  We don't have a mixed endian cluster to test, though.

> * +static const uint32_t crc_32_tab[] = .
> why do you duplicate this? The kernel has a perfectly good set of generic
> crc32 tables/functions just fine

We'll try them, they'll probably do fine.

> * Why use your own journalling layer and not say ... jbd ?

Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd):  http://tinyurl.com/7sbqq

> * +   while (!kthread_should_stop()) {
> + gfs2_scand_internal(sdp);
> +
> + set_current_state(TASK_INTERRUPTIBLE);
> + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
> + }
> 
> you probably really want to check for signals if you do interruptible sleeps

I don't know why we'd be interested in signals here.

> * why not use msleep() and friends instead of schedule_timeout(), you're
> not using the complex variants anyway

When unmounting we really appreciate waking up more often than the
timeout, otherwise the unmount sits and waits for the longest daemon's
msleep to complete.  I converted this to msleep recently but it was too
painful and had to go back.

We'll get to your other comments, thanks.
Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Hans Reiser

Arjan van de Ven wrote:

>On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
>  
>
>>>* Why use your own journalling layer and not say ... jbd ?
>>>  
>>>
>>Why does reiser use its own journalling layer and not say ... jbd ?
>>
>>
>
>because reiser got merged before jbd. Next question.
>  
>
That is the wrong reason.  We use our own journaling layer for the
reason that Vivaldi used his own melody.

I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant.  Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment.  He might want to look at how reiser4 does wandering
logs instead of using jbd. but I would never claim that for sure
some other author should be expected to use it.  and something like
changing one's journaling system is not something to do just before a
merge.

>Now the question for GFS is still a valid one; there might be reasons to
>not use it (which is fair enough) but if there's no real reason then
>using jdb sounds a lot better given it's maturity (and it is used by 2
>filesystems in -mm already).
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [EMAIL PROTECTED]
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at  http://www.tux.org/lkml/
>
>
>  
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Arjan van de Ven

On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
> >* Why use your own journalling layer and not say ... jbd ?
> 
> Why does reiser use its own journalling layer and not say ... jbd ?

because reiser got merged before jbd. Next question.

Now the question for GFS is still a valid one; there might be reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used by 2
filesystems in -mm already).

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Jan Engelhardt


>* Why use your own journalling layer and not say ... jbd ?

Why does reiser use its own journalling layer and not say ... jbd ?



Jan Engelhardt
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Pekka Enberg

Hi David,

On 8/2/05, David Teigland <[EMAIL PROTECTED]> wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks

> +#define kmalloc_nofail(size, flags) \
> + gmalloc_nofail((size), (flags), __FILE__, __LINE__)

[snip]

> +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
> +   unsigned int line)
> +{
> + void *x;
> + for (;;) {
> + x = kmalloc(size, flags);
> + if (x)
> + return x;
> + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
> + printk("GFS2: out of memory: %s, %u\n",
> +__FILE__, __LINE__);
> + gfs2_malloc_warning = jiffies;
> + }
> + yield();

This does not belong in a filesystem. It also seems like a very bad
idea. What are you trying to do here? If you absolutely must not fail,
use __GFP_NOFAIL instead.

> + }
> +}
> +
> +#if defined(GFS2_MEMORY_SIMPLE)
> +
> +atomic_t gfs2_memory_count;
> +
> +void gfs2_memory_add_i(void *data, char *file, unsigned int line)
> +{
> + atomic_inc(_memory_count);
> +}
> +
> +void gfs2_memory_rm_i(void *data, char *file, unsigned int line)
> +{
> + if (data)
> + atomic_dec(_memory_count);
> +}
> +
> +void *gmalloc(unsigned int size, int flags, char *file, unsigned int line)
> +{
> + void *data = kmalloc(size, flags);
> + if (data)
> + atomic_inc(_memory_count);
> + return data;
> +}
> +
> +void *gmalloc_nofail(unsigned int size, int flags, char *file,
> +  unsigned int line)
> +{
> + atomic_inc(_memory_count);
> + return gmalloc_nofail_real(size, flags, file, line);
> +}
> +
> +void gfree(void *data, char *file, unsigned int line)
> +{
> + if (data) {
> + atomic_dec(_memory_count);
> + kfree(data);
> + }
> +}

-mm has memory leak detection patches and there are others floating
around. Please do not introduce yet another subsystem-specific debug allocator.

Pekka
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Arjan van de Ven

On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote:
> Hi, GFS (Global File System) is a cluster file system that we'd like to
> see added to the kernel.  The 14 patches total about 900K so I won't send
> them to the list unless that's requested.  Comments and suggestions are
> welcome.  Thanks
> 
> http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050801/broken-out/

* The on disk structures are defined in terms of uint32_t and friends,
which are NOT endian neutral. Why are they not le32/be32 and thus
endian-defined? Did you run bitwise-sparse on GFS yet ?

* None of your on disk structures are packet. Are you sure?

* 
+#define gfs2_16_to_cpu be16_to_cpu
+#define gfs2_32_to_cpu be32_to_cpu
+#define gfs2_64_to_cpu be64_to_cpu

why this pointless abstracting?

* +static const uint32_t crc_32_tab[] = .

why do you duplicate this? The kernel has a perfectly good set of generic crc32 
tables/functions just fine

* Why are you using bufferheads extensively in a new filesystem?

* + if (create)
+   down_write(>i_rw_mutex);
+   else
+   down_read(>i_rw_mutex);

why do you use a rwsem and not a regular semaphore? You are aware that rwsems 
are far more expensive than regular ones right?
How skewed is the read/write ratio?

* Why use your own journalling layer and not say ... jbd ?

* + while (!kthread_should_stop()) {
+   gfs2_scand_internal(sdp);
+
+   set_current_state(TASK_INTERRUPTIBLE);
+   schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+   }

you probably really want to check for signals if you do interruptible sleeps
(multiple places)

* why not use msleep() and friends instead of schedule_timeout(), you're not 
using the complex variants anyway

* +++ b/fs/gfs2/fixed_div64.h   2005-08-01 14:13:08.009808200 +0800

e why?

* int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
+  unsigned int size)
+{
+   int error;
+
+   if (bh)
+   error = copy_to_user(*buf, bh->b_data + offset, size);
+   else
+   error = clear_user(*buf, size);

that looks to be missing a few kmaps.. whats the guarantee that b_data is 
actually, like in lowmem?

* [PATCH 08/14] GFS: diaper device

The diaper device is a block device within gfs that gets transparently
inserted between the real device the and rest of the filesystem.

h why not use device mapper or something? Is this really needed? Should it 
live in drivers/block ? Doesn't
this wrapper just increase the risk for memory deadlocks?

* [PATCH 06/14] GFS: logging and recovery

quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors 
agree to license their stuff under the GPL?

* do_lock_wait

that almost screams for using wait_event and related APIs

*
+static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
+{
+   spin_lock(>sd_log_lock);
+}
why the abstraction ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/14] GFS

2005-08-02 Thread David Teigland

Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel.  The 14 patches total about 900K so I won't send
them to the list unless that's requested.  Comments and suggestions are
welcome.  Thanks

http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/

Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 00/14] GFS

2005-08-02 Thread David Teigland

Hi, GFS (Global File System) is a cluster file system that we'd like to
see added to the kernel.  The 14 patches total about 900K so I won't send
them to the list unless that's requested.  Comments and suggestions are
welcome.  Thanks

http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050801/broken-out/

Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Arjan van de Ven

On Tue, 2005-08-02 at 15:18 +0800, David Teigland wrote:
 Hi, GFS (Global File System) is a cluster file system that we'd like to
 see added to the kernel.  The 14 patches total about 900K so I won't send
 them to the list unless that's requested.  Comments and suggestions are
 welcome.  Thanks
 
 http://redhat.com/~teigland/gfs2/20050801/gfs2-full.patch
 http://redhat.com/~teigland/gfs2/20050801/broken-out/

* The on disk structures are defined in terms of uint32_t and friends,
which are NOT endian neutral. Why are they not le32/be32 and thus
endian-defined? Did you run bitwise-sparse on GFS yet ?

* None of your on disk structures are packet. Are you sure?

* 
+#define gfs2_16_to_cpu be16_to_cpu
+#define gfs2_32_to_cpu be32_to_cpu
+#define gfs2_64_to_cpu be64_to_cpu

why this pointless abstracting?

* +static const uint32_t crc_32_tab[] = .

why do you duplicate this? The kernel has a perfectly good set of generic crc32 
tables/functions just fine

* Why are you using bufferheads extensively in a new filesystem?

* + if (create)
+   down_write(ip-i_rw_mutex);
+   else
+   down_read(ip-i_rw_mutex);

why do you use a rwsem and not a regular semaphore? You are aware that rwsems 
are far more expensive than regular ones right?
How skewed is the read/write ratio?

* Why use your own journalling layer and not say ... jbd ?

* + while (!kthread_should_stop()) {
+   gfs2_scand_internal(sdp);
+
+   set_current_state(TASK_INTERRUPTIBLE);
+   schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
+   }

you probably really want to check for signals if you do interruptible sleeps
(multiple places)

* why not use msleep() and friends instead of schedule_timeout(), you're not 
using the complex variants anyway

* +++ b/fs/gfs2/fixed_div64.h   2005-08-01 14:13:08.009808200 +0800

e why?

* int gfs2_copy2user(struct buffer_head *bh, char **buf, unsigned int offset,
+  unsigned int size)
+{
+   int error;
+
+   if (bh)
+   error = copy_to_user(*buf, bh-b_data + offset, size);
+   else
+   error = clear_user(*buf, size);

that looks to be missing a few kmaps.. whats the guarantee that b_data is 
actually, like in lowmem?

* [PATCH 08/14] GFS: diaper device

The diaper device is a block device within gfs that gets transparently
inserted between the real device the and rest of the filesystem.

h why not use device mapper or something? Is this really needed? Should it 
live in drivers/block ? Doesn't
this wrapper just increase the risk for memory deadlocks?

* [PATCH 06/14] GFS: logging and recovery

quoting the ren and stimpy show is nice.. but did the ren ans stimpy authors 
agree to license their stuff under the GPL?


* do_lock_wait

that almost screams for using wait_event and related APIs


*
+static inline void gfs2_log_lock(struct gfs2_sbd *sdp)
+{
+   spin_lock(sdp-sd_log_lock);
+}
why the abstraction ?



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Pekka Enberg

Hi David,

On 8/2/05, David Teigland [EMAIL PROTECTED] wrote:
 Hi, GFS (Global File System) is a cluster file system that we'd like to
 see added to the kernel.  The 14 patches total about 900K so I won't send
 them to the list unless that's requested.  Comments and suggestions are
 welcome.  Thanks

 +#define kmalloc_nofail(size, flags) \
 + gmalloc_nofail((size), (flags), __FILE__, __LINE__)

[snip]

 +void *gmalloc_nofail_real(unsigned int size, int flags, char *file,
 +   unsigned int line)
 +{
 + void *x;
 + for (;;) {
 + x = kmalloc(size, flags);
 + if (x)
 + return x;
 + if (time_after_eq(jiffies, gfs2_malloc_warning + 5 * HZ)) {
 + printk(GFS2: out of memory: %s, %u\n,
 +__FILE__, __LINE__);
 + gfs2_malloc_warning = jiffies;
 + }
 + yield();

This does not belong in a filesystem. It also seems like a very bad
idea. What are you trying to do here? If you absolutely must not fail,
use __GFP_NOFAIL instead.

 + }
 +}
 +
 +#if defined(GFS2_MEMORY_SIMPLE)
 +
 +atomic_t gfs2_memory_count;
 +
 +void gfs2_memory_add_i(void *data, char *file, unsigned int line)
 +{
 + atomic_inc(gfs2_memory_count);
 +}
 +
 +void gfs2_memory_rm_i(void *data, char *file, unsigned int line)
 +{
 + if (data)
 + atomic_dec(gfs2_memory_count);
 +}
 +
 +void *gmalloc(unsigned int size, int flags, char *file, unsigned int line)
 +{
 + void *data = kmalloc(size, flags);
 + if (data)
 + atomic_inc(gfs2_memory_count);
 + return data;
 +}
 +
 +void *gmalloc_nofail(unsigned int size, int flags, char *file,
 +  unsigned int line)
 +{
 + atomic_inc(gfs2_memory_count);
 + return gmalloc_nofail_real(size, flags, file, line);
 +}
 +
 +void gfree(void *data, char *file, unsigned int line)
 +{
 + if (data) {
 + atomic_dec(gfs2_memory_count);
 + kfree(data);
 + }
 +}

-mm has memory leak detection patches and there are others floating
around. Please do not introduce yet another subsystem-specific debug allocator.

Pekka
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Jan Engelhardt


* Why use your own journalling layer and not say ... jbd ?

Why does reiser use its own journalling layer and not say ... jbd ?



Jan Engelhardt
-- 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Arjan van de Ven

On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
 * Why use your own journalling layer and not say ... jbd ?
 
 Why does reiser use its own journalling layer and not say ... jbd ?

because reiser got merged before jbd. Next question.

Now the question for GFS is still a valid one; there might be reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used by 2
filesystems in -mm already).



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Hans Reiser

Arjan van de Ven wrote:

On Tue, 2005-08-02 at 16:57 +0200, Jan Engelhardt wrote:
  

* Why use your own journalling layer and not say ... jbd ?
  

Why does reiser use its own journalling layer and not say ... jbd ?



because reiser got merged before jbd. Next question.
  

That is the wrong reason.  We use our own journaling layer for the
reason that Vivaldi used his own melody.

I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant.  Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment.  He might want to look at how reiser4 does wandering
logs instead of using jbd. but I would never claim that for sure
some other author should be expected to use it.  and something like
changing one's journaling system is not something to do just before a
merge.

Now the question for GFS is still a valid one; there might be reasons to
not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used by 2
filesystems in -mm already).



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread David Teigland

On Tue, Aug 02, 2005 at 09:45:24AM +0200, Arjan van de Ven wrote:

 * The on disk structures are defined in terms of uint32_t and friends,
 which are NOT endian neutral. Why are they not le32/be32 and thus
 endian-defined? Did you run bitwise-sparse on GFS yet ?

GFS has had proper endian handling for many years, it's still correct as
far as we've been able to test.  I ran bitwise-sparse yesterday and didn't
find anything alarming.

 * None of your on disk structures are packet. Are you sure?

Quite, particular attention has been paid to aligning the structure
fields, you'll find pad fields throughout.  We'll write a quick test to
verify that packing doesn't change anything.

 +#define gfs2_16_to_cpu be16_to_cpu
 +#define gfs2_32_to_cpu be32_to_cpu
 +#define gfs2_64_to_cpu be64_to_cpu
 
 why this pointless abstracting?

#ifdef GFS2_ENDIAN_BIG

#define gfs2_16_to_cpu be16_to_cpu
#define gfs2_32_to_cpu be32_to_cpu
#define gfs2_64_to_cpu be64_to_cpu

#define cpu_to_gfs2_16 cpu_to_be16
#define cpu_to_gfs2_32 cpu_to_be32
#define cpu_to_gfs2_64 cpu_to_be64

#else /* GFS2_ENDIAN_BIG */

#define gfs2_16_to_cpu le16_to_cpu
#define gfs2_32_to_cpu le32_to_cpu
#define gfs2_64_to_cpu le64_to_cpu

#define cpu_to_gfs2_16 cpu_to_le16
#define cpu_to_gfs2_32 cpu_to_le32
#define cpu_to_gfs2_64 cpu_to_le64

#endif /* GFS2_ENDIAN_BIG */

The point is you can define GFS2_ENDIAN_BIG to compile gfs to be BE
on-disk instead of LE which is another useful way to verify endian
correctness.

You should be able to use gfs in mixed architecture and mixed endian
clusters.  We don't have a mixed endian cluster to test, though.

 * +static const uint32_t crc_32_tab[] = .
 why do you duplicate this? The kernel has a perfectly good set of generic
 crc32 tables/functions just fine

We'll try them, they'll probably do fine.

 * Why use your own journalling layer and not say ... jbd ?

Here's an analysis of three approaches to cluster-fs journaling and their
pros/cons (including using jbd):  http://tinyurl.com/7sbqq

 * +   while (!kthread_should_stop()) {
 + gfs2_scand_internal(sdp);
 +
 + set_current_state(TASK_INTERRUPTIBLE);
 + schedule_timeout(gfs2_tune_get(sdp, gt_scand_secs) * HZ);
 + }
 
 you probably really want to check for signals if you do interruptible sleeps

I don't know why we'd be interested in signals here.

 * why not use msleep() and friends instead of schedule_timeout(), you're
 not using the complex variants anyway

When unmounting we really appreciate waking up more often than the
timeout, otherwise the unmount sits and waits for the longest daemon's
msleep to complete.  I converted this to msleep recently but it was too
painful and had to go back.

We'll get to your other comments, thanks.
Dave

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/14] GFS

2005-08-02 Thread Kyle Moffett


On Aug 2, 2005, at 21:00:02, Hans Reiser wrote:

Arjan van de Ven wrote:

because reiser got merged before jbd. Next question.

That is the wrong reason.  We use our own journaling layer for the
reason that Vivaldi used his own melody.

I don't know anything about GFS, but expecting a filesystem author to
use a journaling layer he does not want to is a bit arrogant.  Now, if
you got into details, and said jbd does X, Y and Z, and GFS does the
same X and Y, and does not do Z as well as jbd, that would be a more
serious comment.  He might want to look at how reiser4 does wandering
logs instead of using jbd. but I would never claim that for sure
some other author should be expected to use it.  and something  
like

changing one's journaling system is not something to do just before a
merge.


I don't want to start another big reiser4 flamewar, but...

I don't know anything about Reiser4, but expecting a filesystem author
to use a VFS layer he does not want to is a bit arrogant.  Now, if you
got into details, and said the linux VFS does X, Y, and Z, and Reiser4
does...

Do you see my point here?  If every person who added new kernel code
just wrote their own thing without checking to see if it had already
been done before, then there would be a lot of poorly maintained code
in the kernel.  If a journalling layer already exists, _new_ journaled
filesystems should either (A) use the layer as is, or (B) fix the layer
so it has sufficient functionality for them to use, and submit patches.
That way if somebody later says, Ah, crap, there's a bug in the kernel
journalling layer, and fixes it, there are not eight other filesystems
with their own open-coded layers that need to be audited for similar
mistakes.

This is similar to why some kernel developers did not like the Reiser4
code, because it implemented some private layers that looked kinda like
stuff the VFS should be doing  (Again, I don't want to get into that
argument again, I'm just bringing up the similarities to clarify _this_
particular point, as that one has been beaten to death enough already).

Now the question for GFS is still a valid one; there might be  
reasons to

not use it (which is fair enough) but if there's no real reason then
using jdb sounds a lot better given it's maturity (and it is used  
by 2

filesystems in -mm already).


Personally, I am of the opinion that if GFS cannot use jdb, the  
developers

ought to clarify why it isn't useable, and possibly submit fixes to make
it useful, so that others can share the benefits.

Cheers,
Kyle Moffett

--
I lost interest in blade servers when I found they didn't throw  
knives at

people who weren't supposed to be in your machine room.
  -- Anthony de Boer


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

90 matches

Mail list logo