from:"Daniel Phillips"

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

Hi Mike,

On Thursday 02 August 2007 21:09, Mike Snitzer wrote:
 But NBD's synchronous nature is actually an asset when coupled with
 MD raid1 as it provides guarantees that the data has _really_ been
 mirrored remotely.

And bio completion doesn't?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Distributed storage.

2007-08-03 Thread Daniel Phillips

On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote:
 On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote:
  I see bits that worry me, e.g.:
 
  +   req = mempool_alloc(st-w-req_pool, GFP_NOIO);
 
  which seems to be callable in response to a local request, just the
  case where NBD deadlocks.  Your mempool strategy can work reliably
  only if you can prove that the pool allocations of the maximum
  number of requests you can have in flight do not exceed the size of
  the pool.  In other words, if you ever take the pool's fallback
  path to normal allocation, you risk deadlock.

 mempool should be allocated to be able to catch up with maximum
 in-flight requests, in my tests I was unable to force block layer to
 put more than 31 pages in sync, but in one bio. Each request is
 essentially dealyed bio processing, so this must handle maximum
 number of in-flight bios (if they do not cover multiple nodes, if
 they do, then each node requires own request).

It depends on the characteristics of the physical and virtual block 
devices involved.  Slow block devices can produce surprising effects.  
Ddsnap still qualifies as slow under certain circumstances (big 
linear write immediately following a new snapshot). Before we added 
throttling we would see as many as 800,000 bios in flight.  Nice to 
know the system can actually survive this... mostly.  But memory 
deadlock is a clear and present danger under those conditions and we 
did hit it (not to mention that read latency sucked beyond belief). 

Anyway, we added a simple counting semaphore to throttle the bio traffic 
to a reasonable number and behavior became much nicer, but most 
importantly, this satisfies one of the primary requirements for 
avoiding block device memory deadlock: a strictly bounded amount of bio 
traffic in flight.  In fact, we allow some bounded number of 
non-memalloc bios *plus* however much traffic the mm wants to throw at 
us in memalloc mode, on the assumption that the mm knows what it is 
doing and imposes its own bound of in flight bios per device.   This 
needs auditing obviously, but the mm either does that or is buggy.  In 
practice, with this throttling in place we never saw more than 2,000 in 
flight no matter how hard we hit it, which is about the number we were 
aiming at.  Since we draw our reserve from the main memalloc pool, we 
can easily handle 2,000 bios in flight, even under extreme conditions.

See:
http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c
down(info-throttle_sem);

To be sure, I am not very proud of this throttling mechanism for various 
reasons, but the thing is, _any_ throttling mechanism no matter how 
sucky solves the deadlock problem.  Over time I want to move the 
throttling up into bio submission proper, or perhaps incorporate it in 
device mapper's queue function, not quite as high up the food chain.  
Only some stupid little logistical issues stopped me from doing it one 
of those ways right from the start.   I think Peter has also tried some 
things in this area.  Anyway, that part is not pressing because the 
throttling can be done in the virtual device itself as we do it, even 
if it is not very pretty there.  The point is: you have to throttle the 
bio traffic.  The alternative is to die a horrible death under 
conditions that may be rare, but _will_ hit somebody.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Distributed storage.

2007-08-02 Thread Daniel Phillips

On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
> Hi.
>
> I'm pleased to announce first release of the distributed storage
> subsystem, which allows to form a storage on top of remote and local
> nodes, which in turn can be exported to another storage as a node to
> form tree-like storages.

Excellent!  This is precisely what the doctor ordered for the 
OCFS2-based distributed storage system I have been mumbling about for 
some time.  In fact the dd in ddsnap and ddraid stands for "distributed 
data".  The ddsnap/raid devices do not include an actual network 
transport, that is expected to be provided by a specialized block 
device, which up till now has been NBD.  But NBD has various 
deficiencies as you note, in addition to its tendency to deadlock when 
accessed locally.  Your new code base may be just the thing we always 
wanted.  We (zumastor et al) will take it for a drive and see if 
anything breaks.

Memory deadlock is a concern of course.  From a cursory glance through, 
it looks like this code is pretty vm-friendly and you have thought 
quite a lot about it, however I respectfully invite peterz 
(obsessive/compulsive memory deadlock hunter) to help give it a good 
going over with me.

I see bits that worry me, e.g.:

+   req = mempool_alloc(st->w->req_pool, GFP_NOIO);

which seems to be callable in response to a local request, just the case 
where NBD deadlocks.  Your mempool strategy can work reliably only if 
you can prove that the pool allocations of the maximum number of 
requests you can have in flight do not exceed the size of the pool.  In 
other words, if you ever take the pool's fallback path to normal 
allocation, you risk deadlock.

Anyway, if this is as grand as it seems then I would think we ought to 
factor out a common transfer core that can be used by all of NBD, 
iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own 
code those things have now.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CFS review

2007-08-02 Thread Daniel Phillips

Hi Linus,

On Wednesday 01 August 2007 19:17, Linus Torvalds wrote:
>And the "approximates" thing would be about the fact that we don't
>actually care about "absolute" microseconds as much as something
> that is in the "roughly a microsecond" area. So if we say "it doesn't
> have to be microseconds, but it should be within a factor of two of a
> ms", we could avoid all the expensive divisions (even if they turn
> into multiplications with reciprocals), and just let people *shift*
> the CPU counter instead.

On that theme, expressing the subsecond part of high precision time in 
decimal instead of left-aligned binary always was an insane idea.  
Applications end up with silly numbers of multiplies and divides 
(likely as not incorrect) whereas they would often just need a simple 
shift as you say, if the tv struct had been defined sanely from the 
start.  As a bonus, whenever precision gets bumped up, the new bits 
appear on the right in formerly zero locations on the right, meaning 
little if any code needs to change.  What we have in the incumbent libc 
timeofday scheme is the moral equivalent of BCD.

Of course libc is unlikely ever to repent, but we can at least put off 
converting into the awkward decimal format until the last possible 
instant.  In other words, I do not see why xtime is expressed as a tv 
instead of simple 32.32 fixed point.  Perhaps somebody can elucidate 
me?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: CFS review

2007-08-02 Thread Daniel Phillips

Hi Linus,

On Wednesday 01 August 2007 19:17, Linus Torvalds wrote:
And the approximates thing would be about the fact that we don't
actually care about absolute microseconds as much as something
 that is in the roughly a microsecond area. So if we say it doesn't
 have to be microseconds, but it should be within a factor of two of a
 ms, we could avoid all the expensive divisions (even if they turn
 into multiplications with reciprocals), and just let people *shift*
 the CPU counter instead.

On that theme, expressing the subsecond part of high precision time in 
decimal instead of left-aligned binary always was an insane idea.  
Applications end up with silly numbers of multiplies and divides 
(likely as not incorrect) whereas they would often just need a simple 
shift as you say, if the tv struct had been defined sanely from the 
start.  As a bonus, whenever precision gets bumped up, the new bits 
appear on the right in formerly zero locations on the right, meaning 
little if any code needs to change.  What we have in the incumbent libc 
timeofday scheme is the moral equivalent of BCD.

Of course libc is unlikely ever to repent, but we can at least put off 
converting into the awkward decimal format until the last possible 
instant.  In other words, I do not see why xtime is expressed as a tv 
instead of simple 32.32 fixed point.  Perhaps somebody can elucidate 
me?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Distributed storage.

2007-08-02 Thread Daniel Phillips

On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote:
 Hi.

 I'm pleased to announce first release of the distributed storage
 subsystem, which allows to form a storage on top of remote and local
 nodes, which in turn can be exported to another storage as a node to
 form tree-like storages.

Excellent!  This is precisely what the doctor ordered for the 
OCFS2-based distributed storage system I have been mumbling about for 
some time.  In fact the dd in ddsnap and ddraid stands for distributed 
data.  The ddsnap/raid devices do not include an actual network 
transport, that is expected to be provided by a specialized block 
device, which up till now has been NBD.  But NBD has various 
deficiencies as you note, in addition to its tendency to deadlock when 
accessed locally.  Your new code base may be just the thing we always 
wanted.  We (zumastor et al) will take it for a drive and see if 
anything breaks.

Memory deadlock is a concern of course.  From a cursory glance through, 
it looks like this code is pretty vm-friendly and you have thought 
quite a lot about it, however I respectfully invite peterz 
(obsessive/compulsive memory deadlock hunter) to help give it a good 
going over with me.

I see bits that worry me, e.g.:

+   req = mempool_alloc(st-w-req_pool, GFP_NOIO);

which seems to be callable in response to a local request, just the case 
where NBD deadlocks.  Your mempool strategy can work reliably only if 
you can prove that the pool allocations of the maximum number of 
requests you can have in flight do not exceed the size of the pool.  In 
other words, if you ever take the pool's fallback path to normal 
allocation, you risk deadlock.

Anyway, if this is as grand as it seems then I would think we ought to 
factor out a common transfer core that can be used by all of NBD, 
iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own 
code those things have now.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] extent mapped page cache

2007-07-12 Thread Daniel Phillips

On Tuesday 10 July 2007 14:03, Chris Mason wrote:
> This patch aims to demonstrate one way to replace buffer heads with a
> few extent trees...

Hi Chris,

Quite terse commentary on algorithms and data structures, but I suppose
that is not a problem because Jon has a whole week to reverse engineer
it for us.

What did you have in mind for subpages?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] extent mapped page cache

2007-07-12 Thread Daniel Phillips

On Tuesday 10 July 2007 14:03, Chris Mason wrote:
 This patch aims to demonstrate one way to replace buffer heads with a
 few extent trees...

Hi Chris,

Quite terse commentary on algorithms and data structures, but I suppose
that is not a problem because Jon has a whole week to reverse engineer
it for us.

What did you have in mind for subpages?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-11 Thread Daniel Phillips

On Wednesday 11 July 2007 15:09, Neil Brown wrote:
> > > Has anyone fixed the infrequent crashes with 4K stacks and ext3
> > > -> LVM snapshot -> LVM -> DM mirror -> libata?
> >
> > Ahem: ext3 -> LVM snapshot -> LVM -> DM mirror -> DM crypt -> md ->
> > libata, or worse.
> >
> > No, it's not fixed.  The model is wrong.  Virtual block drivers
> > should not be callling submit_bio.  The recursive IO submissions
> > should be handled on a dedicated stack, most probably allocated as
> > part of the request queue. This could be done easily in device
> > mapper and md, or better, in submit_bio.
>
> Maybe you should read that latest kernel source code.  Particularly
> generic_make_request in block/ll_rw_blk.c.

And plus you've had that one sitting around since 2005, hats off for
nailing the issue from way out.  Sorry for missing the action, I was 
elsewhere.  Niggles begin here.

I'm not sure I like the additional task_struct encumbrance when the
functions themselves could sort it out, albeit with an API change
affecting a gaggle of md and dm drivers.

Hopefully there are other users of the bio list fields, otherwise I
would point out that a per-queue stack is less memory than two
per-bio fields.  I didn't go delving that far.

The pointer to the description of the barrier deadlock is not right, it
points to the problem report when it really out to point to the
definitive analysis and include a subject line, because list archives
come and go:

   [PATCH] block: always requeue !fs requests at the front
   http://thread.gmane.org/gmane.linux.kernel/537473

Is there a good reason why we should not just put the whole
analysis from Tejun Heo in as a comment?  It is terse enough.

In other words, looks good to me :)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-11 Thread Daniel Phillips

On Wednesday 11 July 2007 10:54, Zan Lynx wrote:
> Jesper Juhl wrote:
> > Hi,
> >
> > I'm wondering if it's time to make 4K stacks the default and to start
> > considering removing the 8K stack option alltogether soon?
> >
> > One of the big problem spots was XFS, but that got some stack usage
> > fixes recently, and the 4K stack option has been around for quite a
> > while now, so people really should have gotten around to fixing any
> > code that can't handle it.   Are there still any big problem areas
> > remaining?
>
> Has anyone fixed the infrequent crashes with 4K stacks and ext3 -> LVM
> snapshot -> LVM -> DM mirror -> libata?

Ahem: ext3 -> LVM snapshot -> LVM -> DM mirror -> DM crypt -> md -> libata,
or worse.

No, it's not fixed.  The model is wrong.  Virtual block drivers should not
be callling submit_bio.  The recursive IO submissions should be handled
on a dedicated stack, most probably allocated as part of the request queue.
This could be done easily in device mapper and md, or better, in
submit_bio.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-11 Thread Daniel Phillips

On Wednesday 11 July 2007 10:54, Zan Lynx wrote:
 Jesper Juhl wrote:
  Hi,
 
  I'm wondering if it's time to make 4K stacks the default and to start
  considering removing the 8K stack option alltogether soon?
 
  One of the big problem spots was XFS, but that got some stack usage
  fixes recently, and the 4K stack option has been around for quite a
  while now, so people really should have gotten around to fixing any
  code that can't handle it.   Are there still any big problem areas
  remaining?

 Has anyone fixed the infrequent crashes with 4K stacks and ext3 - LVM
 snapshot - LVM - DM mirror - libata?

Ahem: ext3 - LVM snapshot - LVM - DM mirror - DM crypt - md - libata,
or worse.

No, it's not fixed.  The model is wrong.  Virtual block drivers should not
be callling submit_bio.  The recursive IO submissions should be handled
on a dedicated stack, most probably allocated as part of the request queue.
This could be done easily in device mapper and md, or better, in
submit_bio.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?

2007-07-11 Thread Daniel Phillips

On Wednesday 11 July 2007 15:09, Neil Brown wrote:
   Has anyone fixed the infrequent crashes with 4K stacks and ext3
   - LVM snapshot - LVM - DM mirror - libata?
 
  Ahem: ext3 - LVM snapshot - LVM - DM mirror - DM crypt - md -
  libata, or worse.
 
  No, it's not fixed.  The model is wrong.  Virtual block drivers
  should not be callling submit_bio.  The recursive IO submissions
  should be handled on a dedicated stack, most probably allocated as
  part of the request queue. This could be done easily in device
  mapper and md, or better, in submit_bio.

 Maybe you should read that latest kernel source code.  Particularly
 generic_make_request in block/ll_rw_blk.c.

And plus you've had that one sitting around since 2005, hats off for
nailing the issue from way out.  Sorry for missing the action, I was 
elsewhere.  Niggles begin here.

I'm not sure I like the additional task_struct encumbrance when the
functions themselves could sort it out, albeit with an API change
affecting a gaggle of md and dm drivers.

Hopefully there are other users of the bio list fields, otherwise I
would point out that a per-queue stack is less memory than two
per-bio fields.  I didn't go delving that far.

The pointer to the description of the barrier deadlock is not right, it
points to the problem report when it really out to point to the
definitive analysis and include a subject line, because list archives
come and go:

   [PATCH] block: always requeue !fs requests at the front
   http://thread.gmane.org/gmane.linux.kernel/537473

Is there a good reason why we should not just put the whole
analysis from Tejun Heo in as a comment?  It is terse enough.

In other words, looks good to me :)

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, -v9

2007-05-07 Thread Daniel Phillips


Hi Ingo,

I just thought I would mention this, because it is certainly on my
mind.  I can't help
wondering if other folks are also concerned about this.  The thing is,
why don't you
just send your patches to Con who got this whole ball rolling and did a bunch of
great work, proving beyond any reasonable doubt that he is capable of
maintaining
this subsystem, whatever algorithm is finally adopted?  Are you worried that Con
might steal your thunder?  That somehow the scheduler is yours alone?  That you
might be perceived as less of a genius if somebody else gets credit
for their good
work?  NIH?

My perception is that you barged in to take over just when Con got things moving
after the scheduler sat and rotted for several years.  If that is in
any way accurate,
then shame on you.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] CFS scheduler, -v9

2007-05-07 Thread Daniel Phillips


Hi Ingo,

I just thought I would mention this, because it is certainly on my
mind.  I can't help
wondering if other folks are also concerned about this.  The thing is,
why don't you
just send your patches to Con who got this whole ball rolling and did a bunch of
great work, proving beyond any reasonable doubt that he is capable of
maintaining
this subsystem, whatever algorithm is finally adopted?  Are you worried that Con
might steal your thunder?  That somehow the scheduler is yours alone?  That you
might be perceived as less of a genius if somebody else gets credit
for their good
work?  NIH?

My perception is that you barged in to take over just when Con got things moving
after the scheduler sat and rotted for several years.  If that is in
any way accurate,
then shame on you.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-07 Thread Daniel Phillips

On Wednesday 07 September 2005 15:52, Daniel Phillips wrote:
Ah, there's another issue: an interrupt can come in when esp is on the ndis 
stack and above THREAD_SIZE, so do_IRQ will not find thread_info.  Sorry,
this one is nasty.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-07 Thread Daniel Phillips

> > Is there a technical reason ("hard to implement" is a practical reason)
> > why all stacks need to be the same size?
>
> Because of
>
> static inline struct thread_info *current_thread_info(void)
> {
> struct thread_info *ti;
> __asm__("andl %%esp,%0; ":"=r" (ti) : "" (~(THREAD_SIZE - 1)));
> return ti;
> }
> [include/asm-i386/thread_info.h]
>
> which assumes that it can "round down" the stack pointer and then will
> find the thread_info of the current context there. Only works for
> identically sized stacks. Note that this function is heavily used in
> the kernel, either directly or indirectly. You cannot avoid it.
>
> My current assessment regarding differently sized threads for
> ndiswrapper: not feasible with vanilla kernels.

If so, it is not because of this.  It just means you have to go back to the 
idea of switching back to the original stack when the Windows driver calls 
into the ndis API.  (It must have been way too late last night when I claimed 
the second stack switch wasn't necessary.)

Other issues:

  - Use a semaphore to serialize access to a single ndis stack... any
spinlock or interrupt state issues?  (I didn't notice any.)

  - Copy parameters across the stack switch - a little tricky, but far from
the trickiest bit of glue in the kernel

  - Preempt - looks like it has to be disabled from switching to the ndis
stack to switching back because of the thread_info problem

  - It is best for Linux when life is a little hard for binary-only drivers,
but not completely impossible.  When the smoke clears, ndis wrapper will
be slightly slower than before and we will be slightly closer to having
some native drivers.  In the meantime, keeping the thing alive without
impacting core is an interesting puzzle.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-07 Thread Daniel Phillips

  Is there a technical reason (hard to implement is a practical reason)
  why all stacks need to be the same size?

 Because of

 static inline struct thread_info *current_thread_info(void)
 {
 struct thread_info *ti;
 __asm__(andl %%esp,%0; :=r (ti) :  (~(THREAD_SIZE - 1)));
 return ti;
 }
 [include/asm-i386/thread_info.h]

 which assumes that it can round down the stack pointer and then will
 find the thread_info of the current context there. Only works for
 identically sized stacks. Note that this function is heavily used in
 the kernel, either directly or indirectly. You cannot avoid it.

 My current assessment regarding differently sized threads for
 ndiswrapper: not feasible with vanilla kernels.

If so, it is not because of this.  It just means you have to go back to the 
idea of switching back to the original stack when the Windows driver calls 
into the ndis API.  (It must have been way too late last night when I claimed 
the second stack switch wasn't necessary.)

Other issues:

  - Use a semaphore to serialize access to a single ndis stack... any
spinlock or interrupt state issues?  (I didn't notice any.)

  - Copy parameters across the stack switch - a little tricky, but far from
the trickiest bit of glue in the kernel

  - Preempt - looks like it has to be disabled from switching to the ndis
stack to switching back because of the thread_info problem

  - It is best for Linux when life is a little hard for binary-only drivers,
but not completely impossible.  When the smoke clears, ndis wrapper will
be slightly slower than before and we will be slightly closer to having
some native drivers.  In the meantime, keeping the thing alive without
impacting core is an interesting puzzle.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-07 Thread Daniel Phillips

On Wednesday 07 September 2005 15:52, Daniel Phillips wrote:
Ah, there's another issue: an interrupt can come in when esp is on the ndis 
stack and above THREAD_SIZE, so do_IRQ will not find thread_info.  Sorry,
this one is nasty.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Wednesday 07 September 2005 00:16, Daniel Phillips wrote:
> ...as long as ->task and ->previous_esp are initialized,
> staying on the bigger stack looks fine (previous_esp is apparently used
> only for backtrace) ... just like do_IRQ.

Ahem, but let me note before somebody else does: it isn't interrupt context, 
it is normal process context - while an interrupt can ignore most of the 
thread_info fields, a normal process has to worry about all 9.  To be on the 
safe side, the first 8 need to be copied into and out of the ndis stack, with 
preempt disabled until after the stack switch.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 21:59, Mark Lord wrote:
> Daniel Phillips wrote:
> > There are only two stacks involved, the normal kernel stack and your new
> > ndis stack.  You save ESP of the kernel stack at the base of the ndis
> > stack.  When the Windows code calls your api, you get the ndis ESP, load
> > the kernel ESP from the base of the ndis stack, push the ndis ESP so you
> > can get back to the ndis code later, and continue on your merry way.

I must have been smoking something when I convinced myself that the driver 
can't call into the kernel without switching back to the kernel stack.  But 
this is wrong, as long as ->task and ->previous_esp are initialized, staying 
on the bigger stack looks fine (previous_esp is apparently used only for 
backtrace).

> With CONFIG_PREEMPT, this will still cause trouble due to lack
> of "current" task info on the NDIS stack.
>
> One option is to copy (duplicate) the bottom-of-stack info when
> switching to the NDIS stack.

Yes, just like do_IRQ.

> Another option is to stick a Mutex around any use of the NDIS stack
> when calling into the foreign driver (might be done like this already??),

There is no mutex now, but this is the easy way to get by with just one ndis 
stack.

> which will prevent PREEMPTion during the call.

We have preempt_enable/disable for that.  But I am not sure preemption needs 
to be disabled.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 18:28, Roland Dreier wrote:
> Daniel> There are only two stacks involved, the normal kernel
> Daniel> stack and your new ndis stack.  You save ESP of the kernel
> Daniel> stack at the base of the ndis stack.  When the Windows
> Daniel> code calls your api, you get the ndis ESP, load the kernel
> Daniel> ESP from the base of the ndis stack, push the ndis ESP so
> Daniel> you can get back to the ndis code later, and continue on
> Daniel> your merry way.
>
> [...]
>
> Daniel> You will allocate your own stack once on driver
> Daniel> initialization.
>
> I'm not quite sure it's this trivial.  Obviously there are more than
> two stacks involved, since there is more than one kernel stack!  (One
> per task plus IRQ stacks)  This is more than just a theoretical
> problem.  It seems entirely possible that more than one task could
> be in the driver, and clearly they each need their own stack.

Semaphore :-)

Do you expect this to be heavily contended?  On a very quick run through the 
code, it seems you don't hold any spinlocks going into the driver from 
process context.  Interrupts... they better fit into a 4K stack or it's game 
over.  Preemption while on the ndis stack... you can always disable 
preemption in this region, but the semaphore should protect you.  Task killed 
while preempted... I dunno. 

> So it's going to be at least a little harder than allocating a single
> stack for NDIS use when the driver starts up.
>
> I personally like the idea raised elsewhere in this thread of running
> the Windows driver in userspace by proxying interrupts, PCI access,
> etc.  That seems more robust and probably allows some cool reverse
> engineering hacks.

I expect the userspace approach will be a lot more work and a lot more 
overhead too, but then again it sounds like loads of fun.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 18:21, Andi Kleen wrote:
> On Wednesday 07 September 2005 00:19, Daniel Phillips wrote:
> > Andi, their stack will have to have a valid thread_info->task because
> > interrupts will use it.  Out of interest, could you please explain what
> > for?
>
> No, with 4k interrupts run on their own stack with their own thread_info
> Or rather they mostly do. Currently do_IRQ does irq_enter which refers
> thread_info before switching to the interrupt stack, that order would
> likely need to be exchanged.

But then how would thread_info->task on the irq stack ever get initialized?

My "what for" question was re why interrupt routines even need a valid 
current.  I see one answer out there on the web: statistical profiling.  Is 
that it?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 13:23, Giridhar Pemmasani wrote:
> Jan Kiszka wrote:
> > The only way I see is to switch stacks back on ndiswrapper API entry.
> > But managing all those stacks correctly is challenging,

There are only two stacks involved, the normal kernel stack and your new ndis 
stack.  You save ESP of the kernel stack at the base of the ndis stack.  When 
the Windows code calls your api, you get the ndis ESP, load the kernel ESP 
from the base of the ndis stack, push the ndis ESP so you can get back to the 
ndis code later, and continue on your merry way.

> > as you will likely not want to create a new stack on each switching
> > point... 

You will allocate your own stack once on driver initialization.

> This is what I had in mind before I saw this thread here. I, in fact, did
> some work along those lines, but it is even more complicated than you
> mentioned here: Windows uses different calling conventions (STDCALL,
> FASTCALL, CDECL) so switching stacks by copying arguments/results gets
> complicated.

I missed something there.  You would switch stacks before calling the Windows 
code and after the Windows code calls you (and respective returns) so you are 
always in your own code when you switch, hence you know how to copy the 
parameters.

> I am still hoping that Andi's approach is possible (I don't understand how
> we can make kernel see current info from private stack).

He suggested you use your own private variant of current which would 
presumeably read a copy of current you stored at the bottom of your own 
stack.  But I don't see why your code would ever need current while you are 
on the private ndis stack.

Andi, their stack will have to have a valid thread_info->task because 
interrupts will use it.  Out of interest, could you please explain what for?

Code like u32 stack[THREAD_SIZE/sizeof(u32)] is violated by a different sized 
stack, but apparently not in any way that matters.

By the way, I use ndis_wrapper, thanks a lot you guys!

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > do you think it is a bit premature to dismiss something even without
> > > ever seeing the code?
> >
> > You told me you are using a dlm for a single-node application, is there
> > anything more I need to know?
>
> I would still like to know why you consider it a "sin". On OpenVMS it is
> fast, provides a way of cleaning up...

There is something hard about handling EPIPE?

> and does not introduce single point 
> of failure as it is the case with a daemon. And if we ever want to spread
> the load between 2 boxes we easily can do it.

But you said it runs on an aging Alpha, surely you do not intend to expand it 
to two aging Alphas?  And what makes you think that socket-based 
synchronization keeps you from spreading out the load over multiple boxes?

> Why would I not want to use it?

It is not the right tool for the job from what you have told me.  You want to 
get a few bytes of information from one task to another?  Use a socket, as 
God intended.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> do you think it is a bit premature to dismiss something even without
> ever seeing the code?

You told me you are using a dlm for a single-node application, is there 
anything more I need to know?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
 do you think it is a bit premature to dismiss something even without
 ever seeing the code?

You told me you are using a dlm for a single-node application, is there 
anything more I need to know?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remainingh

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
 On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
  On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
   do you think it is a bit premature to dismiss something even without
   ever seeing the code?
 
  You told me you are using a dlm for a single-node application, is there
  anything more I need to know?

 I would still like to know why you consider it a sin. On OpenVMS it is
 fast, provides a way of cleaning up...

There is something hard about handling EPIPE?

 and does not introduce single point 
 of failure as it is the case with a daemon. And if we ever want to spread
 the load between 2 boxes we easily can do it.

But you said it runs on an aging Alpha, surely you do not intend to expand it 
to two aging Alphas?  And what makes you think that socket-based 
synchronization keeps you from spreading out the load over multiple boxes?

 Why would I not want to use it?

It is not the right tool for the job from what you have told me.  You want to 
get a few bytes of information from one task to another?  Use a socket, as 
God intended.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 13:23, Giridhar Pemmasani wrote:
 Jan Kiszka wrote:
  The only way I see is to switch stacks back on ndiswrapper API entry.
  But managing all those stacks correctly is challenging,

There are only two stacks involved, the normal kernel stack and your new ndis 
stack.  You save ESP of the kernel stack at the base of the ndis stack.  When 
the Windows code calls your api, you get the ndis ESP, load the kernel ESP 
from the base of the ndis stack, push the ndis ESP so you can get back to the 
ndis code later, and continue on your merry way.

  as you will likely not want to create a new stack on each switching
  point... 

You will allocate your own stack once on driver initialization.

 This is what I had in mind before I saw this thread here. I, in fact, did
 some work along those lines, but it is even more complicated than you
 mentioned here: Windows uses different calling conventions (STDCALL,
 FASTCALL, CDECL) so switching stacks by copying arguments/results gets
 complicated.

I missed something there.  You would switch stacks before calling the Windows 
code and after the Windows code calls you (and respective returns) so you are 
always in your own code when you switch, hence you know how to copy the 
parameters.

 I am still hoping that Andi's approach is possible (I don't understand how
 we can make kernel see current info from private stack).

He suggested you use your own private variant of current which would 
presumeably read a copy of current you stored at the bottom of your own 
stack.  But I don't see why your code would ever need current while you are 
on the private ndis stack.

Andi, their stack will have to have a valid thread_info-task because 
interrupts will use it.  Out of interest, could you please explain what for?

Code like u32 stack[THREAD_SIZE/sizeof(u32)] is violated by a different sized 
stack, but apparently not in any way that matters.

By the way, I use ndis_wrapper, thanks a lot you guys!

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 18:21, Andi Kleen wrote:
 On Wednesday 07 September 2005 00:19, Daniel Phillips wrote:
  Andi, their stack will have to have a valid thread_info-task because
  interrupts will use it.  Out of interest, could you please explain what
  for?

 No, with 4k interrupts run on their own stack with their own thread_info
 Or rather they mostly do. Currently do_IRQ does irq_enter which refers
 thread_info before switching to the interrupt stack, that order would
 likely need to be exchanged.

But then how would thread_info-task on the irq stack ever get initialized?

My what for question was re why interrupt routines even need a valid 
current.  I see one answer out there on the web: statistical profiling.  Is 
that it?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 18:28, Roland Dreier wrote:
 Daniel There are only two stacks involved, the normal kernel
 Daniel stack and your new ndis stack.  You save ESP of the kernel
 Daniel stack at the base of the ndis stack.  When the Windows
 Daniel code calls your api, you get the ndis ESP, load the kernel
 Daniel ESP from the base of the ndis stack, push the ndis ESP so
 Daniel you can get back to the ndis code later, and continue on
 Daniel your merry way.

 [...]

 Daniel You will allocate your own stack once on driver
 Daniel initialization.

 I'm not quite sure it's this trivial.  Obviously there are more than
 two stacks involved, since there is more than one kernel stack!  (One
 per task plus IRQ stacks)  This is more than just a theoretical
 problem.  It seems entirely possible that more than one task could
 be in the driver, and clearly they each need their own stack.

Semaphore :-)

Do you expect this to be heavily contended?  On a very quick run through the 
code, it seems you don't hold any spinlocks going into the driver from 
process context.  Interrupts... they better fit into a 4K stack or it's game 
over.  Preemption while on the ndis stack... you can always disable 
preemption in this region, but the semaphore should protect you.  Task killed 
while preempted... I dunno. 

 So it's going to be at least a little harder than allocating a single
 stack for NDIS use when the driver starts up.

 I personally like the idea raised elsewhere in this thread of running
 the Windows driver in userspace by proxying interrupts, PCI access,
 etc.  That seems more robust and probably allows some cool reverse
 engineering hacks.

I expect the userspace approach will be a lot more work and a lot more 
overhead too, but then again it sounds like loads of fun.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Tuesday 06 September 2005 21:59, Mark Lord wrote:
 Daniel Phillips wrote:
  There are only two stacks involved, the normal kernel stack and your new
  ndis stack.  You save ESP of the kernel stack at the base of the ndis
  stack.  When the Windows code calls your api, you get the ndis ESP, load
  the kernel ESP from the base of the ndis stack, push the ndis ESP so you
  can get back to the ndis code later, and continue on your merry way.

I must have been smoking something when I convinced myself that the driver 
can't call into the kernel without switching back to the kernel stack.  But 
this is wrong, as long as -task and -previous_esp are initialized, staying 
on the bigger stack looks fine (previous_esp is apparently used only for 
backtrace).

 With CONFIG_PREEMPT, this will still cause trouble due to lack
 of current task info on the NDIS stack.

 One option is to copy (duplicate) the bottom-of-stack info when
 switching to the NDIS stack.

Yes, just like do_IRQ.

 Another option is to stick a Mutex around any use of the NDIS stack
 when calling into the foreign driver (might be done like this already??),

There is no mutex now, but this is the easy way to get by with just one ndis 
stack.

 which will prevent PREEMPTion during the call.

We have preempt_enable/disable for that.  But I am not sure preemption needs 
to be disabled.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RFC: i386: kill !4KSTACKS

2005-09-06 Thread Daniel Phillips

On Wednesday 07 September 2005 00:16, Daniel Phillips wrote:
 ...as long as -task and -previous_esp are initialized,
 staying on the bigger stack looks fine (previous_esp is apparently used
 only for backtrace) ... just like do_IRQ.

Ahem, but let me note before somebody else does: it isn't interrupt context, 
it is normal process context - while an interrupt can ignore most of the 
thread_info fields, a normal process has to worry about all 9.  To be on the 
safe side, the first 8 need to be copied into and out of the ndis stack, with 
preempt disabled until after the stack switch.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 19:37, Joel Becker wrote:
>  OCFS2, the new filesystem, is fully general purpose.  It
> supports all the usual stuff, is quite fast...

So I have heard, but isn't it time to quantify that?  How do you think you 
would stack up here:

   http://www.caspur.it/Files/2005/01/10/1105354214692.pdf

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remainingh

2005-09-05 Thread Daniel Phillips

On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > By the way, you said "alpha server" not "alpha servers", was that just a
> > slip? Because if you don't have a cluster then why are you using a dlm?
>
> No, it is not a slip. The application is running on just one node, so we
> do not really use "distributed" part. However we make heavy use of the
> rest of lock manager features, especially lock value blocks.

Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
without even having the excuse you were forced to use it.  Why don't you just 
have a daemon that sends your values over a socket?  That should be all of a 
day's coding.

Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
But you nicely supported my claim that most who think they should be using a 
dlm, really shouldn't.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > > > The only current users of dlms are cluster filesystems.  There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear.  The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch?  That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.

I did not say "potential", I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said "gee this works great, look what it does".

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said "alpha server" not "alpha servers", was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > > > The only current users of dlms are cluster filesystems.  There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear.  The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The only current users of dlms are cluster filesystems.  There are zero
> > users of the userspace dlm api.
>
> That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

> ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

> > What does have to be resolved is a common API for node management.  It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management.  Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <[EMAIL PROTECTED]> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <[EMAIL PROTECTED]> wrote:
> > > >  We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface?   ie: how many syscalls would it take?
> >
> > Four functions:
> >   create_lockspace()
> >   release_lockspace()
> >   lock()
> >   unlock()
>
> Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
> is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 05:19, Andrew Morton wrote:
 David Teigland [EMAIL PROTECTED] wrote:
  On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
   David Teigland [EMAIL PROTECTED] wrote:
 We export our full dlm API through read/write/poll on a misc device.
  
   inotify did that for a while, but we ended up going with a straight
   syscall interface.
  
   How fat is the dlm interface?   ie: how many syscalls would it take?
 
  Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()

 Neat.  I'd be inclined to make them syscalls then.  I don't suppose anyone
 is likely to object if we reserve those slots.

Better take a look at the actual parameter lists to those calls before jumping 
to conclusions...

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
 On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
  The only current users of dlms are cluster filesystems.  There are zero
  users of the userspace dlm api.

 That is incorrect...

Application users Lars, sorry if I did not make that clear.  The issue is 
whether we need to export an all-singing-all-dancing dlm api from kernel to 
userspace today, or whether we can afford to take the necessary time to get 
it right while application writers take their time to have a good think about 
whether they even need it.

 ...and you're contradicting yourself here:

How so?  Above talks about dlm, below talks about cluster membership.

  What does have to be resolved is a common API for node management.  It is
  not just cluster filesystems and their lock managers that have to
  interface to node management.  Below the filesystem layer, cluster block
  devices and cluster volume management need to be coordinated by the same
  system, and above the filesystem layer, applications also need to be
  hooked into it. This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
 On Monday 05 September 2005 10:49, Daniel Phillips wrote:
  On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
   On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
The only current users of dlms are cluster filesystems.  There are
zero users of the userspace dlm api.
  
   That is incorrect...
 
  Application users Lars, sorry if I did not make that clear.  The issue is
  whether we need to export an all-singing-all-dancing dlm api from kernel
  to userspace today, or whether we can afford to take the necessary time
  to get it right while application writers take their time to have a good
  think about whether they even need it.

 If Linux fully supported OpenVMS DLM semantics we could start thinking
 asbout moving our application onto a Linux box because our alpha server is
 aging.

 That's just my user application writer $0.02.

What stops you from trying it with the patch?  That kind of feedback would be 
worth way more than $0.02.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
 On Monday 05 September 2005 19:57, Daniel Phillips wrote:
  On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
   On Monday 05 September 2005 10:49, Daniel Phillips wrote:
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
 On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote:
  The only current users of dlms are cluster filesystems.  There
  are zero users of the userspace dlm api.

 That is incorrect...
   
Application users Lars, sorry if I did not make that clear.  The
issue is whether we need to export an all-singing-all-dancing dlm api
from kernel to userspace today, or whether we can afford to take the
necessary time to get it right while application writers take their
time to have a good think about whether they even need it.
  
   If Linux fully supported OpenVMS DLM semantics we could start thinking
   asbout moving our application onto a Linux box because our alpha server
   is aging.
  
   That's just my user application writer $0.02.
 
  What stops you from trying it with the patch?  That kind of feedback
  would be worth way more than $0.02.

 We do not have such plans at the moment and I prefer spending my free
 time on tinkering with kernel, not rewriting some in-house application.
 Besides, DLM is not the only thing that does not have a drop-in
 replacement in Linux.

 You just said you did not know if there are any potential users for the
 full DLM and I said there are some.

I did not say potential, I said there are zero dlm applications at the 
moment.  Nobody has picked up the prototype (g)dlm api, used it in an 
application and said gee this works great, look what it does.

I also claim that most developers who think that using a dlm for application 
synchronization would be really cool are probably wrong.  Use sockets for 
synchronization exactly as for a single-node, multi-tasking application and 
you will end up with less code, more obviously correct code, probably more 
efficient and... you get an optimal, single-node version for free.

And I also claim that there is precious little reason to have a full-featured 
dlm in-kernel.  Being in-kernel has no benefit for a userspace application.  
But being in-kernel does add kernel bloat, because there will be extra 
features lathered on that are not needed by the only in-kernel user, the 
cluster filesystem.

In the case of your port, you'd be better off hacking up a userspace library 
to provide OpenVMS dlm semantics exactly, not almost.

By the way, you said alpha server not alpha servers, was that just a slip?  
Because if you don't have a cluster then why are you using a dlm?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remainingh

2005-09-05 Thread Daniel Phillips

On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
 On Monday 05 September 2005 23:02, Daniel Phillips wrote:
  By the way, you said alpha server not alpha servers, was that just a
  slip? Because if you don't have a cluster then why are you using a dlm?

 No, it is not a slip. The application is running on just one node, so we
 do not really use distributed part. However we make heavy use of the
 rest of lock manager features, especially lock value blocks.

Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature 
without even having the excuse you were forced to use it.  Why don't you just 
have a daemon that sends your values over a socket?  That should be all of a 
day's coding.

Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. 
But you nicely supported my claim that most who think they should be using a 
dlm, really shouldn't.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-05 Thread Daniel Phillips

On Monday 05 September 2005 19:37, Joel Becker wrote:
  OCFS2, the new filesystem, is fully general purpose.  It
 supports all the usual stuff, is quite fast...

So I have heard, but isn't it time to quantify that?  How do you think you 
would stack up here:

   http://www.caspur.it/Files/2005/01/10/1105354214692.pdf

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips

On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.  That's why I asked (thus
> far unsuccessfully):
>
>Are you saying that the posix-file lookalike interface provides
>access to part of the functionality, but there are other APIs which are
>used to access the rest of the functionality?  If so, what is that
>interface, and why cannot that interface offer access to 100% of the
>functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
 100K of userspace library?  Answer: because we don't want userspace-only
 dlm features bulking up the kernel.  Answer #2: the extra syscalls and
 interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
 Answer: only support tools need to do that.  A cut-down locking api is
 entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
 simple matter of coding.  But please bear in mind that dlm-style
 synchronization is probably a bad idea for most cluster applications,
 particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm "for 
free", but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips

On Sunday 04 September 2005 00:46, Andrew Morton wrote:
> Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > The model you came up with for dlmfs is beyond cute, it's downright
> > clever.
>
> Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
> me.  O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock".  Not even close.

Now, I see the ocfs2 guys are all ready to back down on this one, but I will 
at least argue weakly in favor.

Sick is a nice word for it, but it is actually not that far off.  Normally, 
this fs will acquire a lock whenever the user creates a virtual file and the 
create will block until the global lock arrives.  With O_NONBLOCK, it will 
return, erm... ETXTBSY (!) immediately.  Is that not what O_NONBLOCK is 
supposed to accomplish?

> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
>
> What happens when we want to add some new primitive which has no posix-file
> analog?
>
> Wy too cute.  Oh well, whatever.

The explicit way is syscalls or a set of ioctls, which he already has the 
makings of.  If there is going to be a userspace api, I would hope it looks 
more like the contents of userdlm.c than the traditional Vaxcluster API, 
which sucks beyond belief.

Another explicit way is to do it with a whole set of virtual attributes 
instead of just a single file trying to capture the whole model.  That is 
really unappealing, but I am afraid that is exactly what a whole lot of 
sysfs/configfs usage is going to end up looking like.

But more to the point: we have no urgent need for a userspace dlm api at the 
moment.  Nothing will break if we just put that issue off for a few months, 
quite the contrary.

If the only user is their tools I would say let it go ahead and be cute, even 
sickeningly so.  It is not supposed to be a general dlm api, at least that is 
my understanding.  It is just supposed to be an interface for their tools.  
Of course it would help to know exactly how those tools use it.  Too sleepy 
to find out tonight...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips

On Sunday 04 September 2005 00:46, Andrew Morton wrote:
 Daniel Phillips [EMAIL PROTECTED] wrote:
  The model you came up with for dlmfs is beyond cute, it's downright
  clever.

 Actually I think it's rather sick.  Taking O_NONBLOCK and making it a
 lock-manager trylock because they're kinda-sorta-similar-sounding?  Spare
 me.  O_NONBLOCK means open this file in nonblocking mode, not attempt to
 acquire a clustered filesystem lock.  Not even close.

Now, I see the ocfs2 guys are all ready to back down on this one, but I will 
at least argue weakly in favor.

Sick is a nice word for it, but it is actually not that far off.  Normally, 
this fs will acquire a lock whenever the user creates a virtual file and the 
create will block until the global lock arrives.  With O_NONBLOCK, it will 
return, erm... ETXTBSY (!) immediately.  Is that not what O_NONBLOCK is 
supposed to accomplish?

 It would be much better to do something which explicitly and directly
 expresses what you're trying to do rather than this strange lets do this
 because the names sound the same thing.

 What happens when we want to add some new primitive which has no posix-file
 analog?

 Wy too cute.  Oh well, whatever.

The explicit way is syscalls or a set of ioctls, which he already has the 
makings of.  If there is going to be a userspace api, I would hope it looks 
more like the contents of userdlm.c than the traditional Vaxcluster API, 
which sucks beyond belief.

Another explicit way is to do it with a whole set of virtual attributes 
instead of just a single file trying to capture the whole model.  That is 
really unappealing, but I am afraid that is exactly what a whole lot of 
sysfs/configfs usage is going to end up looking like.

But more to the point: we have no urgent need for a userspace dlm api at the 
moment.  Nothing will break if we just put that issue off for a few months, 
quite the contrary.

If the only user is their tools I would say let it go ahead and be cute, even 
sickeningly so.  It is not supposed to be a general dlm api, at least that is 
my understanding.  It is just supposed to be an interface for their tools.  
Of course it would help to know exactly how those tools use it.  Too sleepy 
to find out tonight...

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-04 Thread Daniel Phillips

On Sunday 04 September 2005 03:28, Andrew Morton wrote:
 If there is already a richer interface into all this code (such as a
 syscall one) and it's feasible to migrate the open() tricksies to that API
 in the future if it all comes unstuck then OK.  That's why I asked (thus
 far unsuccessfully):

Are you saying that the posix-file lookalike interface provides
access to part of the functionality, but there are other APIs which are
used to access the rest of the functionality?  If so, what is that
interface, and why cannot that interface offer access to 100% of the
functionality, thus making the posix-file tricks unnecessary?

There is no such interface at the moment, nor is one needed in the immediate 
future.  Let's look at the arguments for exporting a dlm to userspace:

  1) Since we already have a dlm in kernel, why not just export that and save
 100K of userspace library?  Answer: because we don't want userspace-only
 dlm features bulking up the kernel.  Answer #2: the extra syscalls and
 interface baggage serve no useful purpose.

  2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
 Answer: only support tools need to do that.  A cut-down locking api is
 entirely appropriate for this.

  3) But the kernel dlm is the only one we have!  Answer: easily fixed, a
 simple matter of coding.  But please bear in mind that dlm-style
 synchronization is probably a bad idea for most cluster applications,
 particularly ones that already do their synchronization via sockets.

In other words, exporting the full dlm api is a red herring.  It has nothing 
to do with getting cluster filesystems up and running.  It is really just 
marketing: it sounds like a great thing for userspace to get a dlm for 
free, but it isn't free, it contributes to kernel bloat and it isn't even 
the most efficient way to do it.

If after considering that, we _still_ want to export a dlm api from kernel, 
then can we please take the necessary time and get it right?  The full api 
requires not only syscall-style elements, but asynchronous events as well, 
similar to aio.  I do not think anybody has a good answer to this today, nor 
do we even need it to begin porting applications to cluster filesystems.

Oracle guys: what is the distributed locking API for RAC?  Is the RAC team 
waiting with bated breath to adopt your kernel-based dlm?  If not, why not?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 01:00, Joel Becker wrote:
> On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> > Clearly, I ought to have asked why dlmfs can't be done by configfs.  It
> > is the same paradigm: drive the kernel logic from user-initiated vfs
> > methods.  You already have nearly all the right methods in nearly all the
> > right places.
>
>  configfs, like sysfs, does not support ->open() or ->release()
> callbacks.

struct configfs_item_operations {
 void (*release)(struct config_item *);
 ssize_t (*show)(struct config_item *, struct attribute *,char *);
 ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t);
 int (*allow_link)(struct config_item *src, struct config_item *target);
 int (*drop_link)(struct config_item *src, struct config_item *target);
};

struct configfs_group_operations {
 struct config_item *(*make_item)(struct config_group *group, const char *name);
 struct config_group *(*make_group)(struct config_group *group, const char 
*name);
 int (*commit_item)(struct config_item *item);
 void (*drop_item)(struct config_group *group, struct config_item *item);
};

You do have ->release and ->make_item/group.

If I may hand you a more substantive argument: you don't support user-driven
creation of files in configfs, only directories.  Dlmfs supports user-created
files.  But you know, there isn't actually a good reason not to support
user-created files in configfs, as dlmfs demonstrates.

Anyway, goodnight.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 00:30, Joel Becker wrote:
> You asked why dlmfs can't go into sysfs, and I responded.

And you got me!  In the heat of the moment I overlooked the fact that you and 
Greg haven't agreed to the merge yet ;-)

Clearly, I ought to have asked why dlmfs can't be done by configfs.  It is the 
same paradigm: drive the kernel logic from user-initiated vfs methods.  You 
already have nearly all the right methods in nearly all the right places.

Regards,

Daniel




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 23:06, Joel Becker wrote:
>  dlmfs is *tiny*.  The VFS interface is less than his claimed 500
> lines of savings.

It is 640 lines.

> The few VFS callbacks do nothing but call DLM 
> functions.  You'd have to replace this VFS glue with sysfs glue, and
> probably save very few lines of code.
>  In addition, sysfs cannot support the dlmfs model.  In dlmfs,
> mkdir(2) creates a directory representing a DLM domain and mknod(2)
> creates the user representation of a lock.  sysfs doesn't support
> mkdir(2) or mknod(2) at all.

I said "configfs" in the email to which you are replying.

>  More than mkdir() and mknod(), however, dlmfs uses open(2) to
> acquire locks from userspace.  O_RDONLY acquires a shared read lock (PR
> in VMS parlance).  O_RDWR gets an exclusive lock (X).  O_NONBLOCK is a
> trylock.  Here, dlmfs is using the VFS for complete lifetiming.  A lock
> is released via close(2).  If a process dies, close(2) happens.  In
> other words, ->release() handles all the cleanup for normal and abnormal
> termination.
>
>  sysfs does not allow hooking into ->open() or ->release().  So
> this model, and the inherent lifetiming that comes with it, cannot be 
> used.

Configfs has a per-item release method.  Configfs has a group open method.  
What is it that configfs can't do, or can't be made to do trivially?

> If dlmfs was changed to use a less intuitive model that fits 
> sysfs, all the handling of lifetimes and cleanup would have to be added.

The model you came up with for dlmfs is beyond cute, it's downright clever.  
Why mar that achievement by then failing to capitalize on the framework you 
already have in configfs?

By the way, do you agree that dlmfs is too inefficient to be an effective way 
of exporting your dlm api to user space, except for slow-path applications 
like you have here?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 02:46, Wim Coekaerts wrote:
> On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> > On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > > As far as userspace dlm apis go, dlmfs already abstracts away a large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling your
> > own?
>
> because it's totally different. have a look at what it does.

You create a dlm domain when a directory is created.  You create a lock 
resource when a file of that name is opened.  You lock the resource when the 
file is opened.  You access the lvb by read/writing the file.  Why doesn't 
that fit the configfs-nee-sysfs model?  If it does, the payoff will be about 
500 lines saved.

This little dlm fs is very slick, but grossly inefficient.  Maybe efficiency 
doesn't matter here since it is just your slow-path userspace tools taking 
these locks.  Please do not even think of proposing this as a way to export a 
kernel-based dlm for general purpose use!

Your userdlm.c file has some hidden gold in it.  You have factored the dlm 
calls far more attractively than the bad old bazillion-parameter Vaxcluster 
legacy.  You are almost in system call zone there.  (But note my earlier 
comment on dlms in general: until there are dlm-based applications, merging a 
general-purpose dlm API is pointless and has nothing to do with getting your 
filesystem merged.)

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 06:35, David Teigland wrote:
> Just a new version, not a big difference.  The ondisk format changed a
> little making it incompatible with the previous versions.  We'd been
> holding out on the format change for a long time and thought now would be
> a sensible time to finally do it.

What exactly was the format change, and for what purpose?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> As far as userspace dlm apis go, dlmfs already abstracts away a large part
> of the dlm interaction...

Dumb question, why can't you use sysfs for this instead of rolling your own?

Side note: you seem to have deleted all the 2.6.12-rc4 patches.  Perhaps you 
forgot that there are dozens of lkml archives pointing at them?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Friday 02 September 2005 20:16, Mark Fasheh wrote:
 As far as userspace dlm apis go, dlmfs already abstracts away a large part
 of the dlm interaction...

Dumb question, why can't you use sysfs for this instead of rolling your own?

Side note: you seem to have deleted all the 2.6.12-rc4 patches.  Perhaps you 
forgot that there are dozens of lkml archives pointing at them?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 06:35, David Teigland wrote:
 Just a new version, not a big difference.  The ondisk format changed a
 little making it incompatible with the previous versions.  We'd been
 holding out on the format change for a long time and thought now would be
 a sensible time to finally do it.

What exactly was the format change, and for what purpose?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 02:46, Wim Coekaerts wrote:
 On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
  On Friday 02 September 2005 20:16, Mark Fasheh wrote:
   As far as userspace dlm apis go, dlmfs already abstracts away a large
   part of the dlm interaction...
 
  Dumb question, why can't you use sysfs for this instead of rolling your
  own?

 because it's totally different. have a look at what it does.

You create a dlm domain when a directory is created.  You create a lock 
resource when a file of that name is opened.  You lock the resource when the 
file is opened.  You access the lvb by read/writing the file.  Why doesn't 
that fit the configfs-nee-sysfs model?  If it does, the payoff will be about 
500 lines saved.

This little dlm fs is very slick, but grossly inefficient.  Maybe efficiency 
doesn't matter here since it is just your slow-path userspace tools taking 
these locks.  Please do not even think of proposing this as a way to export a 
kernel-based dlm for general purpose use!

Your userdlm.c file has some hidden gold in it.  You have factored the dlm 
calls far more attractively than the bad old bazillion-parameter Vaxcluster 
legacy.  You are almost in system call zone there.  (But note my earlier 
comment on dlms in general: until there are dlm-based applications, merging a 
general-purpose dlm API is pointless and has nothing to do with getting your 
filesystem merged.)

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Saturday 03 September 2005 23:06, Joel Becker wrote:
  dlmfs is *tiny*.  The VFS interface is less than his claimed 500
 lines of savings.

It is 640 lines.

 The few VFS callbacks do nothing but call DLM 
 functions.  You'd have to replace this VFS glue with sysfs glue, and
 probably save very few lines of code.
  In addition, sysfs cannot support the dlmfs model.  In dlmfs,
 mkdir(2) creates a directory representing a DLM domain and mknod(2)
 creates the user representation of a lock.  sysfs doesn't support
 mkdir(2) or mknod(2) at all.

I said configfs in the email to which you are replying.

  More than mkdir() and mknod(), however, dlmfs uses open(2) to
 acquire locks from userspace.  O_RDONLY acquires a shared read lock (PR
 in VMS parlance).  O_RDWR gets an exclusive lock (X).  O_NONBLOCK is a
 trylock.  Here, dlmfs is using the VFS for complete lifetiming.  A lock
 is released via close(2).  If a process dies, close(2) happens.  In
 other words, -release() handles all the cleanup for normal and abnormal
 termination.

  sysfs does not allow hooking into -open() or -release().  So
 this model, and the inherent lifetiming that comes with it, cannot be 
 used.

Configfs has a per-item release method.  Configfs has a group open method.  
What is it that configfs can't do, or can't be made to do trivially?

 If dlmfs was changed to use a less intuitive model that fits 
 sysfs, all the handling of lifetimes and cleanup would have to be added.

The model you came up with for dlmfs is beyond cute, it's downright clever.  
Why mar that achievement by then failing to capitalize on the framework you 
already have in configfs?

By the way, do you agree that dlmfs is too inefficient to be an effective way 
of exporting your dlm api to user space, except for slow-path applications 
like you have here?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 00:30, Joel Becker wrote:
 You asked why dlmfs can't go into sysfs, and I responded.

And you got me!  In the heat of the moment I overlooked the fact that you and 
Greg haven't agreed to the merge yet ;-)

Clearly, I ought to have asked why dlmfs can't be done by configfs.  It is the 
same paradigm: drive the kernel logic from user-initiated vfs methods.  You 
already have nearly all the right methods in nearly all the right places.

Regards,

Daniel




-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-cluster] Re: GFS, what's remaining

2005-09-03 Thread Daniel Phillips

On Sunday 04 September 2005 01:00, Joel Becker wrote:
 On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
  Clearly, I ought to have asked why dlmfs can't be done by configfs.  It
  is the same paradigm: drive the kernel logic from user-initiated vfs
  methods.  You already have nearly all the right methods in nearly all the
  right places.

  configfs, like sysfs, does not support -open() or -release()
 callbacks.

struct configfs_item_operations {
 void (*release)(struct config_item *);
 ssize_t (*show)(struct config_item *, struct attribute *,char *);
 ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t);
 int (*allow_link)(struct config_item *src, struct config_item *target);
 int (*drop_link)(struct config_item *src, struct config_item *target);
};

struct configfs_group_operations {
 struct config_item *(*make_item)(struct config_group *group, const char *name);
 struct config_group *(*make_group)(struct config_group *group, const char 
*name);
 int (*commit_item)(struct config_item *item);
 void (*drop_item)(struct config_group *group, struct config_item *item);
};

You do have -release and -make_item/group.

If I may hand you a more substantive argument: you don't support user-driven
creation of files in configfs, only directories.  Dlmfs supports user-created
files.  But you know, there isn't actually a good reason not to support
user-created files in configfs, as dlmfs demonstrates.

Anyway, goodnight.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 17:17, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.

The only current users of dlms are cluster filesystems.  There are zero users 
of the userspace dlm api.  Therefore, the (g)dlm userspace interface actually 
has nothing to do with the needs of gfs.  It should be taken out the gfs 
patch and merged later, when or if user space applications emerge that need 
it.  Maybe in the meantime it will be possible to come up with a userspace 
dlm api that isn't completely repulsive.

Also, note that the only reason the two current dlms are in-kernel is because 
it supposedly cuts down on userspace-kernel communication with the cluster 
filesystems.  Then why should a userspace application bother with a an 
awkward interface to an in-kernel dlm?  This is obviously suboptimal.  Why 
not have a userspace dlm for userspace apps, if indeed there are any 
userspace apps that would need to use dlm-style synchronization instead of 
more typical socket-based synchronization, or Posix locking, which is already 
exposed via a standard api?

There is actually nothing wrong with having multiple, completely different 
dlms active at the same time.  There is no urgent need to merge them into the 
one true dlm.  It would be a lot better to let them evolve separately and 
pick the winner a year or two from now.  Just think of the dlm as part of the 
cfs until then.

What does have to be resolved is a common API for node management.  It is not 
just cluster filesystems and their lock managers that have to interface to 
node management.  Below the filesystem layer, cluster block devices and 
cluster volume management need to be coordinated by the same system, and 
above the filesystem layer, applications also need to be hooked into it.  
This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ia_attr_flags - time to die

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 15:41, Miklos Szeredi wrote:
> Already dead ;)
>
> 2.6.13-mm1: remove-ia_attr_flags.patch
>
> Miklos

Wow, the pace of Linux development really is picking up.  Now patches are 
applied before I even send them!

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ia_attr_flags - time to die

2005-09-02 Thread Daniel Phillips

Struct iattr is not involved any more in such things as NOATIME inode flags.
There are no in-tree users of ia_attr_flags.

Signed-off-by Daniel Phillips <[EMAIL PROTECTED]>

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 
2.6.13-rc5-mm1/fs/hostfs/hostfs.h
--- 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2005-08-09 18:23:11.0 
-0400
+++ 2.6.13-rc5-mm1/fs/hostfs/hostfs.h   2005-09-01 17:54:40.0 -0400
@@ -49,7 +49,6 @@ struct hostfs_iattr {
struct timespec ia_atime;
struct timespec ia_mtime;
struct timespec ia_ctime;
-   unsigned intia_attr_flags;
 };
 
 extern int stat_file(const char *path, unsigned long long *inode_out,
diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/fs.h 
2.6.13-rc5-mm1/include/linux/fs.h
--- 2.6.13-rc5-mm1.clean/include/linux/fs.h 2005-08-09 18:23:31.0 
-0400
+++ 2.6.13-rc5-mm1/include/linux/fs.h   2005-09-01 18:27:42.0 -0400
@@ -282,19 +282,9 @@ struct iattr {
struct timespec ia_atime;
struct timespec ia_mtime;
struct timespec ia_ctime;
-   unsigned intia_attr_flags;
 };
 
 /*
- * This is the inode attributes flag definitions
- */
-#define ATTR_FLAG_SYNCRONOUS   1   /* Syncronous write */
-#define ATTR_FLAG_NOATIME  2   /* Don't update atime */
-#define ATTR_FLAG_APPEND   4   /* Append-only file */
-#define ATTR_FLAG_IMMUTABLE8   /* Immutable file */
-#define ATTR_FLAG_NODIRATIME   16  /* Don't update atime for directory */
-
-/*
  * Includes for diskquotas.
  */
 #include 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ia_attr_flags - time to die

2005-09-02 Thread Daniel Phillips

Struct iattr is not involved any more in such things as NOATIME inode flags.
There are no in-tree users of ia_attr_flags.

Signed-off-by Daniel Phillips [EMAIL PROTECTED]

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 
2.6.13-rc5-mm1/fs/hostfs/hostfs.h
--- 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2005-08-09 18:23:11.0 
-0400
+++ 2.6.13-rc5-mm1/fs/hostfs/hostfs.h   2005-09-01 17:54:40.0 -0400
@@ -49,7 +49,6 @@ struct hostfs_iattr {
struct timespec ia_atime;
struct timespec ia_mtime;
struct timespec ia_ctime;
-   unsigned intia_attr_flags;
 };
 
 extern int stat_file(const char *path, unsigned long long *inode_out,
diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/fs.h 
2.6.13-rc5-mm1/include/linux/fs.h
--- 2.6.13-rc5-mm1.clean/include/linux/fs.h 2005-08-09 18:23:31.0 
-0400
+++ 2.6.13-rc5-mm1/include/linux/fs.h   2005-09-01 18:27:42.0 -0400
@@ -282,19 +282,9 @@ struct iattr {
struct timespec ia_atime;
struct timespec ia_mtime;
struct timespec ia_ctime;
-   unsigned intia_attr_flags;
 };
 
 /*
- * This is the inode attributes flag definitions
- */
-#define ATTR_FLAG_SYNCRONOUS   1   /* Syncronous write */
-#define ATTR_FLAG_NOATIME  2   /* Don't update atime */
-#define ATTR_FLAG_APPEND   4   /* Append-only file */
-#define ATTR_FLAG_IMMUTABLE8   /* Immutable file */
-#define ATTR_FLAG_NODIRATIME   16  /* Don't update atime for directory */
-
-/*
  * Includes for diskquotas.
  */
 #include linux/quota.h
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] ia_attr_flags - time to die

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 15:41, Miklos Szeredi wrote:
 Already dead ;)

 2.6.13-mm1: remove-ia_attr_flags.patch

 Miklos

Wow, the pace of Linux development really is picking up.  Now patches are 
applied before I even send them!

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-02 Thread Daniel Phillips

On Friday 02 September 2005 17:17, Andi Kleen wrote:
 The only thing that should be probably resolved is a common API
 for at least the clustered lock manager. Having multiple
 incompatible user space APIs for that would be sad.

The only current users of dlms are cluster filesystems.  There are zero users 
of the userspace dlm api.  Therefore, the (g)dlm userspace interface actually 
has nothing to do with the needs of gfs.  It should be taken out the gfs 
patch and merged later, when or if user space applications emerge that need 
it.  Maybe in the meantime it will be possible to come up with a userspace 
dlm api that isn't completely repulsive.

Also, note that the only reason the two current dlms are in-kernel is because 
it supposedly cuts down on userspace-kernel communication with the cluster 
filesystems.  Then why should a userspace application bother with a an 
awkward interface to an in-kernel dlm?  This is obviously suboptimal.  Why 
not have a userspace dlm for userspace apps, if indeed there are any 
userspace apps that would need to use dlm-style synchronization instead of 
more typical socket-based synchronization, or Posix locking, which is already 
exposed via a standard api?

There is actually nothing wrong with having multiple, completely different 
dlms active at the same time.  There is no urgent need to merge them into the 
one true dlm.  It would be a lot better to let them evolve separately and 
pick the winner a year or two from now.  Just think of the dlm as part of the 
cfs until then.

What does have to be resolved is a common API for node management.  It is not 
just cluster filesystems and their lock managers that have to interface to 
node management.  Below the filesystem layer, cluster block devices and 
cluster volume management need to be coordinated by the same system, and 
above the filesystem layer, applications also need to be hooked into it.  
This work is, in a word, incomplete.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-01 Thread Daniel Phillips

On Thursday 01 September 2005 06:46, David Teigland wrote:
> I'd like to get a list of specific things remaining for merging.

Where are the benchmarks and stability analysis?  How many hours does it 
survive cerberos running on all nodes simultaneously?  Where are the 
testimonials from users?  How long has there been a gfs2 filesystem?  Note 
that Reiser4 is still not in mainline a year after it was first offered, why 
do you think gfs2 should be in mainline after one month?

So far, all catches are surface things like bogus spinlocks.  Substantive 
issues have not even begun to be addressed.  Patience please, this is going 
to take a while.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-01 Thread Daniel Phillips

On Thursday 01 September 2005 10:49, Alan Cox wrote:
> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> >   possibly gain (or vice versa)
> >
> > - Relative merits of the two offerings
>
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.

I thought that gfs2 just appeared last month.  Or is it really still just gfs?  
If there are substantive changes from gfs to gfs2 then obviously they have 
had practically zero testing, let alone posted benchmarks, testimonials, etc.  
If it is really still just gfs then the silly-rename should be undone.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-01 Thread Daniel Phillips

On Thursday 01 September 2005 10:49, Alan Cox wrote:
 On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
  - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
possibly gain (or vice versa)
 
  - Relative merits of the two offerings

 You missed the important one - people actively use it and have been for
 some years. Same reason with have NTFS, HPFS, and all the others. On
 that alone it makes sense to include.

I thought that gfs2 just appeared last month.  Or is it really still just gfs?  
If there are substantive changes from gfs to gfs2 then obviously they have 
had practically zero testing, let alone posted benchmarks, testimonials, etc.  
If it is really still just gfs then the silly-rename should be undone.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: GFS, what's remaining

2005-09-01 Thread Daniel Phillips

On Thursday 01 September 2005 06:46, David Teigland wrote:
 I'd like to get a list of specific things remaining for merging.

Where are the benchmarks and stability analysis?  How many hours does it 
survive cerberos running on all nodes simultaneously?  Where are the 
testimonials from users?  How long has there been a gfs2 filesystem?  Note 
that Reiser4 is still not in mainline a year after it was first offered, why 
do you think gfs2 should be in mainline after one month?

So far, all catches are surface things like bogus spinlocks.  Substantive 
issues have not even begun to be addressed.  Patience please, this is going 
to take a while.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:34, [EMAIL PROTECTED] wrote:
> On Tue, Aug 30, 2005 at 04:28:46PM -0700, Andrew Morton wrote:
> > Sure, but all that copying-and-pasting really sucks.  I'm sure there's
> > some way of providing the slightly different semantics from the same
> > codebase?
>
> Careful - you've almost reinvented the concept of library, which would
> violate any number of patents...

I will keep my eyes open for library candidates as I go.  For example, the 
binary blob operations really cry out for it.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:28, Andrew Morton wrote:
> Joel Becker <[EMAIL PROTECTED]> wrote:
> > On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote:
> > > But it would be stupid to forbid users from creating directories in
> > > sysfs or to forbid kernel modules from directly tweaking a configfs
> > > namespace.  Why should the kernel not be able to add objects to a
> > > directory a user created? It should be up to the module author to
> > > decide these things.
> >
> > This is precisely why configfs is separate from sysfs.  If both
> > user and kernel can create objects, the lifetime of the object and its
> > filesystem representation is very complex.  Sysfs already has problems
> > with people getting this wrong.  configfs does not.
> > The fact that sysfs and configfs have similar backing stores
> > does not make them the same thing.
>
> Sure, but all that copying-and-pasting really sucks.  I'm sure there's some
> way of providing the slightly different semantics from the same codebase?

I will have that patch ready later this week.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:25, Daniel Phillips wrote:
> On Wednesday 31 August 2005 09:13, Joel Becker wrote:
> > On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote:
> > > But it would be stupid to forbid users from creating directories in
> > > sysfs or to forbid kernel modules from directly tweaking a configfs
> > > namespace. Why should the kernel not be able to add objects to a
> > > directory a user created? It should be up to the module author to
> > > decide these things.
> >
> > This is precisely why configfs is separate from sysfs.  If both
> > user and kernel can create objects, the lifetime of the object and its
> > filesystem representation is very complex.  Sysfs already has problems
> > with people getting this wrong.  configfs does not.
>
> Could you please give a specific case?

More to the point: what makes you think that this apparent ruggedness will
diminish after being re-integrated with sysfs?  If you wish, you can avoid
any dangers by not using sysfs's vfs bypass api.  It should be up to the
subsystem author.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

(without kmail bugs this time)

A kernel code example that uses the modified configfs to define a simple
configuration interface.  Note the use of kobjects and ksets instead of
config_items and config_groups.

Also notice how much code is required to get a simple value from
userspace to kernel space.  This is a big problem that needs to be
addressed soon, before we end up with tens or hundreds of thousands of
lines of code code bloat just to get and set variables from user space.

Regards,

Daniel

#include 
#include 
#include 

#include 

struct ddbond_info {
 struct kobject item;
 int controlsock;
};

static inline struct ddbond_info *to_ddbond_info(struct kobject *item)
{
 return container_of(item, struct ddbond_info, item);
}

static ssize_t ddbond_info_attr_show(struct kobject *item,
 struct attribute *attr, char *page)
{
 ssize_t count;
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 count = sprintf(page, "%d\n", ddbond_info->controlsock);
 return count;
}

static ssize_t ddbond_info_attr_store(struct kobject *item,
 struct attribute *attr, const char *page, size_t count)
{
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 unsigned long tmp;
 char *p = (char *)page;

 tmp = simple_strtoul(p, , 10);
 if (!p || (*p && (*p != '\n')))
  return -EINVAL;
 if (tmp > INT_MAX)
  return -ERANGE;
 ddbond_info->controlsock = tmp;
 return count;
}

static void ddbond_info_release(struct kobject *item)
{
 kfree(to_ddbond_info(item));
}

static struct kobj_type ddbond_info_type = {
 .sysfs_ops = &(struct sysfs_ops){
  .show = ddbond_info_attr_show,
  .store = ddbond_info_attr_store,
  .release = ddbond_info_release,
 },
 .default_attrs = (struct attribute *[]){
  &(struct attribute){
   .owner = THIS_MODULE,
   .name = "sockname",
   .mode = S_IRUGO | S_IWUSR,
  },
  NULL,
 },
 .ct_owner = THIS_MODULE,
};

static struct kobject *ddbond_make_item(struct kset *group, const char *name)
{
 struct ddbond_info *ddbond_info;
 if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL)))
  return NULL;
 kobject_init_type_name(_info->item, name, _info_type);
 return _info->item;
}

static ssize_t ddbond_description(struct kobject *item,
 struct attribute *attr, char *page)
{
 return sprintf(page,
  "A ddbond block server has two components: a userspace server and a kernel\n"
  "io daemon.  First start the server and give it the name of a socket it 
will\n"
  "create, then create a ddbond directory and write the same name into the\n"
  "socket attribute\n");
}

static struct kobj_type ddbond_type = {
 .sysfs_ops = &(struct sysfs_ops){
  .show = ddbond_description,
 },
 .ct_group_ops = &(struct configfs_group_operations){
  .make_item = ddbond_make_item,
 },
 .default_attrs = (struct attribute *[]){
  &(struct attribute){
   .owner = THIS_MODULE,
   .name = "description",
   .mode = S_IRUGO,
  },
  NULL,
 }
};

static struct subsystem ddbond_subsys = {
 .kset = {
  .kobj = {
   .k_name = "ddbond",
   .ktype = _type,
  },
 },
};

static int __init init_ddbond_config(void)
{
 int ret;
 config_group_init(_subsys.kset);
 init_rwsem(_subsys.rwsem);
 if ((ret = configfs_register_subsystem(_subsys)))
  printk(KERN_ERR "Error %d while registering subsystem %s\n",
 ret, ddbond_subsys.kset.kobj.k_name);
 return ret;
}

static void __exit exit_ddbond_config(void)
{
 configfs_unregister_subsystem(_subsys);
}

module_init(init_ddbond_config);
module_exit(exit_ddbond_config);
MODULE_LICENSE("GPL");
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:13, Joel Becker wrote:
> On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote:
> > But it would be stupid to forbid users from creating directories in sysfs
> > or to forbid kernel modules from directly tweaking a configfs namespace. 
> > Why should the kernel not be able to add objects to a directory a user
> > created? It should be up to the module author to decide these things.
>
>   This is precisely why configfs is separate from sysfs.  If both
> user and kernel can create objects, the lifetime of the object and its
> filesystem representation is very complex.  Sysfs already has problems
> with people getting this wrong.  configfs does not.

Could you please give a specific case?

>   The fact that sysfs and configfs have similar backing stores
> does not make them the same thing.

It is not just the backing store, it is most of the code, all the structures, 
most of the functionality, a good deal of the bugs...

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

(avoiding the kmail formatting problems this time.)

Sysfs rearranged as a single file for analysis purposes.

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 
2.6.13-rc5-mm1/fs/sysfs/Makefile
--- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile  2005-06-17 15:48:29.0 
-0400
+++ 2.6.13-rc5-mm1/fs/sysfs/Makefile2005-08-29 17:13:59.0 -0400
@@ -2,5 +2,4 @@
 # Makefile for the sysfs virtual filesystem
 #
 
-obj-y  := inode.o file.o dir.o symlink.o mount.o bin.o \
-  group.o
+obj-y := sysfs.o
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 
2.6.13-rc5-mm1/fs/sysfs/sysfs.c
--- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c   2005-08-30 17:52:35.0 
-0400
+++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400
@@ -0,0 +1,1680 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct sysfs_symlink {
+   char *link_name;
+   struct kobject *sl_target;
+};
+
+static inline struct kobject *to_kobj(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   return ((struct kobject *)sd->s_element);
+}
+
+static inline struct attribute *to_attr(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   return ((struct attribute *)sd->s_element);
+}
+
+static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject *kobj = NULL;
+
+   spin_lock(_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   if (sd->s_type & SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd->s_element;
+   kobj = kobject_get(sl->sl_target);
+   } else
+   kobj = kobject_get(sd->s_element);
+   }
+   spin_unlock(_lock);
+
+   return kobj;
+}
+
+static kmem_cache_t *sysfs_dir_cachep;
+
+static void release_sysfs_dirent(struct sysfs_dirent *sd)
+{
+   if (sd->s_type & SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd->s_element;
+   kfree(sl->link_name);
+   kobject_put(sl->sl_target);
+   kfree(sl);
+   }
+   kfree(sd->s_iattr);
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
+static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd)
+{
+   if (sd) {
+   WARN_ON(!atomic_read(>s_count));
+   atomic_inc(>s_count);
+   }
+   return sd;
+}
+
+static void sysfs_put(struct sysfs_dirent *sd)
+{
+   WARN_ON(!atomic_read(>s_count));
+   if (atomic_dec_and_test(>s_count))
+   release_sysfs_dirent(sd);
+}
+
+/*
+ * inode.c - basic inode and dentry operations.
+ */
+
+int sysfs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+   struct inode *inode = dentry->d_inode;
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   struct iattr *sd_iattr;
+   unsigned int ia_valid = iattr->ia_valid;
+   int error;
+
+   if (!sd)
+   return -EINVAL;
+
+   sd_iattr = sd->s_iattr;
+
+   error = inode_change_ok(inode, iattr);
+   if (error)
+   return error;
+
+   error = inode_setattr(inode, iattr);
+   if (error)
+   return error;
+
+   if (!sd_iattr) {
+   /* setting attributes for the first time, allocate now */
+   sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL);
+   if (!sd_iattr)
+   return -ENOMEM;
+   /* assign default attributes */
+   memset(sd_iattr, 0, sizeof(struct iattr));
+   sd_iattr->ia_mode = sd->s_mode;
+   sd_iattr->ia_uid = 0;
+   sd_iattr->ia_gid = 0;
+   sd_iattr->ia_atime = sd_iattr->ia_mtime = sd_iattr->ia_ctime =
+   CURRENT_TIME;
+   sd->s_iattr = sd_iattr;
+   }
+
+   /* attributes were changed atleast once in past */
+
+   if (ia_valid & ATTR_UID)
+   sd_iattr->ia_uid = iattr->ia_uid;
+   if (ia_valid & ATTR_GID)
+   sd_iattr->ia_gid = iattr->ia_gid;
+   if (ia_valid & ATTR_ATIME)
+   sd_iattr->ia_atime = timespec_trunc(iattr->ia_atime,
+   inode->i_sb->s_time_gran);
+   if (ia_valid & ATTR_MTIME)
+   sd_iattr->ia_mtime = timespec_trunc(iattr->ia_mtime,
+   inode->i_sb->s_time_gran);
+   if (ia_valid & ATTR_CTIME)
+   sd_iattr->ia_ctime = timespec_trunc(iattr->ia_ctime,
+   inode->i_sb->s_time_gran);
+   if (ia_valid & ATTR_MODE) {
+   umode_t mode = iattr->ia_mode;
+
+   if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+   mode &= ~S_ISGID;
+   sd_iattr->ia_mode = sd->s_mode = mode;
+   }
+
+   return error;
+}
+

Re: [RFC][PATCH 3 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Tuesday 30 August 2005 19:06, Stephen Hemminger wrote:
> On Wed, 31 Aug 2005 08:59:55 +1000
>
> Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > Configfs rewritten as a single file and updated to use kobjects instead
> > of its own clone of kobjects (config_items).
>
> Some style issues:
>  Mixed case in labels

I certainly agree.  This is strictly for comparison purposes and so I did not 
clean up the stylistic problems from the original... this time.

>  Bad identation

I did lindent it however :-)

> > +  Done:
>
> Why the mixed case label?

It shall die.

> > +void config_group_init_type_name(struct kset *group, const char *name,
> > struct kobj_type *type) +{
> > + kobject_set_name(>kobj, name);
> > + group->kobj.ktype = type;
> > + config_group_init(group);
> > +}
>
> Use tabs not one space for indent.

Urk.  Kmail did that to me, it has been broken that way for a year or so.  I 
will have to repost the whole set from a mailer that works.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 3 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 08:59, Daniel Phillips wrote:
> -obj-$(CONFIG_CONFIGFS_FS) += configfs.o
> +obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o

This should just be:

+obj-$(CONFIG_CONFIGFS_FS) += configfs.o

However, the wrong version does provide a convenient way of compiling the
example, I just... have... to... remember to delete it next time.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 4 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

A kernel code example that uses the modified configfs to define a simple
configuration interface.  Note the use of kobjects and ksets instead of
config_items and config_groups.

Also notice how much code is required to get a simple value from
userspace to kernel space.  This is a big problem that needs to be
addressed soon, before we end up with tens or hundreds of thousands of
lines of code code bloat just to get and set variables from user space.

Regards,

Daniel

#include 
#include 
#include 

#include 

struct ddbond_info {
 struct kobject item;
 int controlsock;
};

static inline struct ddbond_info *to_ddbond_info(struct kobject *item)
{
 return container_of(item, struct ddbond_info, item);
}

static ssize_t ddbond_info_attr_show(struct kobject *item,
 struct attribute *attr, char *page)
{
 ssize_t count;
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 count = sprintf(page, "%d\n", ddbond_info->controlsock);
 return count;
}

static ssize_t ddbond_info_attr_store(struct kobject *item,
 struct attribute *attr, const char *page, size_t count)
{
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 unsigned long tmp;
 char *p = (char *)page;

 tmp = simple_strtoul(p, , 10);
 if (!p || (*p && (*p != '\n')))
  return -EINVAL;
 if (tmp > INT_MAX)
  return -ERANGE;
 ddbond_info->controlsock = tmp;
 return count;
}

static void ddbond_info_release(struct kobject *item)
{
 kfree(to_ddbond_info(item));
}

static struct kobj_type ddbond_info_type = {
 .sysfs_ops = &(struct sysfs_ops){
  .show = ddbond_info_attr_show,
  .store = ddbond_info_attr_store,
  .release = ddbond_info_release,
 },
 .default_attrs = (struct attribute *[]){
  &(struct attribute){
   .owner = THIS_MODULE,
   .name = "sockname",
   .mode = S_IRUGO | S_IWUSR,
  },
  NULL,
 },
 .ct_owner = THIS_MODULE,
};

static struct kobject *ddbond_make_item(struct kset *group, const char *name)
{
 struct ddbond_info *ddbond_info;
 if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL)))
  return NULL;
 kobject_init_type_name(_info->item, name, _info_type);
 return _info->item;
}

static ssize_t ddbond_description(struct kobject *item,
 struct attribute *attr, char *page)
{
 return sprintf(page,
  "A ddbond block server has two components: a userspace server and a kernel\n"
  "io daemon.  First start the server and give it the name of a socket it 
will\n"
  "create, then create a ddbond directory and write the same name into the\n"
  "socket attribute\n");
}

static struct kobj_type ddbond_type = {
 .sysfs_ops = &(struct sysfs_ops){
  .show = ddbond_description,
 },
 .ct_group_ops = &(struct configfs_group_operations){
  .make_item = ddbond_make_item,
 },
 .default_attrs = (struct attribute *[]){
  &(struct attribute){
   .owner = THIS_MODULE,
   .name = "description",
   .mode = S_IRUGO,
  },
  NULL,
 }
};

static struct subsystem ddbond_subsys = {
 .kset = {
  .kobj = {
   .k_name = "ddbond",
   .ktype = _type,
  },
 },
};

static int __init init_ddbond_config(void)
{
 int ret;
 config_group_init(_subsys.kset);
 init_rwsem(_subsys.rwsem);
 if ((ret = configfs_register_subsystem(_subsys)))
  printk(KERN_ERR "Error %d while registering subsystem %s\n",
 ret, ddbond_subsys.kset.kobj.k_name);
 return ret;
}

static void __exit exit_ddbond_config(void)
{
 configfs_unregister_subsystem(_subsys);
}

module_init(init_ddbond_config);
module_exit(exit_ddbond_config);
MODULE_LICENSE("GPL");

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC][PATCH 3 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

Configfs rewritten as a single file and updated to use kobjects instead of its
own clone of kobjects (config_items).

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 
2.6.13-rc5-mm1/fs/configfs/Makefile
--- 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 2005-08-09 18:23:30.0 
-0400
+++ 2.6.13-rc5-mm1/fs/configfs/Makefile 2005-08-29 17:26:02.0 -0400
@@ -2,6 +2,5 @@
 # Makefile for the configfs virtual filesystem
 #
 
-obj-$(CONFIG_CONFIGFS_FS) += configfs.o
+obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o
 
-configfs-objs := inode.o file.o dir.o symlink.o mount.o item.o
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 
2.6.13-rc5-mm1/fs/configfs/configfs.c
--- 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 2005-08-30 17:50:30.0 
-0400
+++ 2.6.13-rc5-mm1/fs/configfs/configfs.c 2005-08-29 21:36:47.0 -0400
@@ -0,0 +1,1897 @@
+/*
+ * Based on sysfs:
+ *  sysfs Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define CONFIGFS_ROOT  0x0001
+#define CONFIGFS_DIR  0x0002
+#define CONFIGFS_ITEM_ATTR 0x0004
+#define CONFIGFS_ITEM_LINK 0x0020
+#define CONFIGFS_USET_DIR  0x0040
+#define CONFIGFS_USET_DEFAULT  0x0080
+#define CONFIGFS_USET_DROPPING 0x0100
+#define CONFIGFS_NOT_PINNED(CONFIGFS_ITEM_ATTR)
+
+struct sysfs_symlink {
+   struct list_head sl_list;
+   struct kobject *sl_target;
+};
+
+static inline struct kobject *to_kobj(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   return ((struct kobject *)sd->s_element);
+}
+
+static inline struct attribute *to_attr(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   return ((struct attribute *)sd->s_element);
+}
+
+static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject *kobj = NULL;
+
+   spin_lock(_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   if (sd->s_type & CONFIGFS_ITEM_LINK) {
+   struct sysfs_symlink *sl = sd->s_element;
+   kobj = kobject_get(sl->sl_target);
+   } else
+   kobj = kobject_get(sd->s_element);
+   }
+   spin_unlock(_lock);
+
+   return kobj;
+}
+
+static kmem_cache_t *sysfs_dir_cachep;
+
+static void release_sysfs_dirent(struct sysfs_dirent *sd)
+{
+   if ((sd->s_type & CONFIGFS_ROOT))
+   return;
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
+static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd)
+{
+   if (sd) {
+   WARN_ON(!atomic_read(>s_count));
+   atomic_inc(>s_count);
+   }
+   return sd;
+}
+
+static void sysfs_put(struct sysfs_dirent *sd)
+{
+   WARN_ON(!atomic_read(>s_count));
+   if (atomic_dec_and_test(>s_count))
+   release_sysfs_dirent(sd);
+}
+
+/*
+ * inode.c - basic inode and dentry operations.
+ */
+
+static struct super_block *sysfs_sb;
+
+static struct address_space_operations sysfs_aops = {
+   .readpage = simple_readpage,
+   .prepare_write = simple_prepare_write,
+   .commit_write = simple_commit_write
+};
+
+static struct backing_dev_info sysfs_backing_dev_info = {
+   .ra_pages = 0,  /* No readahead */
+   .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
+};
+
+static struct inode *sysfs_new_inode(mode_t mode)
+{
+   struct inode *inode = new_inode(sysfs_sb);
+   if (inode) {
+   inode->i_blksize = PAGE_CACHE_SIZE;
+   inode->i_blocks = 0;
+   inode->i_mode = mode;
+   inode->i_uid = 0;
+   inode->i_gid = 0;
+   inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+   inode->i_mapping->a_ops = _aops;
+   inode->i_mapping->backing_dev_info = _backing_dev_info;
+   }
+   return inode;
+}
+
+static int sysfs_create(struct dentry *dentry, int mode, int (*init) (struct 
inode *))
+{
+   int error = 0;
+   struct inode *inode = NULL;
+   if (dentry) {
+   if (!dentry->d_inode) {
+   if ((inode = sysfs_new_inode(mode))) {
+   if (dentry->d_parent
+   && dentry->d_parent->d_inode) {
+   struct inode *p_inode =
+   dentry->d_parent->d_inode;
+   p_inode->i_mtime = p_inode->i_ctime =
+   CURRENT_TIME;
+   }
+   goto Proceed;
+   } else
+   error = -ENOMEM;
+   } else
+   error = -EEXIST;
+   }

[RFC][PATCH 2 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

Sysfs rearranged as a single file for analysis purposes.

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 
2.6.13-rc5-mm1/fs/sysfs/Makefile
--- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2005-06-17 15:48:29.0 -0400
+++ 2.6.13-rc5-mm1/fs/sysfs/Makefile 2005-08-29 17:13:59.0 -0400
@@ -2,5 +2,4 @@
 # Makefile for the sysfs virtual filesystem
 #
 
-obj-y  := inode.o file.o dir.o symlink.o mount.o bin.o \
- group.o
+obj-y := sysfs.o
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 
2.6.13-rc5-mm1/fs/sysfs/sysfs.c
--- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2005-08-30 17:52:35.0 -0400
+++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400
@@ -0,0 +1,1680 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct sysfs_symlink {
+ char *link_name;
+ struct kobject *sl_target;
+};
+
+static inline struct kobject *to_kobj(struct dentry *dentry)
+{
+ struct sysfs_dirent *sd = dentry->d_fsdata;
+ return ((struct kobject *)sd->s_element);
+}
+
+static inline struct attribute *to_attr(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   return ((struct attribute *)sd->s_element);
+}
+
+static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject *kobj = NULL;
+
+   spin_lock(_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent *sd = dentry->d_fsdata;
+   if (sd->s_type & SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd->s_element;
+   kobj = kobject_get(sl->sl_target);
+   } else
+   kobj = kobject_get(sd->s_element);
+   }
+   spin_unlock(_lock);
+
+   return kobj;
+}
+
+static kmem_cache_t *sysfs_dir_cachep;
+
+static void release_sysfs_dirent(struct sysfs_dirent *sd)
+{
+   if (sd->s_type & SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd->s_element;
+   kfree(sl->link_name);
+   kobject_put(sl->sl_target);
+   kfree(sl);
+   }
+   kfree(sd->s_iattr);
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
+static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd)
+{
+ if (sd) {
+  WARN_ON(!atomic_read(>s_count));
+  atomic_inc(>s_count);
+ }
+ return sd;
+}
+
+static void sysfs_put(struct sysfs_dirent *sd)
+{
+ WARN_ON(!atomic_read(>s_count));
+ if (atomic_dec_and_test(>s_count))
+  release_sysfs_dirent(sd);
+}
+
+/*
+ * inode.c - basic inode and dentry operations.
+ */
+
+int sysfs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+ struct inode *inode = dentry->d_inode;
+ struct sysfs_dirent *sd = dentry->d_fsdata;
+ struct iattr *sd_iattr;
+ unsigned int ia_valid = iattr->ia_valid;
+ int error;
+
+ if (!sd)
+  return -EINVAL;
+
+ sd_iattr = sd->s_iattr;
+
+ error = inode_change_ok(inode, iattr);
+ if (error)
+  return error;
+
+   error = inode_setattr(inode, iattr);
+   if (error)
+   return error;
+
+   if (!sd_iattr) {
+   /* setting attributes for the first time, allocate now */
+   sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL);
+   if (!sd_iattr)
+   return -ENOMEM;
+   /* assign default attributes */
+   memset(sd_iattr, 0, sizeof(struct iattr));
+   sd_iattr->ia_mode = sd->s_mode;
+   sd_iattr->ia_uid = 0;
+   sd_iattr->ia_gid = 0;
+   sd_iattr->ia_atime = sd_iattr->ia_mtime = sd_iattr->ia_ctime =
+   CURRENT_TIME;
+   sd->s_iattr = sd_iattr;
+   }
+
+   /* attributes were changed atleast once in past */
+
+   if (ia_valid & ATTR_UID)
+   sd_iattr->ia_uid = iattr->ia_uid;
+   if (ia_valid & ATTR_GID)
+   sd_iattr->ia_gid = iattr->ia_gid;
+   if (ia_valid & ATTR_ATIME)
+   sd_iattr->ia_atime = timespec_trunc(iattr->ia_atime,
+   inode->i_sb->s_time_gran);
+   if (ia_valid & ATTR_MTIME)
+   sd_iattr->ia_mtime = timespec_trunc(iattr->ia_mtime,
+  inode->i_sb->s_time_gran);
+ if (ia_valid & ATTR_CTIME)
+  sd_iattr->ia_ctime = timespec_trunc(iattr->ia_ctime,
+  inode->i_sb->s_time_gran);
+ if (ia_valid & ATTR_MODE) {
+  umode_t mode = iattr->ia_mode;
+
+  if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID))
+   mode &= ~S_ISGID;
+  sd_iattr->ia_mode = sd->s_mode = mode;
+ }
+
+ return error;
+}
+
+static struct inode_operations sysfs_inode_operations = {
+ .setattr = sysfs_setattr,
+};
+
+static struct super_block *sysfs_sb;
+
+static struct address_space_operations sysfs_aops = {
+ .readpage = simple_readpage,
+ .prepare_write = simple_prepare_write,
+ .commit_write = simple_commit_write
+};
+
+static struct backing_dev_info sysfs_backing_dev_info = {
+ .ra_pages = 0,  /* No readahead */
+ .capabilities = BDI_CAP_NO_ACCT_DIRTY |

[RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

Hi Andrew,

Configfs blithely ingests kobject.h and kobject.c into itself, just changing 
the names.  Furthermore, more than half of configfs is copied verbatim from 
sysfs, the only difference being the name changes.  After undoing the name 
changes and adding a few new fields to kobject structures, configfs is able 
to use the real thing instead of its own imitation.

The changes I made to kobject.h and sysfs.h are:

  * add module owner to kobj_type.

  * add group_operations to kobj_type (because configfs does it this way
not because it is right)

  * add a children field to kset.  This is likely the same as the blandly
named "list" field but I haven't confirmed it.

  * add a default_groups field to kset, analogous to the default_attrs of
kobj_type.  Hmm, somebody seems to be mixing up types and containers
here, but let's just close our eyes for now.

  * add an s_links field to sysfs_dirent to support configfs's user
createable symlinks.

  * add two new methods to sysfs_ops for fancy symlink hooks

  * add a questionable release method to sysfs_ops.  Sysfs and configfs
have slightly different notions of when to release objects, one of
them is probably wrong.

That's it, no new fields in kobjects themselves, and just three or four fields 
in other allocateable structures.   After these changes, no structures at all 
are left in configfs.h.  Configfs is now running happily using the kobject 
machinery instead of its own mutated clones and unsurprisingly, sysfs still 
runs happily too.  These changes are all found in the first patch of this 
series.

I then looked into exactly how configfs and sysfs are different.  To reduce 
the noise, I concatentated all the files in each directory into two single 
files.  With redundant declarations removed, configfs came in at 1897 lines 
and sysfs at 1680.  Diffing those two files shows:

diff -u fs/sysfs/sysfs.c fs/configfs/configfs.c | diffstat configfs.c | 1497 
++---
 1 files changed, 857 insertions(+), 640 deletions(-)

So we see that two thirds of sysfs made it into configfs unchanged.  Of the 
remaining one third that configfs has not copied, about one third supports 
read/write/mmappable attribute files (why should configfs not have them 
too?), a little less than a third involves needlessly importing its own 
version of setattr, and the remainder, about 300 lines, exports the kernel 
interface for manipulating the user-visible sysfs tree.

Allowing for a few lines of fluff, configfs's value add is about 750 lines 
of user space glue for namespace operations.  Nothing below that glue layer 
is changed, except cosmetically.  So configfs really is sysfs.  By adding
about 300 lines to configfs we can add the vfs-bypass code, and voila, 
configfs becomess sysfs.  Another 200 lines gives us the binary blob 
attributes as well.  There is no reason whatsover for configfs and sysfs to 
live on as separate code bases.  If we really want to make a distinction, we 
can make the distinction with a flag.

But it would be stupid to forbid users from creating directories in sysfs or 
to forbid kernel modules from directly tweaking a configfs namespace.  Why 
should the kernel not be able to add objects to a directory a user created?  
It should be up to the module author to decide these things.

Please do not push configfs to stable in this form.  It is not actually a new 
filesystem, it is an extension to sysfs.  Merging it as is would add more 
than a thousand lines of pointless kernel bloat.  If indeed we wish to 
present exactly the semantics configfs now offers, we do not need a separate 
code base to do so.

The four patches in this patch set:

  1) Add new fields to kobjects; update other headers to match
  2) Sysfs all in one file
  3) Configfs all in one file
  4) A configfs kernel example using sysfs instead of configfs structures

Regards,

Daniel

diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/configfs.h 
2.6.13-rc5-mm1/include/linux/configfs.h
--- 2.6.13-rc5-mm1.clean/include/linux/configfs.h   2005-08-09 
18:23:31.0 -0400
+++ 2.6.13-rc5-mm1/include/linux/configfs.h 2005-08-29 18:30:41.0 
-0400
@@ -46,120 +46,32 @@
 
 #define CONFIGFS_ITEM_NAME_LEN 20
 
-struct module;
-
-struct configfs_item_operations;
-struct configfs_group_operations;
-struct configfs_attribute;
-struct configfs_subsystem;
-
-struct config_item {
-   char*ci_name;
-   charci_namebuf[CONFIGFS_ITEM_NAME_LEN];
-   struct kref ci_kref;
-   struct list_headci_entry;
-   struct config_item  *ci_parent;
-   struct config_group *ci_group;
-   struct config_item_type *ci_type;
-   struct dentry   *ci_dentry;
-};
-
-extern int config_item_set_name(struct config_item *, const char *, ...);
-
-static inline char *config_item_name(struct config_item * item)
-{
-   return

[RFC][PATCH 2 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

Sysfs rearranged as a single file for analysis purposes.

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 
2.6.13-rc5-mm1/fs/sysfs/Makefile
--- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2005-06-17 15:48:29.0 -0400
+++ 2.6.13-rc5-mm1/fs/sysfs/Makefile 2005-08-29 17:13:59.0 -0400
@@ -2,5 +2,4 @@
 # Makefile for the sysfs virtual filesystem
 #
 
-obj-y  := inode.o file.o dir.o symlink.o mount.o bin.o \
- group.o
+obj-y := sysfs.o
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 
2.6.13-rc5-mm1/fs/sysfs/sysfs.c
--- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2005-08-30 17:52:35.0 -0400
+++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400
@@ -0,0 +1,1680 @@
+#include linux/fs.h
+#include linux/namei.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/backing-dev.h
+#include linux/pagemap.h
+#include linux/fsnotify.h
+
+struct sysfs_symlink {
+ char *link_name;
+ struct kobject *sl_target;
+};
+
+static inline struct kobject *to_kobj(struct dentry *dentry)
+{
+ struct sysfs_dirent *sd = dentry-d_fsdata;
+ return ((struct kobject *)sd-s_element);
+}
+
+static inline struct attribute *to_attr(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   return ((struct attribute *)sd-s_element);
+}
+
+static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject *kobj = NULL;
+
+   spin_lock(dcache_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   if (sd-s_type  SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd-s_element;
+   kobj = kobject_get(sl-sl_target);
+   } else
+   kobj = kobject_get(sd-s_element);
+   }
+   spin_unlock(dcache_lock);
+
+   return kobj;
+}
+
+static kmem_cache_t *sysfs_dir_cachep;
+
+static void release_sysfs_dirent(struct sysfs_dirent *sd)
+{
+   if (sd-s_type  SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd-s_element;
+   kfree(sl-link_name);
+   kobject_put(sl-sl_target);
+   kfree(sl);
+   }
+   kfree(sd-s_iattr);
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
+static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd)
+{
+ if (sd) {
+  WARN_ON(!atomic_read(sd-s_count));
+  atomic_inc(sd-s_count);
+ }
+ return sd;
+}
+
+static void sysfs_put(struct sysfs_dirent *sd)
+{
+ WARN_ON(!atomic_read(sd-s_count));
+ if (atomic_dec_and_test(sd-s_count))
+  release_sysfs_dirent(sd);
+}
+
+/*
+ * inode.c - basic inode and dentry operations.
+ */
+
+int sysfs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+ struct inode *inode = dentry-d_inode;
+ struct sysfs_dirent *sd = dentry-d_fsdata;
+ struct iattr *sd_iattr;
+ unsigned int ia_valid = iattr-ia_valid;
+ int error;
+
+ if (!sd)
+  return -EINVAL;
+
+ sd_iattr = sd-s_iattr;
+
+ error = inode_change_ok(inode, iattr);
+ if (error)
+  return error;
+
+   error = inode_setattr(inode, iattr);
+   if (error)
+   return error;
+
+   if (!sd_iattr) {
+   /* setting attributes for the first time, allocate now */
+   sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL);
+   if (!sd_iattr)
+   return -ENOMEM;
+   /* assign default attributes */
+   memset(sd_iattr, 0, sizeof(struct iattr));
+   sd_iattr-ia_mode = sd-s_mode;
+   sd_iattr-ia_uid = 0;
+   sd_iattr-ia_gid = 0;
+   sd_iattr-ia_atime = sd_iattr-ia_mtime = sd_iattr-ia_ctime =
+   CURRENT_TIME;
+   sd-s_iattr = sd_iattr;
+   }
+
+   /* attributes were changed atleast once in past */
+
+   if (ia_valid  ATTR_UID)
+   sd_iattr-ia_uid = iattr-ia_uid;
+   if (ia_valid  ATTR_GID)
+   sd_iattr-ia_gid = iattr-ia_gid;
+   if (ia_valid  ATTR_ATIME)
+   sd_iattr-ia_atime = timespec_trunc(iattr-ia_atime,
+   inode-i_sb-s_time_gran);
+   if (ia_valid  ATTR_MTIME)
+   sd_iattr-ia_mtime = timespec_trunc(iattr-ia_mtime,
+  inode-i_sb-s_time_gran);
+ if (ia_valid  ATTR_CTIME)
+  sd_iattr-ia_ctime = timespec_trunc(iattr-ia_ctime,
+  inode-i_sb-s_time_gran);
+ if (ia_valid  ATTR_MODE) {
+  umode_t mode = iattr-ia_mode;
+
+  if (!in_group_p(inode-i_gid)  !capable(CAP_FSETID))
+   mode = ~S_ISGID;
+  sd_iattr-ia_mode = sd-s_mode = mode;
+ }
+
+ return error;
+}
+
+static struct inode_operations sysfs_inode_operations = {
+ .setattr = sysfs_setattr,
+};
+
+static struct super_block *sysfs_sb;
+
+static struct address_space_operations sysfs_aops = {
+ .readpage = simple_readpage,
+ .prepare_write = simple_prepare_write,
+ .commit_write = simple_commit_write
+};
+
+static struct backing_dev_info sysfs_backing_dev_info = {
+ .ra_pages = 0,  /*

[RFC][PATCH 3 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

Configfs rewritten as a single file and updated to use kobjects instead of its
own clone of kobjects (config_items).

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 
2.6.13-rc5-mm1/fs/configfs/Makefile
--- 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 2005-08-09 18:23:30.0 
-0400
+++ 2.6.13-rc5-mm1/fs/configfs/Makefile 2005-08-29 17:26:02.0 -0400
@@ -2,6 +2,5 @@
 # Makefile for the configfs virtual filesystem
 #
 
-obj-$(CONFIG_CONFIGFS_FS) += configfs.o
+obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o
 
-configfs-objs := inode.o file.o dir.o symlink.o mount.o item.o
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 
2.6.13-rc5-mm1/fs/configfs/configfs.c
--- 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 2005-08-30 17:50:30.0 
-0400
+++ 2.6.13-rc5-mm1/fs/configfs/configfs.c 2005-08-29 21:36:47.0 -0400
@@ -0,0 +1,1897 @@
+/*
+ * Based on sysfs:
+ *  sysfs Copyright (C) 2001, 2002, 2003 Patrick Mochel
+ *
+ * configfs Copyright (C) 2005 Oracle.  All rights reserved.
+ */
+
+#include linux/fs.h
+#include linux/namei.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/backing-dev.h
+#include linux/pagemap.h
+#include linux/configfs.h
+
+#define CONFIGFS_ROOT  0x0001
+#define CONFIGFS_DIR  0x0002
+#define CONFIGFS_ITEM_ATTR 0x0004
+#define CONFIGFS_ITEM_LINK 0x0020
+#define CONFIGFS_USET_DIR  0x0040
+#define CONFIGFS_USET_DEFAULT  0x0080
+#define CONFIGFS_USET_DROPPING 0x0100
+#define CONFIGFS_NOT_PINNED(CONFIGFS_ITEM_ATTR)
+
+struct sysfs_symlink {
+   struct list_head sl_list;
+   struct kobject *sl_target;
+};
+
+static inline struct kobject *to_kobj(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   return ((struct kobject *)sd-s_element);
+}
+
+static inline struct attribute *to_attr(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   return ((struct attribute *)sd-s_element);
+}
+
+static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject *kobj = NULL;
+
+   spin_lock(dcache_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   if (sd-s_type  CONFIGFS_ITEM_LINK) {
+   struct sysfs_symlink *sl = sd-s_element;
+   kobj = kobject_get(sl-sl_target);
+   } else
+   kobj = kobject_get(sd-s_element);
+   }
+   spin_unlock(dcache_lock);
+
+   return kobj;
+}
+
+static kmem_cache_t *sysfs_dir_cachep;
+
+static void release_sysfs_dirent(struct sysfs_dirent *sd)
+{
+   if ((sd-s_type  CONFIGFS_ROOT))
+   return;
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
+static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd)
+{
+   if (sd) {
+   WARN_ON(!atomic_read(sd-s_count));
+   atomic_inc(sd-s_count);
+   }
+   return sd;
+}
+
+static void sysfs_put(struct sysfs_dirent *sd)
+{
+   WARN_ON(!atomic_read(sd-s_count));
+   if (atomic_dec_and_test(sd-s_count))
+   release_sysfs_dirent(sd);
+}
+
+/*
+ * inode.c - basic inode and dentry operations.
+ */
+
+static struct super_block *sysfs_sb;
+
+static struct address_space_operations sysfs_aops = {
+   .readpage = simple_readpage,
+   .prepare_write = simple_prepare_write,
+   .commit_write = simple_commit_write
+};
+
+static struct backing_dev_info sysfs_backing_dev_info = {
+   .ra_pages = 0,  /* No readahead */
+   .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
+};
+
+static struct inode *sysfs_new_inode(mode_t mode)
+{
+   struct inode *inode = new_inode(sysfs_sb);
+   if (inode) {
+   inode-i_blksize = PAGE_CACHE_SIZE;
+   inode-i_blocks = 0;
+   inode-i_mode = mode;
+   inode-i_uid = 0;
+   inode-i_gid = 0;
+   inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME;
+   inode-i_mapping-a_ops = sysfs_aops;
+   inode-i_mapping-backing_dev_info = sysfs_backing_dev_info;
+   }
+   return inode;
+}
+
+static int sysfs_create(struct dentry *dentry, int mode, int (*init) (struct 
inode *))
+{
+   int error = 0;
+   struct inode *inode = NULL;
+   if (dentry) {
+   if (!dentry-d_inode) {
+   if ((inode = sysfs_new_inode(mode))) {
+   if (dentry-d_parent
+dentry-d_parent-d_inode) {
+   struct inode *p_inode =
+   dentry-d_parent-d_inode;
+   p_inode-i_mtime = p_inode-i_ctime =
+   CURRENT_TIME;
+   }
+   goto Proceed;
+   } else
+

[RFC][PATCH 4 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

A kernel code example that uses the modified configfs to define a simple
configuration interface.  Note the use of kobjects and ksets instead of
config_items and config_groups.

Also notice how much code is required to get a simple value from
userspace to kernel space.  This is a big problem that needs to be
addressed soon, before we end up with tens or hundreds of thousands of
lines of code code bloat just to get and set variables from user space.

Regards,

Daniel

#include linux/init.h
#include linux/module.h
#include linux/slab.h

#include linux/configfs.h

struct ddbond_info {
 struct kobject item;
 int controlsock;
};

static inline struct ddbond_info *to_ddbond_info(struct kobject *item)
{
 return container_of(item, struct ddbond_info, item);
}

static ssize_t ddbond_info_attr_show(struct kobject *item,
 struct attribute *attr, char *page)
{
 ssize_t count;
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 count = sprintf(page, %d\n, ddbond_info-controlsock);
 return count;
}

static ssize_t ddbond_info_attr_store(struct kobject *item,
 struct attribute *attr, const char *page, size_t count)
{
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 unsigned long tmp;
 char *p = (char *)page;

 tmp = simple_strtoul(p, p, 10);
 if (!p || (*p  (*p != '\n')))
  return -EINVAL;
 if (tmp  INT_MAX)
  return -ERANGE;
 ddbond_info-controlsock = tmp;
 return count;
}

static void ddbond_info_release(struct kobject *item)
{
 kfree(to_ddbond_info(item));
}

static struct kobj_type ddbond_info_type = {
 .sysfs_ops = (struct sysfs_ops){
  .show = ddbond_info_attr_show,
  .store = ddbond_info_attr_store,
  .release = ddbond_info_release,
 },
 .default_attrs = (struct attribute *[]){
  (struct attribute){
   .owner = THIS_MODULE,
   .name = sockname,
   .mode = S_IRUGO | S_IWUSR,
  },
  NULL,
 },
 .ct_owner = THIS_MODULE,
};

static struct kobject *ddbond_make_item(struct kset *group, const char *name)
{
 struct ddbond_info *ddbond_info;
 if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL)))
  return NULL;
 kobject_init_type_name(ddbond_info-item, name, ddbond_info_type);
 return ddbond_info-item;
}

static ssize_t ddbond_description(struct kobject *item,
 struct attribute *attr, char *page)
{
 return sprintf(page,
  A ddbond block server has two components: a userspace server and a kernel\n
  io daemon.  First start the server and give it the name of a socket it 
will\n
  create, then create a ddbond directory and write the same name into the\n
  socket attribute\n);
}

static struct kobj_type ddbond_type = {
 .sysfs_ops = (struct sysfs_ops){
  .show = ddbond_description,
 },
 .ct_group_ops = (struct configfs_group_operations){
  .make_item = ddbond_make_item,
 },
 .default_attrs = (struct attribute *[]){
  (struct attribute){
   .owner = THIS_MODULE,
   .name = description,
   .mode = S_IRUGO,
  },
  NULL,
 }
};

static struct subsystem ddbond_subsys = {
 .kset = {
  .kobj = {
   .k_name = ddbond,
   .ktype = ddbond_type,
  },
 },
};

static int __init init_ddbond_config(void)
{
 int ret;
 config_group_init(ddbond_subsys.kset);
 init_rwsem(ddbond_subsys.rwsem);
 if ((ret = configfs_register_subsystem(ddbond_subsys)))
  printk(KERN_ERR Error %d while registering subsystem %s\n,
 ret, ddbond_subsys.kset.kobj.k_name);
 return ret;
}

static void __exit exit_ddbond_config(void)
{
 configfs_unregister_subsystem(ddbond_subsys);
}

module_init(init_ddbond_config);
module_exit(exit_ddbond_config);
MODULE_LICENSE(GPL);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 3 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 08:59, Daniel Phillips wrote:
 -obj-$(CONFIG_CONFIGFS_FS) += configfs.o
 +obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o

This should just be:

+obj-$(CONFIG_CONFIGFS_FS) += configfs.o

However, the wrong version does provide a convenient way of compiling the
example, I just... have... to... remember to delete it next time.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 3 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Tuesday 30 August 2005 19:06, Stephen Hemminger wrote:
 On Wed, 31 Aug 2005 08:59:55 +1000

 Daniel Phillips [EMAIL PROTECTED] wrote:
  Configfs rewritten as a single file and updated to use kobjects instead
  of its own clone of kobjects (config_items).

 Some style issues:
  Mixed case in labels

I certainly agree.  This is strictly for comparison purposes and so I did not 
clean up the stylistic problems from the original... this time.

  Bad identation

I did lindent it however :-)

  +  Done:

 Why the mixed case label?

It shall die.

  +void config_group_init_type_name(struct kset *group, const char *name,
  struct kobj_type *type) +{
  + kobject_set_name(group-kobj, name);
  + group-kobj.ktype = type;
  + config_group_init(group);
  +}

 Use tabs not one space for indent.

Urk.  Kmail did that to me, it has been broken that way for a year or so.  I 
will have to repost the whole set from a mailer that works.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

(avoiding the kmail formatting problems this time.)

Sysfs rearranged as a single file for analysis purposes.

diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 
2.6.13-rc5-mm1/fs/sysfs/Makefile
--- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile  2005-06-17 15:48:29.0 
-0400
+++ 2.6.13-rc5-mm1/fs/sysfs/Makefile2005-08-29 17:13:59.0 -0400
@@ -2,5 +2,4 @@
 # Makefile for the sysfs virtual filesystem
 #
 
-obj-y  := inode.o file.o dir.o symlink.o mount.o bin.o \
-  group.o
+obj-y := sysfs.o
diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 
2.6.13-rc5-mm1/fs/sysfs/sysfs.c
--- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c   2005-08-30 17:52:35.0 
-0400
+++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400
@@ -0,0 +1,1680 @@
+#include linux/fs.h
+#include linux/namei.h
+#include linux/module.h
+#include linux/mount.h
+#include linux/backing-dev.h
+#include linux/pagemap.h
+#include linux/fsnotify.h
+
+struct sysfs_symlink {
+   char *link_name;
+   struct kobject *sl_target;
+};
+
+static inline struct kobject *to_kobj(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   return ((struct kobject *)sd-s_element);
+}
+
+static inline struct attribute *to_attr(struct dentry *dentry)
+{
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   return ((struct attribute *)sd-s_element);
+}
+
+static inline struct kobject *sysfs_get_kobject(struct dentry *dentry)
+{
+   struct kobject *kobj = NULL;
+
+   spin_lock(dcache_lock);
+   if (!d_unhashed(dentry)) {
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   if (sd-s_type  SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd-s_element;
+   kobj = kobject_get(sl-sl_target);
+   } else
+   kobj = kobject_get(sd-s_element);
+   }
+   spin_unlock(dcache_lock);
+
+   return kobj;
+}
+
+static kmem_cache_t *sysfs_dir_cachep;
+
+static void release_sysfs_dirent(struct sysfs_dirent *sd)
+{
+   if (sd-s_type  SYSFS_KOBJ_LINK) {
+   struct sysfs_symlink *sl = sd-s_element;
+   kfree(sl-link_name);
+   kobject_put(sl-sl_target);
+   kfree(sl);
+   }
+   kfree(sd-s_iattr);
+   kmem_cache_free(sysfs_dir_cachep, sd);
+}
+
+static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd)
+{
+   if (sd) {
+   WARN_ON(!atomic_read(sd-s_count));
+   atomic_inc(sd-s_count);
+   }
+   return sd;
+}
+
+static void sysfs_put(struct sysfs_dirent *sd)
+{
+   WARN_ON(!atomic_read(sd-s_count));
+   if (atomic_dec_and_test(sd-s_count))
+   release_sysfs_dirent(sd);
+}
+
+/*
+ * inode.c - basic inode and dentry operations.
+ */
+
+int sysfs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+   struct inode *inode = dentry-d_inode;
+   struct sysfs_dirent *sd = dentry-d_fsdata;
+   struct iattr *sd_iattr;
+   unsigned int ia_valid = iattr-ia_valid;
+   int error;
+
+   if (!sd)
+   return -EINVAL;
+
+   sd_iattr = sd-s_iattr;
+
+   error = inode_change_ok(inode, iattr);
+   if (error)
+   return error;
+
+   error = inode_setattr(inode, iattr);
+   if (error)
+   return error;
+
+   if (!sd_iattr) {
+   /* setting attributes for the first time, allocate now */
+   sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL);
+   if (!sd_iattr)
+   return -ENOMEM;
+   /* assign default attributes */
+   memset(sd_iattr, 0, sizeof(struct iattr));
+   sd_iattr-ia_mode = sd-s_mode;
+   sd_iattr-ia_uid = 0;
+   sd_iattr-ia_gid = 0;
+   sd_iattr-ia_atime = sd_iattr-ia_mtime = sd_iattr-ia_ctime =
+   CURRENT_TIME;
+   sd-s_iattr = sd_iattr;
+   }
+
+   /* attributes were changed atleast once in past */
+
+   if (ia_valid  ATTR_UID)
+   sd_iattr-ia_uid = iattr-ia_uid;
+   if (ia_valid  ATTR_GID)
+   sd_iattr-ia_gid = iattr-ia_gid;
+   if (ia_valid  ATTR_ATIME)
+   sd_iattr-ia_atime = timespec_trunc(iattr-ia_atime,
+   inode-i_sb-s_time_gran);
+   if (ia_valid  ATTR_MTIME)
+   sd_iattr-ia_mtime = timespec_trunc(iattr-ia_mtime,
+   inode-i_sb-s_time_gran);
+   if (ia_valid  ATTR_CTIME)
+   sd_iattr-ia_ctime = timespec_trunc(iattr-ia_ctime,
+   inode-i_sb-s_time_gran);
+   if (ia_valid  ATTR_MODE) {
+   umode_t mode = iattr-ia_mode;
+
+   if (!in_group_p(inode-i_gid)  !capable(CAP_FSETID))
+   mode = ~S_ISGID;
+   sd_iattr-ia_mode

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:13, Joel Becker wrote:
 On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote:
  But it would be stupid to forbid users from creating directories in sysfs
  or to forbid kernel modules from directly tweaking a configfs namespace. 
  Why should the kernel not be able to add objects to a directory a user
  created? It should be up to the module author to decide these things.

   This is precisely why configfs is separate from sysfs.  If both
 user and kernel can create objects, the lifetime of the object and its
 filesystem representation is very complex.  Sysfs already has problems
 with people getting this wrong.  configfs does not.

Could you please give a specific case?

   The fact that sysfs and configfs have similar backing stores
 does not make them the same thing.

It is not just the backing store, it is most of the code, all the structures, 
most of the functionality, a good deal of the bugs...

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 4 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

(without kmail bugs this time)

A kernel code example that uses the modified configfs to define a simple
configuration interface.  Note the use of kobjects and ksets instead of
config_items and config_groups.

Also notice how much code is required to get a simple value from
userspace to kernel space.  This is a big problem that needs to be
addressed soon, before we end up with tens or hundreds of thousands of
lines of code code bloat just to get and set variables from user space.

Regards,

Daniel

#include linux/init.h
#include linux/module.h
#include linux/slab.h

#include linux/configfs.h

struct ddbond_info {
 struct kobject item;
 int controlsock;
};

static inline struct ddbond_info *to_ddbond_info(struct kobject *item)
{
 return container_of(item, struct ddbond_info, item);
}

static ssize_t ddbond_info_attr_show(struct kobject *item,
 struct attribute *attr, char *page)
{
 ssize_t count;
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 count = sprintf(page, %d\n, ddbond_info-controlsock);
 return count;
}

static ssize_t ddbond_info_attr_store(struct kobject *item,
 struct attribute *attr, const char *page, size_t count)
{
 struct ddbond_info *ddbond_info = to_ddbond_info(item);
 unsigned long tmp;
 char *p = (char *)page;

 tmp = simple_strtoul(p, p, 10);
 if (!p || (*p  (*p != '\n')))
  return -EINVAL;
 if (tmp  INT_MAX)
  return -ERANGE;
 ddbond_info-controlsock = tmp;
 return count;
}

static void ddbond_info_release(struct kobject *item)
{
 kfree(to_ddbond_info(item));
}

static struct kobj_type ddbond_info_type = {
 .sysfs_ops = (struct sysfs_ops){
  .show = ddbond_info_attr_show,
  .store = ddbond_info_attr_store,
  .release = ddbond_info_release,
 },
 .default_attrs = (struct attribute *[]){
  (struct attribute){
   .owner = THIS_MODULE,
   .name = sockname,
   .mode = S_IRUGO | S_IWUSR,
  },
  NULL,
 },
 .ct_owner = THIS_MODULE,
};

static struct kobject *ddbond_make_item(struct kset *group, const char *name)
{
 struct ddbond_info *ddbond_info;
 if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL)))
  return NULL;
 kobject_init_type_name(ddbond_info-item, name, ddbond_info_type);
 return ddbond_info-item;
}

static ssize_t ddbond_description(struct kobject *item,
 struct attribute *attr, char *page)
{
 return sprintf(page,
  A ddbond block server has two components: a userspace server and a kernel\n
  io daemon.  First start the server and give it the name of a socket it 
will\n
  create, then create a ddbond directory and write the same name into the\n
  socket attribute\n);
}

static struct kobj_type ddbond_type = {
 .sysfs_ops = (struct sysfs_ops){
  .show = ddbond_description,
 },
 .ct_group_ops = (struct configfs_group_operations){
  .make_item = ddbond_make_item,
 },
 .default_attrs = (struct attribute *[]){
  (struct attribute){
   .owner = THIS_MODULE,
   .name = description,
   .mode = S_IRUGO,
  },
  NULL,
 }
};

static struct subsystem ddbond_subsys = {
 .kset = {
  .kobj = {
   .k_name = ddbond,
   .ktype = ddbond_type,
  },
 },
};

static int __init init_ddbond_config(void)
{
 int ret;
 config_group_init(ddbond_subsys.kset);
 init_rwsem(ddbond_subsys.rwsem);
 if ((ret = configfs_register_subsystem(ddbond_subsys)))
  printk(KERN_ERR Error %d while registering subsystem %s\n,
 ret, ddbond_subsys.kset.kobj.k_name);
 return ret;
}

static void __exit exit_ddbond_config(void)
{
 configfs_unregister_subsystem(ddbond_subsys);
}

module_init(init_ddbond_config);
module_exit(exit_ddbond_config);
MODULE_LICENSE(GPL);
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:25, Daniel Phillips wrote:
 On Wednesday 31 August 2005 09:13, Joel Becker wrote:
  On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote:
   But it would be stupid to forbid users from creating directories in
   sysfs or to forbid kernel modules from directly tweaking a configfs
   namespace. Why should the kernel not be able to add objects to a
   directory a user created? It should be up to the module author to
   decide these things.
 
  This is precisely why configfs is separate from sysfs.  If both
  user and kernel can create objects, the lifetime of the object and its
  filesystem representation is very complex.  Sysfs already has problems
  with people getting this wrong.  configfs does not.

 Could you please give a specific case?

More to the point: what makes you think that this apparent ruggedness will
diminish after being re-integrated with sysfs?  If you wish, you can avoid
any dangers by not using sysfs's vfs bypass api.  It should be up to the
subsystem author.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:28, Andrew Morton wrote:
 Joel Becker [EMAIL PROTECTED] wrote:
  On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote:
   But it would be stupid to forbid users from creating directories in
   sysfs or to forbid kernel modules from directly tweaking a configfs
   namespace.  Why should the kernel not be able to add objects to a
   directory a user created? It should be up to the module author to
   decide these things.
 
  This is precisely why configfs is separate from sysfs.  If both
  user and kernel can create objects, the lifetime of the object and its
  filesystem representation is very complex.  Sysfs already has problems
  with people getting this wrong.  configfs does not.
  The fact that sysfs and configfs have similar backing stores
  does not make them the same thing.

 Sure, but all that copying-and-pasting really sucks.  I'm sure there's some
 way of providing the slightly different semantics from the same codebase?

I will have that patch ready later this week.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 1 of 4] Configfs is really sysfs

2005-08-30 Thread Daniel Phillips

On Wednesday 31 August 2005 09:34, [EMAIL PROTECTED] wrote:
 On Tue, Aug 30, 2005 at 04:28:46PM -0700, Andrew Morton wrote:
  Sure, but all that copying-and-pasting really sucks.  I'm sure there's
  some way of providing the slightly different semantics from the same
  codebase?

 Careful - you've almost reinvented the concept of library, which would
 violate any number of patents...

I will keep my eyes open for library candidates as I go.  For example, the 
binary blob operations really cry out for it.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Permissions don't stick on ConfigFS attributes

2005-08-22 Thread Daniel Phillips

On Monday 22 August 2005 00:49, Eric W. Biederman wrote:
> I am confused.  I am beginning to see shades of the devfs problems coming
> up again.  sysfs is built to be world readable by everyone who has it
> mounted in their namespace.  Writable files in sysfs I have never
> understood.

Sysfs is not like devfs by nature, it is more like procfs.  It exposes 
properties of a device, not the device itself.  It makes perfect sense that 
some of the properties should be writeable.

> Given that we now have files which do not conform to one uniform
> policy for everyone is there any reason why we do not want to allocate
> a character device major number for all config values and dynamically
> allocate a minor number for each config value?  Giving each config
> value its own unique entry under /dev.

/dev is already busy enough without adding masses of entries that are not 
devices.  I don't see that this would simplify the internal implementation 
either, the opposite actually.  The user certainly will not have any use for 
temporary device numbers in this context.

On the other hand, it is clunky to force an application to go through the same 
parse/format interface as the user just to get/set a simple integer.  Perhaps 
sysfs needs to be taught how to ioctl these properties.  I see exposing 
property names and operating on them as orthogonal issues that are currently 
joined at the hip in an unnatural, but fixable way.

> Device nodes for each writable config value trivially handles
> persistence and user policy and should be easy to implement in the
> kernel.  We already have a policy engine in userspace, udev to handle
> all of the chaos.
>
> Why do we need another mechanism?

We need the mechanism that exposes subsystem instance properties as they 
appear and disappear with changing configuration.  This is a new mechanism 
anyway, so implementing it using device nodes does not save anything, it only 
introduces a new requirement to allocate device numbers.

> Are device nodes out of fashion these days?

They are, at least for putting things in /dev that are not actual hardware.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Permissions don't stick on ConfigFS attributes

2005-08-22 Thread Daniel Phillips

On Monday 22 August 2005 00:49, Eric W. Biederman wrote:
 I am confused.  I am beginning to see shades of the devfs problems coming
 up again.  sysfs is built to be world readable by everyone who has it
 mounted in their namespace.  Writable files in sysfs I have never
 understood.

Sysfs is not like devfs by nature, it is more like procfs.  It exposes 
properties of a device, not the device itself.  It makes perfect sense that 
some of the properties should be writeable.

 Given that we now have files which do not conform to one uniform
 policy for everyone is there any reason why we do not want to allocate
 a character device major number for all config values and dynamically
 allocate a minor number for each config value?  Giving each config
 value its own unique entry under /dev.

/dev is already busy enough without adding masses of entries that are not 
devices.  I don't see that this would simplify the internal implementation 
either, the opposite actually.  The user certainly will not have any use for 
temporary device numbers in this context.

On the other hand, it is clunky to force an application to go through the same 
parse/format interface as the user just to get/set a simple integer.  Perhaps 
sysfs needs to be taught how to ioctl these properties.  I see exposing 
property names and operating on them as orthogonal issues that are currently 
joined at the hip in an unnatural, but fixable way.

 Device nodes for each writable config value trivially handles
 persistence and user policy and should be easy to implement in the
 kernel.  We already have a policy engine in userspace, udev to handle
 all of the chaos.

 Why do we need another mechanism?

We need the mechanism that exposes subsystem instance properties as they 
appear and disappear with changing configuration.  This is a new mechanism 
anyway, so implementing it using device nodes does not save anything, it only 
introduces a new requirement to allocate device numbers.

 Are device nodes out of fashion these days?

They are, at least for putting things in /dev that are not actual hardware.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Permissions don't stick on ConfigFS attributes (revised)

2005-08-20 Thread Daniel Phillips

On Saturday 20 August 2005 13:01, Greg KH wrote:
> On Sat, Aug 20, 2005 at 10:50:51AM +1000, Daniel Phillips wrote:
> > Permissions set on ConfigFS attributes (aka files) do not stick.
>
> The recent changes to sysfs should be ported to configfs to do this.

No, it should go the other way, my fix is better.  It would not require sysfs
to have its own version of setattr.  What I do like about Maneesh's fix is the
handling of other inode attributes besides mode flags, however that is a
detail, let's get the structural elements right first.

The revised patch fixes the vanishing permissions bug and kills the configfs
bogon that made my first attempt subtly wrong (changed permissions for all
attribute files instead of just the chmoded one).

diff -up --recursive 2.6.12-mm2.clean/fs/configfs/dir.c 
2.6.12-mm2/fs/configfs/dir.c
--- 2.6.12-mm2.clean/fs/configfs/dir.c  2005-08-12 00:53:06.0 -0400
+++ 2.6.12-mm2/fs/configfs/dir.c2005-08-20 16:16:34.0 -0400
@@ -64,6 +64,17 @@ static struct dentry_operations configfs
.d_delete   = configfs_d_delete,
 };
 
+static int configfs_d_delete_attr(struct dentry *dentry)
+{
+   ((struct configfs_dirent *)dentry->d_fsdata)->s_mode = 
dentry->d_inode->i_mode;
+   return 1;
+}
+
+static struct dentry_operations configfs_attr_dentry_ops = {
+   .d_delete = configfs_d_delete_attr,
+   .d_iput = configfs_d_iput,
+};
+
 /*
  * Allocates a new configfs_dirent and links it to the parent configfs_dirent
  */
@@ -238,14 +249,11 @@ static void configfs_remove_dir(struct c
  */
 static int configfs_attach_attr(struct configfs_dirent * sd, struct dentry * 
dentry)
 {
-   struct configfs_attribute * attr = sd->s_element;
-   int error;
-
-   error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, 
init_file);
+   int error = configfs_create(dentry, sd->s_mode, init_file);
if (error)
return error;
 
-   dentry->d_op = _dentry_ops;
+   dentry->d_op = _attr_dentry_ops;
dentry->d_fsdata = configfs_get(sd);
sd->s_dentry = dentry;
d_rehash(dentry);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] Rename PageChecked as PageMiscFS

2005-08-20 Thread Daniel Phillips

On Saturday 20 August 2005 20:45, David Howells wrote:
> Daniel Phillips <[EMAIL PROTECTED]> wrote:
> > Biased.  Fs is a mixed case acronym, nuff said.
>
> But I'm still right:-)

Of course you are!  We're only impugning your taste, not your logic ;-)

OK, the questions re your global consistency model are a bazillion times more 
significant.  I have not forgotten about that, please stay tuned.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Permissions don't stick on ConfigFS attributes

2005-08-20 Thread Daniel Phillips

On Saturday 20 August 2005 16:31, Joel Becker wrote:
> On Fri, Aug 19, 2005 at 08:01:17PM -0700, Greg KH wrote:
> > The recent changes to sysfs should be ported to configfs to do this.
>
>   Yeah, I've been meaning to do something, and resusing code is
> always a good plan.

Ending up with the same code in core kernel in two different places is always 
a bad plan.  Oh man.  Just look at these two bodies of code, configfs is 
mostly just large tracts that are identical to sysfs except for name changes.  
Listen to what the code is trying to tell you!

SysFS:

struct kobject {
const char  * k_name;
charname[KOBJ_NAME_LEN];
struct kref kref;
struct list_headentry;
struct kobject  * parent;
struct kset * kset;
struct kobj_type* ktype;
struct dentry   * dentry;
};

ConfigFS:

struct config_item {
char*ci_name;
charci_namebuf[CONFIGFS_ITEM_NAME_LEN];
struct kref ci_kref;
struct list_headci_entry;
struct config_item  *ci_parent;
struct config_group *ci_group;
struct config_item_type *ci_type;
struct dentry   *ci_dentry;
};

Big difference, huh?

As designer of configfs, could you please offer your take on why it cannot be 
rolled back into sysfs, considering that it is mostly identical already?

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Permissions don't stick on ConfigFS attributes

2005-08-20 Thread Daniel Phillips

On Saturday 20 August 2005 11:22, Jon Smirl wrote:
> A patch for making sysfs attributes persistent has recently made it
> into Linus' tree.
>
> 
http://article.gmane.org/gmane.linux.hotplug.devel/7927/match=sysfs+permissions

Interesting, it handles more than just the file mode.  But does anybody really 
care about the ctime/atime/mtime in sysfs?  I can see how uid and gid could 
be useful.  My way of handling this (by copying out the potentially changed 
iattrs when the dentry is destroyed) looks more compact than Maneesh's 
solution, while not being any less effective, once I get it right that is.  
Does sysfs really need its own setattr?

A quibble: we normally use the term persistent to mean "saved on permanent 
storage".  Going by that, Maneesh just fixed a bug and did not make iattrs 
persistent.

Regards,

Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Permissions don't stick on ConfigFS attributes

2005-08-20 Thread Daniel Phillips

On Saturday 20 August 2005 11:22, Jon Smirl wrote:
 A patch for making sysfs attributes persistent has recently made it
 into Linus' tree.

 
http://article.gmane.org/gmane.linux.hotplug.devel/7927/match=sysfs+permissions

Interesting, it handles more than just the file mode.  But does anybody really 
care about the ctime/atime/mtime in sysfs?  I can see how uid and gid could 
be useful.  My way of handling this (by copying out the potentially changed 
iattrs when the dentry is destroyed) looks more compact than Maneesh's 
solution, while not being any less effective, once I get it right that is.  
Does sysfs really need its own setattr?

A quibble: we normally use the term persistent to mean saved on permanent 
storage.  Going by that, Maneesh just fixed a bug and did not make iattrs 
persistent.

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 1437 matches

Mail list logo