Re: Distributed storage.
Hi Mike, On Thursday 02 August 2007 21:09, Mike Snitzer wrote: But NBD's synchronous nature is actually an asset when coupled with MD raid1 as it provides guarantees that the data has _really_ been mirrored remotely. And bio completion doesn't? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Distributed storage.
On Friday 03 August 2007 03:26, Evgeniy Polyakov wrote: On Thu, Aug 02, 2007 at 02:08:24PM -0700, I wrote: I see bits that worry me, e.g.: + req = mempool_alloc(st-w-req_pool, GFP_NOIO); which seems to be callable in response to a local request, just the case where NBD deadlocks. Your mempool strategy can work reliably only if you can prove that the pool allocations of the maximum number of requests you can have in flight do not exceed the size of the pool. In other words, if you ever take the pool's fallback path to normal allocation, you risk deadlock. mempool should be allocated to be able to catch up with maximum in-flight requests, in my tests I was unable to force block layer to put more than 31 pages in sync, but in one bio. Each request is essentially dealyed bio processing, so this must handle maximum number of in-flight bios (if they do not cover multiple nodes, if they do, then each node requires own request). It depends on the characteristics of the physical and virtual block devices involved. Slow block devices can produce surprising effects. Ddsnap still qualifies as slow under certain circumstances (big linear write immediately following a new snapshot). Before we added throttling we would see as many as 800,000 bios in flight. Nice to know the system can actually survive this... mostly. But memory deadlock is a clear and present danger under those conditions and we did hit it (not to mention that read latency sucked beyond belief). Anyway, we added a simple counting semaphore to throttle the bio traffic to a reasonable number and behavior became much nicer, but most importantly, this satisfies one of the primary requirements for avoiding block device memory deadlock: a strictly bounded amount of bio traffic in flight. In fact, we allow some bounded number of non-memalloc bios *plus* however much traffic the mm wants to throw at us in memalloc mode, on the assumption that the mm knows what it is doing and imposes its own bound of in flight bios per device. This needs auditing obviously, but the mm either does that or is buggy. In practice, with this throttling in place we never saw more than 2,000 in flight no matter how hard we hit it, which is about the number we were aiming at. Since we draw our reserve from the main memalloc pool, we can easily handle 2,000 bios in flight, even under extreme conditions. See: http://zumastor.googlecode.com/svn/trunk/ddsnap/kernel/dm-ddsnap.c down(info-throttle_sem); To be sure, I am not very proud of this throttling mechanism for various reasons, but the thing is, _any_ throttling mechanism no matter how sucky solves the deadlock problem. Over time I want to move the throttling up into bio submission proper, or perhaps incorporate it in device mapper's queue function, not quite as high up the food chain. Only some stupid little logistical issues stopped me from doing it one of those ways right from the start. I think Peter has also tried some things in this area. Anyway, that part is not pressing because the throttling can be done in the virtual device itself as we do it, even if it is not very pretty there. The point is: you have to throttle the bio traffic. The alternative is to die a horrible death under conditions that may be rare, but _will_ hit somebody. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Distributed storage.
On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: > Hi. > > I'm pleased to announce first release of the distributed storage > subsystem, which allows to form a storage on top of remote and local > nodes, which in turn can be exported to another storage as a node to > form tree-like storages. Excellent! This is precisely what the doctor ordered for the OCFS2-based distributed storage system I have been mumbling about for some time. In fact the dd in ddsnap and ddraid stands for "distributed data". The ddsnap/raid devices do not include an actual network transport, that is expected to be provided by a specialized block device, which up till now has been NBD. But NBD has various deficiencies as you note, in addition to its tendency to deadlock when accessed locally. Your new code base may be just the thing we always wanted. We (zumastor et al) will take it for a drive and see if anything breaks. Memory deadlock is a concern of course. From a cursory glance through, it looks like this code is pretty vm-friendly and you have thought quite a lot about it, however I respectfully invite peterz (obsessive/compulsive memory deadlock hunter) to help give it a good going over with me. I see bits that worry me, e.g.: + req = mempool_alloc(st->w->req_pool, GFP_NOIO); which seems to be callable in response to a local request, just the case where NBD deadlocks. Your mempool strategy can work reliably only if you can prove that the pool allocations of the maximum number of requests you can have in flight do not exceed the size of the pool. In other words, if you ever take the pool's fallback path to normal allocation, you risk deadlock. Anyway, if this is as grand as it seems then I would think we ought to factor out a common transfer core that can be used by all of NBD, iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own code those things have now. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi Linus, On Wednesday 01 August 2007 19:17, Linus Torvalds wrote: >And the "approximates" thing would be about the fact that we don't >actually care about "absolute" microseconds as much as something > that is in the "roughly a microsecond" area. So if we say "it doesn't > have to be microseconds, but it should be within a factor of two of a > ms", we could avoid all the expensive divisions (even if they turn > into multiplications with reciprocals), and just let people *shift* > the CPU counter instead. On that theme, expressing the subsecond part of high precision time in decimal instead of left-aligned binary always was an insane idea. Applications end up with silly numbers of multiplies and divides (likely as not incorrect) whereas they would often just need a simple shift as you say, if the tv struct had been defined sanely from the start. As a bonus, whenever precision gets bumped up, the new bits appear on the right in formerly zero locations on the right, meaning little if any code needs to change. What we have in the incumbent libc timeofday scheme is the moral equivalent of BCD. Of course libc is unlikely ever to repent, but we can at least put off converting into the awkward decimal format until the last possible instant. In other words, I do not see why xtime is expressed as a tv instead of simple 32.32 fixed point. Perhaps somebody can elucidate me? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: CFS review
Hi Linus, On Wednesday 01 August 2007 19:17, Linus Torvalds wrote: And the approximates thing would be about the fact that we don't actually care about absolute microseconds as much as something that is in the roughly a microsecond area. So if we say it doesn't have to be microseconds, but it should be within a factor of two of a ms, we could avoid all the expensive divisions (even if they turn into multiplications with reciprocals), and just let people *shift* the CPU counter instead. On that theme, expressing the subsecond part of high precision time in decimal instead of left-aligned binary always was an insane idea. Applications end up with silly numbers of multiplies and divides (likely as not incorrect) whereas they would often just need a simple shift as you say, if the tv struct had been defined sanely from the start. As a bonus, whenever precision gets bumped up, the new bits appear on the right in formerly zero locations on the right, meaning little if any code needs to change. What we have in the incumbent libc timeofday scheme is the moral equivalent of BCD. Of course libc is unlikely ever to repent, but we can at least put off converting into the awkward decimal format until the last possible instant. In other words, I do not see why xtime is expressed as a tv instead of simple 32.32 fixed point. Perhaps somebody can elucidate me? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Distributed storage.
On Tuesday 31 July 2007 10:13, Evgeniy Polyakov wrote: Hi. I'm pleased to announce first release of the distributed storage subsystem, which allows to form a storage on top of remote and local nodes, which in turn can be exported to another storage as a node to form tree-like storages. Excellent! This is precisely what the doctor ordered for the OCFS2-based distributed storage system I have been mumbling about for some time. In fact the dd in ddsnap and ddraid stands for distributed data. The ddsnap/raid devices do not include an actual network transport, that is expected to be provided by a specialized block device, which up till now has been NBD. But NBD has various deficiencies as you note, in addition to its tendency to deadlock when accessed locally. Your new code base may be just the thing we always wanted. We (zumastor et al) will take it for a drive and see if anything breaks. Memory deadlock is a concern of course. From a cursory glance through, it looks like this code is pretty vm-friendly and you have thought quite a lot about it, however I respectfully invite peterz (obsessive/compulsive memory deadlock hunter) to help give it a good going over with me. I see bits that worry me, e.g.: + req = mempool_alloc(st-w-req_pool, GFP_NOIO); which seems to be callable in response to a local request, just the case where NBD deadlocks. Your mempool strategy can work reliably only if you can prove that the pool allocations of the maximum number of requests you can have in flight do not exceed the size of the pool. In other words, if you ever take the pool's fallback path to normal allocation, you risk deadlock. Anyway, if this is as grand as it seems then I would think we ought to factor out a common transfer core that can be used by all of NBD, iSCSI, ATAoE and your own kernel server, in place of the roll-yer-own code those things have now. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] extent mapped page cache
On Tuesday 10 July 2007 14:03, Chris Mason wrote: > This patch aims to demonstrate one way to replace buffer heads with a > few extent trees... Hi Chris, Quite terse commentary on algorithms and data structures, but I suppose that is not a problem because Jon has a whole week to reverse engineer it for us. What did you have in mind for subpages? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] extent mapped page cache
On Tuesday 10 July 2007 14:03, Chris Mason wrote: This patch aims to demonstrate one way to replace buffer heads with a few extent trees... Hi Chris, Quite terse commentary on algorithms and data structures, but I suppose that is not a problem because Jon has a whole week to reverse engineer it for us. What did you have in mind for subpages? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
On Wednesday 11 July 2007 15:09, Neil Brown wrote: > > > Has anyone fixed the infrequent crashes with 4K stacks and ext3 > > > -> LVM snapshot -> LVM -> DM mirror -> libata? > > > > Ahem: ext3 -> LVM snapshot -> LVM -> DM mirror -> DM crypt -> md -> > > libata, or worse. > > > > No, it's not fixed. The model is wrong. Virtual block drivers > > should not be callling submit_bio. The recursive IO submissions > > should be handled on a dedicated stack, most probably allocated as > > part of the request queue. This could be done easily in device > > mapper and md, or better, in submit_bio. > > Maybe you should read that latest kernel source code. Particularly > generic_make_request in block/ll_rw_blk.c. And plus you've had that one sitting around since 2005, hats off for nailing the issue from way out. Sorry for missing the action, I was elsewhere. Niggles begin here. I'm not sure I like the additional task_struct encumbrance when the functions themselves could sort it out, albeit with an API change affecting a gaggle of md and dm drivers. Hopefully there are other users of the bio list fields, otherwise I would point out that a per-queue stack is less memory than two per-bio fields. I didn't go delving that far. The pointer to the description of the barrier deadlock is not right, it points to the problem report when it really out to point to the definitive analysis and include a subject line, because list archives come and go: [PATCH] block: always requeue !fs requests at the front http://thread.gmane.org/gmane.linux.kernel/537473 Is there a good reason why we should not just put the whole analysis from Tejun Heo in as a comment? It is terse enough. In other words, looks good to me :) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
On Wednesday 11 July 2007 10:54, Zan Lynx wrote: > Jesper Juhl wrote: > > Hi, > > > > I'm wondering if it's time to make 4K stacks the default and to start > > considering removing the 8K stack option alltogether soon? > > > > One of the big problem spots was XFS, but that got some stack usage > > fixes recently, and the 4K stack option has been around for quite a > > while now, so people really should have gotten around to fixing any > > code that can't handle it. Are there still any big problem areas > > remaining? > > Has anyone fixed the infrequent crashes with 4K stacks and ext3 -> LVM > snapshot -> LVM -> DM mirror -> libata? Ahem: ext3 -> LVM snapshot -> LVM -> DM mirror -> DM crypt -> md -> libata, or worse. No, it's not fixed. The model is wrong. Virtual block drivers should not be callling submit_bio. The recursive IO submissions should be handled on a dedicated stack, most probably allocated as part of the request queue. This could be done easily in device mapper and md, or better, in submit_bio. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
On Wednesday 11 July 2007 10:54, Zan Lynx wrote: Jesper Juhl wrote: Hi, I'm wondering if it's time to make 4K stacks the default and to start considering removing the 8K stack option alltogether soon? One of the big problem spots was XFS, but that got some stack usage fixes recently, and the 4K stack option has been around for quite a while now, so people really should have gotten around to fixing any code that can't handle it. Are there still any big problem areas remaining? Has anyone fixed the infrequent crashes with 4K stacks and ext3 - LVM snapshot - LVM - DM mirror - libata? Ahem: ext3 - LVM snapshot - LVM - DM mirror - DM crypt - md - libata, or worse. No, it's not fixed. The model is wrong. Virtual block drivers should not be callling submit_bio. The recursive IO submissions should be handled on a dedicated stack, most probably allocated as part of the request queue. This could be done easily in device mapper and md, or better, in submit_bio. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH][RFC] 4K stacks default, not a debug thing any more...?
On Wednesday 11 July 2007 15:09, Neil Brown wrote: Has anyone fixed the infrequent crashes with 4K stacks and ext3 - LVM snapshot - LVM - DM mirror - libata? Ahem: ext3 - LVM snapshot - LVM - DM mirror - DM crypt - md - libata, or worse. No, it's not fixed. The model is wrong. Virtual block drivers should not be callling submit_bio. The recursive IO submissions should be handled on a dedicated stack, most probably allocated as part of the request queue. This could be done easily in device mapper and md, or better, in submit_bio. Maybe you should read that latest kernel source code. Particularly generic_make_request in block/ll_rw_blk.c. And plus you've had that one sitting around since 2005, hats off for nailing the issue from way out. Sorry for missing the action, I was elsewhere. Niggles begin here. I'm not sure I like the additional task_struct encumbrance when the functions themselves could sort it out, albeit with an API change affecting a gaggle of md and dm drivers. Hopefully there are other users of the bio list fields, otherwise I would point out that a per-queue stack is less memory than two per-bio fields. I didn't go delving that far. The pointer to the description of the barrier deadlock is not right, it points to the problem report when it really out to point to the definitive analysis and include a subject line, because list archives come and go: [PATCH] block: always requeue !fs requests at the front http://thread.gmane.org/gmane.linux.kernel/537473 Is there a good reason why we should not just put the whole analysis from Tejun Heo in as a comment? It is terse enough. In other words, looks good to me :) Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v9
Hi Ingo, I just thought I would mention this, because it is certainly on my mind. I can't help wondering if other folks are also concerned about this. The thing is, why don't you just send your patches to Con who got this whole ball rolling and did a bunch of great work, proving beyond any reasonable doubt that he is capable of maintaining this subsystem, whatever algorithm is finally adopted? Are you worried that Con might steal your thunder? That somehow the scheduler is yours alone? That you might be perceived as less of a genius if somebody else gets credit for their good work? NIH? My perception is that you barged in to take over just when Con got things moving after the scheduler sat and rotted for several years. If that is in any way accurate, then shame on you. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] CFS scheduler, -v9
Hi Ingo, I just thought I would mention this, because it is certainly on my mind. I can't help wondering if other folks are also concerned about this. The thing is, why don't you just send your patches to Con who got this whole ball rolling and did a bunch of great work, proving beyond any reasonable doubt that he is capable of maintaining this subsystem, whatever algorithm is finally adopted? Are you worried that Con might steal your thunder? That somehow the scheduler is yours alone? That you might be perceived as less of a genius if somebody else gets credit for their good work? NIH? My perception is that you barged in to take over just when Con got things moving after the scheduler sat and rotted for several years. If that is in any way accurate, then shame on you. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Wednesday 07 September 2005 15:52, Daniel Phillips wrote: Ah, there's another issue: an interrupt can come in when esp is on the ndis stack and above THREAD_SIZE, so do_IRQ will not find thread_info. Sorry, this one is nasty. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
> > Is there a technical reason ("hard to implement" is a practical reason) > > why all stacks need to be the same size? > > Because of > > static inline struct thread_info *current_thread_info(void) > { > struct thread_info *ti; > __asm__("andl %%esp,%0; ":"=r" (ti) : "" (~(THREAD_SIZE - 1))); > return ti; > } > [include/asm-i386/thread_info.h] > > which assumes that it can "round down" the stack pointer and then will > find the thread_info of the current context there. Only works for > identically sized stacks. Note that this function is heavily used in > the kernel, either directly or indirectly. You cannot avoid it. > > My current assessment regarding differently sized threads for > ndiswrapper: not feasible with vanilla kernels. If so, it is not because of this. It just means you have to go back to the idea of switching back to the original stack when the Windows driver calls into the ndis API. (It must have been way too late last night when I claimed the second stack switch wasn't necessary.) Other issues: - Use a semaphore to serialize access to a single ndis stack... any spinlock or interrupt state issues? (I didn't notice any.) - Copy parameters across the stack switch - a little tricky, but far from the trickiest bit of glue in the kernel - Preempt - looks like it has to be disabled from switching to the ndis stack to switching back because of the thread_info problem - It is best for Linux when life is a little hard for binary-only drivers, but not completely impossible. When the smoke clears, ndis wrapper will be slightly slower than before and we will be slightly closer to having some native drivers. In the meantime, keeping the thing alive without impacting core is an interesting puzzle. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
Is there a technical reason (hard to implement is a practical reason) why all stacks need to be the same size? Because of static inline struct thread_info *current_thread_info(void) { struct thread_info *ti; __asm__(andl %%esp,%0; :=r (ti) : (~(THREAD_SIZE - 1))); return ti; } [include/asm-i386/thread_info.h] which assumes that it can round down the stack pointer and then will find the thread_info of the current context there. Only works for identically sized stacks. Note that this function is heavily used in the kernel, either directly or indirectly. You cannot avoid it. My current assessment regarding differently sized threads for ndiswrapper: not feasible with vanilla kernels. If so, it is not because of this. It just means you have to go back to the idea of switching back to the original stack when the Windows driver calls into the ndis API. (It must have been way too late last night when I claimed the second stack switch wasn't necessary.) Other issues: - Use a semaphore to serialize access to a single ndis stack... any spinlock or interrupt state issues? (I didn't notice any.) - Copy parameters across the stack switch - a little tricky, but far from the trickiest bit of glue in the kernel - Preempt - looks like it has to be disabled from switching to the ndis stack to switching back because of the thread_info problem - It is best for Linux when life is a little hard for binary-only drivers, but not completely impossible. When the smoke clears, ndis wrapper will be slightly slower than before and we will be slightly closer to having some native drivers. In the meantime, keeping the thing alive without impacting core is an interesting puzzle. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Wednesday 07 September 2005 15:52, Daniel Phillips wrote: Ah, there's another issue: an interrupt can come in when esp is on the ndis stack and above THREAD_SIZE, so do_IRQ will not find thread_info. Sorry, this one is nasty. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Wednesday 07 September 2005 00:16, Daniel Phillips wrote: > ...as long as ->task and ->previous_esp are initialized, > staying on the bigger stack looks fine (previous_esp is apparently used > only for backtrace) ... just like do_IRQ. Ahem, but let me note before somebody else does: it isn't interrupt context, it is normal process context - while an interrupt can ignore most of the thread_info fields, a normal process has to worry about all 9. To be on the safe side, the first 8 need to be copied into and out of the ndis stack, with preempt disabled until after the stack switch. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 21:59, Mark Lord wrote: > Daniel Phillips wrote: > > There are only two stacks involved, the normal kernel stack and your new > > ndis stack. You save ESP of the kernel stack at the base of the ndis > > stack. When the Windows code calls your api, you get the ndis ESP, load > > the kernel ESP from the base of the ndis stack, push the ndis ESP so you > > can get back to the ndis code later, and continue on your merry way. I must have been smoking something when I convinced myself that the driver can't call into the kernel without switching back to the kernel stack. But this is wrong, as long as ->task and ->previous_esp are initialized, staying on the bigger stack looks fine (previous_esp is apparently used only for backtrace). > With CONFIG_PREEMPT, this will still cause trouble due to lack > of "current" task info on the NDIS stack. > > One option is to copy (duplicate) the bottom-of-stack info when > switching to the NDIS stack. Yes, just like do_IRQ. > Another option is to stick a Mutex around any use of the NDIS stack > when calling into the foreign driver (might be done like this already??), There is no mutex now, but this is the easy way to get by with just one ndis stack. > which will prevent PREEMPTion during the call. We have preempt_enable/disable for that. But I am not sure preemption needs to be disabled. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 18:28, Roland Dreier wrote: > Daniel> There are only two stacks involved, the normal kernel > Daniel> stack and your new ndis stack. You save ESP of the kernel > Daniel> stack at the base of the ndis stack. When the Windows > Daniel> code calls your api, you get the ndis ESP, load the kernel > Daniel> ESP from the base of the ndis stack, push the ndis ESP so > Daniel> you can get back to the ndis code later, and continue on > Daniel> your merry way. > > [...] > > Daniel> You will allocate your own stack once on driver > Daniel> initialization. > > I'm not quite sure it's this trivial. Obviously there are more than > two stacks involved, since there is more than one kernel stack! (One > per task plus IRQ stacks) This is more than just a theoretical > problem. It seems entirely possible that more than one task could > be in the driver, and clearly they each need their own stack. Semaphore :-) Do you expect this to be heavily contended? On a very quick run through the code, it seems you don't hold any spinlocks going into the driver from process context. Interrupts... they better fit into a 4K stack or it's game over. Preemption while on the ndis stack... you can always disable preemption in this region, but the semaphore should protect you. Task killed while preempted... I dunno. > So it's going to be at least a little harder than allocating a single > stack for NDIS use when the driver starts up. > > I personally like the idea raised elsewhere in this thread of running > the Windows driver in userspace by proxying interrupts, PCI access, > etc. That seems more robust and probably allows some cool reverse > engineering hacks. I expect the userspace approach will be a lot more work and a lot more overhead too, but then again it sounds like loads of fun. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 18:21, Andi Kleen wrote: > On Wednesday 07 September 2005 00:19, Daniel Phillips wrote: > > Andi, their stack will have to have a valid thread_info->task because > > interrupts will use it. Out of interest, could you please explain what > > for? > > No, with 4k interrupts run on their own stack with their own thread_info > Or rather they mostly do. Currently do_IRQ does irq_enter which refers > thread_info before switching to the interrupt stack, that order would > likely need to be exchanged. But then how would thread_info->task on the irq stack ever get initialized? My "what for" question was re why interrupt routines even need a valid current. I see one answer out there on the web: statistical profiling. Is that it? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 13:23, Giridhar Pemmasani wrote: > Jan Kiszka wrote: > > The only way I see is to switch stacks back on ndiswrapper API entry. > > But managing all those stacks correctly is challenging, There are only two stacks involved, the normal kernel stack and your new ndis stack. You save ESP of the kernel stack at the base of the ndis stack. When the Windows code calls your api, you get the ndis ESP, load the kernel ESP from the base of the ndis stack, push the ndis ESP so you can get back to the ndis code later, and continue on your merry way. > > as you will likely not want to create a new stack on each switching > > point... You will allocate your own stack once on driver initialization. > This is what I had in mind before I saw this thread here. I, in fact, did > some work along those lines, but it is even more complicated than you > mentioned here: Windows uses different calling conventions (STDCALL, > FASTCALL, CDECL) so switching stacks by copying arguments/results gets > complicated. I missed something there. You would switch stacks before calling the Windows code and after the Windows code calls you (and respective returns) so you are always in your own code when you switch, hence you know how to copy the parameters. > I am still hoping that Andi's approach is possible (I don't understand how > we can make kernel see current info from private stack). He suggested you use your own private variant of current which would presumeably read a copy of current you stored at the bottom of your own stack. But I don't see why your code would ever need current while you are on the private ndis stack. Andi, their stack will have to have a valid thread_info->task because interrupts will use it. Out of interest, could you please explain what for? Code like u32 stack[THREAD_SIZE/sizeof(u32)] is violated by a different sized stack, but apparently not in any way that matters. By the way, I use ndis_wrapper, thanks a lot you guys! Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remainingh
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote: > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote: > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > > > do you think it is a bit premature to dismiss something even without > > > ever seeing the code? > > > > You told me you are using a dlm for a single-node application, is there > > anything more I need to know? > > I would still like to know why you consider it a "sin". On OpenVMS it is > fast, provides a way of cleaning up... There is something hard about handling EPIPE? > and does not introduce single point > of failure as it is the case with a daemon. And if we ever want to spread > the load between 2 boxes we easily can do it. But you said it runs on an aging Alpha, surely you do not intend to expand it to two aging Alphas? And what makes you think that socket-based synchronization keeps you from spreading out the load over multiple boxes? > Why would I not want to use it? It is not the right tool for the job from what you have told me. You want to get a few bytes of information from one task to another? Use a socket, as God intended. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remainingh
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: > do you think it is a bit premature to dismiss something even without > ever seeing the code? You told me you are using a dlm for a single-node application, is there anything more I need to know? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remainingh
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: do you think it is a bit premature to dismiss something even without ever seeing the code? You told me you are using a dlm for a single-node application, is there anything more I need to know? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remainingh
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote: On Tuesday 06 September 2005 01:48, Daniel Phillips wrote: On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: do you think it is a bit premature to dismiss something even without ever seeing the code? You told me you are using a dlm for a single-node application, is there anything more I need to know? I would still like to know why you consider it a sin. On OpenVMS it is fast, provides a way of cleaning up... There is something hard about handling EPIPE? and does not introduce single point of failure as it is the case with a daemon. And if we ever want to spread the load between 2 boxes we easily can do it. But you said it runs on an aging Alpha, surely you do not intend to expand it to two aging Alphas? And what makes you think that socket-based synchronization keeps you from spreading out the load over multiple boxes? Why would I not want to use it? It is not the right tool for the job from what you have told me. You want to get a few bytes of information from one task to another? Use a socket, as God intended. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 13:23, Giridhar Pemmasani wrote: Jan Kiszka wrote: The only way I see is to switch stacks back on ndiswrapper API entry. But managing all those stacks correctly is challenging, There are only two stacks involved, the normal kernel stack and your new ndis stack. You save ESP of the kernel stack at the base of the ndis stack. When the Windows code calls your api, you get the ndis ESP, load the kernel ESP from the base of the ndis stack, push the ndis ESP so you can get back to the ndis code later, and continue on your merry way. as you will likely not want to create a new stack on each switching point... You will allocate your own stack once on driver initialization. This is what I had in mind before I saw this thread here. I, in fact, did some work along those lines, but it is even more complicated than you mentioned here: Windows uses different calling conventions (STDCALL, FASTCALL, CDECL) so switching stacks by copying arguments/results gets complicated. I missed something there. You would switch stacks before calling the Windows code and after the Windows code calls you (and respective returns) so you are always in your own code when you switch, hence you know how to copy the parameters. I am still hoping that Andi's approach is possible (I don't understand how we can make kernel see current info from private stack). He suggested you use your own private variant of current which would presumeably read a copy of current you stored at the bottom of your own stack. But I don't see why your code would ever need current while you are on the private ndis stack. Andi, their stack will have to have a valid thread_info-task because interrupts will use it. Out of interest, could you please explain what for? Code like u32 stack[THREAD_SIZE/sizeof(u32)] is violated by a different sized stack, but apparently not in any way that matters. By the way, I use ndis_wrapper, thanks a lot you guys! Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 18:21, Andi Kleen wrote: On Wednesday 07 September 2005 00:19, Daniel Phillips wrote: Andi, their stack will have to have a valid thread_info-task because interrupts will use it. Out of interest, could you please explain what for? No, with 4k interrupts run on their own stack with their own thread_info Or rather they mostly do. Currently do_IRQ does irq_enter which refers thread_info before switching to the interrupt stack, that order would likely need to be exchanged. But then how would thread_info-task on the irq stack ever get initialized? My what for question was re why interrupt routines even need a valid current. I see one answer out there on the web: statistical profiling. Is that it? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 18:28, Roland Dreier wrote: Daniel There are only two stacks involved, the normal kernel Daniel stack and your new ndis stack. You save ESP of the kernel Daniel stack at the base of the ndis stack. When the Windows Daniel code calls your api, you get the ndis ESP, load the kernel Daniel ESP from the base of the ndis stack, push the ndis ESP so Daniel you can get back to the ndis code later, and continue on Daniel your merry way. [...] Daniel You will allocate your own stack once on driver Daniel initialization. I'm not quite sure it's this trivial. Obviously there are more than two stacks involved, since there is more than one kernel stack! (One per task plus IRQ stacks) This is more than just a theoretical problem. It seems entirely possible that more than one task could be in the driver, and clearly they each need their own stack. Semaphore :-) Do you expect this to be heavily contended? On a very quick run through the code, it seems you don't hold any spinlocks going into the driver from process context. Interrupts... they better fit into a 4K stack or it's game over. Preemption while on the ndis stack... you can always disable preemption in this region, but the semaphore should protect you. Task killed while preempted... I dunno. So it's going to be at least a little harder than allocating a single stack for NDIS use when the driver starts up. I personally like the idea raised elsewhere in this thread of running the Windows driver in userspace by proxying interrupts, PCI access, etc. That seems more robust and probably allows some cool reverse engineering hacks. I expect the userspace approach will be a lot more work and a lot more overhead too, but then again it sounds like loads of fun. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Tuesday 06 September 2005 21:59, Mark Lord wrote: Daniel Phillips wrote: There are only two stacks involved, the normal kernel stack and your new ndis stack. You save ESP of the kernel stack at the base of the ndis stack. When the Windows code calls your api, you get the ndis ESP, load the kernel ESP from the base of the ndis stack, push the ndis ESP so you can get back to the ndis code later, and continue on your merry way. I must have been smoking something when I convinced myself that the driver can't call into the kernel without switching back to the kernel stack. But this is wrong, as long as -task and -previous_esp are initialized, staying on the bigger stack looks fine (previous_esp is apparently used only for backtrace). With CONFIG_PREEMPT, this will still cause trouble due to lack of current task info on the NDIS stack. One option is to copy (duplicate) the bottom-of-stack info when switching to the NDIS stack. Yes, just like do_IRQ. Another option is to stick a Mutex around any use of the NDIS stack when calling into the foreign driver (might be done like this already??), There is no mutex now, but this is the easy way to get by with just one ndis stack. which will prevent PREEMPTion during the call. We have preempt_enable/disable for that. But I am not sure preemption needs to be disabled. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: i386: kill !4KSTACKS
On Wednesday 07 September 2005 00:16, Daniel Phillips wrote: ...as long as -task and -previous_esp are initialized, staying on the bigger stack looks fine (previous_esp is apparently used only for backtrace) ... just like do_IRQ. Ahem, but let me note before somebody else does: it isn't interrupt context, it is normal process context - while an interrupt can ignore most of the thread_info fields, a normal process has to worry about all 9. To be on the safe side, the first 8 need to be copied into and out of the ndis stack, with preempt disabled until after the stack switch. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 19:37, Joel Becker wrote: > OCFS2, the new filesystem, is fully general purpose. It > supports all the usual stuff, is quite fast... So I have heard, but isn't it time to quantify that? How do you think you would stack up here: http://www.caspur.it/Files/2005/01/10/1105354214692.pdf Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remainingh
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote: > On Monday 05 September 2005 23:02, Daniel Phillips wrote: > > By the way, you said "alpha server" not "alpha servers", was that just a > > slip? Because if you don't have a cluster then why are you using a dlm? > > No, it is not a slip. The application is running on just one node, so we > do not really use "distributed" part. However we make heavy use of the > rest of lock manager features, especially lock value blocks. Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature without even having the excuse you were forced to use it. Why don't you just have a daemon that sends your values over a socket? That should be all of a day's coding. Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. But you nicely supported my claim that most who think they should be using a dlm, really shouldn't. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote: > On Monday 05 September 2005 19:57, Daniel Phillips wrote: > > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > > > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote: > > > > > > The only current users of dlms are cluster filesystems. There > > > > > > are zero users of the userspace dlm api. > > > > > > > > > > That is incorrect... > > > > > > > > Application users Lars, sorry if I did not make that clear. The > > > > issue is whether we need to export an all-singing-all-dancing dlm api > > > > from kernel to userspace today, or whether we can afford to take the > > > > necessary time to get it right while application writers take their > > > > time to have a good think about whether they even need it. > > > > > > If Linux fully supported OpenVMS DLM semantics we could start thinking > > > asbout moving our application onto a Linux box because our alpha server > > > is aging. > > > > > > That's just my user application writer $0.02. > > > > What stops you from trying it with the patch? That kind of feedback > > would be worth way more than $0.02. > > We do not have such plans at the moment and I prefer spending my free > time on tinkering with kernel, not rewriting some in-house application. > Besides, DLM is not the only thing that does not have a drop-in > replacement in Linux. > > You just said you did not know if there are any potential users for the > full DLM and I said there are some. I did not say "potential", I said there are zero dlm applications at the moment. Nobody has picked up the prototype (g)dlm api, used it in an application and said "gee this works great, look what it does". I also claim that most developers who think that using a dlm for application synchronization would be really cool are probably wrong. Use sockets for synchronization exactly as for a single-node, multi-tasking application and you will end up with less code, more obviously correct code, probably more efficient and... you get an optimal, single-node version for free. And I also claim that there is precious little reason to have a full-featured dlm in-kernel. Being in-kernel has no benefit for a userspace application. But being in-kernel does add kernel bloat, because there will be extra features lathered on that are not needed by the only in-kernel user, the cluster filesystem. In the case of your port, you'd be better off hacking up a userspace library to provide OpenVMS dlm semantics exactly, not almost. By the way, you said "alpha server" not "alpha servers", was that just a slip? Because if you don't have a cluster then why are you using a dlm? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: > On Monday 05 September 2005 10:49, Daniel Phillips wrote: > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > > > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote: > > > > The only current users of dlms are cluster filesystems. There are > > > > zero users of the userspace dlm api. > > > > > > That is incorrect... > > > > Application users Lars, sorry if I did not make that clear. The issue is > > whether we need to export an all-singing-all-dancing dlm api from kernel > > to userspace today, or whether we can afford to take the necessary time > > to get it right while application writers take their time to have a good > > think about whether they even need it. > > If Linux fully supported OpenVMS DLM semantics we could start thinking > asbout moving our application onto a Linux box because our alpha server is > aging. > > That's just my user application writer $0.02. What stops you from trying it with the patch? That kind of feedback would be worth way more than $0.02. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: > On 2005-09-03T01:57:31, Daniel Phillips <[EMAIL PROTECTED]> wrote: > > The only current users of dlms are cluster filesystems. There are zero > > users of the userspace dlm api. > > That is incorrect... Application users Lars, sorry if I did not make that clear. The issue is whether we need to export an all-singing-all-dancing dlm api from kernel to userspace today, or whether we can afford to take the necessary time to get it right while application writers take their time to have a good think about whether they even need it. > ...and you're contradicting yourself here: How so? Above talks about dlm, below talks about cluster membership. > > What does have to be resolved is a common API for node management. It is > > not just cluster filesystems and their lock managers that have to > > interface to node management. Below the filesystem layer, cluster block > > devices and cluster volume management need to be coordinated by the same > > system, and above the filesystem layer, applications also need to be > > hooked into it. This work is, in a word, incomplete. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Monday 05 September 2005 05:19, Andrew Morton wrote: > David Teigland <[EMAIL PROTECTED]> wrote: > > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: > > > David Teigland <[EMAIL PROTECTED]> wrote: > > > > We export our full dlm API through read/write/poll on a misc device. > > > > > > inotify did that for a while, but we ended up going with a straight > > > syscall interface. > > > > > > How fat is the dlm interface? ie: how many syscalls would it take? > > > > Four functions: > > create_lockspace() > > release_lockspace() > > lock() > > unlock() > > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone > is likely to object if we reserve those slots. Better take a look at the actual parameter lists to those calls before jumping to conclusions... Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Monday 05 September 2005 05:19, Andrew Morton wrote: David Teigland [EMAIL PROTECTED] wrote: On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote: David Teigland [EMAIL PROTECTED] wrote: We export our full dlm API through read/write/poll on a misc device. inotify did that for a while, but we ended up going with a straight syscall interface. How fat is the dlm interface? ie: how many syscalls would it take? Four functions: create_lockspace() release_lockspace() lock() unlock() Neat. I'd be inclined to make them syscalls then. I don't suppose anyone is likely to object if we reserve those slots. Better take a look at the actual parameter lists to those calls before jumping to conclusions... Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote: The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. That is incorrect... Application users Lars, sorry if I did not make that clear. The issue is whether we need to export an all-singing-all-dancing dlm api from kernel to userspace today, or whether we can afford to take the necessary time to get it right while application writers take their time to have a good think about whether they even need it. ...and you're contradicting yourself here: How so? Above talks about dlm, below talks about cluster membership. What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: On Monday 05 September 2005 10:49, Daniel Phillips wrote: On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote: The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. That is incorrect... Application users Lars, sorry if I did not make that clear. The issue is whether we need to export an all-singing-all-dancing dlm api from kernel to userspace today, or whether we can afford to take the necessary time to get it right while application writers take their time to have a good think about whether they even need it. If Linux fully supported OpenVMS DLM semantics we could start thinking asbout moving our application onto a Linux box because our alpha server is aging. That's just my user application writer $0.02. What stops you from trying it with the patch? That kind of feedback would be worth way more than $0.02. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote: On Monday 05 September 2005 19:57, Daniel Phillips wrote: On Monday 05 September 2005 12:18, Dmitry Torokhov wrote: On Monday 05 September 2005 10:49, Daniel Phillips wrote: On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote: On 2005-09-03T01:57:31, Daniel Phillips [EMAIL PROTECTED] wrote: The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. That is incorrect... Application users Lars, sorry if I did not make that clear. The issue is whether we need to export an all-singing-all-dancing dlm api from kernel to userspace today, or whether we can afford to take the necessary time to get it right while application writers take their time to have a good think about whether they even need it. If Linux fully supported OpenVMS DLM semantics we could start thinking asbout moving our application onto a Linux box because our alpha server is aging. That's just my user application writer $0.02. What stops you from trying it with the patch? That kind of feedback would be worth way more than $0.02. We do not have such plans at the moment and I prefer spending my free time on tinkering with kernel, not rewriting some in-house application. Besides, DLM is not the only thing that does not have a drop-in replacement in Linux. You just said you did not know if there are any potential users for the full DLM and I said there are some. I did not say potential, I said there are zero dlm applications at the moment. Nobody has picked up the prototype (g)dlm api, used it in an application and said gee this works great, look what it does. I also claim that most developers who think that using a dlm for application synchronization would be really cool are probably wrong. Use sockets for synchronization exactly as for a single-node, multi-tasking application and you will end up with less code, more obviously correct code, probably more efficient and... you get an optimal, single-node version for free. And I also claim that there is precious little reason to have a full-featured dlm in-kernel. Being in-kernel has no benefit for a userspace application. But being in-kernel does add kernel bloat, because there will be extra features lathered on that are not needed by the only in-kernel user, the cluster filesystem. In the case of your port, you'd be better off hacking up a userspace library to provide OpenVMS dlm semantics exactly, not almost. By the way, you said alpha server not alpha servers, was that just a slip? Because if you don't have a cluster then why are you using a dlm? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remainingh
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote: On Monday 05 September 2005 23:02, Daniel Phillips wrote: By the way, you said alpha server not alpha servers, was that just a slip? Because if you don't have a cluster then why are you using a dlm? No, it is not a slip. The application is running on just one node, so we do not really use distributed part. However we make heavy use of the rest of lock manager features, especially lock value blocks. Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature without even having the excuse you were forced to use it. Why don't you just have a daemon that sends your values over a socket? That should be all of a day's coding. Anyway, thanks for sticking your head up, and sorry if it sounds aggressive. But you nicely supported my claim that most who think they should be using a dlm, really shouldn't. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Monday 05 September 2005 19:37, Joel Becker wrote: OCFS2, the new filesystem, is fully general purpose. It supports all the usual stuff, is quite fast... So I have heard, but isn't it time to quantify that? How do you think you would stack up here: http://www.caspur.it/Files/2005/01/10/1105354214692.pdf Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 03:28, Andrew Morton wrote: > If there is already a richer interface into all this code (such as a > syscall one) and it's feasible to migrate the open() tricksies to that API > in the future if it all comes unstuck then OK. That's why I asked (thus > far unsuccessfully): > >Are you saying that the posix-file lookalike interface provides >access to part of the functionality, but there are other APIs which are >used to access the rest of the functionality? If so, what is that >interface, and why cannot that interface offer access to 100% of the >functionality, thus making the posix-file tricks unnecessary? There is no such interface at the moment, nor is one needed in the immediate future. Let's look at the arguments for exporting a dlm to userspace: 1) Since we already have a dlm in kernel, why not just export that and save 100K of userspace library? Answer: because we don't want userspace-only dlm features bulking up the kernel. Answer #2: the extra syscalls and interface baggage serve no useful purpose. 2) But we need to take locks in the same lockspaces as the kernel dlm(s)! Answer: only support tools need to do that. A cut-down locking api is entirely appropriate for this. 3) But the kernel dlm is the only one we have! Answer: easily fixed, a simple matter of coding. But please bear in mind that dlm-style synchronization is probably a bad idea for most cluster applications, particularly ones that already do their synchronization via sockets. In other words, exporting the full dlm api is a red herring. It has nothing to do with getting cluster filesystems up and running. It is really just marketing: it sounds like a great thing for userspace to get a dlm "for free", but it isn't free, it contributes to kernel bloat and it isn't even the most efficient way to do it. If after considering that, we _still_ want to export a dlm api from kernel, then can we please take the necessary time and get it right? The full api requires not only syscall-style elements, but asynchronous events as well, similar to aio. I do not think anybody has a good answer to this today, nor do we even need it to begin porting applications to cluster filesystems. Oracle guys: what is the distributed locking API for RAC? Is the RAC team waiting with bated breath to adopt your kernel-based dlm? If not, why not? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 00:46, Andrew Morton wrote: > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > The model you came up with for dlmfs is beyond cute, it's downright > > clever. > > Actually I think it's rather sick. Taking O_NONBLOCK and making it a > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to > acquire a clustered filesystem lock". Not even close. Now, I see the ocfs2 guys are all ready to back down on this one, but I will at least argue weakly in favor. Sick is a nice word for it, but it is actually not that far off. Normally, this fs will acquire a lock whenever the user creates a virtual file and the create will block until the global lock arrives. With O_NONBLOCK, it will return, erm... ETXTBSY (!) immediately. Is that not what O_NONBLOCK is supposed to accomplish? > It would be much better to do something which explicitly and directly > expresses what you're trying to do rather than this strange "lets do this > because the names sound the same" thing. > > What happens when we want to add some new primitive which has no posix-file > analog? > > Wy too cute. Oh well, whatever. The explicit way is syscalls or a set of ioctls, which he already has the makings of. If there is going to be a userspace api, I would hope it looks more like the contents of userdlm.c than the traditional Vaxcluster API, which sucks beyond belief. Another explicit way is to do it with a whole set of virtual attributes instead of just a single file trying to capture the whole model. That is really unappealing, but I am afraid that is exactly what a whole lot of sysfs/configfs usage is going to end up looking like. But more to the point: we have no urgent need for a userspace dlm api at the moment. Nothing will break if we just put that issue off for a few months, quite the contrary. If the only user is their tools I would say let it go ahead and be cute, even sickeningly so. It is not supposed to be a general dlm api, at least that is my understanding. It is just supposed to be an interface for their tools. Of course it would help to know exactly how those tools use it. Too sleepy to find out tonight... Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 00:46, Andrew Morton wrote: Daniel Phillips [EMAIL PROTECTED] wrote: The model you came up with for dlmfs is beyond cute, it's downright clever. Actually I think it's rather sick. Taking O_NONBLOCK and making it a lock-manager trylock because they're kinda-sorta-similar-sounding? Spare me. O_NONBLOCK means open this file in nonblocking mode, not attempt to acquire a clustered filesystem lock. Not even close. Now, I see the ocfs2 guys are all ready to back down on this one, but I will at least argue weakly in favor. Sick is a nice word for it, but it is actually not that far off. Normally, this fs will acquire a lock whenever the user creates a virtual file and the create will block until the global lock arrives. With O_NONBLOCK, it will return, erm... ETXTBSY (!) immediately. Is that not what O_NONBLOCK is supposed to accomplish? It would be much better to do something which explicitly and directly expresses what you're trying to do rather than this strange lets do this because the names sound the same thing. What happens when we want to add some new primitive which has no posix-file analog? Wy too cute. Oh well, whatever. The explicit way is syscalls or a set of ioctls, which he already has the makings of. If there is going to be a userspace api, I would hope it looks more like the contents of userdlm.c than the traditional Vaxcluster API, which sucks beyond belief. Another explicit way is to do it with a whole set of virtual attributes instead of just a single file trying to capture the whole model. That is really unappealing, but I am afraid that is exactly what a whole lot of sysfs/configfs usage is going to end up looking like. But more to the point: we have no urgent need for a userspace dlm api at the moment. Nothing will break if we just put that issue off for a few months, quite the contrary. If the only user is their tools I would say let it go ahead and be cute, even sickeningly so. It is not supposed to be a general dlm api, at least that is my understanding. It is just supposed to be an interface for their tools. Of course it would help to know exactly how those tools use it. Too sleepy to find out tonight... Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 03:28, Andrew Morton wrote: If there is already a richer interface into all this code (such as a syscall one) and it's feasible to migrate the open() tricksies to that API in the future if it all comes unstuck then OK. That's why I asked (thus far unsuccessfully): Are you saying that the posix-file lookalike interface provides access to part of the functionality, but there are other APIs which are used to access the rest of the functionality? If so, what is that interface, and why cannot that interface offer access to 100% of the functionality, thus making the posix-file tricks unnecessary? There is no such interface at the moment, nor is one needed in the immediate future. Let's look at the arguments for exporting a dlm to userspace: 1) Since we already have a dlm in kernel, why not just export that and save 100K of userspace library? Answer: because we don't want userspace-only dlm features bulking up the kernel. Answer #2: the extra syscalls and interface baggage serve no useful purpose. 2) But we need to take locks in the same lockspaces as the kernel dlm(s)! Answer: only support tools need to do that. A cut-down locking api is entirely appropriate for this. 3) But the kernel dlm is the only one we have! Answer: easily fixed, a simple matter of coding. But please bear in mind that dlm-style synchronization is probably a bad idea for most cluster applications, particularly ones that already do their synchronization via sockets. In other words, exporting the full dlm api is a red herring. It has nothing to do with getting cluster filesystems up and running. It is really just marketing: it sounds like a great thing for userspace to get a dlm for free, but it isn't free, it contributes to kernel bloat and it isn't even the most efficient way to do it. If after considering that, we _still_ want to export a dlm api from kernel, then can we please take the necessary time and get it right? The full api requires not only syscall-style elements, but asynchronous events as well, similar to aio. I do not think anybody has a good answer to this today, nor do we even need it to begin porting applications to cluster filesystems. Oracle guys: what is the distributed locking API for RAC? Is the RAC team waiting with bated breath to adopt your kernel-based dlm? If not, why not? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 01:00, Joel Becker wrote: > On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote: > > Clearly, I ought to have asked why dlmfs can't be done by configfs. It > > is the same paradigm: drive the kernel logic from user-initiated vfs > > methods. You already have nearly all the right methods in nearly all the > > right places. > > configfs, like sysfs, does not support ->open() or ->release() > callbacks. struct configfs_item_operations { void (*release)(struct config_item *); ssize_t (*show)(struct config_item *, struct attribute *,char *); ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t); int (*allow_link)(struct config_item *src, struct config_item *target); int (*drop_link)(struct config_item *src, struct config_item *target); }; struct configfs_group_operations { struct config_item *(*make_item)(struct config_group *group, const char *name); struct config_group *(*make_group)(struct config_group *group, const char *name); int (*commit_item)(struct config_item *item); void (*drop_item)(struct config_group *group, struct config_item *item); }; You do have ->release and ->make_item/group. If I may hand you a more substantive argument: you don't support user-driven creation of files in configfs, only directories. Dlmfs supports user-created files. But you know, there isn't actually a good reason not to support user-created files in configfs, as dlmfs demonstrates. Anyway, goodnight. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 00:30, Joel Becker wrote: > You asked why dlmfs can't go into sysfs, and I responded. And you got me! In the heat of the moment I overlooked the fact that you and Greg haven't agreed to the merge yet ;-) Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the same paradigm: drive the kernel logic from user-initiated vfs methods. You already have nearly all the right methods in nearly all the right places. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Saturday 03 September 2005 23:06, Joel Becker wrote: > dlmfs is *tiny*. The VFS interface is less than his claimed 500 > lines of savings. It is 640 lines. > The few VFS callbacks do nothing but call DLM > functions. You'd have to replace this VFS glue with sysfs glue, and > probably save very few lines of code. > In addition, sysfs cannot support the dlmfs model. In dlmfs, > mkdir(2) creates a directory representing a DLM domain and mknod(2) > creates the user representation of a lock. sysfs doesn't support > mkdir(2) or mknod(2) at all. I said "configfs" in the email to which you are replying. > More than mkdir() and mknod(), however, dlmfs uses open(2) to > acquire locks from userspace. O_RDONLY acquires a shared read lock (PR > in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a > trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock > is released via close(2). If a process dies, close(2) happens. In > other words, ->release() handles all the cleanup for normal and abnormal > termination. > > sysfs does not allow hooking into ->open() or ->release(). So > this model, and the inherent lifetiming that comes with it, cannot be > used. Configfs has a per-item release method. Configfs has a group open method. What is it that configfs can't do, or can't be made to do trivially? > If dlmfs was changed to use a less intuitive model that fits > sysfs, all the handling of lifetimes and cleanup would have to be added. The model you came up with for dlmfs is beyond cute, it's downright clever. Why mar that achievement by then failing to capitalize on the framework you already have in configfs? By the way, do you agree that dlmfs is too inefficient to be an effective way of exporting your dlm api to user space, except for slow-path applications like you have here? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Saturday 03 September 2005 02:46, Wim Coekaerts wrote: > On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote: > > On Friday 02 September 2005 20:16, Mark Fasheh wrote: > > > As far as userspace dlm apis go, dlmfs already abstracts away a large > > > part of the dlm interaction... > > > > Dumb question, why can't you use sysfs for this instead of rolling your > > own? > > because it's totally different. have a look at what it does. You create a dlm domain when a directory is created. You create a lock resource when a file of that name is opened. You lock the resource when the file is opened. You access the lvb by read/writing the file. Why doesn't that fit the configfs-nee-sysfs model? If it does, the payoff will be about 500 lines saved. This little dlm fs is very slick, but grossly inefficient. Maybe efficiency doesn't matter here since it is just your slow-path userspace tools taking these locks. Please do not even think of proposing this as a way to export a kernel-based dlm for general purpose use! Your userdlm.c file has some hidden gold in it. You have factored the dlm calls far more attractively than the bad old bazillion-parameter Vaxcluster legacy. You are almost in system call zone there. (But note my earlier comment on dlms in general: until there are dlm-based applications, merging a general-purpose dlm API is pointless and has nothing to do with getting your filesystem merged.) Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Saturday 03 September 2005 06:35, David Teigland wrote: > Just a new version, not a big difference. The ondisk format changed a > little making it incompatible with the previous versions. We'd been > holding out on the format change for a long time and thought now would be > a sensible time to finally do it. What exactly was the format change, and for what purpose? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Friday 02 September 2005 20:16, Mark Fasheh wrote: > As far as userspace dlm apis go, dlmfs already abstracts away a large part > of the dlm interaction... Dumb question, why can't you use sysfs for this instead of rolling your own? Side note: you seem to have deleted all the 2.6.12-rc4 patches. Perhaps you forgot that there are dozens of lkml archives pointing at them? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Friday 02 September 2005 20:16, Mark Fasheh wrote: As far as userspace dlm apis go, dlmfs already abstracts away a large part of the dlm interaction... Dumb question, why can't you use sysfs for this instead of rolling your own? Side note: you seem to have deleted all the 2.6.12-rc4 patches. Perhaps you forgot that there are dozens of lkml archives pointing at them? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Saturday 03 September 2005 06:35, David Teigland wrote: Just a new version, not a big difference. The ondisk format changed a little making it incompatible with the previous versions. We'd been holding out on the format change for a long time and thought now would be a sensible time to finally do it. What exactly was the format change, and for what purpose? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Saturday 03 September 2005 02:46, Wim Coekaerts wrote: On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote: On Friday 02 September 2005 20:16, Mark Fasheh wrote: As far as userspace dlm apis go, dlmfs already abstracts away a large part of the dlm interaction... Dumb question, why can't you use sysfs for this instead of rolling your own? because it's totally different. have a look at what it does. You create a dlm domain when a directory is created. You create a lock resource when a file of that name is opened. You lock the resource when the file is opened. You access the lvb by read/writing the file. Why doesn't that fit the configfs-nee-sysfs model? If it does, the payoff will be about 500 lines saved. This little dlm fs is very slick, but grossly inefficient. Maybe efficiency doesn't matter here since it is just your slow-path userspace tools taking these locks. Please do not even think of proposing this as a way to export a kernel-based dlm for general purpose use! Your userdlm.c file has some hidden gold in it. You have factored the dlm calls far more attractively than the bad old bazillion-parameter Vaxcluster legacy. You are almost in system call zone there. (But note my earlier comment on dlms in general: until there are dlm-based applications, merging a general-purpose dlm API is pointless and has nothing to do with getting your filesystem merged.) Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Saturday 03 September 2005 23:06, Joel Becker wrote: dlmfs is *tiny*. The VFS interface is less than his claimed 500 lines of savings. It is 640 lines. The few VFS callbacks do nothing but call DLM functions. You'd have to replace this VFS glue with sysfs glue, and probably save very few lines of code. In addition, sysfs cannot support the dlmfs model. In dlmfs, mkdir(2) creates a directory representing a DLM domain and mknod(2) creates the user representation of a lock. sysfs doesn't support mkdir(2) or mknod(2) at all. I said configfs in the email to which you are replying. More than mkdir() and mknod(), however, dlmfs uses open(2) to acquire locks from userspace. O_RDONLY acquires a shared read lock (PR in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock is released via close(2). If a process dies, close(2) happens. In other words, -release() handles all the cleanup for normal and abnormal termination. sysfs does not allow hooking into -open() or -release(). So this model, and the inherent lifetiming that comes with it, cannot be used. Configfs has a per-item release method. Configfs has a group open method. What is it that configfs can't do, or can't be made to do trivially? If dlmfs was changed to use a less intuitive model that fits sysfs, all the handling of lifetimes and cleanup would have to be added. The model you came up with for dlmfs is beyond cute, it's downright clever. Why mar that achievement by then failing to capitalize on the framework you already have in configfs? By the way, do you agree that dlmfs is too inefficient to be an effective way of exporting your dlm api to user space, except for slow-path applications like you have here? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 00:30, Joel Becker wrote: You asked why dlmfs can't go into sysfs, and I responded. And you got me! In the heat of the moment I overlooked the fact that you and Greg haven't agreed to the merge yet ;-) Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the same paradigm: drive the kernel logic from user-initiated vfs methods. You already have nearly all the right methods in nearly all the right places. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Linux-cluster] Re: GFS, what's remaining
On Sunday 04 September 2005 01:00, Joel Becker wrote: On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote: Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the same paradigm: drive the kernel logic from user-initiated vfs methods. You already have nearly all the right methods in nearly all the right places. configfs, like sysfs, does not support -open() or -release() callbacks. struct configfs_item_operations { void (*release)(struct config_item *); ssize_t (*show)(struct config_item *, struct attribute *,char *); ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t); int (*allow_link)(struct config_item *src, struct config_item *target); int (*drop_link)(struct config_item *src, struct config_item *target); }; struct configfs_group_operations { struct config_item *(*make_item)(struct config_group *group, const char *name); struct config_group *(*make_group)(struct config_group *group, const char *name); int (*commit_item)(struct config_item *item); void (*drop_item)(struct config_group *group, struct config_item *item); }; You do have -release and -make_item/group. If I may hand you a more substantive argument: you don't support user-driven creation of files in configfs, only directories. Dlmfs supports user-created files. But you know, there isn't actually a good reason not to support user-created files in configfs, as dlmfs demonstrates. Anyway, goodnight. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Friday 02 September 2005 17:17, Andi Kleen wrote: > The only thing that should be probably resolved is a common API > for at least the clustered lock manager. Having multiple > incompatible user space APIs for that would be sad. The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. Therefore, the (g)dlm userspace interface actually has nothing to do with the needs of gfs. It should be taken out the gfs patch and merged later, when or if user space applications emerge that need it. Maybe in the meantime it will be possible to come up with a userspace dlm api that isn't completely repulsive. Also, note that the only reason the two current dlms are in-kernel is because it supposedly cuts down on userspace-kernel communication with the cluster filesystems. Then why should a userspace application bother with a an awkward interface to an in-kernel dlm? This is obviously suboptimal. Why not have a userspace dlm for userspace apps, if indeed there are any userspace apps that would need to use dlm-style synchronization instead of more typical socket-based synchronization, or Posix locking, which is already exposed via a standard api? There is actually nothing wrong with having multiple, completely different dlms active at the same time. There is no urgent need to merge them into the one true dlm. It would be a lot better to let them evolve separately and pick the winner a year or two from now. Just think of the dlm as part of the cfs until then. What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ia_attr_flags - time to die
On Friday 02 September 2005 15:41, Miklos Szeredi wrote: > Already dead ;) > > 2.6.13-mm1: remove-ia_attr_flags.patch > > Miklos Wow, the pace of Linux development really is picking up. Now patches are applied before I even send them! Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ia_attr_flags - time to die
Struct iattr is not involved any more in such things as NOATIME inode flags. There are no in-tree users of ia_attr_flags. Signed-off-by Daniel Phillips <[EMAIL PROTECTED]> diff -up --recursive 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2.6.13-rc5-mm1/fs/hostfs/hostfs.h --- 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2005-08-09 18:23:11.0 -0400 +++ 2.6.13-rc5-mm1/fs/hostfs/hostfs.h 2005-09-01 17:54:40.0 -0400 @@ -49,7 +49,6 @@ struct hostfs_iattr { struct timespec ia_atime; struct timespec ia_mtime; struct timespec ia_ctime; - unsigned intia_attr_flags; }; extern int stat_file(const char *path, unsigned long long *inode_out, diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/fs.h 2.6.13-rc5-mm1/include/linux/fs.h --- 2.6.13-rc5-mm1.clean/include/linux/fs.h 2005-08-09 18:23:31.0 -0400 +++ 2.6.13-rc5-mm1/include/linux/fs.h 2005-09-01 18:27:42.0 -0400 @@ -282,19 +282,9 @@ struct iattr { struct timespec ia_atime; struct timespec ia_mtime; struct timespec ia_ctime; - unsigned intia_attr_flags; }; /* - * This is the inode attributes flag definitions - */ -#define ATTR_FLAG_SYNCRONOUS 1 /* Syncronous write */ -#define ATTR_FLAG_NOATIME 2 /* Don't update atime */ -#define ATTR_FLAG_APPEND 4 /* Append-only file */ -#define ATTR_FLAG_IMMUTABLE8 /* Immutable file */ -#define ATTR_FLAG_NODIRATIME 16 /* Don't update atime for directory */ - -/* * Includes for diskquotas. */ #include - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ia_attr_flags - time to die
Struct iattr is not involved any more in such things as NOATIME inode flags. There are no in-tree users of ia_attr_flags. Signed-off-by Daniel Phillips [EMAIL PROTECTED] diff -up --recursive 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2.6.13-rc5-mm1/fs/hostfs/hostfs.h --- 2.6.13-rc5-mm1.clean/fs/hostfs/hostfs.h 2005-08-09 18:23:11.0 -0400 +++ 2.6.13-rc5-mm1/fs/hostfs/hostfs.h 2005-09-01 17:54:40.0 -0400 @@ -49,7 +49,6 @@ struct hostfs_iattr { struct timespec ia_atime; struct timespec ia_mtime; struct timespec ia_ctime; - unsigned intia_attr_flags; }; extern int stat_file(const char *path, unsigned long long *inode_out, diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/fs.h 2.6.13-rc5-mm1/include/linux/fs.h --- 2.6.13-rc5-mm1.clean/include/linux/fs.h 2005-08-09 18:23:31.0 -0400 +++ 2.6.13-rc5-mm1/include/linux/fs.h 2005-09-01 18:27:42.0 -0400 @@ -282,19 +282,9 @@ struct iattr { struct timespec ia_atime; struct timespec ia_mtime; struct timespec ia_ctime; - unsigned intia_attr_flags; }; /* - * This is the inode attributes flag definitions - */ -#define ATTR_FLAG_SYNCRONOUS 1 /* Syncronous write */ -#define ATTR_FLAG_NOATIME 2 /* Don't update atime */ -#define ATTR_FLAG_APPEND 4 /* Append-only file */ -#define ATTR_FLAG_IMMUTABLE8 /* Immutable file */ -#define ATTR_FLAG_NODIRATIME 16 /* Don't update atime for directory */ - -/* * Includes for diskquotas. */ #include linux/quota.h - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ia_attr_flags - time to die
On Friday 02 September 2005 15:41, Miklos Szeredi wrote: Already dead ;) 2.6.13-mm1: remove-ia_attr_flags.patch Miklos Wow, the pace of Linux development really is picking up. Now patches are applied before I even send them! Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Friday 02 September 2005 17:17, Andi Kleen wrote: The only thing that should be probably resolved is a common API for at least the clustered lock manager. Having multiple incompatible user space APIs for that would be sad. The only current users of dlms are cluster filesystems. There are zero users of the userspace dlm api. Therefore, the (g)dlm userspace interface actually has nothing to do with the needs of gfs. It should be taken out the gfs patch and merged later, when or if user space applications emerge that need it. Maybe in the meantime it will be possible to come up with a userspace dlm api that isn't completely repulsive. Also, note that the only reason the two current dlms are in-kernel is because it supposedly cuts down on userspace-kernel communication with the cluster filesystems. Then why should a userspace application bother with a an awkward interface to an in-kernel dlm? This is obviously suboptimal. Why not have a userspace dlm for userspace apps, if indeed there are any userspace apps that would need to use dlm-style synchronization instead of more typical socket-based synchronization, or Posix locking, which is already exposed via a standard api? There is actually nothing wrong with having multiple, completely different dlms active at the same time. There is no urgent need to merge them into the one true dlm. It would be a lot better to let them evolve separately and pick the winner a year or two from now. Just think of the dlm as part of the cfs until then. What does have to be resolved is a common API for node management. It is not just cluster filesystems and their lock managers that have to interface to node management. Below the filesystem layer, cluster block devices and cluster volume management need to be coordinated by the same system, and above the filesystem layer, applications also need to be hooked into it. This work is, in a word, incomplete. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Thursday 01 September 2005 06:46, David Teigland wrote: > I'd like to get a list of specific things remaining for merging. Where are the benchmarks and stability analysis? How many hours does it survive cerberos running on all nodes simultaneously? Where are the testimonials from users? How long has there been a gfs2 filesystem? Note that Reiser4 is still not in mainline a year after it was first offered, why do you think gfs2 should be in mainline after one month? So far, all catches are surface things like bogus spinlocks. Substantive issues have not even begun to be addressed. Patience please, this is going to take a while. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Thursday 01 September 2005 10:49, Alan Cox wrote: > On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot > > possibly gain (or vice versa) > > > > - Relative merits of the two offerings > > You missed the important one - people actively use it and have been for > some years. Same reason with have NTFS, HPFS, and all the others. On > that alone it makes sense to include. I thought that gfs2 just appeared last month. Or is it really still just gfs? If there are substantive changes from gfs to gfs2 then obviously they have had practically zero testing, let alone posted benchmarks, testimonials, etc. If it is really still just gfs then the silly-rename should be undone. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Thursday 01 September 2005 10:49, Alan Cox wrote: On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot possibly gain (or vice versa) - Relative merits of the two offerings You missed the important one - people actively use it and have been for some years. Same reason with have NTFS, HPFS, and all the others. On that alone it makes sense to include. I thought that gfs2 just appeared last month. Or is it really still just gfs? If there are substantive changes from gfs to gfs2 then obviously they have had practically zero testing, let alone posted benchmarks, testimonials, etc. If it is really still just gfs then the silly-rename should be undone. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
On Thursday 01 September 2005 06:46, David Teigland wrote: I'd like to get a list of specific things remaining for merging. Where are the benchmarks and stability analysis? How many hours does it survive cerberos running on all nodes simultaneously? Where are the testimonials from users? How long has there been a gfs2 filesystem? Note that Reiser4 is still not in mainline a year after it was first offered, why do you think gfs2 should be in mainline after one month? So far, all catches are surface things like bogus spinlocks. Substantive issues have not even begun to be addressed. Patience please, this is going to take a while. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:34, [EMAIL PROTECTED] wrote: > On Tue, Aug 30, 2005 at 04:28:46PM -0700, Andrew Morton wrote: > > Sure, but all that copying-and-pasting really sucks. I'm sure there's > > some way of providing the slightly different semantics from the same > > codebase? > > Careful - you've almost reinvented the concept of library, which would > violate any number of patents... I will keep my eyes open for library candidates as I go. For example, the binary blob operations really cry out for it. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:28, Andrew Morton wrote: > Joel Becker <[EMAIL PROTECTED]> wrote: > > On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote: > > > But it would be stupid to forbid users from creating directories in > > > sysfs or to forbid kernel modules from directly tweaking a configfs > > > namespace. Why should the kernel not be able to add objects to a > > > directory a user created? It should be up to the module author to > > > decide these things. > > > > This is precisely why configfs is separate from sysfs. If both > > user and kernel can create objects, the lifetime of the object and its > > filesystem representation is very complex. Sysfs already has problems > > with people getting this wrong. configfs does not. > > The fact that sysfs and configfs have similar backing stores > > does not make them the same thing. > > Sure, but all that copying-and-pasting really sucks. I'm sure there's some > way of providing the slightly different semantics from the same codebase? I will have that patch ready later this week. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:25, Daniel Phillips wrote: > On Wednesday 31 August 2005 09:13, Joel Becker wrote: > > On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote: > > > But it would be stupid to forbid users from creating directories in > > > sysfs or to forbid kernel modules from directly tweaking a configfs > > > namespace. Why should the kernel not be able to add objects to a > > > directory a user created? It should be up to the module author to > > > decide these things. > > > > This is precisely why configfs is separate from sysfs. If both > > user and kernel can create objects, the lifetime of the object and its > > filesystem representation is very complex. Sysfs already has problems > > with people getting this wrong. configfs does not. > > Could you please give a specific case? More to the point: what makes you think that this apparent ruggedness will diminish after being re-integrated with sysfs? If you wish, you can avoid any dangers by not using sysfs's vfs bypass api. It should be up to the subsystem author. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 4 of 4] Configfs is really sysfs
(without kmail bugs this time) A kernel code example that uses the modified configfs to define a simple configuration interface. Note the use of kobjects and ksets instead of config_items and config_groups. Also notice how much code is required to get a simple value from userspace to kernel space. This is a big problem that needs to be addressed soon, before we end up with tens or hundreds of thousands of lines of code code bloat just to get and set variables from user space. Regards, Daniel #include #include #include #include struct ddbond_info { struct kobject item; int controlsock; }; static inline struct ddbond_info *to_ddbond_info(struct kobject *item) { return container_of(item, struct ddbond_info, item); } static ssize_t ddbond_info_attr_show(struct kobject *item, struct attribute *attr, char *page) { ssize_t count; struct ddbond_info *ddbond_info = to_ddbond_info(item); count = sprintf(page, "%d\n", ddbond_info->controlsock); return count; } static ssize_t ddbond_info_attr_store(struct kobject *item, struct attribute *attr, const char *page, size_t count) { struct ddbond_info *ddbond_info = to_ddbond_info(item); unsigned long tmp; char *p = (char *)page; tmp = simple_strtoul(p, , 10); if (!p || (*p && (*p != '\n'))) return -EINVAL; if (tmp > INT_MAX) return -ERANGE; ddbond_info->controlsock = tmp; return count; } static void ddbond_info_release(struct kobject *item) { kfree(to_ddbond_info(item)); } static struct kobj_type ddbond_info_type = { .sysfs_ops = &(struct sysfs_ops){ .show = ddbond_info_attr_show, .store = ddbond_info_attr_store, .release = ddbond_info_release, }, .default_attrs = (struct attribute *[]){ &(struct attribute){ .owner = THIS_MODULE, .name = "sockname", .mode = S_IRUGO | S_IWUSR, }, NULL, }, .ct_owner = THIS_MODULE, }; static struct kobject *ddbond_make_item(struct kset *group, const char *name) { struct ddbond_info *ddbond_info; if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL))) return NULL; kobject_init_type_name(_info->item, name, _info_type); return _info->item; } static ssize_t ddbond_description(struct kobject *item, struct attribute *attr, char *page) { return sprintf(page, "A ddbond block server has two components: a userspace server and a kernel\n" "io daemon. First start the server and give it the name of a socket it will\n" "create, then create a ddbond directory and write the same name into the\n" "socket attribute\n"); } static struct kobj_type ddbond_type = { .sysfs_ops = &(struct sysfs_ops){ .show = ddbond_description, }, .ct_group_ops = &(struct configfs_group_operations){ .make_item = ddbond_make_item, }, .default_attrs = (struct attribute *[]){ &(struct attribute){ .owner = THIS_MODULE, .name = "description", .mode = S_IRUGO, }, NULL, } }; static struct subsystem ddbond_subsys = { .kset = { .kobj = { .k_name = "ddbond", .ktype = _type, }, }, }; static int __init init_ddbond_config(void) { int ret; config_group_init(_subsys.kset); init_rwsem(_subsys.rwsem); if ((ret = configfs_register_subsystem(_subsys))) printk(KERN_ERR "Error %d while registering subsystem %s\n", ret, ddbond_subsys.kset.kobj.k_name); return ret; } static void __exit exit_ddbond_config(void) { configfs_unregister_subsystem(_subsys); } module_init(init_ddbond_config); module_exit(exit_ddbond_config); MODULE_LICENSE("GPL"); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:13, Joel Becker wrote: > On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote: > > But it would be stupid to forbid users from creating directories in sysfs > > or to forbid kernel modules from directly tweaking a configfs namespace. > > Why should the kernel not be able to add objects to a directory a user > > created? It should be up to the module author to decide these things. > > This is precisely why configfs is separate from sysfs. If both > user and kernel can create objects, the lifetime of the object and its > filesystem representation is very complex. Sysfs already has problems > with people getting this wrong. configfs does not. Could you please give a specific case? > The fact that sysfs and configfs have similar backing stores > does not make them the same thing. It is not just the backing store, it is most of the code, all the structures, most of the functionality, a good deal of the bugs... Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2 of 4] Configfs is really sysfs
(avoiding the kmail formatting problems this time.) Sysfs rearranged as a single file for analysis purposes. diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2.6.13-rc5-mm1/fs/sysfs/Makefile --- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2005-06-17 15:48:29.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/Makefile2005-08-29 17:13:59.0 -0400 @@ -2,5 +2,4 @@ # Makefile for the sysfs virtual filesystem # -obj-y := inode.o file.o dir.o symlink.o mount.o bin.o \ - group.o +obj-y := sysfs.o diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2.6.13-rc5-mm1/fs/sysfs/sysfs.c --- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2005-08-30 17:52:35.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400 @@ -0,0 +1,1680 @@ +#include +#include +#include +#include +#include +#include +#include + +struct sysfs_symlink { + char *link_name; + struct kobject *sl_target; +}; + +static inline struct kobject *to_kobj(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry->d_fsdata; + return ((struct kobject *)sd->s_element); +} + +static inline struct attribute *to_attr(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry->d_fsdata; + return ((struct attribute *)sd->s_element); +} + +static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject *kobj = NULL; + + spin_lock(_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent *sd = dentry->d_fsdata; + if (sd->s_type & SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd->s_element; + kobj = kobject_get(sl->sl_target); + } else + kobj = kobject_get(sd->s_element); + } + spin_unlock(_lock); + + return kobj; +} + +static kmem_cache_t *sysfs_dir_cachep; + +static void release_sysfs_dirent(struct sysfs_dirent *sd) +{ + if (sd->s_type & SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd->s_element; + kfree(sl->link_name); + kobject_put(sl->sl_target); + kfree(sl); + } + kfree(sd->s_iattr); + kmem_cache_free(sysfs_dir_cachep, sd); +} + +static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd) +{ + if (sd) { + WARN_ON(!atomic_read(>s_count)); + atomic_inc(>s_count); + } + return sd; +} + +static void sysfs_put(struct sysfs_dirent *sd) +{ + WARN_ON(!atomic_read(>s_count)); + if (atomic_dec_and_test(>s_count)) + release_sysfs_dirent(sd); +} + +/* + * inode.c - basic inode and dentry operations. + */ + +int sysfs_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry->d_inode; + struct sysfs_dirent *sd = dentry->d_fsdata; + struct iattr *sd_iattr; + unsigned int ia_valid = iattr->ia_valid; + int error; + + if (!sd) + return -EINVAL; + + sd_iattr = sd->s_iattr; + + error = inode_change_ok(inode, iattr); + if (error) + return error; + + error = inode_setattr(inode, iattr); + if (error) + return error; + + if (!sd_iattr) { + /* setting attributes for the first time, allocate now */ + sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL); + if (!sd_iattr) + return -ENOMEM; + /* assign default attributes */ + memset(sd_iattr, 0, sizeof(struct iattr)); + sd_iattr->ia_mode = sd->s_mode; + sd_iattr->ia_uid = 0; + sd_iattr->ia_gid = 0; + sd_iattr->ia_atime = sd_iattr->ia_mtime = sd_iattr->ia_ctime = + CURRENT_TIME; + sd->s_iattr = sd_iattr; + } + + /* attributes were changed atleast once in past */ + + if (ia_valid & ATTR_UID) + sd_iattr->ia_uid = iattr->ia_uid; + if (ia_valid & ATTR_GID) + sd_iattr->ia_gid = iattr->ia_gid; + if (ia_valid & ATTR_ATIME) + sd_iattr->ia_atime = timespec_trunc(iattr->ia_atime, + inode->i_sb->s_time_gran); + if (ia_valid & ATTR_MTIME) + sd_iattr->ia_mtime = timespec_trunc(iattr->ia_mtime, + inode->i_sb->s_time_gran); + if (ia_valid & ATTR_CTIME) + sd_iattr->ia_ctime = timespec_trunc(iattr->ia_ctime, + inode->i_sb->s_time_gran); + if (ia_valid & ATTR_MODE) { + umode_t mode = iattr->ia_mode; + + if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) + mode &= ~S_ISGID; + sd_iattr->ia_mode = sd->s_mode = mode; + } + + return error; +} +
Re: [RFC][PATCH 3 of 4] Configfs is really sysfs
On Tuesday 30 August 2005 19:06, Stephen Hemminger wrote: > On Wed, 31 Aug 2005 08:59:55 +1000 > > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > Configfs rewritten as a single file and updated to use kobjects instead > > of its own clone of kobjects (config_items). > > Some style issues: > Mixed case in labels I certainly agree. This is strictly for comparison purposes and so I did not clean up the stylistic problems from the original... this time. > Bad identation I did lindent it however :-) > > + Done: > > Why the mixed case label? It shall die. > > +void config_group_init_type_name(struct kset *group, const char *name, > > struct kobj_type *type) +{ > > + kobject_set_name(>kobj, name); > > + group->kobj.ktype = type; > > + config_group_init(group); > > +} > > Use tabs not one space for indent. Urk. Kmail did that to me, it has been broken that way for a year or so. I will have to repost the whole set from a mailer that works. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 08:59, Daniel Phillips wrote: > -obj-$(CONFIG_CONFIGFS_FS) += configfs.o > +obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o This should just be: +obj-$(CONFIG_CONFIGFS_FS) += configfs.o However, the wrong version does provide a convenient way of compiling the example, I just... have... to... remember to delete it next time. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 4 of 4] Configfs is really sysfs
A kernel code example that uses the modified configfs to define a simple configuration interface. Note the use of kobjects and ksets instead of config_items and config_groups. Also notice how much code is required to get a simple value from userspace to kernel space. This is a big problem that needs to be addressed soon, before we end up with tens or hundreds of thousands of lines of code code bloat just to get and set variables from user space. Regards, Daniel #include #include #include #include struct ddbond_info { struct kobject item; int controlsock; }; static inline struct ddbond_info *to_ddbond_info(struct kobject *item) { return container_of(item, struct ddbond_info, item); } static ssize_t ddbond_info_attr_show(struct kobject *item, struct attribute *attr, char *page) { ssize_t count; struct ddbond_info *ddbond_info = to_ddbond_info(item); count = sprintf(page, "%d\n", ddbond_info->controlsock); return count; } static ssize_t ddbond_info_attr_store(struct kobject *item, struct attribute *attr, const char *page, size_t count) { struct ddbond_info *ddbond_info = to_ddbond_info(item); unsigned long tmp; char *p = (char *)page; tmp = simple_strtoul(p, , 10); if (!p || (*p && (*p != '\n'))) return -EINVAL; if (tmp > INT_MAX) return -ERANGE; ddbond_info->controlsock = tmp; return count; } static void ddbond_info_release(struct kobject *item) { kfree(to_ddbond_info(item)); } static struct kobj_type ddbond_info_type = { .sysfs_ops = &(struct sysfs_ops){ .show = ddbond_info_attr_show, .store = ddbond_info_attr_store, .release = ddbond_info_release, }, .default_attrs = (struct attribute *[]){ &(struct attribute){ .owner = THIS_MODULE, .name = "sockname", .mode = S_IRUGO | S_IWUSR, }, NULL, }, .ct_owner = THIS_MODULE, }; static struct kobject *ddbond_make_item(struct kset *group, const char *name) { struct ddbond_info *ddbond_info; if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL))) return NULL; kobject_init_type_name(_info->item, name, _info_type); return _info->item; } static ssize_t ddbond_description(struct kobject *item, struct attribute *attr, char *page) { return sprintf(page, "A ddbond block server has two components: a userspace server and a kernel\n" "io daemon. First start the server and give it the name of a socket it will\n" "create, then create a ddbond directory and write the same name into the\n" "socket attribute\n"); } static struct kobj_type ddbond_type = { .sysfs_ops = &(struct sysfs_ops){ .show = ddbond_description, }, .ct_group_ops = &(struct configfs_group_operations){ .make_item = ddbond_make_item, }, .default_attrs = (struct attribute *[]){ &(struct attribute){ .owner = THIS_MODULE, .name = "description", .mode = S_IRUGO, }, NULL, } }; static struct subsystem ddbond_subsys = { .kset = { .kobj = { .k_name = "ddbond", .ktype = _type, }, }, }; static int __init init_ddbond_config(void) { int ret; config_group_init(_subsys.kset); init_rwsem(_subsys.rwsem); if ((ret = configfs_register_subsystem(_subsys))) printk(KERN_ERR "Error %d while registering subsystem %s\n", ret, ddbond_subsys.kset.kobj.k_name); return ret; } static void __exit exit_ddbond_config(void) { configfs_unregister_subsystem(_subsys); } module_init(init_ddbond_config); module_exit(exit_ddbond_config); MODULE_LICENSE("GPL"); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC][PATCH 3 of 4] Configfs is really sysfs
Configfs rewritten as a single file and updated to use kobjects instead of its own clone of kobjects (config_items). diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 2.6.13-rc5-mm1/fs/configfs/Makefile --- 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 2005-08-09 18:23:30.0 -0400 +++ 2.6.13-rc5-mm1/fs/configfs/Makefile 2005-08-29 17:26:02.0 -0400 @@ -2,6 +2,5 @@ # Makefile for the configfs virtual filesystem # -obj-$(CONFIG_CONFIGFS_FS) += configfs.o +obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o -configfs-objs := inode.o file.o dir.o symlink.o mount.o item.o diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 2.6.13-rc5-mm1/fs/configfs/configfs.c --- 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 2005-08-30 17:50:30.0 -0400 +++ 2.6.13-rc5-mm1/fs/configfs/configfs.c 2005-08-29 21:36:47.0 -0400 @@ -0,0 +1,1897 @@ +/* + * Based on sysfs: + * sysfs Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include +#include +#include +#include +#include +#include +#include + +#define CONFIGFS_ROOT 0x0001 +#define CONFIGFS_DIR 0x0002 +#define CONFIGFS_ITEM_ATTR 0x0004 +#define CONFIGFS_ITEM_LINK 0x0020 +#define CONFIGFS_USET_DIR 0x0040 +#define CONFIGFS_USET_DEFAULT 0x0080 +#define CONFIGFS_USET_DROPPING 0x0100 +#define CONFIGFS_NOT_PINNED(CONFIGFS_ITEM_ATTR) + +struct sysfs_symlink { + struct list_head sl_list; + struct kobject *sl_target; +}; + +static inline struct kobject *to_kobj(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry->d_fsdata; + return ((struct kobject *)sd->s_element); +} + +static inline struct attribute *to_attr(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry->d_fsdata; + return ((struct attribute *)sd->s_element); +} + +static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject *kobj = NULL; + + spin_lock(_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent *sd = dentry->d_fsdata; + if (sd->s_type & CONFIGFS_ITEM_LINK) { + struct sysfs_symlink *sl = sd->s_element; + kobj = kobject_get(sl->sl_target); + } else + kobj = kobject_get(sd->s_element); + } + spin_unlock(_lock); + + return kobj; +} + +static kmem_cache_t *sysfs_dir_cachep; + +static void release_sysfs_dirent(struct sysfs_dirent *sd) +{ + if ((sd->s_type & CONFIGFS_ROOT)) + return; + kmem_cache_free(sysfs_dir_cachep, sd); +} + +static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd) +{ + if (sd) { + WARN_ON(!atomic_read(>s_count)); + atomic_inc(>s_count); + } + return sd; +} + +static void sysfs_put(struct sysfs_dirent *sd) +{ + WARN_ON(!atomic_read(>s_count)); + if (atomic_dec_and_test(>s_count)) + release_sysfs_dirent(sd); +} + +/* + * inode.c - basic inode and dentry operations. + */ + +static struct super_block *sysfs_sb; + +static struct address_space_operations sysfs_aops = { + .readpage = simple_readpage, + .prepare_write = simple_prepare_write, + .commit_write = simple_commit_write +}; + +static struct backing_dev_info sysfs_backing_dev_info = { + .ra_pages = 0, /* No readahead */ + .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, +}; + +static struct inode *sysfs_new_inode(mode_t mode) +{ + struct inode *inode = new_inode(sysfs_sb); + if (inode) { + inode->i_blksize = PAGE_CACHE_SIZE; + inode->i_blocks = 0; + inode->i_mode = mode; + inode->i_uid = 0; + inode->i_gid = 0; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_mapping->a_ops = _aops; + inode->i_mapping->backing_dev_info = _backing_dev_info; + } + return inode; +} + +static int sysfs_create(struct dentry *dentry, int mode, int (*init) (struct inode *)) +{ + int error = 0; + struct inode *inode = NULL; + if (dentry) { + if (!dentry->d_inode) { + if ((inode = sysfs_new_inode(mode))) { + if (dentry->d_parent + && dentry->d_parent->d_inode) { + struct inode *p_inode = + dentry->d_parent->d_inode; + p_inode->i_mtime = p_inode->i_ctime = + CURRENT_TIME; + } + goto Proceed; + } else + error = -ENOMEM; + } else + error = -EEXIST; + }
[RFC][PATCH 2 of 4] Configfs is really sysfs
Sysfs rearranged as a single file for analysis purposes. diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2.6.13-rc5-mm1/fs/sysfs/Makefile --- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2005-06-17 15:48:29.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/Makefile 2005-08-29 17:13:59.0 -0400 @@ -2,5 +2,4 @@ # Makefile for the sysfs virtual filesystem # -obj-y := inode.o file.o dir.o symlink.o mount.o bin.o \ - group.o +obj-y := sysfs.o diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2.6.13-rc5-mm1/fs/sysfs/sysfs.c --- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2005-08-30 17:52:35.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400 @@ -0,0 +1,1680 @@ +#include +#include +#include +#include +#include +#include +#include + +struct sysfs_symlink { + char *link_name; + struct kobject *sl_target; +}; + +static inline struct kobject *to_kobj(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry->d_fsdata; + return ((struct kobject *)sd->s_element); +} + +static inline struct attribute *to_attr(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry->d_fsdata; + return ((struct attribute *)sd->s_element); +} + +static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject *kobj = NULL; + + spin_lock(_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent *sd = dentry->d_fsdata; + if (sd->s_type & SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd->s_element; + kobj = kobject_get(sl->sl_target); + } else + kobj = kobject_get(sd->s_element); + } + spin_unlock(_lock); + + return kobj; +} + +static kmem_cache_t *sysfs_dir_cachep; + +static void release_sysfs_dirent(struct sysfs_dirent *sd) +{ + if (sd->s_type & SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd->s_element; + kfree(sl->link_name); + kobject_put(sl->sl_target); + kfree(sl); + } + kfree(sd->s_iattr); + kmem_cache_free(sysfs_dir_cachep, sd); +} + +static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd) +{ + if (sd) { + WARN_ON(!atomic_read(>s_count)); + atomic_inc(>s_count); + } + return sd; +} + +static void sysfs_put(struct sysfs_dirent *sd) +{ + WARN_ON(!atomic_read(>s_count)); + if (atomic_dec_and_test(>s_count)) + release_sysfs_dirent(sd); +} + +/* + * inode.c - basic inode and dentry operations. + */ + +int sysfs_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry->d_inode; + struct sysfs_dirent *sd = dentry->d_fsdata; + struct iattr *sd_iattr; + unsigned int ia_valid = iattr->ia_valid; + int error; + + if (!sd) + return -EINVAL; + + sd_iattr = sd->s_iattr; + + error = inode_change_ok(inode, iattr); + if (error) + return error; + + error = inode_setattr(inode, iattr); + if (error) + return error; + + if (!sd_iattr) { + /* setting attributes for the first time, allocate now */ + sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL); + if (!sd_iattr) + return -ENOMEM; + /* assign default attributes */ + memset(sd_iattr, 0, sizeof(struct iattr)); + sd_iattr->ia_mode = sd->s_mode; + sd_iattr->ia_uid = 0; + sd_iattr->ia_gid = 0; + sd_iattr->ia_atime = sd_iattr->ia_mtime = sd_iattr->ia_ctime = + CURRENT_TIME; + sd->s_iattr = sd_iattr; + } + + /* attributes were changed atleast once in past */ + + if (ia_valid & ATTR_UID) + sd_iattr->ia_uid = iattr->ia_uid; + if (ia_valid & ATTR_GID) + sd_iattr->ia_gid = iattr->ia_gid; + if (ia_valid & ATTR_ATIME) + sd_iattr->ia_atime = timespec_trunc(iattr->ia_atime, + inode->i_sb->s_time_gran); + if (ia_valid & ATTR_MTIME) + sd_iattr->ia_mtime = timespec_trunc(iattr->ia_mtime, + inode->i_sb->s_time_gran); + if (ia_valid & ATTR_CTIME) + sd_iattr->ia_ctime = timespec_trunc(iattr->ia_ctime, + inode->i_sb->s_time_gran); + if (ia_valid & ATTR_MODE) { + umode_t mode = iattr->ia_mode; + + if (!in_group_p(inode->i_gid) && !capable(CAP_FSETID)) + mode &= ~S_ISGID; + sd_iattr->ia_mode = sd->s_mode = mode; + } + + return error; +} + +static struct inode_operations sysfs_inode_operations = { + .setattr = sysfs_setattr, +}; + +static struct super_block *sysfs_sb; + +static struct address_space_operations sysfs_aops = { + .readpage = simple_readpage, + .prepare_write = simple_prepare_write, + .commit_write = simple_commit_write +}; + +static struct backing_dev_info sysfs_backing_dev_info = { + .ra_pages = 0, /* No readahead */ + .capabilities = BDI_CAP_NO_ACCT_DIRTY |
[RFC][PATCH 1 of 4] Configfs is really sysfs
Hi Andrew, Configfs blithely ingests kobject.h and kobject.c into itself, just changing the names. Furthermore, more than half of configfs is copied verbatim from sysfs, the only difference being the name changes. After undoing the name changes and adding a few new fields to kobject structures, configfs is able to use the real thing instead of its own imitation. The changes I made to kobject.h and sysfs.h are: * add module owner to kobj_type. * add group_operations to kobj_type (because configfs does it this way not because it is right) * add a children field to kset. This is likely the same as the blandly named "list" field but I haven't confirmed it. * add a default_groups field to kset, analogous to the default_attrs of kobj_type. Hmm, somebody seems to be mixing up types and containers here, but let's just close our eyes for now. * add an s_links field to sysfs_dirent to support configfs's user createable symlinks. * add two new methods to sysfs_ops for fancy symlink hooks * add a questionable release method to sysfs_ops. Sysfs and configfs have slightly different notions of when to release objects, one of them is probably wrong. That's it, no new fields in kobjects themselves, and just three or four fields in other allocateable structures. After these changes, no structures at all are left in configfs.h. Configfs is now running happily using the kobject machinery instead of its own mutated clones and unsurprisingly, sysfs still runs happily too. These changes are all found in the first patch of this series. I then looked into exactly how configfs and sysfs are different. To reduce the noise, I concatentated all the files in each directory into two single files. With redundant declarations removed, configfs came in at 1897 lines and sysfs at 1680. Diffing those two files shows: diff -u fs/sysfs/sysfs.c fs/configfs/configfs.c | diffstat configfs.c | 1497 ++--- 1 files changed, 857 insertions(+), 640 deletions(-) So we see that two thirds of sysfs made it into configfs unchanged. Of the remaining one third that configfs has not copied, about one third supports read/write/mmappable attribute files (why should configfs not have them too?), a little less than a third involves needlessly importing its own version of setattr, and the remainder, about 300 lines, exports the kernel interface for manipulating the user-visible sysfs tree. Allowing for a few lines of fluff, configfs's value add is about 750 lines of user space glue for namespace operations. Nothing below that glue layer is changed, except cosmetically. So configfs really is sysfs. By adding about 300 lines to configfs we can add the vfs-bypass code, and voila, configfs becomess sysfs. Another 200 lines gives us the binary blob attributes as well. There is no reason whatsover for configfs and sysfs to live on as separate code bases. If we really want to make a distinction, we can make the distinction with a flag. But it would be stupid to forbid users from creating directories in sysfs or to forbid kernel modules from directly tweaking a configfs namespace. Why should the kernel not be able to add objects to a directory a user created? It should be up to the module author to decide these things. Please do not push configfs to stable in this form. It is not actually a new filesystem, it is an extension to sysfs. Merging it as is would add more than a thousand lines of pointless kernel bloat. If indeed we wish to present exactly the semantics configfs now offers, we do not need a separate code base to do so. The four patches in this patch set: 1) Add new fields to kobjects; update other headers to match 2) Sysfs all in one file 3) Configfs all in one file 4) A configfs kernel example using sysfs instead of configfs structures Regards, Daniel diff -up --recursive 2.6.13-rc5-mm1.clean/include/linux/configfs.h 2.6.13-rc5-mm1/include/linux/configfs.h --- 2.6.13-rc5-mm1.clean/include/linux/configfs.h 2005-08-09 18:23:31.0 -0400 +++ 2.6.13-rc5-mm1/include/linux/configfs.h 2005-08-29 18:30:41.0 -0400 @@ -46,120 +46,32 @@ #define CONFIGFS_ITEM_NAME_LEN 20 -struct module; - -struct configfs_item_operations; -struct configfs_group_operations; -struct configfs_attribute; -struct configfs_subsystem; - -struct config_item { - char*ci_name; - charci_namebuf[CONFIGFS_ITEM_NAME_LEN]; - struct kref ci_kref; - struct list_headci_entry; - struct config_item *ci_parent; - struct config_group *ci_group; - struct config_item_type *ci_type; - struct dentry *ci_dentry; -}; - -extern int config_item_set_name(struct config_item *, const char *, ...); - -static inline char *config_item_name(struct config_item * item) -{ - return
[RFC][PATCH 2 of 4] Configfs is really sysfs
Sysfs rearranged as a single file for analysis purposes. diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2.6.13-rc5-mm1/fs/sysfs/Makefile --- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2005-06-17 15:48:29.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/Makefile 2005-08-29 17:13:59.0 -0400 @@ -2,5 +2,4 @@ # Makefile for the sysfs virtual filesystem # -obj-y := inode.o file.o dir.o symlink.o mount.o bin.o \ - group.o +obj-y := sysfs.o diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2.6.13-rc5-mm1/fs/sysfs/sysfs.c --- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2005-08-30 17:52:35.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400 @@ -0,0 +1,1680 @@ +#include linux/fs.h +#include linux/namei.h +#include linux/module.h +#include linux/mount.h +#include linux/backing-dev.h +#include linux/pagemap.h +#include linux/fsnotify.h + +struct sysfs_symlink { + char *link_name; + struct kobject *sl_target; +}; + +static inline struct kobject *to_kobj(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry-d_fsdata; + return ((struct kobject *)sd-s_element); +} + +static inline struct attribute *to_attr(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry-d_fsdata; + return ((struct attribute *)sd-s_element); +} + +static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject *kobj = NULL; + + spin_lock(dcache_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent *sd = dentry-d_fsdata; + if (sd-s_type SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd-s_element; + kobj = kobject_get(sl-sl_target); + } else + kobj = kobject_get(sd-s_element); + } + spin_unlock(dcache_lock); + + return kobj; +} + +static kmem_cache_t *sysfs_dir_cachep; + +static void release_sysfs_dirent(struct sysfs_dirent *sd) +{ + if (sd-s_type SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd-s_element; + kfree(sl-link_name); + kobject_put(sl-sl_target); + kfree(sl); + } + kfree(sd-s_iattr); + kmem_cache_free(sysfs_dir_cachep, sd); +} + +static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd) +{ + if (sd) { + WARN_ON(!atomic_read(sd-s_count)); + atomic_inc(sd-s_count); + } + return sd; +} + +static void sysfs_put(struct sysfs_dirent *sd) +{ + WARN_ON(!atomic_read(sd-s_count)); + if (atomic_dec_and_test(sd-s_count)) + release_sysfs_dirent(sd); +} + +/* + * inode.c - basic inode and dentry operations. + */ + +int sysfs_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry-d_inode; + struct sysfs_dirent *sd = dentry-d_fsdata; + struct iattr *sd_iattr; + unsigned int ia_valid = iattr-ia_valid; + int error; + + if (!sd) + return -EINVAL; + + sd_iattr = sd-s_iattr; + + error = inode_change_ok(inode, iattr); + if (error) + return error; + + error = inode_setattr(inode, iattr); + if (error) + return error; + + if (!sd_iattr) { + /* setting attributes for the first time, allocate now */ + sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL); + if (!sd_iattr) + return -ENOMEM; + /* assign default attributes */ + memset(sd_iattr, 0, sizeof(struct iattr)); + sd_iattr-ia_mode = sd-s_mode; + sd_iattr-ia_uid = 0; + sd_iattr-ia_gid = 0; + sd_iattr-ia_atime = sd_iattr-ia_mtime = sd_iattr-ia_ctime = + CURRENT_TIME; + sd-s_iattr = sd_iattr; + } + + /* attributes were changed atleast once in past */ + + if (ia_valid ATTR_UID) + sd_iattr-ia_uid = iattr-ia_uid; + if (ia_valid ATTR_GID) + sd_iattr-ia_gid = iattr-ia_gid; + if (ia_valid ATTR_ATIME) + sd_iattr-ia_atime = timespec_trunc(iattr-ia_atime, + inode-i_sb-s_time_gran); + if (ia_valid ATTR_MTIME) + sd_iattr-ia_mtime = timespec_trunc(iattr-ia_mtime, + inode-i_sb-s_time_gran); + if (ia_valid ATTR_CTIME) + sd_iattr-ia_ctime = timespec_trunc(iattr-ia_ctime, + inode-i_sb-s_time_gran); + if (ia_valid ATTR_MODE) { + umode_t mode = iattr-ia_mode; + + if (!in_group_p(inode-i_gid) !capable(CAP_FSETID)) + mode = ~S_ISGID; + sd_iattr-ia_mode = sd-s_mode = mode; + } + + return error; +} + +static struct inode_operations sysfs_inode_operations = { + .setattr = sysfs_setattr, +}; + +static struct super_block *sysfs_sb; + +static struct address_space_operations sysfs_aops = { + .readpage = simple_readpage, + .prepare_write = simple_prepare_write, + .commit_write = simple_commit_write +}; + +static struct backing_dev_info sysfs_backing_dev_info = { + .ra_pages = 0, /*
[RFC][PATCH 3 of 4] Configfs is really sysfs
Configfs rewritten as a single file and updated to use kobjects instead of its own clone of kobjects (config_items). diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 2.6.13-rc5-mm1/fs/configfs/Makefile --- 2.6.13-rc5-mm1.clean/fs/configfs/Makefile 2005-08-09 18:23:30.0 -0400 +++ 2.6.13-rc5-mm1/fs/configfs/Makefile 2005-08-29 17:26:02.0 -0400 @@ -2,6 +2,5 @@ # Makefile for the configfs virtual filesystem # -obj-$(CONFIG_CONFIGFS_FS) += configfs.o +obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o -configfs-objs := inode.o file.o dir.o symlink.o mount.o item.o diff -up --recursive 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 2.6.13-rc5-mm1/fs/configfs/configfs.c --- 2.6.13-rc5-mm1.clean/fs/configfs/configfs.c 2005-08-30 17:50:30.0 -0400 +++ 2.6.13-rc5-mm1/fs/configfs/configfs.c 2005-08-29 21:36:47.0 -0400 @@ -0,0 +1,1897 @@ +/* + * Based on sysfs: + * sysfs Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include linux/fs.h +#include linux/namei.h +#include linux/module.h +#include linux/mount.h +#include linux/backing-dev.h +#include linux/pagemap.h +#include linux/configfs.h + +#define CONFIGFS_ROOT 0x0001 +#define CONFIGFS_DIR 0x0002 +#define CONFIGFS_ITEM_ATTR 0x0004 +#define CONFIGFS_ITEM_LINK 0x0020 +#define CONFIGFS_USET_DIR 0x0040 +#define CONFIGFS_USET_DEFAULT 0x0080 +#define CONFIGFS_USET_DROPPING 0x0100 +#define CONFIGFS_NOT_PINNED(CONFIGFS_ITEM_ATTR) + +struct sysfs_symlink { + struct list_head sl_list; + struct kobject *sl_target; +}; + +static inline struct kobject *to_kobj(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry-d_fsdata; + return ((struct kobject *)sd-s_element); +} + +static inline struct attribute *to_attr(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry-d_fsdata; + return ((struct attribute *)sd-s_element); +} + +static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject *kobj = NULL; + + spin_lock(dcache_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent *sd = dentry-d_fsdata; + if (sd-s_type CONFIGFS_ITEM_LINK) { + struct sysfs_symlink *sl = sd-s_element; + kobj = kobject_get(sl-sl_target); + } else + kobj = kobject_get(sd-s_element); + } + spin_unlock(dcache_lock); + + return kobj; +} + +static kmem_cache_t *sysfs_dir_cachep; + +static void release_sysfs_dirent(struct sysfs_dirent *sd) +{ + if ((sd-s_type CONFIGFS_ROOT)) + return; + kmem_cache_free(sysfs_dir_cachep, sd); +} + +static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd) +{ + if (sd) { + WARN_ON(!atomic_read(sd-s_count)); + atomic_inc(sd-s_count); + } + return sd; +} + +static void sysfs_put(struct sysfs_dirent *sd) +{ + WARN_ON(!atomic_read(sd-s_count)); + if (atomic_dec_and_test(sd-s_count)) + release_sysfs_dirent(sd); +} + +/* + * inode.c - basic inode and dentry operations. + */ + +static struct super_block *sysfs_sb; + +static struct address_space_operations sysfs_aops = { + .readpage = simple_readpage, + .prepare_write = simple_prepare_write, + .commit_write = simple_commit_write +}; + +static struct backing_dev_info sysfs_backing_dev_info = { + .ra_pages = 0, /* No readahead */ + .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, +}; + +static struct inode *sysfs_new_inode(mode_t mode) +{ + struct inode *inode = new_inode(sysfs_sb); + if (inode) { + inode-i_blksize = PAGE_CACHE_SIZE; + inode-i_blocks = 0; + inode-i_mode = mode; + inode-i_uid = 0; + inode-i_gid = 0; + inode-i_atime = inode-i_mtime = inode-i_ctime = CURRENT_TIME; + inode-i_mapping-a_ops = sysfs_aops; + inode-i_mapping-backing_dev_info = sysfs_backing_dev_info; + } + return inode; +} + +static int sysfs_create(struct dentry *dentry, int mode, int (*init) (struct inode *)) +{ + int error = 0; + struct inode *inode = NULL; + if (dentry) { + if (!dentry-d_inode) { + if ((inode = sysfs_new_inode(mode))) { + if (dentry-d_parent +dentry-d_parent-d_inode) { + struct inode *p_inode = + dentry-d_parent-d_inode; + p_inode-i_mtime = p_inode-i_ctime = + CURRENT_TIME; + } + goto Proceed; + } else +
[RFC][PATCH 4 of 4] Configfs is really sysfs
A kernel code example that uses the modified configfs to define a simple configuration interface. Note the use of kobjects and ksets instead of config_items and config_groups. Also notice how much code is required to get a simple value from userspace to kernel space. This is a big problem that needs to be addressed soon, before we end up with tens or hundreds of thousands of lines of code code bloat just to get and set variables from user space. Regards, Daniel #include linux/init.h #include linux/module.h #include linux/slab.h #include linux/configfs.h struct ddbond_info { struct kobject item; int controlsock; }; static inline struct ddbond_info *to_ddbond_info(struct kobject *item) { return container_of(item, struct ddbond_info, item); } static ssize_t ddbond_info_attr_show(struct kobject *item, struct attribute *attr, char *page) { ssize_t count; struct ddbond_info *ddbond_info = to_ddbond_info(item); count = sprintf(page, %d\n, ddbond_info-controlsock); return count; } static ssize_t ddbond_info_attr_store(struct kobject *item, struct attribute *attr, const char *page, size_t count) { struct ddbond_info *ddbond_info = to_ddbond_info(item); unsigned long tmp; char *p = (char *)page; tmp = simple_strtoul(p, p, 10); if (!p || (*p (*p != '\n'))) return -EINVAL; if (tmp INT_MAX) return -ERANGE; ddbond_info-controlsock = tmp; return count; } static void ddbond_info_release(struct kobject *item) { kfree(to_ddbond_info(item)); } static struct kobj_type ddbond_info_type = { .sysfs_ops = (struct sysfs_ops){ .show = ddbond_info_attr_show, .store = ddbond_info_attr_store, .release = ddbond_info_release, }, .default_attrs = (struct attribute *[]){ (struct attribute){ .owner = THIS_MODULE, .name = sockname, .mode = S_IRUGO | S_IWUSR, }, NULL, }, .ct_owner = THIS_MODULE, }; static struct kobject *ddbond_make_item(struct kset *group, const char *name) { struct ddbond_info *ddbond_info; if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL))) return NULL; kobject_init_type_name(ddbond_info-item, name, ddbond_info_type); return ddbond_info-item; } static ssize_t ddbond_description(struct kobject *item, struct attribute *attr, char *page) { return sprintf(page, A ddbond block server has two components: a userspace server and a kernel\n io daemon. First start the server and give it the name of a socket it will\n create, then create a ddbond directory and write the same name into the\n socket attribute\n); } static struct kobj_type ddbond_type = { .sysfs_ops = (struct sysfs_ops){ .show = ddbond_description, }, .ct_group_ops = (struct configfs_group_operations){ .make_item = ddbond_make_item, }, .default_attrs = (struct attribute *[]){ (struct attribute){ .owner = THIS_MODULE, .name = description, .mode = S_IRUGO, }, NULL, } }; static struct subsystem ddbond_subsys = { .kset = { .kobj = { .k_name = ddbond, .ktype = ddbond_type, }, }, }; static int __init init_ddbond_config(void) { int ret; config_group_init(ddbond_subsys.kset); init_rwsem(ddbond_subsys.rwsem); if ((ret = configfs_register_subsystem(ddbond_subsys))) printk(KERN_ERR Error %d while registering subsystem %s\n, ret, ddbond_subsys.kset.kobj.k_name); return ret; } static void __exit exit_ddbond_config(void) { configfs_unregister_subsystem(ddbond_subsys); } module_init(init_ddbond_config); module_exit(exit_ddbond_config); MODULE_LICENSE(GPL); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 08:59, Daniel Phillips wrote: -obj-$(CONFIG_CONFIGFS_FS) += configfs.o +obj-$(CONFIG_CONFIGFS_FS) += configfs.o ddbond.config.o This should just be: +obj-$(CONFIG_CONFIGFS_FS) += configfs.o However, the wrong version does provide a convenient way of compiling the example, I just... have... to... remember to delete it next time. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 3 of 4] Configfs is really sysfs
On Tuesday 30 August 2005 19:06, Stephen Hemminger wrote: On Wed, 31 Aug 2005 08:59:55 +1000 Daniel Phillips [EMAIL PROTECTED] wrote: Configfs rewritten as a single file and updated to use kobjects instead of its own clone of kobjects (config_items). Some style issues: Mixed case in labels I certainly agree. This is strictly for comparison purposes and so I did not clean up the stylistic problems from the original... this time. Bad identation I did lindent it however :-) + Done: Why the mixed case label? It shall die. +void config_group_init_type_name(struct kset *group, const char *name, struct kobj_type *type) +{ + kobject_set_name(group-kobj, name); + group-kobj.ktype = type; + config_group_init(group); +} Use tabs not one space for indent. Urk. Kmail did that to me, it has been broken that way for a year or so. I will have to repost the whole set from a mailer that works. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 2 of 4] Configfs is really sysfs
(avoiding the kmail formatting problems this time.) Sysfs rearranged as a single file for analysis purposes. diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2.6.13-rc5-mm1/fs/sysfs/Makefile --- 2.6.13-rc5-mm1.clean/fs/sysfs/Makefile 2005-06-17 15:48:29.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/Makefile2005-08-29 17:13:59.0 -0400 @@ -2,5 +2,4 @@ # Makefile for the sysfs virtual filesystem # -obj-y := inode.o file.o dir.o symlink.o mount.o bin.o \ - group.o +obj-y := sysfs.o diff -up --recursive 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2.6.13-rc5-mm1/fs/sysfs/sysfs.c --- 2.6.13-rc5-mm1.clean/fs/sysfs/sysfs.c 2005-08-30 17:52:35.0 -0400 +++ 2.6.13-rc5-mm1/fs/sysfs/sysfs.c 2005-08-29 21:04:40.0 -0400 @@ -0,0 +1,1680 @@ +#include linux/fs.h +#include linux/namei.h +#include linux/module.h +#include linux/mount.h +#include linux/backing-dev.h +#include linux/pagemap.h +#include linux/fsnotify.h + +struct sysfs_symlink { + char *link_name; + struct kobject *sl_target; +}; + +static inline struct kobject *to_kobj(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry-d_fsdata; + return ((struct kobject *)sd-s_element); +} + +static inline struct attribute *to_attr(struct dentry *dentry) +{ + struct sysfs_dirent *sd = dentry-d_fsdata; + return ((struct attribute *)sd-s_element); +} + +static inline struct kobject *sysfs_get_kobject(struct dentry *dentry) +{ + struct kobject *kobj = NULL; + + spin_lock(dcache_lock); + if (!d_unhashed(dentry)) { + struct sysfs_dirent *sd = dentry-d_fsdata; + if (sd-s_type SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd-s_element; + kobj = kobject_get(sl-sl_target); + } else + kobj = kobject_get(sd-s_element); + } + spin_unlock(dcache_lock); + + return kobj; +} + +static kmem_cache_t *sysfs_dir_cachep; + +static void release_sysfs_dirent(struct sysfs_dirent *sd) +{ + if (sd-s_type SYSFS_KOBJ_LINK) { + struct sysfs_symlink *sl = sd-s_element; + kfree(sl-link_name); + kobject_put(sl-sl_target); + kfree(sl); + } + kfree(sd-s_iattr); + kmem_cache_free(sysfs_dir_cachep, sd); +} + +static struct sysfs_dirent *sysfs_get(struct sysfs_dirent *sd) +{ + if (sd) { + WARN_ON(!atomic_read(sd-s_count)); + atomic_inc(sd-s_count); + } + return sd; +} + +static void sysfs_put(struct sysfs_dirent *sd) +{ + WARN_ON(!atomic_read(sd-s_count)); + if (atomic_dec_and_test(sd-s_count)) + release_sysfs_dirent(sd); +} + +/* + * inode.c - basic inode and dentry operations. + */ + +int sysfs_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry-d_inode; + struct sysfs_dirent *sd = dentry-d_fsdata; + struct iattr *sd_iattr; + unsigned int ia_valid = iattr-ia_valid; + int error; + + if (!sd) + return -EINVAL; + + sd_iattr = sd-s_iattr; + + error = inode_change_ok(inode, iattr); + if (error) + return error; + + error = inode_setattr(inode, iattr); + if (error) + return error; + + if (!sd_iattr) { + /* setting attributes for the first time, allocate now */ + sd_iattr = kmalloc(sizeof(struct iattr), GFP_KERNEL); + if (!sd_iattr) + return -ENOMEM; + /* assign default attributes */ + memset(sd_iattr, 0, sizeof(struct iattr)); + sd_iattr-ia_mode = sd-s_mode; + sd_iattr-ia_uid = 0; + sd_iattr-ia_gid = 0; + sd_iattr-ia_atime = sd_iattr-ia_mtime = sd_iattr-ia_ctime = + CURRENT_TIME; + sd-s_iattr = sd_iattr; + } + + /* attributes were changed atleast once in past */ + + if (ia_valid ATTR_UID) + sd_iattr-ia_uid = iattr-ia_uid; + if (ia_valid ATTR_GID) + sd_iattr-ia_gid = iattr-ia_gid; + if (ia_valid ATTR_ATIME) + sd_iattr-ia_atime = timespec_trunc(iattr-ia_atime, + inode-i_sb-s_time_gran); + if (ia_valid ATTR_MTIME) + sd_iattr-ia_mtime = timespec_trunc(iattr-ia_mtime, + inode-i_sb-s_time_gran); + if (ia_valid ATTR_CTIME) + sd_iattr-ia_ctime = timespec_trunc(iattr-ia_ctime, + inode-i_sb-s_time_gran); + if (ia_valid ATTR_MODE) { + umode_t mode = iattr-ia_mode; + + if (!in_group_p(inode-i_gid) !capable(CAP_FSETID)) + mode = ~S_ISGID; + sd_iattr-ia_mode
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:13, Joel Becker wrote: On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote: But it would be stupid to forbid users from creating directories in sysfs or to forbid kernel modules from directly tweaking a configfs namespace. Why should the kernel not be able to add objects to a directory a user created? It should be up to the module author to decide these things. This is precisely why configfs is separate from sysfs. If both user and kernel can create objects, the lifetime of the object and its filesystem representation is very complex. Sysfs already has problems with people getting this wrong. configfs does not. Could you please give a specific case? The fact that sysfs and configfs have similar backing stores does not make them the same thing. It is not just the backing store, it is most of the code, all the structures, most of the functionality, a good deal of the bugs... Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 4 of 4] Configfs is really sysfs
(without kmail bugs this time) A kernel code example that uses the modified configfs to define a simple configuration interface. Note the use of kobjects and ksets instead of config_items and config_groups. Also notice how much code is required to get a simple value from userspace to kernel space. This is a big problem that needs to be addressed soon, before we end up with tens or hundreds of thousands of lines of code code bloat just to get and set variables from user space. Regards, Daniel #include linux/init.h #include linux/module.h #include linux/slab.h #include linux/configfs.h struct ddbond_info { struct kobject item; int controlsock; }; static inline struct ddbond_info *to_ddbond_info(struct kobject *item) { return container_of(item, struct ddbond_info, item); } static ssize_t ddbond_info_attr_show(struct kobject *item, struct attribute *attr, char *page) { ssize_t count; struct ddbond_info *ddbond_info = to_ddbond_info(item); count = sprintf(page, %d\n, ddbond_info-controlsock); return count; } static ssize_t ddbond_info_attr_store(struct kobject *item, struct attribute *attr, const char *page, size_t count) { struct ddbond_info *ddbond_info = to_ddbond_info(item); unsigned long tmp; char *p = (char *)page; tmp = simple_strtoul(p, p, 10); if (!p || (*p (*p != '\n'))) return -EINVAL; if (tmp INT_MAX) return -ERANGE; ddbond_info-controlsock = tmp; return count; } static void ddbond_info_release(struct kobject *item) { kfree(to_ddbond_info(item)); } static struct kobj_type ddbond_info_type = { .sysfs_ops = (struct sysfs_ops){ .show = ddbond_info_attr_show, .store = ddbond_info_attr_store, .release = ddbond_info_release, }, .default_attrs = (struct attribute *[]){ (struct attribute){ .owner = THIS_MODULE, .name = sockname, .mode = S_IRUGO | S_IWUSR, }, NULL, }, .ct_owner = THIS_MODULE, }; static struct kobject *ddbond_make_item(struct kset *group, const char *name) { struct ddbond_info *ddbond_info; if (!(ddbond_info = kcalloc(1, sizeof(struct ddbond_info), GFP_KERNEL))) return NULL; kobject_init_type_name(ddbond_info-item, name, ddbond_info_type); return ddbond_info-item; } static ssize_t ddbond_description(struct kobject *item, struct attribute *attr, char *page) { return sprintf(page, A ddbond block server has two components: a userspace server and a kernel\n io daemon. First start the server and give it the name of a socket it will\n create, then create a ddbond directory and write the same name into the\n socket attribute\n); } static struct kobj_type ddbond_type = { .sysfs_ops = (struct sysfs_ops){ .show = ddbond_description, }, .ct_group_ops = (struct configfs_group_operations){ .make_item = ddbond_make_item, }, .default_attrs = (struct attribute *[]){ (struct attribute){ .owner = THIS_MODULE, .name = description, .mode = S_IRUGO, }, NULL, } }; static struct subsystem ddbond_subsys = { .kset = { .kobj = { .k_name = ddbond, .ktype = ddbond_type, }, }, }; static int __init init_ddbond_config(void) { int ret; config_group_init(ddbond_subsys.kset); init_rwsem(ddbond_subsys.rwsem); if ((ret = configfs_register_subsystem(ddbond_subsys))) printk(KERN_ERR Error %d while registering subsystem %s\n, ret, ddbond_subsys.kset.kobj.k_name); return ret; } static void __exit exit_ddbond_config(void) { configfs_unregister_subsystem(ddbond_subsys); } module_init(init_ddbond_config); module_exit(exit_ddbond_config); MODULE_LICENSE(GPL); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:25, Daniel Phillips wrote: On Wednesday 31 August 2005 09:13, Joel Becker wrote: On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote: But it would be stupid to forbid users from creating directories in sysfs or to forbid kernel modules from directly tweaking a configfs namespace. Why should the kernel not be able to add objects to a directory a user created? It should be up to the module author to decide these things. This is precisely why configfs is separate from sysfs. If both user and kernel can create objects, the lifetime of the object and its filesystem representation is very complex. Sysfs already has problems with people getting this wrong. configfs does not. Could you please give a specific case? More to the point: what makes you think that this apparent ruggedness will diminish after being re-integrated with sysfs? If you wish, you can avoid any dangers by not using sysfs's vfs bypass api. It should be up to the subsystem author. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:28, Andrew Morton wrote: Joel Becker [EMAIL PROTECTED] wrote: On Wed, Aug 31, 2005 at 08:54:39AM +1000, Daniel Phillips wrote: But it would be stupid to forbid users from creating directories in sysfs or to forbid kernel modules from directly tweaking a configfs namespace. Why should the kernel not be able to add objects to a directory a user created? It should be up to the module author to decide these things. This is precisely why configfs is separate from sysfs. If both user and kernel can create objects, the lifetime of the object and its filesystem representation is very complex. Sysfs already has problems with people getting this wrong. configfs does not. The fact that sysfs and configfs have similar backing stores does not make them the same thing. Sure, but all that copying-and-pasting really sucks. I'm sure there's some way of providing the slightly different semantics from the same codebase? I will have that patch ready later this week. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 1 of 4] Configfs is really sysfs
On Wednesday 31 August 2005 09:34, [EMAIL PROTECTED] wrote: On Tue, Aug 30, 2005 at 04:28:46PM -0700, Andrew Morton wrote: Sure, but all that copying-and-pasting really sucks. I'm sure there's some way of providing the slightly different semantics from the same codebase? Careful - you've almost reinvented the concept of library, which would violate any number of patents... I will keep my eyes open for library candidates as I go. For example, the binary blob operations really cry out for it. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Permissions don't stick on ConfigFS attributes
On Monday 22 August 2005 00:49, Eric W. Biederman wrote: > I am confused. I am beginning to see shades of the devfs problems coming > up again. sysfs is built to be world readable by everyone who has it > mounted in their namespace. Writable files in sysfs I have never > understood. Sysfs is not like devfs by nature, it is more like procfs. It exposes properties of a device, not the device itself. It makes perfect sense that some of the properties should be writeable. > Given that we now have files which do not conform to one uniform > policy for everyone is there any reason why we do not want to allocate > a character device major number for all config values and dynamically > allocate a minor number for each config value? Giving each config > value its own unique entry under /dev. /dev is already busy enough without adding masses of entries that are not devices. I don't see that this would simplify the internal implementation either, the opposite actually. The user certainly will not have any use for temporary device numbers in this context. On the other hand, it is clunky to force an application to go through the same parse/format interface as the user just to get/set a simple integer. Perhaps sysfs needs to be taught how to ioctl these properties. I see exposing property names and operating on them as orthogonal issues that are currently joined at the hip in an unnatural, but fixable way. > Device nodes for each writable config value trivially handles > persistence and user policy and should be easy to implement in the > kernel. We already have a policy engine in userspace, udev to handle > all of the chaos. > > Why do we need another mechanism? We need the mechanism that exposes subsystem instance properties as they appear and disappear with changing configuration. This is a new mechanism anyway, so implementing it using device nodes does not save anything, it only introduces a new requirement to allocate device numbers. > Are device nodes out of fashion these days? They are, at least for putting things in /dev that are not actual hardware. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Permissions don't stick on ConfigFS attributes
On Monday 22 August 2005 00:49, Eric W. Biederman wrote: I am confused. I am beginning to see shades of the devfs problems coming up again. sysfs is built to be world readable by everyone who has it mounted in their namespace. Writable files in sysfs I have never understood. Sysfs is not like devfs by nature, it is more like procfs. It exposes properties of a device, not the device itself. It makes perfect sense that some of the properties should be writeable. Given that we now have files which do not conform to one uniform policy for everyone is there any reason why we do not want to allocate a character device major number for all config values and dynamically allocate a minor number for each config value? Giving each config value its own unique entry under /dev. /dev is already busy enough without adding masses of entries that are not devices. I don't see that this would simplify the internal implementation either, the opposite actually. The user certainly will not have any use for temporary device numbers in this context. On the other hand, it is clunky to force an application to go through the same parse/format interface as the user just to get/set a simple integer. Perhaps sysfs needs to be taught how to ioctl these properties. I see exposing property names and operating on them as orthogonal issues that are currently joined at the hip in an unnatural, but fixable way. Device nodes for each writable config value trivially handles persistence and user policy and should be easy to implement in the kernel. We already have a policy engine in userspace, udev to handle all of the chaos. Why do we need another mechanism? We need the mechanism that exposes subsystem instance properties as they appear and disappear with changing configuration. This is a new mechanism anyway, so implementing it using device nodes does not save anything, it only introduces a new requirement to allocate device numbers. Are device nodes out of fashion these days? They are, at least for putting things in /dev that are not actual hardware. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Permissions don't stick on ConfigFS attributes (revised)
On Saturday 20 August 2005 13:01, Greg KH wrote: > On Sat, Aug 20, 2005 at 10:50:51AM +1000, Daniel Phillips wrote: > > Permissions set on ConfigFS attributes (aka files) do not stick. > > The recent changes to sysfs should be ported to configfs to do this. No, it should go the other way, my fix is better. It would not require sysfs to have its own version of setattr. What I do like about Maneesh's fix is the handling of other inode attributes besides mode flags, however that is a detail, let's get the structural elements right first. The revised patch fixes the vanishing permissions bug and kills the configfs bogon that made my first attempt subtly wrong (changed permissions for all attribute files instead of just the chmoded one). diff -up --recursive 2.6.12-mm2.clean/fs/configfs/dir.c 2.6.12-mm2/fs/configfs/dir.c --- 2.6.12-mm2.clean/fs/configfs/dir.c 2005-08-12 00:53:06.0 -0400 +++ 2.6.12-mm2/fs/configfs/dir.c2005-08-20 16:16:34.0 -0400 @@ -64,6 +64,17 @@ static struct dentry_operations configfs .d_delete = configfs_d_delete, }; +static int configfs_d_delete_attr(struct dentry *dentry) +{ + ((struct configfs_dirent *)dentry->d_fsdata)->s_mode = dentry->d_inode->i_mode; + return 1; +} + +static struct dentry_operations configfs_attr_dentry_ops = { + .d_delete = configfs_d_delete_attr, + .d_iput = configfs_d_iput, +}; + /* * Allocates a new configfs_dirent and links it to the parent configfs_dirent */ @@ -238,14 +249,11 @@ static void configfs_remove_dir(struct c */ static int configfs_attach_attr(struct configfs_dirent * sd, struct dentry * dentry) { - struct configfs_attribute * attr = sd->s_element; - int error; - - error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file); + int error = configfs_create(dentry, sd->s_mode, init_file); if (error) return error; - dentry->d_op = _dentry_ops; + dentry->d_op = _attr_dentry_ops; dentry->d_fsdata = configfs_get(sd); sd->s_dentry = dentry; d_rehash(dentry); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH] Rename PageChecked as PageMiscFS
On Saturday 20 August 2005 20:45, David Howells wrote: > Daniel Phillips <[EMAIL PROTECTED]> wrote: > > Biased. Fs is a mixed case acronym, nuff said. > > But I'm still right:-) Of course you are! We're only impugning your taste, not your logic ;-) OK, the questions re your global consistency model are a bazillion times more significant. I have not forgotten about that, please stay tuned. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Permissions don't stick on ConfigFS attributes
On Saturday 20 August 2005 16:31, Joel Becker wrote: > On Fri, Aug 19, 2005 at 08:01:17PM -0700, Greg KH wrote: > > The recent changes to sysfs should be ported to configfs to do this. > > Yeah, I've been meaning to do something, and resusing code is > always a good plan. Ending up with the same code in core kernel in two different places is always a bad plan. Oh man. Just look at these two bodies of code, configfs is mostly just large tracts that are identical to sysfs except for name changes. Listen to what the code is trying to tell you! SysFS: struct kobject { const char * k_name; charname[KOBJ_NAME_LEN]; struct kref kref; struct list_headentry; struct kobject * parent; struct kset * kset; struct kobj_type* ktype; struct dentry * dentry; }; ConfigFS: struct config_item { char*ci_name; charci_namebuf[CONFIGFS_ITEM_NAME_LEN]; struct kref ci_kref; struct list_headci_entry; struct config_item *ci_parent; struct config_group *ci_group; struct config_item_type *ci_type; struct dentry *ci_dentry; }; Big difference, huh? As designer of configfs, could you please offer your take on why it cannot be rolled back into sysfs, considering that it is mostly identical already? Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Permissions don't stick on ConfigFS attributes
On Saturday 20 August 2005 11:22, Jon Smirl wrote: > A patch for making sysfs attributes persistent has recently made it > into Linus' tree. > > http://article.gmane.org/gmane.linux.hotplug.devel/7927/match=sysfs+permissions Interesting, it handles more than just the file mode. But does anybody really care about the ctime/atime/mtime in sysfs? I can see how uid and gid could be useful. My way of handling this (by copying out the potentially changed iattrs when the dentry is destroyed) looks more compact than Maneesh's solution, while not being any less effective, once I get it right that is. Does sysfs really need its own setattr? A quibble: we normally use the term persistent to mean "saved on permanent storage". Going by that, Maneesh just fixed a bug and did not make iattrs persistent. Regards, Daniel - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Permissions don't stick on ConfigFS attributes
On Saturday 20 August 2005 11:22, Jon Smirl wrote: A patch for making sysfs attributes persistent has recently made it into Linus' tree. http://article.gmane.org/gmane.linux.hotplug.devel/7927/match=sysfs+permissions Interesting, it handles more than just the file mode. But does anybody really care about the ctime/atime/mtime in sysfs? I can see how uid and gid could be useful. My way of handling this (by copying out the potentially changed iattrs when the dentry is destroyed) looks more compact than Maneesh's solution, while not being any less effective, once I get it right that is. Does sysfs really need its own setattr? A quibble: we normally use the term persistent to mean saved on permanent storage. Going by that, Maneesh just fixed a bug and did not make iattrs persistent. Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/