[PATCH] Fix the warning when running make tags
From: Aneesh Kumar K.V <[EMAIL PROTECTED]> make tags was giving the below warning. ctags: Warning: arch/x86_64/kernel/head.S:124: null expansion of name pattern "\1" Fix the same by making sure we taken only ENTRY pattern found at the begining of the line. Signed-off-by: Aneesh Kumar K.V <[EMAIL PROTECTED]> --- Makefile |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/Makefile b/Makefile index 8a3c271..9c2670e 100644 --- a/Makefile +++ b/Makefile @@ -1316,7 +1316,7 @@ define xtags -I __initdata,__exitdata,__acquires,__releases \ -I EXPORT_SYMBOL,EXPORT_SYMBOL_GPL \ --extra=+f --c-kinds=+px \ - --regex-asm='/ENTRY\(([^)]*)\).*/\1/'; \ + --regex-asm='/^ENTRY\(([^)]*)\).*/\1/'; \ $(all-kconfigs) | xargs $1 -a \ --langdef=kconfig \ --language-force=kconfig \ -- 1.5.2.2.571.ge1341-dirty - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH, RFD]: Unbreak no-mmu mmap
Bryan Wu wrote: On Wed, 2007-06-20 at 12:00 +0900, Paul Mundt wrote: On Fri, Jun 08, 2007 at 03:53:49PM +0200, Bernd Schmidt wrote: diff --git a/mm/nommu.c b/mm/nommu.c index 2b16b00..7480a95 100644 --- a/mm/nommu.c +++ b/mm/nommu.c [snip] + /* +* Must always set the VM_SPLIT_PAGES flag for single-page allocations, +* to avoid trying to get the order of the compound page later on. +*/ + if (len == PAGE_SIZE) + vma->vm_flags |= VM_SPLIT_PAGES; + else if (flags & MAP_SPLIT_PAGES And now you've just broken every non-blackfin nommu platform, as you've only defined MAP_SPLIT_PAGES in asm-blackfin/mman.h. +#ifdef CONFIG_NP2 + || len < total_len +#endif And what is this? It only shows up in the blackfin defconfig. This is not the place to be putting board-specific hacks. Yes, it is our own NP2 memory allocator option. I think Bernd will fix it. Theres no reason you can't add the MAP_SPLIT_PAGES define in all the necessary places too. On Tue, Jun 19, 2007 at 07:26:19PM -0400, Robin Getz wrote: I'm assuming that since no one had any large objections, that this is OK, and we should send to Andrew to live in -mm for awhile? No real objections to the approach, but it would be nice if these sorts of things were test compiled for at least one platform that isn't yours, so the obviously broken stuff is fixed before it's posted and someone else has to find out about it later. Exactly, Could please do some simple test on your SH-NOMMU platform? And we are waiting for some feedback from other nommu arch maintainers. David and Grep could you please help on this? Maybe Robin got some m68k nommu by hand which can be used for testing, I only have Blackfin, -:)) I have compiled the patch on m68knommu (after adding a MAP_SPLIT_PAGES define). And it seems to work ok with simple testing. I don't have a problem with the change, though please do add that MAP_SPLIT_PAGES define in the appropriate mman.h includes. And like Paul said there is no place for CONFIG_NP2 in it currently. Please take that out. Regards Greg Greg Ungerer -- Chief Software Dude EMAIL: [EMAIL PROTECTED] Secure Computing CorporationPHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] atkbd: cleanup only once
On Wed, Jun 27, 2007 at 12:59:32AM -0400, Dmitry Torokhov wrote: > On Wednesday 27 June 2007 00:28, Greg KH wrote: > > On Wed, Jun 27, 2007 at 12:34:09AM -0400, Dmitry Torokhov wrote: > > > Hi Dave, > > > > > > On Wednesday 27 June 2007 06:59, Dave Young wrote: > > > > Hi, > > > > > > > > If you press ctrl+alt+del several times as kernel booting (before user > > > > level bootin), the kernel will oops. I found the ps2_command is called > > > > more than once, then the ps2dev->serio maybe NULL pointer. > > > > > > > > 2.6.22-rc5 and 2.6.22-rc6 have same result. > > > > > > > > Signed-off-by: Dave Young <[EMAIL PROTECTED]> > > > > --- > > > > diff -upr linux/drivers/input/keyboard/atkbd.c > > > > linux.new/drivers/input/keyboard/atkbd.c > > > > --- linux/drivers/input/keyboard/atkbd.c2007-06-27 > > > > 10:38:37.0 + > > > > +++ linux.new/drivers/input/keyboard/atkbd.c2007-06-27 > > > > 10:37:39.0 + > > > > @@ -795,6 +795,11 @@ static int atkbd_activate(struct atkbd * > > > > > > > > static void atkbd_cleanup(struct serio *serio) > > > > { > > > > + static int flag; > > > > + > > > > + if(flag) > > > > + return; > > > > + flag = 1; > > > > > > Unfortunately this will prevent atkbd from resetting keyboard on 2nd > > > suspend attempt. It will also not work if you have an active MUX and > > > have a couple of keyboards connected. > > > > > > Greg, now that you removed rwsem from subsystem (and subsystem itself > > > for that matter) there is nothing as far as I can see that stops > > > several threads from running device_shutdown() simultaneously. I also > > > do not see what would isolate device probing and shutting them down > > > at the same time. Am I missing something? > > > > There was never anything stopping that from happening before. No driver > > core code was using that rwsem, so it wasn't protecting anything, > > despite people trying to use it as if it was :) > > > > It did protect device_shutdown() from itself, didn't it? Hm, yeah, it did, but that was it. If that was its goal, it sure wasn't obvious at all. Do you think the driver core needs to serialize this? thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] fsblock
On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote: > On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote: > > On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote: > > [ ... fsblocks vs extent range mapping ] > > > iomaps can double as range locks simply because iomaps are > > expressions of ranges within the file. Seeing as you can only > > access a given range exclusively to modify it, inserting an empty > > mapping into the tree as a range lock gives an effective method of > > allowing safe parallel reads, writes and allocation into the file. > > > > The fsblocks and the vm page cache interface cannot be used to > > facilitate this because a radix tree is the wrong type of tree to > > store this information in. A sparse, range based tree (e.g. btree) > > is the right way to do this and it matches very well with > > a range based API. > > I'm really not against the extent based page cache idea, but I kind of > assumed it would be too big a change for this kind of generic setup. At > any rate, if we'd like to do it, it may be best to ditch the idea of > "attach mapping information to a page", and switch to "lookup mapping > information and range locking for a page". Well the get_block equivalent API is extent based one now, and I'll look at what is required in making map_fsblock a more generic call that could be used for an extent-based scheme. An extent based thing IMO really isn't appropriate as the main generic layer here though. If it is really useful and popular, then it could be turned into generic code and sit along side fsblock or underneath fsblock... It definitely isn't trivial to drive the IO directly from something like that which doesn't correspond to filesystem block size. Splitting parts of your extent tree when things go dirty or uptodate or partially under IO, etc.. joining things back up again when they are mergable. Not that it would be impossible, but it would be a lot more heavyweight than fsblock. I think using fsblock to drive the IO and keep the pagecache flags uptodate and using a btree in the filesystem to manage extents of block allocations wouldn't be a bad idea though. Do any filesystems actually do this? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NVidia Driver Support - 1680x1050 mode
OK where to start? Firstly this is really the wrong list your writing to. Chances are you'll be wanting to ask your question at http://www.nvnews.net/vbulletin/forumdisplay.php?s==14 if your using the Nvidia Blob otherwise if your using the 2D only NV driver then you should really aim your question at the Xorg guys and gals Secondly how exactly did you tell it differently? Chances are your /etc/conf.d/xorg.conf is wrong if you were using the latest X server with the latest binary blob (nvidia driver) chances are it would detect the highest resolution and set it for you. The program nvidia-xconfig should fix the file for you. Oh and thirdly do you really think it just works on windows is a good incentive to get people to help you? Yes it should work out the box but unfortunately the world of Linux 3D drivers is mostly dominated with company's that prefers keeping their drivers in a black box and hopefully in the not too distant future the neuveau project might remedy this. Any way I hope this e-mail both helps with your problems and adds to your understanding of how things work. The kernel mailing list is for kernel issues (which include rivafb and nvidiafb but not nv and nvidia 3d issues) so if you ever plug in a hard drive and it's not working at full speed or something along those lines that's when you should call. Cheers Mike On 27/06/07, Marc Perkel <[EMAIL PROTECTED]> wrote: Trying to get my Asus M2NPV-VM motherboard and my Samsung SyncMaster 215tw Digital to work in 1680x1050 mode but 1280x1024 is the most I can get. Chip Set is GeForce 6150. Looking in Xorg.0.log it ssems to think that the panel size is 1280x1024 in spite of my setting telling it differently. Sorry if this is off topic but I thought that the smart people would be here. In Windows I just plug it in and it works. So I figure Linux should work too. :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ata: Add the SW NCQ support to sata_nv for MCP51/MCP55/MCP61
On Wed, 27 Jun 2007 11:04:44 +0800 "kuan luo" <[EMAIL PROTECTED]> wrote: > Add the Software NCQ support to sata_nv.c for MCP51/MCP55/MCP61 SATA > controller. > NCQ function is disable by default, you can enable it with 'swncq=1' > This patch adds a large amount of new trailing whitespace. > --- > diff -Nurp a/sata_nv.c b/sata_nv.c > --- a/sata_nv.c 2007-06-13 10:15:07.0 -0400 > +++ b/sata_nv.c 2007-06-26 12:52:27.0 -0400 Please prepare patches in `pathc -p1' form. > +typedef struct { > + u32 defer_bits; > + u8 front; > + u8 rear; > + unsigned inttag[ATA_MAX_QUEUE + 1]; > +}defer_queue_t; Avoid adding typedefs. > +static int swncq_enabled = 0; Don't initialise static storage to zero: it needlessly increases the vmlinux size. > nv_hardreset, ata_std_postreset); > } > > +static void nv_swncq_qc_to_dq(struct ata_port *ap, struct ata_queued_cmd *qc) > +{ > + struct nv_swncq_port_priv *pp = ap->private_data; > + defer_queue_t *dq = >defer_queue; > + > + /* queue is full */ > + WARN_ON((dq->rear + 1) % (ATA_MAX_QUEUE + 1) == dq->front); This is peculiar. The array is sized ATA_MAX_QUEUE+1 (ie: 33) and the code uses ATA_MAX_QUEUE+1 everywhere. It looks to me like the ata code was designed to queue up to 32 elements and all this code has taken that to 33. What exactly is going on here? > + > + dq->defer_bits |= (1 << qc->tag); > + > + dq->tag[dq->rear] = qc->tag; > + dq->rear = (dq->rear + 1) % (ATA_MAX_QUEUE + 1); > + > +} > + > +static struct ata_queued_cmd *nv_swncq_qc_from_dq(struct ata_port *ap) > +{ > + struct nv_swncq_port_priv *pp = ap->private_data; > + defer_queue_t *dq = >defer_queue; > + unsigned int tag; > + > + if (dq->front == dq->rear) /* null queue */ > + return NULL; > + > + tag = dq->tag[dq->front]; > + dq->tag[dq->front] = ATA_TAG_POISON; > + dq->front = (dq->front + 1) % (ATA_MAX_QUEUE + 1); etc. > + WARN_ON(!(dq->defer_bits & (1 << tag))); > + dq->defer_bits &= ~(1 << tag); > + > + return ata_qc_from_tag(ap, tag); > +} > + > + dq->front = dq->rear = 0; > + dq->defer_bits = 0; > + pp->qc_active = 0; > + pp->last_issue_tag = ATA_TAG_POISON; > + nv_swncq_fis_reinit(ap); > +} > + > +static void nv_swncq_irq_clear(struct ata_port *ap, u32 val) > +{ > + void __iomem *mmio = ap->host->iomap[NV_MMIO_BAR]; > + u32 flags = (val << (ap->port_no * NV_INT_PORT_SHIFT_MCP55)); I hope we'll never need to support more than two ports... > + writel(flags, mmio + NV_INT_STATUS_MCP55); > +} > + > +static void nv_swncq_ncq_stop(struct ata_port *ap) > +{ > + struct nv_swncq_port_priv *pp = ap->private_data; > + unsigned int i; > + u32 sactive; > + u32 done_mask; > + > + ata_port_printk(ap, KERN_ERR, > + "EH in SWNCQ mode,QC:qc_active 0x%X sactive 0x%X\n", > + ap->qc_active, ap->sactive); > + ata_port_printk(ap, KERN_ERR, > + "SWNCQ:qc_active 0x%X defer_bits 0x%X last_issue_tag 0x%x\n " > + "dhfis 0x%X dmafis 0x%X sdbfis 0x%X\n", > + pp->qc_active, pp->defer_queue.defer_bits, pp->last_issue_tag, > + pp->dhfis_bits, pp->dmafis_bits, > + pp->sdbfis_bits); > + > + ata_port_printk(ap, KERN_ERR, "ATA_REG 0x%X ERR_REG 0x%X\n", > + ap->ops->check_status(ap), ioread8(ap->ioaddr.error_addr)); > + > + sactive = readl(pp->sactive_block); > + done_mask = pp->qc_active ^ sactive; > + > + ata_port_printk(ap, KERN_ERR, "tag : dhfis dmafis sdbfis sacitve\n"); > + for (i=0; i < ATA_MAX_QUEUE; i++) { Missing spaces around the "=". We have a script in scripts/checkpatch.pl which will inform you about many of these little things. Please familiarise yourself with it. > + u8 err = 0; > + if (pp->qc_active & (1 << i)) > + err = 0; > + else if (done_mask & (1 << i)) > + err = 1; > + else > + continue; > + > + ata_port_printk(ap, KERN_ERR, > + "tag 0x%x: %01x %01x %01x %01x %s\n", i, > + (pp->dhfis_bits >> i) & 0x1, > + (pp->dmafis_bits >> i) & 0x1 , (pp->sdbfis_bits >> i) & 0x1, > + (sactive >> i) & 0x1, > + (err ? "error!tag doesn't exit, but sactive bit is set" : " ")); > + } > + > + nv_swncq_pp_reinit(ap); > + ap->ops->irq_clear(ap); > + nv_swncq_bmdma_stop(ap); > + nv_swncq_irq_clear(ap, 0x); > +} > > ... > > + > +static void nv_swncq_fill_sg(struct ata_queued_cmd *qc) > +{ > + struct ata_port *ap = qc->ap; > + struct scatterlist *sg; > + unsigned int idx; > + > + struct nv_swncq_port_priv *pp = ap->private_data; > + >
Re: [BUG] long freezes on thinkpad t60
Linus Torvalds wrote: On Tue, 26 Jun 2007, Nick Piggin wrote: Hmm, not that I have a strong opinion one way or the other, but I don't know that they would encourage bad code. They are not going to reduce latency under a locked section, but will improve determinism in the contended case. xadd really generally *is* slower than an add. One is often microcoded, the other is not. Oh. I found xadd to be not hugely slower on my P4, but it was a little bit. But the real problem is that your "unlock" sequence is now about two orders of magnitude slower than it used to be. So it used to be that a spinlocked sequence only had a single synchronization point, now it has two. *That* is really bad, and I guarantee that it makes your spinlocks effectively twice as slow for the non-contended parts. I don't know why my unlock sequence should be that much slower? Unlocked mov vs unlocked add? Definitely in dumb micro-benchmark testing it wasn't twice as slow (IIRC). But your xadd thing might be worth looking at, just to see how expensive it is. As an _alternative_ to spinlocks, it's certainly viable. (Side note: why make it a word? Word operations are slower on many x86 implementations, because they add yet another prefix. You only need a byte) No real reason I guess. I'll change it. -- SUSE Labs, Novell Inc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ata: Add the SW NCQ support to sata_nv for MCP51/MCP55/MCP61
kuan luo wrote: Add the Software NCQ support to sata_nv.c for MCP51/MCP55/MCP61 SATA controller. NCQ function is disable by default, you can enable it with 'swncq=1' Signed-off-by: Kuan Luo <[EMAIL PROTECTED]> Signed-off-by: Peer Chen <[EMAIL PROTECTED]> Haven't reviewed in detail, but does look cleaner than the previous version. Some people reported seeing some unrecognized FIS, etc. errors with the previous version, have those been looked into/fixed? -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
On 6/26/07, Rik van Riel <[EMAIL PROTECTED]> wrote: After going through the first malloc()/free() cycle, surely the memory will no longer be zeroed on the second malloc() ? If returned to the system, sure. What makes the first brk malloc so special? If the memory is zeroed it needs not be initialized by malloc. No calloc zeroing, no pointer clearing. Anyway, it's irrelevant what the benefits are, the fact is current code depends on brk to zero the memory and you'd break the ABI if you'd change it. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6 Linux for PowerPC supports kdb?
On Tue, 26 Jun 2007 23:03:55 -0400 Shan, Guo Wen (Gavin) wrote: > Does anybody knew if 2.6 linux for PowerPC supports kdb? PowerPC isn't listed AFAICT: ftp://oss.sgi.com/www/projects/kdb/download/v4.4/README I.e., all that I see are i386, x86_64, and ia64. --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
On 6/26/07, Davide Libenzi <[EMAIL PROTECTED]> wrote: I acutally have the code for it, but I never posted it since it did not receive a too warm review (and the only user was the fdmap thingy). Only user of sys_indirect? There will be quite a few right away. Every syscall that returns a file descriptor needs O_CLOEXEC support (socket, pipe, epoll_create, ...) OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence brk2() would not be needed, no? No. mmap calls create individual VMAs which gets expensive. There are also some hardware drivers which get more expensive the more VMAs there are. I want to go away as much as possible from mmap for malloc. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] atkbd: cleanup only once
2007/6/27, Dmitry Torokhov <[EMAIL PROTECTED]>: On Wednesday 27 June 2007 00:28, Greg KH wrote: > On Wed, Jun 27, 2007 at 12:34:09AM -0400, Dmitry Torokhov wrote: > > Hi Dave, > > > > On Wednesday 27 June 2007 06:59, Dave Young wrote: > > > Hi, > > > > > > If you press ctrl+alt+del several times as kernel booting (before user level bootin), the kernel will oops. I found the ps2_command is called more than once, then the ps2dev->serio maybe NULL pointer. > > > > > > 2.6.22-rc5 and 2.6.22-rc6 have same result. > > > > > > Signed-off-by: Dave Young <[EMAIL PROTECTED]> > > > --- > > > diff -upr linux/drivers/input/keyboard/atkbd.c linux.new/drivers/input/keyboard/atkbd.c > > > --- linux/drivers/input/keyboard/atkbd.c 2007-06-27 10:38:37.0 + > > > +++ linux.new/drivers/input/keyboard/atkbd.c 2007-06-27 10:37:39.0 + > > > @@ -795,6 +795,11 @@ static int atkbd_activate(struct atkbd * > > > > > > static void atkbd_cleanup(struct serio *serio) > > > { > > > + static int flag; > > > + > > > + if(flag) > > > + return; > > > + flag = 1; > > > > Unfortunately this will prevent atkbd from resetting keyboard on 2nd > > suspend attempt. It will also not work if you have an active MUX and > > have a couple of keyboards connected. > > > > Greg, now that you removed rwsem from subsystem (and subsystem itself > > for that matter) there is nothing as far as I can see that stops > > several threads from running device_shutdown() simultaneously. I also > > do not see what would isolate device probing and shutting them down > > at the same time. Am I missing something? > > There was never anything stopping that from happening before. No driver > core code was using that rwsem, so it wasn't protecting anything, > despite people trying to use it as if it was :) > It did protect device_shutdown() from itself, didn't it? -- Dmitry how about check ps2dev->serio in ps2_command before use it? Regards dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: implement-file-posix-capabilities.patch
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Serge E. Hallyn wrote: > >> I don't particularly mind, but can you point out any case where >> it is an advantage to have the one bit for f'E rather than just >> drop f'E altogether? Instead of having > >> f'I=something >> f'P=something >> f'E=off > >> we can always just remove the security.capability xattr. Right? No. Bear in mind that capabilities are all about trusting specific applications with privilege as opposed to trusting the superuser to not run dangerous applications. There are three situations, we'll take them in turn: - no capabilities (fP=fI=fE=0): this is for applications that are not intended to operate with privilege. Because of the way the capability convolution rules work, such a program can't execute with privilege. Period. - with capabilities (fP and/or fI != 0), but fE=0 (off): this is for applications that are intended to operate with privilege, but they were designed to know about capabilities and they manipulate (raise and lower) capabilities as needed to do the things they do. - with capabilities, but fE=1 (on): this is a class of applications loosely called 'legacy'. They can use privilege to operate, but don't strictly need to know about it. For example, /bin/chown . Such a program will have fP=0,fI=CAP_CHOWN. Since the administrator sets fP=0,fI=CAP_CHOWN,fE=1 on the /bin/chown file, any process with CAP_CHOWN in its inheritable set (pI) can "exec /bin/chown" and have it do the thing historically reserved for the superuser... In some future world, the legacy fE bit may become unnecessary because every application will be rewritten to be careful about exercising privilege explicitly. In the meantime, the fE bit can be used to drop the setuid-0 bits from things like ping and traceroute. >> If there's a case where that does not suffice, then I have no objection >> to doing it this way. Does that explain it? > 2) Allocate capability bit-31 for CAP_SETFCAP, and use it to gate > whether the user can set this xattr on a file or not. CAP_SYS_ADMIN is > way too overloaded and this functionality is special. > >> The functionality is special, but someone with CAP_SYS_ADMIN can always >> unload the capability module and create the security.capability xattr >> using the dummy module. This argument leads down a rat hole. (As appears to have happened with the non-modularization LSM thread elsewhere...) The simple fact is that CAP_SYS_ADMIN is equivalent to every other capability in the system if it can be used to load any flavor of kernel module. Arguing that you don't need a capability for something because you can do it with CAP_SYS_ADMIN is very close to admitting that you prefer the superuser (uid=0) model. >> If we do add this cap, do we want to make it apply to all security.* >> xattrs? I recommend limiting it to just capabilities. For now, you can leave the other security attributes with the CAP_SYS_ADMIN ("misc") capability. In the original 'POSIX' drafts, there is a separate notion of advisory and mandatory access control; leveraging different capabilities to override. > 3) The cap_from_disk() interface checking needs some work Most > notably, size must be greater than sizeof(u32) or the very first line > will do something nasty... I'd recommend you use code like this: > > [...] cap_from_disk(...) > { >if (size != sizeof(struct vfs_cap_data)) { [...] > mistake at least once... The future is uncertain, so don't trust it will > look the way you want it to. ;-) > >> Ok, so you're saying that when we do switch to 64-bit caps or some other >> evolution, we switch to completely separate logic based on the >> VFS_CAP_REVISION? Yes. >> That seems sane to me. You might also want to verify that unallocated bits hold the value 'zero'. Its funny what people do when they realize they can silently store bits in obscure places like this. That really messes up allocating bits in the future. Add a check that is something like: if (version & ~(VFS_CAP_REVISION_MASK|VFS_CAP_FLAGS_EFFECTIVE)) { return -EINVAL; } > 7) This one is subtle, and to my mind not well appreciated. In > cap_bprm_apply_creds(), the wart of the global 'cap_bset' masking > permitted bits can lead to problems like the one we saw a few years back > with sendmail and capabilities. There is an assumption in setting > permitted (they are called 'forced' in some documents) capabilities on a > file that the file will execute with at least these. The inheritable > ones are optional. > >> Hmm, changing the behavior of the cap_bset is something that seems to >> belong in 8), though I see what you're saying, it does affect the >> behavior of vfs caps. I'm not really changing the behavior of cap_bset. I'm specifying the behavior of fP. ;-) [Your comments on cap_bset and CAP_SETPCAP are exactly where my ax^H^H^Hscalpel will fall after all this VFS support is stable. But *that* is a subject for a different thread... aka item 8.]
Re: [PATCH] atkbd: cleanup only once
On Wednesday 27 June 2007 00:28, Greg KH wrote: > On Wed, Jun 27, 2007 at 12:34:09AM -0400, Dmitry Torokhov wrote: > > Hi Dave, > > > > On Wednesday 27 June 2007 06:59, Dave Young wrote: > > > Hi, > > > > > > If you press ctrl+alt+del several times as kernel booting (before user > > > level bootin), the kernel will oops. I found the ps2_command is called > > > more than once, then the ps2dev->serio maybe NULL pointer. > > > > > > 2.6.22-rc5 and 2.6.22-rc6 have same result. > > > > > > Signed-off-by: Dave Young <[EMAIL PROTECTED]> > > > --- > > > diff -upr linux/drivers/input/keyboard/atkbd.c > > > linux.new/drivers/input/keyboard/atkbd.c > > > --- linux/drivers/input/keyboard/atkbd.c 2007-06-27 10:38:37.0 > > > + > > > +++ linux.new/drivers/input/keyboard/atkbd.c 2007-06-27 > > > 10:37:39.0 + > > > @@ -795,6 +795,11 @@ static int atkbd_activate(struct atkbd * > > > > > > static void atkbd_cleanup(struct serio *serio) > > > { > > > + static int flag; > > > + > > > + if(flag) > > > + return; > > > + flag = 1; > > > > Unfortunately this will prevent atkbd from resetting keyboard on 2nd > > suspend attempt. It will also not work if you have an active MUX and > > have a couple of keyboards connected. > > > > Greg, now that you removed rwsem from subsystem (and subsystem itself > > for that matter) there is nothing as far as I can see that stops > > several threads from running device_shutdown() simultaneously. I also > > do not see what would isolate device probing and shutting them down > > at the same time. Am I missing something? > > There was never anything stopping that from happening before. No driver > core code was using that rwsem, so it wasn't protecting anything, > despite people trying to use it as if it was :) > It did protect device_shutdown() from itself, didn't it? -- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] atkbd: cleanup only once
On Wed, Jun 27, 2007 at 12:34:09AM -0400, Dmitry Torokhov wrote: > Hi Dave, > > On Wednesday 27 June 2007 06:59, Dave Young wrote: > > Hi, > > > > If you press ctrl+alt+del several times as kernel booting (before user > > level bootin), the kernel will oops. I found the ps2_command is called more > > than once, then the ps2dev->serio maybe NULL pointer. > > > > 2.6.22-rc5 and 2.6.22-rc6 have same result. > > > > Signed-off-by: Dave Young <[EMAIL PROTECTED]> > > --- > > diff -upr linux/drivers/input/keyboard/atkbd.c > > linux.new/drivers/input/keyboard/atkbd.c > > --- linux/drivers/input/keyboard/atkbd.c2007-06-27 10:38:37.0 > > + > > +++ linux.new/drivers/input/keyboard/atkbd.c2007-06-27 > > 10:37:39.0 + > > @@ -795,6 +795,11 @@ static int atkbd_activate(struct atkbd * > > > > static void atkbd_cleanup(struct serio *serio) > > { > > + static int flag; > > + > > + if(flag) > > + return; > > + flag = 1; > > Unfortunately this will prevent atkbd from resetting keyboard on 2nd > suspend attempt. It will also not work if you have an active MUX and > have a couple of keyboards connected. > > Greg, now that you removed rwsem from subsystem (and subsystem itself > for that matter) there is nothing as far as I can see that stops > several threads from running device_shutdown() simultaneously. I also > do not see what would isolate device probing and shutting them down > at the same time. Am I missing something? There was never anything stopping that from happening before. No driver core code was using that rwsem, so it wasn't protecting anything, despite people trying to use it as if it was :) That's why I removed it. So, if you need to have a lock for your subsystem to serialize this, please do so, I have no objection to it. thanks, greg k-h - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: scheduling while atomic and DEBUG_SPINLOCK_SLEEP
On Tue, 2007-06-26 at 22:55 -0400, Jon Ringle wrote: > Hello, > Out of these two, the first one that is showing "in_atomic():1" seems > more likely to me to be a potential cause of the "scheduling while > atomic" dump. > > Does this logic seem reasonable? Are there other debugging techniques I > can use to narrow down the cause for the "scheduling while atomic"? you could start by giving us pointers to the sources of the two drivers... without that... how can we look and help? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] atkbd: cleanup only once
Hi Dave, On Wednesday 27 June 2007 06:59, Dave Young wrote: > Hi, > > If you press ctrl+alt+del several times as kernel booting (before user level > bootin), the kernel will oops. I found the ps2_command is called more than > once, then the ps2dev->serio maybe NULL pointer. > > 2.6.22-rc5 and 2.6.22-rc6 have same result. > > Signed-off-by: Dave Young <[EMAIL PROTECTED]> > --- > diff -upr linux/drivers/input/keyboard/atkbd.c > linux.new/drivers/input/keyboard/atkbd.c > --- linux/drivers/input/keyboard/atkbd.c 2007-06-27 10:38:37.0 > + > +++ linux.new/drivers/input/keyboard/atkbd.c 2007-06-27 10:37:39.0 > + > @@ -795,6 +795,11 @@ static int atkbd_activate(struct atkbd * > > static void atkbd_cleanup(struct serio *serio) > { > + static int flag; > + > + if(flag) > + return; > + flag = 1; Unfortunately this will prevent atkbd from resetting keyboard on 2nd suspend attempt. It will also not work if you have an active MUX and have a couple of keyboards connected. Greg, now that you removed rwsem from subsystem (and subsystem itself for that matter) there is nothing as far as I can see that stops several threads from running device_shutdown() simultaneously. I also do not see what would isolate device probing and shutting them down at the same time. Am I missing something? -- Dmitry - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] MAP_NOZERO - implement a new VM_NOZERO/MAP_NOZERO page retirement policy
On Wed, 27 Jun 2007, Rik van Riel wrote: > Davide Libenzi wrote: > > On Tue, 26 Jun 2007, Rik van Riel wrote: > > > > > SUID programs should not be able to use this feature, > > > either. > > > > Why? A SUID programs runs under the UID of the owner, and should be no > > problems in it seeing the owners data. > > Because an SUID program can change its UID back. > > At least, one that was SUID root. OTOH, any > program running as root can change UID, so we > should probably not allow root to get nonzeroed > pages. Well, root can in general access the whole system in any case. At the moment, root cannot access othe UIDs pages. Only their own. And this differs from standard security policies where root can access everything. Pages used internally by the kernel, cannot be reused by anyone. > > I tried to look, and the attempt to reuse _mapcount failed miserably :) > > The last time we have the owner info (vma->mm) available, is before > > processing of the other fields ends. OTOH I'm not VM guru either, so I may > > be wrong. It can share ->virtual (when enabled). > > I think the process that actually calls the page freeing > functions is always the process that owned the page, so > going for current->mm should work. I'll try to see if that works out... - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH try #2] security: Convert LSM into a static interface
* Crispin Cowan ([EMAIL PROTECTED]) wrote: > and simple LSMs that can be > unloaded safely can permit it. there are none, and making the above possible is prohibitively expensive. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 1/3] MAP_NOZERO - implement a new VM_NOZERO/MAP_NOZERO page retirement policy
Davide Libenzi wrote: On Tue, 26 Jun 2007, Rik van Riel wrote: SUID programs should not be able to use this feature, either. Why? A SUID programs runs under the UID of the owner, and should be no problems in it seeing the owners data. Because an SUID program can change its UID back. At least, one that was SUID root. OTOH, any program running as root can change UID, so we should probably not allow root to get nonzeroed pages. But the patch post was more a quest for possible scenarios where the use of MAP_NOZERO can result in lower security WRT the same program (under the same security restrictions) not using such feature. If you have something specific in mind, please go ahead and shoot. Besides the non-enforcing of SELinux security labels (and maybe namespaces?), I cannot think of anything. When pages exit (unmapped from) a vma, they are marked with the effective UID of the mm_struct that owns it. --- linux-2.6.mod.orig/include/linux/mm_types.h 2007-06-21 14:02:06.0 -0700 +++ linux-2.6.mod/include/linux/mm_types.h 2007-06-25 19:11:22.0 -0700 @@ -64,6 +64,7 @@ struct list_head lru; /* Pageout list, eg. active_list * protected by zone->lru_lock ! */ + int owner_uid; /* Last owner of the page */ /* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with Since this is only set when the page is freed, could the owner_uid and security context be put inside a union with some fields that are not otherwise used for free pages? I tried to look, and the attempt to reuse _mapcount failed miserably :) The last time we have the owner info (vma->mm) available, is before processing of the other fields ends. OTOH I'm not VM guru either, so I may be wrong. It can share ->virtual (when enabled). I think the process that actually calls the page freeing functions is always the process that owned the page, so going for current->mm should work. Getting the UID wrong for file pages caught in a truncate is fine, since the process obviously already had access to the data in that page. -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
Ulrich Drepper wrote: On 6/26/07, Rik van Riel <[EMAIL PROTECTED]> wrote: Since programs can get back free()d memory after a malloc(), with the old contents of the memory intact, surely your MAP_NONZERO behavior could be the default for programs that can get away with it? Maybe we could use some magic ELF header, similar to the way non-executable stack is handled? No. This is an implementation detail of the libc version. The malloc as compiled today is expecting brk-ed memory to be zeroed. This default can of course be changed (it's a simple define) but you cannot make this the default behavior for brk. After going through the first malloc()/free() cycle, surely the memory will no longer be zeroed on the second malloc() ? What makes the first brk malloc so special? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is it time for remove (crap) ALSA from kernel tree ?
On 06/26/2007 10:39 PM, Andreas Hartmetz wrote: Okay, here's a rant. As an interested kernel outsider and KDE developer(*), it looks to me like most kernel people are too focused on the history and feature lists of the particular technologies here. The real matter with ALSA is that you get a strong "ALSA hates me" feeling when dealing with it. There is bad documentation, bad API, and a config file syntax that is much harder to understand than necessary. I'll agree to the documentation bit; the funny thing is that it's partly caused by documentation actually being, or once having been, _better_ than it is for the average subsystem. ALSA for example has the useful "Writing an ALSA Driver" document from Takashi Iwai: http://www.alsa-project.org/~iwai/writing-an-alsa-driver/index.html Documentation becomes obsolete as code progresses though and yes, especially on the userside of things the documentation is slow to follow. And then the usual problem of noone ever removing obsolete junk from the web exacerbates matters. Google will find you tons of useless, outdated crap but if you need the information in the first place, you don't know that it _is_ obsolete. And yes, this unfortunately includes www.alsa-project.org. For the longest time it was advocating writing ~/.asoundrc files for example through generic driver boilerplate texts while that was actually at that point mostly counter productive in getting ALSA functional. As to the config file -- well, sure. The best thing about is that normally you don't need it... The "bad API" I find interesting since you are a KDE developer. I'm not an audio application developer myself so I don't have (m)any well thought out opinions on it, but isn't the only thing in KDE4 that talks to ALSA the Phonon ALSA backend? If you are talking in that context, I'm quite sure the alsa-user and/or alsa-devel lists (@alsa-project.org) would like to hear about any specific comments/problems. Getting the Phonon backend right from the start is something that seems important. Then there is the kernel/library split that seems to have no convincing reason at all in its current form. Why not put the whole sound system in userland? It has been done before. Sound is just not performance critical at all and it's almost never mission critical Heh. Sound may not be, but audio is. For the longest time, audio users stuck with 2.4 kernels and the low-latency patches that were availabe for it due to latency issues. Large parts of ALSA already are in userland in the form of libasound and I expect moving over everything would not so much help. [ ... ] The track record of ALSA for me goes like this: - dmix finally started working automatically (at least on my Kubuntu system) about one year ago, about five years after everybody could see that this was badly needed. I couldn't get it to work before. The howtos somehow didn't work and ALSA's documentation isn't all that helpful. dmix was really only implemented (or at least, made default) for casual users. Hope it'll not come across as elitist but people who are serious about music or audio don't actually need or want it. It's a consumer thing. To have software mixing work you have to resample to a common rate and this an absolute unthinkable horror to a serious user. It's a good thing it's now default, but only because a majority of sound users is not serious (simply because it's mostly all computer users). - Different desktop environments have different sound daemons to paper over the weaknesses of ALSA (no dmix by default / unfriendly API), which creates new problems. Yes there are other reasons for sound daemons, but I doubt anybody would have come up with the idea if it wasn't for ALSA. Given that they existed before ALSA did this seems to be a somewhat odd doubt. - I have an Envy24HT based soundcard in my desktop PC, which also goes to show that I'm really interested in sound issues. Nice chip. I don't have one, and am not too sure about its native supported rates but if you are mostly playing 44100 through it (ie, CD source audio) I'd consider doing without dmix. A nice sounding chip like that shouldn't be subjected to resampling really. Someone recently informed me on the ALSA list that Envy24 indeed doesn't do hardware mixing though, so I guess you may need it if you really do want the also have the card available for desktop sounds. I have to run alsamixer after every bootup to unmute the left channel because restoring volume only works for the right channel. The left channel starts working after changing its volume. Sounds like a rather debugable problem. I'm (almost) sure someone will try to get you a useful answer if you post to the [EMAIL PROTECTED] list :) - On my IBM/Lenovo R50e notebook with Intel chipset sound didn't work before I "muted" the "headphone jack sense" control in alsamixer. That took two hours or so. When both the master volume and the PCM volume
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
On Tue, 26 Jun 2007, Ulrich Drepper wrote: > On 6/26/07, Davide Libenzi <[EMAIL PROTECTED]> wrote: > > The following patch implements the sys_brk2() syscall, that nothing is > > other than a sys_brk() with an extra "flags" parameter. > > Shouldn't we wait for Linus' sys_indirect to arrive and make this > another syscall which takes advantage of it? I acutally have the code for it, but I never posted it since it did not receive a too warm review (and the only user was the fdmap thingy). OTOH glibc could implement __morecore using mmap(MAP_NOZERO), and hence brk2() would not be needed, no? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
On 6/26/07, Davide Libenzi <[EMAIL PROTECTED]> wrote: The following patch implements the sys_brk2() syscall, that nothing is other than a sys_brk() with an extra "flags" parameter. Shouldn't we wait for Linus' sys_indirect to arrive and make this another syscall which takes advantage of it? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
On 6/26/07, Rik van Riel <[EMAIL PROTECTED]> wrote: Since programs can get back free()d memory after a malloc(), with the old contents of the memory intact, surely your MAP_NONZERO behavior could be the default for programs that can get away with it? Maybe we could use some magic ELF header, similar to the way non-executable stack is handled? No. This is an implementation detail of the libc version. The malloc as compiled today is expecting brk-ed memory to be zeroed. This default can of course be changed (it's a simple define) but you cannot make this the default behavior for brk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NVidia Driver Support - 1680x1050 mode
Trying to get my Asus M2NPV-VM motherboard and my Samsung SyncMaster 215tw Digital to work in 1680x1050 mode but 1280x1024 is the most I can get. Chip Set is GeForce 6150. Looking in Xorg.0.log it ssems to think that the panel size is 1280x1024 in spite of my setting telling it differently. Sorry if this is off topic but I thought that the smart people would be here. In Windows I just plug it in and it works. So I figure Linux should work too. :) Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games. http://sims.yahoo.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3
Alexandre Oliva wrote: > On Jun 26, 2007, Al Boldi <[EMAIL PROTECTED]> wrote: > > I read your scenario of the vendor not giving you the source to mean: > > not directly; i.e. they could give you a third-party download link. > > This has never been enough to comply with GPLv2. Section 3a of the GPLv2 mentions "a medium customarily used for software interchange". I would think the Internet is a medium customarily used for software interchange, is it not? Thanks! -- Al - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
On Tue, 26 Jun 2007, Rik van Riel wrote: > Davide Libenzi wrote: > > The following patch implements the sys_brk2() syscall, that nothing is > > other than a sys_brk() with an extra "flags" parameter. This can be used > > to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero > > pages if possible. > > Since programs can get back free()d memory after a malloc(), > with the old contents of the memory intact, surely your > MAP_NONZERO behavior could be the default for programs that > can get away with it? > > Maybe we could use some magic ELF header, similar to the > way non-executable stack is handled? Well, the quick glibc patch simply uses an environment variable, just because I wanted to bench the kernel build with using the same glibc+gcc. Yes, it can be the default behaviour for the allocator. The patch handles calloc() correctly, by forcibly zeroing memory in such calls. But other software must be taught too, to use MAP_NOZERO when they do not need zeroed memory. I did that for the gcc garbage collector. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)
On Tue, 26 Jun 2007, Roland McGrath wrote: > > Here's the next iteration. The arch-specific parts are now completely > > encapsulated. validate_settings is in a form which should be workable > > on all architectures. And the address, length, and type are passed as > > arguments to register_{kernel,user}_hw_breakpoint(). > > I like it! Good. My earlier stubbornness was caused by a desire to allow static initializers, but now I see that specifying the values in the registration call really isn't all that bad. > > I haven't tried to modify Kconfig at all. To do it properly would > > require making ptrace configurable, which is not something I want to > > tackle at the moment. > > You don't need to worry about that. Under utrace, CONFIG_PTRACE is > already separate and can be turned off. I don't think we need really to > finish the Kconfig stuff at all before I merge it into the utrace code. So far this work has all been based on the vanilla kernel. Should I switch over to basing it on -mm? > Calling send_sigtrap twice during the same exception does happen to be > harmless, but I don't think it should be presumed to be. It is just not > the right way to go about things that you send a signal twice when there > is one signal you want to generate. What happens when there are two ptrace exceptions at different points during the same system call? Won't we end up sending the signal twice no matter what? > Also, send_sigtrap is an i386-only function (not even x86_64 has the > same). Only x86_64 will share this actual code, but all others will be > modelled on it. I think it makes things simplest across the board if > the standard form is that when there is a ptrace exception, the notifier > does not return NOTIFY_STOP, so it falls through to the existing SIGTRAP > arch code. > > So, hmm. In the old do_debug code, if a notifier returns NOTIFY_STOP, > it bails immediately, before the db6 value is saved in current->thread. > This is the normal theory of notify_die use, where NOTIFY_STOP means to > completely swallow the event as if it never happened. In the event > there were some third party notifier involved, it ought to be able to > swallow its magic exceptions as before and have no user-visible db6 > change happen at the time of that exception. So how about this: > > get_debugreg(condition, 6); > set_debugreg(0UL, 6); /* The CPU does not clear it. */ > > if (notify_die(DIE_DEBUG, "debug", regs, condition, error_code, > SIGTRAP) == NOTIFY_STOP) > return; > > The kprobes notifier uses max priority, so it will run first. Its > notifier code uses my version. For a single-step that belongs to it, > it will return NOTIFY_STOP and nothing else happens (noone touches > vdr6). (I think I'm dredging up old territory by asking what happens > when kprobes steps over an insn that hits a data breakpoint, but I > don't recall atm.) In theory we should get an exception with both DR_STEP and DR_TRAPn set, meaning that neither notifier will return NOTIFY_STOP. But if the kprobes handler clears DR_STEP in the DR6 image passed to the hw_breakpoint handler, it should work out better. > vdr6 belongs wholly to hw_breakpoint, no other code refers to it > directly. hw_breakpoint's notifier sets vdr6 with non-DR_TRAPn bits, > if it's a user-mode exception. If it's a ptrace exception it also > sets the mapped DR_TRAPn bits. If it's not a ptrace exception and > only DR_TRAPn bits were newly set, then it returns NOTIFY_STOP. If > it's a spurious exception from lazy db7 setting, hw_breakpoint just > returns NOTIFY_STOP early. That sounds not quite right. To a user-space debugger, a system call should appear as an atomic operation. If multiple ptrace exceptions occur during a system call, all the relevant DR_TRAPn bits should be set in vdr6 together and all the other ones reset. How can we arrange that? There's also the question of whether to send the SIGTRAP. If extraneous bits are set in DR6 (e.g., because the CPU always sets some extra bits) then we will never get NOTIFY_STOP. Nevertheless, the signal should not always be sent. > > @@ -484,7 +495,8 @@ int copy_thread(int nr, unsigned long cl > > > > err = 0; > > out: > > - if (err && p->thread.io_bitmap_ptr) { > > + if (err) { > > + flush_thread_hw_breakpoint(p); > > kfree(p->thread.io_bitmap_ptr); > > p->thread.io_bitmap_max = 0; > > } > > This can call kfree(NULL). I would leave the original code alone, i.e.: > > if (err) > flush_thread_hw_breakpoint(p); > if (err && p->thread.io_bitmap_ptr) { > kfree(p->thread.io_bitmap_ptr); > p->thread.io_bitmap_max = 0; > } I disagree. kfree() is documented to return harmlessly when passed a NULL pointer, and lots of places in the kernel have been changed to remove useless tests for NULL before calls to
Re: [patch 1/3] MAP_NOZERO - implement a new VM_NOZERO/MAP_NOZERO page retirement policy
On Tue, 26 Jun 2007, Rik van Riel wrote: > SUID programs should not be able to use this feature, > either. Why? A SUID programs runs under the UID of the owner, and should be no problems in it seeing the owners data. But the patch post was more a quest for possible scenarios where the use of MAP_NOZERO can result in lower security WRT the same program (under the same security restrictions) not using such feature. If you have something specific in mind, please go ahead and shoot. > > When pages exit (unmapped from) a vma, they are marked with the effective > > UID of the mm_struct that owns it. > > > > --- linux-2.6.mod.orig/include/linux/mm_types.h 2007-06-21 > > 14:02:06.0 -0700 > > +++ linux-2.6.mod/include/linux/mm_types.h 2007-06-25 19:11:22.0 > > -0700 > > @@ -64,6 +64,7 @@ > > struct list_head lru; /* Pageout list, eg. active_list > > * protected by zone->lru_lock ! > > */ > > + int owner_uid; /* Last owner of the page */ > > /* > > * On machines where all RAM is mapped into kernel address space, > > * we can simply calculate the virtual address. On machines with > > Since this is only set when the page is freed, could > the owner_uid and security context be put inside a > union with some fields that are not otherwise used > for free pages? I tried to look, and the attempt to reuse _mapcount failed miserably :) The last time we have the owner info (vma->mm) available, is before processing of the other fields ends. OTOH I'm not VM guru either, so I may be wrong. It can share ->virtual (when enabled). - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6 Linux for PowerPC supports kdb?
Does anybody knew if 2.6 linux for PowerPC supports kdb? Best Regards, Gavin - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/7] cpuset write dirty map
On Tue, 26 Jun 2007, Andrew Morton wrote: > Is in my queue somewhere. Could be that by the time I get to it it will > need refreshing (again), we'll see. > > One open question is the interaction between these changes and with Peter's > per-device-dirty-throttling changes. They also are in my queue somewhere. > Having a 100:1 coder:reviewer ratio doesn't exactly make for swift > progress. H.. How can we help? I can look at some aspects of Peter's per device throttling. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] hw_random: add quality categories
On Tue, Jun 26, 2007 at 04:45:24PM +0200, Michael Buesch wrote: > On Tuesday 26 June 2007 16:32:37 Matt Mackall wrote: > > > No wait. You are missing the whole point of this > > > quality category. > > > The whole point of it is to prevent defaulting to a bad RNG, if > > > there's a bad and a good one in a machine. > > > Well, what's bad. > > > It's easy. HWRNGs like the one in bcm43xx are bad. > > > It's proprietary and nobody knows what it does (I guess > > > it gathers the entropy from the network or something > > > and hashes that in hardware). > > > So such a device would be QUAL_LOW. > > > > If it's gathering its entropy from the network, it is not a QUAL_LOW > > RNG because it is not a hardware random number generator at all! > > > > Such a device is QUAL_PSEUDO or QUAL_UNKNOWN. If it's known or > > suspected to be bogus, it should be so marked. > > No, it should not be marked pseudo. It _is_ a RNG in hardware. Again, if it's not using an underlying physical process that's unpredictable, it does not deserve to be called a real HWRNG. It's no better than the software PRNG in the kernel at that point. If you have a reasonable suspicion that this is the case with the BCM part, then you should so mark it. > Where it gets its entropy from is unknown. (I'm just guessing > around). > PSEUDO is for example for entropy gathered from hardware sensors. Not sure what this means. Some hardware sensors are quite good sources of noise. What gets you into trouble is when the sources are either predictable (ie heavily correlated with fixed-frequency crosstalk), observable (ie wireless traffic), or controllable (ie wireless traffic). > > Once you've merged your LOW class with PSEUDO, you're left with a > > meaningless, unquantifiable distinction between NORMAL and HIGH. > > No, that's not true. I explained the difference to you and it's even > explained in the kdoc help text. Re-read it, please. > HIGH is for seperate dedicated extension devices that you buy and > stick into your machine. So it would default to that, as you want > to use that by default (why would you otherwise stick it in). I do not believe there exist devices that deserve to be classified as "HIGH". Any device that makes this claim probably instead deserves to be classified as "SNAKE OIL". Making a high-quality HWRNG is easy, and cheap (>$.05), and very hard to improve on except by upping the bandwidth. Anyone who tells you that their HWRNG is significantly or even measurably better than the one in, say, VIA Padlock, in any dimension except for speed, they are almost certainly LYING. Given that, I'd really rather not create an opportunity for such snake oil salesmen to claim to be "the only Linux-supported RNG to use QUAL_HIGH" or some such bullshit. > To say it again: It all is _just_ for defining a sane _default_ > policy. That's all. > Currently the policy is: "Select whatever comes first", which is > random. So it could select crap (bcm43xx) over not-so-crap (in-CPU-RNG). That's perfectly reasonable. And all I'm saying is please have only two levels: CRAP and NOTCRAP. Anything else just muddies the waters. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2/3] MAP_NOZERO - implement sys_brk2()
Davide Libenzi wrote: The following patch implements the sys_brk2() syscall, that nothing is other than a sys_brk() with an extra "flags" parameter. This can be used to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero pages if possible. Since programs can get back free()d memory after a malloc(), with the old contents of the memory intact, surely your MAP_NONZERO behavior could be the default for programs that can get away with it? Maybe we could use some magic ELF header, similar to the way non-executable stack is handled? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ata: Add the SW NCQ support to sata_nv for MCP51/MCP55/MCP61
Add the Software NCQ support to sata_nv.c for MCP51/MCP55/MCP61 SATA controller. NCQ function is disable by default, you can enable it with 'swncq=1' Signed-off-by: Kuan Luo <[EMAIL PROTECTED]> Signed-off-by: Peer Chen <[EMAIL PROTECTED]> --- diff -Nurp a/sata_nv.c b/sata_nv.c --- a/sata_nv.c 2007-06-13 10:15:07.0 -0400 +++ b/sata_nv.c 2007-06-26 12:52:27.0 -0400 @@ -169,6 +169,35 @@ enum { NV_ADMA_PORT_REGISTER_MODE = (1 << 0), NV_ADMA_ATAPI_SETUP_COMPLETE= (1 << 1), + /* MCP55 reg offset */ + NV_CTL_MCP55= 0x400, + NV_INT_STATUS_MCP55 = 0x440, + NV_INT_ENABLE_MCP55 = 0x444, + NV_NCQ_REG_MCP55= 0x448, + + /* MCP55 */ + NV_INT_ALL_MCP55= 0x, + NV_INT_PORT_SHIFT_MCP55 = 16, /* each port occupies 16 bits */ + NV_INT_MASK_MCP55 = NV_INT_ALL_MCP55 & 0xfffd, + + /* SWNCQ ENABLE BITS*/ + NV_CTL_PRI_SWNCQ= 0x02, + NV_CTL_SEC_SWNCQ= 0x04, + + /* SW NCQ status bits*/ + NV_SWNCQ_IRQ_DEV= (1 << 0), + NV_SWNCQ_IRQ_PM = (1 << 1), + NV_SWNCQ_IRQ_ADDED = (1 << 2), + NV_SWNCQ_IRQ_REMOVED= (1 << 3), + + NV_SWNCQ_IRQ_BACKOUT= (1 << 4), + NV_SWNCQ_IRQ_SDBFIS = (1 << 5), + NV_SWNCQ_IRQ_DHREGFIS = (1 << 6), + NV_SWNCQ_IRQ_DMASETUP = (1 << 7), + + NV_SWNCQ_IRQ_HOTPLUG= NV_SWNCQ_IRQ_ADDED | + NV_SWNCQ_IRQ_REMOVED, + }; /* ADMA Physical Region Descriptor - one SG segment */ @@ -226,6 +255,35 @@ struct nv_host_priv { unsigned long type; }; +typedef struct { + u32 defer_bits; + u8 front; + u8 rear; + unsigned inttag[ATA_MAX_QUEUE + 1]; +}defer_queue_t; + +struct nv_swncq_port_priv { + struct ata_prd *prd;/* our SG list */ + dma_addr_t prd_dma; /* and its DMA mapping */ + void __iomem*sactive_block; + u32 qc_active; + unsigned intlast_issue_tag; + spinlock_t lock; + /* fifo loop queue to store deferral command */ + defer_queue_t defer_queue; + + /* for NCQ interrupt analysis */ + u32 dhfis_bits; + u32 dmafis_bits; + u32 sdbfis_bits; + + unsigned intncq_saw_d2h:1; + unsigned intncq_saw_dmas:1; + unsigned intncq_saw_sdb:1; + unsigned intncq_saw_backout:1; +}; + + #define NV_ADMA_CHECK_INTR(GCTL, PORT) ((GCTL) & ( 1 << (19 + (12 * (PORT) static int nv_init_one (struct pci_dev *pdev, const struct pci_device_id *ent); @@ -263,13 +321,28 @@ static void nv_adma_host_stop(struct ata static void nv_adma_post_internal_cmd(struct ata_queued_cmd *qc); static void nv_adma_tf_read(struct ata_port *ap, struct ata_taskfile *tf); +static void nv_mcp55_thaw(struct ata_port *ap); +static void nv_mcp55_freeze(struct ata_port *ap); +static void nv_swncq_error_handler(struct ata_port *ap); +static int nv_swncq_port_start(struct ata_port *ap); +static void nv_swncq_qc_prep(struct ata_queued_cmd *qc); +static void nv_swncq_fill_sg(struct ata_queued_cmd *qc); +static unsigned int nv_swncq_qc_issue(struct ata_queued_cmd *qc); +static void nv_swncq_irq_clear(struct ata_port *ap, u32 val); +static irqreturn_t nv_swncq_interrupt(int irq, void *dev_instance); +#ifdef CONFIG_PM +static int nv_swncq_port_suspend(struct ata_port *ap, pm_message_t mesg); +static int nv_swncq_port_resume(struct ata_port *ap); +#endif + enum nv_host_type { GENERIC, NFORCE2, NFORCE3 = NFORCE2, /* NF2 == NF3 as far as sata_nv is concerned */ CK804, - ADMA + ADMA, + SWNCQ }; static const struct pci_device_id nv_pci_tbl[] = { @@ -280,13 +353,13 @@ static const struct pci_device_id nv_pci { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_CK804_SATA2), CK804 }, { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP04_SATA), CK804 }, { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP04_SATA2), CK804 }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP51_SATA), GENERIC }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP51_SATA2), GENERIC }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP55_SATA), GENERIC }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP55_SATA2), GENERIC }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA), GENERIC }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA2), GENERIC }, - { PCI_VDEVICE(NVIDIA, PCI_DEVICE_ID_NVIDIA_NFORCE_MCP61_SATA3), GENERIC }, + { PCI_VDEVICE(NVIDIA,
Re: [patch 1/3] MAP_NOZERO - implement a new VM_NOZERO/MAP_NOZERO page retirement policy
Davide Libenzi wrote: This is the core implementation of the new VM_NOZERO page retirement policy (and the associated MAP_NOZERO). A new field owner_uid is added the the mm_struct, and it is kept set to the effective UID of the task that own the mm_struct. A new field owner_uid is also added to the page struct. You will also need to take the task's SELinux security context into account. SUID programs should not be able to use this feature, either. When pages exit (unmapped from) a vma, they are marked with the effective UID of the mm_struct that owns it. --- linux-2.6.mod.orig/include/linux/mm_types.h 2007-06-21 14:02:06.0 -0700 +++ linux-2.6.mod/include/linux/mm_types.h 2007-06-25 19:11:22.0 -0700 @@ -64,6 +64,7 @@ struct list_head lru; /* Pageout list, eg. active_list * protected by zone->lru_lock ! */ + int owner_uid; /* Last owner of the page */ /* * On machines where all RAM is mapped into kernel address space, * we can simply calculate the virtual address. On machines with Since this is only set when the page is freed, could the owner_uid and security context be put inside a union with some fields that are not otherwise used for free pages? -- Politics is the struggle between those who want to make their country the best in the world, and those who believe it already is. Each group calls the other unpatriotic. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] atkbd: cleanup only once
Hi, If you press ctrl+alt+del several times as kernel booting (before user level bootin), the kernel will oops. I found the ps2_command is called more than once, then the ps2dev->serio maybe NULL pointer. 2.6.22-rc5 and 2.6.22-rc6 have same result. Signed-off-by: Dave Young <[EMAIL PROTECTED]> --- diff -upr linux/drivers/input/keyboard/atkbd.c linux.new/drivers/input/keyboard/atkbd.c --- linux/drivers/input/keyboard/atkbd.c2007-06-27 10:38:37.0 + +++ linux.new/drivers/input/keyboard/atkbd.c2007-06-27 10:37:39.0 + @@ -795,6 +795,11 @@ static int atkbd_activate(struct atkbd * static void atkbd_cleanup(struct serio *serio) { + static int flag; + + if(flag) + return; + flag = 1; struct atkbd *atkbd = serio_get_drvdata(serio); ps2_command(>ps2dev, NULL, ATKBD_CMD_RESET_BAT); } Regards dave - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
scheduling while atomic and DEBUG_SPINLOCK_SLEEP
Hello, I am sometimes getting the following "scheduling while atomic" dump: [42949427.37] scheduling while atomic: sh/0x0002/144 [42949427.38] [] (dump_stack+0x0/0x14) from [] (schedule+0x628/0x6c8) [42949427.39] [] (schedule+0x0/0x6c8) from [] (__down_read+0xc4/0x128) [42949427.40] [] (__down_read+0x0/0x128) from [] (do_page_fault+0x84/0x214) [42949427.40] r5 = 0017 r4 = C02F6168 [42949427.41] [] (do_page_fault+0x0/0x214) from [] (do_DataAbort+0x3c/0xa4) [42949427.42] [] (do_DataAbort+0x0/0xa4) from [] (__dabt_svc+0x40/0x60) [42949427.43] r8 = 0093 r7 = C0186920 r6 = E5903048 r5 = CF21FD14 [42949427.43] r4 = [42949427.44] [] (do_alignment_ldrstr+0x0/0x130) from [] (do_alignment+0x238/0x34c) [42949427.45] r4 = CF21E000 [42949427.45] [] (do_alignment+0x0/0x34c) from [] (do_DataAbort+0x3c/0xa4) [42949427.46] [] (do_DataAbort+0x0/0xa4) from [] (__dabt_svc+0x40/0x60) [42949427.47] r8 = CFD51F34 r7 = 8000 r6 = 0001 r5 = CF21FE58 [42949427.47] r4 = [42949427.48] [] (get_index+0x0/0x5c) from [] (prio_tree_insert+0xac/0x28c) [42949427.49] [] (prio_tree_insert+0x0/0x28c) from [] (vma_prio_tree_insert+0x28/0x40) [42949427.50] [] (vma_prio_tree_insert+0x0/0x40) from [] (vma_link+0xe0/0x1d4) [42949427.50] r5 = CFC4F90C r4 = CF21E000 [42949427.51] [] (vma_link+0x0/0x1d4) from [] (do_mmap_pgoff+0x390/0x760) [42949427.52] r7 = CFC374E0 r6 = 1000 r5 = 4005E000 r4 = CFC4F90C [42949427.52] [] (do_mmap_pgoff+0x0/0x760) from [] (old_mmap+0x108/0x130) [42949427.53] [] (old_mmap+0x0/0x130) from [] (ret_fast_syscall+0x0/0x2c) So, I think I need to try to figure out why the preempt_count is 2. I enabled CONFIG_DEBUG_SPINLOCK_SLEEP thinking that it would give me more information about this problem. I got two different hits with this turned on. The first dump is coming from Intel's ixp400eth driver: [42949391.91] Debug: sleeping function called from invalid context at include/asm/semaphore.h:69 [42949391.91] in_atomic():1, irqs_disabled():128 [42949391.91] [] (dump_stack+0x0/0x14) from [] (__might_sleep+0xe8/0x114) [42949391.91] [] (__might_sleep+0x0/0x114) from [] (ixOsalMutexLock+0x190/0x1d8 [ixp400eth]) [42949391.91] r5 = BF0A3D84 r4 = CFC2D2C0 [42949391.91] [] (ixOsalMutexLock+0x0/0x1d8 [ixp400eth]) from [] (ixEthAccPortMulticastAddressLeaveAll+0x38/0x60 [ixp400eth]) [42949391.91] r8 = FF9D r7 = BF09ED88 r6 = C43F5260 r5 = CFD36000 [42949391.91] r4 = [42949391.91] [] (ixEthAccPortMulticastAddressLeaveAll+0x0/0x60 [ixp400eth]) from [] (dev_set_multicast_list+0x68/0x214 [ixp400eth]) [42949391.91] r4 = C43F5000 [42949391.91] [] (dev_set_multicast_list+0x0/0x214 [ixp400eth]) from [] (__dev_mc_upload+0x3c/0x40) [42949391.91] r7 = r6 = 1002 r5 = r4 = CFD36000 [42949391.91] [] (__dev_mc_upload+0x0/0x40) from [] (dev_mc_upload+0x30/0x44) [42949391.91] [] (dev_mc_upload+0x0/0x44) from [] (dev_open+0x70/0xcc) [42949391.91] r4 = C43F5000 [42949391.91] [] (dev_open+0x0/0xcc) from [] (dev_change_flags+0x68/0x138) [42949391.91] r5 = 1043 r4 = C43F5000 [42949391.91] [] (dev_change_flags+0x0/0x138) from [] (devinet_ioctl+0x64c/0x72c) [42949391.91] r7 = CFA09760 r6 = CFD36000 r5 = BEFA8D2C r4 = CFB99D40 [42949391.91] [] (devinet_ioctl+0x0/0x72c) from [] (inet_ioctl+0x1b0/0x1d4) [42949391.91] [] (inet_ioctl+0x0/0x1d4) from [] (sock_ioctl+0x184/0x2f0) [42949391.91] [] (sock_ioctl+0x0/0x2f0) from [] (do_ioctl+0x84/0xa0) [42949391.91] r8 = C002AE44 r7 = BEFA8D2C r6 = 8914 r5 = FFE7 [42949391.91] r4 = C43F1800 [42949391.91] [] (do_ioctl+0x0/0xa0) from [] (vfs_ioctl+0x94/0x314) [42949391.91] r7 = r6 = BEFA8D2C r5 = 0003 r4 = C43F1800 [42949391.91] [] (vfs_ioctl+0x0/0x314) from [] (sys_ioctl+0x40/0x64) [42949391.91] r8 = C002AE44 r7 = 0036 r6 = 8914 r5 = FFF7 [42949391.91] r4 = C43F1800 [42949391.91] [] (sys_ioctl+0x0/0x64) from [] (ret_fast_syscall+0x0/0x2c) [42949391.91] r6 = r5 = BEFA8E1C r4 = BEFA8D2C And the other one is from one of our own kernel modules: [42949490.89] Debug: sleeping function called from invalid context at mm/slab.c:2729 [42949490.89] in_atomic():0, irqs_disabled():128 [42949490.89] [] (dump_stack+0x0/0x14) from [] (__might_sleep+0xe8/0x114) [42949490.89] [] (__might_sleep+0x0/0x114) from [] (kmem_cache_alloc+0x74/0x84) [42949490.89] r5 = 00D0 r4 = CFFFE0C0 [42949490.89] [] (kmem_cache_alloc+0x0/0x84) from [] (request_irq+0x80/0xdc) [42949490.89] r6 = r5 = 0007 r4 = [42949490.89] [] (request_irq+0x0/0xdc) from [] (VbusHookInterrupt+0x2c/0x68 [dstdrv]) [42949490.89] [] (VbusHookInterrupt+0x0/0x68 [dstdrv]) from [] (VbusRegisterISR+0xcc/0xfc [dstdrv]) [42949490.89]
Re: [PATCH RFC #2] hwrng: Add type categories
On Tue, 26 Jun 2007, Matt Mackall wrote: > On Tue, Jun 26, 2007 at 08:21:51PM +0200, Michael Buesch wrote: > > Don't use the word "quality", as people seem to think of > > the entropy quality when hearing that word. > > Why do I so often feel compelled to respond with "did you read what I > wrote?" on this list? > > I object to your MEANINGLESS CATEGORIES. > > > This uses the word "type", which is probably better for > > understanding what the value really means. > > Please explain: > > a) how is bad different from pseudo? > b) how is onboard different than dedicated? Actually, I think I understand the reason behind (b). If someone adds a dedicated crypto/RNG engine to the system, he likely wants to use that and not anything else that might also be around. (a) is just broken, unless one is to take it as "never use it". And I am really not sure about (b). It *is* better than just using whatever crap we found first (or last), but it is the wrong solution for a problem that we really should not have in the first place if someone had thought a bit before adding a misc device for something that has no reason to be unique in a system. Instead of papering over the problem with borked solutions, maybe we should just export ALL HRNGs to userspace. While at it, please add whatever is needed so that userspace can talk to the kernel driver to get vital information about the HRNG device the driver might have (the current interface is a bad simplistic hack). Let userspace get the data from whichever HRNG it wants, process it in any way it wants and pipe it back through /dev/random IOCTLs. And let it do it for as many HRNGs it wants at the same time. And if you must have /dev/hw_random point somewhere, let udev scripts or something else like that take care of it. -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [AppArmor 00/44] AppArmor security module overview
On Tue, 26 Jun 2007 19:24:03 -0700 John Johansen <[EMAIL PROTECTED]> wrote: > > > > so... where do we stand with this? Fundamental, irreconcilable > > differences over the use of pathname-based security? > > > There certainly seems to be some differences of opinion over the use > of pathname-based-security. I was refreshed to have not been cc'ed on a lkml thread for once. I guess it couldn't last. Do you agree with the "irreconcilable" part? I think I do. I suspect that we're at the stage of having to decide between a) set aside the technical issues and grudgingly merge this stuff as a service to Suse and to their users (both of which entities are very important to us) and leave it all as an object lesson in how-not-to-develop-kernel-features. Minimisation of the impact on the rest of the kernel is of course very important here. versus b) leave it out and require that Suse wear the permanent cost and quality impact of maintaining it out-of-tree. It will still be an object lesson in how-not-to-develop-kernel-features. Sigh. Please don't put us in this position again. Get stuff upstream before shipping it to customers, OK? It ain't rocket science. > > Are there any other sticking points? > > > > > The conditional passing of the vfsmnt mount in the vfs, as done in this > patch series, has received a NAK. This problem results from NFS passing > a NULL nameidata into the vfs. We have a second patch series that we > have posted for discussion that addresses this by splitting the nameidata > struct. > Message-Id: <[EMAIL PROTECTED]> > Subject: [RFD 0/4] AppArmor - Don't pass NULL nameidata to > vfs_create/lookup/permission IOPs > > other issues that have been raised are: > - AppArmor does not currently mediate IPC and network communications. > Mediation of these is a wip > - the use of d_path to generate the pathname used for mediation when a > file is opened. > - Generating the pathname using a reverse walk is considered ugly > - A buffer is alloced to store the generated path name. > - The buffer size has a configurable upper limit which will cause > opens to fail if the pathname length exceeds this limit. This > is a fail closed behavior. > - there have been some concerns expressed about the performance > of this approach > We are evaluating our options on how best to address this issue. OK, useful summary, thanks. I'd encourage you to proceed apace. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 3/3] MAP_NOZERO - wire sys_brk2() to the x86 family
Wires up sys_brk2() to the x86 family. Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]> - Davide --- arch/i386/kernel/syscall_table.S |1 + arch/x86_64/ia32/ia32entry.S |1 + include/asm-i386/unistd.h|3 ++- include/asm-x86_64/unistd.h |2 ++ 4 files changed, 6 insertions(+), 1 deletion(-) Index: linux-2.6.mod/arch/i386/kernel/syscall_table.S === --- linux-2.6.mod.orig/arch/i386/kernel/syscall_table.S 2007-06-25 19:14:46.0 -0700 +++ linux-2.6.mod/arch/i386/kernel/syscall_table.S 2007-06-26 18:08:30.0 -0700 @@ -323,3 +323,4 @@ .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_brk2 Index: linux-2.6.mod/arch/x86_64/ia32/ia32entry.S === --- linux-2.6.mod.orig/arch/x86_64/ia32/ia32entry.S 2007-06-25 19:14:46.0 -0700 +++ linux-2.6.mod/arch/x86_64/ia32/ia32entry.S 2007-06-26 18:08:30.0 -0700 @@ -719,4 +719,5 @@ .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_brk2 ia32_syscall_end: Index: linux-2.6.mod/include/asm-i386/unistd.h === --- linux-2.6.mod.orig/include/asm-i386/unistd.h2007-06-25 19:14:46.0 -0700 +++ linux-2.6.mod/include/asm-i386/unistd.h 2007-06-26 18:08:30.0 -0700 @@ -329,10 +329,11 @@ #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_brk2 324 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 325 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR Index: linux-2.6.mod/include/asm-x86_64/unistd.h === --- linux-2.6.mod.orig/include/asm-x86_64/unistd.h 2007-06-25 19:14:46.0 -0700 +++ linux-2.6.mod/include/asm-x86_64/unistd.h 2007-06-26 18:08:30.0 -0700 @@ -630,6 +630,8 @@ __SYSCALL(__NR_timerfd, sys_timerfd) #define __NR_eventfd 284 __SYSCALL(__NR_eventfd, sys_eventfd) +#define __NR_brk2 285 +__SYSCALL(__NR_brk2, sys_brk2) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 1/3] MAP_NOZERO - implement a new VM_NOZERO/MAP_NOZERO page retirement policy
This is the core implementation of the new VM_NOZERO page retirement policy (and the associated MAP_NOZERO). A new field owner_uid is added the the mm_struct, and it is kept set to the effective UID of the task that own the mm_struct. A new field owner_uid is also added to the page struct. When pages exit (unmapped from) a vma, they are marked with the effective UID of the mm_struct that owns it. When pages exit the allocator, their owner_uid is cleared, unless the new flag __GFP_UIDKEEP is passed to it. So every page fetcher other than the new alloc_zeroed_page_vma(), clears the owner_uid and blocks all the following uses of the uncleared page itself. The new alloc_zeroed_page_vma() calls __alloc_pages() with the __GFP_UIDKEEP flag, and checks if the VM_NOZERO flag is set in the vma, and if the owner_uid field of the page matches the one of the mm_struct owning the vma. If any of these test fail, the page is cleared in the usual way, otherwise it is passed back without being cleared. Page-cache pages are (once unmapped) marked with the uid owning the inode of the mapping the pages are associated with. Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]> - Davide --- include/asm-alpha/page.h |3 ++- include/asm-cris/page.h |3 ++- include/asm-generic/mman.h |1 + include/asm-h8300/page.h |3 ++- include/asm-i386/page.h |3 ++- include/asm-ia64/page.h |2 +- include/asm-m32r/page.h |3 ++- include/asm-m68knommu/page.h |3 ++- include/asm-s390/page.h |3 ++- include/asm-x86_64/page.h|3 ++- include/linux/gfp.h |5 + include/linux/highmem.h |7 +-- include/linux/mm.h | 16 include/linux/mm_types.h |1 + include/linux/mman.h |3 ++- include/linux/rmap.h |1 + include/linux/sched.h|3 +++ kernel/fork.c|1 + kernel/sys.c |3 +++ mm/filemap.c |2 ++ mm/mmap.c|3 ++- mm/page_alloc.c | 33 + mm/rmap.c| 14 ++ 23 files changed, 102 insertions(+), 17 deletions(-) Index: linux-2.6.mod/include/linux/sched.h === --- linux-2.6.mod.orig/include/linux/sched.h2007-06-21 13:59:38.0 -0700 +++ linux-2.6.mod/include/linux/sched.h 2007-06-21 14:01:28.0 -0700 @@ -386,6 +386,9 @@ /* aio bits */ rwlock_tioctx_list_lock; struct kioctx *ioctx_list; + + /* Effective UID of the owner of this mm_struct */ + uid_t owner_uid; }; struct sighand_struct { Index: linux-2.6.mod/mm/rmap.c === --- linux-2.6.mod.orig/mm/rmap.c2007-06-21 14:27:19.0 -0700 +++ linux-2.6.mod/mm/rmap.c 2007-06-25 17:42:59.0 -0700 @@ -627,6 +627,16 @@ } #endif +void page_set_owner(struct page *page, uid_t owner_uid) +{ + if (unlikely(PageCompound(page))) { + unsigned int nrpages = 1U << compound_order(page); + for (; nrpages; nrpages--, page++) + page_set_owner_uid(page, owner_uid); + } else + page_set_owner_uid(page, owner_uid); +} + /** * page_remove_rmap - take down pte mapping from a page * @page: page to remove mapping from @@ -649,6 +659,10 @@ print_symbol (KERN_EMERG " vma->vm_file->f_op->mmap = %s\n", (unsigned long)vma->vm_file->f_op->mmap); BUG(); } + /* +* Record the last owner of the page. +*/ + page_set_owner(page, vma->vm_mm->owner_uid); /* * It would be tidy to reset the PageAnon mapping here, Index: linux-2.6.mod/kernel/fork.c === --- linux-2.6.mod.orig/kernel/fork.c2007-06-21 14:32:44.0 -0700 +++ linux-2.6.mod/kernel/fork.c 2007-06-24 21:23:52.0 -0700 @@ -342,6 +342,7 @@ mm->ioctx_list = NULL; mm->free_area_cache = TASK_UNMAPPED_BASE; mm->cached_hole_size = ~0UL; + mm->owner_uid = current->euid; if (likely(!mm_alloc_pgd(mm))) { mm->def_flags = 0; Index: linux-2.6.mod/include/linux/highmem.h === --- linux-2.6.mod.orig/include/linux/highmem.h 2007-06-21 14:38:02.0 -0700 +++ linux-2.6.mod/include/linux/highmem.h 2007-06-22 12:10:36.0 -0700 @@ -76,12 +76,7 @@ static inline struct page * alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr) { - struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr); - - if (page) -
[patch 2/3] MAP_NOZERO - implement sys_brk2()
The following patch implements the sys_brk2() syscall, that nothing is other than a sys_brk() with an extra "flags" parameter. This can be used to pass the new MAP_NOZERO bit, to ask the kernel to hand over non-zero pages if possible. Signed-off-by: Davide Libenzi <[EMAIL PROTECTED]> - Davide --- include/linux/mm.h |3 ++- include/linux/syscalls.h |1 + mm/mmap.c| 22 ++ 3 files changed, 21 insertions(+), 5 deletions(-) Index: linux-2.6.mod/include/linux/mm.h === --- linux-2.6.mod.orig/include/linux/mm.h 2007-06-25 19:27:42.0 -0700 +++ linux-2.6.mod/include/linux/mm.h2007-06-26 18:08:28.0 -0700 @@ -1099,7 +1099,8 @@ } extern int do_munmap(struct mm_struct *, unsigned long, size_t); - +extern unsigned long do_brk_flags(unsigned long addr, unsigned long len, + unsigned long vmflags); extern unsigned long do_brk(unsigned long, unsigned long); /* filemap.c */ Index: linux-2.6.mod/include/linux/syscalls.h === --- linux-2.6.mod.orig/include/linux/syscalls.h 2007-06-25 19:14:49.0 -0700 +++ linux-2.6.mod/include/linux/syscalls.h 2007-06-26 18:08:28.0 -0700 @@ -263,6 +263,7 @@ asmlinkage long sys_fremovexattr(int fd, char __user *name); asmlinkage unsigned long sys_brk(unsigned long brk); +asmlinkage unsigned long sys_brk2(unsigned long brk, unsigned long flags); asmlinkage long sys_mprotect(unsigned long start, size_t len, unsigned long prot); asmlinkage unsigned long sys_mremap(unsigned long addr, Index: linux-2.6.mod/mm/mmap.c === --- linux-2.6.mod.orig/mm/mmap.c2007-06-25 19:14:49.0 -0700 +++ linux-2.6.mod/mm/mmap.c 2007-06-26 18:08:28.0 -0700 @@ -35,6 +35,8 @@ #define arch_mmap_check(addr, len, flags) (0) #endif +#define BRK_ALLOWED_FLAGS (VM_NOZERO) + static void unmap_region(struct mm_struct *mm, struct vm_area_struct *vma, struct vm_area_struct *prev, unsigned long start, unsigned long end); @@ -234,7 +236,7 @@ return next; } -asmlinkage unsigned long sys_brk(unsigned long brk) +asmlinkage unsigned long sys_brk2(unsigned long brk, unsigned long flags) { unsigned long rlim, retval; unsigned long newbrk, oldbrk; @@ -271,8 +273,10 @@ if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE)) goto out; + flags = BRK_ALLOWED_FLAGS & calc_vm_flag_bits(flags); + /* Ok, looks good - let it rip. */ - if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk) + if (do_brk_flags(oldbrk, newbrk-oldbrk, flags) != oldbrk) goto out; set_brk: mm->brk = brk; @@ -282,6 +286,11 @@ return retval; } +asmlinkage unsigned long sys_brk(unsigned long brk) +{ + return sys_brk2(brk, 0); +} + #ifdef DEBUG_MM_RB static int browse_rb(struct rb_root *root) { @@ -1863,7 +1872,8 @@ * anonymous maps. eventually we may be able to do some * brk-specific accounting here. */ -unsigned long do_brk(unsigned long addr, unsigned long len) +unsigned long do_brk_flags(unsigned long addr, unsigned long len, + unsigned long vmflags) { struct mm_struct * mm = current->mm; struct vm_area_struct * vma, * prev; @@ -1882,7 +1892,7 @@ if (is_hugepage_only_range(mm, addr, len)) return -EINVAL; - flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags; + flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags | vmflags; error = arch_mmap_check(addr, len, flags); if (error) @@ -1959,6 +1969,10 @@ return addr; } +unsigned long do_brk(unsigned long addr, unsigned long len) +{ + return do_brk_flags(addr, len, 0); +} EXPORT_SYMBOL(do_brk); /* Release all mmaps. */ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] hwbkpt: Hardware breakpoints (was Kwatch)
On Tue, 26 Jun 2007, Roland McGrath wrote: > I needed the attached patch on top of the bptest patch for the current > code. Btw, that is a very nice little tester! I had already made some of those changes (the ones needed to make bptest build with the new hw_breakpoint code). I'll add in the others. > Below that is a patch to go on top of your current patch, with x86-64 > support. I've only tried a few trivial tests with bptest (including an > 8-byte bp), which worked great. It is a pretty faithful copy of your i386 > changes. I'm still not sure we have all that right, but you might as well > incorporate this into your patch. You should change the x86_64 code in > parallel with any i386 changes we decide on later, and I can test it and > send you any typo fixups or whatnot. Right. I may update a few comments... Alan Stern - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 0/3] MAP_NOZERO - VM_NOZERO/MAP_NOZERO early summer madness
I was using oprofile to sample some userspace code I am working on, and I was continuosly noticing clear_page in the top three entries of the oprofile logs. Also, a simple kernel build, in my Dual Opteron with 8GB of RAM, shows clear_page as the first kernel entry, second only to the userspace the cc1 and as. Most of the userspace code uses malloc() (and anonymous mappings) in such a way that the memory returned via kernel->glibc is immediately written soon after. The POSIX malloc() definition itself also, does not require the returned memory to be zeroed (as calloc() does). So I implemented a rather quick hack that introduces a new mmap() flag MAP_NOZERO (only valid for anonymous mappings) and the vma counter-part VM_NOZERO. Also, a new sys_brk2() has been introduced to accept a new flags parameter. A brief description of the patches follows in the next emails. I first hacked Val's ebizzy to accept a new '-N' flag to make use of MAP_NOZERO: http://infohost.nmt.edu/~val/patches/ebizzy.tar.gz http://www.xmailserver.org/ebizzy-nzmmap-0.2.diff On my box, ebizzy performance jumped up from 10% to 15%. The userspace code I am working on (uses malloc() quite heavily), saw a performance jump of around 14%. In both cases, clear_page dropped way down in the oprofile logs. I then coded quick (and rather ugly) hacks for glibc and gcc to make them use the new features (MAP_NOZERO and sys_brk2()): http://www.xmailserver.org/glibc-nzmalloc-tweaks http://www.xmailserver.org/gcc-nozero-hack I then tried a 2.6.22-rc5 kernel build using the newly built glibc and gcc (with and w/out no-zero enabling options/env-vars), and when using the no-zero mode, clear_page went way down in the oprofile logs and build time dropped of about 2.5% to 3%. I did not have time (and will) to tweak as and ld also. These are some test utilities to verify the no-zero behaviour of MAP_NOZERO (and sys_brk2()): http://www.xmailserver.org/nzmmap-test.c http://www.xmailserver.org/nzmalloc-test.c http://www.xmailserver.org/smiffy.c To run nzmalloc-test you need a patched glibc (using glibc-nzmalloc-tweaks). The smiffy one, should be run under a user that has no other processes running and that owns no files on the system, and it verifies that all the pages it gets from the kernel are zeroed (otherwise "Houston, we have a problem ..."). It is running on my system w/out barfing by more than two days. How crazy is that? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC #2] hwrng: Add type categories
On Tue, Jun 26, 2007 at 05:45:17PM -0500, Matt Mackall wrote: > On Tue, Jun 26, 2007 at 08:21:51PM +0200, Michael Buesch wrote: > > Don't use the word "quality", as people seem to think of > > the entropy quality when hearing that word. > > Why do I so often feel compelled to respond with "did you read what I > wrote?" on this list? Ahh, I see you did respond to what I wrote earlier. Missed it do to travelling earlier today. Will respond to the earlier thread. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NDAs - ANY KNOWN RULES?
> Thanks for your explanations, > > but I know for sure it does't work. then.. do you have an actual question or are you just trying to troll? and yes there have been several such trolls lately on this list, and so far your postings have all the signs of being just another one.. DO NOT FEED THE TROLLS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [AppArmor 00/44] AppArmor security module overview
On Tue, Jun 26, 2007 at 04:52:02PM -0700, Andrew Morton wrote: > On Tue, 26 Jun 2007 16:07:56 -0700 > [EMAIL PROTECTED] wrote: > > > This post contains patches to include the AppArmor application security > > framework, with request for inclusion into -mm for wider testing. > > Patches 24 and 31 didn't come through. > yes, sorry about that I had a very odd failure authetication failure with those two mails and missed it. They have been recent. > > so... where do we stand with this? Fundamental, irreconcilable > differences over the use of pathname-based security? > There certainly seems to be some differences of opinion over the use of pathname-based-security. > Are there any other sticking points? > > The conditional passing of the vfsmnt mount in the vfs, as done in this patch series, has received a NAK. This problem results from NFS passing a NULL nameidata into the vfs. We have a second patch series that we have posted for discussion that addresses this by splitting the nameidata struct. Message-Id: <[EMAIL PROTECTED]> Subject: [RFD 0/4] AppArmor - Don't pass NULL nameidata to vfs_create/lookup/permission IOPs other issues that have been raised are: - AppArmor does not currently mediate IPC and network communications. Mediation of these is a wip - the use of d_path to generate the pathname used for mediation when a file is opened. - Generating the pathname using a reverse walk is considered ugly - A buffer is alloced to store the generated path name. - The buffer size has a configurable upper limit which will cause opens to fail if the pathname length exceeds this limit. This is a fail closed behavior. - there have been some concerns expressed about the performance of this approach We are evaluating our options on how best to address this issue. pgpHKUsFfcLeK.pgp Description: PGP signature
Re: [PATCH] hw_random: add quality categories
On Tue, 26 Jun 2007, Michael Buesch wrote: > On Tuesday 26 June 2007 16:06:25 Henrique de Moraes Holschuh wrote: > > Which, AFAIK, we can quantify as the minimum expected entropy in the output. > > The category is _not_ a measure of the entropy in the output. > It is _just_ to get the chance to get a sane _default_ policy > for which RNG is enabled by default, in the kernel. > It's just about a default policy. _Nothing_ else. Then why don't you call it "preference", or something to that effect? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[md-accel PATCH 18/19] iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver
Adds the platform device definitions and the architecture specific support routines (i.e. register initialization and descriptor formats) for the iop-adma driver. Changelog: * add support for > 1k zero sum buffer sizes * added dma/aau platform devices to iq80321 and iq80332 setup * fixed the calculation in iop_desc_is_aligned * support xor buffer sizes larger than 16MB * fix places where software descriptors are assumed to be contiguous, only hardware descriptors are contiguous for up to a PAGE_SIZE buffer size * convert to async_tx * add interrupt support * add platform devices for 80219 boards * do not call platform register macros in driver code * remove switch() statements for compatible register offsets/layouts * change over to bitmap based capabilities * remove unnecessary ARM assembly statement * checkpatch.pl fixes * gpl v2 only correction Cc: Russell King <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- arch/arm/mach-iop32x/glantank.c|2 arch/arm/mach-iop32x/iq31244.c |5 arch/arm/mach-iop32x/iq80321.c |3 arch/arm/mach-iop32x/n2100.c |2 arch/arm/mach-iop33x/iq80331.c |3 arch/arm/mach-iop33x/iq80332.c |3 arch/arm/plat-iop/Makefile |2 arch/arm/plat-iop/adma.c | 209 include/asm-arm/arch-iop32x/adma.h |5 include/asm-arm/arch-iop33x/adma.h |5 include/asm-arm/hardware/iop3xx-adma.h | 891 include/asm-arm/hardware/iop3xx.h | 68 -- 12 files changed, 1138 insertions(+), 60 deletions(-) diff --git a/arch/arm/mach-iop32x/glantank.c b/arch/arm/mach-iop32x/glantank.c index 5776fd8..2b086ab 100644 --- a/arch/arm/mach-iop32x/glantank.c +++ b/arch/arm/mach-iop32x/glantank.c @@ -180,6 +180,8 @@ static void __init glantank_init_machine(void) platform_device_register(_i2c1_device); platform_device_register(_flash_device); platform_device_register(_serial_device); + platform_device_register(_dma_0_channel); + platform_device_register(_dma_1_channel); pm_power_off = glantank_power_off; } diff --git a/arch/arm/mach-iop32x/iq31244.c b/arch/arm/mach-iop32x/iq31244.c index d4eefbe..98cfa1c 100644 --- a/arch/arm/mach-iop32x/iq31244.c +++ b/arch/arm/mach-iop32x/iq31244.c @@ -298,9 +298,14 @@ static void __init iq31244_init_machine(void) platform_device_register(_i2c1_device); platform_device_register(_flash_device); platform_device_register(_serial_device); + platform_device_register(_dma_0_channel); + platform_device_register(_dma_1_channel); if (is_ep80219()) pm_power_off = ep80219_power_off; + + if (!is_80219()) + platform_device_register(_aau_channel); } static int __init force_ep80219_setup(char *str) diff --git a/arch/arm/mach-iop32x/iq80321.c b/arch/arm/mach-iop32x/iq80321.c index 8d9f491..18ad29f 100644 --- a/arch/arm/mach-iop32x/iq80321.c +++ b/arch/arm/mach-iop32x/iq80321.c @@ -181,6 +181,9 @@ static void __init iq80321_init_machine(void) platform_device_register(_i2c1_device); platform_device_register(_flash_device); platform_device_register(_serial_device); + platform_device_register(_dma_0_channel); + platform_device_register(_dma_1_channel); + platform_device_register(_aau_channel); } MACHINE_START(IQ80321, "Intel IQ80321") diff --git a/arch/arm/mach-iop32x/n2100.c b/arch/arm/mach-iop32x/n2100.c index d55005d..390a97d 100644 --- a/arch/arm/mach-iop32x/n2100.c +++ b/arch/arm/mach-iop32x/n2100.c @@ -245,6 +245,8 @@ static void __init n2100_init_machine(void) platform_device_register(_i2c0_device); platform_device_register(_flash_device); platform_device_register(_serial_device); + platform_device_register(_dma_0_channel); + platform_device_register(_dma_1_channel); pm_power_off = n2100_power_off; diff --git a/arch/arm/mach-iop33x/iq80331.c b/arch/arm/mach-iop33x/iq80331.c index 2b06318..433188e 100644 --- a/arch/arm/mach-iop33x/iq80331.c +++ b/arch/arm/mach-iop33x/iq80331.c @@ -136,6 +136,9 @@ static void __init iq80331_init_machine(void) platform_device_register(_uart0_device); platform_device_register(_uart1_device); platform_device_register(_flash_device); + platform_device_register(_dma_0_channel); + platform_device_register(_dma_1_channel); + platform_device_register(_aau_channel); } MACHINE_START(IQ80331, "Intel IQ80331") diff --git a/arch/arm/mach-iop33x/iq80332.c b/arch/arm/mach-iop33x/iq80332.c index 7889ce3..416c095 100644 --- a/arch/arm/mach-iop33x/iq80332.c +++ b/arch/arm/mach-iop33x/iq80332.c @@ -136,6 +136,9 @@ static void __init iq80332_init_machine(void) platform_device_register(_uart0_device); platform_device_register(_uart1_device); platform_device_register(_flash_device); +
[md-accel PATCH 19/19] ARM: Add drivers/dma to arch/arm/Kconfig
Cc: Russell King <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- arch/arm/Kconfig |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 50d9f3e..0cb2d4f 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1034,6 +1034,8 @@ source "drivers/mmc/Kconfig" source "drivers/rtc/Kconfig" +source "drivers/dma/Kconfig" + endmenu source "fs/Kconfig" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[md-accel PATCH 17/19] iop13xx: surface the iop13xx adma units to the iop-adma driver
Adds the platform device definitions and the architecture specific support routines (i.e. register initialization and descriptor formats) for the iop-adma driver. Changelog: * added 'descriptor pool size' to the platform data * add base support for buffer sizes larger than 16MB (hw max) * build error fix from Kirill A. Shutemov * rebase for async_tx changes * add interrupt support * do not call platform register macros in driver code * remove unnecessary ARM assembly statement * checkpatch.pl fixes * gpl v2 only correction Cc: Russell King <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- arch/arm/mach-iop13xx/setup.c | 217 + include/asm-arm/arch-iop13xx/adma.h| 544 include/asm-arm/arch-iop13xx/iop13xx.h | 38 +- 3 files changed, 774 insertions(+), 25 deletions(-) diff --git a/arch/arm/mach-iop13xx/setup.c b/arch/arm/mach-iop13xx/setup.c index bc48715..bfe0c87 100644 --- a/arch/arm/mach-iop13xx/setup.c +++ b/arch/arm/mach-iop13xx/setup.c @@ -25,6 +25,7 @@ #include #include #include +#include #define IOP13XX_UART_XTAL 4000 #define IOP13XX_SETUP_DEBUG 0 @@ -236,19 +237,143 @@ static unsigned long iq8134x_probe_flash_size(void) } #endif +/* ADMA Channels */ +static struct resource iop13xx_adma_0_resources[] = { + [0] = { + .start = IOP13XX_ADMA_PHYS_BASE(0), + .end = IOP13XX_ADMA_UPPER_PA(0), + .flags = IORESOURCE_MEM, + }, + [1] = { + .start = IRQ_IOP13XX_ADMA0_EOT, + .end = IRQ_IOP13XX_ADMA0_EOT, + .flags = IORESOURCE_IRQ + }, + [2] = { + .start = IRQ_IOP13XX_ADMA0_EOC, + .end = IRQ_IOP13XX_ADMA0_EOC, + .flags = IORESOURCE_IRQ + }, + [3] = { + .start = IRQ_IOP13XX_ADMA0_ERR, + .end = IRQ_IOP13XX_ADMA0_ERR, + .flags = IORESOURCE_IRQ + } +}; + +static struct resource iop13xx_adma_1_resources[] = { + [0] = { + .start = IOP13XX_ADMA_PHYS_BASE(1), + .end = IOP13XX_ADMA_UPPER_PA(1), + .flags = IORESOURCE_MEM, + }, + [1] = { + .start = IRQ_IOP13XX_ADMA1_EOT, + .end = IRQ_IOP13XX_ADMA1_EOT, + .flags = IORESOURCE_IRQ + }, + [2] = { + .start = IRQ_IOP13XX_ADMA1_EOC, + .end = IRQ_IOP13XX_ADMA1_EOC, + .flags = IORESOURCE_IRQ + }, + [3] = { + .start = IRQ_IOP13XX_ADMA1_ERR, + .end = IRQ_IOP13XX_ADMA1_ERR, + .flags = IORESOURCE_IRQ + } +}; + +static struct resource iop13xx_adma_2_resources[] = { + [0] = { + .start = IOP13XX_ADMA_PHYS_BASE(2), + .end = IOP13XX_ADMA_UPPER_PA(2), + .flags = IORESOURCE_MEM, + }, + [1] = { + .start = IRQ_IOP13XX_ADMA2_EOT, + .end = IRQ_IOP13XX_ADMA2_EOT, + .flags = IORESOURCE_IRQ + }, + [2] = { + .start = IRQ_IOP13XX_ADMA2_EOC, + .end = IRQ_IOP13XX_ADMA2_EOC, + .flags = IORESOURCE_IRQ + }, + [3] = { + .start = IRQ_IOP13XX_ADMA2_ERR, + .end = IRQ_IOP13XX_ADMA2_ERR, + .flags = IORESOURCE_IRQ + } +}; + +static u64 iop13xx_adma_dmamask = DMA_64BIT_MASK; +static struct iop_adma_platform_data iop13xx_adma_0_data = { + .hw_id = 0, + .pool_size = PAGE_SIZE, +}; + +static struct iop_adma_platform_data iop13xx_adma_1_data = { + .hw_id = 1, + .pool_size = PAGE_SIZE, +}; + +static struct iop_adma_platform_data iop13xx_adma_2_data = { + .hw_id = 2, + .pool_size = PAGE_SIZE, +}; + +/* The ids are fixed up later in iop13xx_platform_init */ +static struct platform_device iop13xx_adma_0_channel = { + .name = "iop-adma", + .id = 0, + .num_resources = 4, + .resource = iop13xx_adma_0_resources, + .dev = { + .dma_mask = _adma_dmamask, + .coherent_dma_mask = DMA_64BIT_MASK, + .platform_data = (void *) _adma_0_data, + }, +}; + +static struct platform_device iop13xx_adma_1_channel = { + .name = "iop-adma", + .id = 0, + .num_resources = 4, + .resource = iop13xx_adma_1_resources, + .dev = { + .dma_mask = _adma_dmamask, + .coherent_dma_mask = DMA_64BIT_MASK, + .platform_data = (void *) _adma_1_data, + }, +}; + +static struct platform_device iop13xx_adma_2_channel = { + .name = "iop-adma", + .id = 0, + .num_resources = 4, + .resource = iop13xx_adma_2_resources, + .dev = { + .dma_mask = _adma_dmamask, + .coherent_dma_mask = DMA_64BIT_MASK, + .platform_data = (void *) _adma_2_data, + }, +}; + void __init
[md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines
The Intel(R) IOP series of i/o processors integrate an Xscale core with raid acceleration engines. The capabilities per platform are: iop219: (2) copy engines iop321: (2) copy engines (1) xor and block fill engine iop33x: (2) copy and crc32c engines (1) xor, xor zero sum, pq, pq zero sum, and block fill engine iop13xx: (2) copy, crc32c, xor, xor zero sum, and block fill engines (1) copy, crc32c, xor, xor zero sum, pq, pq zero sum, and block fill engine The driver supports the features of the async_tx api: * asynchronous notification of operation completion * implicit (interupt triggered) handling of inter-channel transaction dependencies The driver adapts to the platform it is running by two methods. 1/ #include which defines the hardware specific iop_chan_* and iop_desc_* routines as a series of static inline functions 2/ The private platform data attached to the platform_device defines the capabilities of the channels 20070626: Callbacks are run in a tasklet. Given the recent discussion on LKML about killing tasklets in favor of workqueues I did a quick conversion of the driver. Raid5 resync performance dropped from 50MB/s to 30MB/s, so the tasklet implementation remains until a generic softirq interface is available. Changelog: * fixed a slot allocation bug in do_iop13xx_adma_xor that caused too few slots to be requested eventually leading to data corruption * enabled the slot allocation routine to attempt to free slots before returning -ENOMEM * switched the cleanup routine to solely use the software chain and the status register to determine if a descriptor is complete. This is necessary to support other IOP engines that do not have status writeback capability * make the driver iop generic * modified the allocation routines to understand allocating a group of slots for a single operation * added a null xor initialization operation for the xor only channel on iop3xx * support xor operations on buffers larger than the hardware maximum * split the do_* routines into separate prep, src/dest set, submit stages * added async_tx support (dependent operations initiation at cleanup time) * simplified group handling * added interrupt support (callbacks via tasklets) * brought the pending depth inline with ioat (i.e. 4 descriptors) * drop dma mapping methods, suggested by Chris Leech * don't use inline in C files, Adrian Bunk * remove static tasklet declarations * make iop_adma_alloc_slots easier to read and remove chances for a corrupted descriptor chain * fix locking bug in iop_adma_alloc_chan_resources, Benjamin Herrenschmidt * convert capabilities over to dma_cap_mask_t * fixup sparse warnings * add descriptor flush before iop_chan_enable * checkpatch.pl fixes * gpl v2 only correction * move set_src, set_dest, submit to async_tx methods Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/dma/Kconfig |8 drivers/dma/Makefile|1 drivers/dma/iop-adma.c | 1465 +++ include/asm-arm/hardware/iop_adma.h | 120 +++ 4 files changed, 1594 insertions(+), 0 deletions(-) diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 492aa08..f27f5c7 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -31,4 +31,12 @@ config INTEL_IOATDMA default m ---help--- Enable support for the Intel(R) I/OAT DMA engine. + +config INTEL_IOP_ADMA +tristate "Intel IOP ADMA support" +depends on DMA_ENGINE && (ARCH_IOP32X || ARCH_IOP33X || ARCH_IOP13XX) +default m +---help--- + Enable support for the Intel(R) IOP Series RAID engines. + endmenu diff --git a/drivers/dma/Makefile b/drivers/dma/Makefile index bdcfdbd..b3839b6 100644 --- a/drivers/dma/Makefile +++ b/drivers/dma/Makefile @@ -1,3 +1,4 @@ obj-$(CONFIG_DMA_ENGINE) += dmaengine.o obj-$(CONFIG_NET_DMA) += iovlock.o obj-$(CONFIG_INTEL_IOATDMA) += ioatdma.o +obj-$(CONFIG_INTEL_IOP_ADMA) += iop-adma.o diff --git a/drivers/dma/iop-adma.c b/drivers/dma/iop-adma.c new file mode 100644 index 000..3db12d6 --- /dev/null +++ b/drivers/dma/iop-adma.c @@ -0,0 +1,1465 @@ +/* + * offload engine driver for the Intel Xscale series of i/o processors + * Copyright © 2006, Intel Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + * + */ + +/*
[md-accel PATCH 14/19] md: handle_stripe5 - request io processing in raid5_run_ops
I/O submission requests were already handled outside of the stripe lock in handle_stripe. Now that handle_stripe is only tasked with finding work, this logic belongs in raid5_run_ops. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 71 ++-- 1 files changed, 13 insertions(+), 58 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e0ae26d..a09bc5f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2319,6 +2319,9 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf, "%d for r-m-w\n", i); set_bit(R5_LOCKED, >flags); set_bit(R5_Wantread, >flags); + if (!test_and_set_bit( + STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; s->locked++; } else { set_bit(STRIPE_DELAYED, >state); @@ -2342,6 +2345,9 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf, "%d for Reconstruct\n", i); set_bit(R5_LOCKED, >flags); set_bit(R5_Wantread, >flags); + if (!test_and_set_bit( + STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; s->locked++; } else { set_bit(STRIPE_DELAYED, >state); @@ -2538,6 +2544,9 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh, set_bit(R5_LOCKED, >flags); set_bit(R5_Wantwrite, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; + clear_bit(STRIPE_DEGRADED, >state); s->locked++; set_bit(STRIPE_INSYNC, >state); @@ -2923,12 +2932,16 @@ static void handle_stripe5(struct stripe_head *sh) dev = >dev[s.failed_num]; if (!test_bit(R5_ReWrite, >flags)) { set_bit(R5_Wantwrite, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; set_bit(R5_ReWrite, >flags); set_bit(R5_LOCKED, >flags); s.locked++; } else { /* let's read it back */ set_bit(R5_Wantread, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; set_bit(R5_LOCKED, >flags); s.locked++; } @@ -2989,64 +3002,6 @@ static void handle_stripe5(struct stripe_head *sh) test_bit(BIO_UPTODATE, >bi_flags) ? 0 : -EIO); } - for (i=disks; i-- ;) { - int rw; - struct bio *bi; - mdk_rdev_t *rdev; - if (test_and_clear_bit(R5_Wantwrite, >dev[i].flags)) - rw = WRITE; - else if (test_and_clear_bit(R5_Wantread, >dev[i].flags)) - rw = READ; - else - continue; - - bi = >dev[i].req; - - bi->bi_rw = rw; - if (rw == WRITE) - bi->bi_end_io = raid5_end_write_request; - else - bi->bi_end_io = raid5_end_read_request; - - rcu_read_lock(); - rdev = rcu_dereference(conf->disks[i].rdev); - if (rdev && test_bit(Faulty, >flags)) - rdev = NULL; - if (rdev) - atomic_inc(>nr_pending); - rcu_read_unlock(); - - if (rdev) { - if (s.syncing || s.expanding || s.expanded) - md_sync_acct(rdev->bdev, STRIPE_SECTORS); - - bi->bi_bdev = rdev->bdev; - pr_debug("for %llu schedule op %ld on disc %d\n", - (unsigned long long)sh->sector, bi->bi_rw, i); - atomic_inc(>count); - bi->bi_sector = sh->sector + rdev->data_offset; - bi->bi_flags = 1 << BIO_UPTODATE; - bi->bi_vcnt = 1; - bi->bi_max_vecs = 1; - bi->bi_idx = 0; -
[md-accel PATCH 15/19] md: remove raid5 compute_block and compute_parity5
replaced by raid5_run_ops Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 124 1 files changed, 0 insertions(+), 124 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index a09bc5f..0579d1f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1509,130 +1509,6 @@ static void copy_data(int frombio, struct bio *bio, } \ } while(0) - -static void compute_block(struct stripe_head *sh, int dd_idx) -{ - int i, count, disks = sh->disks; - void *ptr[MAX_XOR_BLOCKS], *dest, *p; - - pr_debug("compute_block, stripe %llu, idx %d\n", - (unsigned long long)sh->sector, dd_idx); - - dest = page_address(sh->dev[dd_idx].page); - memset(dest, 0, STRIPE_SIZE); - count = 0; - for (i = disks ; i--; ) { - if (i == dd_idx) - continue; - p = page_address(sh->dev[i].page); - if (test_bit(R5_UPTODATE, >dev[i].flags)) - ptr[count++] = p; - else - printk(KERN_ERR "compute_block() %d, stripe %llu, %d" - " not present\n", dd_idx, - (unsigned long long)sh->sector, i); - - check_xor(); - } - if (count) - xor_blocks(count, STRIPE_SIZE, dest, ptr); - set_bit(R5_UPTODATE, >dev[dd_idx].flags); -} - -static void compute_parity5(struct stripe_head *sh, int method) -{ - raid5_conf_t *conf = sh->raid_conf; - int i, pd_idx = sh->pd_idx, disks = sh->disks, count; - void *ptr[MAX_XOR_BLOCKS], *dest; - struct bio *chosen; - - pr_debug("compute_parity5, stripe %llu, method %d\n", - (unsigned long long)sh->sector, method); - - count = 0; - dest = page_address(sh->dev[pd_idx].page); - switch(method) { - case READ_MODIFY_WRITE: - BUG_ON(!test_bit(R5_UPTODATE, >dev[pd_idx].flags)); - for (i=disks ; i-- ;) { - if (i==pd_idx) - continue; - if (sh->dev[i].towrite && - test_bit(R5_UPTODATE, >dev[i].flags)) { - ptr[count++] = page_address(sh->dev[i].page); - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) - wake_up(>wait_for_overlap); - - BUG_ON(sh->dev[i].written); - sh->dev[i].written = chosen; - check_xor(); - } - } - break; - case RECONSTRUCT_WRITE: - memset(dest, 0, STRIPE_SIZE); - for (i= disks; i-- ;) - if (i!=pd_idx && sh->dev[i].towrite) { - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) - wake_up(>wait_for_overlap); - - BUG_ON(sh->dev[i].written); - sh->dev[i].written = chosen; - } - break; - case CHECK_PARITY: - break; - } - if (count) { - xor_blocks(count, STRIPE_SIZE, dest, ptr); - count = 0; - } - - for (i = disks; i--;) - if (sh->dev[i].written) { - sector_t sector = sh->dev[i].sector; - struct bio *wbi = sh->dev[i].written; - while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) { - copy_data(1, wbi, sh->dev[i].page, sector); - wbi = r5_next_bio(wbi, sector); - } - - set_bit(R5_LOCKED, >dev[i].flags); - set_bit(R5_UPTODATE, >dev[i].flags); - } - - switch(method) { - case RECONSTRUCT_WRITE: - case CHECK_PARITY: - for (i=disks; i--;) - if (i != pd_idx) { - ptr[count++] = page_address(sh->dev[i].page); - check_xor(); - } - break; - case READ_MODIFY_WRITE: - for (i = disks; i--;) - if (sh->dev[i].written) { - ptr[count++] = page_address(sh->dev[i].page); - check_xor(); - } -
[md-accel PATCH 11/19] md: handle_stripe5 - add request/completion logic for async check ops
Check operations are scheduled when the array is being resynced or an explicit 'check/repair' command was sent to the array. Previously check operations would destroy the parity block in the cache such that even if parity turned out to be correct the parity block would be marked !R5_UPTODATE at the completion of the check. When the operation can be carried out by a dma engine the assumption is that it can check parity as a read-only operation. If raid5_run_ops notices that the check was handled by hardware it will preserve the R5_UPTODATE status of the parity disk. When a check operation determines that the parity needs to be repaired we reuse the existing compute block infrastructure to carry out the operation. Repair operations imply an immediate write back of the data, so to differentiate a repair from a normal compute operation the STRIPE_OP_MOD_REPAIR_PD flag is added. Changelog: * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 84 1 files changed, 65 insertions(+), 19 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 38b8167..89d3890 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2464,26 +2464,67 @@ static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh, struct stripe_head_state *s, int disks) { set_bit(STRIPE_HANDLE, >state); - if (s->failed == 0) { - BUG_ON(s->uptodate != disks); - compute_parity5(sh, CHECK_PARITY); - s->uptodate--; - if (page_is_zero(sh->dev[sh->pd_idx].page)) { - /* parity is correct (on disc, not in buffer any more) -*/ - set_bit(STRIPE_INSYNC, >state); - } else { - conf->mddev->resync_mismatches += STRIPE_SECTORS; - if (test_bit(MD_RECOVERY_CHECK, >mddev->recovery)) - /* don't try to repair!! */ + /* Take one of the following actions: +* 1/ start a check parity operation if (uptodate == disks) +* 2/ finish a check parity operation and act on the result +* 3/ skip to the writeback section if we previously +*initiated a recovery operation +*/ + if (s->failed == 0 && + !test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) { + if (!test_and_set_bit(STRIPE_OP_CHECK, >ops.pending)) { + BUG_ON(s->uptodate != disks); + clear_bit(R5_UPTODATE, >dev[sh->pd_idx].flags); + sh->ops.count++; + s->uptodate--; + } else if ( + test_and_clear_bit(STRIPE_OP_CHECK, >ops.complete)) { + clear_bit(STRIPE_OP_CHECK, >ops.ack); + clear_bit(STRIPE_OP_CHECK, >ops.pending); + + if (sh->ops.zero_sum_result == 0) + /* parity is correct (on disc, +* not in buffer any more) +*/ set_bit(STRIPE_INSYNC, >state); else { - compute_block(sh, sh->pd_idx); - s->uptodate++; + conf->mddev->resync_mismatches += + STRIPE_SECTORS; + if (test_bit( +MD_RECOVERY_CHECK, >mddev->recovery)) + /* don't try to repair!! */ + set_bit(STRIPE_INSYNC, >state); + else { + set_bit(STRIPE_OP_COMPUTE_BLK, + >ops.pending); + set_bit(STRIPE_OP_MOD_REPAIR_PD, + >ops.pending); + set_bit(R5_Wantcompute, + >dev[sh->pd_idx].flags); + sh->ops.target = sh->pd_idx; + sh->ops.count++; + s->uptodate++; + } } } } - if (!test_bit(STRIPE_INSYNC, >state)) { + + /* check if we can clear a parity disk reconstruct */ + if (test_bit(STRIPE_OP_COMPUTE_BLK, >ops.complete) && + test_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending)) { + + clear_bit(STRIPE_OP_MOD_REPAIR_PD, >ops.pending); + clear_bit(STRIPE_OP_COMPUTE_BLK, >ops.complete); +
[md-accel PATCH 12/19] md: handle_stripe5 - add request/completion logic for async read ops
When a read bio is attached to the stripe and the corresponding block is marked R5_UPTODATE, then a read (biofill) operation is scheduled to copy the data from the stripe cache to the bio buffer. handle_stripe flags the blocks to be operated on with the R5_Wantfill flag. If new read requests arrive while raid5_run_ops is running they will not be handled until handle_stripe is scheduled to run again. Changelog: * cleanup to_read and to_fill accounting * do not fail reads that have reached the cache Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 53 +--- include/linux/raid/raid5.h |2 +- 2 files changed, 26 insertions(+), 29 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 89d3890..3d0dca9 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2042,9 +2042,12 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh, bi = bi2; } - /* fail any reads if this device is non-operational */ - if (!test_bit(R5_Insync, >dev[i].flags) || - test_bit(R5_ReadError, >dev[i].flags)) { + /* fail any reads if this device is non-operational and +* the data has not reached the cache yet. +*/ + if (!test_bit(R5_Wantfill, >dev[i].flags) && + (!test_bit(R5_Insync, >dev[i].flags) || + test_bit(R5_ReadError, >dev[i].flags))) { bi = sh->dev[i].toread; sh->dev[i].toread = NULL; if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) @@ -2733,37 +2736,27 @@ static void handle_stripe5(struct stripe_head *sh) struct r5dev *dev = >dev[i]; clear_bit(R5_Insync, >flags); - pr_debug("check %d: state 0x%lx read %p write %p written %p\n", - i, dev->flags, dev->toread, dev->towrite, dev->written); - /* maybe we can reply to a read */ - if (test_bit(R5_UPTODATE, >flags) && dev->toread) { - struct bio *rbi, *rbi2; - pr_debug("Return read for disc %d\n", i); - spin_lock_irq(>device_lock); - rbi = dev->toread; - dev->toread = NULL; - if (test_and_clear_bit(R5_Overlap, >flags)) - wake_up(>wait_for_overlap); - spin_unlock_irq(>device_lock); - while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) { - copy_data(0, rbi, dev->page, dev->sector); - rbi2 = r5_next_bio(rbi, dev->sector); - spin_lock_irq(>device_lock); - if (--rbi->bi_phys_segments == 0) { - rbi->bi_next = return_bi; - return_bi = rbi; - } - spin_unlock_irq(>device_lock); - rbi = rbi2; - } - } + pr_debug("check %d: state 0x%lx toread %p read %p write %p " + "written %p\n", i, dev->flags, dev->toread, dev->read, + dev->towrite, dev->written); + + /* maybe we can request a biofill operation +* +* new wantfill requests are only permitted while +* STRIPE_OP_BIOFILL is clear +*/ + if (test_bit(R5_UPTODATE, >flags) && dev->toread && + !test_bit(STRIPE_OP_BIOFILL, >ops.pending)) + set_bit(R5_Wantfill, >flags); /* now count some things */ if (test_bit(R5_LOCKED, >flags)) s.locked++; if (test_bit(R5_UPTODATE, >flags)) s.uptodate++; if (test_bit(R5_Wantcompute, >flags)) s.compute++; - if (dev->toread) + if (test_bit(R5_Wantfill, >flags)) + s.to_fill++; + else if (dev->toread) s.to_read++; if (dev->towrite) { s.to_write++; @@ -2786,6 +2779,10 @@ static void handle_stripe5(struct stripe_head *sh) set_bit(R5_Insync, >flags); } rcu_read_unlock(); + + if (s.to_fill && !test_and_set_bit(STRIPE_OP_BIOFILL, >ops.pending)) + sh->ops.count++; + pr_debug("locked=%d uptodate=%d to_read=%d" " to_write=%d failed=%d failed_num=%d\n", s.locked, s.uptodate, s.to_read, s.to_write, diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h index 2d45eba..e9dfb2d 100644 ---
[md-accel PATCH 13/19] md: handle_stripe5 - add request/completion logic for async expand ops
When a stripe is being expanded bulk copying takes place to move the data from the old stripe to the new. Since raid5_run_ops only operates on one stripe at a time these bulk copies are handled in-line under the stripe lock. In the dma offload case we poll for the completion of the operation. After the data has been copied into the new stripe the parity needs to be recalculated across the new disks. We reuse the existing postxor functionality to carry out this calculation. By setting STRIPE_OP_POSTXOR without setting STRIPE_OP_BIODRAIN the completion path in handle stripe can differentiate expand operations from normal write operations. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 50 ++ 1 files changed, 38 insertions(+), 12 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3d0dca9..e0ae26d 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2646,6 +2646,7 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh, /* We have read all the blocks in this stripe and now we need to * copy some of them into a target stripe for expand. */ + struct dma_async_tx_descriptor *tx = NULL; clear_bit(STRIPE_EXPAND_SOURCE, >state); for (i = 0; i < sh->disks; i++) if (i != sh->pd_idx && (r6s && i != r6s->qd_idx)) { @@ -2671,9 +2672,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh, release_stripe(sh2); continue; } - memcpy(page_address(sh2->dev[dd_idx].page), - page_address(sh->dev[i].page), - STRIPE_SIZE); + + /* place all the copies on one channel */ + tx = async_memcpy(sh2->dev[dd_idx].page, + sh->dev[i].page, 0, 0, STRIPE_SIZE, + ASYNC_TX_DEP_ACK, tx, NULL, NULL); + set_bit(R5_Expanded, >dev[dd_idx].flags); set_bit(R5_UPTODATE, >dev[dd_idx].flags); for (j = 0; j < conf->raid_disks; j++) @@ -2686,6 +2690,12 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh, set_bit(STRIPE_HANDLE, >state); } release_stripe(sh2); + + /* done submitting copies, wait for them to complete */ + if (i + 1 >= sh->disks) { + async_tx_ack(tx); + dma_wait_for_async_tx(tx); + } } } @@ -2924,18 +2934,34 @@ static void handle_stripe5(struct stripe_head *sh) } } - if (s.expanded && test_bit(STRIPE_EXPANDING, >state)) { - /* Need to write out all blocks after computing parity */ - sh->disks = conf->raid_disks; - sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks); - compute_parity5(sh, RECONSTRUCT_WRITE); + /* Finish postxor operations initiated by the expansion +* process +*/ + if (test_bit(STRIPE_OP_POSTXOR, >ops.complete) && + !test_bit(STRIPE_OP_BIODRAIN, >ops.pending)) { + + clear_bit(STRIPE_EXPANDING, >state); + + clear_bit(STRIPE_OP_POSTXOR, >ops.pending); + clear_bit(STRIPE_OP_POSTXOR, >ops.ack); + clear_bit(STRIPE_OP_POSTXOR, >ops.complete); + for (i = conf->raid_disks; i--; ) { - set_bit(R5_LOCKED, >dev[i].flags); - s.locked++; set_bit(R5_Wantwrite, >dev[i].flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; } - clear_bit(STRIPE_EXPANDING, >state); - } else if (s.expanded) { + } + + if (s.expanded && test_bit(STRIPE_EXPANDING, >state) && + !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) { + /* Need to write out all blocks after computing parity */ + sh->disks = conf->raid_disks; + sh->pd_idx = stripe_to_pdidx(sh->sector, conf, + conf->raid_disks); + s.locked += handle_write_operations5(sh, 0, 1); + } else if (s.expanded && + !test_bit(STRIPE_OP_POSTXOR, >ops.pending)) { clear_bit(STRIPE_EXPAND_READY, >state); atomic_dec(>reshape_stripes); wake_up(>wait_for_overlap); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at
[md-accel PATCH 09/19] md: handle_stripe5 - add request/completion logic for async write ops
After handle_stripe5 decides whether it wants to perform a read-modify-write, or a reconstruct write it calls handle_write_operations5. A read-modify-write operation will perform an xor subtraction of the blocks marked with the R5_Wantprexor flag, copy the new data into the stripe (biodrain) and perform a postxor operation across all up-to-date blocks to generate the new parity. A reconstruct write is run when all blocks are already up-to-date in the cache so all that is needed is a biodrain and postxor. On the completion path STRIPE_OP_PREXOR will be set if the operation was a read-modify-write. The STRIPE_OP_BIODRAIN flag is used in the completion path to differentiate write-initiated postxor operations versus expansion-initiated postxor operations. Completion of a write triggers i/o to the drives. Changelog: * make the 'rcw' parameter to handle_write_operations5 a simple flag, Neil Brown * remove test_and_set/test_and_clear BUG_ONs, Neil Brown Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 161 +--- 1 files changed, 138 insertions(+), 23 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 7c688f6..b2e88fe 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1815,7 +1815,79 @@ static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2) } } +static int +handle_write_operations5(struct stripe_head *sh, int rcw, int expand) +{ + int i, pd_idx = sh->pd_idx, disks = sh->disks; + int locked = 0; + + if (rcw) { + /* if we are not expanding this is a proper write request, and +* there will be bios with new data to be drained into the +* stripe cache +*/ + if (!expand) { + set_bit(STRIPE_OP_BIODRAIN, >ops.pending); + sh->ops.count++; + } + + set_bit(STRIPE_OP_POSTXOR, >ops.pending); + sh->ops.count++; + + for (i = disks; i--; ) { + struct r5dev *dev = >dev[i]; + + if (dev->towrite) { + set_bit(R5_LOCKED, >flags); + if (!expand) + clear_bit(R5_UPTODATE, >flags); + locked++; + } + } + } else { + BUG_ON(!(test_bit(R5_UPTODATE, >dev[pd_idx].flags) || + test_bit(R5_Wantcompute, >dev[pd_idx].flags))); + + set_bit(STRIPE_OP_PREXOR, >ops.pending); + set_bit(STRIPE_OP_BIODRAIN, >ops.pending); + set_bit(STRIPE_OP_POSTXOR, >ops.pending); + + sh->ops.count += 3; + + for (i = disks; i--; ) { + struct r5dev *dev = >dev[i]; + if (i == pd_idx) + continue; + + /* For a read-modify write there may be blocks that are +* locked for reading while others are ready to be +* written so we distinguish these blocks by the +* R5_Wantprexor bit +*/ + if (dev->towrite && + (test_bit(R5_UPTODATE, >flags) || + test_bit(R5_Wantcompute, >flags))) { + set_bit(R5_Wantprexor, >flags); + set_bit(R5_LOCKED, >flags); + clear_bit(R5_UPTODATE, >flags); + locked++; + } + } + } + + /* keep the parity disk locked while asynchronous operations +* are in flight +*/ + set_bit(R5_LOCKED, >dev[pd_idx].flags); + clear_bit(R5_UPTODATE, >dev[pd_idx].flags); + locked++; + pr_debug("%s: stripe %llu locked: %d pending: %lx\n", + __FUNCTION__, (unsigned long long)sh->sector, + locked, sh->ops.pending); + + return locked; +} /* * Each stripe/dev can have one or more bion attached. @@ -2210,27 +2282,8 @@ static void handle_issuing_new_write_requests5(raid5_conf_t *conf, * we can start a write request */ if (s->locked == 0 && (rcw == 0 || rmw == 0) && - !test_bit(STRIPE_BIT_DELAY, >state)) { - pr_debug("Computing parity...\n"); - compute_parity5(sh, rcw == 0 ? - RECONSTRUCT_WRITE : READ_MODIFY_WRITE); - /* now every locked buffer is ready to be written */ - for (i = disks; i--; ) - if (test_bit(R5_LOCKED, >dev[i].flags)) { - pr_debug("Writing block %d\n", i); - s->locked++; -
[md-accel PATCH 10/19] md: handle_stripe5 - add request/completion logic for async compute ops
handle_stripe will compute a block when a backing disk has failed, or when it determines it can save a disk read by computing the block from all the other up-to-date blocks. Previously a block would be computed under the lock and subsequent logic in handle_stripe could use the newly up-to-date block. With the raid5_run_ops implementation the compute operation is carried out a later time outside the lock. To preserve the old functionality we take advantage of the dependency chain feature of async_tx to flag the block as R5_Wantcompute and then let other parts of handle_stripe operate on the block as if it were up-to-date. raid5_run_ops guarantees that the block will be ready before it is used in another operation. However, this only works in cases where the compute and the dependent operation are scheduled at the same time. If a previous call to handle_stripe sets the R5_Wantcompute flag there is no facility to pass the async_tx dependency chain across successive calls to raid5_run_ops. The req_compute variable protects against this case. Changelog: * remove the req_compute BUG_ON Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 149 ++-- include/linux/raid/raid5.h |2 - 2 files changed, 115 insertions(+), 36 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index b2e88fe..38b8167 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -2070,36 +2070,101 @@ handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh, } +/* __handle_issuing_new_read_requests5 - returns 0 if there are no more disks + * to process + */ +static int __handle_issuing_new_read_requests5(struct stripe_head *sh, + struct stripe_head_state *s, int disk_idx, int disks) +{ + struct r5dev *dev = >dev[disk_idx]; + struct r5dev *failed_dev = >dev[s->failed_num]; + + /* don't schedule compute operations or reads on the parity block while +* a check is in flight +*/ + if ((disk_idx == sh->pd_idx) && +test_bit(STRIPE_OP_CHECK, >ops.pending)) + return ~0; + + /* is the data in this block needed, and can we get it? */ + if (!test_bit(R5_LOCKED, >flags) && + !test_bit(R5_UPTODATE, >flags) && (dev->toread || + (dev->towrite && !test_bit(R5_OVERWRITE, >flags)) || +s->syncing || s->expanding || (s->failed && +(failed_dev->toread || (failed_dev->towrite && +!test_bit(R5_OVERWRITE, _dev->flags) +) { + /* 1/ We would like to get this block, possibly by computing it, +* but we might not be able to. +* +* 2/ Since parity check operations potentially make the parity +* block !uptodate it will need to be refreshed before any +* compute operations on data disks are scheduled. +* +* 3/ We hold off parity block re-reads until check operations +* have quiesced. +*/ + if ((s->uptodate == disks - 1) && + !test_bit(STRIPE_OP_CHECK, >ops.pending)) { + set_bit(STRIPE_OP_COMPUTE_BLK, >ops.pending); + set_bit(R5_Wantcompute, >flags); + sh->ops.target = disk_idx; + s->req_compute = 1; + sh->ops.count++; + /* Careful: from this point on 'uptodate' is in the eye +* of raid5_run_ops which services 'compute' operations +* before writes. R5_Wantcompute flags a block that will +* be R5_UPTODATE by the time it is needed for a +* subsequent operation. +*/ + s->uptodate++; + return 0; /* uptodate + compute == disks */ + } else if ((s->uptodate < disks - 1) && + test_bit(R5_Insync, >flags)) { + /* Note: we hold off compute operations while checks are +* in flight, but we still prefer 'compute' over 'read' +* hence we only read if (uptodate < * disks-1) +*/ + set_bit(R5_LOCKED, >flags); + set_bit(R5_Wantread, >flags); + if (!test_and_set_bit(STRIPE_OP_IO, >ops.pending)) + sh->ops.count++; + s->locked++; + pr_debug("Reading block %d (sync=%d)\n", disk_idx, + s->syncing); + } + } + + return ~0; +} + static void handle_issuing_new_read_requests5(struct stripe_head *sh, struct stripe_head_state *s, int disks) { int
[md-accel PATCH 07/19] md: raid5_run_ops - run stripe operations outside sh->lock
When the raid acceleration work was proposed, Neil laid out the following attack plan: 1/ move the xor and copy operations outside spin_lock(>lock) 2/ find/implement an asynchronous offload api The raid5_run_ops routine uses the asynchronous offload api (async_tx) and the stripe_operations member of a stripe_head to carry out xor+copy operations asynchronously, outside the lock. To perform operations outside the lock a new set of state flags is needed to track new requests, in-flight requests, and completed requests. In this new model handle_stripe is tasked with scanning the stripe_head for work, updating the stripe_operations structure, and finally dropping the lock and calling raid5_run_ops for processing. The following flags outline the requests that handle_stripe can make of raid5_run_ops: STRIPE_OP_BIOFILL - copy data into request buffers to satisfy a read request STRIPE_OP_COMPUTE_BLK - generate a missing block in the cache from the other blocks STRIPE_OP_PREXOR - subtract existing data as part of the read-modify-write process STRIPE_OP_BIODRAIN - copy data out of request buffers to satisfy a write request STRIPE_OP_POSTXOR - recalculate parity for new data that has entered the cache STRIPE_OP_CHECK - verify that the parity is correct STRIPE_OP_IO - submit i/o to the member disks (note this was already performed outside the stripe lock, but it made sense to add it as an operation type The flow is: 1/ handle_stripe sets STRIPE_OP_* in sh->ops.pending 2/ raid5_run_ops reads sh->ops.pending, sets sh->ops.ack, and submits the operation to the async_tx api 3/ async_tx triggers the completion callback routine to set sh->ops.complete and release the stripe 4/ handle_stripe runs again to finish the operation and optionally submit new operations that were previously blocked Note this patch just defines raid5_run_ops, subsequent commits (one per major operation type) modify handle_stripe to take advantage of this routine. Changelog: * removed ops_complete_biodrain in favor of ops_complete_postxor and ops_complete_write. * removed the raid5_run_ops workqueue * call bi_end_io for reads in ops_complete_biofill, saves a call to handle_stripe * explicitly handle the 2-disk raid5 case (xor becomes memcpy), Neil Brown * fix race between async engines and bi_end_io call for reads, Neil Brown * remove unnecessary spin_lock from ops_complete_biofill * remove test_and_set/test_and_clear BUG_ONs, Neil Brown * remove explicit interrupt handling for channel switching, this feature was absorbed (i.e. it is now implicit) by the async_tx api Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 546 include/linux/raid/raid5.h | 81 ++- 2 files changed, 624 insertions(+), 3 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index d21fa7a..34fcda0 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -52,6 +52,7 @@ #include "raid6.h" #include +#include /* * Stripe cache @@ -324,6 +325,551 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector return sh; } +static int +raid5_end_read_request(struct bio *bi, unsigned int bytes_done, int error); +static int +raid5_end_write_request (struct bio *bi, unsigned int bytes_done, int error); + +static void ops_run_io(struct stripe_head *sh) +{ + raid5_conf_t *conf = sh->raid_conf; + int i, disks = sh->disks; + + might_sleep(); + + for (i = disks; i--; ) { + int rw; + struct bio *bi; + mdk_rdev_t *rdev; + if (test_and_clear_bit(R5_Wantwrite, >dev[i].flags)) + rw = WRITE; + else if (test_and_clear_bit(R5_Wantread, >dev[i].flags)) + rw = READ; + else + continue; + + bi = >dev[i].req; + + bi->bi_rw = rw; + if (rw == WRITE) + bi->bi_end_io = raid5_end_write_request; + else + bi->bi_end_io = raid5_end_read_request; + + rcu_read_lock(); + rdev = rcu_dereference(conf->disks[i].rdev); + if (rdev && test_bit(Faulty, >flags)) + rdev = NULL; + if (rdev) + atomic_inc(>nr_pending); + rcu_read_unlock(); + + if (rdev) { + if (test_bit(STRIPE_SYNCING, >state) || + test_bit(STRIPE_EXPAND_SOURCE, >state) || + test_bit(STRIPE_EXPAND_READY, >state)) + md_sync_acct(rdev->bdev, STRIPE_SECTORS); + + bi->bi_bdev = rdev->bdev; + pr_debug("%s: for %llu schedule op %ld on disc %d\n", + __FUNCTION__, (unsigned
[md-accel PATCH 08/19] md: common infrastructure for running operations with raid5_run_ops
All the handle_stripe operations that are to be transitioned to use raid5_run_ops need a method to coherently gather work under the stripe-lock and hand that work off to raid5_run_ops. The 'get_stripe_work' routine runs under the lock to read all the bits in sh->ops.pending that do not have the corresponding bit set in sh->ops.ack. This modified 'pending' bitmap is then passed to raid5_run_ops for processing. The transition from 'ack' to 'completion' does not need similar protection as the existing release_stripe infrastructure will guarantee that handle_stripe will run again after a completion bit is set, and handle_stripe can tolerate a sh->ops.completed bit being set while the lock is held. A call to async_tx_issue_pending_all() is added to raid5d to kick the offload engines once all pending stripe operations work has been submitted. This enables batching of the submission and completion of operations. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> Acked-By: NeilBrown <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 67 +--- 1 files changed, 58 insertions(+), 9 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 34fcda0..7c688f6 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -124,6 +124,7 @@ static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh) } md_wakeup_thread(conf->mddev->thread); } else { + BUG_ON(sh->ops.pending); if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, >state)) { atomic_dec(>preread_active_stripes); if (atomic_read(>preread_active_stripes) < IO_THRESHOLD) @@ -225,7 +226,8 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int BUG_ON(atomic_read(>count) != 0); BUG_ON(test_bit(STRIPE_HANDLE, >state)); - + BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete); + CHECK_DEVLOCK(); pr_debug("init_stripe called, stripe %llu\n", (unsigned long long)sh->sector); @@ -241,11 +243,11 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int for (i = sh->disks; i--; ) { struct r5dev *dev = >dev[i]; - if (dev->toread || dev->towrite || dev->written || + if (dev->toread || dev->read || dev->towrite || dev->written || test_bit(R5_LOCKED, >flags)) { - printk("sector=%llx i=%d %p %p %p %d\n", + printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n", (unsigned long long)sh->sector, i, dev->toread, - dev->towrite, dev->written, + dev->read, dev->towrite, dev->written, test_bit(R5_LOCKED, >flags)); BUG(); } @@ -325,6 +327,44 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector return sh; } +/* test_and_ack_op() ensures that we only dequeue an operation once */ +#define test_and_ack_op(op, pend) \ +do { \ + if (test_bit(op, >ops.pending) && \ + !test_bit(op, >ops.complete)) { \ + if (test_and_set_bit(op, >ops.ack)) \ + clear_bit(op, ); \ + else\ + ack++; \ + } else \ + clear_bit(op, ); \ +} while (0) + +/* find new work to run, do not resubmit work that is already + * in flight + */ +static unsigned long get_stripe_work(struct stripe_head *sh) +{ + unsigned long pending; + int ack = 0; + + pending = sh->ops.pending; + + test_and_ack_op(STRIPE_OP_BIOFILL, pending); + test_and_ack_op(STRIPE_OP_COMPUTE_BLK, pending); + test_and_ack_op(STRIPE_OP_PREXOR, pending); + test_and_ack_op(STRIPE_OP_BIODRAIN, pending); + test_and_ack_op(STRIPE_OP_POSTXOR, pending); + test_and_ack_op(STRIPE_OP_CHECK, pending); + if (test_and_clear_bit(STRIPE_OP_IO, >ops.pending)) + ack++; + + sh->ops.count -= ack; + BUG_ON(sh->ops.count < 0); + + return pending; +} + static int raid5_end_read_request(struct bio *bi, unsigned int bytes_done, int error); static int @@ -2487,7 +2527,6 @@ static void handle_stripe_expansion(raid5_conf_t *conf, struct stripe_head *sh, *schedule a write of some buffers *return confirmation of parity correctness * - * Parity calculations are done inside the stripe lock * buffers are taken off read_list or write_list, and bh_cache buffers * get BH_Lock set before the stripe lock is released.
[md-accel PATCH 05/19] raid5: refactor handle_stripe5 and handle_stripe6 (v2)
handle_stripe5 and handle_stripe6 have very deep logic paths handling the various states of a stripe_head. By introducing the 'stripe_head_state' and 'r6_state' objects, large portions of the logic can be moved to sub-routines. 'struct stripe_head_state' consumes all of the automatic variables that previously stood alone in handle_stripe5,6. 'struct r6_state' contains the handle_stripe6 specific variables like p_failed and q_failed. One of the nice side effects of the 'stripe_head_state' change is that it allows for further reductions in code duplication between raid5 and raid6. The following new routines are shared between raid5 and raid6: handle_completed_write_requests handle_requests_to_failed_array handle_stripe_expansion Changes in v2: * fixed 'conf->raid_disk-1' for the raid6 'handle_stripe_expansion' path Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 1488 +--- include/linux/raid/raid5.h | 16 2 files changed, 737 insertions(+), 767 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 4f51dfa..94e0920 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1326,6 +1326,608 @@ static int stripe_to_pdidx(sector_t stripe, raid5_conf_t *conf, int disks) return pd_idx; } +static void +handle_requests_to_failed_array(raid5_conf_t *conf, struct stripe_head *sh, + struct stripe_head_state *s, int disks, + struct bio **return_bi) +{ + int i; + for (i = disks; i--; ) { + struct bio *bi; + int bitmap_end = 0; + + if (test_bit(R5_ReadError, >dev[i].flags)) { + mdk_rdev_t *rdev; + rcu_read_lock(); + rdev = rcu_dereference(conf->disks[i].rdev); + if (rdev && test_bit(In_sync, >flags)) + /* multiple read failures in one stripe */ + md_error(conf->mddev, rdev); + rcu_read_unlock(); + } + spin_lock_irq(>device_lock); + /* fail all writes first */ + bi = sh->dev[i].towrite; + sh->dev[i].towrite = NULL; + if (bi) { + s->to_write--; + bitmap_end = 1; + } + + if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) + wake_up(>wait_for_overlap); + + while (bi && bi->bi_sector < + sh->dev[i].sector + STRIPE_SECTORS) { + struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector); + clear_bit(BIO_UPTODATE, >bi_flags); + if (--bi->bi_phys_segments == 0) { + md_write_end(conf->mddev); + bi->bi_next = *return_bi; + *return_bi = bi; + } + bi = nextbi; + } + /* and fail all 'written' */ + bi = sh->dev[i].written; + sh->dev[i].written = NULL; + if (bi) bitmap_end = 1; + while (bi && bi->bi_sector < + sh->dev[i].sector + STRIPE_SECTORS) { + struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector); + clear_bit(BIO_UPTODATE, >bi_flags); + if (--bi->bi_phys_segments == 0) { + md_write_end(conf->mddev); + bi->bi_next = *return_bi; + *return_bi = bi; + } + bi = bi2; + } + + /* fail any reads if this device is non-operational */ + if (!test_bit(R5_Insync, >dev[i].flags) || + test_bit(R5_ReadError, >dev[i].flags)) { + bi = sh->dev[i].toread; + sh->dev[i].toread = NULL; + if (test_and_clear_bit(R5_Overlap, >dev[i].flags)) + wake_up(>wait_for_overlap); + if (bi) s->to_read--; + while (bi && bi->bi_sector < + sh->dev[i].sector + STRIPE_SECTORS) { + struct bio *nextbi = + r5_next_bio(bi, sh->dev[i].sector); + clear_bit(BIO_UPTODATE, >bi_flags); + if (--bi->bi_phys_segments == 0) { + bi->bi_next = *return_bi; + *return_bi = bi; + } + bi = nextbi; + } + } + spin_unlock_irq(>device_lock); +
[md-accel PATCH 04/19] async_tx: add the async_tx api
The async_tx api provides methods for describing a chain of asynchronous bulk memory transfers/transforms with support for inter-transactional dependencies. It is implemented as a dmaengine client that smooths over the details of different hardware offload engine implementations. Code that is written to the api can optimize for asynchronous operation and the api will fit the chain of operations to the available offload resources. I imagine that any piece of ADMA hardware would register with the 'async_*' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. - Neil Brown async_tx exploits the capabilities of struct dma_async_tx_descriptor to provide an api of the following general format: struct dma_async_tx_descriptor * async_(..., struct dma_async_tx_descriptor *depend_tx, dma_async_tx_callback cb_fn, void *cb_param) { struct dma_chan *chan = async_tx_find_channel(depend_tx, ); struct dma_device *device = chan ? chan->device : NULL; int int_en = cb_fn ? 1 : 0; struct dma_async_tx_descriptor *tx = device ? device->device_prep_dma_(chan, len, int_en) : NULL; if (tx) { /* run asynchronously */ ... tx->tx_set_dest(addr, tx, index); ... tx->tx_set_src(addr, tx, index); ... async_tx_submit(chan, tx, flags, depend_tx, cb_fn, cb_param); } else { /* run synchronously */ ... ... async_tx_sync_epilog(flags, depend_tx, cb_fn, cb_param); } return tx; } async_tx_find_channel() returns a capable channel from its pool. The channel pool is organized as a per-cpu array of channel pointers. The async_tx_rebalance() routine is tasked with managing these arrays. In the uniprocessor case async_tx_rebalance() tries to spread responsibility evenly over channels of similar capabilities. For example if there are two copy+xor channels, one will handle copy operations and the other will handle xor. In the SMP case async_tx_rebalance() attempts to spread the operations evenly over the cpus, e.g. cpu0 gets copy channel0 and xor channel0 while cpu1 gets copy channel 1 and xor channel 1. When a dependency is specified async_tx_find_channel defaults to keeping the operation on the same channel. A xor->copy->xor chain will stay on one channel if it supports both operation types, otherwise the transaction will transition between a copy and a xor resource. Currently the raid5 implementation in the MD raid456 driver has been converted to the async_tx api. A driver for the offload engines on the Intel Xscale series of I/O processors, iop-adma, is provided in a later commit. With the iop-adma driver and async_tx, raid456 is able to offload copy, xor, and xor-zero-sum operations to hardware engines. On iop342 tiobench showed higher throughput for sequential writes (20 - 30% improvement) and sequential reads to a degraded array (40 - 55% improvement). For the other cases performance was roughly equal, +/- a few percentage points. On a x86-smp platform the performance of the async_tx implementation (in synchronous mode) was also +/- a few percentage points of the original implementation. According to 'top' on iop342 CPU utilization drops from ~50% to ~15% during a 'resync' while the speed according to /proc/mdstat doubles from ~25 MB/s to ~50 MB/s. The tiobench command line used for testing was: tiobench --size 2048 --block 4096 --block 131072 --dir /mnt/raid --numruns 5 * iop342 had 1GB of memory available Details: * if CONFIG_DMA_ENGINE=n the asynchronous path is compiled away by making async_tx_find_channel a static inline routine that always returns NULL * when a callback is specified for a given transaction an interrupt will fire at operation completion time and the callback will occur in a tasklet. if the the channel does not support interrupts then a live polling wait will be performed * the api is written as a dmaengine client that requests all available channels * In support of dependencies the api implicitly schedules channel-switch interrupts. The interrupt triggers the cleanup tasklet which causes pending operations to be scheduled on the next channel * Xor engines treat an xor destination address differently than a software xor routine. To the software routine the destination address is an implied source, whereas engines treat it as a write-only destination. This patch modifies the xor_blocks routine to take a an explicit destination address to mirror the hardware. Changelog: * fixed a leftover debug print * don't allow callbacks in async_interrupt_cond * fixed xor_block changes * fixed usage of ASYNC_TX_XOR_DROP_DEST * drop dma mapping methods, suggested by Chris Leech * printk warning fixups from Andrew Morton * don't use inline in C files, Adrian Bunk * select
[md-accel PATCH 06/19] raid5: replace custom debug PRINTKs with standard pr_debug
Replaces PRINTK with pr_debug, and kills the RAID5_DEBUG definition in favor of the global DEBUG definition. To get local debug messages just add '#define DEBUG' to the top of the file. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/md/raid5.c | 116 ++-- 1 files changed, 58 insertions(+), 58 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 94e0920..d21fa7a 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -80,7 +80,6 @@ /* * The following can be used to debug the driver */ -#define RAID5_DEBUG0 #define RAID5_PARANOIA 1 #if RAID5_PARANOIA && defined(CONFIG_SMP) # define CHECK_DEVLOCK() assert_spin_locked(>device_lock) @@ -88,8 +87,7 @@ # define CHECK_DEVLOCK() #endif -#define PRINTK(x...) ((void)(RAID5_DEBUG && printk(x))) -#if RAID5_DEBUG +#ifdef DEBUG #define inline #define __inline__ #endif @@ -152,7 +150,8 @@ static void release_stripe(struct stripe_head *sh) static inline void remove_hash(struct stripe_head *sh) { - PRINTK("remove_hash(), stripe %llu\n", (unsigned long long)sh->sector); + pr_debug("remove_hash(), stripe %llu\n", + (unsigned long long)sh->sector); hlist_del_init(>hash); } @@ -161,7 +160,8 @@ static inline void insert_hash(raid5_conf_t *conf, struct stripe_head *sh) { struct hlist_head *hp = stripe_hash(conf, sh->sector); - PRINTK("insert_hash(), stripe %llu\n", (unsigned long long)sh->sector); + pr_debug("insert_hash(), stripe %llu\n", + (unsigned long long)sh->sector); CHECK_DEVLOCK(); hlist_add_head(>hash, hp); @@ -226,7 +226,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int BUG_ON(test_bit(STRIPE_HANDLE, >state)); CHECK_DEVLOCK(); - PRINTK("init_stripe called, stripe %llu\n", + pr_debug("init_stripe called, stripe %llu\n", (unsigned long long)sh->sector); remove_hash(sh); @@ -260,11 +260,11 @@ static struct stripe_head *__find_stripe(raid5_conf_t *conf, sector_t sector, in struct hlist_node *hn; CHECK_DEVLOCK(); - PRINTK("__find_stripe, sector %llu\n", (unsigned long long)sector); + pr_debug("__find_stripe, sector %llu\n", (unsigned long long)sector); hlist_for_each_entry(sh, hn, stripe_hash(conf, sector), hash) if (sh->sector == sector && sh->disks == disks) return sh; - PRINTK("__stripe %llu not in cache\n", (unsigned long long)sector); + pr_debug("__stripe %llu not in cache\n", (unsigned long long)sector); return NULL; } @@ -276,7 +276,7 @@ static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector { struct stripe_head *sh; - PRINTK("get_stripe, sector %llu\n", (unsigned long long)sector); + pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector); spin_lock_irq(>device_lock); @@ -537,8 +537,8 @@ static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done, if (bi == >dev[i].req) break; - PRINTK("end_read_request %llu/%d, count: %d, uptodate %d.\n", - (unsigned long long)sh->sector, i, atomic_read(>count), + pr_debug("end_read_request %llu/%d, count: %d, uptodate %d.\n", + (unsigned long long)sh->sector, i, atomic_read(>count), uptodate); if (i == disks) { BUG(); @@ -613,7 +613,7 @@ static int raid5_end_write_request (struct bio *bi, unsigned int bytes_done, if (bi == >dev[i].req) break; - PRINTK("end_write_request %llu/%d, count %d, uptodate: %d.\n", + pr_debug("end_write_request %llu/%d, count %d, uptodate: %d.\n", (unsigned long long)sh->sector, i, atomic_read(>count), uptodate); if (i == disks) { @@ -658,7 +658,7 @@ static void error(mddev_t *mddev, mdk_rdev_t *rdev) { char b[BDEVNAME_SIZE]; raid5_conf_t *conf = (raid5_conf_t *) mddev->private; - PRINTK("raid5: error called\n"); + pr_debug("raid5: error called\n"); if (!test_bit(Faulty, >flags)) { set_bit(MD_CHANGE_DEVS, >flags); @@ -929,7 +929,7 @@ static void compute_block(struct stripe_head *sh, int dd_idx) int i, count, disks = sh->disks; void *ptr[MAX_XOR_BLOCKS], *dest, *p; - PRINTK("compute_block, stripe %llu, idx %d\n", + pr_debug("compute_block, stripe %llu, idx %d\n", (unsigned long long)sh->sector, dd_idx); dest = page_address(sh->dev[dd_idx].page); @@ -960,7 +960,7 @@ static void compute_parity5(struct stripe_head *sh, int method) void *ptr[MAX_XOR_BLOCKS], *dest; struct bio *chosen; - PRINTK("compute_parity5, stripe %llu, method %d\n", + pr_debug("compute_parity5,
[md-accel PATCH 03/19] xor: make 'xor_blocks' a library routine for use with async_tx
The async_tx api tries to use a dma engine for an operation, but will fall back to an optimized software routine otherwise. Xor support is implemented using the raid5 xor routines. For organizational purposes this routine is moved to a common area. The following fixes are also made: * rename xor_block => xor_blocks, suggested by Adrian Bunk * ensure that xor.o initializes before md.o in the built-in case * checkpatch.pl fixes * mark calibrate_xor_blocks __init, Adrian Bunk Cc: Adrian Bunk <[EMAIL PROTECTED]> Cc: NeilBrown <[EMAIL PROTECTED]> Cc: Herbert Xu <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- crypto/Kconfig |6 ++ crypto/Makefile |6 ++ crypto/xor.c | 156 ++ drivers/md/Kconfig |1 drivers/md/Makefile |4 + drivers/md/md.c |2 - drivers/md/raid5.c | 10 +-- drivers/md/xor.c | 154 - include/linux/raid/xor.h |2 - 9 files changed, 178 insertions(+), 163 deletions(-) diff --git a/crypto/Kconfig b/crypto/Kconfig index 4ca0ab3..b749a1a 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -1,4 +1,10 @@ # +# Generic algorithms support +# +config XOR_BLOCKS + tristate + +# # Cryptographic API Configuration # diff --git a/crypto/Makefile b/crypto/Makefile index cce46a1..68e934b 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -50,3 +50,9 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += michael_mic.o obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o + +# +# generic algorithms and the async_tx api +# +obj-$(CONFIG_XOR_BLOCKS) += xor.o + diff --git a/crypto/xor.c b/crypto/xor.c new file mode 100644 index 000..8281ac5 --- /dev/null +++ b/crypto/xor.c @@ -0,0 +1,156 @@ +/* + * xor.c : Multiple Devices driver for Linux + * + * Copyright (C) 1996, 1997, 1998, 1999, 2000, + * Ingo Molnar, Matti Aarnio, Jakub Jelinek, Richard Henderson. + * + * Dispatch optimized RAID-5 checksumming functions. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * You should have received a copy of the GNU General Public License + * (for example /usr/src/linux/COPYING); if not, write to the Free + * Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#define BH_TRACE 0 +#include +#include +#include +#include + +/* The xor routines to use. */ +static struct xor_block_template *active_template; + +void +xor_blocks(unsigned int count, unsigned int bytes, void **ptr) +{ + unsigned long *p0, *p1, *p2, *p3, *p4; + + p0 = (unsigned long *) ptr[0]; + p1 = (unsigned long *) ptr[1]; + if (count == 2) { + active_template->do_2(bytes, p0, p1); + return; + } + + p2 = (unsigned long *) ptr[2]; + if (count == 3) { + active_template->do_3(bytes, p0, p1, p2); + return; + } + + p3 = (unsigned long *) ptr[3]; + if (count == 4) { + active_template->do_4(bytes, p0, p1, p2, p3); + return; + } + + p4 = (unsigned long *) ptr[4]; + active_template->do_5(bytes, p0, p1, p2, p3, p4); +} +EXPORT_SYMBOL(xor_blocks); + +/* Set of all registered templates. */ +static struct xor_block_template *template_list; + +#define BENCH_SIZE (PAGE_SIZE) + +static void +do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2) +{ + int speed; + unsigned long now; + int i, count, max; + + tmpl->next = template_list; + template_list = tmpl; + + /* +* Count the number of XORs done during a whole jiffy, and use +* this to calculate the speed of checksumming. We use a 2-page +* allocation to have guaranteed color L1-cache layout. +*/ + max = 0; + for (i = 0; i < 5; i++) { + now = jiffies; + count = 0; + while (jiffies == now) { + mb(); /* prevent loop optimzation */ + tmpl->do_2(BENCH_SIZE, b1, b2); + mb(); + count++; + mb(); + } + if (count > max) + max = count; + } + + speed = max * (HZ * BENCH_SIZE / 1024); + tmpl->speed = speed; + + printk(KERN_INFO " %-10s: %5d.%03d MB/sec\n", tmpl->name, + speed / 1000, speed % 1000); +} + +static int __init +calibrate_xor_blocks(void) +{ + void *b1, *b2; + struct xor_block_template *f, *fastest; + + b1 = (void *) __get_free_pages(GFP_KERNEL, 2); + if (!b1) { + printk(KERN_WARNING "xor: Yikes! No memory available.\n"); + return
[md-accel PATCH 02/19] dmaengine: make clients responsible for managing channels
The current implementation assumes that a channel will only be used by one client at a time. In order to enable channel sharing the dmaengine core is changed to a model where clients subscribe to channel-available-events. Instead of tracking how many channels a client wants and how many it has received the core just broadcasts the available channels and lets the clients optionally take a reference. The core learns about the clients' needs at dma_event_callback time. In support of multiple operation types, clients can specify a capability mask to only be notified of channels that satisfy a certain set of capabilities. Changelog: * removed DMA_TX_ARRAY_INIT, no longer needed * dma_client_chan_free -> dma_chan_release: switch to global reference counting only at device unregistration time, before it was also happening at client unregistration time * clients now return dma_state_client to dmaengine (ack, dup, nak) * checkpatch.pl fixes Cc: Chris Leech <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/dma/dmaengine.c | 217 +++-- drivers/dma/ioatdma.c |1 drivers/dma/ioatdma.h |3 - include/linux/dmaengine.h | 58 +++- net/core/dev.c| 112 --- 5 files changed, 224 insertions(+), 167 deletions(-) diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index 379809f..5c5378e 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -37,11 +37,11 @@ * Each device has a channels list, which runs unlocked but is never modified * once the device is registered, it's just setup by the driver. * - * Each client has a channels list, it's only modified under the client->lock - * and in an RCU callback, so it's safe to read under rcu_read_lock(). + * Each client is responsible for keeping track of the channels it uses. See + * the definition of dma_event_callback in dmaengine.h. * * Each device has a kref, which is initialized to 1 when the device is - * registered. A kref_put is done for each class_device registered. When the + * registered. A kref_get is done for each class_device registered. When the * class_device is released, the coresponding kref_put is done in the release * method. Every time one of the device's channels is allocated to a client, * a kref_get occurs. When the channel is freed, the coresponding kref_put @@ -51,10 +51,12 @@ * references to finish. * * Each channel has an open-coded implementation of Rusty Russell's "bigref," - * with a kref and a per_cpu local_t. A single reference is set when on an - * ADDED event, and removed with a REMOVE event. Net DMA client takes an - * extra reference per outstanding transaction. The relase function does a - * kref_put on the device. -ChrisL + * with a kref and a per_cpu local_t. A dma_chan_get is called when a client + * signals that it wants to use a channel, and dma_chan_put is called when + * a channel is removed or a client using it is unregesitered. A client can + * take extra references per outstanding transaction, as is the case with + * the NET DMA client. The release function does a kref_put on the device. + * -ChrisL, DanW */ #include @@ -102,8 +104,19 @@ static ssize_t show_bytes_transferred(struct class_device *cd, char *buf) static ssize_t show_in_use(struct class_device *cd, char *buf) { struct dma_chan *chan = container_of(cd, struct dma_chan, class_dev); + int in_use = 0; + + if (unlikely(chan->slow_ref) && + atomic_read(>refcount.refcount) > 1) + in_use = 1; + else { + if (local_read(&(per_cpu_ptr(chan->local, + get_cpu())->refcount)) > 0) + in_use = 1; + put_cpu(); + } - return sprintf(buf, "%d\n", (chan->client ? 1 : 0)); + return sprintf(buf, "%d\n", in_use); } static struct class_device_attribute dma_class_attrs[] = { @@ -129,42 +142,53 @@ static struct class dma_devclass = { /* --- client and device registration --- */ +#define dma_chan_satisfies_mask(chan, mask) \ + __dma_chan_satisfies_mask((chan), &(mask)) +static int +__dma_chan_satisfies_mask(struct dma_chan *chan, dma_cap_mask_t *want) +{ + dma_cap_mask_t has; + + bitmap_and(has.bits, want->bits, chan->device->cap_mask.bits, + DMA_TX_TYPE_END); + return bitmap_equal(want->bits, has.bits, DMA_TX_TYPE_END); +} + /** - * dma_client_chan_alloc - try to allocate a channel to a client + * dma_client_chan_alloc - try to allocate channels to a client * @client: _client * * Called with dma_list_mutex held. */ -static struct dma_chan *dma_client_chan_alloc(struct dma_client *client) +static void dma_client_chan_alloc(struct dma_client *client) { struct dma_device *device; struct dma_chan *chan; - unsigned long flags; int desc; /* allocated descriptor
[md-accel PATCH 01/19] dmaengine: refactor dmaengine around dma_async_tx_descriptor
The current dmaengine interface defines mutliple routines per operation, i.e. dma_async_memcpy_buf_to_buf, dma_async_memcpy_buf_to_page etc. Adding more operation types (xor, crc, etc) to this model would result in an unmanageable number of method permutations. Are we really going to add a set of hooks for each DMA engine whizbang feature? - Jeff Garzik The descriptor creation process is refactored using the new common dma_async_tx_descriptor structure. Instead of per driver do___to_ methods, drivers integrate dma_async_tx_descriptor into their private software descriptor and then define a 'prep' routine per operation. The prep routine allocates a descriptor and ensures that the tx_set_src, tx_set_dest, tx_submit routines are valid. Descriptor creation and submission becomes: struct dma_device *dev; struct dma_chan *chan; struct dma_async_tx_descriptor *tx; tx = dev->device_prep_dma_(chan, len, int_flag) tx->tx_set_src(dma_addr_t, tx, index /* for multi-source ops */) tx->tx_set_dest(dma_addr_t, tx, index) tx->tx_submit(tx) In addition to the refactoring, dma_async_tx_descriptor also lays the groundwork for definining cross-channel-operation dependencies, and a callback facility for asynchronous notification of operation completion. Changelog: * drop dma mapping methods, suggested by Chris Leech * fix ioat_dma_dependency_added, also caught by Andrew Morton * fix dma_sync_wait, change from Andrew Morton * uninline large functions, change from Andrew Morton * add tx->callback = NULL to dmaengine calls to interoperate with async_tx calls * hookup ioat_tx_submit * convert channel capabilities to a 'cpumask_t like' bitmap * removed DMA_TX_ARRAY_INIT, no longer needed * checkpatch.pl fixes * make set_src, set_dest, and tx_submit descriptor specific methods Cc: Jeff Garzik <[EMAIL PROTECTED]> Cc: Chris Leech <[EMAIL PROTECTED]> Cc: Shannon Nelson <[EMAIL PROTECTED]> Signed-off-by: Dan Williams <[EMAIL PROTECTED]> --- drivers/dma/dmaengine.c | 182 ++ drivers/dma/ioatdma.c | 277 - drivers/dma/ioatdma.h |8 + include/linux/dmaengine.h | 230 +++-- 4 files changed, 455 insertions(+), 242 deletions(-) diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index 322ee29..379809f 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -59,6 +59,7 @@ #include #include +#include #include #include #include @@ -66,6 +67,7 @@ #include #include #include +#include static DEFINE_MUTEX(dma_list_mutex); static LIST_HEAD(dma_device_list); @@ -165,6 +167,24 @@ static struct dma_chan *dma_client_chan_alloc(struct dma_client *client) return NULL; } +enum dma_status dma_sync_wait(struct dma_chan *chan, dma_cookie_t cookie) +{ + enum dma_status status; + unsigned long dma_sync_wait_timeout = jiffies + msecs_to_jiffies(5000); + + dma_async_issue_pending(chan); + do { + status = dma_async_is_tx_complete(chan, cookie, NULL, NULL); + if (time_after_eq(jiffies, dma_sync_wait_timeout)) { + printk(KERN_ERR "dma_sync_wait_timeout!\n"); + return DMA_ERROR; + } + } while (status == DMA_IN_PROGRESS); + + return status; +} +EXPORT_SYMBOL(dma_sync_wait); + /** * dma_chan_cleanup - release a DMA channel's resources * @kref: kernel reference structure that contains the DMA channel device @@ -322,6 +342,25 @@ int dma_async_device_register(struct dma_device *device) if (!device) return -ENODEV; + /* validate device routines */ + BUG_ON(dma_has_cap(DMA_MEMCPY, device->cap_mask) && + !device->device_prep_dma_memcpy); + BUG_ON(dma_has_cap(DMA_XOR, device->cap_mask) && + !device->device_prep_dma_xor); + BUG_ON(dma_has_cap(DMA_ZERO_SUM, device->cap_mask) && + !device->device_prep_dma_zero_sum); + BUG_ON(dma_has_cap(DMA_MEMSET, device->cap_mask) && + !device->device_prep_dma_memset); + BUG_ON(dma_has_cap(DMA_ZERO_SUM, device->cap_mask) && + !device->device_prep_dma_interrupt); + + BUG_ON(!device->device_alloc_chan_resources); + BUG_ON(!device->device_free_chan_resources); + BUG_ON(!device->device_dependency_added); + BUG_ON(!device->device_is_tx_complete); + BUG_ON(!device->device_issue_pending); + BUG_ON(!device->dev); + init_completion(>done); kref_init(>refcount); device->dev_id = id++; @@ -397,6 +436,149 @@ void dma_async_device_unregister(struct dma_device *device) } EXPORT_SYMBOL(dma_async_device_unregister); +/** + * dma_async_memcpy_buf_to_buf - offloaded copy between virtual addresses + * @chan: DMA channel to offload copy to + * @dest: destination address (virtual) + * @src: source address (virtual) + *
[md-accel PATCH 00/19] md raid acceleration and the async_tx api
Greetings, Per Andrew's suggestion this is the md raid5 acceleration patch set updated with more thorough changelogs to lower the barrier to entry for reviewers. To get started with the code I would suggest the following order: [md-accel PATCH 01/19] dmaengine: refactor dmaengine around dma_async_tx_descriptor [md-accel PATCH 04/19] async_tx: add the async_tx api [md-accel PATCH 07/19] md: raid5_run_ops - run stripe operations outside sh->lock [md-accel PATCH 16/19] dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines The patch set can be broken down into three main categories: 1/ API (async_tx: patches 1 - 4) 2/ implementation (md changes: patches 5 - 15) 3/ driver (iop-adma: patches 16 - 19) I have worked with Neil to get approval of the category 2 changes. However for the category 1 and 3 changes there was no obvious merge-path/maintainer to work through. I have thus far extrapolated Neil's comments about 2 out to 1 and 3, Jeff gave some direction on a early revision about the scalability of the API, and the patch set has picked up various fixes and suggestions from being in -mm for a few releases. Please help me ensure that this code is ready for Linus to pull for 2.6.23. git://lost.foo-projects.org/~dwillia2/git/iop md-accel-linus Dan Williams (19): dmaengine: refactor dmaengine around dma_async_tx_descriptor dmaengine: make clients responsible for managing channels xor: make 'xor_blocks' a library routine for use with async_tx async_tx: add the async_tx api raid5: refactor handle_stripe5 and handle_stripe6 (v2) raid5: replace custom debug PRINTKs with standard pr_debug md: raid5_run_ops - run stripe operations outside sh->lock md: common infrastructure for running operations with raid5_run_ops md: handle_stripe5 - add request/completion logic for async write ops md: handle_stripe5 - add request/completion logic for async compute ops md: handle_stripe5 - add request/completion logic for async check ops md: handle_stripe5 - add request/completion logic for async read ops md: handle_stripe5 - add request/completion logic for async expand ops md: handle_stripe5 - request io processing in raid5_run_ops md: remove raid5 compute_block and compute_parity5 dmaengine: driver for the iop32x, iop33x, and iop13xx raid engines iop13xx: surface the iop13xx adma units to the iop-adma driver iop3xx: surface the iop3xx DMA and AAU units to the iop-adma driver ARM: Add drivers/dma to arch/arm/Kconfig Administrivia: This patch set contains three new patches compared to the previous release they are: [md-accel PATCH 03/19] xor: make 'xor_blocks' a library routine for use with async_tx [md-accel PATCH 05/19] raid5: refactor handle_stripe5 and handle_stripe6 (v2) [md-accel PATCH 06/19] raid5: replace custom debug PRINTKs with standard pr_debug net/core/dev.c is touched by the following: [md-accel PATCH 02/19] dmaengine: make clients responsible for managing channels - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [-mm patch] remove nobh_{prepare,commit}_write()
On Tue, Jun 26, 2007 at 02:33:35PM -0700, Randy Dunlap wrote: > On Tue, 26 Jun 2007 14:23:20 -0700 Andrew Morton wrote: > > > On Tue, 26 Jun 2007 15:48:58 -0500 > > Dave Kleikamp <[EMAIL PROTECTED]> wrote: > > > > > On Tue, 2007-06-26 at 13:32 -0700, Andrew Morton wrote: > > > > On Fri, 15 Jun 2007 00:15:55 +0200 > > > > Adrian Bunk <[EMAIL PROTECTED]> wrote: > > > > > > > > > nobh_{prepare,commit}_write() are no longer used. > > > > > > > > wth? What happened to ext2 and ext3 nobh mode? They seem to > > > > have magically and unchangeloggedly disappeared? > > > > > > They were removed with Nick's new aops patches. > >^secretly > > > > > > > That much I worked out for myself. It's kinda staggering that a fairly > > major feature in two fairly major filesystems got removed without even a > > mention in the changelog. I don't recall having seen it discussed in email > > but I obviously missed that bit. > > > > Look, I'm one micron from just dropping the whole lot. These changes > > simply have not received the amount of energy, effort, care, attention and > > testing which a change of this magnitude requires. > > so be sure to discuss that (not the patches themselves so much, > but the process(es)) at the kernel summit etc I did of course mention that nobh wasn't converted when sending the patches. I asked for comments about how much it is used in real world. Badari was the only one who replied about that but we didn't reach a conclusion. I don't know about energy, but I have seen lots of other patches cause a lot more problems... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel include files
On Jun 22, 2007, at 11:00:38, Adrian Bunk wrote: It would certainly help if Joerg would tell what exactly breaks, but I spot one likely problem in include/asm-i386/types.h: #if defined(__GNUC__) && !defined(__STRICT_ANSI__) typedef __signed__ long long __s64; typedef unsigned long long __u64; #endif It might make sense to remove the #if and simply require that a C compiler under Linux must know about the C99 "long long"? Gah, this particular topic and a few other similar header- compatibility ones show up once a month on LKML; I should probably just make a patch to fix all the types.h files and be done with it. The proper solution is this: # if __STDC_VERSION__ >= 19901L typedef signed long long __s64; typedef unsigned long long __u64; # elif defined(__GNUC__) __extension__ typedef signed long long __s64; __extension__ typedef unsigned long long __u64; # else # error "Your compiler doesn't support long long (IOW: It sucks). Please get a new one" # endif That way if you have any kind of vaguely-long-long-compatible compiler then it will work, and otherwise you'll get a nice useful error message. It also makes sure that GCC doesn't spew warnings/ errors when in c89-pedantic mode. The "__extension__" keyword is designed for use in implementation header files which want to use GCC- isms unconditionally. Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] RFC: have tcp_recvmsg() check kthread_should_stop() and treat it as if it were signalled
On 6/27/07, Satyam Sharma <[EMAIL PROTECTED]> wrote: [...] On 6/26/07, Oleg Nesterov <[EMAIL PROTECTED]> wrote: > On 06/26, Satyam Sharma wrote: [...] > > So could we have signals in _addition_ to kthread_stop_info and change > > kthread_should_stop() to check for both: > > > > kthread_stop_info.k == current && signal_pending(current) > > No, this can't work in general. Some kthreads do flush_signals/dequeue_signal, > so TIF_SIGPENDING can be lost anyway. Yup, I had thought of precisely this issue yesterday as well. The mental note I made to myself was that the force_sig(SIGKILL) and wake_up_process() in kthread_stop() must be atomic so that the following race is not possible: Hmm, the issue seems to have more to do with the ordering of flush_signals() w.r.t. checking kthread_should_stop() in the kthread's code. I thought about how to tackle this, but there's no easy way to make the stuff atomic like I thought earlier. The problem, like you mentioned, is if the target kthread proactively flushes its signals by hand *before* checking kthread_should_stop(). The only way out seems to be to simply outlaw flush_signals() in kthreads (or anything to do with signals), but that would be impossible to enforce ... Satyam - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH try #2] security: Convert LSM into a static interface
On Jun 26, 2007, at 20:57:53, Crispin Cowan wrote: Kyle Moffett wrote: Let's go over the differences between "my fs" and "my LSM", and the similarities between "my VM" and "my LSM": Filesystems don't get hooked from virtually every userspace-initiated operation, whereas both VMs and LSMs do. VMs and LSMs attach anonymous state data to a large percentage of the allocated objects in the system, whereas filesystems allocate their own independent datastructure and use that. Would you want to "rmmod ext3" and then "modprobe ext2" while you have an ext2-as-ext3 filesystem *mounted*??? If you want a good analogy, that's a better one than the "my fs can't be a module" crap. This whole discussion boils down to 2 points: 1) As currently implemented, no LSM may be safely rmmod-ed 2) Someone has submitted a patch which fixes that problem (you can't rmmod them at all, so no crashes) If you really want to do modular LSMs, then you need to submit a patch which fixes all the race conditions in LSM removal *without* adding much extra overhead. I'm sure if your solutions works then everyone will be much more open to modular LSMs. Hmmm. You seem to be mostly concerned with safely rmmod'ing modules. In contrast, my main concern with the proposed patch is that it removes the ability to *insert* a module. You must have missed this in my emails: 2) When you "modprobe my_custom_security_module", how exactly do you expect that all the processes, files, shared memory segments, file descriptors, sockets, SYSV mutexes, packets, etc will get appropriate security pointers? This isn't even solvable the same way the "rmmod" problem is, since most of that isn't even accessible without iterating over the ENTIRE dcache, icache, every process, every process' file-descriptors, every socket, every unix socket, every anonymous socket, every SYSV shm object, every currently-in-process packet. I'd argue that security-module-insertion is actually MORE complicated than removal. Here's one example: TOMOYO cares about the process execution tree, but you can't penalize the no-LSM case by a percent or two to add that kind of data. When TOMOYO is loaded, it wants to do access control based on process execution trees for which data DOES NOT EXIST!!! Not only that, but the processes which originally ran the one you care about (and which you'd need to recreate that data) may have exited anywhere from seconds to years before. It is fundamentally IMPOSSIBLE to recreate that data, even if you could solve the problems of how to do it while the system is running without racing with existing process operations. Imagine a process which hasn't had security data tagged to it yet which opens thousands of FIFOs per second, waits for your tagging code to assign security data to them in the filesystem, and then removes them; if you did it right you could prevent the code from EVER completely tagging every object (even assuming you could recreate enough information). Such a need to add extra security data to multiple classes of objects is *fundamental* to any security module (isn't that the whole point?) As such, you can't just "modprobe" one and expect it to work. That's like mounting an ext2 filesystem, and then later trying to "modprobe ext3" and dynamically switch to the ext3 code and enable journalling all at once ON THE MOUNTED FILESYSTEM!!! Sure, theoretically it *could* be done, but the code complexity is hardly worth it (plus nobody has yet even tried posting patches to make it happen). Consider the use case of joe admin who is running enterprise- supported RHEL or SLES, and wants to try some newfangled LSM FooSecureMod thingie. So he grabs a machine, config's selinux=0 or apparmor=0 and loads his own module on boot, and plays with it. He even likes FooSecure, better than SELinux or AppArmor, and wants to roll it out across his data center. Flatly impossible. You simply cannot "load" a security module and hope to provide any useful information about the system's present state. If you want comprehensive security it has to be there before a single byte of userspace code is executed. SELinux sort-of handles unlabelled objects by treating them with a small set of initial "types", but that's only enough to get the system up enough to actually relabel objects with type-transitions (after init loads the selinux policy it reexecs itself, before doing anything else). So to solve the problem James & Kyle are concerned with, and preserve user choice, how about we *only* remove the ability to rmmod, and leave in place the ability to modprobe? Or even easier, LSMs that don't want to be unloaded can just block rmmod, and simple LSMs that can be unloaded safely can permit it. An LSM simple enough to unload would be too simple for anybody to want to load in the first place (even capabilities can have this
Re: i386 boot fail, EIP in __change_page_attr:166
2007/6/26, Chuck Ebbert <[EMAIL PROTECTED]>: On 06/25/2007 09:11 PM, dave young wrote: > Hi, > > 2007/6/25, Chuck Ebbert <[EMAIL PROTECTED]>: >> On 06/24/2007 11:43 PM, dave young wrote: >> > Hi, >> > I reconfig my kernel, boot and oops, EIP in __change_page_attr:166, I >> > tried 2.6.22-rc4-mm2 and 2.6.22-rc5 , same result. >> > >> > Anyone has some clues? >> > >> > here is my config file: >> >> Where are the oops messages? > Attached please find the screenshots. sorry for my phone camera resolution. > screen1.png : vga=ask select mode 6 > screen2.png : normal 80x25 console That's 2.6.22-rc4-mm2 which I don't have. netsc520 is doing iounmap() of an area it did ioremap_nocache() on earlier, because it has now failed to find a device. Why it went BUG() I have no idea. Hi, maybe some config option cause this issue, here is my current working-ok config file: # # Automatically generated make config: don't edit # Linux kernel version: 2.6.22-rc5 # Tue Jun 19 16:10:01 2007 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_QUICKLIST=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y # CONFIG_IPC_NS is not set CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set # CONFIG_UTS_NS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=14 # CONFIG_CPUSETS is not set CONFIG_SYSFS_DEPRECATED=y # CONFIG_RELAY is not set CONFIG_BLK_DEV_INITRD=y CONFIG_INITRAMFS_SOURCE="" # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not set CONFIG_RT_MUTEXES=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_BLOCK=y CONFIG_LBD=y # CONFIG_BLK_DEV_IO_TRACE is not set CONFIG_LSF=y # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Processor type and features # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ=y CONFIG_HIGH_RES_TIMERS=y CONFIG_SMP=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_PARAVIRT is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set # CONFIG_MPENTIUMM is not set # CONFIG_MCORE2 is not set CONFIG_MPENTIUM4=y # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_MVIAC7 is not set CONFIG_X86_GENERIC=y CONFIG_X86_CMPXCHG=y CONFIG_X86_L1_CACHE_SHIFT=7 CONFIG_X86_XADD=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_TSC=y CONFIG_X86_CMOV=y CONFIG_X86_MINIMUM_CPU_MODEL=4 CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_NR_CPUS=8 # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set
Re: i386 boot fail, EIP in __change_page_attr:166
Hi, 2007/6/26, Jeremy Fitzhardinge <[EMAIL PROTECTED]>: dave young wrote: > Hi, > I reconfig my kernel, boot and oops, EIP in __change_page_attr:166, I > tried 2.6.22-rc4-mm2 and 2.6.22-rc5 , same result. oops output? dmesg output? Hardware config? How much memory? How big is your kernel? J kernel oops message only is captured by camera, please find my screenshot images. memory size 1G kernel size: 3.9M lspci -vv output: 00:00.0 Host bridge: Intel Corporation 945G/GZ/P/PL Express Memory Controller Hub (rev 02) Subsystem: Dell Unknown device 01d2 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [88] #0d [] Capabilities: [80] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- Capabilities: [90] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Address: Data: Capabilities: [a0] Express Root Port (Slot+) IRQ 0 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag- Device: Latency L0s <64ns, L1 <1us Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 128 bytes Link: Supported Speed 2.5Gb/s, Width x16, ASPM L0s, Port 2 Link: Latency L0s <256ns, L1 <4us Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x16 Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug- Surpise- Slot: Number 7680, PowerLimit 75.00 Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- Slot: AttnInd Off, PwrInd On, Power- Root: Correctable- Non-Fatal- Fatal- PME- 00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High Definition Audio Controller (rev 01) Subsystem: Dell Unknown device 01d2 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [40] Express Root Port (Slot+) IRQ 0 Device: Supported: MaxPayload 128 bytes, PhantFunc 0, ExtTag- Device: Latency L0s unlimited, L1 unlimited Device: Errors: Correctable+ Non-Fatal+ Fatal+ Unsupported+ Device: RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- Device: MaxPayload 128 bytes, MaxReadReq 128 bytes Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 1 Link: Latency L0s <256ns, L1 <4us Link: ASPM Disabled RCB 64 bytes CommClk+ ExtSynch- Link: Speed 2.5Gb/s, Width x0 Slot: AtnBtn- PwrCtrl- MRL- AtnInd- PwrInd- HotPlug+ Surpise+ Slot: Number 2, PowerLimit 10.00 Slot: Enabled AtnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- Slot: AttnInd Unknown, PwrInd Unknown, Power- Root: Correctable- Non-Fatal- Fatal- PME- Capabilities: [80] Message Signalled Interrupts: 64bit- Queue=0/0 Enable- Address: Data: Capabilities: [90] #0d [] Capabilities: [a0] Power Management version 2 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D0 PME-Enable- DSel=0 DScale=0 PME- 00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1 (rev 01) (prog-if 00 [UHCI]) Subsystem: Dell Unknown device 01d2 Control: I/O+ Mem- BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- Reset- FastB2B- Capabilities: [50] #0d [] 00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC Interface Bridge (rev 01) Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- TAbort- SERR- http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[AppArmor 31/44] Add d_namespace_path() to compute namespace relative pathnames
In AppArmor, we are interested in pathnames relative to the namespace root. This is the same as d_path() except for the root where the search ends. Add a function for computing the namespace-relative path. Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]> Signed-off-by: John Johansen <[EMAIL PROTECTED]> --- fs/dcache.c|6 +++--- fs/namespace.c | 27 +++ include/linux/dcache.h |2 ++ include/linux/mount.h |2 ++ 4 files changed, 34 insertions(+), 3 deletions(-) --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1779,9 +1779,9 @@ shouldnt_be_hashed: * * Returns the buffer or an error code. */ -static char *__d_path(struct dentry *dentry, struct vfsmount *vfsmnt, - struct dentry *root, struct vfsmount *rootmnt, - char *buffer, int buflen, int fail_deleted) +char *__d_path(struct dentry *dentry, struct vfsmount *vfsmnt, + struct dentry *root, struct vfsmount *rootmnt, + char *buffer, int buflen, int fail_deleted) { int namelen, is_slash, vfsmount_locked = 0; --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1868,3 +1868,30 @@ void __put_mnt_ns(struct mnt_namespace * release_mounts(_list); kfree(ns); } + +char *d_namespace_path(struct dentry *dentry, struct vfsmount *vfsmnt, + char *buf, int buflen) +{ + struct vfsmount *rootmnt, *nsrootmnt = NULL; + struct dentry *root = NULL; + char *res; + + read_lock(>fs->lock); + rootmnt = mntget(current->fs->rootmnt); + read_unlock(>fs->lock); + spin_lock(_lock); + if (rootmnt->mnt_ns) + nsrootmnt = mntget(rootmnt->mnt_ns->root); + spin_unlock(_lock); + mntput(rootmnt); + if (nsrootmnt) + root = dget(nsrootmnt->mnt_root); + res = __d_path(dentry, vfsmnt, root, nsrootmnt, buf, buflen, 1); + dput(root); + mntput(nsrootmnt); + /* Prevent empty path for lazily unmounted filesystems. */ + if (!IS_ERR(res) && *res == '\0') + *--res = '.'; + return res; +} +EXPORT_SYMBOL(d_namespace_path); --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -299,6 +299,8 @@ extern int d_validate(struct dentry *, s */ extern char *dynamic_dname(struct dentry *, char *, int, const char *, ...); +extern char *__d_path(struct dentry *, struct vfsmount *, struct dentry *, + struct vfsmount *, char *, int, int); extern char * d_path(struct dentry *, struct vfsmount *, char *, int); /* Allocation counts.. */ --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -103,5 +103,7 @@ extern void shrink_submounts(struct vfsm extern spinlock_t vfsmount_lock; extern dev_t name_to_dev_t(char *name); +extern char *d_namespace_path(struct dentry *, struct vfsmount *, char *, int); + #endif #endif /* _LINUX_MOUNT_H */ -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[AppArmor 24/44] Pass struct vfsmount to the inode_getxattr LSM hook
This is needed for computing pathnames in the AppArmor LSM. Signed-off-by: Tony Jones <[EMAIL PROTECTED]> Signed-off-by: Andreas Gruenbacher <[EMAIL PROTECTED]> Signed-off-by: John Johansen <[EMAIL PROTECTED]> --- fs/xattr.c |2 +- include/linux/security.h | 13 - security/dummy.c |3 ++- security/selinux/hooks.c |3 ++- 4 files changed, 13 insertions(+), 8 deletions(-) --- a/fs/xattr.c +++ b/fs/xattr.c @@ -116,7 +116,7 @@ vfs_getxattr(struct dentry *dentry, stru if (error) return error; - error = security_inode_getxattr(dentry, name); + error = security_inode_getxattr(dentry, mnt, name); if (error) return error; --- a/include/linux/security.h +++ b/include/linux/security.h @@ -391,7 +391,7 @@ struct request_sock; * @value identified by @name for @dentry and @mnt. * @inode_getxattr: * Check permission before obtaining the extended attributes - * identified by @name for @dentry. + * identified by @name for @dentry and @mnt. * Return 0 if permission is granted. * @inode_listxattr: * Check permission before obtaining the list of extended attribute @@ -1248,7 +1248,8 @@ struct security_operations { struct vfsmount *mnt, char *name, void *value, size_t size, int flags); - int (*inode_getxattr) (struct dentry *dentry, char *name); + int (*inode_getxattr) (struct dentry *dentry, struct vfsmount *mnt, + char *name); int (*inode_listxattr) (struct dentry *dentry); int (*inode_removexattr) (struct dentry *dentry, char *name); const char *(*inode_xattr_getsuffix) (void); @@ -1782,11 +1783,12 @@ static inline void security_inode_post_s security_ops->inode_post_setxattr (dentry, mnt, name, value, size, flags); } -static inline int security_inode_getxattr (struct dentry *dentry, char *name) +static inline int security_inode_getxattr (struct dentry *dentry, + struct vfsmount *mnt, char *name) { if (unlikely (IS_PRIVATE (dentry->d_inode))) return 0; - return security_ops->inode_getxattr (dentry, name); + return security_ops->inode_getxattr (dentry, mnt, name); } static inline int security_inode_listxattr (struct dentry *dentry) @@ -2487,7 +2489,8 @@ static inline void security_inode_post_s int flags) { } -static inline int security_inode_getxattr (struct dentry *dentry, char *name) +static inline int security_inode_getxattr (struct dentry *dentry, + struct vfsmount *mnt, char *name) { return 0; } --- a/security/dummy.c +++ b/security/dummy.c @@ -368,7 +368,8 @@ static void dummy_inode_post_setxattr (s { } -static int dummy_inode_getxattr (struct dentry *dentry, char *name) +static int dummy_inode_getxattr (struct dentry *dentry, + struct vfsmount *mnt, char *name) { return 0; } --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2393,7 +2393,8 @@ static void selinux_inode_post_setxattr( return; } -static int selinux_inode_getxattr (struct dentry *dentry, char *name) +static int selinux_inode_getxattr (struct dentry *dentry, struct vfsmount *mnt, + char *name) { return dentry_has_perm(current, NULL, dentry, FILE__GETATTR); } -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
On Tue, 26 Jun 2007 17:46:13 -0700 "Wessel, Jason" <[EMAIL PROTECTED]> wrote: > > > } > > > } > > > > Everything went quiet? > > > > If this patch has been tested and fixes the bug, can you > > please send a version which is ready for merging? (ie: add a > > suitable description of what it does). > > > > > > I mailed Jarek separately. > > I had tested the patch with netconsole and kgdb and it does in fact fix > the problem that was reported. OK, thanks. Please don't mail people separately! I queued this up with a null changelog for now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH try #2] security: Convert LSM into a static interface
Kyle Moffett wrote: > Let's go over the differences between "my fs" and "my LSM", and the > similarities between "my VM" and "my LSM": Filesystems don't get > hooked from virtually every userspace-initiated operation, whereas > both VMs and LSMs do. VMs and LSMs attach anonymous state data to a > large percentage of the allocated objects in the system, whereas > filesystems allocate their own independent datastructure and use > that. Would you want to "rmmod ext3" and then "modprobe ext2" while > you have an ext2-as-ext3 filesystem *mounted*??? If you want a good > analogy, that's a better one than the "my fs can't be a module" crap. > > This whole discussion boils down to 2 points: > 1) As currently implemented, no LSM may be safely rmmod-ed > 2) Someone has submitted a patch which fixes that problem (you can't > rmmod them at all, so no crashes) > > If you really want to do modular LSMs, then you need to submit a patch > which fixes all the race conditions in LSM removal *without* adding > much extra overhead. I'm sure if your solutions works then everyone > will be much more open to modular LSMs. I said this before: Hmmm. You seem to be mostly concerned with safely rmmod'ing modules. In contrast, my main concern with the proposed patch is that it removes the ability to *insert* a module. Consider the use case of joe admin who is running enterprise-supported RHEL or SLES, and wants to try some newfangled LSM FooSecureMod thingie. So he grabs a machine, config's selinux=0 or apparmor=0 and loads his own module on boot, and plays with it. He even likes FooSecure, better than SELinux or AppArmor, and wants to roll it out across his data center. Without James's patch, he can do that, and at worst has a tainted kernel. RH or Novell or his favorite distro vendor can fix that with a wave of the hand and bless FooSecure as a module. With James's patch, he has to patch his kernels, and then enterprise support is hopeless, to say nothing of the barrier to entry that "patch and rebuild kernel" is more than many admins are willing to do. So to solve the problem James & Kyle are concerned with, and preserve user choice, how about we *only* remove the ability to rmmod, and leave in place the ability to modprobe? Or even easier, LSMs that don't want to be unloaded can just block rmmod, and simple LSMs that can be unloaded safely can permit it. Crispin -- Crispin Cowan, Ph.D. http://crispincowan.com/~crispin/ Director of Software Engineering http://novell.com AppArmor Chat: irc.oftc.net/#apparmor - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/16] fix handling of integer constant expressions
Al Viro wrote: Hopefully correct handling of integer constant expressions. Please, review. Am I invoking sparse wrongly? ./sparse -W -Wall doesn't diagnose the following TU, for example. extern int a; extern int as1[(a = 2)]; sparse simply doesn't check that. We don't have anything resembling support of VLA. If it did support VLAs it would point out that this is a constraint violation. VLAs must have block or function prototype scope. -- Derek M. Jones tel: +44 (0) 1252 520 667 Knowledge Software Ltd mailto:[EMAIL PROTECTED] Applications Standards Conformance Testinghttp://www.knosof.co.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with fb console [was Re: 2.6.12-rc4-mm2]
On Wed, 27 Jun 2007 02:35:27 +0200 "J.A. Magallón" <[EMAIL PROTECTED]> wrote: > On Mon, 16 May 2005 02:13:02 -0700, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc4/2.6.12-rc4-mm2/ > > > > > > Hi... > > I have a (stupid, I suppose) problem with framebuffer console. > I have builtin VESAFB in this kernel, so: > > werewolf:/boot# grep _FB config-2.6.21-jam09 | grep =y > CONFIG_FB=y > CONFIG_FB_CFB_FILLRECT=y > CONFIG_FB_CFB_COPYAREA=y > CONFIG_FB_CFB_IMAGEBLIT=y > CONFIG_FB_DEFERRED_IO=y > CONFIG_FB_MODE_HELPERS=y > CONFIG_FB_VESA=y > werewolf:/boot# grep CONSO config-2.6.21-jam09 > # CONFIG_NETCONSOLE is not set > CONFIG_VT_CONSOLE=y > CONFIG_HW_CONSOLE=y > # CONFIG_VT_HW_CONSOLE_BINDING is not set > CONFIG_VGA_CONSOLE=y > CONFIG_DUMMY_CONSOLE=y > CONFIG_FRAMEBUFFER_CONSOLE=y > # CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set > # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set > > I put this line in grub's menu.lst: > > kernel /boot/vmlinuz video=vesafb:mtrr,ywrap vga=0x31A ro root=/dev/sdc1 > > (tried both with hex and decimal). > > but grub keeps telling me it can't set that video mode, and I have no > /dev/fb0 device to try with fbset. I have a '29 fb' line in /proc/devices. > > Any ideas about why the device is missing ? udev is 113... > I have followed al the info I could get (linux/Documentation/fb/, Google ;) ) > and all say that what I'm doing should work. What am I doing wrong ? > Methinks that'll be git-newsetup changes? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NDAs - ANY KNOWN RULES?
Am Dienstag, den 26.06.2007, 20:24 -0400 schrieb Daniel Barkalow: > On Wed, 27 Jun 2007, hermann pitton wrote: > > > Hi, > > > > such stuff causes a lot of troubles since long. > > > > Are there any rules, or can everybody go on as some sort of freelancer > > exclusively on such? I don't like it! > > http://www.linux-foundation.org/en/NDA_program > > In short, the Linux Foundation can negotiate a reasonable NDA for you to > sign, and they may be able to show you relevant documents as a freelancer > under a reasonable and standardized contract. > > -Daniel > *This .sig left intentionally blank* Thanks for your explanations, but I know for sure it does't work. Cheers, Hermann - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Re: [2.6.21.1] soft lockup when removing netconsole module
> -Original Message- > From: Andrew Morton [mailto:[EMAIL PROTECTED] > > > > Signed-off-by: Jarek Poplawski <[EMAIL PROTECTED]> > > > > --- > > > > diff -Nurp 2.6.21-/net/core/netpoll.c 2.6.21/net/core/netpoll.c > > --- 2.6.21-/net/core/netpoll.c 2007-04-26 > 15:08:32.0 +0200 > > +++ 2.6.21/net/core/netpoll.c 2007-06-12 > 21:05:23.0 +0200 > > @@ -73,7 +73,8 @@ static void queue_process(struct work_st > > netif_tx_unlock(dev); > > local_irq_restore(flags); > > > > - schedule_delayed_work(>tx_work, HZ/10); > > + if (atomic_read(>refcnt)) > > + > schedule_delayed_work(>tx_work, HZ/10); > > return; > > } > > netif_tx_unlock(dev); > > @@ -780,9 +781,15 @@ void netpoll_cleanup(struct netpoll *np) > > if (atomic_dec_and_test(>refcnt)) { > > skb_queue_purge(>arp_tx); > > skb_queue_purge(>txq); > > - > cancel_rearming_delayed_work(>tx_work); > > + cancel_delayed_work(>tx_work); > > flush_scheduled_work(); > > > > + /* clean after last, unfinished work */ > > + if (!skb_queue_empty(>txq)) { > > + struct sk_buff *skb; > > + skb = > __skb_dequeue(>txq); > > + kfree_skb(skb); > > + } > > kfree(npinfo); > > } > > } > > Everything went quiet? > > If this patch has been tested and fixes the bug, can you > please send a version which is ready for merging? (ie: add a > suitable description of what it does). > > I mailed Jarek separately. I had tested the patch with netconsole and kgdb and it does in fact fix the problem that was reported. Jason. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/16] fix handling of integer constant expressions
On Wed, Jun 27, 2007 at 01:29:59AM +0100, Derek M Jones wrote: > Al Viro wrote: > > >>>Hopefully correct handling of integer constant expressions. Please, > >>>review. > >>Am I invoking sparse wrongly? ./sparse -W -Wall doesn't diagnose > >>the following TU, for example. > >> > >>extern int a; > >>extern int as1[(a = 2)]; > > > >sparse simply doesn't check that. We don't have anything resembling > >support of VLA. > > If it did support VLAs it would point out that this is > a constraint violation. VLAs must have block or function > prototype scope. I know. It's just that "let's do something about array size checks" triggers "yeah, but the poor sod who does that will have to sort VLAs *and* gcc extensions around VLAs out", which is not a nice thought and so far nobody had touched that area at all. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/16] fix handling of integer constant expressions
On Tue, Jun 26, 2007 at 05:25:06PM -0700, Linus Torvalds wrote: > > > On Wed, 27 Jun 2007, Al Viro wrote: > > > > > extern int a; > > > extern int as1[(a = 2)]; > > > > sparse simply doesn't check that. We don't have anything resembling > > support of VLA. > > Well, the above has two bugs that sparse could notice _independently_ of > variable-sized arrays: > - assignment outside of a function > - variable size array that isn't an automatic variable Right; what I'm saying is that we don't do any checks on array sizes at all, mostly since nobody is brave enough to deal with VLAs (which we'll have to do if we start doing that). > (strictly speaking, that's not even a variable size - it's a constant 2, > just with a non-constant expression - maybe you misread the "=" as an > "==") With == it would be a different bug ;-) BTW, VLA can be not just auto variable - it can be used in derivation of such (i.e. you can say int (*p)[n], just not for anything not in block or prototype scope). And $DEITY help us[1] when ({...}) comes into the game, since it allows leaking types out of the scope they'd been declared in... [1] or gcc - it gets an ICE galore in that class of testcases - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Problems with fb console [was Re: 2.6.12-rc4-mm2]
On Mon, 16 May 2005 02:13:02 -0700, Andrew Morton <[EMAIL PROTECTED]> wrote: > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc4/2.6.12-rc4-mm2/ > > Hi... I have a (stupid, I suppose) problem with framebuffer console. I have builtin VESAFB in this kernel, so: werewolf:/boot# grep _FB config-2.6.21-jam09 | grep =y CONFIG_FB=y CONFIG_FB_CFB_FILLRECT=y CONFIG_FB_CFB_COPYAREA=y CONFIG_FB_CFB_IMAGEBLIT=y CONFIG_FB_DEFERRED_IO=y CONFIG_FB_MODE_HELPERS=y CONFIG_FB_VESA=y werewolf:/boot# grep CONSO config-2.6.21-jam09 # CONFIG_NETCONSOLE is not set CONFIG_VT_CONSOLE=y CONFIG_HW_CONSOLE=y # CONFIG_VT_HW_CONSOLE_BINDING is not set CONFIG_VGA_CONSOLE=y CONFIG_DUMMY_CONSOLE=y CONFIG_FRAMEBUFFER_CONSOLE=y # CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY is not set # CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set I put this line in grub's menu.lst: kernel /boot/vmlinuz video=vesafb:mtrr,ywrap vga=0x31A ro root=/dev/sdc1 (tried both with hex and decimal). but grub keeps telling me it can't set that video mode, and I have no /dev/fb0 device to try with fbset. I have a '29 fb' line in /proc/devices. Any ideas about why the device is missing ? udev is 113... I have followed al the info I could get (linux/Documentation/fb/, Google ;) ) and all say that what I'm doing should work. What am I doing wrong ? TIA -- J.A. Magallon \ Software is like sex: \ It's better when it's free Mandriva Linux release 2008.0 (Cooker) for i586 Linux 2.6.21-jam09 (gcc 4.1.2 20070302 (4.1.2-1mdv2007.1)) SMP PREEMPT 09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/17] Add Texas Instruments OMAP LCD driver-v2
On Tue, 26 Jun 2007 18:00:22 +0530 "Trilok Soni" <[EMAIL PROTECTED]> wrote: > This patch series contains Texas Instruments OMAP LCD framebuffer > drivers. This driver is divided into > > * main omapfb driver, which handles most common functions across > processor series, like platform driver registration, ioctl handling, > much like fb skeleton. > > This driver then gets through the callback based on the > internal/external lcd controller and panel registered to it based on > processor and board. Internal/External LCD controller as per lcd panel > data registration is being done in separate files and so does patches. > > Overall this patches contains framebuffer driver for TI OMAP1 > (OMAP1510/1610/1710) and OMAP2 (OMAP2420/2430) and external > controllers used in Nokia Internal Tablets (N770/N800). > > These drivers were very well tested on OMAP GIT [1] tree from long > time. Most of the code for this driver is written by Imre Deak > <[EMAIL PROTECTED]>. It seems churlish to complain about the 10-15 minutes spent reassembling the mime mess when so much effort has gone into this work. But for the long-term, plze do have a talk with your email setup so that you no longer need to send patches as attachments, OK? > Also CCed to LKML for wider review, and this v2 went through checkpatch.pl and > I have modified the patches to accept most of checkpatch comments. Maybe you had an old version of checkpatch: trailing statements should be on next line #520: FILE: drivers/video/omap/omapfb_main.c:402: + else switch (var->bits_per_pixel) { line over 80 characters #1136: FILE: drivers/video/omap/omapfb_main.c:1018: +static enum omapfb_update_mode omapfb_get_update_mode(struct omapfb_device *fbdev) do not use assignment in if condition #1251: FILE: drivers/video/omap/omapfb_main.c:1133: + if ((r = omapfb_query_plane(fbi, _info)) < 0) do not use assignment in if condition #1265: FILE: drivers/video/omap/omapfb_main.c:1147: + if ((r = omapfb_query_mem(fbi, _info)) < 0) do not use assignment in if condition #1279: FILE: drivers/video/omap/omapfb_main.c:1161: + if ((r = omapfb_get_color_key(fbdev, _key)) < 0) do not use assignment in if condition #1535: FILE: drivers/video/omap/omapfb_main.c:1417: + if ((r = device_create_file(fbdev->dev, _attr_caps_num))) do not use assignment in if condition #1538: FILE: drivers/video/omap/omapfb_main.c:1420: + if ((r = device_create_file(fbdev->dev, _attr_caps_text))) do not use assignment in if condition #1541: FILE: drivers/video/omap/omapfb_main.c:1423: + if ((r = sysfs_create_group(>dev->kobj, _attr_grp))) do not use assignment in if condition #1544: FILE: drivers/video/omap/omapfb_main.c:1426: + if ((r = sysfs_create_group(>dev->kobj, _attr_grp))) do not use assignment in if condition #1646: FILE: drivers/video/omap/omapfb_main.c:1528: + if ((r = fbinfo_init(fbdev, fbi)) < 0) { else should follow close brace #2001: FILE: drivers/video/omap/omapfb_main.c:1883: + } + else if (!strncmp(this_opt, "vxres:", 6)) Your patch has style problems, please review. If any of these errors are false positives report them to the maintainer, see CHECKPATCH in MAINTAINERS. Plus there are quite a large number of extern declarations in C files, which is poor practice, which checkpatch failed to detect (maintainer has been notified). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/16] fix handling of integer constant expressions
On Wed, 27 Jun 2007, Al Viro wrote: > > > extern int a; > > extern int as1[(a = 2)]; > > sparse simply doesn't check that. We don't have anything resembling > support of VLA. Well, the above has two bugs that sparse could notice _independently_ of variable-sized arrays: - assignment outside of a function - variable size array that isn't an automatic variable (strictly speaking, that's not even a variable size - it's a constant 2, just with a non-constant expression - maybe you misread the "=" as an "==") Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: NDAs - ANY KNOWN RULES?
On Wed, 27 Jun 2007, hermann pitton wrote: > Hi, > > such stuff causes a lot of troubles since long. > > Are there any rules, or can everybody go on as some sort of freelancer > exclusively on such? I don't like it! http://www.linux-foundation.org/en/NDA_program In short, the Linux Foundation can negotiate a reasonable NDA for you to sign, and they may be able to show you relevant documents as a freelancer under a reasonable and standardized contract. -Daniel *This .sig left intentionally blank* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, v2.6.22-rc6] sys_time() speedup
On Tue, Jun 26, 2007 at 10:14:40AM -0700, Andrew Morton wrote: > On my machine, time(2) doesn't do any syscall at all - it uses the vsyscall > page. I'd be surprised if a database uses sys_time() either. Large boxes unfortunately can't always use vsyscalls... that's a real pity. I also had to disable the vsyscalls64 to generate some number. I think there shall be a perfectly accurate but not monotone mode for gettimeofday so we can enable rdtscp (via sysctl or/and prctl). Aware apps can enable the prctl, aware or brave admins can turn on the sysctl. Vojtech and others should have proper patches to merge for this. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] atl1: disable 64bit DMA
On Mon, 25 Jun 2007 23:18:55 +0200 Luca Tettamanti <[EMAIL PROTECTED]> wrote: > Il Mon, Jun 25, 2007 at 07:42:44AM -0500, Jay Cliburn ha scritto: > > Jay L. T. Cornwall wrote: > > >Jay Cliburn wrote: > > > > > >>For reasons not yet clear to me, it appears the L1 driver has a > > >>bug or the device itself has trouble with DMA in high memory. > > >>This patch, drafted by Luca Tettamanti, is being explored as a > > >>workaround. I'd be interested to know if it fixes your problem. > > > > > >Yes, it certainly seems to. Now running with this patch and 4GB > > >active, I've transferred about 15GB with no problem so far. It > > >usually oopses after a GB or two. > > > > > >I guess it's not an ideal solution, architecturally speaking, but > > >it's a good deal better than an unstable driver. If there's any > > >other patches you'd like me to test or traces to capture, I'm > > >happy to help out. Otherwise I'll run with this one for now since > > >it does the job! > > > > Okay Jay, thanks. > > > > Luca, would you please submit your patch to Jeff Garzik and netdev? > > Hi Jeff, > a couple of users reported hard lockups when using L1 NICs on machines > with 4GB or more of RAM. We're still waiting official confirmation > from the vendor, but it seems that L1 has problems doing DMA to/from > high memory (physical address above the 4GB limit). Passing 32bit DMA > mask cures the problem. > > Signed-Off-By: Luca Tettamanti <[EMAIL PROTECTED]> > > --- > I think that the patch should be included in 2.6.22. > > drivers/net/atl1/atl1_main.c | 15 +++ > 1 file changed, 3 insertions(+), 12 deletions(-) > > diff --git a/drivers/net/atl1/atl1_main.c > b/drivers/net/atl1/atl1_main.c index 6862c11..a730f15 100644 > --- a/drivers/net/atl1/atl1_main.c > +++ b/drivers/net/atl1/atl1_main.c > @@ -2097,21 +2097,16 @@ static int __devinit atl1_probe(struct > pci_dev *pdev, struct net_device *netdev; > struct atl1_adapter *adapter; > static int cards_found = 0; > - bool pci_using_64 = true; > int err; > > err = pci_enable_device(pdev); > if (err) > return err; > > - err = pci_set_dma_mask(pdev, DMA_64BIT_MASK); > + err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); > if (err) { > - err = pci_set_dma_mask(pdev, DMA_32BIT_MASK); > - if (err) { > - dev_err(>dev, "no usable DMA > configuration\n"); > - goto err_dma; > - } > - pci_using_64 = false; > + dev_err(>dev, "no usable DMA configuration\n"); > + goto err_dma; > } > /* Mark all PCI regions associated with PCI device >* pdev as being reserved by owner atl1_driver_name > @@ -2176,7 +2171,6 @@ static int __devinit atl1_probe(struct pci_dev > *pdev, > netdev->ethtool_ops = _ethtool_ops; > adapter->bd_number = cards_found; > - adapter->pci_using_64 = pci_using_64; > > /* setup the private structure */ > err = atl1_sw_init(adapter); > @@ -2193,9 +2187,6 @@ static int __devinit atl1_probe(struct pci_dev > *pdev, */ > /* netdev->features |= NETIF_F_TSO; */ > > - if (pci_using_64) > - netdev->features |= NETIF_F_HIGHDMA; > - > netdev->features |= NETIF_F_LLTX; > > /* Acked-by: Jay Cliburn <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 16/16] fix handling of integer constant expressions
On Wed, Jun 27, 2007 at 08:32:26AM +0900, Neil Booth wrote: > Al Viro wrote:- > > > Hopefully correct handling of integer constant expressions. Please, review. > > Am I invoking sparse wrongly? ./sparse -W -Wall doesn't diagnose > the following TU, for example. > > extern int a; > extern int as1[(a = 2)]; sparse simply doesn't check that. We don't have anything resembling support of VLA. Note that check for integer constant expression has nothing to do with that; int x[(int)(0.6 + 0.6)]; is valid (if stupid). And yes, footnote in 6.6 contradicts 6.7.5.2(1); too bad... We certainly need to do checks on array sizes; however, that part ("if it has static storage duration, it should not be a VLA") is minor. And then there are gccisms: size_t foo(int n) { struct { int a[n]; char b; } x; return offsetof(typeof(x), b); } Yes, it's eaten up just fine. And yes, such structures are silently accepted even with -pedantic -std=c99, which is a bug. Sigh... We'll need to tackle VLAs at some point, but it certainly won't be fun ;-/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch, v2.6.22-rc6] sys_time() speedup
On Tue, Jun 26, 2007 at 10:13:31AM -0700, Ray Lee wrote: > faster? Weird. It shouldn't. They must be doing something wrong, > therefore the patch is stupid." Just in case it's not obvious the above are Ray Lee words, mine not. --- #!/usr/bin/env stap # edited top.stp from systemtap global syscalls function print_top () { printf ("SYSCALL\t\t\t\tCOUNT\n") foreach ([name] in syscalls- limit 20) printf("%-20s\t\t%5d\n",name, syscalls[name]) printf("--\n") } probe syscall.time { syscalls[probefunc()]++ } probe syscall.gettimeofday { syscalls[probefunc()]++ } # print top syscalls every 5 seconds probe timer.ms(5000) { print_top () } --- The above while running various huge sql operations with real life postgresql app running sql in loop for a minute or so (sorry no mysql setup but the world isn't mysql, I'd rather want to see oracle if something): SYSCALL COUNT sys_gettimeofday 4998 sys_time 120 -- SYSCALL COUNT sys_gettimeofday 9989 sys_time 185 -- SYSCALL COUNT sys_gettimeofday15219 sys_time 335 -- SYSCALL COUNT sys_gettimeofday21215 sys_time 428 -- SYSCALL COUNT sys_gettimeofday26194 sys_time 629 -- SYSCALL COUNT sys_gettimeofday30752 sys_time 734 -- SYSCALL COUNT sys_gettimeofday37379 sys_time 976 -- SYSCALL COUNT sys_gettimeofday42381 sys_time 1125 -- SYSCALL COUNT sys_gettimeofday47722 sys_time 1391 -- SYSCALL COUNT sys_gettimeofday53138 sys_time 1520 -- SYSCALL COUNT sys_gettimeofday57499 sys_time 1651 -- SYSCALL COUNT sys_gettimeofday62314 sys_time 1712 -- SYSCALL COUNT sys_gettimeofday66874 sys_time 1827 -- SYSCALL COUNT sys_gettimeofday71757 sys_time 2007 -- SYSCALL COUNT sys_gettimeofday76335 sys_time 2240 SYSCALL COUNT sys_gettimeofday80469 sys_time 2354 -- SYSCALL COUNT sys_gettimeofday85420 sys_time 2519 -- SYSCALL COUNT sys_gettimeofday90662 sys_time 2648 -- SYSCALL COUNT sys_gettimeofday95513 sys_time 2909 -- SYSCALL COUNT sys_gettimeofday100767 sys_time 3111 -- SYSCALL COUNT sys_gettimeofday106553 sys_time 3427 -- SYSCALL COUNT sys_gettimeofday112300 sys_time 3673 -- SYSCALL COUNT sys_gettimeofday115706 sys_time 3793 SYSCALL COUNT sys_gettimeofday119842 sys_time 3893 -- SYSCALL COUNT sys_gettimeofday123054 sys_time 4113 -- SYSCALL COUNT sys_gettimeofday126286 sys_time 4250 -- SYSCALL COUNT sys_gettimeofday129077 sys_time
[drm patch for 2.6.22-rc6] Add some pci ids for XGI chips
The attached patch just adds some XGI pci ids to the SIS driver. It should be harmless at this stage.. Dave.From 02031bf9190f6698e46e157196086ff416d0bf9b Mon Sep 17 00:00:00 2001 From: Ian Romanick <[EMAIL PROTECTED]> Date: Wed, 27 Jun 2007 06:38:00 +1000 Subject: [PATCH] Add support SiS based XGI chips to SiS DRM. This adds support for some of the XGI Volari family that are based on the SiS. Signed-off-by: Dave Airlie <[EMAIL PROTECTED]> --- drivers/char/drm/drm_pciids.h |2 ++ drivers/char/drm/sis_drv.h|8 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/char/drm/drm_pciids.h b/drivers/char/drm/drm_pciids.h index aa63350..30b200b 100644 --- a/drivers/char/drm/drm_pciids.h +++ b/drivers/char/drm/drm_pciids.h @@ -219,6 +219,8 @@ {0x1039, 0x6300, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, \ {0x1039, 0x6330, PCI_ANY_ID, PCI_ANY_ID, 0, 0, SIS_CHIP_315}, \ {0x1039, 0x7300, PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0}, \ + {0x18CA, 0x0040, PCI_ANY_ID, PCI_ANY_ID, 0, 0, SIS_CHIP_315}, \ + {0x18CA, 0x0042, PCI_ANY_ID, PCI_ANY_ID, 0, 0, SIS_CHIP_315}, \ {0, 0, 0} #define tdfx_PCI_IDS \ diff --git a/drivers/char/drm/sis_drv.h b/drivers/char/drm/sis_drv.h index 2b8d6f6..70d4ede 100644 --- a/drivers/char/drm/sis_drv.h +++ b/drivers/char/drm/sis_drv.h @@ -33,11 +33,11 @@ #define DRIVER_AUTHOR "SIS, Tungsten Graphics" #define DRIVER_NAME"sis" -#define DRIVER_DESC"SIS 300/630/540" -#define DRIVER_DATE"20060704" +#define DRIVER_DESC"SIS 300/630/540 and XGI V3XE/V5/V8" +#define DRIVER_DATE"20070626" #define DRIVER_MAJOR 1 -#define DRIVER_MINOR 2 -#define DRIVER_PATCHLEVEL 1 +#define DRIVER_MINOR 3 +#define DRIVER_PATCHLEVEL 0 enum sis_family { SIS_OTHER = 0, -- 1.4.4.2
Re: [RFD 1/4] Pass no useless nameidata to the create, lookup, and permission IOPs
In message <[EMAIL PROTECTED]>, [EMAIL PROTECTED] writes: > The create, lookup, and permission inode operations are all passed a > full nameidata. This is unfortunate because in nfsd and the mqueue > filesystem, we must instantiate a struct nameidata but cannot provide > all of the same information that a regular lookup would provide. The > unused fields take up space on the stack, but more importantly, it is > not obvious which fields have meaningful values and which don't, and so > things might easily break. > > This patch introduces struct nameidata2 with only the fields that make > sense independent of an actual lookup, and uses that struct in those > places where a full nameidat is not needed. I agree w/ Trond that a better name is needed other than 'nameidata2', esp. for something that's a sub-structure (perhaps start it with a '__'?) These changes would probably help stackable file systems (e.g., eCryptfs and esp. Unionfs) a lot, b/c stackable f/s often call the lower f/s to lookup files and such; and in most cases, we just need to pass the intent down, not the full VFS-level state info. > +/** > + * Fields shared between nameidata and nameidata2 -- nameidata2 could > + * simply be embedded in nameidata, but then the vfs code would become > + * cluttered with dereferences. > + */ > +#define __NAMEIDATA2 \ > + struct dentry *dentry;\ > + struct vfsmount *mnt; \ > + unsigned intflags; \ > + \ > + union { \ > + struct open_intent open;\ > + } intent; Perhaps it is also time to put the dentry + mnt into a single struct path? It's a small change, but it emphasizes that the two items here, dentry+mnt, really define a single path to be passed around: #define __NAMEIDATA \ struct path path; \ unsigned int flags; \ ... Of course, you'll have to change instances of nd->dentry to nd->path.dentry and so on. Erez. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH try #2] security: Convert LSM into a static interface
On Jun 26, 2007, at 09:47:12, Serge E. Hallyn wrote: Quoting Kyle Moffett ([EMAIL PROTECTED]): On Jun 25, 2007, at 16:37:58, Andreas Gruenbacher wrote: It's useful for some LSMs to be modular, and LSMs which are y/n options won't have any security architecture issues with unloading at all. The mere fact that SELinux cannot be built as a module is a rather weak argument for disabling LSM modules as a whole, so please don't. Here are a few questions for you: 1) What do you expect to happen to all the megs of security data when you "rmmod selinux"? Read the sentence right above yours again. Noone is saying we should be able to rmmod selinux. Ok, so say we extend LSM to do what AppArmor or TOMOYO need, what do you expect to happen when you "rmmod tomoyo", "rmmod apparmor", or whatever? Each of those is also going to stick lots of context on various objects during the course of running, the same way that the VM subsystem sticks lots of context on filesystem pages while running. Besides, even the standard "capabilities" module wants to attach a list of capabilities to every process and defines inheritance rules for them. Ergo you have the problems described below: Do you maintain a massive linked list of security data (with all the locking and performance problems) so that you can iterate over it calling kfree()? What synchronization primitive do we have right now which could safely stop all CPUs outside of security calls while we NULL out and free security data and disable security operations? Don't say "software suspend" and "process freezer", since those have whole order-of-magnitude-complexity problems of their own (and don't always work right either). 2) When you "modprobe my_custom_security_module", how exactly do you expect that all the processes, files, shared memory segments, file descriptors, sockets, SYSV mutexes, packets, etc will get appropriate security pointers? Those don't all need labels for capabilities, for instance. This question is as wrong as the last one. Ok, so let's just restrict ourselves to the simple dumb-as-dirt capabilities module. Every process is "labeled" with capabilities while running under that LSM, right? What happens when you "rmmod capabilities"? Do you iterate over all the processes to remove their security data even while they may be using it? Or do you just let it leak? Some daemons test if capabilities are supported, and if so they modify their capability set instead of forking a high-priv and a low-priv process and doing IPC. When you remove the capabilities module, suddenly all those programs will lose that critical "low- privilege" data and become full root. What happens later when you "modprobe capabilities"? Do you suddenly have to stop the system while you iterate over EVERY process to set capabilities based on whether it's root or not? It's also impossible to determine from a given state in time what processes should have capabilities, as the model includes inheritance, which includes processes that don't even exist anymore. 3) This sounds suspiciously like "The mere fact that the Linux-2.6-VM cannot be built as a module is a rather weak argument for disabling VFS modules as a whole". No, your argument sounds like "my fs can't be a module so neither should any." Let's go over the differences between "my fs" and "my LSM", and the similarities between "my VM" and "my LSM": Filesystems don't get hooked from virtually every userspace-initiated operation, whereas both VMs and LSMs do. VMs and LSMs attach anonymous state data to a large percentage of the allocated objects in the system, whereas filesystems allocate their own independent datastructure and use that. Would you want to "rmmod ext3" and then "modprobe ext2" while you have an ext2-as-ext3 filesystem *mounted*??? If you want a good analogy, that's a better one than the "my fs can't be a module" crap. This whole discussion boils down to 2 points: 1) As currently implemented, no LSM may be safely rmmod-ed 2) Someone has submitted a patch which fixes that problem (you can't rmmod them at all, so no crashes) If you really want to do modular LSMs, then you need to submit a patch which fixes all the race conditions in LSM removal *without* adding much extra overhead. I'm sure if your solutions works then everyone will be much more open to modular LSMs. I said this before: So... Do you have a proposal for solving those rather fundamental design gotchas? If so, I'm sure everybody here would love to see your patch Cheers, Kyle Moffett - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 7/7][TAKE5] ext4: support new modes
On Wed, Jun 27, 2007 at 12:59:08AM +0530, Amit K. Arora wrote: > On Tue, Jun 26, 2007 at 12:14:00PM -0400, Andreas Dilger wrote: > > On Jun 26, 2007 17:37 +0530, Amit K. Arora wrote: > > > > I also thought another proposed flag was to determine whether mtime (and > > > > maybe ctime) is changed when doing prealloc/dealloc space? Default > > > > should > > > > probably be to change mtime/ctime, and have FA_FL_NO_MTIME. Someone > > > > else > > > > should decide if we want to allow changing the file w/o changing ctime, > > > > if > > > > that is required even though the file is not visibly changing. Maybe > > > > the > > > > ctime update should be implicit if the size or mtime are changing? > > > > > > Is it really required ? I mean, why should we allow users not to update > > > ctime/mtime even if the file metadata/data gets updated ? It sounds > > > a bit "unnatural" to me. > > > Is there any application scenario in your mind, when you suggest of > > > giving this flexibility to userspace ? > > > > One reason is that XFS does NOT update the mtime/ctime when doing the > > XFS_IOC_* allocation ioctls. Not totally correct. XFS_IOC_ALLOCSP/FREESP change timestamps if they change the file size (via the truncate call made to change the file size). If they don't change the file size, then they are a no-op and should not change the file size. XFS_IOC_RESVSP/UNRESVSP don't change timestamps just like they don't change file size. That is by design AFAICT so these calls can be used by HSM-type applications that don't want to change timestamps when punching out data blocks or preallocating new ones. > Hmm.. I personally will call it a bug in XFS code then. :) No, I'd call it useful. :) > > > I think, modifying ctime/mtime should be dependent on the other flags. > > > E.g., if we do not zero out data blocks on allocation/deallocation, > > > update only ctime. Otherwise, update ctime and mtime both. > > > > I'm only being the advocate for requirements David Chinner has put > > forward due to existing behaviour in XFS. This is one of the reasons > > why I think the "flags" mechanism we now have - we can encode the > > various different behaviours in any way we want and leave it to the > > caller. > > I understand. May be we can confirm once more with David Chinner if this > is really required. Will it really be a compatibility issue if new XFS > preallocations (ie. via fallocate) update mtime/ctime? It should be left up to the filesystem to decide. Only the filesystem knows whether something changed and the timestamp should or should not be updated. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
NDAs - ANY KNOWN RULES?
Hi, such stuff causes a lot of troubles since long. Are there any rules, or can everybody go on as some sort of freelancer exclusively on such? I don't like it! Cheers, Hermann - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [AppArmor 00/44] AppArmor security module overview
On Tue, 26 Jun 2007 16:07:56 -0700 [EMAIL PROTECTED] wrote: > This post contains patches to include the AppArmor application security > framework, with request for inclusion into -mm for wider testing. Patches 24 and 31 didn't come through. Rolled-up diffstat (excluding 24&31): fs/attr.c|7 fs/dcache.c | 181 ++- fs/ecryptfs/inode.c | 41 fs/exec.c|3 fs/fat/file.c|2 fs/hpfs/namei.c |2 fs/namei.c | 115 +- fs/nfsd/nfs4recover.c|7 fs/nfsd/nfs4xdr.c|2 fs/nfsd/vfs.c| 89 + fs/ntfs/file.c |2 fs/open.c| 50 fs/reiserfs/file.c |2 fs/reiserfs/xattr.c |8 fs/splice.c |4 fs/stat.c|2 fs/sysfs/file.c |2 fs/utimes.c | 11 fs/xattr.c | 75 - fs/xfs/linux-2.6/xfs_lrw.c |2 include/linux/audit.h| 12 include/linux/fs.h | 27 include/linux/nfsd/nfsd.h|3 include/linux/security.h | 182 ++- include/linux/sysctl.h |2 include/linux/xattr.h| 11 ipc/mqueue.c |2 kernel/audit.c |6 kernel/sysctl.c | 27 mm/filemap.c | 12 mm/filemap_xip.c |2 mm/shmem.c |2 mm/tiny-shmem.c |2 net/unix/af_unix.c |2 security/Kconfig |1 security/Makefile|1 security/apparmor/Kconfig| 10 security/apparmor/Makefile | 13 security/apparmor/apparmor.h | 265 + security/apparmor/apparmorfs.c | 252 + security/apparmor/inline.h | 211 security/apparmor/list.c | 94 + security/apparmor/locking.txt| 68 + security/apparmor/lsm.c | 817 security/apparmor/main.c | 1255 + security/apparmor/match.c| 248 security/apparmor/match.h| 83 + security/apparmor/module_interface.c | 589 +++ security/apparmor/procattr.c | 155 +++ security/commoncap.c |7 security/dummy.c | 43 security/selinux/hooks.c | 94 - 52 files changed, 4701 insertions(+), 404 deletions(-) which seems OK. so... where do we stand with this? Fundamental, irreconcilable differences over the use of pathname-based security? Are there any other sticking points? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFD 0/4] AppArmor - Don't pass NULL nameidata to vfs_create/lookup/permission IOPs
On Tue, 2007-06-26 at 16:15 -0700, [EMAIL PROTECTED] wrote: > To remove conditionally passing of vfsmounts to the LSM, a nameidata > struct can be instantiated in the nfsd and mqueue filesystems. This > however results in useless information being passed down, as not > all fields in the nameidata struct will be meaingful. The nameidata > struct is split creating struct nameidata2 that contains only the > fields > that will carry meaningful information. I don't object to the concept per se, but could you please give it a more descriptive name please? "struct vfs_intent" would be a lot more accurate than "nameidata2". Trond - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/