Re: Data corruption with raid5/dm-crypt/lvm/reiserfs on 2.6.19.2
Am Montag, den 22.01.2007, 11:56 -0800 schrieb Andrew Morton: > There has been a long history of similar problems when raid and dm-crypt > are used together. I thought a couple of months ago that we were hot on > the trail of a fix, but I don't think we ever got there. Perhaps > Christophe can comment? No, I think it's exactly this bug. Three month ago someone came up with a very reliable test case and I managed to nail down the bug. Readaheads that were aborted by the raid5 code (or some layer below) were signalled using a cleared BIO_UPTODATE bit, but no error code, and were missed as aborted by dm-crypt (all other layers apparently set the error code in this case, so this only happened with raid5) which could mess up the buffer cache. Anyway, it then turned out this bug was already "accidentally" fixed in 2.6.19 by RedHat in order to play nicely with make_request changes (the stuff to reduce stack usage with stacked block device layers), that's why you probably missed that it got fixed. The fix for pre-2.6.19 kernels went into some 2.6.16.x and 2.6.18.6. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Data corruption with raid5/dm-crypt/lvm/reiserfs on 2.6.19.2
Am Montag, den 22.01.2007, 11:56 -0800 schrieb Andrew Morton: There has been a long history of similar problems when raid and dm-crypt are used together. I thought a couple of months ago that we were hot on the trail of a fix, but I don't think we ever got there. Perhaps Christophe can comment? No, I think it's exactly this bug. Three month ago someone came up with a very reliable test case and I managed to nail down the bug. Readaheads that were aborted by the raid5 code (or some layer below) were signalled using a cleared BIO_UPTODATE bit, but no error code, and were missed as aborted by dm-crypt (all other layers apparently set the error code in this case, so this only happened with raid5) which could mess up the buffer cache. Anyway, it then turned out this bug was already accidentally fixed in 2.6.19 by RedHat in order to play nicely with make_request changes (the stuff to reduce stack usage with stacked block device layers), that's why you probably missed that it got fixed. The fix for pre-2.6.19 kernels went into some 2.6.16.x and 2.6.18.6. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable][PATCH < 2.6.19] Fix data corruption with dm-crypt over RAID5
Am Freitag, den 01.12.2006, 19:49 -0800 schrieb Chris Wright: > * Christophe Saout ([EMAIL PROTECTED]) wrote: > > Fix corruption issue with dm-crypt on top of software raid5. Cancelled > > readahead bio's that report no error, just have BIO_UPTODATE cleared > > were reported as successful reads to the higher layers (and leaving > > random content in the buffer cache). Already fixed in 2.6.19. > > I take it this is fixed a different way in 2.6.19? Mind clarifying the > difference? It's more or less fixed the same way in 2.6.19. The function was reordered by Milan Broz to accommodate the changed write path (commit 8b004457168995f2ae2a35327f885183a9e74141): > [PATCH] dm crypt: restructure for workqueue change > > Restructure part of the dm-crypt code in preparation for workqueue > changes. > > Use 'base_bio' or 'clone' variable names consistently throughout. No > functional changes are included in this patch. "No functional changes" actually included the correct fix, accidental or not. > Minor nit: introduces trailing whitespaces, cleaned it up locally. Ouch, sorry. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [stable][PATCH 2.6.19] Fix data corruption with dm-crypt over RAID5
Am Freitag, den 01.12.2006, 19:49 -0800 schrieb Chris Wright: * Christophe Saout ([EMAIL PROTECTED]) wrote: Fix corruption issue with dm-crypt on top of software raid5. Cancelled readahead bio's that report no error, just have BIO_UPTODATE cleared were reported as successful reads to the higher layers (and leaving random content in the buffer cache). Already fixed in 2.6.19. I take it this is fixed a different way in 2.6.19? Mind clarifying the difference? It's more or less fixed the same way in 2.6.19. The function was reordered by Milan Broz to accommodate the changed write path (commit 8b004457168995f2ae2a35327f885183a9e74141): [PATCH] dm crypt: restructure for workqueue change Restructure part of the dm-crypt code in preparation for workqueue changes. Use 'base_bio' or 'clone' variable names consistently throughout. No functional changes are included in this patch. No functional changes actually included the correct fix, accidental or not. Minor nit: introduces trailing whitespaces, cleaned it up locally. Ouch, sorry. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
[stable][PATCH < 2.6.19] Fix data corruption with dm-crypt over RAID5
Fix corruption issue with dm-crypt on top of software raid5. Cancelled readahead bio's that report no error, just have BIO_UPTODATE cleared were reported as successful reads to the higher layers (and leaving random content in the buffer cache). Already fixed in 2.6.19. Signed-off-by: Christophe Saout <[EMAIL PROTECTED]> --- linux-2.6.18.orig/drivers/md/dm-crypt.c 2006-09-20 05:42:06.0 +0200 +++ linux-2.6.18/drivers/md/dm-crypt.c 2006-12-02 03:03:36.0 +0100 @@ -717,13 +717,15 @@ if (bio->bi_size) return 1; + if (!bio_flagged(bio, BIO_UPTODATE) && !error) + error = -EIO; + bio_put(bio); /* * successful reads are decrypted by the worker thread */ - if ((bio_data_dir(bio) == READ) - && bio_flagged(bio, BIO_UPTODATE)) { + if (bio_data_dir(io->bio) == READ && !error) { kcryptd_queue_io(io); return 0; } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[stable][PATCH 2.6.19] Fix data corruption with dm-crypt over RAID5
Fix corruption issue with dm-crypt on top of software raid5. Cancelled readahead bio's that report no error, just have BIO_UPTODATE cleared were reported as successful reads to the higher layers (and leaving random content in the buffer cache). Already fixed in 2.6.19. Signed-off-by: Christophe Saout [EMAIL PROTECTED] --- linux-2.6.18.orig/drivers/md/dm-crypt.c 2006-09-20 05:42:06.0 +0200 +++ linux-2.6.18/drivers/md/dm-crypt.c 2006-12-02 03:03:36.0 +0100 @@ -717,13 +717,15 @@ if (bio-bi_size) return 1; + if (!bio_flagged(bio, BIO_UPTODATE) !error) + error = -EIO; + bio_put(bio); /* * successful reads are decrypted by the worker thread */ - if ((bio_data_dir(bio) == READ) -bio_flagged(bio, BIO_UPTODATE)) { + if (bio_data_dir(io-bio) == READ !error) { kcryptd_queue_io(io); return 0; } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [BUG mm] "fixed" i386 memcpy inlining buggy
Am Mittwoch, den 06.04.2005, 13:14 +0300 schrieb Denis Vlasenko: > Oh shit. I was trying to be too clever. I still run with this patch, > so it must be happening very rarely. Yes, that's right, it happened with code that's not in the mainline tree but could have happened anywhere. > Does this one compile ok? Yes, the case that failed is now okay. I changed it slightly to assign esi and edi directy on top of the functions, no asm section needed here. The compiler will make sure that they have the correct values when needed. In the case above the compiler now uses %ebx to save the loop counter instead of %esi. In drivers/cdrom/cdrom.c I'm observing one very strange thing though. It appears that the compiler decided to put the local variable edi on the stack for some unexplicable reason (or possibly there is?). Since the asm sections are memory barriers the compiler then saves the value of %edi on the stack before entering the next assembler section. 1f1c: a5 movsl %ds:(%esi),%es:(%edi) 1f1d: 89 7d 84mov%edi,0xff84(%ebp) 1f20: a5 movsl %ds:(%esi),%es:(%edi) 1f21: 89 7d 84mov%edi,0xff84(%ebp) 1f24: 66 a5 movsw %ds:(%esi),%es:(%edi) (this is a constant 10 byte memcpy) The only thing that would avoid this is to either tell the compiler to never put esi/edi in memory (which I think is not possibly across different versions of gcc) or to always generate a single asm section for all the different cases. signature.asc Description: This is a digitally signed message part
Re: [BUG mm] fixed i386 memcpy inlining buggy
Am Mittwoch, den 06.04.2005, 13:14 +0300 schrieb Denis Vlasenko: Oh shit. I was trying to be too clever. I still run with this patch, so it must be happening very rarely. Yes, that's right, it happened with code that's not in the mainline tree but could have happened anywhere. Does this one compile ok? Yes, the case that failed is now okay. I changed it slightly to assign esi and edi directy on top of the functions, no asm section needed here. The compiler will make sure that they have the correct values when needed. In the case above the compiler now uses %ebx to save the loop counter instead of %esi. In drivers/cdrom/cdrom.c I'm observing one very strange thing though. It appears that the compiler decided to put the local variable edi on the stack for some unexplicable reason (or possibly there is?). Since the asm sections are memory barriers the compiler then saves the value of %edi on the stack before entering the next assembler section. 1f1c: a5 movsl %ds:(%esi),%es:(%edi) 1f1d: 89 7d 84mov%edi,0xff84(%ebp) 1f20: a5 movsl %ds:(%esi),%es:(%edi) 1f21: 89 7d 84mov%edi,0xff84(%ebp) 1f24: 66 a5 movsw %ds:(%esi),%es:(%edi) (this is a constant 10 byte memcpy) The only thing that would avoid this is to either tell the compiler to never put esi/edi in memory (which I think is not possibly across different versions of gcc) or to always generate a single asm section for all the different cases. signature.asc Description: This is a digitally signed message part
Re: 2.6.12-rc2-mm1
Hi Andrew, > - Nobody said anything about the PM resume and DRI behaviour in > 2.6.12-rc1-mm4. So it's all perfect now? Yes, works for me. DRI (i915) is working again and USB is now happy after a PM resume too. signature.asc Description: This is a digitally signed message part
[BUG mm] "fixed" i386 memcpy inlining buggy
Hi Denis, the new i386 memcpy macro is a ticking timebomb. I've been debugging a new mISDN crash, just to find out that a memcpy was not inlined correctly. Andrew, you should drop the fix-i386-memcpy.patch (or have it fixed). This source code: mISDN_pid_t pid; [...] memcpy(>mgr->pid, , sizeof(mISDN_pid_t)); was compiled as: lea0xffa4(%ebp),%esi< %esi is loaded ( add$0x10,%ebx ) ( mov%ebx,%eax ) something else ( call 1613 ) %esi preserved mov0xffa0(%ebp),%edx mov0x74(%edx),%edi < %edi is loaded add$0x20,%edi offset in structure added ! mov$0x14,%esi!! < %esi overwritten! mov%esi,%ecx< %ecx loaded repz movsl %ds:(%esi),%es:(%edi) Apparently the compiled decided that the value 0x14 could be reused afterwards (which it does for an inlined memset of the same size some instructions below) and clobbers %esi. Looking at the macro: __asm__ __volatile__( "" : "=" (edi), "=" (esi) : "0" ((long) to),"1" ((long) from) : "memory" ); } if (n >= 5*4) { /* large block: use rep prefix */ int ecx; __asm__ __volatile__( "rep ; movsl" : "=" (ecx) it seems obvious that the compiled assumes it can reuse %esi and %edi for something else between the two __asm__ sections. These should probably be merged. signature.asc Description: This is a digitally signed message part
Re: 2.6.12-rc2-mm1
Hi Andrew, - Nobody said anything about the PM resume and DRI behaviour in 2.6.12-rc1-mm4. So it's all perfect now? Yes, works for me. DRI (i915) is working again and USB is now happy after a PM resume too. signature.asc Description: This is a digitally signed message part
[BUG mm] fixed i386 memcpy inlining buggy
Hi Denis, the new i386 memcpy macro is a ticking timebomb. I've been debugging a new mISDN crash, just to find out that a memcpy was not inlined correctly. Andrew, you should drop the fix-i386-memcpy.patch (or have it fixed). This source code: mISDN_pid_t pid; [...] memcpy(st-mgr-pid, pid, sizeof(mISDN_pid_t)); was compiled as: lea0xffa4(%ebp),%esi %esi is loaded ( add$0x10,%ebx ) ( mov%ebx,%eax ) something else ( call 1613 test_stack_protocol+0x83 ) %esi preserved mov0xffa0(%ebp),%edx mov0x74(%edx),%edi %edi is loaded add$0x20,%edi offset in structure added ! mov$0x14,%esi!! %esi overwritten! mov%esi,%ecx %ecx loaded repz movsl %ds:(%esi),%es:(%edi) Apparently the compiled decided that the value 0x14 could be reused afterwards (which it does for an inlined memset of the same size some instructions below) and clobbers %esi. Looking at the macro: __asm__ __volatile__( : =D (edi), =S (esi) : 0 ((long) to),1 ((long) from) : memory ); } if (n = 5*4) { /* large block: use rep prefix */ int ecx; __asm__ __volatile__( rep ; movsl : =c (ecx) it seems obvious that the compiled assumes it can reuse %esi and %edi for something else between the two __asm__ sections. These should probably be merged. signature.asc Description: This is a digitally signed message part
Re: x86-64 preemption fix from IRQ and BKL in 2.6.12-rc1-mm2
Am Sonntag, den 27.03.2005, 19:26 +0200 schrieb Andi Kleen: > > preempt_schedule_irq is not an i386 specific function and seems to take > > special care of BKL preemption and since reiserfs does use the BKL to do > > certain things I think this actually might be the problem...? > > Hmm, preempt_schedule_irq is not in mainline as far as I can see. > My patches are always for mainline; i dont do a special > patch kit for -mm* PREEMPT_BKL has been in mainline since 2.6.11-rc1, preempt_schedule_irq made it in 2.6.11-rc3. Please look here: http://linux.bkbits.net:8080/linux-2.6/search/?expr=preempt_schedule_irq=ChangeSet+comments For i386 the first change was to switch to preempt_schedule in this code path: http://linux.bkbits.net:8080/linux-2.6/[EMAIL PROTECTED] preempt_schedule takes care of setting PREEMPT_ACTIVE and resetting it afterwards, so I removed that from the assembler code. Then preempt_schedule_irq has been introduced to move the sti/cli back around the call to schedule: http://linux.bkbits.net:8080/linux-2.6/[EMAIL PROTECTED] So in the end the only thing that the patch I proposed was doing is to *additionally* handle the PREEMPT_BKL case so that schedule doesn't accidentally release the BKL semaphore when it shouldn't because we are preempting and nobody explicitly called schedule. Several other archs have done the same. No bug has shown up until the recent -mm kernel where the execution of this code path actually became possible (the "jc -> jnc" fix some lines above). > It looks like a unfortunate interaction with some other patches > in mm. Andrew, can you disable CONFIG_PREEMPT on x86-64 in > mm for now? These things are in 2.6.11 (except that they never got called because of the wrong interrupt flag check in the IRQ handler). > > Unfortunately I don't have a amd64 machine to play with, so can somebody > > please check this? > > How did you generate the crash dumps above then? Well, nobody minds if I play with a webserver in the middle of the night, as long as it works during the day. Shoot me. :) Both servers are running fine since I applied my patch last night. Now that I looked into it I think that it's obviously the correct solution. signature.asc Description: This is a digitally signed message part
inotify issue: iput called atomically
Hi Robert, it looks like you shouldn't call iput with spinlocks held. iput might call down into the filesystem to delete the inode and this can sleep. Mar 27 14:38:18 server Debug: sleeping function called from invalid context at include/asm/semaphore.h:102 Mar 27 14:38:18 server in_atomic():1, irqs_disabled():0 Mar 27 14:38:18 server [] dump_stack+0x17/0x20 Mar 27 14:38:18 server [] __might_sleep+0xa3/0xb0 Mar 27 14:38:18 server [] lock_kernel+0x2b/0x50 Mar 27 14:38:18 server [] reiserfs_delete_inode+0x13/0x100 Mar 27 14:38:18 server [] generic_delete_inode+0xa4/0x160 Mar 27 14:38:18 server [] iput+0x56/0x80 Mar 27 14:38:18 server [] remove_watch_no_event+0x8d/0x100 Mar 27 14:38:18 server [] inotify_ignore+0x49/0x90 Mar 27 14:38:18 server [] inotify_ioctl+0xcd/0x110 Mar 27 14:38:18 server [] do_ioctl+0x76/0x90 Mar 27 14:38:18 server [] vfs_ioctl+0x59/0x1c0 Mar 27 14:38:18 server [] sys_ioctl+0x39/0x60 Mar 27 14:38:18 server [] sysenter_past_esp+0x54/0x75 I've looked through the code and iput can usually be called after releasing spinlocks. It doesn't look trivial to implement though. signature.asc Description: This is a digitally signed message part
inotify issue: iput called atomically
Hi Robert, it looks like you shouldn't call iput with spinlocks held. iput might call down into the filesystem to delete the inode and this can sleep. Mar 27 14:38:18 server Debug: sleeping function called from invalid context at include/asm/semaphore.h:102 Mar 27 14:38:18 server in_atomic():1, irqs_disabled():0 Mar 27 14:38:18 server [c0103e37] dump_stack+0x17/0x20 Mar 27 14:38:18 server [c0118973] __might_sleep+0xa3/0xb0 Mar 27 14:38:18 server [c03fc5ab] lock_kernel+0x2b/0x50 Mar 27 14:38:18 server [c019fc33] reiserfs_delete_inode+0x13/0x100 Mar 27 14:38:18 server [c0172c64] generic_delete_inode+0xa4/0x160 Mar 27 14:38:18 server [c0172f16] iput+0x56/0x80 Mar 27 14:38:18 server [c017f9bd] remove_watch_no_event+0x8d/0x100 Mar 27 14:38:18 server [c01805f9] inotify_ignore+0x49/0x90 Mar 27 14:38:18 server [c018070d] inotify_ioctl+0xcd/0x110 Mar 27 14:38:18 server [c016ac26] do_ioctl+0x76/0x90 Mar 27 14:38:18 server [c016adb9] vfs_ioctl+0x59/0x1c0 Mar 27 14:38:18 server [c016af59] sys_ioctl+0x39/0x60 Mar 27 14:38:18 server [c0102f6b] sysenter_past_esp+0x54/0x75 I've looked through the code and iput can usually be called after releasing spinlocks. It doesn't look trivial to implement though. signature.asc Description: This is a digitally signed message part
Re: x86-64 preemption fix from IRQ and BKL in 2.6.12-rc1-mm2
Am Sonntag, den 27.03.2005, 19:26 +0200 schrieb Andi Kleen: preempt_schedule_irq is not an i386 specific function and seems to take special care of BKL preemption and since reiserfs does use the BKL to do certain things I think this actually might be the problem...? Hmm, preempt_schedule_irq is not in mainline as far as I can see. My patches are always for mainline; i dont do a special patch kit for -mm* PREEMPT_BKL has been in mainline since 2.6.11-rc1, preempt_schedule_irq made it in 2.6.11-rc3. Please look here: http://linux.bkbits.net:8080/linux-2.6/search/?expr=preempt_schedule_irqsearch=ChangeSet+comments For i386 the first change was to switch to preempt_schedule in this code path: http://linux.bkbits.net:8080/linux-2.6/[EMAIL PROTECTED] preempt_schedule takes care of setting PREEMPT_ACTIVE and resetting it afterwards, so I removed that from the assembler code. Then preempt_schedule_irq has been introduced to move the sti/cli back around the call to schedule: http://linux.bkbits.net:8080/linux-2.6/[EMAIL PROTECTED] So in the end the only thing that the patch I proposed was doing is to *additionally* handle the PREEMPT_BKL case so that schedule doesn't accidentally release the BKL semaphore when it shouldn't because we are preempting and nobody explicitly called schedule. Several other archs have done the same. No bug has shown up until the recent -mm kernel where the execution of this code path actually became possible (the jc - jnc fix some lines above). It looks like a unfortunate interaction with some other patches in mm. Andrew, can you disable CONFIG_PREEMPT on x86-64 in mm for now? These things are in 2.6.11 (except that they never got called because of the wrong interrupt flag check in the IRQ handler). Unfortunately I don't have a amd64 machine to play with, so can somebody please check this? How did you generate the crash dumps above then? Well, nobody minds if I play with a webserver in the middle of the night, as long as it works during the day. Shoot me. :) Both servers are running fine since I applied my patch last night. Now that I looked into it I think that it's obviously the correct solution. signature.asc Description: This is a digitally signed message part
[PATCH] Fix preemption off of irq context on x86-64 with PREEMPT_BKL
Hi, > x86_64-fix-config_preempt.patch > x86_64: Fix CONFIG_PREEMPT This patch causes another bug to show up some lines below with CONFIG_PREEMPT_BKL. schedule releases the BKL which it shouldn't do. Call preempt_schedule_irq instead (like for i386). This seems to fix the easily reproducible filesystem errors I've seen (with reiserfs, which heavily relies on the BKL). Signed-off-by: Christophe Saout <[EMAIL PROTECTED]> --- linux-2.6.12-rc1-mm2.orig/arch/x86_64/kernel/entry.S2005-03-24 17:32:22.0 +0100 +++ linux-2.6.12-rc1-mm2/arch/x86_64/kernel/entry.S 2005-03-26 23:40:30.0 +0100 @@ -517,12 +517,7 @@ jnc retint_restore_args bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */ jnc retint_restore_args - movl $PREEMPT_ACTIVE,threadinfo_preempt_count(%rcx) - sti - call schedule - cli - GET_THREAD_INFO(%rcx) - movl $0,threadinfo_preempt_count(%rcx) + call preempt_schedule_irq jmp exit_intr #endif CFI_ENDPROC signature.asc Description: This is a digitally signed message part
[PATCH] Fix preemption off of irq context on x86-64 with PREEMPT_BKL
Hi, x86_64-fix-config_preempt.patch x86_64: Fix CONFIG_PREEMPT This patch causes another bug to show up some lines below with CONFIG_PREEMPT_BKL. schedule releases the BKL which it shouldn't do. Call preempt_schedule_irq instead (like for i386). This seems to fix the easily reproducible filesystem errors I've seen (with reiserfs, which heavily relies on the BKL). Signed-off-by: Christophe Saout [EMAIL PROTECTED] --- linux-2.6.12-rc1-mm2.orig/arch/x86_64/kernel/entry.S2005-03-24 17:32:22.0 +0100 +++ linux-2.6.12-rc1-mm2/arch/x86_64/kernel/entry.S 2005-03-26 23:40:30.0 +0100 @@ -517,12 +517,7 @@ jnc retint_restore_args bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */ jnc retint_restore_args - movl $PREEMPT_ACTIVE,threadinfo_preempt_count(%rcx) - sti - call schedule - cli - GET_THREAD_INFO(%rcx) - movl $0,threadinfo_preempt_count(%rcx) + call preempt_schedule_irq jmp exit_intr #endif CFI_ENDPROC signature.asc Description: This is a digitally signed message part
x86-64 preemption fix from IRQ and BKL in 2.6.12-rc1-mm2
Hi, > +x86_64-fix-config_preempt.patch > > x86_64-fix-config_preempt.patch > x86_64: Fix CONFIG_PREEMPT Has this one been stress-tested? I've got the impression that things have become a lot worse. I've been seeing things like these: Mar 25 01:00:48 websrv2 REISERFS: panic (device dm-1): clm-6000: do_balance, fs generation has changed Mar 25 01:00:48 websrv2 Mar 25 01:00:48 websrv2 --- [cut here ] - [please bite here ] - Mar 25 01:00:48 websrv2 Kernel BUG at prints:362 Mar 25 01:00:48 websrv2 invalid operand: [1] PREEMPT Mar 25 01:00:48 websrv2 CPU 0 Mar 25 01:00:48 websrv2 Modules linked in: iptable_nat ipt_MARK iptable_mangle ipt_LOG ipt_multiport ipt_owner ipt_mark ipt_state ipt_REJECT iptable_filter ip_tables twofish serpent blowfish ext3 jbd reiser4 sha256 aes dm_crypt ip_conntrack_irc ip_conntrack_ftp ip_conntrack via_rhine 8139too crc32 Mar 25 01:00:48 websrv2 Pid: 25172, comm: rm Not tainted 2.6.12-rc1-cs1 Mar 25 01:00:48 websrv2 RIP: 0010:[] {reiserfs_panic+211} Mar 25 01:00:48 websrv2 RSP: 0018:81001efe37b8 EFLAGS: 00010292 Mar 25 01:00:48 websrv2 RAX: 0059 RBX: 803fbcac RCX: c100 Mar 25 01:00:48 websrv2 RDX: RSI: 81007d0b31f0 RDI: Mar 25 01:00:48 websrv2 RBP: 81004f960060 R08: 81001efe2000 R09: 0002 Mar 25 01:00:48 websrv2 R10: R11: 80340ef0 R12: 81007f850230 Mar 25 01:00:48 websrv2 R13: 81007f85 R14: R15: 81004f9565d0 Mar 25 01:00:48 websrv2 FS: 2aabaae0() GS:805be800() knlGS:55563dc0 Mar 25 01:00:48 websrv2 CS: 0010 DS: ES: CR0: 8005003b Mar 25 01:00:48 websrv2 CR2: 2aaff008 CR3: 1ebbd000 CR4: 06e0 Mar 25 01:00:48 websrv2 Process rm (pid: 25172, threadinfo 81001efe2000, task 81007d0b31f0) Mar 25 01:00:48 websrv2 Stack: 00300010 81001efe38a8 81001efe37d8 81001c041530 Mar 25 01:00:48 websrv2 81001efe39d8 801d4e42 81007e659a00 0063 Mar 25 01:00:48 websrv2 0063 Mar 25 01:00:48 websrv2 Call Trace:{pathrelse_and_restore+66} {retint_kernel+46} Mar 25 01:00:48 websrv2 {do_balance+39} {do_balance+6901} Mar 25 01:00:48 websrv2 {unfix_nodes+128} {do_balance+10555} Mar 25 01:00:48 websrv2 {reiserfs_cut_from_item+1673} {reiserfs_unlink+362} Mar 25 01:00:48 websrv2 {vfs_unlink+462} {sys_unlink+233} Mar 25 01:00:48 websrv2 {sys_getdents+232} {error_exit+0} Mar 25 01:00:48 websrv2 {system_call+126} Mar 25 01:00:48 websrv2 Mar 25 01:00:48 websrv2 Code: 0f 0b b8 c1 3f 80 ff ff ff ff 6a 01 4d 85 ed 48 c7 c2 40 ba Mar 25 01:00:48 websrv2 RIP {reiserfs_panic+211} RSP or Mar 25 16:39:21 websrv2 VFS: brelse: Trying to free free buffer Mar 25 16:39:21 websrv2 Badness in __brelse at fs/buffer.c:1295 Mar 25 16:39:21 websrv2 Mar 25 16:39:21 websrv2 Call Trace:{__find_get_block+479} {__getblk+37} Mar 25 16:39:21 websrv2 {do_journal_end+2181} {keventd_create_kthread+0} Mar 25 16:39:21 websrv2 {reiserfs_sync_fs+64} {sync_supers+211} Mar 25 16:39:21 websrv2 {wb_kupdate+42} {pdflush+399} Mar 25 16:39:21 websrv2 {wb_kupdate+0} {keventd_create_kthread+0} Mar 25 16:39:21 websrv2 {pdflush+0} {kthread+205} Mar 25 16:39:21 websrv2 {child_rip+8} {keventd_create_kthread+0} Mar 25 16:39:21 websrv2 {kthread+0} {child_rip+0} Fortunately the kernel locked up and there was no data corruption. I've got PREEMPT and PREEMPT_BKL enabled under UP. I just took a look at the change and found this: x86-64 does this (in entry.S): bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */ jnc retint_restore_args movl $PREEMPT_ACTIVE,threadinfo_preempt_count(%rcx) sti call schedule cli GET_THREAD_INFO(%rcx) movl $0,threadinfo_preempt_count(%rcx) jmp exit_intr while i386 does this: testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ? jz restore_all call preempt_schedule_irq jmp need_resched preempt_schedule_irq is not an i386 specific function and seems to take special care of BKL preemption and since reiserfs does use the BKL to do certain things I think this actually might be the problem...? I'm not saying that this fix is wrong (it is obviously the right fix) but it causes another problem to show up. Unfortunately I don't have a amd64 machine to play with, so can somebody please check this? signature.asc Description: This is a digitally signed message part
x86-64 preemption fix from IRQ and BKL in 2.6.12-rc1-mm2
Hi, +x86_64-fix-config_preempt.patch x86_64-fix-config_preempt.patch x86_64: Fix CONFIG_PREEMPT Has this one been stress-tested? I've got the impression that things have become a lot worse. I've been seeing things like these: Mar 25 01:00:48 websrv2 REISERFS: panic (device dm-1): clm-6000: do_balance, fs generation has changed Mar 25 01:00:48 websrv2 Mar 25 01:00:48 websrv2 --- [cut here ] - [please bite here ] - Mar 25 01:00:48 websrv2 Kernel BUG at prints:362 Mar 25 01:00:48 websrv2 invalid operand: [1] PREEMPT Mar 25 01:00:48 websrv2 CPU 0 Mar 25 01:00:48 websrv2 Modules linked in: iptable_nat ipt_MARK iptable_mangle ipt_LOG ipt_multiport ipt_owner ipt_mark ipt_state ipt_REJECT iptable_filter ip_tables twofish serpent blowfish ext3 jbd reiser4 sha256 aes dm_crypt ip_conntrack_irc ip_conntrack_ftp ip_conntrack via_rhine 8139too crc32 Mar 25 01:00:48 websrv2 Pid: 25172, comm: rm Not tainted 2.6.12-rc1-cs1 Mar 25 01:00:48 websrv2 RIP: 0010:[801cfe13] 801cfe13{reiserfs_panic+211} Mar 25 01:00:48 websrv2 RSP: 0018:81001efe37b8 EFLAGS: 00010292 Mar 25 01:00:48 websrv2 RAX: 0059 RBX: 803fbcac RCX: c100 Mar 25 01:00:48 websrv2 RDX: RSI: 81007d0b31f0 RDI: Mar 25 01:00:48 websrv2 RBP: 81004f960060 R08: 81001efe2000 R09: 0002 Mar 25 01:00:48 websrv2 R10: R11: 80340ef0 R12: 81007f850230 Mar 25 01:00:48 websrv2 R13: 81007f85 R14: R15: 81004f9565d0 Mar 25 01:00:48 websrv2 FS: 2aabaae0() GS:805be800() knlGS:55563dc0 Mar 25 01:00:48 websrv2 CS: 0010 DS: ES: CR0: 8005003b Mar 25 01:00:48 websrv2 CR2: 2aaff008 CR3: 1ebbd000 CR4: 06e0 Mar 25 01:00:48 websrv2 Process rm (pid: 25172, threadinfo 81001efe2000, task 81007d0b31f0) Mar 25 01:00:48 websrv2 Stack: 00300010 81001efe38a8 81001efe37d8 81001c041530 Mar 25 01:00:48 websrv2 81001efe39d8 801d4e42 81007e659a00 0063 Mar 25 01:00:48 websrv2 0063 Mar 25 01:00:48 websrv2 Call Trace:801d4e42{pathrelse_and_restore+66} 8010efe6{retint_kernel+46} Mar 25 01:00:48 websrv2 801bb847{do_balance+39} 801bd315{do_balance+6901} Mar 25 01:00:48 websrv2 801cbd90{unfix_nodes+128} 801be15b{do_balance+10555} Mar 25 01:00:48 websrv2 801d7bf9{reiserfs_cut_from_item+1673} 801bfcfa{reiserfs_unlink+362} Mar 25 01:00:48 websrv2 801873ae{vfs_unlink+462} 801874f9{sys_unlink+233} Mar 25 01:00:48 websrv2 8018a268{sys_getdents+232} 8010f221{error_exit+0} Mar 25 01:00:48 websrv2 8010e906{system_call+126} Mar 25 01:00:48 websrv2 Mar 25 01:00:48 websrv2 Code: 0f 0b b8 c1 3f 80 ff ff ff ff 6a 01 4d 85 ed 48 c7 c2 40 ba Mar 25 01:00:48 websrv2 RIP 801cfe13{reiserfs_panic+211} RSP 81001efe37b8 or Mar 25 16:39:21 websrv2 VFS: brelse: Trying to free free buffer Mar 25 16:39:21 websrv2 Badness in __brelse at fs/buffer.c:1295 Mar 25 16:39:21 websrv2 Mar 25 16:39:21 websrv2 Call Trace:8017787f{__find_get_block+479} 8017a175{__getblk+37} Mar 25 16:39:21 websrv2 801de3d5{do_journal_end+2181} 80147d70{keventd_create_kthread+0} Mar 25 16:39:21 websrv2 801cbf50{reiserfs_sync_fs+64} 8017c0b3{sync_supers+211} Mar 25 16:39:21 websrv2 8015a22a{wb_kupdate+42} 8015ae8f{pdflush+399} Mar 25 16:39:21 websrv2 8015a200{wb_kupdate+0} 80147d70{keventd_create_kthread+0} Mar 25 16:39:21 websrv2 8015ad00{pdflush+0} 80147d2d{kthread+205} Mar 25 16:39:21 websrv2 8010f3d7{child_rip+8} 80147d70{keventd_create_kthread+0} Mar 25 16:39:21 websrv2 80147c60{kthread+0} 8010f3cf{child_rip+0} Fortunately the kernel locked up and there was no data corruption. I've got PREEMPT and PREEMPT_BKL enabled under UP. I just took a look at the change and found this: x86-64 does this (in entry.S): bt $9,EFLAGS-ARGOFFSET(%rsp) /* interrupts off? */ jnc retint_restore_args movl $PREEMPT_ACTIVE,threadinfo_preempt_count(%rcx) sti call schedule cli GET_THREAD_INFO(%rcx) movl $0,threadinfo_preempt_count(%rcx) jmp exit_intr while i386 does this: testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ? jz restore_all call preempt_schedule_irq jmp need_resched preempt_schedule_irq is not an i386 specific function and seems to take special care of BKL preemption and since reiserfs does use the BKL to do certain things I think this actually might be the problem...? I'm not saying that this fix is wrong (it is obviously the right fix) but it causes another problem to show up. Unfortunately I don't have
Re: [0/many] Acrypto - asynchronous crypto layer for linux kernel 2.6
Am Dienstag, den 08.03.2005, 00:08 -0500 schrieb Kyle Moffett: > Did you include support for the new key/keyring infrastructure > introduced > a couple versions ago by David Howells? It allows userspace to create > and > manage various sorts of "keys" in kernelspace. If you create and > register > a few keytypes for various symmetric and asymmetric ciphers, you could > then > take advantage of its support for securely passing keys around in and > out > of userspace. I've written a dm-crypt patch some weeks ago that does what you describe. The crypto information (cipher and key) is added to a keyring and then the device is constructed using a reference to this key. I had some issues with the keyring code (mainly a deadlock problem with crypto module autoloading): http://lkml.org/lkml/2005/2/4/113 I would also like to switch dm-crypt to acrypto once it's accepted into the kernel. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [0/many] Acrypto - asynchronous crypto layer for linux kernel 2.6
Am Dienstag, den 08.03.2005, 00:08 -0500 schrieb Kyle Moffett: Did you include support for the new key/keyring infrastructure introduced a couple versions ago by David Howells? It allows userspace to create and manage various sorts of keys in kernelspace. If you create and register a few keytypes for various symmetric and asymmetric ciphers, you could then take advantage of its support for securely passing keys around in and out of userspace. I've written a dm-crypt patch some weeks ago that does what you describe. The crypto information (cipher and key) is added to a keyring and then the device is constructed using a reference to this key. I had some issues with the keyring code (mainly a deadlock problem with crypto module autoloading): http://lkml.org/lkml/2005/2/4/113 I would also like to switch dm-crypt to acrypto once it's accepted into the kernel. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [PATCH 1/8] lib/sort: Heapsort implementation of sort()
Am Sonntag, den 27.02.2005, 13:25 -0800 schrieb Matt Mackall: > Which kernel? There was an off-by-one for odd array sizes in the > original posted version that was quickly spotted: > > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc4/2.6.11-rc4-mm1/broken-out/sort-fix.patch > > I've since tested all sizes 1 - 1000 with 100 random arrays each, so > I'm fairly confident it's now fixed. - int i = (num/2 - 1) * size, n = num * size, c, r; + int i = (num/2) * size, n = num * size, c, r; What's probably meant is: ((num - 1)/2) * size `i' must cover half of the array or a little bit more (not less, in case of odd numbers). `i' (before my patch) is the highest address to start with, so that's why it should be ((num + 1)/2 - 1) * size or simpler: ((num - 1)/2) * size Anyway, I was wondering, is there a specific reason you are not using size_t for the offset variables? size is a size_t and the only purpose of the variables i, n, c and r is to be compared or added to the start pointer (also I think it's just ugly to cast size_t down to an int). On system where int is 32 bit but pointers are 64 bit the compiler might need to extend to adjust the size of the operands for the address calculation. Right? Since size_t is unsigned I also had to modify the two loops since I can't check for any of the variables becoming negative. Tested with all kinds of array sizes. Signed-off-by: Christophe Saout <[EMAIL PROTECTED]> diff -Nur linux-2.6.11-rc4-mm1.orig/include/linux/sort.h linux-2.6.11-rc4-mm1/include/linux/sort.h --- linux-2.6.11-rc4-mm1.orig/include/linux/sort.h 2005-03-01 19:45:11.0 +0100 +++ linux-2.6.11-rc4-mm1/include/linux/sort.h 2005-03-01 19:47:13.456097896 +0100 @@ -5,6 +5,6 @@ void sort(void *base, size_t num, size_t size, int (*cmp)(const void *, const void *), - void (*swap)(void *, void *, int)); + void (*swap)(void *, void *, size_t)); #endif diff -Nur linux-2.6.11-rc4-mm1.orig/lib/sort.c linux-2.6.11-rc4-mm1/lib/sort.c --- linux-2.6.11-rc4-mm1.orig/lib/sort.c2005-03-01 19:46:05.0 +0100 +++ linux-2.6.11-rc4-mm1/lib/sort.c 2005-03-01 19:47:55.688677568 +0100 @@ -7,14 +7,14 @@ #include #include -void u32_swap(void *a, void *b, int size) +void u32_swap(void *a, void *b, size_t size) { u32 t = *(u32 *)a; *(u32 *)a = *(u32 *)b; *(u32 *)b = t; } -void generic_swap(void *a, void *b, int size) +void generic_swap(void *a, void *b, size_t size) { char t; @@ -44,17 +44,19 @@ void sort(void *base, size_t num, size_t size, int (*cmp)(const void *, const void *), - void (*swap)(void *, void *, int size)) + void (*swap)(void *, void *, size_t size)) { /* pre-scale counters for performance */ - int i = (num/2) * size, n = num * size, c, r; + size_t i = ((num + 1)/2) * size, n = num * size, c, r; if (!swap) swap = (size == 4 ? u32_swap : generic_swap); /* heapify */ - for ( ; i >= 0; i -= size) { - for (r = i; r * 2 < n; r = c) { + while(i > 0) { + i -= size; + + for (r = i; r * 2 < n; r = c) { c = r * 2; if (c < n - size && cmp(base + c, base + c + size) < 0) c += size; @@ -65,7 +67,9 @@ } /* sort */ - for (i = n - size; i >= 0; i -= size) { + for (i = n; i > 0;) { + i -= size; + swap(base, base + i, size); for (r = 0; r * 2 < i; r = c) { c = r * 2; signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [PATCH 1/8] lib/sort: Heapsort implementation of sort()
Am Sonntag, den 27.02.2005, 13:25 -0800 schrieb Matt Mackall: Which kernel? There was an off-by-one for odd array sizes in the original posted version that was quickly spotted: http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.11-rc4/2.6.11-rc4-mm1/broken-out/sort-fix.patch I've since tested all sizes 1 - 1000 with 100 random arrays each, so I'm fairly confident it's now fixed. - int i = (num/2 - 1) * size, n = num * size, c, r; + int i = (num/2) * size, n = num * size, c, r; What's probably meant is: ((num - 1)/2) * size `i' must cover half of the array or a little bit more (not less, in case of odd numbers). `i' (before my patch) is the highest address to start with, so that's why it should be ((num + 1)/2 - 1) * size or simpler: ((num - 1)/2) * size Anyway, I was wondering, is there a specific reason you are not using size_t for the offset variables? size is a size_t and the only purpose of the variables i, n, c and r is to be compared or added to the start pointer (also I think it's just ugly to cast size_t down to an int). On system where int is 32 bit but pointers are 64 bit the compiler might need to extend to adjust the size of the operands for the address calculation. Right? Since size_t is unsigned I also had to modify the two loops since I can't check for any of the variables becoming negative. Tested with all kinds of array sizes. Signed-off-by: Christophe Saout [EMAIL PROTECTED] diff -Nur linux-2.6.11-rc4-mm1.orig/include/linux/sort.h linux-2.6.11-rc4-mm1/include/linux/sort.h --- linux-2.6.11-rc4-mm1.orig/include/linux/sort.h 2005-03-01 19:45:11.0 +0100 +++ linux-2.6.11-rc4-mm1/include/linux/sort.h 2005-03-01 19:47:13.456097896 +0100 @@ -5,6 +5,6 @@ void sort(void *base, size_t num, size_t size, int (*cmp)(const void *, const void *), - void (*swap)(void *, void *, int)); + void (*swap)(void *, void *, size_t)); #endif diff -Nur linux-2.6.11-rc4-mm1.orig/lib/sort.c linux-2.6.11-rc4-mm1/lib/sort.c --- linux-2.6.11-rc4-mm1.orig/lib/sort.c2005-03-01 19:46:05.0 +0100 +++ linux-2.6.11-rc4-mm1/lib/sort.c 2005-03-01 19:47:55.688677568 +0100 @@ -7,14 +7,14 @@ #include linux/kernel.h #include linux/module.h -void u32_swap(void *a, void *b, int size) +void u32_swap(void *a, void *b, size_t size) { u32 t = *(u32 *)a; *(u32 *)a = *(u32 *)b; *(u32 *)b = t; } -void generic_swap(void *a, void *b, int size) +void generic_swap(void *a, void *b, size_t size) { char t; @@ -44,17 +44,19 @@ void sort(void *base, size_t num, size_t size, int (*cmp)(const void *, const void *), - void (*swap)(void *, void *, int size)) + void (*swap)(void *, void *, size_t size)) { /* pre-scale counters for performance */ - int i = (num/2) * size, n = num * size, c, r; + size_t i = ((num + 1)/2) * size, n = num * size, c, r; if (!swap) swap = (size == 4 ? u32_swap : generic_swap); /* heapify */ - for ( ; i = 0; i -= size) { - for (r = i; r * 2 n; r = c) { + while(i 0) { + i -= size; + + for (r = i; r * 2 n; r = c) { c = r * 2; if (c n - size cmp(base + c, base + c + size) 0) c += size; @@ -65,7 +67,9 @@ } /* sort */ - for (i = n - size; i = 0; i -= size) { + for (i = n; i 0;) { + i -= size; + swap(base, base + i, size); for (r = 0; r * 2 i; r = c) { c = r * 2; signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [PATCH 01/04] Adding cipher mode context information to crypto_tfm
Am Mittwoch, den 09.02.2005, 17:19 -0800 schrieb Andrew Morton: > > It must be > > possible to process more than 2 mappings in softirq context. > > Adding a few more fixmap slots wouldn't hurt anyone. But if you want an > arbitrarily large number of them then no, we cannot do that. > > Possibly one could arrange for the pages to not be in highmem at all. In the case of the LRW use in dm-crypt only plain- and ciphertext need to be accessible through struct page (both are addressed through BIO vectors). The LRW tweaks could simply be kmalloced. So instead of passing "struct scatterlist *tweaksg" we could just pass a "void *tweaksg". Some of the hard work on the generic scatterlist walker would be for nothing then. :( signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [PATCH 01/04] Adding cipher mode context information to crypto_tfm
Am Mittwoch, den 09.02.2005, 17:19 -0800 schrieb Andrew Morton: It must be possible to process more than 2 mappings in softirq context. Adding a few more fixmap slots wouldn't hurt anyone. But if you want an arbitrarily large number of them then no, we cannot do that. Possibly one could arrange for the pages to not be in highmem at all. In the case of the LRW use in dm-crypt only plain- and ciphertext need to be accessible through struct page (both are addressed through BIO vectors). The LRW tweaks could simply be kmalloced. So instead of passing struct scatterlist *tweaksg we could just pass a void *tweaksg. Some of the hard work on the generic scatterlist walker would be for nothing then. :( signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [dm-crypt] Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 20:05 -0800 schrieb Matt Mackall: > Dunno here, seems that having one tool that gave the kernel a key named > "foo" and then telling dm-crypt to use key "foo" is probably not a bad > way to go. Then we don't have stuff like "echo | dmsetup create" > and the like and the key-handling smarts can all be put in one > separate place. > > Getting from here to there might be interesting though. Perhaps we can > teach dm-crypt to understand keys of the form "keyname:"? in > addition to raw keys to keep compatibility. Might even be possible to > push this down into crypt_decode_key() (or a smarter variant of same). I've hacked together a working implementation. It's just a prototype though. The patch requires that the key retention service is activated unconditionally. It would probably be better to introduce #ifdefs and the possibility turn of the cleartext key way in the kernel configuration. I've separated out the crypto configuration from the device-mapper configuration. Several device mappings can now share/reference the same key. This might be useful if you want to span your mapping over several devices (without stacking device mappings on top of each other). So basically the crypto and the device-mapper part of the configuration are in its own "classes" now. It makes even more sense now to rename IV operations to crypto/key operations (which only work on the new structure) like it's done in the LRW patch. This crypto configuration that I separated out can be referenced by the device-mapper target directly (old way) or indirectly by the new key retention service. It holds the cipher, chaining mode and IV mechanisms as well as the key itself. A dm table configuration line then looks like "0 100 crypt key my_key_name 0 /dev/bla 0". And dm-crypt.c is slowly getting large. Thirty functions already. Hmm. --- linux-2.6.11-rc3.orig/drivers/md/dm-crypt.c 2005-02-04 01:00:51.983910752 +0100 +++ linux-2.6.11-rc3/drivers/md/dm-crypt.c 2005-02-04 14:30:47.629117856 +0100 @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -48,14 +49,25 @@ int write; }; -struct crypt_config; +/* + * key and associated crypto information + */ +struct crypt_key { + struct crypt_iv_operations *iv_gen_ops; + char *iv_mode; + void *iv_gen_private; + unsigned int iv_size; + + struct crypto_tfm *tfm; + unsigned int key_size; + u8 key[0]; +}; struct crypt_iv_operations { - int (*ctr)(struct crypt_config *cc, struct dm_target *ti, - const char *opts); - void (*dtr)(struct crypt_config *cc); - const char *(*status)(struct crypt_config *cc); - int (*generator)(struct crypt_config *cc, u8 *iv, sector_t sector); + int (*ctr)(struct crypt_key *ck, const char *opts, char **error_msg); + void (*dtr)(struct crypt_key *ck); + const char *(*status)(struct crypt_key *ck); + int (*generator)(struct crypt_key *ck, u8 *iv, sector_t sector); }; /* @@ -73,18 +85,9 @@ mempool_t *io_pool; mempool_t *page_pool; - /* - * crypto related data - */ - struct crypt_iv_operations *iv_gen_ops; - char *iv_mode; - void *iv_gen_private; sector_t iv_offset; - unsigned int iv_size; - - struct crypto_tfm *tfm; - unsigned int key_size; - u8 key[0]; + struct crypt_key *crypto; + struct key *key; }; #define MIN_IOS256 @@ -121,16 +124,16 @@ * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454 */ -static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv, sector_t sector) +static int crypt_iv_plain_gen(struct crypt_key *ck, u8 *iv, sector_t sector) { - memset(iv, 0, cc->iv_size); + memset(iv, 0, ck->iv_size); *(u32 *)iv = cpu_to_le32(sector & 0x); return 0; } -static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti, - const char *opts) +static int crypt_iv_essiv_ctr(struct crypt_key *ck, const char *opts, + char **error_msg) { struct crypto_tfm *essiv_tfm; struct crypto_tfm *hash_tfm; @@ -139,19 +142,19 @@ u8 *salt; if (opts == NULL) { - ti->error = PFX "Digest algorithm missing for ESSIV mode"; + *error_msg = PFX "Digest algorithm missing for ESSIV mode"; return -EINVAL; } /* Hash the cipher key with the given hash algorithm */ hash_tfm = crypto_alloc_tfm(opts, 0); if (hash_tfm == NULL) { - ti->error = PFX "Error initializing ESSIV hash"; + *error_msg = PFX "Error initializing ESSIV hash"; return -EINVAL; } if (crypto_tfm_alg_type(hash_tfm) != CRYPTO_ALG_TYPE_DIGEST) { - ti->error = PFX "Expected digest algorithm for ESSIV hash"; + *error_msg = PFX "Expected digest algorithm for ESSIV hash"; crypto_free_tfm(hash_tfm); return -EINVAL; } @@ -159,63 +162,63 @@ saltsize = crypto_tfm_alg_digestsize(hash_tfm); salt = kmalloc(saltsize, GFP_KERNEL); if (salt == NULL) { - ti->error = PFX "Error kmallocing salt storage in ESSIV"; + *error_msg = PFX "Error kmallocing salt storage in ESSIV";
Questions about the Linux key retention services (and dm-crypt)
Hi, I was investigating a way to hide the dm-crypt key from device-mapper configuration IOCTLs since the key might accidentally end up somewhere it shouldn't (see other thread). Then I stumbled across the new key retention service. This is exectly what I was looking for. The idea is to add the crypto configuration as a key and then give the device-mapper a reference to that key. So everybody can then safely read and copy the device-mapper configuration without risking to compromise the key. I just hacked something up which seems to work, just a prototype though. (If someone wants to have a look at it: http://www.saout.de/misc/dm-crypt-key-retention-v1.diff ). There are some minor issues and questions: I'd like to create the crypto tfm in the instantiation function (to verify the validity of the cipher and key immediately). Cryptoapi might call modprobe to load a cipher. Then it deadlocks. The reason is the instantiation semaphore (kernel tries to create a keyring for modprobe). Can we somehow get rid of it? The other question I have is how should I describe references to the key. My idea was to use its description. The userspace application adds the key to one of its keyrings and tells dm-crypt its name. dm-crypt calls search_process_keyrings to retrieve the key. The alternative would be to use the key's serial id. What do you think (as designer of the API) would be the better solution? What I like is the key refcounting. The process creating the dm-crypt mapping can put the key in its process keyring and when it exits the keyring is destroyed, so that the key is "floating" (held by dm-crypt). Once the dm-crypt mapping is removed the key is also destroy. No risk of having unused keys hanging around. The problem is that if an application wants to modify the dm-crypt mapping is that it needs to get a reference to the key: With the serial id it would be easy, it just links the key to one of its keyrings and can then destroy the dm-crypt mapping without having to worry that the keys gets destroyed. But if I reference the key by its description, dm-crypt would need to link the key to one of the caller's (which asks dm-crypt for the key) keyrings itself so that the caller can then find the key. This sounds a bit ugly to me. I'd prefer to not keep the keys in the root user keyring to make sure the key gets destroyed when nobody is using it. So which solution is better? Should I use the key description or serial id for references? The downside of using the serial id would be that any application could reference any key, even keys it does not own. The problem with using the description is described above. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Questions about the Linux key retention services (and dm-crypt)
Hi, I was investigating a way to hide the dm-crypt key from device-mapper configuration IOCTLs since the key might accidentally end up somewhere it shouldn't (see other thread). Then I stumbled across the new key retention service. This is exectly what I was looking for. The idea is to add the crypto configuration as a key and then give the device-mapper a reference to that key. So everybody can then safely read and copy the device-mapper configuration without risking to compromise the key. I just hacked something up which seems to work, just a prototype though. (If someone wants to have a look at it: http://www.saout.de/misc/dm-crypt-key-retention-v1.diff ). There are some minor issues and questions: I'd like to create the crypto tfm in the instantiation function (to verify the validity of the cipher and key immediately). Cryptoapi might call modprobe to load a cipher. Then it deadlocks. The reason is the instantiation semaphore (kernel tries to create a keyring for modprobe). Can we somehow get rid of it? The other question I have is how should I describe references to the key. My idea was to use its description. The userspace application adds the key to one of its keyrings and tells dm-crypt its name. dm-crypt calls search_process_keyrings to retrieve the key. The alternative would be to use the key's serial id. What do you think (as designer of the API) would be the better solution? What I like is the key refcounting. The process creating the dm-crypt mapping can put the key in its process keyring and when it exits the keyring is destroyed, so that the key is floating (held by dm-crypt). Once the dm-crypt mapping is removed the key is also destroy. No risk of having unused keys hanging around. The problem is that if an application wants to modify the dm-crypt mapping is that it needs to get a reference to the key: With the serial id it would be easy, it just links the key to one of its keyrings and can then destroy the dm-crypt mapping without having to worry that the keys gets destroyed. But if I reference the key by its description, dm-crypt would need to link the key to one of the caller's (which asks dm-crypt for the key) keyrings itself so that the caller can then find the key. This sounds a bit ugly to me. I'd prefer to not keep the keys in the root user keyring to make sure the key gets destroyed when nobody is using it. So which solution is better? Should I use the key description or serial id for references? The downside of using the serial id would be that any application could reference any key, even keys it does not own. The problem with using the description is described above. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: [dm-crypt] Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 20:05 -0800 schrieb Matt Mackall: Dunno here, seems that having one tool that gave the kernel a key named foo and then telling dm-crypt to use key foo is probably not a bad way to go. Then we don't have stuff like echo key | dmsetup create and the like and the key-handling smarts can all be put in one separate place. Getting from here to there might be interesting though. Perhaps we can teach dm-crypt to understand keys of the form keyname:foo? in addition to raw keys to keep compatibility. Might even be possible to push this down into crypt_decode_key() (or a smarter variant of same). I've hacked together a working implementation. It's just a prototype though. The patch requires that the key retention service is activated unconditionally. It would probably be better to introduce #ifdefs and the possibility turn of the cleartext key way in the kernel configuration. I've separated out the crypto configuration from the device-mapper configuration. Several device mappings can now share/reference the same key. This might be useful if you want to span your mapping over several devices (without stacking device mappings on top of each other). So basically the crypto and the device-mapper part of the configuration are in its own classes now. It makes even more sense now to rename IV operations to crypto/key operations (which only work on the new structure) like it's done in the LRW patch. This crypto configuration that I separated out can be referenced by the device-mapper target directly (old way) or indirectly by the new key retention service. It holds the cipher, chaining mode and IV mechanisms as well as the key itself. A dm table configuration line then looks like 0 100 crypt key my_key_name 0 /dev/bla 0. And dm-crypt.c is slowly getting large. Thirty functions already. Hmm. --- linux-2.6.11-rc3.orig/drivers/md/dm-crypt.c 2005-02-04 01:00:51.983910752 +0100 +++ linux-2.6.11-rc3/drivers/md/dm-crypt.c 2005-02-04 14:30:47.629117856 +0100 @@ -13,6 +13,7 @@ #include linux/mempool.h #include linux/slab.h #include linux/crypto.h +#include linux/key.h #include linux/workqueue.h #include asm/atomic.h #include asm/scatterlist.h @@ -48,14 +49,25 @@ int write; }; -struct crypt_config; +/* + * key and associated crypto information + */ +struct crypt_key { + struct crypt_iv_operations *iv_gen_ops; + char *iv_mode; + void *iv_gen_private; + unsigned int iv_size; + + struct crypto_tfm *tfm; + unsigned int key_size; + u8 key[0]; +}; struct crypt_iv_operations { - int (*ctr)(struct crypt_config *cc, struct dm_target *ti, - const char *opts); - void (*dtr)(struct crypt_config *cc); - const char *(*status)(struct crypt_config *cc); - int (*generator)(struct crypt_config *cc, u8 *iv, sector_t sector); + int (*ctr)(struct crypt_key *ck, const char *opts, char **error_msg); + void (*dtr)(struct crypt_key *ck); + const char *(*status)(struct crypt_key *ck); + int (*generator)(struct crypt_key *ck, u8 *iv, sector_t sector); }; /* @@ -73,18 +85,9 @@ mempool_t *io_pool; mempool_t *page_pool; - /* - * crypto related data - */ - struct crypt_iv_operations *iv_gen_ops; - char *iv_mode; - void *iv_gen_private; sector_t iv_offset; - unsigned int iv_size; - - struct crypto_tfm *tfm; - unsigned int key_size; - u8 key[0]; + struct crypt_key *crypto; + struct key *key; }; #define MIN_IOS256 @@ -121,16 +124,16 @@ * http://article.gmane.org/gmane.linux.kernel.device-mapper.dm-crypt/454 */ -static int crypt_iv_plain_gen(struct crypt_config *cc, u8 *iv, sector_t sector) +static int crypt_iv_plain_gen(struct crypt_key *ck, u8 *iv, sector_t sector) { - memset(iv, 0, cc-iv_size); + memset(iv, 0, ck-iv_size); *(u32 *)iv = cpu_to_le32(sector 0x); return 0; } -static int crypt_iv_essiv_ctr(struct crypt_config *cc, struct dm_target *ti, - const char *opts) +static int crypt_iv_essiv_ctr(struct crypt_key *ck, const char *opts, + char **error_msg) { struct crypto_tfm *essiv_tfm; struct crypto_tfm *hash_tfm; @@ -139,19 +142,19 @@ u8 *salt; if (opts == NULL) { - ti-error = PFX Digest algorithm missing for ESSIV mode; + *error_msg = PFX Digest algorithm missing for ESSIV mode; return -EINVAL; } /* Hash the cipher key with the given hash algorithm */ hash_tfm = crypto_alloc_tfm(opts, 0); if (hash_tfm == NULL) { - ti-error = PFX Error initializing ESSIV hash; + *error_msg = PFX Error initializing ESSIV hash; return -EINVAL; } if (crypto_tfm_alg_type(hash_tfm) != CRYPTO_ALG_TYPE_DIGEST) { - ti-error = PFX Expected digest algorithm for ESSIV hash; + *error_msg = PFX Expected digest algorithm for ESSIV hash; crypto_free_tfm(hash_tfm); return -EINVAL; } @@ -159,63 +162,63 @@ saltsize = crypto_tfm_alg_digestsize(hash_tfm); salt = kmalloc(saltsize, GFP_KERNEL); if (salt == NULL) { - ti-error = PFX Error kmallocing salt storage in ESSIV; +
Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 20:05 -0800 schrieb Matt Mackall: > On Thu, Feb 03, 2005 at 03:34:29AM +0100, Christophe Saout wrote: > > The keyring API seems very flexible. You can define your own type of > > keys and give them names. Well, the name is probably irrelevant here and > > should be chosen randomly but it's less likely to collide with someone > > else. > > Dunno here, seems that having one tool that gave the kernel a key named > "foo" and then telling dm-crypt to use key "foo" is probably not a bad > way to go. Then we don't have stuff like "echo | dmsetup create" > and the like and the key-handling smarts can all be put in one > separate place. Yes. I could also change cryptsetup to not mlockall the whole application just because the key is passed down to libdevmapper which does not treat parameters with special care. > Getting from here to there might be interesting though. Perhaps we can > teach dm-crypt to understand keys of the form "keyname:"? in > addition to raw keys to keep compatibility. Might even be possible to > push this down into crypt_decode_key() (or a smarter variant of same). > > Meanwhile, I'd still like to hide the raw key in crypt_status(). Well, I don't. I don't know any tools that actually use the DM_DEVICE_TABLE command except cryptsetup. I don't like to make the interface inconsistent just because there might be an incompetent root sitting in front of the machine. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 20:05 -0800 schrieb Matt Mackall: On Thu, Feb 03, 2005 at 03:34:29AM +0100, Christophe Saout wrote: The keyring API seems very flexible. You can define your own type of keys and give them names. Well, the name is probably irrelevant here and should be chosen randomly but it's less likely to collide with someone else. Dunno here, seems that having one tool that gave the kernel a key named foo and then telling dm-crypt to use key foo is probably not a bad way to go. Then we don't have stuff like echo key | dmsetup create and the like and the key-handling smarts can all be put in one separate place. Yes. I could also change cryptsetup to not mlockall the whole application just because the key is passed down to libdevmapper which does not treat parameters with special care. Getting from here to there might be interesting though. Perhaps we can teach dm-crypt to understand keys of the form keyname:foo? in addition to raw keys to keep compatibility. Might even be possible to push this down into crypt_decode_key() (or a smarter variant of same). Meanwhile, I'd still like to hide the raw key in crypt_status(). Well, I don't. I don't know any tools that actually use the DM_DEVICE_TABLE command except cryptsetup. I don't like to make the interface inconsistent just because there might be an incompetent root sitting in front of the machine. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 17:52 -0800 schrieb Matt Mackall: > > An alternativ would be to use some form of handle to point to the key > > after it has been given to the kernel. But that would require some more > > infrastructure. > > There's been some talk about such infrastructure already. I believe > some pieces of it may already be in place. Yes, you are right. I didn't follow the discussion but it actually looks very promising. The keys in the infrastructure are reference-counted. That's good. The keyrings can be attached to either thread, processes, sessions or users. It seems that it's possible to have floating keys (not attached to any keyring). So we would just need to figure out how to use these keyrings to allow communication with userspace applications. Process keyrings seem to have the advantage that the keyring is dropped when it exits so that all keys that are not in use by the kernel are also dropped. A keyring for the root user would have the problem that if the cryptsetup application aborts in the middle you would end up with old keys lying around forever. The keyring API seems very flexible. You can define your own type of keys and give them names. Well, the name is probably irrelevant here and should be chosen randomly but it's less likely to collide with someone else. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 13:19 -0800 schrieb Matt Mackall: > From looking at the dm_crypt code, it appears that it can be > interrogated to report the current key. Some quick testing shows: > > # dmsetup table /dev/mapper/volume1 > 0 200 crypt aes-plain 0123456789abcdef0123456789abcdef 0 7:0 0 > > Obviously, root can in principle recover this password from the > running kernel but it seems silly to make it so easy. I already tried that. It took me about five minutes using a shell, dd and hexdump to get the key out of the running kernel... > Moreover, it seems this facility exists to support some form of > automated table storage (LVM?). As we don't want anyone/anything > accidentally storing our passwords on disk in the clear, we probably > shouldn't facilitate this. Perhaps we can stick something here like > "" that the dm_crypt constructor can reject. Yes, the reason is that the device-mapper supports on-the-fly modifications of the device. cryptsetup has a command to resize the mapping for example. It can do that without asking for the password again. Features like this are the reason I'm doing this. Userspace tools should be able to assume that they can use the result of a table status command to create a new table with this information. An alternativ would be to use some form of handle to point to the key after it has been given to the kernel. But that would require some more infrastructure. I could imagine something like this: Some kernel infrastructure for key management. It can hold keys which are referenced by some form of handle. The keys are refcounted and wiped if the reference count drops to zero. If you want to create a dm-crypt mapping (or something else that uses some form of cryptographic key) you create a new handle and assign the key. Then you give the handle to dm-crypt which increments reference count. When cryptsetup exits its reference to the key is dropped but the key still has a reference from the active dm-crypt mapping. Later on another application could then safely do something with that handle. As long as an application or in-kernel user references the key it won't be dropped so it's safe for a userspace application to play around with it. BTW: The setkey command also seems to return the keys in use for IPSEC connections. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 13:19 -0800 schrieb Matt Mackall: From looking at the dm_crypt code, it appears that it can be interrogated to report the current key. Some quick testing shows: # dmsetup table /dev/mapper/volume1 0 200 crypt aes-plain 0123456789abcdef0123456789abcdef 0 7:0 0 Obviously, root can in principle recover this password from the running kernel but it seems silly to make it so easy. I already tried that. It took me about five minutes using a shell, dd and hexdump to get the key out of the running kernel... Moreover, it seems this facility exists to support some form of automated table storage (LVM?). As we don't want anyone/anything accidentally storing our passwords on disk in the clear, we probably shouldn't facilitate this. Perhaps we can stick something here like secret that the dm_crypt constructor can reject. Yes, the reason is that the device-mapper supports on-the-fly modifications of the device. cryptsetup has a command to resize the mapping for example. It can do that without asking for the password again. Features like this are the reason I'm doing this. Userspace tools should be able to assume that they can use the result of a table status command to create a new table with this information. An alternativ would be to use some form of handle to point to the key after it has been given to the kernel. But that would require some more infrastructure. I could imagine something like this: Some kernel infrastructure for key management. It can hold keys which are referenced by some form of handle. The keys are refcounted and wiped if the reference count drops to zero. If you want to create a dm-crypt mapping (or something else that uses some form of cryptographic key) you create a new handle and assign the key. Then you give the handle to dm-crypt which increments reference count. When cryptsetup exits its reference to the key is dropped but the key still has a reference from the active dm-crypt mapping. Later on another application could then safely do something with that handle. As long as an application or in-kernel user references the key it won't be dropped so it's safe for a userspace application to play around with it. BTW: The setkey command also seems to return the keys in use for IPSEC connections. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: dm-crypt crypt_status reports key?
Am Mittwoch, den 02.02.2005, 17:52 -0800 schrieb Matt Mackall: An alternativ would be to use some form of handle to point to the key after it has been given to the kernel. But that would require some more infrastructure. There's been some talk about such infrastructure already. I believe some pieces of it may already be in place. Yes, you are right. I didn't follow the discussion but it actually looks very promising. The keys in the infrastructure are reference-counted. That's good. The keyrings can be attached to either thread, processes, sessions or users. It seems that it's possible to have floating keys (not attached to any keyring). So we would just need to figure out how to use these keyrings to allow communication with userspace applications. Process keyrings seem to have the advantage that the keyring is dropped when it exits so that all keys that are not in use by the kernel are also dropped. A keyring for the root user would have the problem that if the cryptsetup application aborts in the middle you would end up with old keys lying around forever. The keyring API seems very flexible. You can define your own type of keys and give them names. Well, the name is probably irrelevant here and should be chosen randomly but it's less likely to collide with someone else. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: swap on dmcrypt crashes machine
Am Dienstag, den 25.01.2005, 18:28 +0100 schrieb Michael Buesch: > I set up swap on an encrypted dmcrypt device. > While stressing swap usage with make -j200 in the > kernel tree, the machine crashes: > > Adding 1461872k swap on /dev/mapper/swap. Priority:-2 extents:1 > [ cut here ] > kernel BUG at drivers/block/ll_rw_blk.c:2193! > invalid operand: [#1] > SMP > Modules linked in: ipv6 ohci_hcd tuner tvaudio msp3400 bttv video_buf > firmware_class btcx_risc nvidia ehci_hcd uhci_hcd usbcore intel_agp agpgart > evdev > CPU:0 > EIP:0060:[]Tainted: P VLI > EFLAGS: 00210002 (2.6.10-ck4-defaultidle) > EIP is at __blk_put_request+0x87/0x91 This seems to be this BUG: BUG_ON(!list_empty(>queuelist)); in drivers/block/ll_rw_blk.c. I don't know how this could be non-empty since scsi_end_request calls if (blk_rq_tagged(req)) blk_queue_end_tag(q, req); just before calling end_that_request_last inside the same spinlock. I don't know anything about tagged request handling but I don't think this is the fault of the dm-crypt driver, perhaps it's stressing the queue in some unexpected way? I have no idea what's going on here. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil
Re: swap on dmcrypt crashes machine
Am Dienstag, den 25.01.2005, 18:28 +0100 schrieb Michael Buesch: I set up swap on an encrypted dmcrypt device. While stressing swap usage with make -j200 in the kernel tree, the machine crashes: Adding 1461872k swap on /dev/mapper/swap. Priority:-2 extents:1 [ cut here ] kernel BUG at drivers/block/ll_rw_blk.c:2193! invalid operand: [#1] SMP Modules linked in: ipv6 ohci_hcd tuner tvaudio msp3400 bttv video_buf firmware_class btcx_risc nvidia ehci_hcd uhci_hcd usbcore intel_agp agpgart evdev CPU:0 EIP:0060:[c0218b8a]Tainted: P VLI EFLAGS: 00210002 (2.6.10-ck4-defaultidle) EIP is at __blk_put_request+0x87/0x91 This seems to be this BUG: BUG_ON(!list_empty(req-queuelist)); in drivers/block/ll_rw_blk.c. I don't know how this could be non-empty since scsi_end_request calls if (blk_rq_tagged(req)) blk_queue_end_tag(q, req); just before calling end_that_request_last inside the same spinlock. I don't know anything about tagged request handling but I don't think this is the fault of the dm-crypt driver, perhaps it's stressing the queue in some unexpected way? I have no idea what's going on here. signature.asc Description: Dies ist ein digital signierter Nachrichtenteil