Re: [PATCH] ext4: fix debug format string warning
On Fri, Apr 09, 2021 at 10:12:05PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > Using no_printk() for jbd_debug() revealed two warnings: > > fs/jbd2/recovery.c: In function 'fc_do_one_pass': > fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' > argument [-Werror=format=] > 256 | jbd_debug(3, "Processing fast commit blk with seq > %d"); > | ^~~~ > fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range': > fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type > 'int', but argument 2 has type 'long unsigned int' [-Werror=format=] > 1732 | jbd_debug(1, "Converting from %d to %d %lld", > > The first one was added incorrectly, and was also missing a few newlines > in debug output, and the second one happened when the type of an > argument changed. > > Reported-by: kernel test robot > Fixes: d556435156b7 ("jbd2: avoid -Wempty-body warnings") > Fixes: 6db074618969 ("ext4: use BIT() macro for BH_** state bits") > Fixes: 5b849b5f96b4 ("jbd2: fast commit recovery path") > Signed-off-by: Arnd Bergmann Thanks, applied, with one change. > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c > index 69f18fe20923..60601c5779f1 100644 > --- a/fs/jbd2/recovery.c > +++ b/fs/jbd2/recovery.c > @@ -245,15 +245,15 @@ static int fc_do_one_pass(journal_t *journal, > > > if (err) { > - jbd_debug(3, "Fast commit replay: read error"); > + jbd_debug(3, "Fast commit replay: read error\n"); > break; > } > > - jbd_debug(3, "Processing fast commit blk with seq %d"); > + jbd_debug(3, "Processing fast commit blk with seq\n"); This debug statement isn't adding any real value, so I just removed it. - Ted
Re: [PATCH] ext4: fix marking group trimmed if all blocks not trimmed
On Mon, Aug 03, 2020 at 04:17:44PM +, Lazar Beloica wrote: > When FTRIM is issued on a group, ext4 marks it as trimmed so another FTRIM > on the same group has no effect. Ext4 marks group as trimmed if at least > one block is trimmed, therefore it is possible that a group is marked as > trimmed even if there are blocks in that group left untrimmed. > > This patch marks group as trimmed only if there are no more blocks > in that group to be trimmed. This patch makes no sense; first of all, the changes below are *not* in the function ext4_trim_extent(), but rather ext4_trim_all_free(). It appears that the diff is based off of v5.8-rc2, based on the index c0a331e, but then I'm not sure how you generated the diff? Secondly, ext4_trim_all_free(), which is where these two patch hunks appear: > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index c0a331e..130936b 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -5346,6 +5346,7 @@ static int ext4_trim_extent(struct super_block *sb, int > start, int count, > { > void *bitmap; > ext4_grpblk_t next, count = 0, free_count = 0; > + ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); > struct ext4_buddy e4b; > int ret = 0; > > @@ -5401,7 +5402,9 @@ static int ext4_trim_extent(struct super_block *sb, int > start, int count, > > if (!ret) { > ret = count; > - EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info); > + next = mb_find_next_bit(bitmap, max_blks, max + 1); > + if (next == max_blks) > + EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info); > } > out: > ext4_unlock_group(sb, group); The function send discards for blocks in a block group which are freed. So setting max_blks to be ext4_blocks_count() and then using it as the limit to mb_find_next_bit() makes no sense. First of all next will never be equal to max_blks, since next is an offset relative to the beginning of the block group, and max_blks is set number of blocks in the entire file system. Secondly, mb_find_next_bit is searching a bitmap, which is a single file system block (e.g., 4k in a 4k block file system). So if max_blks is the the number of blocks in (for example) a 10TB file system, this is going to potentially cause a kernel oops. How, exactly did you test this patch? - Ted
Re: [PATCH] ext4: move buffer_mapped() to proper position
Thanks, applied, although I rewrote the commit description to make it be a bit more clearer: fs: prevent BUG_ON in submit_bh_wbc() If a device is hot-removed --- for example, when a physical device is unplugged from pcie slot or a nbd device's network is shutdown --- this can result in a BUG_ON() crash in submit_bh_wbc(). This is because the when the block device dies, the buffer heads will have their Buffer_Mapped flag get cleared, leading to the crash in submit_bh_wbc. We had attempted to work around this problem in commit a17712c8 ("ext4: check superblock mapped prior to committing"). Unfortunately, it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the device dies between when the work-around check in ext4_commit_super() and when submit_bh_wbh() is finally called: Code path: ext4_commit_super judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8 lock_buffer(sbh) ... unlock_buffer(sbh) __sync_dirty_buffer(sbh,... lock_buffer(sbh) judge if 'buffer_mapped(sbh))' is false, return <== added by this patch submit_bh(...,sbh) submit_bh_wbc(...,sbh,...) [100722.966497] kernel BUG at fs/buffer.c:3095! <== BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc() [100722.966503] invalid opcode: [#1] SMP [100722.966566] task: 8817e15a9e40 task.stack: c90024744000 [100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190 [100722.966575] RSP: 0018:c90024747a90 EFLAGS: 00010246 [100722.966576] RAX: 00620005 RBX: 8818a80603a8 RCX: [100722.966576] RDX: 8818a80603a8 RSI: 00020800 RDI: 0001 [100722.966577] RBP: c90024747ac0 R08: R09: 88207f94170d [100722.966578] R10: 000437c8 R11: 0001 R12: 00020800 [100722.966578] R13: 0001 R14: 0bf9a438 R15: 88195f333000 [100722.966580] FS: 7fa2eee27700() GS:88203d84() knlGS: [100722.966580] CS: 0010 DS: ES: CR0: 80050033 [100722.966581] CR2: 00f0b008 CR3: 00201a622003 CR4: 007606e0 [100722.966582] DR0: DR1: DR2: [100722.966583] DR3: DR6: fffe0ff0 DR7: 0400 [100722.966583] PKRU: 5554 [100722.966583] Call Trace: [100722.966588] __sync_dirty_buffer+0x6e/0xd0 [100722.966614] ext4_commit_super+0x1d8/0x290 [ext4] [100722.966626] __ext4_std_error+0x78/0x100 [ext4] [100722.966635] ? __ext4_journal_get_write_access+0xca/0x120 [ext4] [100722.966646] ext4_reserve_inode_write+0x58/0xb0 [ext4] [100722.966655] ? ext4_dirty_inode+0x48/0x70 [ext4] [100722.93] ext4_mark_inode_dirty+0x53/0x1e0 [ext4] [100722.966671] ? __ext4_journal_start_sb+0x6d/0xf0 [ext4] [100722.966679] ext4_dirty_inode+0x48/0x70 [ext4] [100722.966682] __mark_inode_dirty+0x17f/0x350 [100722.966686] generic_update_time+0x87/0xd0 [100722.966687] touch_atime+0xa9/0xd0 [100722.966690] generic_file_read_iter+0xa09/0xcd0 [100722.966694] ? page_cache_tree_insert+0xb0/0xb0 [100722.966704] ext4_file_read_iter+0x4a/0x100 [ext4] [100722.966707] ? __inode_security_revalidate+0x4f/0x60 [100722.966709] __vfs_read+0xec/0x160 [100722.966711] vfs_read+0x8c/0x130 [100722.966712] SyS_pread64+0x87/0xb0 [100722.966716] do_syscall_64+0x67/0x1b0 [100722.966719] entry_SYSCALL64_slow_path+0x25/0x25 To address this, add the check of 'buffer_mapped(bh)' to __sync_dirty_buffer(). This also has the benefit of fixing this for other file systems. With this addition, we can drop the workaround in ext4_commit_supper(). [ Commit description rewritten by tytso. ] Signed-off-by: Xianting Tian Link: https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_t...@126.com Signed-off-by: Theodore Ts'o - Ted
Re: [PATCH] jbd2: fix incorrect code style
On Sat, Jul 18, 2020 at 08:57:37AM -0400, Xianting Tian wrote: > Remove unnecessary blank. > > Signed-off-by: Xianting Tian Thanks, applied. - Ted
Re: ext4: fix spelling typos in ext4_mb_initialize_context
On Wed, Jul 15, 2020 at 11:00:44AM +0800, brookxu wrote: > Fix spelling typos in ext4_mb_initialize_context. > > Signed-off-by: Chunguang Xu Thanks, applied. - Ted
Re: [PATCH] Replace HTTP links with HTTPS ones: Ext4
On Mon, Jul 06, 2020 at 09:03:39PM +0200, Alexander A. Klimov wrote: > Rationale: > Reduces attack surface on kernel devs opening the links for MITM > as HTTPS traffic is much harder to manipulate. Thanks, applied. - Ted
Re: [PATCH] ext4: fix spelling mistakes in extents.c
On Wed, Jun 10, 2020 at 11:19:46PM -0400, Keyur Patel wrote: > Fix spelling issues over the comments in the code. > > requsted ==> requested > deterimined ==> determined > insde ==> inside > neet ==> need > somthing ==> something > > Signed-off-by: Keyur Patel Applied, thanks. - Ted
Re: strace of io_uring events?
On Wed, Jul 15, 2020 at 06:11:30PM +0100, Matthew Wilcox wrote: > On Wed, Jul 15, 2020 at 07:35:50AM -0700, Andy Lutomirski wrote: > > > On Jul 15, 2020, at 4:12 AM, Miklos Szeredi wrote: > > > This thread is to discuss the possibility of stracing requests > > > submitted through io_uring. I'm not directly involved in io_uring > > > development, so I'm posting this out of interest in using strace on > > > processes utilizing io_uring. > > > > > > Is there some existing tracing infrastructure that strace could use to > > > get async completion events? Should we be introducing one? I suspect the best approach to use here is use eBPF, since since sending asyncronously to a ring buffer is going to be *way* more efficient than using the blocking ptrace(2) system call... - Ted
Re: [PATCH v2] CodingStyle: Inclusive Terminology
On Wed, Jul 08, 2020 at 12:23:59AM -0700, Dan Williams wrote: > Linux maintains a coding-style and its own idiomatic set of terminology. > Update the style guidelines to recommend replacements for the terms > master/slave and blacklist/whitelist. > > Link: > http://lore.kernel.org/r/159389297140.2210796.13590142254668787525.st...@dwillia2-desk3.amr.corp.intel.com > Cc: Jonathan Corbet > Acked-by: Randy Dunlap > Acked-by: Dave Airlie > Acked-by: Kees Cook > Acked-by: SeongJae Park > Signed-off-by: Olof Johansson > Signed-off-by: Chris Mason > Signed-off-by: Greg Kroah-Hartman > Signed-off-by: Dan Williams Signed-off-by: Theodore Ts'o
Re: better patch for linux/bitops.h
On Thu, May 05, 2016 at 05:34:50PM -0400, Sandy Harris wrote: > > I completely fail to see why tests or compiler versions should be > part of the discussion. The C standard says the behaviour in > certain cases is undefined, so a standard-compliant compiler > can generate more-or-less any code there. > > As long as any of portability, reliability or security are among our > goals, any code that can give undefined behaviour should be > considered problematic. Because compilers have been known not necessarily to obey the specs, and/or interpret the specs in way that not everyone agrees with. It's also the case that we are *already* disabling certain C optimizations which are technically allowed by the spec, but which kernel programmers consider insane (e.g., strict aliasing). And of course, memzero_explicit() which crypto people understand is necessary, is something that technically compilers are allowed to optimize according to the spec. So trying to write secure kernel code which will work on arbitrary compilers may well be impossible. And which is also why people have said (mostly in jest), "A sufficiently advanced compiler is indistinguishable from an adversary." (I assume people will agree that optimizing away a memset needed to clear secrets from memory would be considered adversarial, at the very least!) So this is why I tend to take a much more pragmatic viewpoint on things. Sure, it makes sense to pay attention to what the C standard writers are trying to do to us; but if we need to suppress certain optimizations to write sane kernel code --- I'm ok with that. And this is why using a trust-but-verify on a specific set of compilers and ranges of compiler versions is a really good idea - Ted
Re: better patch for linux/bitops.h
On Thu, May 05, 2016 at 05:34:50PM -0400, Sandy Harris wrote: > > I completely fail to see why tests or compiler versions should be > part of the discussion. The C standard says the behaviour in > certain cases is undefined, so a standard-compliant compiler > can generate more-or-less any code there. > > As long as any of portability, reliability or security are among our > goals, any code that can give undefined behaviour should be > considered problematic. Because compilers have been known not necessarily to obey the specs, and/or interpret the specs in way that not everyone agrees with. It's also the case that we are *already* disabling certain C optimizations which are technically allowed by the spec, but which kernel programmers consider insane (e.g., strict aliasing). And of course, memzero_explicit() which crypto people understand is necessary, is something that technically compilers are allowed to optimize according to the spec. So trying to write secure kernel code which will work on arbitrary compilers may well be impossible. And which is also why people have said (mostly in jest), "A sufficiently advanced compiler is indistinguishable from an adversary." (I assume people will agree that optimizing away a memset needed to clear secrets from memory would be considered adversarial, at the very least!) So this is why I tend to take a much more pragmatic viewpoint on things. Sure, it makes sense to pay attention to what the C standard writers are trying to do to us; but if we need to suppress certain optimizations to write sane kernel code --- I'm ok with that. And this is why using a trust-but-verify on a specific set of compilers and ranges of compiler versions is a really good idea - Ted
Re: [PATCH] ext4: Fix check of dqget() return value in ext4_ioctl_setproject()
On Thu, May 05, 2016 at 02:52:34PM +0200, Jan Kara wrote: > On Fri 22-04-16 13:36:32, Seth Forshee wrote: > > On Tue, Mar 29, 2016 at 08:01:03AM -0500, Seth Forshee wrote: > > > A failed call to dqget() returns an ERR_PTR() and not null. Fix > > > the check in ext4_ioctl_setproject() to handle this correctly. > > > > > > Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR > > > interface support") > > > Cc: sta...@vger.kernel.org # v4.5 > > > Signed-off-by: Seth Forshee> > > > Thought I'd check and see if this patch was forgotten, I sent it nearly > > a month ago but the bug is still present. > > Yeah, it seems Ted forgot to pull the fix. Ted? BTW, feel free to add > > Reviewed-by: Jan Kara Applied, thanks. - Ted
Re: [PATCH] ext4: Fix check of dqget() return value in ext4_ioctl_setproject()
On Thu, May 05, 2016 at 02:52:34PM +0200, Jan Kara wrote: > On Fri 22-04-16 13:36:32, Seth Forshee wrote: > > On Tue, Mar 29, 2016 at 08:01:03AM -0500, Seth Forshee wrote: > > > A failed call to dqget() returns an ERR_PTR() and not null. Fix > > > the check in ext4_ioctl_setproject() to handle this correctly. > > > > > > Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR > > > interface support") > > > Cc: sta...@vger.kernel.org # v4.5 > > > Signed-off-by: Seth Forshee > > > > Thought I'd check and see if this patch was forgotten, I sent it nearly > > a month ago but the bug is still present. > > Yeah, it seems Ted forgot to pull the fix. Ted? BTW, feel free to add > > Reviewed-by: Jan Kara Applied, thanks. - Ted
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote: > > We don't care about UB, we care about gcc, and to a lesser extent > LLVM and ICC. If bitops.h doesn't do the right thing, we need to > fix bitops.h. I'm going to suggest that we treat the ro[rl]{32,64}() question as separable from the /dev/random case. I've sanity checked gcc 5.3.1 and it does the right thing given the small number of constant arguments given to rotl32() in lib/chacha20.c, and it doesn't hit the UB case which Jeffrey was concerned about. This is also code that was previously in crypto/chacha20_generic.c, and so if there is a bug with some obscure compiler not properly compiling it down to a rotate instruction, (a) no one is paying me to make sure the kernel code compiles well on ICC, and (b) it's not scalable to have each kernel developer try to deal with the vagrancies of compilers. So from my perspective, the only interesting results for me is (a) using what had been used before with crypto/chacha20_generic.c, or (b) reusing what is in bitops.h and letting it be someone else's problem if some obscure compiler isn't happy with what is in bitops.h If we are all agreed that what is in bitops.h is considered valid, then we can start converting people over to using the version defined in bitops.h, and if there is some compiler issue we need to work around, at least we only need to put the workaround in a single header file. Cheers, - Ted
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote: > > We don't care about UB, we care about gcc, and to a lesser extent > LLVM and ICC. If bitops.h doesn't do the right thing, we need to > fix bitops.h. I'm going to suggest that we treat the ro[rl]{32,64}() question as separable from the /dev/random case. I've sanity checked gcc 5.3.1 and it does the right thing given the small number of constant arguments given to rotl32() in lib/chacha20.c, and it doesn't hit the UB case which Jeffrey was concerned about. This is also code that was previously in crypto/chacha20_generic.c, and so if there is a bug with some obscure compiler not properly compiling it down to a rotate instruction, (a) no one is paying me to make sure the kernel code compiles well on ICC, and (b) it's not scalable to have each kernel developer try to deal with the vagrancies of compilers. So from my perspective, the only interesting results for me is (a) using what had been used before with crypto/chacha20_generic.c, or (b) reusing what is in bitops.h and letting it be someone else's problem if some obscure compiler isn't happy with what is in bitops.h If we are all agreed that what is in bitops.h is considered valid, then we can start converting people over to using the version defined in bitops.h, and if there is some compiler issue we need to work around, at least we only need to put the workaround in a single header file. Cheers, - Ted
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote: > > +static inline u32 rotl32(u32 v, u8 n) > > +{ > > + return (v << n) | (v >> (sizeof(v) * 8 - n)); > > +} > > That's undefined behavior when n=0. Sure, but it's never called with n = 0; I've double checked and the compiler seems to do the right thing with the above pattern as well. Hmm, it looks like there is a "standard" version rotate left and right defined in include/linux/bitops.h. So I suspect it would make sense to use rol32 as defined in bitops.h --- and this is probably something that we should do for the rest of crypto/*.c, where people seem to be defininig their own version of something like rotl32 (I copied the contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern of defining one's own version of rol32 isn't new). > I think the portable way to do a rotate that avoids UB is the > following. GCC, Clang and ICC recognize the pattern, and emit a rotate > instruction. > > static const unsigned int MASK=31; > return (v<>(-n)); > > You should also avoid the following because its not constant time due > to the branch: > > return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n)); > Where is this coming from? I don't see this construct in the patch. - Ted
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote: > > +static inline u32 rotl32(u32 v, u8 n) > > +{ > > + return (v << n) | (v >> (sizeof(v) * 8 - n)); > > +} > > That's undefined behavior when n=0. Sure, but it's never called with n = 0; I've double checked and the compiler seems to do the right thing with the above pattern as well. Hmm, it looks like there is a "standard" version rotate left and right defined in include/linux/bitops.h. So I suspect it would make sense to use rol32 as defined in bitops.h --- and this is probably something that we should do for the rest of crypto/*.c, where people seem to be defininig their own version of something like rotl32 (I copied the contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern of defining one's own version of rol32 isn't new). > I think the portable way to do a rotate that avoids UB is the > following. GCC, Clang and ICC recognize the pattern, and emit a rotate > instruction. > > static const unsigned int MASK=31; > return (v<>(-n)); > > You should also avoid the following because its not constant time due > to the branch: > > return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n)); > Where is this coming from? I don't see this construct in the patch. - Ted
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote: > > +/* > > + * crng_init = 0 --> Uninitialized > > + * 2 --> Initialized > > + * 3 --> Initialized from input_pool > > + */ > > +static int crng_init = 0; > > shouldn't that be an atomic_t ? The crng_init variable only gets incremented (it progresses from 0->1->2->3) and it's protected by the primary_crng->lock spinlock. There are a few places where we read it without first taking the lock, but that's where we depend on it getting incremented and where if we race with another CPU which has just bumped the crng_init status, it's not critical. (Or we can take the lock and then recheck the crng_init if we really need to be sure.) > > +static void _initialize_crng(struct crng_state *crng) > > +{ > > + int i; > > + unsigned long rv; > > Why do you use unsigned long here? I thought the state[i] is unsigned int. Because it gets filled in via arch_get_random_seed_long() and arch_get_random_log(). If that means we get 64 bits and we only use 32 bits, that's OK > Would it make sense to add the ChaCha20 self test vectors from RFC7539 here > to > test that the ChaCha20 works? We're not doing that for any of the other ciphers and hashes. We didn't do that for SHA1, and the chacha20 code where I took this from didn't check for this as well. What threat are you most worried about. We don't care about chacha20 being exactly chacha20, so long as it's cryptographically strong. In fact I think I removed a potential host order byteswap in the set key operation specifically because we don't care and interop. If this is required for FIPS mode, we can add that later. I was primarily trying to keep the patch small to be easier for people to look at it, so I've deliberately left off bells and whistles that aren't strictly needed to show that the new approach is sound. > > + if (crng_init++ >= 2) > > + wake_up_interruptible(_init_wait); > > Don't we have a race here with the crng_init < 3 check in crng_reseed > considering multi-core systems? No, because we are holding the primary_crng->lock spinlock. I'll add a comment explaining the locking protections which is intended for crng_init where we declare it. > > + if (num < 16 || num > 32) { > > + WARN_ON(1); > > + pr_err("crng_reseed: num is %d?!?\n", num); > > + } > > + num_words = (num + 3) / 4; > > + for (i = 0; i < num_words; i++) > > + primary_crng.state[i+4] ^= tmp[i]; > > + primary_crng.init_time = jiffies; > > + if (crng_init < 3) { > > Shouldn't that one be if (crng_init < 3 && num >= 16) ? I'll just change the above WRN_ON test to be: BUG_ON(num < 16 || num > 32); This really can't happen, and I had it as a WARN_ON with a printk for debugging purpose in case I was wrong about how the code works. > > + crng_init = 3; > > + process_random_ready_list(); > > + wake_up_interruptible(_init_wait); > > + pr_notice("random: crng_init 3\n"); > > Would it make sense to be more descriptive here to allow readers of dmesg to > understand the output? Yes, what we're currently printing during the initialization: random: crng_init 1 random: crng_init 2 random: crng_init 3 was again mostly for debugging purposes. What I'm thinking about doing is changing crng_init 2 and 3 messages to be: random: crng fast init done random: crng conservative init done > > + } > > + ret = 1; > > +out: > > + spin_unlock_irqrestore(_crng.lock, flags); > > memzero_explicit of tmp? Good point, I've added a call to memzero_explicit(). > > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > > +{ > > + unsigned long v, flags; > > + struct crng_state *crng = _crng; > > + > > + if (crng_init > 2 && > > + time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) > > + crng_reseed(_pool); > > + spin_lock_irqsave(>lock, flags); > > + if (arch_get_random_long()) > > + crng->state[14] ^= v; > > + chacha20_block(>state[0], out); > > + if (crng->state[12] == 0) > > + crng->state[13]++; > What is the purpose to only cover the 2nd 32 bit value of the nonce with > arch_get_random? > > state[12]++? Or why do you increment the nonce? In Chacha20, its output is a funcrtion of the state array, where state[0..3] is a constant specified by the Chacha20 definition, state[4..11] is the Key, and state[12..15] is the IV. The chacha20_block() function increments state[12] each time it is called, so state[12] is being used as the block counter. The increment of state[13] is used to make state[13] to be the upper 32-bits of the block counter. We *should* be reseeding more often than 2**32 calls to chacha20_block(), and the increment is just to be safe in case something goes wronng and we're not reseeding. We're using crng[14] to be contain the output of RDRAND, so this is where we mix in the contribution from a CPU-level hwrng.
Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG
On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote: > > +/* > > + * crng_init = 0 --> Uninitialized > > + * 2 --> Initialized > > + * 3 --> Initialized from input_pool > > + */ > > +static int crng_init = 0; > > shouldn't that be an atomic_t ? The crng_init variable only gets incremented (it progresses from 0->1->2->3) and it's protected by the primary_crng->lock spinlock. There are a few places where we read it without first taking the lock, but that's where we depend on it getting incremented and where if we race with another CPU which has just bumped the crng_init status, it's not critical. (Or we can take the lock and then recheck the crng_init if we really need to be sure.) > > +static void _initialize_crng(struct crng_state *crng) > > +{ > > + int i; > > + unsigned long rv; > > Why do you use unsigned long here? I thought the state[i] is unsigned int. Because it gets filled in via arch_get_random_seed_long() and arch_get_random_log(). If that means we get 64 bits and we only use 32 bits, that's OK > Would it make sense to add the ChaCha20 self test vectors from RFC7539 here > to > test that the ChaCha20 works? We're not doing that for any of the other ciphers and hashes. We didn't do that for SHA1, and the chacha20 code where I took this from didn't check for this as well. What threat are you most worried about. We don't care about chacha20 being exactly chacha20, so long as it's cryptographically strong. In fact I think I removed a potential host order byteswap in the set key operation specifically because we don't care and interop. If this is required for FIPS mode, we can add that later. I was primarily trying to keep the patch small to be easier for people to look at it, so I've deliberately left off bells and whistles that aren't strictly needed to show that the new approach is sound. > > + if (crng_init++ >= 2) > > + wake_up_interruptible(_init_wait); > > Don't we have a race here with the crng_init < 3 check in crng_reseed > considering multi-core systems? No, because we are holding the primary_crng->lock spinlock. I'll add a comment explaining the locking protections which is intended for crng_init where we declare it. > > + if (num < 16 || num > 32) { > > + WARN_ON(1); > > + pr_err("crng_reseed: num is %d?!?\n", num); > > + } > > + num_words = (num + 3) / 4; > > + for (i = 0; i < num_words; i++) > > + primary_crng.state[i+4] ^= tmp[i]; > > + primary_crng.init_time = jiffies; > > + if (crng_init < 3) { > > Shouldn't that one be if (crng_init < 3 && num >= 16) ? I'll just change the above WRN_ON test to be: BUG_ON(num < 16 || num > 32); This really can't happen, and I had it as a WARN_ON with a printk for debugging purpose in case I was wrong about how the code works. > > + crng_init = 3; > > + process_random_ready_list(); > > + wake_up_interruptible(_init_wait); > > + pr_notice("random: crng_init 3\n"); > > Would it make sense to be more descriptive here to allow readers of dmesg to > understand the output? Yes, what we're currently printing during the initialization: random: crng_init 1 random: crng_init 2 random: crng_init 3 was again mostly for debugging purposes. What I'm thinking about doing is changing crng_init 2 and 3 messages to be: random: crng fast init done random: crng conservative init done > > + } > > + ret = 1; > > +out: > > + spin_unlock_irqrestore(_crng.lock, flags); > > memzero_explicit of tmp? Good point, I've added a call to memzero_explicit(). > > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE]) > > +{ > > + unsigned long v, flags; > > + struct crng_state *crng = _crng; > > + > > + if (crng_init > 2 && > > + time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) > > + crng_reseed(_pool); > > + spin_lock_irqsave(>lock, flags); > > + if (arch_get_random_long()) > > + crng->state[14] ^= v; > > + chacha20_block(>state[0], out); > > + if (crng->state[12] == 0) > > + crng->state[13]++; > What is the purpose to only cover the 2nd 32 bit value of the nonce with > arch_get_random? > > state[12]++? Or why do you increment the nonce? In Chacha20, its output is a funcrtion of the state array, where state[0..3] is a constant specified by the Chacha20 definition, state[4..11] is the Key, and state[12..15] is the IV. The chacha20_block() function increments state[12] each time it is called, so state[12] is being used as the block counter. The increment of state[13] is used to make state[13] to be the upper 32-bits of the block counter. We *should* be reseeding more often than 2**32 calls to chacha20_block(), and the increment is just to be safe in case something goes wronng and we're not reseeding. We're using crng[14] to be contain the output of RDRAND, so this is where we mix in the contribution from a CPU-level hwrng.
Re: [RFC][PATCH 0/6] /dev/random - a new approach
On Tue, May 03, 2016 at 03:57:15PM +0200, Nikos Mavrogiannopoulos wrote: > > I believe their main concern is that they want to protect applications > which do not check error codes of system calls, when running on a > kernel which does not provide getrandom(). That way, they have an > almost impossible task to simulate getrandom() on kernel which do not > support it. The whole *point* of creating the getrandom(2) system call is that it can't be simulated/emulated in userspace. If it can be, then there's no reason why the system call should exist. This is one of the reasons why haven't implemented mysql or TLS inside the kernel. :-) So if their standard is "we need to simulate getrandom(2) on a kernel which does not have it", we'll **never** see glibc support for it. By definition, this is *impossible*. What they can do is do something which is as good as you can get for someone who is open-coding /dev/urandom support in userspace. That means that you won't be able to (a) tell if the urandom pool is has been adequately initialized right after boot, (b) you will need to somehow deal with the case where the file descriptors have been exhausted, (c) or if you are running in a chroot where the system administrator didn't bother to include /dev/urandom. About the best you can do is call abort(0), or if you want, you can let the application author specify some kind of "I want to run in insecure mode", via some magic glibc setting. You could probably default this to "true" without a huge net reduction of security, because most application authors weren't getting this right anyway. And then ones who do care, can set some kind of flag saying, "I promise to check the error return from getrandom(2) as implemented by glibc". - Ted
Re: [RFC][PATCH 0/6] /dev/random - a new approach
On Tue, May 03, 2016 at 03:57:15PM +0200, Nikos Mavrogiannopoulos wrote: > > I believe their main concern is that they want to protect applications > which do not check error codes of system calls, when running on a > kernel which does not provide getrandom(). That way, they have an > almost impossible task to simulate getrandom() on kernel which do not > support it. The whole *point* of creating the getrandom(2) system call is that it can't be simulated/emulated in userspace. If it can be, then there's no reason why the system call should exist. This is one of the reasons why haven't implemented mysql or TLS inside the kernel. :-) So if their standard is "we need to simulate getrandom(2) on a kernel which does not have it", we'll **never** see glibc support for it. By definition, this is *impossible*. What they can do is do something which is as good as you can get for someone who is open-coding /dev/urandom support in userspace. That means that you won't be able to (a) tell if the urandom pool is has been adequately initialized right after boot, (b) you will need to somehow deal with the case where the file descriptors have been exhausted, (c) or if you are running in a chroot where the system administrator didn't bother to include /dev/urandom. About the best you can do is call abort(0), or if you want, you can let the application author specify some kind of "I want to run in insecure mode", via some magic glibc setting. You could probably default this to "true" without a huge net reduction of security, because most application authors weren't getting this right anyway. And then ones who do care, can set some kind of flag saying, "I promise to check the error return from getrandom(2) as implemented by glibc". - Ted
Re: Why is SECTOR_SIZE = 512 inside kernel ?
On Tue, Aug 18, 2015 at 11:06:47PM +0200, Brice Goglin wrote: > Le 17/08/2015 15:54, Theodore Ts'o a écrit : > > > > It's cast in stone. There are too many places all over the kernel, > > especially in a huge number of file systems, which assume that the > > sector size is 512 bytes. So above the block layer, the sector size > > is always going to be 512. > > Could this be a problem when using pmem/nvdimm devices with > byte-granularity (no BTT layer)? (hw_sector_size reports > 512 in this case while we could expect 1 instead). > Or it just doesn't matter because BTT is the only way to use > these devices for filesystems like other block devices? Right now there are very few applications that understand how to use pmem/nvdimm devices as memory. And even where they do, they will need some kind of file system to provide resource isolation in case more than one application or more than one user wants to use the pmem/nvdimm. In that case, they will probably mmap a file and then access the nvdimm directly. In that case, the applications won't be using the block device layer at all, so they won't care about the advertised hw_sector_Size. The challenge with pmem-aware applications is that they need to be able to correctly update their in-memory data structures in such a way that they can correctly recover after an arbitrary power failure. That means they have to use atomic updates and/or copy-on-write update schemes, and I suspect most application writes just aren't going to be able to get this right. So for many legacy applications, they will still read in the file "foo", make changes in local memory, and then write the new contents to the file "foo.new", and then rename "foo.new" on top of "foo". These applications will effectively use nvdimm as super fast flash, and so they will use file systems as file systems. And since file systems today all use block sizes which are multiples of the traditional 512 byte sector size, again, changing something as fundamental as the kernel's internal sector size doesn't have any real value, at least not as far as pmem/nvdimm support is concerned. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Why is SECTOR_SIZE = 512 inside kernel ?
On Tue, Aug 18, 2015 at 11:06:47PM +0200, Brice Goglin wrote: Le 17/08/2015 15:54, Theodore Ts'o a écrit : It's cast in stone. There are too many places all over the kernel, especially in a huge number of file systems, which assume that the sector size is 512 bytes. So above the block layer, the sector size is always going to be 512. Could this be a problem when using pmem/nvdimm devices with byte-granularity (no BTT layer)? (hw_sector_size reports 512 in this case while we could expect 1 instead). Or it just doesn't matter because BTT is the only way to use these devices for filesystems like other block devices? Right now there are very few applications that understand how to use pmem/nvdimm devices as memory. And even where they do, they will need some kind of file system to provide resource isolation in case more than one application or more than one user wants to use the pmem/nvdimm. In that case, they will probably mmap a file and then access the nvdimm directly. In that case, the applications won't be using the block device layer at all, so they won't care about the advertised hw_sector_Size. The challenge with pmem-aware applications is that they need to be able to correctly update their in-memory data structures in such a way that they can correctly recover after an arbitrary power failure. That means they have to use atomic updates and/or copy-on-write update schemes, and I suspect most application writes just aren't going to be able to get this right. So for many legacy applications, they will still read in the file foo, make changes in local memory, and then write the new contents to the file foo.new, and then rename foo.new on top of foo. These applications will effectively use nvdimm as super fast flash, and so they will use file systems as file systems. And since file systems today all use block sizes which are multiples of the traditional 512 byte sector size, again, changing something as fundamental as the kernel's internal sector size doesn't have any real value, at least not as far as pmem/nvdimm support is concerned. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Remove TO DO in jfs_xtree.c
On Mon, Dec 29, 2014 at 04:13:37PM -0600, Dave Kleikamp wrote: > On 12/27/2014 06:58 PM, nick wrote: > > Greetings Dave, > > I am wondering why there is a TO DO above this code: > > * ToDo: tlocks should be on doubly-linked list, so we can > > * quickly remove it and add it to the end. > > I'm sure the idea was to avoid the for loop needed to find the previous > entry in the linked list. A doubly-linked list makes it much simpler to > remove an item from an arbitrary position in the list. Hi Dave, Just in case you weren't aware, Nick has been banned from the LKML list for being a troll. A common troll trick is to send e-mail to a number of individuals with a mailing list (in this case, LKML) cc'ed, in the hopes that people will reply, quoting the troll's words, so they can get around the mailing list ban. The reason why he has been banned is because he has apparently been desperate to get _some_ patches into the Linux kernel. Enough so that he has proposed patches which do not compile, and/or were not tested and/or for which he had no hardware (so he couldn't possibly have tested it). As a maintainer, you should be aware of his past history, and chose for yourself whether, in the future, you feel you should respond to any e-mail that he might send you. Regards, - Ted P.S. Personally, the best reason for banning him isn't that he has been wasting maintainers' time, but that he was responding to users who were reporting bugs with nonsensical responses that were actively harmful to users who were looking for help. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Remove TO DO in jfs_xtree.c
On Mon, Dec 29, 2014 at 04:13:37PM -0600, Dave Kleikamp wrote: On 12/27/2014 06:58 PM, nick wrote: Greetings Dave, I am wondering why there is a TO DO above this code: * ToDo: tlocks should be on doubly-linked list, so we can * quickly remove it and add it to the end. I'm sure the idea was to avoid the for loop needed to find the previous entry in the linked list. A doubly-linked list makes it much simpler to remove an item from an arbitrary position in the list. Hi Dave, Just in case you weren't aware, Nick has been banned from the LKML list for being a troll. A common troll trick is to send e-mail to a number of individuals with a mailing list (in this case, LKML) cc'ed, in the hopes that people will reply, quoting the troll's words, so they can get around the mailing list ban. The reason why he has been banned is because he has apparently been desperate to get _some_ patches into the Linux kernel. Enough so that he has proposed patches which do not compile, and/or were not tested and/or for which he had no hardware (so he couldn't possibly have tested it). As a maintainer, you should be aware of his past history, and chose for yourself whether, in the future, you feel you should respond to any e-mail that he might send you. Regards, - Ted P.S. Personally, the best reason for banning him isn't that he has been wasting maintainers' time, but that he was responding to users who were reporting bugs with nonsensical responses that were actively harmful to users who were looking for help. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reminder for kernel summit nominations
On Fri, May 09, 2014 at 11:23:19AM -0400, Theodore Ts'o wrote: > ... please send nominations to: > > kernel-sum...@lists.linuxfoundation.org Argh, sorry, I screwed up the e-mail address. The correct e-mail address is. ksummit-disc...@lists.linuxfoundation.org My apologies - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Reminder for kernel summit nominations
On Fri, May 09, 2014 at 11:23:19AM -0400, Theodore Ts'o wrote: ... please send nominations to: kernel-sum...@lists.linuxfoundation.org Argh, sorry, I screwed up the e-mail address. The correct e-mail address is. ksummit-disc...@lists.linuxfoundation.org My apologies - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: random: Providing a seed value to VM guests
On Thu, May 01, 2014 at 02:06:13PM -0700, Andy Lutomirski wrote: > > I still don't see the point. What does this do better than virtio-rng? I believe you had been complaining about how complicated it was to set up virtio? And this complexity is also an issue if we want to use it to initialize the RNG used for the kernel text ASLR --- which has to be done very early in the boot process, and where making something as simple as possible is a Good Thing. And since we would want to use RDRAND/RDSEED if it is available *anyway*, perhaps in combination with other things, why not use the RDRAND/RDSEED interface? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: random: Providing a seed value to VM guests
On Thu, May 01, 2014 at 01:32:55PM -0700, Andy Lutomirski wrote: > On Thu, May 1, 2014 at 1:30 PM, H. Peter Anvin wrote: > > RDSEED is not synchronous. It is, however, nonblocking. > > What I mean is: IIUC it's reasonable to call RDSEED a few times in a > loop and hope it works. It makes no sense to do that with > /dev/random. RDSEED is allowed to return an error if there is insufficient entropy. So long as the caller understands that this is an emulated instruction, I don't see a problem. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: random: Providing a seed value to VM guests
On Thu, May 01, 2014 at 12:02:49PM -0700, Andy Lutomirski wrote: > > Is RDSEED really reasonable here? Won't it slow down by several > orders of magnitude? That is I think the biggest problem; RDRAND and RDSEED are fast if they are native, but they will involve a VM exit if they need to be emulated. So when an OS might want to use RDRAND and RDSEED might be quite different if we know they are being emulated. Using the RDRAND and RDSEED "api" certainly makes sense, at least for x86, but I suspect we might want to use a different way of signalling that a VM guest can use RDRAND and RDSEED if they are running on a CPU which doesn't provide that kind of access. Maybe a CPUID extended function parameter, if one could be allocated for use by a Linux hypervisor? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] random: Add "initialized" variable to proc
On Wed, Apr 30, 2014 at 09:05:00PM -0700, H. Peter Anvin wrote: > > Giving the guest a seed would be highly useful, though. There are a > number of ways to do that; changing the boot protocol is probably > only useful if Qemu itself bouts the kernel as opposed to an in-VM > bootloader. So how about simply passing a memory address and an optional offset on the boot command line? That way the hypervisor can drop the seed in some convenient real memory location, and the kernel can just copy it someplace safe, or in the case of kernel ASLR, the relocator can use it to seed its CRNG, and then after it relocates the kernel, it can crank the CRNG to pass a seed to the kernel's urandom driver. That way, we don't have to do something which is ACPI or DT dependent. Maybe there will be embedded architectures where using DT might be more convenient, but this would probably be simplest for KVM/qumu-based VM's, I would think. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] random: Add initialized variable to proc
On Wed, Apr 30, 2014 at 09:05:00PM -0700, H. Peter Anvin wrote: Giving the guest a seed would be highly useful, though. There are a number of ways to do that; changing the boot protocol is probably only useful if Qemu itself bouts the kernel as opposed to an in-VM bootloader. So how about simply passing a memory address and an optional offset on the boot command line? That way the hypervisor can drop the seed in some convenient real memory location, and the kernel can just copy it someplace safe, or in the case of kernel ASLR, the relocator can use it to seed its CRNG, and then after it relocates the kernel, it can crank the CRNG to pass a seed to the kernel's urandom driver. That way, we don't have to do something which is ACPI or DT dependent. Maybe there will be embedded architectures where using DT might be more convenient, but this would probably be simplest for KVM/qumu-based VM's, I would think. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: random: Providing a seed value to VM guests
On Thu, May 01, 2014 at 12:02:49PM -0700, Andy Lutomirski wrote: Is RDSEED really reasonable here? Won't it slow down by several orders of magnitude? That is I think the biggest problem; RDRAND and RDSEED are fast if they are native, but they will involve a VM exit if they need to be emulated. So when an OS might want to use RDRAND and RDSEED might be quite different if we know they are being emulated. Using the RDRAND and RDSEED api certainly makes sense, at least for x86, but I suspect we might want to use a different way of signalling that a VM guest can use RDRAND and RDSEED if they are running on a CPU which doesn't provide that kind of access. Maybe a CPUID extended function parameter, if one could be allocated for use by a Linux hypervisor? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: random: Providing a seed value to VM guests
On Thu, May 01, 2014 at 01:32:55PM -0700, Andy Lutomirski wrote: On Thu, May 1, 2014 at 1:30 PM, H. Peter Anvin h...@zytor.com wrote: RDSEED is not synchronous. It is, however, nonblocking. What I mean is: IIUC it's reasonable to call RDSEED a few times in a loop and hope it works. It makes no sense to do that with /dev/random. RDSEED is allowed to return an error if there is insufficient entropy. So long as the caller understands that this is an emulated instruction, I don't see a problem. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: random: Providing a seed value to VM guests
On Thu, May 01, 2014 at 02:06:13PM -0700, Andy Lutomirski wrote: I still don't see the point. What does this do better than virtio-rng? I believe you had been complaining about how complicated it was to set up virtio? And this complexity is also an issue if we want to use it to initialize the RNG used for the kernel text ASLR --- which has to be done very early in the boot process, and where making something as simple as possible is a Good Thing. And since we would want to use RDRAND/RDSEED if it is available *anyway*, perhaps in combination with other things, why not use the RDRAND/RDSEED interface? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Tue, Mar 25, 2014 at 11:32:09PM -0700, Andrew Morton wrote: > Well, there are always alternatives. For example ext3 could > preallocate a single transaction_t and a single IO page and fall back > to synchronous page-at-a-time journal writes. But I can totally see > that such things are unattractive: heaps of new code which is never > tested in real life. The page allocator works so damn well that it > doesn't make sense to implement it. I'm not sure that there are _always_ solutions. For example, ext3 still needs to use buffer cache to do the I/O, and that may require allocations as well. Fortunately, this was patched around other ways to avoid livelock issues in the page cleaner in 2013: 84235de394d977 But that's another new user of GFP_NOFAIL (and one added three years after David tried to declare There Shalt Be No New Users of GFP_NOFAIL), and sure, we could probably patch around that by having places where there's no other alternaive to keep a preallocated batch of pages and/or allocated structures at each code site. But that's as bad as looping at each code site; in fact, wouldn't it be better if the allocator wants to avoid looping, to have a separate batch of pages which (ala GFP_ATOMIC) which is reserved for just for GFP_NOFAIL allocations when all else fails? (BTW, we have a similar issue with bio's in the block layer, but fortunately, those structures are relatively small, and since they are in their own slab cache, and are constantly being used and then recycled, it's in practice rare that allocations will fail when we are in a desparate low memory situation. But technically there will be places where we should pass a gfp_t down to the block layers, and use GFP_NOFAIL if we really want to make sure that memory allocations needed to try to clean pages and/or avoid user data corruption are needed.) > Please use NOFAIL ;) The core page allocator will always be able to > implement this better than callers. I'll convert jbd2 to use NOFAIL. :-) Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Tue, Mar 25, 2014 at 06:06:17PM -0700, David Rientjes wrote: > > The point is not to add new callers and new code should handle NULL > correctly, not that we should run around changing current users to just do > infinite retries. Checkpatch should have nothing to do with that. My problem with this doctrinaire "there should never be any new users" is that sometiems there *are* worse things than infinite retries. If the alternative is bringing the entire system down, or livelocking the entire system, or corrupting user data, __GFP_NOFAIL *is* the more appropriate option. If you try to tell those of us outside of the mm layer, "thou shalt never use __GFP_NOFAIL in new code", and we have some new code where the alternative is worse, we can either open-code the loop, or have some mm hackers and/or checkpatch whine at us. Andrew has declared that he'd prefer that we not open code the retry loop; if you want to disagree with Andrew, feel free to pursuade him otherwise. If you want to tell me that I should accept user data corruption, I'm going to ignore you (and/or checkpatch). Regards, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Tue, Mar 25, 2014 at 06:06:17PM -0700, David Rientjes wrote: The point is not to add new callers and new code should handle NULL correctly, not that we should run around changing current users to just do infinite retries. Checkpatch should have nothing to do with that. My problem with this doctrinaire there should never be any new users is that sometiems there *are* worse things than infinite retries. If the alternative is bringing the entire system down, or livelocking the entire system, or corrupting user data, __GFP_NOFAIL *is* the more appropriate option. If you try to tell those of us outside of the mm layer, thou shalt never use __GFP_NOFAIL in new code, and we have some new code where the alternative is worse, we can either open-code the loop, or have some mm hackers and/or checkpatch whine at us. Andrew has declared that he'd prefer that we not open code the retry loop; if you want to disagree with Andrew, feel free to pursuade him otherwise. If you want to tell me that I should accept user data corruption, I'm going to ignore you (and/or checkpatch). Regards, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Tue, Mar 25, 2014 at 11:32:09PM -0700, Andrew Morton wrote: Well, there are always alternatives. For example ext3 could preallocate a single transaction_t and a single IO page and fall back to synchronous page-at-a-time journal writes. But I can totally see that such things are unattractive: heaps of new code which is never tested in real life. The page allocator works so damn well that it doesn't make sense to implement it. I'm not sure that there are _always_ solutions. For example, ext3 still needs to use buffer cache to do the I/O, and that may require allocations as well. Fortunately, this was patched around other ways to avoid livelock issues in the page cleaner in 2013: 84235de394d977 But that's another new user of GFP_NOFAIL (and one added three years after David tried to declare There Shalt Be No New Users of GFP_NOFAIL), and sure, we could probably patch around that by having places where there's no other alternaive to keep a preallocated batch of pages and/or allocated structures at each code site. But that's as bad as looping at each code site; in fact, wouldn't it be better if the allocator wants to avoid looping, to have a separate batch of pages which (ala GFP_ATOMIC) which is reserved for just for GFP_NOFAIL allocations when all else fails? (BTW, we have a similar issue with bio's in the block layer, but fortunately, those structures are relatively small, and since they are in their own slab cache, and are constantly being used and then recycled, it's in practice rare that allocations will fail when we are in a desparate low memory situation. But technically there will be places where we should pass a gfp_t down to the block layers, and use GFP_NOFAIL if we really want to make sure that memory allocations needed to try to clean pages and/or avoid user data corruption are needed.) Please use NOFAIL ;) The core page allocator will always be able to implement this better than callers. I'll convert jbd2 to use NOFAIL. :-) Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext4: acl: remove unneeded include of linux/capability.h
On Mon, Mar 24, 2014 at 08:22:12AM +0100, Jakub Sitnicki wrote: > acl.c has not been (directly) using the interface defined by > linux/capability.h header since commit 3bd858ab1c451725c07a > ("Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid > check"). Remove it. > > Signed-off-by: Jakub Sitnicki Thanks, applied. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 21/22] ext4: Fix typos
On Sun, Mar 23, 2014 at 03:08:47PM -0400, Matthew Wilcox wrote: > Comment fix only > > Signed-off-by: Matthew Wilcox Thanks, applied to the ext4 git tree. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static
On Sun, Mar 23, 2014 at 03:08:45PM -0400, Matthew Wilcox wrote: > It's only called within inode.c, so make it static, remove its prototype > from ext4.h and move it above all of its callers so it doesn't need a > prototype within inode.c. > > Signed-off-by: Matthew Wilcox Thanks, applied to the ext4 tree. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static
On Sun, Mar 23, 2014 at 03:08:45PM -0400, Matthew Wilcox wrote: It's only called within inode.c, so make it static, remove its prototype from ext4.h and move it above all of its callers so it doesn't need a prototype within inode.c. Signed-off-by: Matthew Wilcox matthew.r.wil...@intel.com Thanks, applied to the ext4 tree. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 21/22] ext4: Fix typos
On Sun, Mar 23, 2014 at 03:08:47PM -0400, Matthew Wilcox wrote: Comment fix only Signed-off-by: Matthew Wilcox matthew.r.wil...@intel.com Thanks, applied to the ext4 git tree. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext4: acl: remove unneeded include of linux/capability.h
On Mon, Mar 24, 2014 at 08:22:12AM +0100, Jakub Sitnicki wrote: acl.c has not been (directly) using the interface defined by linux/capability.h header since commit 3bd858ab1c451725c07a (Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid check). Remove it. Signed-off-by: Jakub Sitnicki jsitni...@gmail.com Thanks, applied. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Sat, Mar 22, 2014 at 10:55:24AM -0700, Andrew Morton wrote: > > From: Andrew Morton > Subject: scripts/checkpatch.pl: __GFP_NOFAIL isn't going away > > Revert 7e4915e78992eb ("checkpatch: add warning of future __GFP_NOFAIL use"). > > There are no plans to remove __GFP_NOFAIL. > > __GFP_NOFAIL exists to > > a) centralise the retry-allocation-for-ever operation into the core >allocator, which is the appropriate implementation site and > > b) permit us to identify code sites which aren't handling memory >exhaustion appropriately. > > Cc: David Rientjes > Cc: Joe Perches > Cc: Theodore Ts'o > Signed-off-by: Andrew Morton How about also making the following change which inspired the checkpatch warning? diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 0437439..d189872 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -58,9 +58,11 @@ struct vm_area_struct; * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt * _might_ fail. This depends upon the particular VM implementation. * - * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. This modifier is deprecated and no new - * users should be added. + * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the + * caller cannot handle allocation failures. Callers are strongly + * encouraged not to use __GFP_NOFAIL unless the alternative is worse + * than OOM killing some random process (i.e., corruption or loss of + * some innocent user's data, etc). * * __GFP_NORETRY: The VM implementation must not retry indefinitely. * -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Sat, Mar 22, 2014 at 01:26:06PM -0400, ty...@mit.edu wrote: > > Well. Converting an existing retry-for-ever caller to GFP_NOFAIL is > > good. Adding new retry-for-ever code is not good. Oh, and BTW --- now that checkpatch.pl now flags an warning whenever GFP_NOFAIL is used, because it is deprecated, patches that convert a loop to use GFP_NOFAIL will get flagged with checkpatch, and this will also incentivize people writing new code and who can't find any other way to deal with an allocation failure to simply open code the loop... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Sat, Mar 22, 2014 at 10:15:12AM -0700, Andrew Morton wrote: > > I'll note that since 2011, there has been precious little movement on > > removing the final few callers of GFP_NOFAIL, and we still have a bit > > under two dozen of them, including a new one in fs/buffer.c that was > > added in 2013. > > Well. Converting an existing retry-for-ever caller to GFP_NOFAIL is > good. Adding new retry-for-ever code is not good. Actually, it wasn't converting an existing loop; it was adding a new GFP_NOFAIL to fix a reclaim livelock (commit 84235de394d9775bf). I agree that in ideal world, we'd get rid of all of these. But sometimes, the cure can be worse than the disesae, and so the whole "all callers of GFP_NOFAIL are MUST FIX BUGGGY and the maintainers should be shamed into fixing it" attitude is one that I find a bit odd myself. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Fri, Mar 21, 2014 at 01:00:55PM -0700, Andrew Morton wrote: > > The whole point of __GFP_NOFAIL is to centralise this > wait-for-memory-for-ever operation. So it is implemented in a common > (core) place and so that we can easily locate these problematic > callers. > > is exactly wrong. Yes, we'd like __GFP_NOFAIL to go away, but it > cannot go away until buggy callsites such as this one are *fixed*. > Removing the __GFP_NOFAIL usage simply hides the buggy code from casual > searchers. The change to jbd2 was made in July 2010, back when the "we must exterminate GFP_NOFAIL at all costs" brigade was in high gear, and the folks claiming that GFP_FAIL *would* go away, come hell or high water, was a bit more emphatic. I'll note that since 2011, there has been precious little movement on removing the final few callers of GFP_NOFAIL, and we still have a bit under two dozen of them, including a new one in fs/buffer.c that was added in 2013. In any case, __GFP_NOFAIL is in the code comments, so a casual searcher would find it pretty quickly with a "git grep". - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Fri, Mar 21, 2014 at 01:00:55PM -0700, Andrew Morton wrote: The whole point of __GFP_NOFAIL is to centralise this wait-for-memory-for-ever operation. So it is implemented in a common (core) place and so that we can easily locate these problematic callers. is exactly wrong. Yes, we'd like __GFP_NOFAIL to go away, but it cannot go away until buggy callsites such as this one are *fixed*. Removing the __GFP_NOFAIL usage simply hides the buggy code from casual searchers. The change to jbd2 was made in July 2010, back when the we must exterminate GFP_NOFAIL at all costs brigade was in high gear, and the folks claiming that GFP_FAIL *would* go away, come hell or high water, was a bit more emphatic. I'll note that since 2011, there has been precious little movement on removing the final few callers of GFP_NOFAIL, and we still have a bit under two dozen of them, including a new one in fs/buffer.c that was added in 2013. In any case, __GFP_NOFAIL is in the code comments, so a casual searcher would find it pretty quickly with a git grep. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Sat, Mar 22, 2014 at 10:15:12AM -0700, Andrew Morton wrote: I'll note that since 2011, there has been precious little movement on removing the final few callers of GFP_NOFAIL, and we still have a bit under two dozen of them, including a new one in fs/buffer.c that was added in 2013. Well. Converting an existing retry-for-ever caller to GFP_NOFAIL is good. Adding new retry-for-ever code is not good. Actually, it wasn't converting an existing loop; it was adding a new GFP_NOFAIL to fix a reclaim livelock (commit 84235de394d9775bf). I agree that in ideal world, we'd get rid of all of these. But sometimes, the cure can be worse than the disesae, and so the whole all callers of GFP_NOFAIL are MUST FIX BUGGGY and the maintainers should be shamed into fixing it attitude is one that I find a bit odd myself. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Sat, Mar 22, 2014 at 01:26:06PM -0400, ty...@mit.edu wrote: Well. Converting an existing retry-for-ever caller to GFP_NOFAIL is good. Adding new retry-for-ever code is not good. Oh, and BTW --- now that checkpatch.pl now flags an warning whenever GFP_NOFAIL is used, because it is deprecated, patches that convert a loop to use GFP_NOFAIL will get flagged with checkpatch, and this will also incentivize people writing new code and who can't find any other way to deal with an allocation failure to simply open code the loop... - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL
On Sat, Mar 22, 2014 at 10:55:24AM -0700, Andrew Morton wrote: From: Andrew Morton a...@linux-foundation.org Subject: scripts/checkpatch.pl: __GFP_NOFAIL isn't going away Revert 7e4915e78992eb (checkpatch: add warning of future __GFP_NOFAIL use). There are no plans to remove __GFP_NOFAIL. __GFP_NOFAIL exists to a) centralise the retry-allocation-for-ever operation into the core allocator, which is the appropriate implementation site and b) permit us to identify code sites which aren't handling memory exhaustion appropriately. Cc: David Rientjes rient...@google.com Cc: Joe Perches j...@perches.com Cc: Theodore Ts'o ty...@mit.edu Signed-off-by: Andrew Morton a...@linux-foundation.org How about also making the following change which inspired the checkpatch warning? diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 0437439..d189872 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -58,9 +58,11 @@ struct vm_area_struct; * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt * _might_ fail. This depends upon the particular VM implementation. * - * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller - * cannot handle allocation failures. This modifier is deprecated and no new - * users should be added. + * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the + * caller cannot handle allocation failures. Callers are strongly + * encouraged not to use __GFP_NOFAIL unless the alternative is worse + * than OOM killing some random process (i.e., corruption or loss of + * some innocent user's data, etc). * * __GFP_NORETRY: The VM implementation must not retry indefinitely. * -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing & memfd_create()
On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote: > On Thu, Mar 20, 2014 at 4:32 PM, wrote: > > Why not make sealing an attribute of the "struct file", and enforce it > > at the VFS layer? That way all file system objects would have access > > to sealing interface, and for memfd_shmem, you can't get another > > struct file pointing at the object, the security properties would be > > identical. > > Sealing as introduced here is an inode-attribute, not "struct file". > This is intentional. For instance, a gfx-client can get a read-only FD > via /proc/self/fd/ and pass it to the compositor so it can never > overwrite the contents (unless the compositor has write-access to the > inode itself, in which case it can just re-open it read-write). Hmm, good point. I had forgotten about the /proc/self/fd hole. Hmm... what if we have a SEAL_PROC which forces the permissions of /proc/self/fd to be 000? So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is basically equivalent to using chattr to set the immutable and append-only attribute, except for the "you can't undo the seal unless you have exclusive access to the inode" magic. That does make it pretty memfd_create specific and not a very general API, which is a little unfortunate; hence why I'm trying to explore ways of making a bit more generic and hopefully useful for more use cases. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing & memfd_create()
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote: > > This series introduces the concept of "file sealing". Sealing a file restricts > the set of allowed operations on the file in question. Multiple seals are > defined and each seal will cause a different set of operations to return EPERM > if it is set. The following seals are introduced: > > * SEAL_SHRINK: If set, the inode size cannot be reduced > * SEAL_GROW: If set, the inode size cannot be increased > * SEAL_WRITE: If set, the file content cannot be modified Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system. Why not make sealing an attribute of the "struct file", and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Trusted kernel patchset for Secure Boot lockdown
On Wed, Mar 19, 2014 at 01:16:15PM -0700, Kees Cook wrote: > UEFI is a red herring in both cases. This isn't about UEFI, it just > happens that one of the things that can assert "trusted_kernel" is the > UEFI Secure Boot path. Chrome OS would assert "trusted_kernel" by > passing it on the kernel command line (since it, along with firmware, > kernel, and root fs are effectively measured). A > boot-my-router-from-CD system can assert similarly because the kernel > is on RO media. I disagree; it's highly likely, if not certain that Windows booting under UEFI secure boot is going to be able to do some of the things that people are proposing that we have to prohibit in the name of security. That's because presumably Windows won't be willing to make certain usability tradeoffs, and since they control the signing certs, even in the unlikely case that people can leverage these "holes" to enable a boot sector virus, it seems unlikely that Windows will revoke its own cert. The security goals for Windows' secure boot is quite a bit weaker than what ChromeOS is trying to promise; this is why I claim the real argument is over what the goals are for "trusted boot". Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Trusted kernel patchset for Secure Boot lockdown
On Wed, Mar 19, 2014 at 01:16:15PM -0700, Kees Cook wrote: UEFI is a red herring in both cases. This isn't about UEFI, it just happens that one of the things that can assert trusted_kernel is the UEFI Secure Boot path. Chrome OS would assert trusted_kernel by passing it on the kernel command line (since it, along with firmware, kernel, and root fs are effectively measured). A boot-my-router-from-CD system can assert similarly because the kernel is on RO media. I disagree; it's highly likely, if not certain that Windows booting under UEFI secure boot is going to be able to do some of the things that people are proposing that we have to prohibit in the name of security. That's because presumably Windows won't be willing to make certain usability tradeoffs, and since they control the signing certs, even in the unlikely case that people can leverage these holes to enable a boot sector virus, it seems unlikely that Windows will revoke its own cert. The security goals for Windows' secure boot is quite a bit weaker than what ChromeOS is trying to promise; this is why I claim the real argument is over what the goals are for trusted boot. Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing memfd_create()
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote: This series introduces the concept of file sealing. Sealing a file restricts the set of allowed operations on the file in question. Multiple seals are defined and each seal will cause a different set of operations to return EPERM if it is set. The following seals are introduced: * SEAL_SHRINK: If set, the inode size cannot be reduced * SEAL_GROW: If set, the inode size cannot be increased * SEAL_WRITE: If set, the file content cannot be modified Looking at your patches, and what files you are modifying, you are enforcing this in the low-level file system. Why not make sealing an attribute of the struct file, and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical. Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/6] File Sealing memfd_create()
On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote: On Thu, Mar 20, 2014 at 4:32 PM, ty...@mit.edu wrote: Why not make sealing an attribute of the struct file, and enforce it at the VFS layer? That way all file system objects would have access to sealing interface, and for memfd_shmem, you can't get another struct file pointing at the object, the security properties would be identical. Sealing as introduced here is an inode-attribute, not struct file. This is intentional. For instance, a gfx-client can get a read-only FD via /proc/self/fd/ and pass it to the compositor so it can never overwrite the contents (unless the compositor has write-access to the inode itself, in which case it can just re-open it read-write). Hmm, good point. I had forgotten about the /proc/self/fd hole. Hmm... what if we have a SEAL_PROC which forces the permissions of /proc/self/fd to be 000? So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is basically equivalent to using chattr to set the immutable and append-only attribute, except for the you can't undo the seal unless you have exclusive access to the inode magic. That does make it pretty memfd_create specific and not a very general API, which is a little unfortunate; hence why I'm trying to explore ways of making a bit more generic and hopefully useful for more use cases. Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Page cache corruption with mount/unmount of USB with Fat file system
On Thu, Mar 20, 2014 at 12:05:44AM +0530, Nilesh More wrote: > Hi all, > > First of all sorry for the lengthy mail but I did not want to miss out > on any details. > > The following issue that I am reporting occurs when I plugin the USB > drive on android tablet with USB mount support. Please note that this > issue may or may not occur the first time I hotplug the USB drive. It > might require couple of hotplug/hotunplug to see this issue. If > anybody has/gets any clue about the possible root cause, please let me > know. And I'll repeat the comments I made a few weeks ago. Last time you reported this, you included a dmesg output which included this: > [ 413.607849] usb 2-1.1: USB disconnect, device number 12 > [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode > #81827: block 328308: comm installd... Are you seeing the same thing? If so, the fundamental high level issue is that the USB driver is reporting a device disconnect, presumably of a USB device different from the one that you just plugged in. Are you still seeing this? If so, it's important if you want to root cause this issue to determine why the USB disconnect is happening, and not try to comment out the page invalidations which happens *after* the USB device is reported as disconnected. If I had to guess, it's probably some hardware issue where when you plug in a USB device that draws too much power, it's destablizing the USB bus and causing some other USB device to report a disconnect. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Page cache corruption with mount/unmount of USB with Fat file system
On Thu, Mar 20, 2014 at 12:05:44AM +0530, Nilesh More wrote: Hi all, First of all sorry for the lengthy mail but I did not want to miss out on any details. The following issue that I am reporting occurs when I plugin the USB drive on android tablet with USB mount support. Please note that this issue may or may not occur the first time I hotplug the USB drive. It might require couple of hotplug/hotunplug to see this issue. If anybody has/gets any clue about the possible root cause, please let me know. And I'll repeat the comments I made a few weeks ago. Last time you reported this, you included a dmesg output which included this: [ 413.607849] usb 2-1.1: USB disconnect, device number 12 [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode #81827: block 328308: comm installd... Are you seeing the same thing? If so, the fundamental high level issue is that the USB driver is reporting a device disconnect, presumably of a USB device different from the one that you just plugged in. Are you still seeing this? If so, it's important if you want to root cause this issue to determine why the USB disconnect is happening, and not try to comment out the page invalidations which happens *after* the USB device is reported as disconnected. If I had to guess, it's probably some hardware issue where when you plug in a USB device that draws too much power, it's destablizing the USB bus and causing some other USB device to report a disconnect. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] RDSEED support for the Linux kernel
On Mon, Mar 17, 2014 at 04:36:26PM -0700, H. Peter Anvin wrote: > Changes since version 1: > > a. Rebased on top of random.git:dev. > b. Unbreak the PowerPC build (I had managed to miss that PowerPC had >grown archrandom.h support.) > c. Remove duplicate dummy function definitions in . > d. Add a fourth patch containing a microoptimization: avoid the loop >in arch_random_refill() if arch_get_random_seed*() is unavailable. > > Comments are, of course, appreciated. > > Ted, if you are OK with this could you add this to random.git:dev so > linux-next can pick it up? Thanks, applied to the random.git tree. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q on ioctl BLKGETSIZE
On Tue, Mar 18, 2014 at 01:17:25PM +0100, Ulrich Windl wrote: > > Three questions: > 1) Shouldn't the manual page says that the sector size of always 512 Bytes, > even on new disks with larger sectors? > 2) Should the real sector size be used for new disks? The HDD industry is using 512 byte logical sectors (so LBA's are in units of 512 byte sectors), even for disks with a physical sector size of 4k (and in the next 4-5 years, they would like to go to 32k physical sector sizes). > 3) When using 512-bytes sector size, isn't the capacity limited to 2TB (2^31 > kB)? Yes. That's why we have the BLKGETSIZE64 ioctl. Example code: https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/lib/ext2fs/getsize.c - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Q on ioctl BLKGETSIZE
On Tue, Mar 18, 2014 at 01:17:25PM +0100, Ulrich Windl wrote: Three questions: 1) Shouldn't the manual page says that the sector size of always 512 Bytes, even on new disks with larger sectors? 2) Should the real sector size be used for new disks? The HDD industry is using 512 byte logical sectors (so LBA's are in units of 512 byte sectors), even for disks with a physical sector size of 4k (and in the next 4-5 years, they would like to go to 32k physical sector sizes). 3) When using 512-bytes sector size, isn't the capacity limited to 2TB (2^31 kB)? Yes. That's why we have the BLKGETSIZE64 ioctl. Example code: https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/lib/ext2fs/getsize.c - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 0/4] RDSEED support for the Linux kernel
On Mon, Mar 17, 2014 at 04:36:26PM -0700, H. Peter Anvin wrote: Changes since version 1: a. Rebased on top of random.git:dev. b. Unbreak the PowerPC build (I had managed to miss that PowerPC had grown archrandom.h support.) c. Remove duplicate dummy function definitions in linux/random.h. d. Add a fourth patch containing a microoptimization: avoid the loop in arch_random_refill() if arch_get_random_seed*() is unavailable. Comments are, of course, appreciated. Ted, if you are OK with this could you add this to random.git:dev so linux-next can pick it up? Thanks, applied to the random.git tree. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio-blk: make the queue depth the max supportable by the hypervisor
On Mon, Mar 17, 2014 at 11:12:15AM +1030, Rusty Russell wrote: > > Note that with indirect descriptors (which is supported by Almost > Everyone), we can actually use the full index, so this value is a bit > pessimistic. But it's OK as a starting point. So is this something that can go upstream with perhaps a slight adjustment in the commit description? Do you think we need to be able to dynamically adjust the queue depth after the module has been loaded or the kernel has been booted? If so, anyone a hint about the best way to do that would be much appreciated. Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] virtio-blk: make the queue depth the max supportable by the hypervisor
On Mon, Mar 17, 2014 at 11:12:15AM +1030, Rusty Russell wrote: Note that with indirect descriptors (which is supported by Almost Everyone), we can actually use the full index, so this value is a bit pessimistic. But it's OK as a starting point. So is this something that can go upstream with perhaps a slight adjustment in the commit description? Do you think we need to be able to dynamically adjust the queue depth after the module has been loaded or the kernel has been booted? If so, anyone a hint about the best way to do that would be much appreciated. Thanks, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [v7] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
I just noticed I had the v6 version of your patch in my tree. I've updated it to be the v7 version, thanks. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [v7] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate
I just noticed I had the v6 version of your patch in my tree. I've updated it to be the v7 version, thanks. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: flaw in "nf_tables: add reject module for NFPROTO_INET"
On Wed, Feb 12, 2014 at 01:05:54PM -0800, Kees Cook wrote: > > I'd be up for it. It's why I CC'd you, I figured if I'd missed the > report it would have likely have come from you. :) Perhaps just start > by CCing each other, and if others want to get in on the fun too, move > to a list then? I'm paying attention to Linux kernel coverity reports (as well as working on my goal to drive the number of e2fsprogs coverity reports to zero :-), so feel free to CC me on any reports. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: flaw in nf_tables: add reject module for NFPROTO_INET
On Wed, Feb 12, 2014 at 01:05:54PM -0800, Kees Cook wrote: I'd be up for it. It's why I CC'd you, I figured if I'd missed the report it would have likely have come from you. :) Perhaps just start by CCing each other, and if others want to get in on the fun too, move to a list then? I'm paying attention to Linux kernel coverity reports (as well as working on my goal to drive the number of e2fsprogs coverity reports to zero :-), so feel free to CC me on any reports. Cheers, - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] CPU Jitter RNG
On Tue, Feb 04, 2014 at 11:06:04AM -0800, H. Peter Anvin wrote: > > The quantum noise sources there are in a system are generally two > independent clocks running against each other. However, independent > clocks are rare; instead, most clocks are in fact slaved against each > other using PLLs and similar structures. One of the things that would be useful for us to understand is in general, where in a system we have independent clocks. For example, I think (correct me if I'm wrong), a 2.5" or 3.5" HDD has its own clock which is separate from the CPU/chipset. That is actually how and where we get any entropy; I am not at all convinced that we are getting any variation from "chaotic air turbulence in the HDD" --- that paper was published in 1994, and hard drive technologies have changed quite a bit since then, with extra layers of caching, track bufers, etc. However, where a decade ago the ethernet card probably had its own independent clock crystal/oscillator, I'm going to guess that these days with SOC's and even on laptops, with ethernet device part of the chipset, it is probably being driven off the same master oscillator. I wonder if there's anyway we can either figure out manually, or preferably, automatically at boot time, which devices actually have independent clock oscillators. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/5] CPU Jitter RNG
On Tue, Feb 04, 2014 at 11:06:04AM -0800, H. Peter Anvin wrote: The quantum noise sources there are in a system are generally two independent clocks running against each other. However, independent clocks are rare; instead, most clocks are in fact slaved against each other using PLLs and similar structures. One of the things that would be useful for us to understand is in general, where in a system we have independent clocks. For example, I think (correct me if I'm wrong), a 2.5 or 3.5 HDD has its own clock which is separate from the CPU/chipset. That is actually how and where we get any entropy; I am not at all convinced that we are getting any variation from chaotic air turbulence in the HDD --- that paper was published in 1994, and hard drive technologies have changed quite a bit since then, with extra layers of caching, track bufers, etc. However, where a decade ago the ethernet card probably had its own independent clock crystal/oscillator, I'm going to guess that these days with SOC's and even on laptops, with ethernet device part of the chipset, it is probably being driven off the same master oscillator. I wonder if there's anyway we can either figure out manually, or preferably, automatically at boot time, which devices actually have independent clock oscillators. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ksummit-attendees] [PATCH] commit: Add -f, --fixes option to add Fixes: line
[ I've removed Junio and the git mailing list form the -cc ] On Mon, Oct 28, 2013 at 11:41:17PM +, Russell King - ARM Linux wrote: > > I agree too. This whole thread seems to be about noise, and I too > thought there was something about not cross-posting between this list > and any other list. We talked about not including patches on a theoretical "git subtree maintainers list", for noise control purposes, and I think it would be a good to extend this to the ksummit-* mailing lists as well. I'm not sure there's an easy way to make mailman reject e-mails that contain patches, but I'd request that people please honor this restriction. If you must, send the patch to the appropriate mailing list, and then include a lkml.org or some other mail archive link to the ksummit-* lists. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ksummit-attendees] [PATCH] commit: Add -f, --fixes commit option to add Fixes: line
[ I've removed Junio and the git mailing list form the -cc ] On Mon, Oct 28, 2013 at 11:41:17PM +, Russell King - ARM Linux wrote: I agree too. This whole thread seems to be about noise, and I too thought there was something about not cross-posting between this list and any other list. We talked about not including patches on a theoretical git subtree maintainers list, for noise control purposes, and I think it would be a good to extend this to the ksummit-* mailing lists as well. I'm not sure there's an easy way to make mailman reject e-mails that contain patches, but I'd request that people please honor this restriction. If you must, send the patch to the appropriate mailing list, and then include a lkml.org or some other mail archive link to the ksummit-* lists. - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.6.0 kernel - ext4 corruption, or?
On Wed, Oct 24, 2012 at 08:15:37AM -0400, Justin Piszcz wrote: > > I read there was a bug in 3.6.2, is there also one in 3.6.0, or can someone > help explain this? The problem which we are currently trying to investigate was reportedly introduced in v3.6.1. So far that's about how we know; we have two users who have reported it, but I and other ext4 developers haven't been able to reproduce it yet. > # grep 10.0.0.11 -r /etc > /etc/posfix.old/master.cf:10.0.0.11:smtp inet n - - > -1postscreen > > # ls -l /etc/posfix.old/master.cf > -rw-r--r-- 1 root root 13236 Oct 28 2011 /etc/posfix.old/master.cf > > # ls -ld /etc/postfix.old > ls: cannot access /etc/postfix.old: No such file or directory > Looks like you or some script renamed /etc/postfix to /etc/posfix.old as part some upgrade? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 3.6.0 kernel - ext4 corruption, or?
On Wed, Oct 24, 2012 at 08:15:37AM -0400, Justin Piszcz wrote: I read there was a bug in 3.6.2, is there also one in 3.6.0, or can someone help explain this? The problem which we are currently trying to investigate was reportedly introduced in v3.6.1. So far that's about how we know; we have two users who have reported it, but I and other ext4 developers haven't been able to reproduce it yet. # grep 10.0.0.11 -r /etc /etc/posfix.old/master.cf:10.0.0.11:smtp inet n - - -1postscreen # ls -l /etc/posfix.old/master.cf -rw-r--r-- 1 root root 13236 Oct 28 2011 /etc/posfix.old/master.cf # ls -ld /etc/postfix.old ls: cannot access /etc/postfix.old: No such file or directory Looks like you or some script renamed /etc/postfix to /etc/posfix.old as part some upgrade? - Ted -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] ext2 inode size (on-disk)
On Thu, Apr 19, 2001 at 07:55:20AM -0400, Alexander Viro wrote: > Erm... Folks, can ->s_inode_size be not a power of 2? Both > libext2fs and kernel break in that case. This was a project that was never completed. I thought at one point of allowing the inode size to be not a power of 2, but if you do that, you really want to avoid letting an inode cross a block boundary --- for reliability and performance reasons if nothing else. It may simply be easiest at this point to require that the inode size be a power of two, at least as far as going from 128 to 256 bytes, just for compatibility reasons. (Although if we do that, the folks who want to use extra space in the inode will come pooring out of the woodwork, and we're going to have to careful to control who uses what parts of the extended inode.) In the long run, it probably makes sense to adjust the algorithms to allow for non-power-of-two inode sizes, but require an incompatible filesystem feature flag (so that older kernels and filesystem utilities won't choke when mounting filesystems with non-standard sized inodes. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] ext2 inode size (on-disk)
On Thu, Apr 19, 2001 at 07:55:20AM -0400, Alexander Viro wrote: Erm... Folks, can -s_inode_size be not a power of 2? Both libext2fs and kernel break in that case. This was a project that was never completed. I thought at one point of allowing the inode size to be not a power of 2, but if you do that, you really want to avoid letting an inode cross a block boundary --- for reliability and performance reasons if nothing else. It may simply be easiest at this point to require that the inode size be a power of two, at least as far as going from 128 to 256 bytes, just for compatibility reasons. (Although if we do that, the folks who want to use extra space in the inode will come pooring out of the woodwork, and we're going to have to careful to control who uses what parts of the extended inode.) In the long run, it probably makes sense to adjust the algorithms to allow for non-power-of-two inode sizes, but require an incompatible filesystem feature flag (so that older kernels and filesystem utilities won't choke when mounting filesystems with non-standard sized inodes. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
From: Daniel Phillips <[EMAIL PROTECTED]> Date: Fri, 23 Feb 2001 00:04:02 +0100 I resolve not to take a position on this subject, and I will carry forward both a 'squeaky clean' backward-compatible version that sets an INCOMPAT flag, and a 'slightly tarnished' but very clever version that is both forward and backward-compatible, along the lines suggested by Ted. Both flavors have the desireable property that old versions of fsck with no knowledge of the new index structure can remove the indices automatically, with fsck -y. Note that in the long run, the fully comatible version should probably have a COMPAT feature flag set so that you're forced to use a new enough version of e2fsck. Otherwise an old e2fsck may end up not noticing corruptions in an index block which might cause a new kernel to have serious heartburn. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
From: Daniel Phillips [EMAIL PROTECTED] Date: Fri, 23 Feb 2001 00:04:02 +0100 I resolve not to take a position on this subject, and I will carry forward both a 'squeaky clean' backward-compatible version that sets an INCOMPAT flag, and a 'slightly tarnished' but very clever version that is both forward and backward-compatible, along the lines suggested by Ted. Both flavors have the desireable property that old versions of fsck with no knowledge of the new index structure can remove the indices automatically, with fsck -y. Note that in the long run, the fully comatible version should probably have a COMPAT feature flag set so that you're forced to use a new enough version of e2fsck. Otherwise an old e2fsck may end up not noticing corruptions in an index block which might cause a new kernel to have serious heartburn. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
From: Andreas Dilger <[EMAIL PROTECTED]> Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST) One important question as to the disk format is whether the "." and ".." interception by VFS is a new phenomenon in 2.4 or if this also happened in 2.2? If so, then having these entries on disk will be important for 2.2 compatibility, and you don't want to have different on-disk formats between 2.2 and 2.4. Well, you need to have the '.' and '..' there for compatibility if you for the full backwards compatibility. That's clear. If you don't care about backwards compatibility, it's important that there be a way to find the parent directory, but there doesn't have to be explicit '.' and '..' entries. So if Daniel is going to try implementing it both ways then that's one place where the #ifdef's might get a bit more complicated. After it's done, we should do some benchmarks comparing it both ways; if the difference is negligible, I'd argue for simply always providing backwards compatibility. One of the key advantages of ext2/ext3 is its backwards compatibility, and so if it's not too costly to preserve it (as I suspect will be the case), we should try to do so. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
From: Daniel Phillips <[EMAIL PROTECTED]> Date: Thu, 22 Feb 2001 08:24:08 +0100 Content-Type: text/plain > Is it worth it? Well, it means you lose an index entry from each > directory block, thus reducing your fanout at each node of the tree by a > worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're > using 4k blocksizes. I'll leave that up to somebody else - we now have two alternatives, the 100%, no-compromise INCOMPAT solution, and the slightly-bruised but still largely intact forward compatible solution. I'll maintain both solutions for now code so it's just as easy to choose either in the end. Well, the $64,000 question is exactly how much performance does it cost? My guess is that it will be barely measurable, but only benchmarks will answer that question. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
From: Daniel Phillips [EMAIL PROTECTED] Date: Thu, 22 Feb 2001 08:24:08 +0100 Content-Type: text/plain Is it worth it? Well, it means you lose an index entry from each directory block, thus reducing your fanout at each node of the tree by a worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're using 4k blocksizes. I'll leave that up to somebody else - we now have two alternatives, the 100%, no-compromise INCOMPAT solution, and the slightly-bruised but still largely intact forward compatible solution. I'll maintain both solutions for now code so it's just as easy to choose either in the end. Well, the $64,000 question is exactly how much performance does it cost? My guess is that it will be barely measurable, but only benchmarks will answer that question. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
From: Andreas Dilger [EMAIL PROTECTED] Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST) One important question as to the disk format is whether the "." and ".." interception by VFS is a new phenomenon in 2.4 or if this also happened in 2.2? If so, then having these entries on disk will be important for 2.2 compatibility, and you don't want to have different on-disk formats between 2.2 and 2.4. Well, you need to have the '.' and '..' there for compatibility if you for the full backwards compatibility. That's clear. If you don't care about backwards compatibility, it's important that there be a way to find the parent directory, but there doesn't have to be explicit '.' and '..' entries. So if Daniel is going to try implementing it both ways then that's one place where the #ifdef's might get a bit more complicated. After it's done, we should do some benchmarks comparing it both ways; if the difference is negligible, I'd argue for simply always providing backwards compatibility. One of the key advantages of ext2/ext3 is its backwards compatibility, and so if it's not too costly to preserve it (as I suspect will be the case), we should try to do so. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
Daniel, Nice work! A couple of comments. If you make the beginning of each index block look like a an empty directory block (i.e, the first 8 blocks look like this): 32 bits: ino == 0 16 bits: rec_len == blocksize 16 bits: name_len = 0 ... then you will have full backwards compatibility, both for reading *and* writing. When reading, old kernels will simply ignore the index blocks, since it looks like it has an unpopulated directory entry. And if the kernel attempts to write into the directory, it will clear the BTREE_FL flag, in which case new kernels won't treat the directory as a tree anymore. (Running a smart e2fsck which knows about directory trees will be able to restore the tree structure). Is it worth it? Well, it means you lose an index entry from each directory block, thus reducing your fanout at each node of the tree by a worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're using 4k blocksizes. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2
Daniel, Nice work! A couple of comments. If you make the beginning of each index block look like a an empty directory block (i.e, the first 8 blocks look like this): 32 bits: ino == 0 16 bits: rec_len == blocksize 16 bits: name_len = 0 ... then you will have full backwards compatibility, both for reading *and* writing. When reading, old kernels will simply ignore the index blocks, since it looks like it has an unpopulated directory entry. And if the kernel attempts to write into the directory, it will clear the BTREE_FL flag, in which case new kernels won't treat the directory as a tree anymore. (Running a smart e2fsck which knows about directory trees will be able to restore the tree structure). Is it worth it? Well, it means you lose an index entry from each directory block, thus reducing your fanout at each node of the tree by a worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're using 4k blocksizes. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mke2fs and kernel VM issues
Date: Fri, 16 Feb 2001 09:48:17 + (GMT) From: Alan Cox <[EMAIL PROTECTED]> > heavily modifed VA kernel based on 2.2.18. Is there a kernel which is > believed to be a known good kernel? (both 2.2.x and 2.4.x) I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which have worked wonders for everyone so far. Note that this only shows up when using mke2fs to create very large filesystems, and you have relatively little memory. In this particular case, for example, we saw it with a system that had "only" 256 megs of memory, and creating a 72 gigabyte filesystem using a 8x9gb RAID configuration. Some folks at IBM (in the Mylex controller group) have found this problem with 2.2.16, 2.2.18, and with some 2.2.19pre patch (they didn't say exactly which level of the 2.2.19pre patch they were dealing with). Some folks claiming that the problem exists under 2.2.18, and we've seen it with our kernel, which is a 2.2.18 plus some set of 2.2.19pre* patches. The problem is that mke2fs issues a *lot* of writes when it is writing the inode table, and apparently the write throttling isn't completely working write under those circumstances. There is a workaround which easily fixes the problem; if you set the MKE2FS_SYNC environment variable to some value such as 5 or 10, then after writing every 5 or 10 block groups's worth of inode tables, mke2fs will call sync(). This workaround did fix IBM's problem, which lends credence to the theory that the problem is a VM bug related to a lack of sufficient write throttling. I've in the past considered making MKE2FS_SYNC=10 be the default, but Stephen has requested that I not do this, since it's the best way of showing off this particular VM bug. - Ted >From IBM/Mylex's bug report: >The system I used for these tests is a Chardonnay with an AR160 >installed. The FW is 6.00-07 and the BIOS is 6.01-08. The system >has several various kernel/DAC driver boot configurations set up. >The RAID drive under test is a 3-drive RAID 5 with 8.6GB drives, >for a total of 17GB. When one maximum sized partition is created, >the total number of logical cylinders is 2209. > >I also obtained the 2.2.19 patch and upgraded kernel 2.2.18 to 2.2.19. > >Note: The first 4 tests fail to at least some extent. Please read each one. > >1. Kernel 2.2.16 and DAC driver 2.2.9 - 128MB of main memory. > The system fails to complete the creation of an ext2 file system. > The process mke2fs (or mkfs) gets terminated instead. Subsequent > attempts to create an ext2 file (without rerunning fdisk) fail. > >2. Kernel 2.2.18 and DAC driver 2.2.9 - 128 MB of main memory. > The ext2 file system is created, but hundreds of VM error messages > scroll up on the screen. Expanding the swap space to exceed the > memory size does not help (I think it might even be worse). > > I also tried running a copy compare script that loads the drives with > heavy I/O. This failed after approximately 20 hours. The system was > effectively locked up with VM error messages scrolling up the screen > and all alternate terminal screens. > >3. Kernel 2.2.16 and DAC driver 2.2.10 - 128 MB of main memory. > The system fails to complete the creation of an ext2 file system. > The process mke2fs (or mkfs) gets terminated instead. Subsequent > attempts to create an ext2 file (without rerunning fdisk) fail. > >4. Kernel 2.2.19 and DAC driver 2.2.9 - 128 MB of main memory. > The system fails to complete the creation of an ext2 file system. > The process mke2fs (or mkfs) gets terminated instead. Subsequent > attempts to create an ext2 file (without rerunning fdisk) DO NOT > FAIL. > >5. Kernel 2.2.16 and DAC driver 2.2.9 - 512 MB of main memory. > The system completes the creation of an ext2 file system without > any errors. > >6. Kernel 2.2.16 and DAC driver 2.2.10 - 512 MB of main memory. > The system completes the creation of an ext2 file system without > any errors. > >7. Kernel 2.2.18 and DAC driver 2.2.9 - 512 MB of main memory. > The system completes the creation of an ext2 file system without > any errors. > >8. Kernel 2.2.19 and DAC driver 2.2.9 - 512 MB of main memory. > The system completes the creation of an ext2 file system without > any errors. > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mke2fs and kernel VM issues
Date: Fri, 16 Feb 2001 09:48:17 + (GMT) From: Alan Cox [EMAIL PROTECTED] heavily modifed VA kernel based on 2.2.18. Is there a kernel which is believed to be a known good kernel? (both 2.2.x and 2.4.x) I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which have worked wonders for everyone so far. Note that this only shows up when using mke2fs to create very large filesystems, and you have relatively little memory. In this particular case, for example, we saw it with a system that had "only" 256 megs of memory, and creating a 72 gigabyte filesystem using a 8x9gb RAID configuration. Some folks at IBM (in the Mylex controller group) have found this problem with 2.2.16, 2.2.18, and with some 2.2.19pre patch (they didn't say exactly which level of the 2.2.19pre patch they were dealing with). Some folks claiming that the problem exists under 2.2.18, and we've seen it with our kernel, which is a 2.2.18 plus some set of 2.2.19pre* patches. The problem is that mke2fs issues a *lot* of writes when it is writing the inode table, and apparently the write throttling isn't completely working write under those circumstances. There is a workaround which easily fixes the problem; if you set the MKE2FS_SYNC environment variable to some value such as 5 or 10, then after writing every 5 or 10 block groups's worth of inode tables, mke2fs will call sync(). This workaround did fix IBM's problem, which lends credence to the theory that the problem is a VM bug related to a lack of sufficient write throttling. I've in the past considered making MKE2FS_SYNC=10 be the default, but Stephen has requested that I not do this, since it's the best way of showing off this particular VM bug. - Ted From IBM/Mylex's bug report: The system I used for these tests is a Chardonnay with an AR160 installed. The FW is 6.00-07 and the BIOS is 6.01-08. The system has several various kernel/DAC driver boot configurations set up. The RAID drive under test is a 3-drive RAID 5 with 8.6GB drives, for a total of 17GB. When one maximum sized partition is created, the total number of logical cylinders is 2209. I also obtained the 2.2.19 patch and upgraded kernel 2.2.18 to 2.2.19. Note: The first 4 tests fail to at least some extent. Please read each one. 1. Kernel 2.2.16 and DAC driver 2.2.9 - 128MB of main memory. The system fails to complete the creation of an ext2 file system. The process mke2fs (or mkfs) gets terminated instead. Subsequent attempts to create an ext2 file (without rerunning fdisk) fail. 2. Kernel 2.2.18 and DAC driver 2.2.9 - 128 MB of main memory. The ext2 file system is created, but hundreds of VM error messages scroll up on the screen. Expanding the swap space to exceed the memory size does not help (I think it might even be worse). I also tried running a copy compare script that loads the drives with heavy I/O. This failed after approximately 20 hours. The system was effectively locked up with VM error messages scrolling up the screen and all alternate terminal screens. 3. Kernel 2.2.16 and DAC driver 2.2.10 - 128 MB of main memory. The system fails to complete the creation of an ext2 file system. The process mke2fs (or mkfs) gets terminated instead. Subsequent attempts to create an ext2 file (without rerunning fdisk) fail. 4. Kernel 2.2.19 and DAC driver 2.2.9 - 128 MB of main memory. The system fails to complete the creation of an ext2 file system. The process mke2fs (or mkfs) gets terminated instead. Subsequent attempts to create an ext2 file (without rerunning fdisk) DO NOT FAIL. 5. Kernel 2.2.16 and DAC driver 2.2.9 - 512 MB of main memory. The system completes the creation of an ext2 file system without any errors. 6. Kernel 2.2.16 and DAC driver 2.2.10 - 512 MB of main memory. The system completes the creation of an ext2 file system without any errors. 7. Kernel 2.2.18 and DAC driver 2.2.9 - 512 MB of main memory. The system completes the creation of an ext2 file system without any errors. 8. Kernel 2.2.19 and DAC driver 2.2.9 - 512 MB of main memory. The system completes the creation of an ext2 file system without any errors. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Lucent Microelectronics Venus Modem, serial 5.05, and Linux 2.4.0
On Sun, Jan 14, 2001 at 08:10:45PM +0100, W. Michael Petullo wrote: > > In serial.c, you seem to perform a check by writing to a possible > > modem's interrupt enable register and reading the result. This seems to > > be one of the points at which the auto-configuration process occasionally > > fails. If I make the following change to this code my modem seems to > > be auto-detected correctly all of the time: > > >scratch = serial_inp(info, UART_IER); > > serial_outp(info, UART_IER, 0); > > #ifdef __i386__ > > outb(0xff, 0x080); > > #endif > > scratch2 = serial_inp(info, UART_IER); > > serial_outp(info, UART_IER, 0x0F); > > #ifdef __i386__ > > outb(0, 0x080); > > #endif > > - scratch3 = serial_inp(info, UART_IER); /* REMOVE */ > > + scratch3 = 0x0f/* ADD */ > > serial_outp(info, UART_IER, scratch); The problem is that if this doesn't work, there are some serious questions about the correctness of the Lucent Microelectronic Venus modem. I've forwarded this to someone in the Lucent Modem group, who can hopefully look at this (and maybe can ship me a sample hardware so I can play with it, although I'd much rather that he tell me how to work around the hardware bug, or tell me that all you need is a firmware upgrade to fix the bug in the modem). - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Lucent Microelectronics Venus Modem, serial 5.05, and Linux 2.4.0
On Sun, Jan 14, 2001 at 08:10:45PM +0100, W. Michael Petullo wrote: In serial.c, you seem to perform a check by writing to a possible modem's interrupt enable register and reading the result. This seems to be one of the points at which the auto-configuration process occasionally fails. If I make the following change to this code my modem seems to be auto-detected correctly all of the time: scratch = serial_inp(info, UART_IER); serial_outp(info, UART_IER, 0); #ifdef __i386__ outb(0xff, 0x080); #endif scratch2 = serial_inp(info, UART_IER); serial_outp(info, UART_IER, 0x0F); #ifdef __i386__ outb(0, 0x080); #endif - scratch3 = serial_inp(info, UART_IER); /* REMOVE */ + scratch3 = 0x0f/* ADD */ serial_outp(info, UART_IER, scratch); The problem is that if this doesn't work, there are some serious questions about the correctness of the Lucent Microelectronic Venus modem. I've forwarded this to someone in the Lucent Modem group, who can hopefully look at this (and maybe can ship me a sample hardware so I can play with it, although I'd much rather that he tell me how to work around the hardware bug, or tell me that all you need is a firmware upgrade to fix the bug in the modem). - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Renaming lost+found
On Sun, Jan 28, 2001 at 01:35:44PM -0800, H. Peter Anvin wrote: > *renamed*, i.e. does the tools (e2fsck ) use "/lost+found" by name, > or by inode? As far as I know it always uses the same inode number e2fsck uses /lost+found by name, not by inode. It will recreate a new lost+found directory if one doesn't exist. *However*, if the filesystem is badly corrupted, it's possible that when it allocates blocks for the lost+found directory, it might override a datablock that might possibly be recoverable if one were truly desparate and using a disk editor to search for keywords. (This would only happen if part of the inode table had gotten corrupted due to a hardware error --- i.e., such as the anecdotal evidence of DMA units that write garbage to the disk because during a power failure, where it is conjectured that the +5V power rail drops below the critical working of the memory faster than +12V power rail drops below the critical working volutage of the disk drive --- so that the record in the inode table that a certain disk block was in use is erased.) So if you really dislike lost+found, go ahead and delete it. It removes a somewhat tiny safeguard, but being able to take advantage of it requires wizard-level skills (there are no tools to do this automatically, since it requires human intuition and a knowledge of what file you might be trying to save.) So it would probably only be used in the case of someone who had 10 year's of Ph.D. research that wasn't backed up, and this was the only way they could get the data back. And although not doing disk backups is grounds for general redicule, losing ten years of graduate research would probably be reguarded by most as cruel and unusual punishment. But if you're not in a Ph.D. program, it doesn't matter, yes? (And in any case, we ALL do backups, all the time, religiously and on a regular schedule, RIGHT? :-) - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
Date: Tue, 16 Jan 2001 20:33:34 +0100 From: Andrea Arcangeli <[EMAIL PROTECTED]> > Of course, this is utterly unsafe on an SMP machines, since access to > the "block" variable isn't protected at all. So the first question is Wrong, it's obviously protected by the inode_lock. And even if it wasn't protected by the inode_lock in 2.2.x inode.c and dcache.c runs globally serialized by the BKL (but it is obviously protected regardless of the BKL). Yes, you're quite right. The fact that you have to have inode_lock before you call try_to_free inodes would would protect the "block" variable. >static struct semaphore block = MUTEX; >if (down_trylock()) { >spin_unlock(_lock); >down(); >spin_lock(_lock); >} The above is overkill (there's no need to use further atomic API, when we can rely on the inode_lock for the locking. It's overcomplex and slower. Actually, looking at the fast path of down_trylock compared to huge mess of code that's currently there, I actually suspect that using down_trylock() would actually be faster, since in the fast path case there would only two assembly language instructions, whereas the code that's currently there is (a) much more complicated, and thus harder to understand, and (b) is many more instructions to execute. Sometimes the simplest approach is the best. > (with the appropriate unlocking code at the end of the function). > > Next question why was this there in the first place? After all, To fix the "inode-max limit reached" faliures that you could reproduce on earlier 2.2.x. (the freed inodes was re-used before the task that freed them had a chance to allocate them for itself) Ah, OK. Well, we're currently tracking down a slow inode leak which is only happening on SMP machines, especially our mailhubs. It's gradual, but if you don't reboot the machine before you run out of inodes, it will print the "inode-max limit reach" message, and then shortly after that lock up the entire machine locks up until someone can come in and hit the Big Red Button. Monitoring the machine before it locks up, we noted that the number of inodes in use was continually climbing until the machine died. (Yeah, we could put a reboot command into crontab, but you should only need to do hacks like that on Windows NT machines. :-) We're running a reasonably recent 2.2 kernel on our machines, and the problem is only showing up on SMP machines, so there's *some* kind of SMP race hiding in the 2.2 inode code - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
HJ Lu recently pointed me at a potential locking problem try_to_free_inodes(), and when I started proding at it, I found what appears to be another set of SMP locking issues in the dcache code. (But if that were the case, why aren't we seeing huge numbers of complaints? So I'm wondering if I'm missing something.) Anyway, the first problem which HJ pointed out is in try_to_free_inodes() which attempts to implement a mutual exclusion with respect to itself as follows if (block) { struct wait_queue __wait; __wait.task = current; add_wait_queue(_inode_freeing, &__wait); for (;;) { /* NOTE: we rely only on the inode_lock to be sure to not miss the unblock event */ current->state = TASK_UNINTERRUPTIBLE; spin_unlock(_lock); schedule(); spin_lock(_lock); if (!block) break; } remove_wait_queue(_inode_freeing, &__wait); current->state = TASK_RUNNING; } block = 1; Of course, this is utterly unsafe on an SMP machines, since access to the "block" variable isn't protected at all. So the first question is why did whoever implemented this do it in this horribly complicated way above, instead of something simple, say like this: static struct semaphore block = MUTEX; if (down_trylock()) { spin_unlock(_lock); down(); spin_lock(_lock); } (with the appropriate unlocking code at the end of the function). Next question why was this there in the first place? After all, most of the time try_to_free_inodes() is called with the inode_lock spinlock held, which would act as a automatic mutual exclusion anyway. The only time this doesn't happen is when we call prune_dcache(), where inode_lock is temporarily dropped. So I took a look at prune_dcache(), and discovered that (a) it's called from multiple places, and (b) it and shrink_dcache_sb() both iterate over dentry_unused and among other things, tried to free dcache structures without any kind of locking to prevent two kernel threads to from mucking with the contents of dentry_unused at the same time, and possibly having prune_one_dentry() being called by two processes on the same dentry structure. This should almost certainly cause problems. So the following patch I think is definitely necessary, assuming that the "block" mutual exclusion in try_to_free_inodes() needed to be there in the first place. I'm not so sure about needing a patch to protect access to dentry_unused in fs/dentry.c, but unless there are some other unstated locking hierarchy rules about when it's safe to call prune_dcache() and shrink_dcache_sb(), I suspect we need to add a lock to protect dentry_unused as well. Comments? - Ted --- fs/inode.c.old Fri Sep 29 17:37:01 2000 +++ fs/inode.c Tue Jan 16 09:29:06 2001 @@ -380,36 +380,19 @@ */ static void try_to_free_inodes(int goal) { - static int block; - static struct wait_queue * wait_inode_freeing; + static struct semaphore block = MUTEX; LIST_HEAD(throw_away); /* We must make sure to not eat the inodes while the blocker task sleeps otherwise the blocker task may find none inode available. */ - if (block) - { - struct wait_queue __wait; - - __wait.task = current; - add_wait_queue(_inode_freeing, &__wait); - for (;;) - { - /* NOTE: we rely only on the inode_lock to be sure - to not miss the unblock event */ - current->state = TASK_UNINTERRUPTIBLE; - spin_unlock(_lock); - schedule(); - spin_lock(_lock); - if (!block) - break; - } - remove_wait_queue(_inode_freeing, &__wait); - current->state = TASK_RUNNING; + if (down_trylock()) { + spin_unlock(_lock); + down(); + spin_lock(_lock); } - block = 1; /* * First stry to just get rid of unused inodes. * @@ -427,8 +410,7 @@ } if (!list_empty(_away)) dispose_list(_away); - block = 0; - wake_up(_inode_freeing); + up(); } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)
HJ Lu recently pointed me at a potential locking problem try_to_free_inodes(), and when I started proding at it, I found what appears to be another set of SMP locking issues in the dcache code. (But if that were the case, why aren't we seeing huge numbers of complaints? So I'm wondering if I'm missing something.) Anyway, the first problem which HJ pointed out is in try_to_free_inodes() which attempts to implement a mutual exclusion with respect to itself as follows if (block) { struct wait_queue __wait; __wait.task = current; add_wait_queue(wait_inode_freeing, __wait); for (;;) { /* NOTE: we rely only on the inode_lock to be sure to not miss the unblock event */ current-state = TASK_UNINTERRUPTIBLE; spin_unlock(inode_lock); schedule(); spin_lock(inode_lock); if (!block) break; } remove_wait_queue(wait_inode_freeing, __wait); current-state = TASK_RUNNING; } block = 1; Of course, this is utterly unsafe on an SMP machines, since access to the "block" variable isn't protected at all. So the first question is why did whoever implemented this do it in this horribly complicated way above, instead of something simple, say like this: static struct semaphore block = MUTEX; if (down_trylock(block)) { spin_unlock(inode_lock); down(block); spin_lock(inode_lock); } (with the appropriate unlocking code at the end of the function). Next question why was this there in the first place? After all, most of the time try_to_free_inodes() is called with the inode_lock spinlock held, which would act as a automatic mutual exclusion anyway. The only time this doesn't happen is when we call prune_dcache(), where inode_lock is temporarily dropped. So I took a look at prune_dcache(), and discovered that (a) it's called from multiple places, and (b) it and shrink_dcache_sb() both iterate over dentry_unused and among other things, tried to free dcache structures without any kind of locking to prevent two kernel threads to from mucking with the contents of dentry_unused at the same time, and possibly having prune_one_dentry() being called by two processes on the same dentry structure. This should almost certainly cause problems. So the following patch I think is definitely necessary, assuming that the "block" mutual exclusion in try_to_free_inodes() needed to be there in the first place. I'm not so sure about needing a patch to protect access to dentry_unused in fs/dentry.c, but unless there are some other unstated locking hierarchy rules about when it's safe to call prune_dcache() and shrink_dcache_sb(), I suspect we need to add a lock to protect dentry_unused as well. Comments? - Ted --- fs/inode.c.old Fri Sep 29 17:37:01 2000 +++ fs/inode.c Tue Jan 16 09:29:06 2001 @@ -380,36 +380,19 @@ */ static void try_to_free_inodes(int goal) { - static int block; - static struct wait_queue * wait_inode_freeing; + static struct semaphore block = MUTEX; LIST_HEAD(throw_away); /* We must make sure to not eat the inodes while the blocker task sleeps otherwise the blocker task may find none inode available. */ - if (block) - { - struct wait_queue __wait; - - __wait.task = current; - add_wait_queue(wait_inode_freeing, __wait); - for (;;) - { - /* NOTE: we rely only on the inode_lock to be sure - to not miss the unblock event */ - current-state = TASK_UNINTERRUPTIBLE; - spin_unlock(inode_lock); - schedule(); - spin_lock(inode_lock); - if (!block) - break; - } - remove_wait_queue(wait_inode_freeing, __wait); - current-state = TASK_RUNNING; + if (down_trylock(block)) { + spin_unlock(inode_lock); + down(block); + spin_lock(inode_lock); } - block = 1; /* * First stry to just get rid of unused inodes. * @@ -427,8 +410,7 @@ } if (!list_empty(throw_away)) dispose_list(throw_away); - block = 0; - wake_up(wait_inode_freeing); + up(block); } /* - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 2.2.18 PCI_DEVICE_ID_OXSEMI_16PCI954
Date: Sun, 17 Dec 2000 01:51:15 +0100 (CET) From: Lukasz Trabinski <[EMAIL PROTECTED]> OK, You have right, I'm not a driver programmer, but it probably should look like this: +#define PCI_DEVICE_ID_OXSEMI_16PCI954 0x9501 That's correct, yes. The reason why the original poster was having problems was that the serial_compat.h defined PCI_VENDOR_ID_OXSEMI PCI_DEVICE_ID_OXSEMI_16PCI954 if PCI_VENDOR_ID_OXSEMI was not defined. Since 2.2.18 defines PCI_VENDOR_ID_OXSEMI, but ..._OXSEMI_16PCI954, the 5.05 serial.c wouldn't compile. When we integrate a newer serial driver into 2.2.19, I'll include the necessary patches to pci_ids.h. Why serial.c from the 2.2.18 not support 921600 speed? We have to patch it... 2.2.18 has an old serial driver that doesn't know about PCI devices. The updated serial driver available at http://serial.sourceforge.net should deal with this correctly. - Ted - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Serial cardbus code.... for testing, please.....
[Apologies if this has been seen already, but as far as I know my first posting to the L-K list apparently never made it out] - Ted OK, so I'm currently at the road (San Diego IETF meeting) so I can't really test this very well; when your compile engine is your Vaio laptop, it's really slow and painful to do test builds and test booting kernels. But I know some people are eager to test it, and would rather have something potentially flaky earlier rather waiting for something more tested later, so here it is. Please note, the only testing that I have done to date is Linus's famous "it builds, ship it" regression test suite. So if you're not willing to test a patch which might crash your machine, come back in a day or two after I and others have had a chance to do some testing. (But hey, Linus has released kernels to the entire world without even doing a test compile, so why can't I do it with something as simple as a serial driver? :-) This version of the patch has a couple of new features over past ones, including sanity check code which should prevent receive_chars() from getting stuck in a tight loop when the serial card is ejected while a port is active. It also has functions correctly labelled with __devinit and __devexit, and will check to see if an entry in the serial pci table isn't necessay, and ask the user to report the information in that case. (mailing list to be active within 24 hours; until then, send the information to me instead of the e-mail adddress listed in the patch). The patch also has changes which Kanoj from SGI has been bugging me about constantly because he needs some changes to support his big-iron MIPS box project which he's working on. I had sat on them because clearly I have a different understanding of "code freeze; critical bugs only" than other folks on the L-K mailing list, but they're included here now. Linus, I can back out some or all of these changes when I feed the patches to use for merging with 2.4; just let me know what you want and don't want. - Ted Release notes = * Add support for PCI hot plugging. - Functions should be correctly labeled with __devinit and __devexit now. - Set a safety check to prevent infinite loops in receive_chars - Added support for removing PCI cards * Added code to test to see if an entry in the serial driver's PCI table is redundant (i.e., could have been deduced using our hueristics) and asks the user to report the information if so. * Add support for flow controled serial console. (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS box) * Add new interface, early_serial_setup() which allows architecture specific code to set up serial consoles for those platforms where the port and bus information must be dynamically determined by the boot. Early_serial_setup() must be called before rs_init(). (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS box) * Fixed fencepost bug which could cause the serial driver to overrun the flip buffer by a single character if the UART reports an overrun when the flip buffer is nearly full. (not really a critical problem because we have slop space in the flip buffer for this purpose, but it really would be good to have this fixed.) * Add support for the DCI PCCOM8 8-port serial board Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline against Linux version 2.4.0 === RCS file: Documentation/RCS/serial-console.txt,v retrieving revision 1.1 diff -u -r1.1 Documentation/serial-console.txt --- Documentation/serial-console.txt2000/12/13 15:12:57 1.1 +++ Documentation/serial-console.txt2000/12/13 15:13:09 @@ -20,9 +20,11 @@ options:depend on the driver. For the serial port this defines the baudrate/parity/bits of the port, - in the format PN, where is the speed, - P is parity (n/o/e), and N is bits. Default is - 9600n8. The maximum baudrate is 115200. + in the format PNF, where is the speed, + P is parity (n/o/e), N is bits, and F is + flow control (n/r for none/rtscts). P, N, + and F ar
Re: Serial cardbus code.... for testing, please.....
[Apologies if this has been seen already, but as far as I know my first posting to the L-K list apparently never made it out] - Ted OK, so I'm currently at the road (San Diego IETF meeting) so I can't really test this very well; when your compile engine is your Vaio laptop, it's really slow and painful to do test builds and test booting kernels. But I know some people are eager to test it, and would rather have something potentially flaky earlier rather waiting for something more tested later, so here it is. Please note, the only testing that I have done to date is Linus's famous "it builds, ship it" regression test suite. So if you're not willing to test a patch which might crash your machine, come back in a day or two after I and others have had a chance to do some testing. (But hey, Linus has released kernels to the entire world without even doing a test compile, so why can't I do it with something as simple as a serial driver? :-) This version of the patch has a couple of new features over past ones, including sanity check code which should prevent receive_chars() from getting stuck in a tight loop when the serial card is ejected while a port is active. It also has functions correctly labelled with __devinit and __devexit, and will check to see if an entry in the serial pci table isn't necessay, and ask the user to report the information in that case. (mailing list to be active within 24 hours; until then, send the information to me instead of the e-mail adddress listed in the patch). The patch also has changes which Kanoj from SGI has been bugging me about constantly because he needs some changes to support his big-iron MIPS box project which he's working on. I had sat on them because clearly I have a different understanding of "code freeze; critical bugs only" than other folks on the L-K mailing list, but they're included here now. Linus, I can back out some or all of these changes when I feed the patches to use for merging with 2.4; just let me know what you want and don't want. - Ted Release notes = * Add support for PCI hot plugging. - Functions should be correctly labeled with __devinit and __devexit now. - Set a safety check to prevent infinite loops in receive_chars - Added support for removing PCI cards * Added code to test to see if an entry in the serial driver's PCI table is redundant (i.e., could have been deduced using our hueristics) and asks the user to report the information if so. * Add support for flow controled serial console. (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar [EMAIL PROTECTED] for his big-iron MIPS box) * Add new interface, early_serial_setup() which allows architecture specific code to set up serial consoles for those platforms where the port and bus information must be dynamically determined by the boot. Early_serial_setup() must be called before rs_init(). (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar [EMAIL PROTECTED] for his big-iron MIPS box) * Fixed fencepost bug which could cause the serial driver to overrun the flip buffer by a single character if the UART reports an overrun when the flip buffer is nearly full. (not really a critical problem because we have slop space in the flip buffer for this purpose, but it really would be good to have this fixed.) * Add support for the DCI PCCOM8 8-port serial board Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline against Linux version 2.4.0 === RCS file: Documentation/RCS/serial-console.txt,v retrieving revision 1.1 diff -u -r1.1 Documentation/serial-console.txt --- Documentation/serial-console.txt2000/12/13 15:12:57 1.1 +++ Documentation/serial-console.txt2000/12/13 15:13:09 @@ -20,9 +20,11 @@ options:depend on the driver. For the serial port this defines the baudrate/parity/bits of the port, - in the format PN, where is the speed, - P is parity (n/o/e), and N is bits. Default is - 9600n8. The maximum baudrate is 115200. + in the format PNF, where is the speed, + P is parity (n/o/e), N is bits, and F is + flow control (n/r for none/rtscts). P, N, + and F are optional.
Re: Serial cardbus code.... for testing, please.....
OK, so I'm currently at the road (San Diego IETF meeting) so I can't really test this very well; when your compile engine is your Vaio laptop, it's really slow and painful to do test builds and test booting kernels. But I know some people are eager to test it, and would rather have something potentially flaky earlier rather waiting for something more tested later, so here it is. Please note, the only testing that I have done to date is Linus's famous "it builds, ship it" regression test suite. So if you're not willing to test a patch which might crash your machine, come back in a day or two after I and others have had a chance to do some testing. (But hey, Linus has released kernels to the entire world without even doing a test compile, so why can't I do it with something as simple as a serial driver? :-) This version of the patch has a couple of new features over past ones, including sanity check code which should prevent receive_chars() from getting stuck in a tight loop when the serial card is ejected while a port is active. It also has functions correctly labelled with __devinit and __devexit, and will check to see if an entry in the serial pci table isn't necessay, and ask the user to report the information in that case. (mailing list to be active within 24 hours; until then, send the information to me instead of the e-mail adddress listed in the patch). The patch also has changes which Kanoj from SGI has been bugging me about constantly because he needs some changes to support his big-iron MIPS box project which he's working on. I had sat on them because clearly I have a different understanding of "code freeze; critical bugs only" than other folks on the L-K mailing list, but they're included here now. Linus, I can back out some or all of these changes when I feed the patches to use for merging with 2.4; just let me know what you want and don't want. - Ted Release notes = * Add support for PCI hot plugging. - Functions should be correctly labeled with __devinit and __devexit now. - Set a safety check to prevent infinite loops in receive_chars - Added support for removing PCI cards * Added code to test to see if an entry in the serial driver's PCI table is redundant (i.e., could have been deduced using our hueristics) and asks the user to report the information if so. * Add support for flow controled serial console. (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS box) * Add new interface, early_serial_setup() which allows architecture specific code to set up serial consoles for those platforms where the port and bus information must be dynamically determined by the boot. Early_serial_setup() must be called before rs_init(). (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS box) * Fixed fencepost bug which could cause the serial driver to overrun the flip buffer by a single character if the UART reports an overrun when the flip buffer is nearly full. (not really a critical problem because we have slop space in the flip buffer for this purpose, but it really would be good to have this fixed.) * Add support for the DCI PCCOM8 8-port serial board Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline against Linux version 2.4.0 === RCS file: Documentation/RCS/serial-console.txt,v retrieving revision 1.1 diff -u -r1.1 Documentation/serial-console.txt --- Documentation/serial-console.txt2000/12/13 15:12:57 1.1 +++ Documentation/serial-console.txt2000/12/13 15:13:09 @@ -20,9 +20,11 @@ options:depend on the driver. For the serial port this defines the baudrate/parity/bits of the port, - in the format PN, where is the speed, - P is parity (n/o/e), and N is bits. Default is - 9600n8. The maximum baudrate is 115200. + in the format PNF, where is the speed, + P is parity (n/o/e), N is bits, and F is + flow control (n/r for none/rtscts). P, N, + and F are optional. The default setting is + 9600n8n. The maximum baudrate is 115200. You can specify multiple console= options on the kernel command line. Output will appear
Re: Serial cardbus code.... for testing, please.....
OK, so I'm currently at the road (San Diego IETF meeting) so I can't really test this very well; when your compile engine is your Vaio laptop, it's really slow and painful to do test builds and test booting kernels. But I know some people are eager to test it, and would rather have something potentially flaky earlier rather waiting for something more tested later, so here it is. Please note, the only testing that I have done to date is Linus's famous "it builds, ship it" regression test suite. So if you're not willing to test a patch which might crash your machine, come back in a day or two after I and others have had a chance to do some testing. (But hey, Linus has released kernels to the entire world without even doing a test compile, so why can't I do it with something as simple as a serial driver? :-) This version of the patch has a couple of new features over past ones, including sanity check code which should prevent receive_chars() from getting stuck in a tight loop when the serial card is ejected while a port is active. It also has functions correctly labelled with __devinit and __devexit, and will check to see if an entry in the serial pci table isn't necessay, and ask the user to report the information in that case. (mailing list to be active within 24 hours; until then, send the information to me instead of the e-mail adddress listed in the patch). The patch also has changes which Kanoj from SGI has been bugging me about constantly because he needs some changes to support his big-iron MIPS box project which he's working on. I had sat on them because clearly I have a different understanding of "code freeze; critical bugs only" than other folks on the L-K mailing list, but they're included here now. Linus, I can back out some or all of these changes when I feed the patches to use for merging with 2.4; just let me know what you want and don't want. - Ted Release notes = * Add support for PCI hot plugging. - Functions should be correctly labeled with __devinit and __devexit now. - Set a safety check to prevent infinite loops in receive_chars - Added support for removing PCI cards * Added code to test to see if an entry in the serial driver's PCI table is redundant (i.e., could have been deduced using our hueristics) and asks the user to report the information if so. * Add support for flow controled serial console. (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar [EMAIL PROTECTED] for his big-iron MIPS box) * Add new interface, early_serial_setup() which allows architecture specific code to set up serial consoles for those platforms where the port and bus information must be dynamically determined by the boot. Early_serial_setup() must be called before rs_init(). (Feature desperately requested to be merged into 2.4.0 by Kanoj Sarcar [EMAIL PROTECTED] for his big-iron MIPS box) * Fixed fencepost bug which could cause the serial driver to overrun the flip buffer by a single character if the UART reports an overrun when the flip buffer is nearly full. (not really a critical problem because we have slop space in the flip buffer for this purpose, but it really would be good to have this fixed.) * Add support for the DCI PCCOM8 8-port serial board Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline against Linux version 2.4.0 === RCS file: Documentation/RCS/serial-console.txt,v retrieving revision 1.1 diff -u -r1.1 Documentation/serial-console.txt --- Documentation/serial-console.txt2000/12/13 15:12:57 1.1 +++ Documentation/serial-console.txt2000/12/13 15:13:09 @@ -20,9 +20,11 @@ options:depend on the driver. For the serial port this defines the baudrate/parity/bits of the port, - in the format PN, where is the speed, - P is parity (n/o/e), and N is bits. Default is - 9600n8. The maximum baudrate is 115200. + in the format PNF, where is the speed, + P is parity (n/o/e), N is bits, and F is + flow control (n/r for none/rtscts). P, N, + and F are optional. The default setting is + 9600n8n. The maximum baudrate is 115200. You can specify multiple console= options on the kernel command line. Output will appear on all of them. The last