Re: [PATCH] ext4: fix debug format string warning

2021-04-09 Thread tytso
On Fri, Apr 09, 2021 at 10:12:05PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Using no_printk() for jbd_debug() revealed two warnings:
> 
> fs/jbd2/recovery.c: In function 'fc_do_one_pass':
> fs/jbd2/recovery.c:256:30: error: format '%d' expects a matching 'int' 
> argument [-Werror=format=]
>   256 | jbd_debug(3, "Processing fast commit blk with seq 
> %d");
>   |  ^~~~
> fs/ext4/fast_commit.c: In function 'ext4_fc_replay_add_range':
> fs/ext4/fast_commit.c:1732:30: error: format '%d' expects argument of type 
> 'int', but argument 2 has type 'long unsigned int' [-Werror=format=]
>  1732 | jbd_debug(1, "Converting from %d to %d %lld",
> 
> The first one was added incorrectly, and was also missing a few newlines
> in debug output, and the second one happened when the type of an
> argument changed.
> 
> Reported-by: kernel test robot 
> Fixes: d556435156b7 ("jbd2: avoid -Wempty-body warnings")
> Fixes: 6db074618969 ("ext4: use BIT() macro for BH_** state bits")
> Fixes: 5b849b5f96b4 ("jbd2: fast commit recovery path")
> Signed-off-by: Arnd Bergmann 

Thanks, applied, with one change.

> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index 69f18fe20923..60601c5779f1 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -245,15 +245,15 @@ static int fc_do_one_pass(journal_t *journal,
> 
>  
>   if (err) {
> - jbd_debug(3, "Fast commit replay: read error");
> + jbd_debug(3, "Fast commit replay: read error\n");
>   break;
>   }
>  
> - jbd_debug(3, "Processing fast commit blk with seq %d");
> + jbd_debug(3, "Processing fast commit blk with seq\n");

This debug statement isn't adding any real value, so I just removed
it.

- Ted


Re: [PATCH] ext4: fix marking group trimmed if all blocks not trimmed

2020-08-07 Thread tytso
On Mon, Aug 03, 2020 at 04:17:44PM +, Lazar Beloica wrote:
> When FTRIM is issued on a group, ext4 marks it as trimmed so another FTRIM
> on the same group has no effect. Ext4 marks group as trimmed if at least
> one block is trimmed, therefore it is possible that a group is marked as
> trimmed even if there are blocks in that group left untrimmed.
> 
> This patch marks group as trimmed only if there are no more blocks
> in that group to be trimmed.

This patch makes no sense; first of all, the changes below are *not*
in the function ext4_trim_extent(), but rather ext4_trim_all_free().
It appears that the diff is based off of v5.8-rc2, based on the index
c0a331e, but then I'm not sure how you generated the diff?

Secondly, ext4_trim_all_free(), which is where these two patch hunks
appear:

> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index c0a331e..130936b 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5346,6 +5346,7 @@ static int ext4_trim_extent(struct super_block *sb, int 
> start, int count,
>  {
>   void *bitmap;
>   ext4_grpblk_t next, count = 0, free_count = 0;
> + ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
>   struct ext4_buddy e4b;
>   int ret = 0;
>  
> @@ -5401,7 +5402,9 @@ static int ext4_trim_extent(struct super_block *sb, int 
> start, int count,
>  
>   if (!ret) {
>   ret = count;
> - EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
> + next = mb_find_next_bit(bitmap, max_blks, max + 1);
> + if (next == max_blks)
> + EXT4_MB_GRP_SET_TRIMMED(e4b.bd_info);
>   }
>  out:
>   ext4_unlock_group(sb, group);

The function send discards for blocks in a block group which are
freed.  So setting max_blks to be ext4_blocks_count() and then using
it as the limit to mb_find_next_bit() makes no sense.  First of all
next will never be equal to max_blks, since next is an offset relative
to the beginning of the block group, and max_blks is set number of
blocks in the entire file system.

Secondly, mb_find_next_bit is searching a bitmap, which is a single
file system block (e.g., 4k in a 4k block file system).  So if
max_blks is the the number of blocks in (for example) a 10TB file
system, this is going to potentially cause a kernel oops.

How, exactly did you test this patch?

   - Ted


Re: [PATCH] ext4: move buffer_mapped() to proper position

2020-08-07 Thread tytso
Thanks, applied, although I rewrote the commit description to make it
be a bit more clearer:

fs: prevent BUG_ON in submit_bh_wbc()

If a device is hot-removed --- for example, when a physical device is
unplugged from pcie slot or a nbd device's network is shutdown ---
this can result in a BUG_ON() crash in submit_bh_wbc().  This is
because the when the block device dies, the buffer heads will have
their Buffer_Mapped flag get cleared, leading to the crash in
submit_bh_wbc.

We had attempted to work around this problem in commit a17712c8
("ext4: check superblock mapped prior to committing").  Unfortunately,
it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
device dies between when the work-around check in ext4_commit_super()
and when submit_bh_wbh() is finally called:

Code path:
ext4_commit_super
judge if 'buffer_mapped(sbh)' is false, return <== commit a17712c8
  lock_buffer(sbh)
  ...
  unlock_buffer(sbh)
   __sync_dirty_buffer(sbh,...
lock_buffer(sbh)
judge if 'buffer_mapped(sbh))' is false, return <== 
added by this patch
submit_bh(...,sbh)
submit_bh_wbc(...,sbh,...)

[100722.966497] kernel BUG at fs/buffer.c:3095! <== 
BUG_ON(!buffer_mapped(bh))' in submit_bh_wbc()
[100722.966503] invalid opcode:  [#1] SMP
[100722.966566] task: 8817e15a9e40 task.stack: c90024744000
[100722.966574] RIP: 0010:submit_bh_wbc+0x180/0x190
[100722.966575] RSP: 0018:c90024747a90 EFLAGS: 00010246
[100722.966576] RAX: 00620005 RBX: 8818a80603a8 RCX: 

[100722.966576] RDX: 8818a80603a8 RSI: 00020800 RDI: 
0001
[100722.966577] RBP: c90024747ac0 R08:  R09: 
88207f94170d
[100722.966578] R10: 000437c8 R11: 0001 R12: 
00020800
[100722.966578] R13: 0001 R14: 0bf9a438 R15: 
88195f333000
[100722.966580] FS:  7fa2eee27700() GS:88203d84() 
knlGS:
[100722.966580] CS:  0010 DS:  ES:  CR0: 80050033
[100722.966581] CR2: 00f0b008 CR3: 00201a622003 CR4: 
007606e0
[100722.966582] DR0:  DR1:  DR2: 

[100722.966583] DR3:  DR6: fffe0ff0 DR7: 
0400
[100722.966583] PKRU: 5554
[100722.966583] Call Trace:
[100722.966588]  __sync_dirty_buffer+0x6e/0xd0
[100722.966614]  ext4_commit_super+0x1d8/0x290 [ext4]
[100722.966626]  __ext4_std_error+0x78/0x100 [ext4]
[100722.966635]  ? __ext4_journal_get_write_access+0xca/0x120 [ext4]
[100722.966646]  ext4_reserve_inode_write+0x58/0xb0 [ext4]
[100722.966655]  ? ext4_dirty_inode+0x48/0x70 [ext4]
[100722.93]  ext4_mark_inode_dirty+0x53/0x1e0 [ext4]
[100722.966671]  ? __ext4_journal_start_sb+0x6d/0xf0 [ext4]
[100722.966679]  ext4_dirty_inode+0x48/0x70 [ext4]
[100722.966682]  __mark_inode_dirty+0x17f/0x350
[100722.966686]  generic_update_time+0x87/0xd0
[100722.966687]  touch_atime+0xa9/0xd0
[100722.966690]  generic_file_read_iter+0xa09/0xcd0
[100722.966694]  ? page_cache_tree_insert+0xb0/0xb0
[100722.966704]  ext4_file_read_iter+0x4a/0x100 [ext4]
[100722.966707]  ? __inode_security_revalidate+0x4f/0x60
[100722.966709]  __vfs_read+0xec/0x160
[100722.966711]  vfs_read+0x8c/0x130
[100722.966712]  SyS_pread64+0x87/0xb0
[100722.966716]  do_syscall_64+0x67/0x1b0
[100722.966719]  entry_SYSCALL64_slow_path+0x25/0x25

To address this, add the check of 'buffer_mapped(bh)' to
__sync_dirty_buffer().  This also has the benefit of fixing this for
other file systems.

With this addition, we can drop the workaround in ext4_commit_supper().

[ Commit description rewritten by tytso. ]

Signed-off-by: Xianting Tian 
Link: 
https://lore.kernel.org/r/1596211825-8750-1-git-send-email-xianting_t...@126.com
Signed-off-by: Theodore Ts'o 

- Ted


Re: [PATCH] jbd2: fix incorrect code style

2020-08-05 Thread tytso
On Sat, Jul 18, 2020 at 08:57:37AM -0400, Xianting Tian wrote:
> Remove unnecessary blank.
> 
> Signed-off-by: Xianting Tian 

Thanks, applied.

- Ted


Re: ext4: fix spelling typos in ext4_mb_initialize_context

2020-08-05 Thread tytso
On Wed, Jul 15, 2020 at 11:00:44AM +0800, brookxu wrote:
> Fix spelling typos in ext4_mb_initialize_context.
> 
> Signed-off-by: Chunguang Xu 

Thanks, applied.

- Ted


Re: [PATCH] Replace HTTP links with HTTPS ones: Ext4

2020-08-05 Thread tytso
On Mon, Jul 06, 2020 at 09:03:39PM +0200, Alexander A. Klimov wrote:
> Rationale:
> Reduces attack surface on kernel devs opening the links for MITM
> as HTTPS traffic is much harder to manipulate.

Thanks, applied.

- Ted


Re: [PATCH] ext4: fix spelling mistakes in extents.c

2020-07-29 Thread tytso
On Wed, Jun 10, 2020 at 11:19:46PM -0400, Keyur Patel wrote:
> Fix spelling issues over the comments in the code.
> 
> requsted ==> requested
> deterimined ==> determined
> insde ==> inside
> neet ==> need
> somthing ==> something
> 
> Signed-off-by: Keyur Patel 

Applied, thanks.

- Ted


Re: strace of io_uring events?

2020-07-15 Thread tytso
On Wed, Jul 15, 2020 at 06:11:30PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 15, 2020 at 07:35:50AM -0700, Andy Lutomirski wrote:
> > > On Jul 15, 2020, at 4:12 AM, Miklos Szeredi  wrote:
> > > This thread is to discuss the possibility of stracing requests
> > > submitted through io_uring.   I'm not directly involved in io_uring
> > > development, so I'm posting this out of  interest in using strace on
> > > processes utilizing io_uring.

> > > 
> > > Is there some existing tracing infrastructure that strace could use to
> > > get async completion events?  Should we be introducing one?

I suspect the best approach to use here is use eBPF, since since
sending asyncronously to a ring buffer is going to be *way* more
efficient than using the blocking ptrace(2) system call...

- Ted


Re: [PATCH v2] CodingStyle: Inclusive Terminology

2020-07-08 Thread tytso
On Wed, Jul 08, 2020 at 12:23:59AM -0700, Dan Williams wrote:
> Linux maintains a coding-style and its own idiomatic set of terminology.
> Update the style guidelines to recommend replacements for the terms
> master/slave and blacklist/whitelist.
> 
> Link: 
> http://lore.kernel.org/r/159389297140.2210796.13590142254668787525.st...@dwillia2-desk3.amr.corp.intel.com
> Cc: Jonathan Corbet 
> Acked-by: Randy Dunlap 
> Acked-by: Dave Airlie 
> Acked-by: Kees Cook 
> Acked-by: SeongJae Park 
> Signed-off-by: Olof Johansson 
> Signed-off-by: Chris Mason 
> Signed-off-by: Greg Kroah-Hartman 
> Signed-off-by: Dan Williams 

Signed-off-by: Theodore Ts'o 



Re: better patch for linux/bitops.h

2016-05-05 Thread tytso
On Thu, May 05, 2016 at 05:34:50PM -0400, Sandy Harris wrote:
> 
> I completely fail to see why tests or compiler versions should be
> part of the discussion. The C standard says the behaviour in
> certain cases is undefined, so a standard-compliant compiler
> can generate more-or-less any code there.
> 

> As long as any of portability, reliability or security are among our
> goals, any code that can give undefined behaviour should be
> considered problematic.

Because compilers have been known not necessarily to obey the specs,
and/or interpret the specs in way that not everyone agrees with.  It's
also the case that we are *already* disabling certain C optimizations
which are technically allowed by the spec, but which kernel
programmers consider insane (e.g., strict aliasing).

And of course, memzero_explicit() which crypto people understand is
necessary, is something that technically compilers are allowed to
optimize according to the spec.  So trying to write secure kernel code
which will work on arbitrary compilers may well be impossible.

And which is also why people have said (mostly in jest), "A
sufficiently advanced compiler is indistinguishable from an
adversary."  (I assume people will agree that optimizing away a memset
needed to clear secrets from memory would be considered adversarial,
at the very least!)

So this is why I tend to take a much more pragmatic viewpoint on
things.  Sure, it makes sense to pay attention to what the C standard
writers are trying to do to us; but if we need to suppress certain
optimizations to write sane kernel code --- I'm ok with that.  And
this is why using a trust-but-verify on a specific set of compilers
and ranges of compiler versions is a really good idea

 - Ted


Re: better patch for linux/bitops.h

2016-05-05 Thread tytso
On Thu, May 05, 2016 at 05:34:50PM -0400, Sandy Harris wrote:
> 
> I completely fail to see why tests or compiler versions should be
> part of the discussion. The C standard says the behaviour in
> certain cases is undefined, so a standard-compliant compiler
> can generate more-or-less any code there.
> 

> As long as any of portability, reliability or security are among our
> goals, any code that can give undefined behaviour should be
> considered problematic.

Because compilers have been known not necessarily to obey the specs,
and/or interpret the specs in way that not everyone agrees with.  It's
also the case that we are *already* disabling certain C optimizations
which are technically allowed by the spec, but which kernel
programmers consider insane (e.g., strict aliasing).

And of course, memzero_explicit() which crypto people understand is
necessary, is something that technically compilers are allowed to
optimize according to the spec.  So trying to write secure kernel code
which will work on arbitrary compilers may well be impossible.

And which is also why people have said (mostly in jest), "A
sufficiently advanced compiler is indistinguishable from an
adversary."  (I assume people will agree that optimizing away a memset
needed to clear secrets from memory would be considered adversarial,
at the very least!)

So this is why I tend to take a much more pragmatic viewpoint on
things.  Sure, it makes sense to pay attention to what the C standard
writers are trying to do to us; but if we need to suppress certain
optimizations to write sane kernel code --- I'm ok with that.  And
this is why using a trust-but-verify on a specific set of compilers
and ranges of compiler versions is a really good idea

 - Ted


Re: [PATCH] ext4: Fix check of dqget() return value in ext4_ioctl_setproject()

2016-05-05 Thread tytso
On Thu, May 05, 2016 at 02:52:34PM +0200, Jan Kara wrote:
> On Fri 22-04-16 13:36:32, Seth Forshee wrote:
> > On Tue, Mar 29, 2016 at 08:01:03AM -0500, Seth Forshee wrote:
> > > A failed call to dqget() returns an ERR_PTR() and not null. Fix
> > > the check in ext4_ioctl_setproject() to handle this correctly.
> > > 
> > > Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR 
> > > interface support")
> > > Cc: sta...@vger.kernel.org # v4.5
> > > Signed-off-by: Seth Forshee 
> > 
> > Thought I'd check and see if this patch was forgotten, I sent it nearly
> > a month ago but the bug is still present.
> 
> Yeah, it seems Ted forgot to pull the fix. Ted? BTW, feel free to add
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted


Re: [PATCH] ext4: Fix check of dqget() return value in ext4_ioctl_setproject()

2016-05-05 Thread tytso
On Thu, May 05, 2016 at 02:52:34PM +0200, Jan Kara wrote:
> On Fri 22-04-16 13:36:32, Seth Forshee wrote:
> > On Tue, Mar 29, 2016 at 08:01:03AM -0500, Seth Forshee wrote:
> > > A failed call to dqget() returns an ERR_PTR() and not null. Fix
> > > the check in ext4_ioctl_setproject() to handle this correctly.
> > > 
> > > Fixes: 9b7365fc1c82 ("ext4: add FS_IOC_FSSETXATTR/FS_IOC_FSGETXATTR 
> > > interface support")
> > > Cc: sta...@vger.kernel.org # v4.5
> > > Signed-off-by: Seth Forshee 
> > 
> > Thought I'd check and see if this patch was forgotten, I sent it nearly
> > a month ago but the bug is still present.
> 
> Yeah, it seems Ted forgot to pull the fix. Ted? BTW, feel free to add
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote:
> 
> We don't care about UB, we care about gcc, and to a lesser extent
> LLVM and ICC.  If bitops.h doesn't do the right thing, we need to
> fix bitops.h.

I'm going to suggest that we treat the ro[rl]{32,64}() question as
separable from the /dev/random case.  I've sanity checked gcc 5.3.1
and it does the right thing given the small number of constant
arguments given to rotl32() in lib/chacha20.c, and it doesn't hit the
UB case which Jeffrey was concerned about.  This is also code that was
previously in crypto/chacha20_generic.c, and so if there is a bug with
some obscure compiler not properly compiling it down to a rotate
instruction, (a) no one is paying me to make sure the kernel code
compiles well on ICC, and (b) it's not scalable to have each kernel
developer try to deal with the vagrancies of compilers.

So from my perspective, the only interesting results for me is (a)
using what had been used before with crypto/chacha20_generic.c, or (b)
reusing what is in bitops.h and letting it be someone else's problem
if some obscure compiler isn't happy with what is in bitops.h

If we are all agreed that what is in bitops.h is considered valid,
then we can start converting people over to using the version defined
in bitops.h, and if there is some compiler issue we need to work
around, at least we only need to put the workaround in a single header
file.

Cheers,
- Ted


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Wed, May 04, 2016 at 11:29:57AM -0700, H. Peter Anvin wrote:
> 
> We don't care about UB, we care about gcc, and to a lesser extent
> LLVM and ICC.  If bitops.h doesn't do the right thing, we need to
> fix bitops.h.

I'm going to suggest that we treat the ro[rl]{32,64}() question as
separable from the /dev/random case.  I've sanity checked gcc 5.3.1
and it does the right thing given the small number of constant
arguments given to rotl32() in lib/chacha20.c, and it doesn't hit the
UB case which Jeffrey was concerned about.  This is also code that was
previously in crypto/chacha20_generic.c, and so if there is a bug with
some obscure compiler not properly compiling it down to a rotate
instruction, (a) no one is paying me to make sure the kernel code
compiles well on ICC, and (b) it's not scalable to have each kernel
developer try to deal with the vagrancies of compilers.

So from my perspective, the only interesting results for me is (a)
using what had been used before with crypto/chacha20_generic.c, or (b)
reusing what is in bitops.h and letting it be someone else's problem
if some obscure compiler isn't happy with what is in bitops.h

If we are all agreed that what is in bitops.h is considered valid,
then we can start converting people over to using the version defined
in bitops.h, and if there is some compiler issue we need to work
around, at least we only need to put the workaround in a single header
file.

Cheers,
- Ted


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote:
> > +static inline u32 rotl32(u32 v, u8 n)
> > +{
> > +   return (v << n) | (v >> (sizeof(v) * 8 - n));
> > +}
> 
> That's undefined behavior when n=0.

Sure, but it's never called with n = 0; I've double checked and the
compiler seems to do the right thing with the above pattern as well.

Hmm, it looks like there is a "standard" version rotate left and right
defined in include/linux/bitops.h.  So I suspect it would make sense
to use rol32 as defined in bitops.h --- and this is probably something
that we should do for the rest of crypto/*.c, where people seem to be
defininig their own version of something like rotl32 (I copied the
contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern
of defining one's own version of rol32 isn't new).

> I think the portable way to do a rotate that avoids UB is the
> following. GCC, Clang and ICC recognize the pattern, and emit a rotate
> instruction.
> 
> static const unsigned int MASK=31;
> return (v<>(-n));
> 
> You should also avoid the following because its not constant time due
> to the branch:
> 
> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n));
> 

Where is this coming from?  I don't see this construct in the patch.

- Ted


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Wed, May 04, 2016 at 10:40:20AM -0400, Jeffrey Walton wrote:
> > +static inline u32 rotl32(u32 v, u8 n)
> > +{
> > +   return (v << n) | (v >> (sizeof(v) * 8 - n));
> > +}
> 
> That's undefined behavior when n=0.

Sure, but it's never called with n = 0; I've double checked and the
compiler seems to do the right thing with the above pattern as well.

Hmm, it looks like there is a "standard" version rotate left and right
defined in include/linux/bitops.h.  So I suspect it would make sense
to use rol32 as defined in bitops.h --- and this is probably something
that we should do for the rest of crypto/*.c, where people seem to be
defininig their own version of something like rotl32 (I copied the
contents of crypto/chacha20_generic.c to lib/chacha20, so this pattern
of defining one's own version of rol32 isn't new).

> I think the portable way to do a rotate that avoids UB is the
> following. GCC, Clang and ICC recognize the pattern, and emit a rotate
> instruction.
> 
> static const unsigned int MASK=31;
> return (v<>(-n));
> 
> You should also avoid the following because its not constant time due
> to the branch:
> 
> return n == 0 ? v : (v << n) | (v >> (sizeof(v) * 8 - n));
> 

Where is this coming from?  I don't see this construct in the patch.

- Ted


Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote:
> > +/*
> > + * crng_init =  0 --> Uninitialized
> > + * 2 --> Initialized
> > + * 3 --> Initialized from input_pool
> > + */
> > +static int crng_init = 0;
> 
> shouldn't that be an atomic_t ?

The crng_init variable only gets incremented (it progresses from
0->1->2->3) and it's protected by the primary_crng->lock spinlock.
There are a few places where we read it without first taking the lock,
but that's where we depend on it getting incremented and where if we
race with another CPU which has just bumped the crng_init status, it's
not critical.  (Or we can take the lock and then recheck the crng_init
if we really need to be sure.)

> > +static void _initialize_crng(struct crng_state *crng)
> > +{
> > +   int i;
> > +   unsigned long   rv;
> 
> Why do you use unsigned long here? I thought the state[i] is unsigned int.

Because it gets filled in via arch_get_random_seed_long() and
arch_get_random_log().  If that means we get 64 bits and we only
use 32 bits, that's OK

> Would it make sense to add the ChaCha20 self test vectors from RFC7539 here 
> to 
> test that the ChaCha20 works?

We're not doing that for any of the other ciphers and hashes.  We
didn't do that for SHA1, and the chacha20 code where I took this from
didn't check for this as well.  What threat are you most worried
about.  We don't care about chacha20 being exactly chacha20, so long
as it's cryptographically strong.  In fact I think I removed a
potential host order byteswap in the set key operation specifically
because we don't care and interop.

If this is required for FIPS mode, we can add that later.  I was
primarily trying to keep the patch small to be easier for people to
look at it, so I've deliberately left off bells and whistles that
aren't strictly needed to show that the new approach is sound.

> > +   if (crng_init++ >= 2)
> > +   wake_up_interruptible(_init_wait);
> 
> Don't we have a race here with the crng_init < 3 check in crng_reseed 
> considering multi-core systems?

No, because we are holding the primary_crng->lock spinlock.  I'll add
a comment explaining the locking protections which is intended for
crng_init where we declare it.


> > +   if (num < 16 || num > 32) {
> > +   WARN_ON(1);
> > +   pr_err("crng_reseed: num is %d?!?\n", num);
> > +   }
> > +   num_words = (num + 3) / 4;
> > +   for (i = 0; i < num_words; i++)
> > +   primary_crng.state[i+4] ^= tmp[i];
> > +   primary_crng.init_time = jiffies;
> > +   if (crng_init < 3) {
> 
> Shouldn't that one be if (crng_init < 3 && num >= 16) ?

I'll just change the above WRN_ON test to be:

 BUG_ON(num < 16 || num > 32);

This really can't happen, and I had it as a WARN_ON with a printk for
debugging purpose in case I was wrong about how the code works.

> > +   crng_init = 3;
> > +   process_random_ready_list();
> > +   wake_up_interruptible(_init_wait);
> > +   pr_notice("random: crng_init 3\n");
> 
> Would it make sense to be more descriptive here to allow readers of dmesg to 
> understand the output?

Yes, what we're currently printing during the initialization:

random: crng_init 1
random: crng_init 2
random: crng_init 3

was again mostly for debugging purposes.  What I'm thinking about
doing is changing crng_init 2 and 3 messages to be:

random: crng fast init done
random: crng conservative init done

> > +   }
> > +   ret = 1;
> > +out:
> > +   spin_unlock_irqrestore(_crng.lock, flags);
> 
> memzero_explicit of tmp?

Good point, I've added a call to memzero_explicit().

> > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> > +{
> > +   unsigned long v, flags;
> > +   struct crng_state *crng = _crng;
> > +
> > +   if (crng_init > 2 &&
> > +   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
> > +   crng_reseed(_pool);
> > +   spin_lock_irqsave(>lock, flags);
> > +   if (arch_get_random_long())
> > +   crng->state[14] ^= v;
> > +   chacha20_block(>state[0], out);
> > +   if (crng->state[12] == 0)
> > +   crng->state[13]++;

> What is the purpose to only cover the 2nd 32 bit value of the nonce with 
> arch_get_random?
> 
> state[12]++? Or why do you increment the nonce?

In Chacha20, its output is a funcrtion of the state array, where
state[0..3] is a constant specified by the Chacha20 definition,
state[4..11] is the Key, and state[12..15] is the IV.  The
chacha20_block() function increments state[12] each time it is called,
so state[12] is being used as the block counter.  The increment of
state[13] is used to make state[13] to be the upper 32-bits of the
block counter.  We *should* be reseeding more often than 2**32 calls
to chacha20_block(), and the increment is just to be safe in case
something goes wronng and we're not reseeding.

We're using crng[14] to be contain the output of RDRAND, so this is
where we mix in the contribution from a CPU-level hwrng. 

Re: [PATCH 1/3] random: replace non-blocking pool with a Chacha20-based CRNG

2016-05-04 Thread tytso
On Tue, May 03, 2016 at 10:50:25AM +0200, Stephan Mueller wrote:
> > +/*
> > + * crng_init =  0 --> Uninitialized
> > + * 2 --> Initialized
> > + * 3 --> Initialized from input_pool
> > + */
> > +static int crng_init = 0;
> 
> shouldn't that be an atomic_t ?

The crng_init variable only gets incremented (it progresses from
0->1->2->3) and it's protected by the primary_crng->lock spinlock.
There are a few places where we read it without first taking the lock,
but that's where we depend on it getting incremented and where if we
race with another CPU which has just bumped the crng_init status, it's
not critical.  (Or we can take the lock and then recheck the crng_init
if we really need to be sure.)

> > +static void _initialize_crng(struct crng_state *crng)
> > +{
> > +   int i;
> > +   unsigned long   rv;
> 
> Why do you use unsigned long here? I thought the state[i] is unsigned int.

Because it gets filled in via arch_get_random_seed_long() and
arch_get_random_log().  If that means we get 64 bits and we only
use 32 bits, that's OK

> Would it make sense to add the ChaCha20 self test vectors from RFC7539 here 
> to 
> test that the ChaCha20 works?

We're not doing that for any of the other ciphers and hashes.  We
didn't do that for SHA1, and the chacha20 code where I took this from
didn't check for this as well.  What threat are you most worried
about.  We don't care about chacha20 being exactly chacha20, so long
as it's cryptographically strong.  In fact I think I removed a
potential host order byteswap in the set key operation specifically
because we don't care and interop.

If this is required for FIPS mode, we can add that later.  I was
primarily trying to keep the patch small to be easier for people to
look at it, so I've deliberately left off bells and whistles that
aren't strictly needed to show that the new approach is sound.

> > +   if (crng_init++ >= 2)
> > +   wake_up_interruptible(_init_wait);
> 
> Don't we have a race here with the crng_init < 3 check in crng_reseed 
> considering multi-core systems?

No, because we are holding the primary_crng->lock spinlock.  I'll add
a comment explaining the locking protections which is intended for
crng_init where we declare it.


> > +   if (num < 16 || num > 32) {
> > +   WARN_ON(1);
> > +   pr_err("crng_reseed: num is %d?!?\n", num);
> > +   }
> > +   num_words = (num + 3) / 4;
> > +   for (i = 0; i < num_words; i++)
> > +   primary_crng.state[i+4] ^= tmp[i];
> > +   primary_crng.init_time = jiffies;
> > +   if (crng_init < 3) {
> 
> Shouldn't that one be if (crng_init < 3 && num >= 16) ?

I'll just change the above WRN_ON test to be:

 BUG_ON(num < 16 || num > 32);

This really can't happen, and I had it as a WARN_ON with a printk for
debugging purpose in case I was wrong about how the code works.

> > +   crng_init = 3;
> > +   process_random_ready_list();
> > +   wake_up_interruptible(_init_wait);
> > +   pr_notice("random: crng_init 3\n");
> 
> Would it make sense to be more descriptive here to allow readers of dmesg to 
> understand the output?

Yes, what we're currently printing during the initialization:

random: crng_init 1
random: crng_init 2
random: crng_init 3

was again mostly for debugging purposes.  What I'm thinking about
doing is changing crng_init 2 and 3 messages to be:

random: crng fast init done
random: crng conservative init done

> > +   }
> > +   ret = 1;
> > +out:
> > +   spin_unlock_irqrestore(_crng.lock, flags);
> 
> memzero_explicit of tmp?

Good point, I've added a call to memzero_explicit().

> > +static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
> > +{
> > +   unsigned long v, flags;
> > +   struct crng_state *crng = _crng;
> > +
> > +   if (crng_init > 2 &&
> > +   time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
> > +   crng_reseed(_pool);
> > +   spin_lock_irqsave(>lock, flags);
> > +   if (arch_get_random_long())
> > +   crng->state[14] ^= v;
> > +   chacha20_block(>state[0], out);
> > +   if (crng->state[12] == 0)
> > +   crng->state[13]++;

> What is the purpose to only cover the 2nd 32 bit value of the nonce with 
> arch_get_random?
> 
> state[12]++? Or why do you increment the nonce?

In Chacha20, its output is a funcrtion of the state array, where
state[0..3] is a constant specified by the Chacha20 definition,
state[4..11] is the Key, and state[12..15] is the IV.  The
chacha20_block() function increments state[12] each time it is called,
so state[12] is being used as the block counter.  The increment of
state[13] is used to make state[13] to be the upper 32-bits of the
block counter.  We *should* be reseeding more often than 2**32 calls
to chacha20_block(), and the increment is just to be safe in case
something goes wronng and we're not reseeding.

We're using crng[14] to be contain the output of RDRAND, so this is
where we mix in the contribution from a CPU-level hwrng. 

Re: [RFC][PATCH 0/6] /dev/random - a new approach

2016-05-03 Thread tytso
On Tue, May 03, 2016 at 03:57:15PM +0200, Nikos Mavrogiannopoulos wrote:
> 
> I believe their main concern is that they want to protect applications
> which do not check error codes of system calls, when running on a
> kernel which does not provide getrandom().  That way, they have an
> almost impossible task to simulate getrandom() on kernel which do not
> support it.

The whole *point* of creating the getrandom(2) system call is that it
can't be simulated/emulated in userspace.  If it can be, then there's
no reason why the system call should exist.  This is one of the
reasons why haven't implemented mysql or TLS inside the kernel.   :-)

So if their standard is "we need to simulate getrandom(2) on a kernel
which does not have it", we'll **never** see glibc support for it.  By
definition, this is *impossible*.

What they can do is do something which is as good as you can get for
someone who is open-coding /dev/urandom support in userspace.  That
means that you won't be able to (a) tell if the urandom pool is has
been adequately initialized right after boot, (b) you will need to
somehow deal with the case where the file descriptors have been
exhausted, (c) or if you are running in a chroot where the system
administrator didn't bother to include /dev/urandom.  About the best
you can do is call abort(0), or if you want, you can let the
application author specify some kind of "I want to run in insecure
mode", via some magic glibc setting.  You could probably default this
to "true" without a huge net reduction of security, because most
application authors weren't getting this right anyway.  And then ones
who do care, can set some kind of flag saying, "I promise to check the
error return from getrandom(2) as implemented by glibc".

 - Ted


Re: [RFC][PATCH 0/6] /dev/random - a new approach

2016-05-03 Thread tytso
On Tue, May 03, 2016 at 03:57:15PM +0200, Nikos Mavrogiannopoulos wrote:
> 
> I believe their main concern is that they want to protect applications
> which do not check error codes of system calls, when running on a
> kernel which does not provide getrandom().  That way, they have an
> almost impossible task to simulate getrandom() on kernel which do not
> support it.

The whole *point* of creating the getrandom(2) system call is that it
can't be simulated/emulated in userspace.  If it can be, then there's
no reason why the system call should exist.  This is one of the
reasons why haven't implemented mysql or TLS inside the kernel.   :-)

So if their standard is "we need to simulate getrandom(2) on a kernel
which does not have it", we'll **never** see glibc support for it.  By
definition, this is *impossible*.

What they can do is do something which is as good as you can get for
someone who is open-coding /dev/urandom support in userspace.  That
means that you won't be able to (a) tell if the urandom pool is has
been adequately initialized right after boot, (b) you will need to
somehow deal with the case where the file descriptors have been
exhausted, (c) or if you are running in a chroot where the system
administrator didn't bother to include /dev/urandom.  About the best
you can do is call abort(0), or if you want, you can let the
application author specify some kind of "I want to run in insecure
mode", via some magic glibc setting.  You could probably default this
to "true" without a huge net reduction of security, because most
application authors weren't getting this right anyway.  And then ones
who do care, can set some kind of flag saying, "I promise to check the
error return from getrandom(2) as implemented by glibc".

 - Ted


Re: Why is SECTOR_SIZE = 512 inside kernel ?

2015-08-18 Thread tytso
On Tue, Aug 18, 2015 at 11:06:47PM +0200, Brice Goglin wrote:
> Le 17/08/2015 15:54, Theodore Ts'o a écrit :
> > 
> > It's cast in stone.  There are too many places all over the kernel,
> > especially in a huge number of file systems, which assume that the
> > sector size is 512 bytes.  So above the block layer, the sector size
> > is always going to be 512.
> 
> Could this be a problem when using pmem/nvdimm devices with
> byte-granularity (no BTT layer)? (hw_sector_size reports
> 512 in this case while we could expect 1 instead).
> Or it just doesn't matter because BTT is the only way to use
> these devices for filesystems like other block devices?

Right now there are very few applications that understand how to use
pmem/nvdimm devices as memory.  And even where they do, they will need
some kind of file system to provide resource isolation in case more
than one application or more than one user wants to use the
pmem/nvdimm.  In that case, they will probably mmap a file and then
access the nvdimm directly.  In that case, the applications won't be
using the block device layer at all, so they won't care about the
advertised hw_sector_Size.

The challenge with pmem-aware applications is that they need to be
able to correctly update their in-memory data structures in such a way
that they can correctly recover after an arbitrary power failure.
That means they have to use atomic updates and/or copy-on-write update
schemes, and I suspect most application writes just aren't going to be
able to get this right.

So for many legacy applications, they will still read in the file
"foo", make changes in local memory, and then write the new contents
to the file "foo.new", and then rename "foo.new" on top of "foo".
These applications will effectively use nvdimm as super fast flash,
and so they will use file systems as file systems.  And since file
systems today all use block sizes which are multiples of the
traditional 512 byte sector size, again, changing something as
fundamental as the kernel's internal sector size doesn't have any real
value, at least not as far as pmem/nvdimm support is concerned.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Why is SECTOR_SIZE = 512 inside kernel ?

2015-08-18 Thread tytso
On Tue, Aug 18, 2015 at 11:06:47PM +0200, Brice Goglin wrote:
 Le 17/08/2015 15:54, Theodore Ts'o a écrit :
  
  It's cast in stone.  There are too many places all over the kernel,
  especially in a huge number of file systems, which assume that the
  sector size is 512 bytes.  So above the block layer, the sector size
  is always going to be 512.
 
 Could this be a problem when using pmem/nvdimm devices with
 byte-granularity (no BTT layer)? (hw_sector_size reports
 512 in this case while we could expect 1 instead).
 Or it just doesn't matter because BTT is the only way to use
 these devices for filesystems like other block devices?

Right now there are very few applications that understand how to use
pmem/nvdimm devices as memory.  And even where they do, they will need
some kind of file system to provide resource isolation in case more
than one application or more than one user wants to use the
pmem/nvdimm.  In that case, they will probably mmap a file and then
access the nvdimm directly.  In that case, the applications won't be
using the block device layer at all, so they won't care about the
advertised hw_sector_Size.

The challenge with pmem-aware applications is that they need to be
able to correctly update their in-memory data structures in such a way
that they can correctly recover after an arbitrary power failure.
That means they have to use atomic updates and/or copy-on-write update
schemes, and I suspect most application writes just aren't going to be
able to get this right.

So for many legacy applications, they will still read in the file
foo, make changes in local memory, and then write the new contents
to the file foo.new, and then rename foo.new on top of foo.
These applications will effectively use nvdimm as super fast flash,
and so they will use file systems as file systems.  And since file
systems today all use block sizes which are multiples of the
traditional 512 byte sector size, again, changing something as
fundamental as the kernel's internal sector size doesn't have any real
value, at least not as far as pmem/nvdimm support is concerned.

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Remove TO DO in jfs_xtree.c

2014-12-29 Thread tytso
On Mon, Dec 29, 2014 at 04:13:37PM -0600, Dave Kleikamp wrote:
> On 12/27/2014 06:58 PM, nick wrote:
> > Greetings Dave,
> > I am wondering why there is a TO DO above this code:
> > * ToDo:  tlocks should be on doubly-linked list, so we can
> > * quickly remove it and add it to the end.
> 
> I'm sure the idea was to avoid the for loop needed to find the previous
> entry in the linked list. A doubly-linked list makes it much simpler to
> remove an item from an arbitrary position in the list.

Hi Dave,

Just in case you weren't aware, Nick has been banned from the LKML
list for being a troll.

A common troll trick is to send e-mail to a number of individuals with
a mailing list (in this case, LKML) cc'ed, in the hopes that people
will reply, quoting the troll's words, so they can get around the
mailing list ban.

The reason why he has been banned is because he has apparently been
desperate to get _some_ patches into the Linux kernel.  Enough so that
he has proposed patches which do not compile, and/or were not tested
and/or for which he had no hardware (so he couldn't possibly have
tested it).

As a maintainer, you should be aware of his past history, and chose
for yourself whether, in the future, you feel you should respond to
any e-mail that he might send you.

Regards,

- Ted

P.S.  Personally, the best reason for banning him isn't that he has
been wasting maintainers' time, but that he was responding to users
who were reporting bugs with nonsensical responses that were actively
harmful to users who were looking for help.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Remove TO DO in jfs_xtree.c

2014-12-29 Thread tytso
On Mon, Dec 29, 2014 at 04:13:37PM -0600, Dave Kleikamp wrote:
 On 12/27/2014 06:58 PM, nick wrote:
  Greetings Dave,
  I am wondering why there is a TO DO above this code:
  * ToDo:  tlocks should be on doubly-linked list, so we can
  * quickly remove it and add it to the end.
 
 I'm sure the idea was to avoid the for loop needed to find the previous
 entry in the linked list. A doubly-linked list makes it much simpler to
 remove an item from an arbitrary position in the list.

Hi Dave,

Just in case you weren't aware, Nick has been banned from the LKML
list for being a troll.

A common troll trick is to send e-mail to a number of individuals with
a mailing list (in this case, LKML) cc'ed, in the hopes that people
will reply, quoting the troll's words, so they can get around the
mailing list ban.

The reason why he has been banned is because he has apparently been
desperate to get _some_ patches into the Linux kernel.  Enough so that
he has proposed patches which do not compile, and/or were not tested
and/or for which he had no hardware (so he couldn't possibly have
tested it).

As a maintainer, you should be aware of his past history, and chose
for yourself whether, in the future, you feel you should respond to
any e-mail that he might send you.

Regards,

- Ted

P.S.  Personally, the best reason for banning him isn't that he has
been wasting maintainers' time, but that he was responding to users
who were reporting bugs with nonsensical responses that were actively
harmful to users who were looking for help.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reminder for kernel summit nominations

2014-05-09 Thread tytso
On Fri, May 09, 2014 at 11:23:19AM -0400, Theodore Ts'o wrote:
> ... please send nominations to:
> 
>   kernel-sum...@lists.linuxfoundation.org

Argh, sorry, I screwed up the e-mail address.  The correct e-mail
address is.

ksummit-disc...@lists.linuxfoundation.org

My apologies

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Reminder for kernel summit nominations

2014-05-09 Thread tytso
On Fri, May 09, 2014 at 11:23:19AM -0400, Theodore Ts'o wrote:
 ... please send nominations to:
 
   kernel-sum...@lists.linuxfoundation.org

Argh, sorry, I screwed up the e-mail address.  The correct e-mail
address is.

ksummit-disc...@lists.linuxfoundation.org

My apologies

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: random: Providing a seed value to VM guests

2014-05-01 Thread tytso
On Thu, May 01, 2014 at 02:06:13PM -0700, Andy Lutomirski wrote:
> 
> I still don't see the point.  What does this do better than virtio-rng?

I believe you had been complaining about how complicated it was to set
up virtio?  And this complexity is also an issue if we want to use it
to initialize the RNG used for the kernel text ASLR --- which has to
be done very early in the boot process, and where making something as
simple as possible is a Good Thing.

And since we would want to use RDRAND/RDSEED if it is available
*anyway*, perhaps in combination with other things, why not use the
RDRAND/RDSEED interface?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: random: Providing a seed value to VM guests

2014-05-01 Thread tytso
On Thu, May 01, 2014 at 01:32:55PM -0700, Andy Lutomirski wrote:
> On Thu, May 1, 2014 at 1:30 PM, H. Peter Anvin  wrote:
> > RDSEED is not synchronous.  It is, however, nonblocking.
> 
> What I mean is: IIUC it's reasonable to call RDSEED a few times in a
> loop and hope it works.  It makes no sense to do that with
> /dev/random.

RDSEED is allowed to return an error if there is insufficient entropy.
So long as the caller understands that this is an emulated
instruction, I don't see a problem.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: random: Providing a seed value to VM guests

2014-05-01 Thread tytso
On Thu, May 01, 2014 at 12:02:49PM -0700, Andy Lutomirski wrote:
> 
> Is RDSEED really reasonable here?  Won't it slow down by several
> orders of magnitude?

That is I think the biggest problem; RDRAND and RDSEED are fast if
they are native, but they will involve a VM exit if they need to be
emulated.  So when an OS might want to use RDRAND and RDSEED might be
quite different if we know they are being emulated.

Using the RDRAND and RDSEED "api" certainly makes sense, at least for
x86, but I suspect we might want to use a different way of signalling
that a VM guest can use RDRAND and RDSEED if they are running on a CPU
which doesn't provide that kind of access.  Maybe a CPUID extended
function parameter, if one could be allocated for use by a Linux
hypervisor?

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] random: Add "initialized" variable to proc

2014-05-01 Thread tytso
On Wed, Apr 30, 2014 at 09:05:00PM -0700, H. Peter Anvin wrote:
> 
> Giving the guest a seed would be highly useful, though.  There are a
> number of ways to do that; changing the boot protocol is probably
> only useful if Qemu itself bouts the kernel as opposed to an in-VM
> bootloader.

So how about simply passing a memory address and an optional offset on
the boot command line?  That way the hypervisor can drop the seed in
some convenient real memory location, and the kernel can just copy it
someplace safe, or in the case of kernel ASLR, the relocator can use
it to seed its CRNG, and then after it relocates the kernel, it can
crank the CRNG to pass a seed to the kernel's urandom driver.

That way, we don't have to do something which is ACPI or DT dependent.
Maybe there will be embedded architectures where using DT might be
more convenient, but this would probably be simplest for KVM/qumu-based
VM's, I would think.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] random: Add initialized variable to proc

2014-05-01 Thread tytso
On Wed, Apr 30, 2014 at 09:05:00PM -0700, H. Peter Anvin wrote:
 
 Giving the guest a seed would be highly useful, though.  There are a
 number of ways to do that; changing the boot protocol is probably
 only useful if Qemu itself bouts the kernel as opposed to an in-VM
 bootloader.

So how about simply passing a memory address and an optional offset on
the boot command line?  That way the hypervisor can drop the seed in
some convenient real memory location, and the kernel can just copy it
someplace safe, or in the case of kernel ASLR, the relocator can use
it to seed its CRNG, and then after it relocates the kernel, it can
crank the CRNG to pass a seed to the kernel's urandom driver.

That way, we don't have to do something which is ACPI or DT dependent.
Maybe there will be embedded architectures where using DT might be
more convenient, but this would probably be simplest for KVM/qumu-based
VM's, I would think.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: random: Providing a seed value to VM guests

2014-05-01 Thread tytso
On Thu, May 01, 2014 at 12:02:49PM -0700, Andy Lutomirski wrote:
 
 Is RDSEED really reasonable here?  Won't it slow down by several
 orders of magnitude?

That is I think the biggest problem; RDRAND and RDSEED are fast if
they are native, but they will involve a VM exit if they need to be
emulated.  So when an OS might want to use RDRAND and RDSEED might be
quite different if we know they are being emulated.

Using the RDRAND and RDSEED api certainly makes sense, at least for
x86, but I suspect we might want to use a different way of signalling
that a VM guest can use RDRAND and RDSEED if they are running on a CPU
which doesn't provide that kind of access.  Maybe a CPUID extended
function parameter, if one could be allocated for use by a Linux
hypervisor?

- Ted

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: random: Providing a seed value to VM guests

2014-05-01 Thread tytso
On Thu, May 01, 2014 at 01:32:55PM -0700, Andy Lutomirski wrote:
 On Thu, May 1, 2014 at 1:30 PM, H. Peter Anvin h...@zytor.com wrote:
  RDSEED is not synchronous.  It is, however, nonblocking.
 
 What I mean is: IIUC it's reasonable to call RDSEED a few times in a
 loop and hope it works.  It makes no sense to do that with
 /dev/random.

RDSEED is allowed to return an error if there is insufficient entropy.
So long as the caller understands that this is an emulated
instruction, I don't see a problem.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: random: Providing a seed value to VM guests

2014-05-01 Thread tytso
On Thu, May 01, 2014 at 02:06:13PM -0700, Andy Lutomirski wrote:
 
 I still don't see the point.  What does this do better than virtio-rng?

I believe you had been complaining about how complicated it was to set
up virtio?  And this complexity is also an issue if we want to use it
to initialize the RNG used for the kernel text ASLR --- which has to
be done very early in the boot process, and where making something as
simple as possible is a Good Thing.

And since we would want to use RDRAND/RDSEED if it is available
*anyway*, perhaps in combination with other things, why not use the
RDRAND/RDSEED interface?

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-26 Thread tytso
On Tue, Mar 25, 2014 at 11:32:09PM -0700, Andrew Morton wrote:
> Well, there are always alternatives.  For example ext3 could
> preallocate a single transaction_t and a single IO page and fall back
> to synchronous page-at-a-time journal writes.  But I can totally see
> that such things are unattractive: heaps of new code which is never
> tested in real life.  The page allocator works so damn well that it
> doesn't make sense to implement it.

I'm not sure that there are _always_ solutions.  For example, ext3
still needs to use buffer cache to do the I/O, and that may require
allocations as well.  Fortunately, this was patched around other ways
to avoid livelock issues in the page cleaner in 2013: 84235de394d977

But that's another new user of GFP_NOFAIL (and one added three years
after David tried to declare There Shalt Be No New Users of
GFP_NOFAIL), and sure, we could probably patch around that by having
places where there's no other alternaive to keep a preallocated batch
of pages and/or allocated structures at each code site.  But that's as
bad as looping at each code site; in fact, wouldn't it be better if
the allocator wants to avoid looping, to have a separate batch of
pages which (ala GFP_ATOMIC) which is reserved for just for GFP_NOFAIL
allocations when all else fails?

(BTW, we have a similar issue with bio's in the block layer, but
fortunately, those structures are relatively small, and since they are
in their own slab cache, and are constantly being used and then
recycled, it's in practice rare that allocations will fail when we are
in a desparate low memory situation.  But technically there will be
places where we should pass a gfp_t down to the block layers, and use
GFP_NOFAIL if we really want to make sure that memory allocations
needed to try to clean pages and/or avoid user data corruption are
needed.)

> Please use NOFAIL ;) The core page allocator will always be able to
> implement this better than callers.

I'll convert jbd2 to use NOFAIL.  :-)

Cheers,

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-26 Thread tytso
On Tue, Mar 25, 2014 at 06:06:17PM -0700, David Rientjes wrote:
> 
> The point is not to add new callers and new code should handle NULL 
> correctly, not that we should run around changing current users to just do 
> infinite retries.  Checkpatch should have nothing to do with that.

My problem with this doctrinaire "there should never be any new users"
is that sometiems there *are* worse things than infinite retries.  If
the alternative is bringing the entire system down, or livelocking the
entire system, or corrupting user data, __GFP_NOFAIL *is* the more
appropriate option.

If you try to tell those of us outside of the mm layer, "thou shalt
never use __GFP_NOFAIL in new code", and we have some new code where
the alternative is worse, we can either open-code the loop, or have
some mm hackers and/or checkpatch whine at us.

Andrew has declared that he'd prefer that we not open code the retry
loop; if you want to disagree with Andrew, feel free to pursuade him
otherwise.  If you want to tell me that I should accept user data
corruption, I'm going to ignore you (and/or checkpatch).

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-26 Thread tytso
On Tue, Mar 25, 2014 at 06:06:17PM -0700, David Rientjes wrote:
 
 The point is not to add new callers and new code should handle NULL 
 correctly, not that we should run around changing current users to just do 
 infinite retries.  Checkpatch should have nothing to do with that.

My problem with this doctrinaire there should never be any new users
is that sometiems there *are* worse things than infinite retries.  If
the alternative is bringing the entire system down, or livelocking the
entire system, or corrupting user data, __GFP_NOFAIL *is* the more
appropriate option.

If you try to tell those of us outside of the mm layer, thou shalt
never use __GFP_NOFAIL in new code, and we have some new code where
the alternative is worse, we can either open-code the loop, or have
some mm hackers and/or checkpatch whine at us.

Andrew has declared that he'd prefer that we not open code the retry
loop; if you want to disagree with Andrew, feel free to pursuade him
otherwise.  If you want to tell me that I should accept user data
corruption, I'm going to ignore you (and/or checkpatch).

Regards,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-26 Thread tytso
On Tue, Mar 25, 2014 at 11:32:09PM -0700, Andrew Morton wrote:
 Well, there are always alternatives.  For example ext3 could
 preallocate a single transaction_t and a single IO page and fall back
 to synchronous page-at-a-time journal writes.  But I can totally see
 that such things are unattractive: heaps of new code which is never
 tested in real life.  The page allocator works so damn well that it
 doesn't make sense to implement it.

I'm not sure that there are _always_ solutions.  For example, ext3
still needs to use buffer cache to do the I/O, and that may require
allocations as well.  Fortunately, this was patched around other ways
to avoid livelock issues in the page cleaner in 2013: 84235de394d977

But that's another new user of GFP_NOFAIL (and one added three years
after David tried to declare There Shalt Be No New Users of
GFP_NOFAIL), and sure, we could probably patch around that by having
places where there's no other alternaive to keep a preallocated batch
of pages and/or allocated structures at each code site.  But that's as
bad as looping at each code site; in fact, wouldn't it be better if
the allocator wants to avoid looping, to have a separate batch of
pages which (ala GFP_ATOMIC) which is reserved for just for GFP_NOFAIL
allocations when all else fails?

(BTW, we have a similar issue with bio's in the block layer, but
fortunately, those structures are relatively small, and since they are
in their own slab cache, and are constantly being used and then
recycled, it's in practice rare that allocations will fail when we are
in a desparate low memory situation.  But technically there will be
places where we should pass a gfp_t down to the block layers, and use
GFP_NOFAIL if we really want to make sure that memory allocations
needed to try to clean pages and/or avoid user data corruption are
needed.)

 Please use NOFAIL ;) The core page allocator will always be able to
 implement this better than callers.

I'll convert jbd2 to use NOFAIL.  :-)

Cheers,

- Ted

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ext4: acl: remove unneeded include of linux/capability.h

2014-03-24 Thread tytso
On Mon, Mar 24, 2014 at 08:22:12AM +0100, Jakub Sitnicki wrote:
> acl.c has not been (directly) using the interface defined by
> linux/capability.h header since commit 3bd858ab1c451725c07a
> ("Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid
> check"). Remove it.
> 
> Signed-off-by: Jakub Sitnicki 

Thanks, applied.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 21/22] ext4: Fix typos

2014-03-24 Thread tytso
On Sun, Mar 23, 2014 at 03:08:47PM -0400, Matthew Wilcox wrote:
> Comment fix only
> 
> Signed-off-by: Matthew Wilcox 

Thanks, applied to the ext4 git tree.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static

2014-03-24 Thread tytso
On Sun, Mar 23, 2014 at 03:08:45PM -0400, Matthew Wilcox wrote:
> It's only called within inode.c, so make it static, remove its prototype
> from ext4.h and move it above all of its callers so it doesn't need a
> prototype within inode.c.
> 
> Signed-off-by: Matthew Wilcox 

Thanks, applied to the ext4 tree.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 19/22] ext4: Make ext4_block_zero_page_range static

2014-03-24 Thread tytso
On Sun, Mar 23, 2014 at 03:08:45PM -0400, Matthew Wilcox wrote:
 It's only called within inode.c, so make it static, remove its prototype
 from ext4.h and move it above all of its callers so it doesn't need a
 prototype within inode.c.
 
 Signed-off-by: Matthew Wilcox matthew.r.wil...@intel.com

Thanks, applied to the ext4 tree.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v7 21/22] ext4: Fix typos

2014-03-24 Thread tytso
On Sun, Mar 23, 2014 at 03:08:47PM -0400, Matthew Wilcox wrote:
 Comment fix only
 
 Signed-off-by: Matthew Wilcox matthew.r.wil...@intel.com

Thanks, applied to the ext4 git tree.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ext4: acl: remove unneeded include of linux/capability.h

2014-03-24 Thread tytso
On Mon, Mar 24, 2014 at 08:22:12AM +0100, Jakub Sitnicki wrote:
 acl.c has not been (directly) using the interface defined by
 linux/capability.h header since commit 3bd858ab1c451725c07a
 (Introduce is_owner_or_cap() to wrap CAP_FOWNER use with fsuid
 check). Remove it.
 
 Signed-off-by: Jakub Sitnicki jsitni...@gmail.com

Thanks, applied.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Sat, Mar 22, 2014 at 10:55:24AM -0700, Andrew Morton wrote:
> 
> From: Andrew Morton 
> Subject: scripts/checkpatch.pl: __GFP_NOFAIL isn't going away
> 
> Revert 7e4915e78992eb ("checkpatch: add warning of future __GFP_NOFAIL use").
> 
> There are no plans to remove __GFP_NOFAIL.
> 
> __GFP_NOFAIL exists to
> 
> a) centralise the retry-allocation-for-ever operation into the core
>allocator, which is the appropriate implementation site and
> 
> b) permit us to identify code sites which aren't handling memory
>exhaustion appropriately.
> 
> Cc: David Rientjes 
> Cc: Joe Perches 
> Cc: Theodore Ts'o 
> Signed-off-by: Andrew Morton 

How about also making the following change which inspired the
checkpatch warning?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0437439..d189872 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -58,9 +58,11 @@ struct vm_area_struct;
  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
  * _might_ fail.  This depends upon the particular VM implementation.
  *
- * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.  This modifier is deprecated and no new
- * users should be added.
+ * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the
+ * caller cannot handle allocation failures.  Callers are strongly
+ * encouraged not to use __GFP_NOFAIL unless the alternative is worse
+ * than OOM killing some random process (i.e., corruption or loss of
+ * some innocent user's data, etc).
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Sat, Mar 22, 2014 at 01:26:06PM -0400, ty...@mit.edu wrote:
> > Well.  Converting an existing retry-for-ever caller to GFP_NOFAIL is
> > good.  Adding new retry-for-ever code is not good.

Oh, and BTW --- now that checkpatch.pl now flags an warning whenever
GFP_NOFAIL is used, because it is deprecated, patches that convert a
loop to use GFP_NOFAIL will get flagged with checkpatch, and this will
also incentivize people writing new code and who can't find any other
way to deal with an allocation failure to simply open code the loop...

 - Ted




   
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Sat, Mar 22, 2014 at 10:15:12AM -0700, Andrew Morton wrote:
> > I'll note that since 2011, there has been precious little movement on
> > removing the final few callers of GFP_NOFAIL, and we still have a bit
> > under two dozen of them, including a new one in fs/buffer.c that was
> > added in 2013.
> 
> Well.  Converting an existing retry-for-ever caller to GFP_NOFAIL is
> good.  Adding new retry-for-ever code is not good.

Actually, it wasn't converting an existing loop; it was adding a new
GFP_NOFAIL to fix a reclaim livelock (commit 84235de394d9775bf).

I agree that in ideal world, we'd get rid of all of these.  But
sometimes, the cure can be worse than the disesae, and so the whole
"all callers of GFP_NOFAIL are MUST FIX BUGGGY and the maintainers
should be shamed into fixing it" attitude is one that I find a bit odd
myself.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Fri, Mar 21, 2014 at 01:00:55PM -0700, Andrew Morton wrote:
> 
> The whole point of __GFP_NOFAIL is to centralise this
> wait-for-memory-for-ever operation.  So it is implemented in a common
> (core) place and so that we can easily locate these problematic
> callers.
> 
> is exactly wrong.  Yes, we'd like __GFP_NOFAIL to go away, but it
> cannot go away until buggy callsites such as this one are *fixed*. 
> Removing the __GFP_NOFAIL usage simply hides the buggy code from casual
> searchers.

The change to jbd2 was made in July 2010, back when the "we must
exterminate GFP_NOFAIL at all costs" brigade was in high gear, and the
folks claiming that GFP_FAIL *would* go away, come hell or high water,
was a bit more emphatic.

I'll note that since 2011, there has been precious little movement on
removing the final few callers of GFP_NOFAIL, and we still have a bit
under two dozen of them, including a new one in fs/buffer.c that was
added in 2013.

In any case, __GFP_NOFAIL is in the code comments, so a casual
searcher would find it pretty quickly with a "git grep".

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Fri, Mar 21, 2014 at 01:00:55PM -0700, Andrew Morton wrote:
 
 The whole point of __GFP_NOFAIL is to centralise this
 wait-for-memory-for-ever operation.  So it is implemented in a common
 (core) place and so that we can easily locate these problematic
 callers.
 
 is exactly wrong.  Yes, we'd like __GFP_NOFAIL to go away, but it
 cannot go away until buggy callsites such as this one are *fixed*. 
 Removing the __GFP_NOFAIL usage simply hides the buggy code from casual
 searchers.

The change to jbd2 was made in July 2010, back when the we must
exterminate GFP_NOFAIL at all costs brigade was in high gear, and the
folks claiming that GFP_FAIL *would* go away, come hell or high water,
was a bit more emphatic.

I'll note that since 2011, there has been precious little movement on
removing the final few callers of GFP_NOFAIL, and we still have a bit
under two dozen of them, including a new one in fs/buffer.c that was
added in 2013.

In any case, __GFP_NOFAIL is in the code comments, so a casual
searcher would find it pretty quickly with a git grep.

   - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Sat, Mar 22, 2014 at 10:15:12AM -0700, Andrew Morton wrote:
  I'll note that since 2011, there has been precious little movement on
  removing the final few callers of GFP_NOFAIL, and we still have a bit
  under two dozen of them, including a new one in fs/buffer.c that was
  added in 2013.
 
 Well.  Converting an existing retry-for-ever caller to GFP_NOFAIL is
 good.  Adding new retry-for-ever code is not good.

Actually, it wasn't converting an existing loop; it was adding a new
GFP_NOFAIL to fix a reclaim livelock (commit 84235de394d9775bf).

I agree that in ideal world, we'd get rid of all of these.  But
sometimes, the cure can be worse than the disesae, and so the whole
all callers of GFP_NOFAIL are MUST FIX BUGGGY and the maintainers
should be shamed into fixing it attitude is one that I find a bit odd
myself.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Sat, Mar 22, 2014 at 01:26:06PM -0400, ty...@mit.edu wrote:
  Well.  Converting an existing retry-for-ever caller to GFP_NOFAIL is
  good.  Adding new retry-for-ever code is not good.

Oh, and BTW --- now that checkpatch.pl now flags an warning whenever
GFP_NOFAIL is used, because it is deprecated, patches that convert a
loop to use GFP_NOFAIL will get flagged with checkpatch, and this will
also incentivize people writing new code and who can't find any other
way to deal with an allocation failure to simply open code the loop...

 - Ted




   
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 1/1] fs/reiserfs/journal.c: Remove obsolete __GFP_NOFAIL

2014-03-22 Thread tytso
On Sat, Mar 22, 2014 at 10:55:24AM -0700, Andrew Morton wrote:
 
 From: Andrew Morton a...@linux-foundation.org
 Subject: scripts/checkpatch.pl: __GFP_NOFAIL isn't going away
 
 Revert 7e4915e78992eb (checkpatch: add warning of future __GFP_NOFAIL use).
 
 There are no plans to remove __GFP_NOFAIL.
 
 __GFP_NOFAIL exists to
 
 a) centralise the retry-allocation-for-ever operation into the core
allocator, which is the appropriate implementation site and
 
 b) permit us to identify code sites which aren't handling memory
exhaustion appropriately.
 
 Cc: David Rientjes rient...@google.com
 Cc: Joe Perches j...@perches.com
 Cc: Theodore Ts'o ty...@mit.edu
 Signed-off-by: Andrew Morton a...@linux-foundation.org

How about also making the following change which inspired the
checkpatch warning?

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0437439..d189872 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -58,9 +58,11 @@ struct vm_area_struct;
  * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
  * _might_ fail.  This depends upon the particular VM implementation.
  *
- * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures.  This modifier is deprecated and no new
- * users should be added.
+ * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the
+ * caller cannot handle allocation failures.  Callers are strongly
+ * encouraged not to use __GFP_NOFAIL unless the alternative is worse
+ * than OOM killing some random process (i.e., corruption or loss of
+ * some innocent user's data, etc).
  *
  * __GFP_NORETRY: The VM implementation must not retry indefinitely.
  *

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing & memfd_create()

2014-03-20 Thread tytso
On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
> On Thu, Mar 20, 2014 at 4:32 PM,   wrote:
> > Why not make sealing an attribute of the "struct file", and enforce it
> > at the VFS layer?  That way all file system objects would have access
> > to sealing interface, and for memfd_shmem, you can't get another
> > struct file pointing at the object, the security properties would be
> > identical.
> 
> Sealing as introduced here is an inode-attribute, not "struct file".
> This is intentional. For instance, a gfx-client can get a read-only FD
> via /proc/self/fd/ and pass it to the compositor so it can never
> overwrite the contents (unless the compositor has write-access to the
> inode itself, in which case it can just re-open it read-write).

Hmm, good point.  I had forgotten about the /proc/self/fd hole.
Hmm... what if we have a SEAL_PROC which forces the permissions of
/proc/self/fd to be 000?

So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is
basically equivalent to using chattr to set the immutable and
append-only attribute, except for the "you can't undo the seal unless
you have exclusive access to the inode" magic.

That does make it pretty memfd_create specific and not a very general
API, which is a little unfortunate; hence why I'm trying to explore
ways of making a bit more generic and hopefully useful for more use
cases.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing & memfd_create()

2014-03-20 Thread tytso
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified

Looking at your patches, and what files you are modifying, you are
enforcing this in the low-level file system.

Why not make sealing an attribute of the "struct file", and enforce it
at the VFS layer?  That way all file system objects would have access
to sealing interface, and for memfd_shmem, you can't get another
struct file pointing at the object, the security properties would be
identical.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Trusted kernel patchset for Secure Boot lockdown

2014-03-20 Thread tytso
On Wed, Mar 19, 2014 at 01:16:15PM -0700, Kees Cook wrote:
> UEFI is a red herring in both cases. This isn't about UEFI, it just
> happens that one of the things that can assert "trusted_kernel" is the
> UEFI Secure Boot path. Chrome OS would assert "trusted_kernel" by
> passing it on the kernel command line (since it, along with firmware,
> kernel, and root fs are effectively measured). A
> boot-my-router-from-CD system can assert similarly because the kernel
> is on RO media.

I disagree; it's highly likely, if not certain that Windows booting
under UEFI secure boot is going to be able to do some of the things
that people are proposing that we have to prohibit in the name of
security.  That's because presumably Windows won't be willing to make
certain usability tradeoffs, and since they control the signing certs,
even in the unlikely case that people can leverage these "holes" to
enable a boot sector virus, it seems unlikely that Windows will revoke
its own cert.

The security goals for Windows' secure boot is quite a bit weaker than
what ChromeOS is trying to promise; this is why I claim the real
argument is over what the goals are for "trusted boot".

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Trusted kernel patchset for Secure Boot lockdown

2014-03-20 Thread tytso
On Wed, Mar 19, 2014 at 01:16:15PM -0700, Kees Cook wrote:
 UEFI is a red herring in both cases. This isn't about UEFI, it just
 happens that one of the things that can assert trusted_kernel is the
 UEFI Secure Boot path. Chrome OS would assert trusted_kernel by
 passing it on the kernel command line (since it, along with firmware,
 kernel, and root fs are effectively measured). A
 boot-my-router-from-CD system can assert similarly because the kernel
 is on RO media.

I disagree; it's highly likely, if not certain that Windows booting
under UEFI secure boot is going to be able to do some of the things
that people are proposing that we have to prohibit in the name of
security.  That's because presumably Windows won't be willing to make
certain usability tradeoffs, and since they control the signing certs,
even in the unlikely case that people can leverage these holes to
enable a boot sector virus, it seems unlikely that Windows will revoke
its own cert.

The security goals for Windows' secure boot is quite a bit weaker than
what ChromeOS is trying to promise; this is why I claim the real
argument is over what the goals are for trusted boot.

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing memfd_create()

2014-03-20 Thread tytso
On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
 
 This series introduces the concept of file sealing. Sealing a file restricts
 the set of allowed operations on the file in question. Multiple seals are
 defined and each seal will cause a different set of operations to return EPERM
 if it is set. The following seals are introduced:
 
  * SEAL_SHRINK: If set, the inode size cannot be reduced
  * SEAL_GROW: If set, the inode size cannot be increased
  * SEAL_WRITE: If set, the file content cannot be modified

Looking at your patches, and what files you are modifying, you are
enforcing this in the low-level file system.

Why not make sealing an attribute of the struct file, and enforce it
at the VFS layer?  That way all file system objects would have access
to sealing interface, and for memfd_shmem, you can't get another
struct file pointing at the object, the security properties would be
identical.

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] File Sealing memfd_create()

2014-03-20 Thread tytso
On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
 On Thu, Mar 20, 2014 at 4:32 PM,  ty...@mit.edu wrote:
  Why not make sealing an attribute of the struct file, and enforce it
  at the VFS layer?  That way all file system objects would have access
  to sealing interface, and for memfd_shmem, you can't get another
  struct file pointing at the object, the security properties would be
  identical.
 
 Sealing as introduced here is an inode-attribute, not struct file.
 This is intentional. For instance, a gfx-client can get a read-only FD
 via /proc/self/fd/ and pass it to the compositor so it can never
 overwrite the contents (unless the compositor has write-access to the
 inode itself, in which case it can just re-open it read-write).

Hmm, good point.  I had forgotten about the /proc/self/fd hole.
Hmm... what if we have a SEAL_PROC which forces the permissions of
/proc/self/fd to be 000?

So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is
basically equivalent to using chattr to set the immutable and
append-only attribute, except for the you can't undo the seal unless
you have exclusive access to the inode magic.

That does make it pretty memfd_create specific and not a very general
API, which is a little unfortunate; hence why I'm trying to explore
ways of making a bit more generic and hopefully useful for more use
cases.

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Page cache corruption with mount/unmount of USB with Fat file system

2014-03-19 Thread tytso
On Thu, Mar 20, 2014 at 12:05:44AM +0530, Nilesh More wrote:
> Hi all,
> 
> First of all sorry for the lengthy mail but I did not want to miss out
> on any details.
> 
> The following issue that I am reporting occurs when I plugin the USB
> drive on android tablet with USB mount support. Please note that this
> issue may or may not occur the first time I hotplug the USB drive. It
> might require couple of hotplug/hotunplug to see this issue. If
> anybody has/gets any clue about the possible root cause, please let me
> know.

And I'll repeat the comments I made a few weeks ago.  Last time you
reported this, you included a dmesg output which included this:

> [ 413.607849] usb 2-1.1: USB disconnect, device number 12
> [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode
> #81827: block 328308: comm installd...

Are you seeing the same thing?  If so, the fundamental high level
issue is that the USB driver is reporting a device disconnect,
presumably of a USB device different from the one that you just
plugged in.   Are you still seeing this?

If so, it's important if you want to root cause this issue to
determine why the USB disconnect is happening, and not try to comment
out the page invalidations which happens *after* the USB device is
reported as disconnected.

If I had to guess, it's probably some hardware issue where when you
plug in a USB device that draws too much power, it's destablizing the
USB bus and causing some other USB device to report a disconnect.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Page cache corruption with mount/unmount of USB with Fat file system

2014-03-19 Thread tytso
On Thu, Mar 20, 2014 at 12:05:44AM +0530, Nilesh More wrote:
 Hi all,
 
 First of all sorry for the lengthy mail but I did not want to miss out
 on any details.
 
 The following issue that I am reporting occurs when I plugin the USB
 drive on android tablet with USB mount support. Please note that this
 issue may or may not occur the first time I hotplug the USB drive. It
 might require couple of hotplug/hotunplug to see this issue. If
 anybody has/gets any clue about the possible root cause, please let me
 know.

And I'll repeat the comments I made a few weeks ago.  Last time you
reported this, you included a dmesg output which included this:

 [ 413.607849] usb 2-1.1: USB disconnect, device number 12
 [ 414.022630] EXT4-fs error (device mmcblk0p20): ext4_readdir:227: inode
 #81827: block 328308: comm installd...

Are you seeing the same thing?  If so, the fundamental high level
issue is that the USB driver is reporting a device disconnect,
presumably of a USB device different from the one that you just
plugged in.   Are you still seeing this?

If so, it's important if you want to root cause this issue to
determine why the USB disconnect is happening, and not try to comment
out the page invalidations which happens *after* the USB device is
reported as disconnected.

If I had to guess, it's probably some hardware issue where when you
plug in a USB device that draws too much power, it's destablizing the
USB bus and causing some other USB device to report a disconnect.

  - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] RDSEED support for the Linux kernel

2014-03-18 Thread tytso
On Mon, Mar 17, 2014 at 04:36:26PM -0700, H. Peter Anvin wrote:
> Changes since version 1:
> 
> a. Rebased on top of random.git:dev.
> b. Unbreak the PowerPC build (I had managed to miss that PowerPC had
>grown archrandom.h support.)
> c. Remove duplicate dummy function definitions in .
> d. Add a fourth patch containing a microoptimization: avoid the loop
>in arch_random_refill() if arch_get_random_seed*() is unavailable.
> 
> Comments are, of course, appreciated.
> 
> Ted, if you are OK with this could you add this to random.git:dev so
> linux-next can pick it up?

Thanks, applied to the random.git tree.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Q on ioctl BLKGETSIZE

2014-03-18 Thread tytso
On Tue, Mar 18, 2014 at 01:17:25PM +0100, Ulrich Windl wrote:
> 
> Three questions:
> 1) Shouldn't the manual page says that the sector size of always 512 Bytes, 
> even on new disks with larger sectors?
> 2) Should the real sector size be used for new disks?

The HDD industry is using 512 byte logical sectors (so LBA's are in
units of 512 byte sectors), even for disks with a physical sector size
of 4k (and in the next 4-5 years, they would like to go to 32k
physical sector sizes).

> 3) When using 512-bytes sector size, isn't the capacity limited to 2TB (2^31 
> kB)?

Yes.  That's why we have the BLKGETSIZE64 ioctl.

Example code:

https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/lib/ext2fs/getsize.c

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Q on ioctl BLKGETSIZE

2014-03-18 Thread tytso
On Tue, Mar 18, 2014 at 01:17:25PM +0100, Ulrich Windl wrote:
 
 Three questions:
 1) Shouldn't the manual page says that the sector size of always 512 Bytes, 
 even on new disks with larger sectors?
 2) Should the real sector size be used for new disks?

The HDD industry is using 512 byte logical sectors (so LBA's are in
units of 512 byte sectors), even for disks with a physical sector size
of 4k (and in the next 4-5 years, they would like to go to 32k
physical sector sizes).

 3) When using 512-bytes sector size, isn't the capacity limited to 2TB (2^31 
 kB)?

Yes.  That's why we have the BLKGETSIZE64 ioctl.

Example code:

https://git.kernel.org/cgit/fs/ext2/e2fsprogs.git/tree/lib/ext2fs/getsize.c

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 0/4] RDSEED support for the Linux kernel

2014-03-18 Thread tytso
On Mon, Mar 17, 2014 at 04:36:26PM -0700, H. Peter Anvin wrote:
 Changes since version 1:
 
 a. Rebased on top of random.git:dev.
 b. Unbreak the PowerPC build (I had managed to miss that PowerPC had
grown archrandom.h support.)
 c. Remove duplicate dummy function definitions in linux/random.h.
 d. Add a fourth patch containing a microoptimization: avoid the loop
in arch_random_refill() if arch_get_random_seed*() is unavailable.
 
 Comments are, of course, appreciated.
 
 Ted, if you are OK with this could you add this to random.git:dev so
 linux-next can pick it up?

Thanks, applied to the random.git tree.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] virtio-blk: make the queue depth the max supportable by the hypervisor

2014-03-16 Thread tytso
On Mon, Mar 17, 2014 at 11:12:15AM +1030, Rusty Russell wrote:
> 
> Note that with indirect descriptors (which is supported by Almost
> Everyone), we can actually use the full index, so this value is a bit
> pessimistic.  But it's OK as a starting point.

So is this something that can go upstream with perhaps a slight
adjustment in the commit description?  Do you think we need to be able
to dynamically adjust the queue depth after the module has been loaded
or the kernel has been booted?  If so, anyone a hint about the best
way to do that would be much appreciated.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] virtio-blk: make the queue depth the max supportable by the hypervisor

2014-03-16 Thread tytso
On Mon, Mar 17, 2014 at 11:12:15AM +1030, Rusty Russell wrote:
 
 Note that with indirect descriptors (which is supported by Almost
 Everyone), we can actually use the full index, so this value is a bit
 pessimistic.  But it's OK as a starting point.

So is this something that can go upstream with perhaps a slight
adjustment in the commit description?  Do you think we need to be able
to dynamically adjust the queue depth after the module has been loaded
or the kernel has been booted?  If so, anyone a hint about the best
way to do that would be much appreciated.

Thanks,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v7] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate

2014-03-15 Thread tytso
I just noticed I had the v6 version of your patch in my tree.  I've
updated it to be the v7 version, thanks.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [v7] ext4: Add support FALLOC_FL_COLLAPSE_RANGE for fallocate

2014-03-15 Thread tytso
I just noticed I had the v6 version of your patch in my tree.  I've
updated it to be the v7 version, thanks.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: flaw in "nf_tables: add reject module for NFPROTO_INET"

2014-02-12 Thread tytso
On Wed, Feb 12, 2014 at 01:05:54PM -0800, Kees Cook wrote:
> 
> I'd be up for it. It's why I CC'd you, I figured if I'd missed the
> report it would have likely have come from you. :) Perhaps just start
> by CCing each other, and if others want to get in on the fun too, move
> to a list then?

I'm paying attention to Linux kernel coverity reports (as well as
working on my goal to drive the number of e2fsprogs coverity reports
to zero :-), so feel free to CC me on any reports.

Cheers,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: flaw in nf_tables: add reject module for NFPROTO_INET

2014-02-12 Thread tytso
On Wed, Feb 12, 2014 at 01:05:54PM -0800, Kees Cook wrote:
 
 I'd be up for it. It's why I CC'd you, I figured if I'd missed the
 report it would have likely have come from you. :) Perhaps just start
 by CCing each other, and if others want to get in on the fun too, move
 to a list then?

I'm paying attention to Linux kernel coverity reports (as well as
working on my goal to drive the number of e2fsprogs coverity reports
to zero :-), so feel free to CC me on any reports.

Cheers,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] CPU Jitter RNG

2014-02-04 Thread tytso
On Tue, Feb 04, 2014 at 11:06:04AM -0800, H. Peter Anvin wrote:
> 
> The quantum noise sources there are in a system are generally two
> independent clocks running against each other.  However, independent
> clocks are rare; instead, most clocks are in fact slaved against each
> other using PLLs and similar structures.

One of the things that would be useful for us to understand is in
general, where in a system we have independent clocks.  For example, I
think (correct me if I'm wrong), a 2.5" or 3.5" HDD has its own clock
which is separate from the CPU/chipset.  That is actually how and
where we get any entropy; I am not at all convinced that we are
getting any variation from "chaotic air turbulence in the HDD" ---
that paper was published in 1994, and hard drive technologies have
changed quite a bit since then, with extra layers of caching, track
bufers, etc.

However, where a decade ago the ethernet card probably had its own
independent clock crystal/oscillator, I'm going to guess that these
days with SOC's and even on laptops, with ethernet device part of the
chipset, it is probably being driven off the same master oscillator.

I wonder if there's anyway we can either figure out manually, or
preferably, automatically at boot time, which devices actually have
independent clock oscillators.

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH 0/5] CPU Jitter RNG

2014-02-04 Thread tytso
On Tue, Feb 04, 2014 at 11:06:04AM -0800, H. Peter Anvin wrote:
 
 The quantum noise sources there are in a system are generally two
 independent clocks running against each other.  However, independent
 clocks are rare; instead, most clocks are in fact slaved against each
 other using PLLs and similar structures.

One of the things that would be useful for us to understand is in
general, where in a system we have independent clocks.  For example, I
think (correct me if I'm wrong), a 2.5 or 3.5 HDD has its own clock
which is separate from the CPU/chipset.  That is actually how and
where we get any entropy; I am not at all convinced that we are
getting any variation from chaotic air turbulence in the HDD ---
that paper was published in 1994, and hard drive technologies have
changed quite a bit since then, with extra layers of caching, track
bufers, etc.

However, where a decade ago the ethernet card probably had its own
independent clock crystal/oscillator, I'm going to guess that these
days with SOC's and even on laptops, with ethernet device part of the
chipset, it is probably being driven off the same master oscillator.

I wonder if there's anyway we can either figure out manually, or
preferably, automatically at boot time, which devices actually have
independent clock oscillators.

- Ted

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ksummit-attendees] [PATCH] commit: Add -f, --fixes option to add Fixes: line

2013-10-28 Thread tytso
[ I've removed Junio and the git mailing list form the -cc ]

On Mon, Oct 28, 2013 at 11:41:17PM +, Russell King - ARM Linux wrote:
> 
> I agree too.  This whole thread seems to be about noise, and I too
> thought there was something about not cross-posting between this list
> and any other list.

We talked about not including patches on a theoretical "git subtree
maintainers list", for noise control purposes, and I think it would be
a good to extend this to the ksummit-* mailing lists as well.

I'm not sure there's an easy way to make mailman reject e-mails that
contain patches, but I'd request that people please honor this
restriction.  If you must, send the patch to the appropriate mailing
list, and then include a lkml.org or some other mail archive link to
the ksummit-* lists.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ksummit-attendees] [PATCH] commit: Add -f, --fixes commit option to add Fixes: line

2013-10-28 Thread tytso
[ I've removed Junio and the git mailing list form the -cc ]

On Mon, Oct 28, 2013 at 11:41:17PM +, Russell King - ARM Linux wrote:
 
 I agree too.  This whole thread seems to be about noise, and I too
 thought there was something about not cross-posting between this list
 and any other list.

We talked about not including patches on a theoretical git subtree
maintainers list, for noise control purposes, and I think it would be
a good to extend this to the ksummit-* mailing lists as well.

I'm not sure there's an easy way to make mailman reject e-mails that
contain patches, but I'd request that people please honor this
restriction.  If you must, send the patch to the appropriate mailing
list, and then include a lkml.org or some other mail archive link to
the ksummit-* lists.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.6.0 kernel - ext4 corruption, or?

2012-10-24 Thread tytso
On Wed, Oct 24, 2012 at 08:15:37AM -0400, Justin Piszcz wrote:
> 
> I read there was a bug in 3.6.2, is there also one in 3.6.0, or can someone
> help explain this?

The problem which we are currently trying to investigate was
reportedly introduced in v3.6.1.  So far that's about how we know; we
have two users who have reported it, but I and other ext4 developers
haven't been able to reproduce it yet.

> # grep 10.0.0.11 -r /etc
> /etc/posfix.old/master.cf:10.0.0.11:smtp inet  n   -   -
> -1postscreen
> 
> # ls -l /etc/posfix.old/master.cf
> -rw-r--r-- 1 root root 13236 Oct 28  2011 /etc/posfix.old/master.cf
> 
> # ls -ld /etc/postfix.old
> ls: cannot access /etc/postfix.old: No such file or directory
> 

Looks like you or some script renamed /etc/postfix to /etc/posfix.old
as part some upgrade?

   - Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 3.6.0 kernel - ext4 corruption, or?

2012-10-24 Thread tytso
On Wed, Oct 24, 2012 at 08:15:37AM -0400, Justin Piszcz wrote:
 
 I read there was a bug in 3.6.2, is there also one in 3.6.0, or can someone
 help explain this?

The problem which we are currently trying to investigate was
reportedly introduced in v3.6.1.  So far that's about how we know; we
have two users who have reported it, but I and other ext4 developers
haven't been able to reproduce it yet.

 # grep 10.0.0.11 -r /etc
 /etc/posfix.old/master.cf:10.0.0.11:smtp inet  n   -   -
 -1postscreen
 
 # ls -l /etc/posfix.old/master.cf
 -rw-r--r-- 1 root root 13236 Oct 28  2011 /etc/posfix.old/master.cf
 
 # ls -ld /etc/postfix.old
 ls: cannot access /etc/postfix.old: No such file or directory
 

Looks like you or some script renamed /etc/postfix to /etc/posfix.old
as part some upgrade?

   - Ted

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Ext2-devel] ext2 inode size (on-disk)

2001-04-19 Thread tytso

On Thu, Apr 19, 2001 at 07:55:20AM -0400, Alexander Viro wrote:
>   Erm... Folks, can ->s_inode_size be not a power of 2? Both
> libext2fs and kernel break in that case. 

This was a project that was never completed.  I thought at one point
of allowing the inode size to be not a power of 2, but if you do that,
you really want to avoid letting an inode cross a block boundary ---
for reliability and performance reasons if nothing else.   

It may simply be easiest at this point to require that the inode size
be a power of two, at least as far as going from 128 to 256 bytes,
just for compatibility reasons.  (Although if we do that, the folks
who want to use extra space in the inode will come pooring out of the
woodwork, and we're going to have to careful to control who uses what
parts of the extended inode.)

In the long run, it probably makes sense to adjust the algorithms to
allow for non-power-of-two inode sizes, but require an incompatible
filesystem feature flag (so that older kernels and filesystem
utilities won't choke when mounting filesystems with non-standard
sized inodes.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] ext2 inode size (on-disk)

2001-04-19 Thread tytso

On Thu, Apr 19, 2001 at 07:55:20AM -0400, Alexander Viro wrote:
   Erm... Folks, can -s_inode_size be not a power of 2? Both
 libext2fs and kernel break in that case. 

This was a project that was never completed.  I thought at one point
of allowing the inode size to be not a power of 2, but if you do that,
you really want to avoid letting an inode cross a block boundary ---
for reliability and performance reasons if nothing else.   

It may simply be easiest at this point to require that the inode size
be a power of two, at least as far as going from 128 to 256 bytes,
just for compatibility reasons.  (Although if we do that, the folks
who want to use extra space in the inode will come pooring out of the
woodwork, and we're going to have to careful to control who uses what
parts of the extended inode.)

In the long run, it probably makes sense to adjust the algorithms to
allow for non-power-of-two inode sizes, but require an incompatible
filesystem feature flag (so that older kernels and filesystem
utilities won't choke when mounting filesystems with non-standard
sized inodes.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread tytso

   From: Daniel Phillips <[EMAIL PROTECTED]>
   Date: Fri, 23 Feb 2001 00:04:02 +0100

   I resolve not to take a position on this subject, and I will carry
   forward both a 'squeaky clean' backward-compatible version that sets an
   INCOMPAT flag, and a 'slightly tarnished' but very clever version that
   is both forward and backward-compatible, along the lines suggested by
   Ted.  Both flavors have the desireable property that old versions of
   fsck with no knowledge of the new index structure can remove the indices
   automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck.  Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-23 Thread tytso

   From: Daniel Phillips [EMAIL PROTECTED]
   Date: Fri, 23 Feb 2001 00:04:02 +0100

   I resolve not to take a position on this subject, and I will carry
   forward both a 'squeaky clean' backward-compatible version that sets an
   INCOMPAT flag, and a 'slightly tarnished' but very clever version that
   is both forward and backward-compatible, along the lines suggested by
   Ted.  Both flavors have the desireable property that old versions of
   fsck with no knowledge of the new index structure can remove the indices
   automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck.  Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Andreas Dilger <[EMAIL PROTECTED]>
   Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

   One important question as to the disk format is whether the "." and ".."
   interception by VFS is a new phenomenon in 2.4 or if this also happened
   in 2.2?  If so, then having these entries on disk will be important
   for 2.2 compatibility, and you don't want to have different on-disk formats
   between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility.   That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..'  entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated.  After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility.  One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Daniel Phillips <[EMAIL PROTECTED]>
   Date: Thu, 22 Feb 2001 08:24:08 +0100
   Content-Type: text/plain

   > Is it worth it?  Well, it means you lose an index entry from each
   > directory block, thus reducing your fanout at each node of the tree by a
   > worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
   > using 4k blocksizes.

   I'll leave that up to somebody else - we now have two alternatives, the
   100%, no-compromise INCOMPAT solution, and the slightly-bruised but
   still largely intact forward compatible solution.  I'll maintain both
   solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Daniel Phillips [EMAIL PROTECTED]
   Date: Thu, 22 Feb 2001 08:24:08 +0100
   Content-Type: text/plain

Is it worth it?  Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

   I'll leave that up to somebody else - we now have two alternatives, the
   100%, no-compromise INCOMPAT solution, and the slightly-bruised but
   still largely intact forward compatible solution.  I'll maintain both
   solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-22 Thread tytso

   From: Andreas Dilger [EMAIL PROTECTED]
   Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

   One important question as to the disk format is whether the "." and ".."
   interception by VFS is a new phenomenon in 2.4 or if this also happened
   in 2.2?  If so, then having these entries on disk will be important
   for 2.2 compatibility, and you don't want to have different on-disk formats
   between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility.   That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..'  entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated.  After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility.  One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread tytso

Daniel,

Nice work!

A couple of comments.  If you make the beginning of each index block
look like a an empty directory block (i.e, the first 8 blocks look like
this):

32 bits: ino == 0
16 bits: rec_len == blocksize
16 bits: name_len = 0

... then you will have full backwards compatibility, both for reading
*and* writing.  When reading, old kernels will simply ignore the index
blocks, since it looks like it has an unpopulated directory entry.  And
if the kernel attempts to write into the directory, it will clear the
BTREE_FL flag, in which case new kernels won't treat the directory as a
tree anymore.  (Running a smart e2fsck which knows about directory trees
will be able to restore the tree structure).

Is it worth it?  Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

2001-02-21 Thread tytso

Daniel,

Nice work!

A couple of comments.  If you make the beginning of each index block
look like a an empty directory block (i.e, the first 8 blocks look like
this):

32 bits: ino == 0
16 bits: rec_len == blocksize
16 bits: name_len = 0

... then you will have full backwards compatibility, both for reading
*and* writing.  When reading, old kernels will simply ignore the index
blocks, since it looks like it has an unpopulated directory entry.  And
if the kernel attempts to write into the directory, it will clear the
BTREE_FL flag, in which case new kernels won't treat the directory as a
tree anymore.  (Running a smart e2fsck which knows about directory trees
will be able to restore the tree structure).

Is it worth it?  Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: mke2fs and kernel VM issues

2001-02-16 Thread tytso

   Date: Fri, 16 Feb 2001 09:48:17 + (GMT)
   From: Alan Cox <[EMAIL PROTECTED]>

   > heavily modifed VA kernel based on 2.2.18.  Is there a kernel which is
   > believed to be a known good kernel?  (both 2.2.x and 2.4.x)

   I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have
   its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which
   have worked wonders for everyone so far. 

Note that this only shows up when using mke2fs to create very large
filesystems, and you have relatively little memory.  In this particular
case, for example, we saw it with a system that had "only" 256 megs of
memory, and creating a 72 gigabyte filesystem using a 8x9gb RAID
configuration.

Some folks at IBM (in the Mylex controller group) have found this
problem with 2.2.16, 2.2.18, and with some 2.2.19pre patch (they didn't
say exactly which level of the 2.2.19pre patch they were dealing with).
Some folks claiming that the problem exists under 2.2.18, and we've seen
it with our kernel, which is a 2.2.18 plus some set of 2.2.19pre*
patches.

The problem is that mke2fs issues a *lot* of writes when it is writing
the inode table, and apparently the write throttling isn't completely
working write under those circumstances.  There is a workaround which
easily fixes the problem; if you set the MKE2FS_SYNC environment
variable to some value such as 5 or 10, then after writing every 5 or 10
block groups's worth of inode tables, mke2fs will call sync().  This
workaround did fix IBM's problem, which lends credence to the theory
that the problem is a VM bug related to a lack of sufficient write
throttling.  

I've in the past considered making MKE2FS_SYNC=10 be the default, but
Stephen has requested that I not do this, since it's the best way of
showing off this particular VM bug.

- Ted

>From IBM/Mylex's bug report:

>The system I used for these tests is a Chardonnay with an AR160
>installed. The FW is 6.00-07 and the BIOS is 6.01-08. The system
>has several various kernel/DAC driver boot configurations set up.
>The RAID drive under test is a 3-drive RAID 5 with 8.6GB drives,
>for a total of 17GB. When one maximum sized partition is created,
>the total number of logical cylinders is 2209.
>
>I also obtained the 2.2.19 patch and upgraded kernel 2.2.18 to 2.2.19.
>
>Note: The first 4 tests fail to at least some extent. Please read each one.
>
>1. Kernel 2.2.16 and DAC driver 2.2.9 - 128MB of main memory.
>   The system fails to complete the creation of an ext2 file system.
>   The process mke2fs (or mkfs) gets terminated instead. Subsequent
>   attempts to create an ext2 file (without rerunning fdisk) fail.
>
>2. Kernel 2.2.18 and DAC driver 2.2.9 - 128 MB of main memory.
>   The ext2 file system is created, but hundreds of VM error messages
>   scroll up on the screen. Expanding the swap space to exceed the
>   memory size does not help (I think it might even be worse).
>
>   I also tried running a copy compare script that loads the drives with
>   heavy I/O. This failed after approximately 20 hours. The system was
>   effectively locked up with VM error messages scrolling up the screen
>   and all alternate terminal screens.
>
>3. Kernel 2.2.16 and DAC driver 2.2.10 - 128 MB of main memory.
>   The system fails to complete the creation of an ext2 file system.
>   The process mke2fs (or mkfs) gets terminated instead. Subsequent
>   attempts to create an ext2 file (without rerunning fdisk) fail.
>
>4. Kernel 2.2.19 and DAC driver 2.2.9 - 128 MB of main memory.
>   The system fails to complete the creation of an ext2 file system.
>   The process mke2fs (or mkfs) gets terminated instead. Subsequent
>   attempts to create an ext2 file (without rerunning fdisk) DO NOT
>   FAIL.
>
>5. Kernel 2.2.16 and DAC driver 2.2.9 - 512 MB of main memory.
>   The system completes the creation of an ext2 file system without
>   any errors.
>
>6. Kernel 2.2.16 and DAC driver 2.2.10 - 512 MB of main memory.
>   The system completes the creation of an ext2 file system without
>   any errors.
>
>7. Kernel 2.2.18 and DAC driver 2.2.9 - 512 MB of main memory.
>   The system completes the creation of an ext2 file system without
>   any errors.
>
>8. Kernel 2.2.19 and DAC driver 2.2.9 - 512 MB of main memory.
>   The system completes the creation of an ext2 file system without
>   any errors.
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: mke2fs and kernel VM issues

2001-02-16 Thread tytso

   Date: Fri, 16 Feb 2001 09:48:17 + (GMT)
   From: Alan Cox [EMAIL PROTECTED]

heavily modifed VA kernel based on 2.2.18.  Is there a kernel which is
believed to be a known good kernel?  (both 2.2.x and 2.4.x)

   I've not seen the problem on unmodified 2.2.18. The 2.2.17/18 VM does have
   its problems but not these. 2.2.19pre3 and higher have the Andrea VM fixes which
   have worked wonders for everyone so far. 

Note that this only shows up when using mke2fs to create very large
filesystems, and you have relatively little memory.  In this particular
case, for example, we saw it with a system that had "only" 256 megs of
memory, and creating a 72 gigabyte filesystem using a 8x9gb RAID
configuration.

Some folks at IBM (in the Mylex controller group) have found this
problem with 2.2.16, 2.2.18, and with some 2.2.19pre patch (they didn't
say exactly which level of the 2.2.19pre patch they were dealing with).
Some folks claiming that the problem exists under 2.2.18, and we've seen
it with our kernel, which is a 2.2.18 plus some set of 2.2.19pre*
patches.

The problem is that mke2fs issues a *lot* of writes when it is writing
the inode table, and apparently the write throttling isn't completely
working write under those circumstances.  There is a workaround which
easily fixes the problem; if you set the MKE2FS_SYNC environment
variable to some value such as 5 or 10, then after writing every 5 or 10
block groups's worth of inode tables, mke2fs will call sync().  This
workaround did fix IBM's problem, which lends credence to the theory
that the problem is a VM bug related to a lack of sufficient write
throttling.  

I've in the past considered making MKE2FS_SYNC=10 be the default, but
Stephen has requested that I not do this, since it's the best way of
showing off this particular VM bug.

- Ted

From IBM/Mylex's bug report:

The system I used for these tests is a Chardonnay with an AR160
installed. The FW is 6.00-07 and the BIOS is 6.01-08. The system
has several various kernel/DAC driver boot configurations set up.
The RAID drive under test is a 3-drive RAID 5 with 8.6GB drives,
for a total of 17GB. When one maximum sized partition is created,
the total number of logical cylinders is 2209.

I also obtained the 2.2.19 patch and upgraded kernel 2.2.18 to 2.2.19.

Note: The first 4 tests fail to at least some extent. Please read each one.

1. Kernel 2.2.16 and DAC driver 2.2.9 - 128MB of main memory.
   The system fails to complete the creation of an ext2 file system.
   The process mke2fs (or mkfs) gets terminated instead. Subsequent
   attempts to create an ext2 file (without rerunning fdisk) fail.

2. Kernel 2.2.18 and DAC driver 2.2.9 - 128 MB of main memory.
   The ext2 file system is created, but hundreds of VM error messages
   scroll up on the screen. Expanding the swap space to exceed the
   memory size does not help (I think it might even be worse).

   I also tried running a copy compare script that loads the drives with
   heavy I/O. This failed after approximately 20 hours. The system was
   effectively locked up with VM error messages scrolling up the screen
   and all alternate terminal screens.

3. Kernel 2.2.16 and DAC driver 2.2.10 - 128 MB of main memory.
   The system fails to complete the creation of an ext2 file system.
   The process mke2fs (or mkfs) gets terminated instead. Subsequent
   attempts to create an ext2 file (without rerunning fdisk) fail.

4. Kernel 2.2.19 and DAC driver 2.2.9 - 128 MB of main memory.
   The system fails to complete the creation of an ext2 file system.
   The process mke2fs (or mkfs) gets terminated instead. Subsequent
   attempts to create an ext2 file (without rerunning fdisk) DO NOT
   FAIL.

5. Kernel 2.2.16 and DAC driver 2.2.9 - 512 MB of main memory.
   The system completes the creation of an ext2 file system without
   any errors.

6. Kernel 2.2.16 and DAC driver 2.2.10 - 512 MB of main memory.
   The system completes the creation of an ext2 file system without
   any errors.

7. Kernel 2.2.18 and DAC driver 2.2.9 - 512 MB of main memory.
   The system completes the creation of an ext2 file system without
   any errors.

8. Kernel 2.2.19 and DAC driver 2.2.9 - 512 MB of main memory.
   The system completes the creation of an ext2 file system without
   any errors.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Lucent Microelectronics Venus Modem, serial 5.05, and Linux 2.4.0

2001-02-08 Thread tytso

On Sun, Jan 14, 2001 at 08:10:45PM +0100, W. Michael Petullo wrote:
> > In serial.c, you seem to perform a check by writing to a possible
> > modem's interrupt enable register and reading the result.  This seems to
> > be one of the points at which the auto-configuration process occasionally
> > fails.  If I make the following change to this code my modem seems to
> > be auto-detected correctly all of the time:
> 
> >scratch = serial_inp(info, UART_IER);
> > serial_outp(info, UART_IER, 0);
> > #ifdef __i386__
> > outb(0xff, 0x080);
> > #endif
> > scratch2 = serial_inp(info, UART_IER);
> > serial_outp(info, UART_IER, 0x0F);
> > #ifdef __i386__
> > outb(0, 0x080);
> > #endif
> > - scratch3 = serial_inp(info, UART_IER); /* REMOVE */
> > + scratch3 = 0x0f/* ADD */
> > serial_outp(info, UART_IER, scratch);

The problem is that if this doesn't work, there are some serious
questions about the correctness of the Lucent Microelectronic Venus
modem.  I've forwarded this to someone in the Lucent Modem group, who
can hopefully look at this (and maybe can ship me a sample hardware so
I can play with it, although I'd much rather that he tell me how to
work around the hardware bug, or tell me that all you need is a
firmware upgrade to fix the bug in the modem).

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Lucent Microelectronics Venus Modem, serial 5.05, and Linux 2.4.0

2001-02-08 Thread tytso

On Sun, Jan 14, 2001 at 08:10:45PM +0100, W. Michael Petullo wrote:
  In serial.c, you seem to perform a check by writing to a possible
  modem's interrupt enable register and reading the result.  This seems to
  be one of the points at which the auto-configuration process occasionally
  fails.  If I make the following change to this code my modem seems to
  be auto-detected correctly all of the time:
 
 scratch = serial_inp(info, UART_IER);
  serial_outp(info, UART_IER, 0);
  #ifdef __i386__
  outb(0xff, 0x080);
  #endif
  scratch2 = serial_inp(info, UART_IER);
  serial_outp(info, UART_IER, 0x0F);
  #ifdef __i386__
  outb(0, 0x080);
  #endif
  - scratch3 = serial_inp(info, UART_IER); /* REMOVE */
  + scratch3 = 0x0f/* ADD */
  serial_outp(info, UART_IER, scratch);

The problem is that if this doesn't work, there are some serious
questions about the correctness of the Lucent Microelectronic Venus
modem.  I've forwarded this to someone in the Lucent Modem group, who
can hopefully look at this (and maybe can ship me a sample hardware so
I can play with it, although I'd much rather that he tell me how to
work around the hardware bug, or tell me that all you need is a
firmware upgrade to fix the bug in the modem).

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Renaming lost+found

2001-01-31 Thread tytso

On Sun, Jan 28, 2001 at 01:35:44PM -0800, H. Peter Anvin wrote:
> *renamed*, i.e. does the tools (e2fsck ) use "/lost+found" by name,
> or by inode?  As far as I know it always uses the same inode number

e2fsck uses /lost+found by name, not by inode.  It will recreate a new
lost+found directory if one doesn't exist.  *However*, if the
filesystem is badly corrupted, it's possible that when it allocates
blocks for the lost+found directory, it might override a datablock
that might possibly be recoverable if one were truly desparate and
using a disk editor to search for keywords.  (This would only happen
if part of the inode table had gotten corrupted due to a hardware
error --- i.e., such as the anecdotal evidence of DMA units that write
garbage to the disk because during a power failure, where it is
conjectured that the +5V power rail drops below the critical working
of the memory faster than +12V power rail drops below the critical
working volutage of the disk drive --- so that the record in the inode
table that a certain disk block was in use is erased.)

So if you really dislike lost+found, go ahead and delete it.  It
removes a somewhat tiny safeguard, but being able to take advantage of
it requires wizard-level skills (there are no tools to do this
automatically, since it requires human intuition and a knowledge of
what file you might be trying to save.)  So it would probably only be
used in the case of someone who had 10 year's of Ph.D. research that
wasn't backed up, and this was the only way they could get the data
back.  And although not doing disk backups is grounds for general
redicule, losing ten years of graduate research would probably be
reguarded by most as cruel and unusual punishment.  But if you're not
in a Ph.D. program, it doesn't matter, yes?  (And in any case, we ALL
do backups, all the time, religiously and on a regular schedule,
RIGHT?  :-)

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)

2001-01-16 Thread tytso

   Date: Tue, 16 Jan 2001 20:33:34 +0100
   From: Andrea Arcangeli <[EMAIL PROTECTED]>

   > Of course, this is utterly unsafe on an SMP machines, since access to
   > the "block" variable isn't protected at all.  So the first question is

   Wrong, it's obviously protected by the inode_lock. And even if it wasn't
   protected by the inode_lock in 2.2.x inode.c and dcache.c runs globally
   serialized by the BKL (but it is obviously protected regardless of the BKL).

Yes, you're quite right.  The fact that you have to have inode_lock
before you call try_to_free inodes would would protect the "block"
variable.

   >static struct semaphore block = MUTEX;
   >if (down_trylock()) {
   >spin_unlock(_lock);
   >down();
   >spin_lock(_lock);
   >}

   The above is overkill (there's no need to use further atomic API,
   when we can rely on the inode_lock for the locking. It's overcomplex
   and slower. 

Actually, looking at the fast path of down_trylock compared to huge mess
of code that's currently there, I actually suspect that using
down_trylock() would actually be faster, since in the fast path case
there would only two assembly language instructions, whereas the code
that's currently there is (a) much more complicated, and thus harder to
understand, and (b) is many more instructions to execute.  

Sometimes the simplest approach is the best.

   > (with the appropriate unlocking code at the end of the function).
   > 
   > Next question why was this there in the first place?  After all,

   To fix the "inode-max limit reached" faliures that you could reproduce on
   earlier 2.2.x. (the freed inodes was re-used before the task that freed them
   had a chance to allocate them for itself)

Ah, OK.  Well, we're currently tracking down a slow inode leak which is
only happening on SMP machines, especially our mailhubs.  It's gradual,
but if you don't reboot the machine before you run out of inodes, it
will print the "inode-max limit reach" message, and then shortly after
that lock up the entire machine locks up until someone can come in and
hit the Big Red Button.  Monitoring the machine before it locks up, we
noted that the number of inodes in use was continually climbing until
the machine died.  (Yeah, we could put a reboot command into crontab,
but you should only need to do hacks like that on Windows NT machines.
:-)

We're running a reasonably recent 2.2 kernel on our machines, and the
problem is only showing up on SMP machines, so there's *some* kind of
SMP race hiding in the 2.2 inode code

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)

2001-01-16 Thread tytso


HJ Lu recently pointed me at a potential locking problem
try_to_free_inodes(), and when I started proding at it, I found what
appears to be another set of SMP locking issues in the dcache code.
(But if that were the case, why aren't we seeing huge numbers of
complaints?  So I'm wondering if I'm missing something.)

Anyway, the first problem which HJ pointed out is in
try_to_free_inodes() which attempts to implement a mutual exclusion with
respect to itself as follows

if (block)
{
struct wait_queue __wait;

__wait.task = current;
add_wait_queue(_inode_freeing, &__wait);
for (;;)
{
/* NOTE: we rely only on the inode_lock to be sure
   to not miss the unblock event */
current->state = TASK_UNINTERRUPTIBLE;
spin_unlock(_lock);
schedule();
spin_lock(_lock);
if (!block)
break;
}
remove_wait_queue(_inode_freeing, &__wait);
current->state = TASK_RUNNING;
}
block = 1;

Of course, this is utterly unsafe on an SMP machines, since access to
the "block" variable isn't protected at all.  So the first question is
why did whoever implemented this do it in this horribly complicated way
above, instead of something simple, say like this:

static struct semaphore block = MUTEX;
if (down_trylock()) {
spin_unlock(_lock);
down();
spin_lock(_lock);
}

(with the appropriate unlocking code at the end of the function).


Next question why was this there in the first place?  After all,
most of the time try_to_free_inodes() is called with the inode_lock
spinlock held, which would act as a automatic mutual exclusion anyway.
The only time this doesn't happen is when we call prune_dcache(), where
inode_lock is temporarily dropped.

So I took a look at prune_dcache(), and discovered that (a) it's called
from multiple places, and (b) it and shrink_dcache_sb() both iterate over
dentry_unused and among other things, tried to free dcache structures
without any kind of locking to prevent two kernel threads to
from mucking with the contents of dentry_unused at the same time, and
possibly having prune_one_dentry() being called by two processes on the
same dentry structure.  This should almost certainly cause problems.

So the following patch I think is definitely necessary, assuming that
the "block" mutual exclusion in try_to_free_inodes() needed to be there
in the first place.  I'm not so sure about needing a patch to protect
access to dentry_unused in fs/dentry.c, but unless there are some
other unstated locking hierarchy rules about when it's safe to call
prune_dcache() and shrink_dcache_sb(), I suspect we need to add a lock
to protect dentry_unused as well.  Comments?

- Ted

--- fs/inode.c.old  Fri Sep 29 17:37:01 2000
+++ fs/inode.c  Tue Jan 16 09:29:06 2001
@@ -380,36 +380,19 @@
  */
 static void try_to_free_inodes(int goal)
 {
-   static int block;
-   static struct wait_queue * wait_inode_freeing;
+   static struct semaphore block = MUTEX;
LIST_HEAD(throw_away);

/* We must make sure to not eat the inodes
   while the blocker task sleeps otherwise
   the blocker task may find none inode
   available. */
-   if (block)
-   {
-   struct wait_queue __wait;
-
-   __wait.task = current;
-   add_wait_queue(_inode_freeing, &__wait);
-   for (;;)
-   {
-   /* NOTE: we rely only on the inode_lock to be sure
-  to not miss the unblock event */
-   current->state = TASK_UNINTERRUPTIBLE;
-   spin_unlock(_lock);
-   schedule();
-   spin_lock(_lock);
-   if (!block)
-   break;
-   }
-   remove_wait_queue(_inode_freeing, &__wait);
-   current->state = TASK_RUNNING;
+   if (down_trylock()) {
+   spin_unlock(_lock);
+   down();
+   spin_lock(_lock);
}
 
-   block = 1;
/*
 * First stry to just get rid of unused inodes.
 *
@@ -427,8 +410,7 @@
}
if (!list_empty(_away))
dispose_list(_away);
-   block = 0;
-   wake_up(_inode_freeing);
+   up();
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Locking problem in 2.2.18/19-pre7? (fs/inode.c and fs/dcache.c)

2001-01-16 Thread tytso


HJ Lu recently pointed me at a potential locking problem
try_to_free_inodes(), and when I started proding at it, I found what
appears to be another set of SMP locking issues in the dcache code.
(But if that were the case, why aren't we seeing huge numbers of
complaints?  So I'm wondering if I'm missing something.)

Anyway, the first problem which HJ pointed out is in
try_to_free_inodes() which attempts to implement a mutual exclusion with
respect to itself as follows

if (block)
{
struct wait_queue __wait;

__wait.task = current;
add_wait_queue(wait_inode_freeing, __wait);
for (;;)
{
/* NOTE: we rely only on the inode_lock to be sure
   to not miss the unblock event */
current-state = TASK_UNINTERRUPTIBLE;
spin_unlock(inode_lock);
schedule();
spin_lock(inode_lock);
if (!block)
break;
}
remove_wait_queue(wait_inode_freeing, __wait);
current-state = TASK_RUNNING;
}
block = 1;

Of course, this is utterly unsafe on an SMP machines, since access to
the "block" variable isn't protected at all.  So the first question is
why did whoever implemented this do it in this horribly complicated way
above, instead of something simple, say like this:

static struct semaphore block = MUTEX;
if (down_trylock(block)) {
spin_unlock(inode_lock);
down(block);
spin_lock(inode_lock);
}

(with the appropriate unlocking code at the end of the function).


Next question why was this there in the first place?  After all,
most of the time try_to_free_inodes() is called with the inode_lock
spinlock held, which would act as a automatic mutual exclusion anyway.
The only time this doesn't happen is when we call prune_dcache(), where
inode_lock is temporarily dropped.

So I took a look at prune_dcache(), and discovered that (a) it's called
from multiple places, and (b) it and shrink_dcache_sb() both iterate over
dentry_unused and among other things, tried to free dcache structures
without any kind of locking to prevent two kernel threads to
from mucking with the contents of dentry_unused at the same time, and
possibly having prune_one_dentry() being called by two processes on the
same dentry structure.  This should almost certainly cause problems.

So the following patch I think is definitely necessary, assuming that
the "block" mutual exclusion in try_to_free_inodes() needed to be there
in the first place.  I'm not so sure about needing a patch to protect
access to dentry_unused in fs/dentry.c, but unless there are some
other unstated locking hierarchy rules about when it's safe to call
prune_dcache() and shrink_dcache_sb(), I suspect we need to add a lock
to protect dentry_unused as well.  Comments?

- Ted

--- fs/inode.c.old  Fri Sep 29 17:37:01 2000
+++ fs/inode.c  Tue Jan 16 09:29:06 2001
@@ -380,36 +380,19 @@
  */
 static void try_to_free_inodes(int goal)
 {
-   static int block;
-   static struct wait_queue * wait_inode_freeing;
+   static struct semaphore block = MUTEX;
LIST_HEAD(throw_away);

/* We must make sure to not eat the inodes
   while the blocker task sleeps otherwise
   the blocker task may find none inode
   available. */
-   if (block)
-   {
-   struct wait_queue __wait;
-
-   __wait.task = current;
-   add_wait_queue(wait_inode_freeing, __wait);
-   for (;;)
-   {
-   /* NOTE: we rely only on the inode_lock to be sure
-  to not miss the unblock event */
-   current-state = TASK_UNINTERRUPTIBLE;
-   spin_unlock(inode_lock);
-   schedule();
-   spin_lock(inode_lock);
-   if (!block)
-   break;
-   }
-   remove_wait_queue(wait_inode_freeing, __wait);
-   current-state = TASK_RUNNING;
+   if (down_trylock(block)) {
+   spin_unlock(inode_lock);
+   down(block);
+   spin_lock(inode_lock);
}
 
-   block = 1;
/*
 * First stry to just get rid of unused inodes.
 *
@@ -427,8 +410,7 @@
}
if (!list_empty(throw_away))
dispose_list(throw_away);
-   block = 0;
-   wake_up(wait_inode_freeing);
+   up(block);
 }
 
 /*
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [patch] 2.2.18 PCI_DEVICE_ID_OXSEMI_16PCI954

2000-12-17 Thread tytso

   Date: Sun, 17 Dec 2000 01:51:15 +0100 (CET)
   From: Lukasz Trabinski <[EMAIL PROTECTED]>

   OK, You have right, I'm not a driver programmer, but it probably should
   look like this:

   +#define PCI_DEVICE_ID_OXSEMI_16PCI954   0x9501

That's correct, yes.  The reason why the original poster was having
problems was that the serial_compat.h defined PCI_VENDOR_ID_OXSEMI
PCI_DEVICE_ID_OXSEMI_16PCI954 if PCI_VENDOR_ID_OXSEMI was not defined.
Since 2.2.18 defines PCI_VENDOR_ID_OXSEMI, but ..._OXSEMI_16PCI954, the
5.05 serial.c wouldn't compile.  

When we integrate a newer serial driver into 2.2.19, I'll include the
necessary patches to pci_ids.h.  

   Why serial.c from the 2.2.18 not support 921600 speed? We have to patch
   it...

2.2.18 has an old serial driver that doesn't know about PCI devices.
The updated serial driver available at http://serial.sourceforge.net
should deal with this correctly.

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Serial cardbus code.... for testing, please.....

2000-12-14 Thread tytso


[Apologies if this has been seen already, but as far as I know my first
posting to the L-K list  apparently never made it out]

- Ted


OK, so I'm currently at the road (San Diego IETF meeting) so I can't
really test this very well; when your compile engine is your Vaio
laptop, it's really slow and painful to do test builds and test booting
kernels.   

But I know some people are eager to test it, and would rather have
something potentially flaky earlier rather waiting for something more
tested later, so here it is.  Please note, the only testing that I have
done to date is Linus's famous "it builds, ship it" regression test
suite.  So if you're not willing to test a patch which might crash your
machine, come back in a day or two after I and others have had a chance
to do some testing.  (But hey, Linus has released kernels to the entire
world without even doing a test compile, so why can't I do it with
something as simple as a serial driver?  :-)


This version of the patch has a couple of new features over past ones,
including sanity check code which should prevent receive_chars() from
getting stuck in a tight loop when the serial card is ejected while a
port is active.  It also has functions correctly labelled with __devinit
and __devexit, and will check to see if an entry in the serial pci table
isn't necessay, and ask the user to report the information in that case.
(mailing list to be active within 24 hours; until then, send the
information to me instead of the e-mail adddress listed in the patch).

The patch also has changes which Kanoj from SGI has been bugging me
about constantly because he needs some changes to support his big-iron
MIPS box project which he's working on.  I had sat on them because
clearly I have a different understanding of "code freeze; critical bugs
only" than other folks on the L-K mailing list, but they're included
here now.  Linus, I can back out some or all of these changes when I
feed the patches to use for merging with 2.4; just let me know what you
want and don't want.

- Ted


Release notes
=

* Add support for PCI hot plugging.  
- Functions should be correctly labeled with __devinit
and __devexit now.
- Set a safety check to prevent infinite loops in
receive_chars
- Added support for removing PCI cards

* Added code to test to see if an entry in the serial driver's
PCI table is redundant (i.e., could have been deduced
using our hueristics) and asks the user to report the
information if so.

* Add support for flow controled serial console.  (Feature
desperately requested to be merged into 2.4.0 by Kanoj
Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS box)

* Add new interface, early_serial_setup() which allows
architecture specific code to set up serial consoles
for those platforms where the port and bus information
must be dynamically determined by the boot.
Early_serial_setup() must be called before rs_init().
(Feature desperately requested to be merged into 2.4.0
by Kanoj Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS
box)


* Fixed fencepost bug which could cause the serial driver to
overrun the flip buffer by a single character if the
UART reports an overrun when the flip buffer is nearly
full.  (not really a critical problem because we have
slop space in the flip buffer for this purpose, but it
really would be good to have this fixed.)

* Add support for the DCI PCCOM8 8-port serial board


Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline
against Linux version 2.4.0
 
===
RCS file: Documentation/RCS/serial-console.txt,v
retrieving revision 1.1
diff -u -r1.1 Documentation/serial-console.txt
--- Documentation/serial-console.txt2000/12/13 15:12:57 1.1
+++ Documentation/serial-console.txt2000/12/13 15:13:09
@@ -20,9 +20,11 @@
 
options:depend on the driver. For the serial port this
defines the baudrate/parity/bits of the port,
-   in the format PN, where  is the speed,
-   P is parity (n/o/e), and N is bits. Default is
-   9600n8. The maximum baudrate is 115200.
+   in the format PNF, where  is the speed,
+   P is parity (n/o/e), N is bits, and F is
+   flow control (n/r for none/rtscts).  P, N,
+   and F ar

Re: Serial cardbus code.... for testing, please.....

2000-12-14 Thread tytso


[Apologies if this has been seen already, but as far as I know my first
posting to the L-K list  apparently never made it out]

- Ted


OK, so I'm currently at the road (San Diego IETF meeting) so I can't
really test this very well; when your compile engine is your Vaio
laptop, it's really slow and painful to do test builds and test booting
kernels.   

But I know some people are eager to test it, and would rather have
something potentially flaky earlier rather waiting for something more
tested later, so here it is.  Please note, the only testing that I have
done to date is Linus's famous "it builds, ship it" regression test
suite.  So if you're not willing to test a patch which might crash your
machine, come back in a day or two after I and others have had a chance
to do some testing.  (But hey, Linus has released kernels to the entire
world without even doing a test compile, so why can't I do it with
something as simple as a serial driver?  :-)


This version of the patch has a couple of new features over past ones,
including sanity check code which should prevent receive_chars() from
getting stuck in a tight loop when the serial card is ejected while a
port is active.  It also has functions correctly labelled with __devinit
and __devexit, and will check to see if an entry in the serial pci table
isn't necessay, and ask the user to report the information in that case.
(mailing list to be active within 24 hours; until then, send the
information to me instead of the e-mail adddress listed in the patch).

The patch also has changes which Kanoj from SGI has been bugging me
about constantly because he needs some changes to support his big-iron
MIPS box project which he's working on.  I had sat on them because
clearly I have a different understanding of "code freeze; critical bugs
only" than other folks on the L-K mailing list, but they're included
here now.  Linus, I can back out some or all of these changes when I
feed the patches to use for merging with 2.4; just let me know what you
want and don't want.

- Ted


Release notes
=

* Add support for PCI hot plugging.  
- Functions should be correctly labeled with __devinit
and __devexit now.
- Set a safety check to prevent infinite loops in
receive_chars
- Added support for removing PCI cards

* Added code to test to see if an entry in the serial driver's
PCI table is redundant (i.e., could have been deduced
using our hueristics) and asks the user to report the
information if so.

* Add support for flow controled serial console.  (Feature
desperately requested to be merged into 2.4.0 by Kanoj
Sarcar [EMAIL PROTECTED] for his big-iron MIPS box)

* Add new interface, early_serial_setup() which allows
architecture specific code to set up serial consoles
for those platforms where the port and bus information
must be dynamically determined by the boot.
Early_serial_setup() must be called before rs_init().
(Feature desperately requested to be merged into 2.4.0
by Kanoj Sarcar [EMAIL PROTECTED] for his big-iron MIPS
box)


* Fixed fencepost bug which could cause the serial driver to
overrun the flip buffer by a single character if the
UART reports an overrun when the flip buffer is nearly
full.  (not really a critical problem because we have
slop space in the flip buffer for this purpose, but it
really would be good to have this fixed.)

* Add support for the DCI PCCOM8 8-port serial board


Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline
against Linux version 2.4.0
 
===
RCS file: Documentation/RCS/serial-console.txt,v
retrieving revision 1.1
diff -u -r1.1 Documentation/serial-console.txt
--- Documentation/serial-console.txt2000/12/13 15:12:57 1.1
+++ Documentation/serial-console.txt2000/12/13 15:13:09
@@ -20,9 +20,11 @@
 
options:depend on the driver. For the serial port this
defines the baudrate/parity/bits of the port,
-   in the format PN, where  is the speed,
-   P is parity (n/o/e), and N is bits. Default is
-   9600n8. The maximum baudrate is 115200.
+   in the format PNF, where  is the speed,
+   P is parity (n/o/e), N is bits, and F is
+   flow control (n/r for none/rtscts).  P, N,
+   and F are optional. 

Re: Serial cardbus code.... for testing, please.....

2000-12-13 Thread tytso


OK, so I'm currently at the road (San Diego IETF meeting) so I can't
really test this very well; when your compile engine is your Vaio
laptop, it's really slow and painful to do test builds and test booting
kernels.   

But I know some people are eager to test it, and would rather have
something potentially flaky earlier rather waiting for something more
tested later, so here it is.  Please note, the only testing that I have
done to date is Linus's famous "it builds, ship it" regression test
suite.  So if you're not willing to test a patch which might crash your
machine, come back in a day or two after I and others have had a chance
to do some testing.  (But hey, Linus has released kernels to the entire
world without even doing a test compile, so why can't I do it with
something as simple as a serial driver?  :-)


This version of the patch has a couple of new features over past ones,
including sanity check code which should prevent receive_chars() from
getting stuck in a tight loop when the serial card is ejected while a
port is active.  It also has functions correctly labelled with __devinit
and __devexit, and will check to see if an entry in the serial pci table
isn't necessay, and ask the user to report the information in that case.
(mailing list to be active within 24 hours; until then, send the
information to me instead of the e-mail adddress listed in the patch).

The patch also has changes which Kanoj from SGI has been bugging me
about constantly because he needs some changes to support his big-iron
MIPS box project which he's working on.  I had sat on them because
clearly I have a different understanding of "code freeze; critical bugs
only" than other folks on the L-K mailing list, but they're included
here now.  Linus, I can back out some or all of these changes when I
feed the patches to use for merging with 2.4; just let me know what you
want and don't want.

- Ted


Release notes
=

* Add support for PCI hot plugging.  
- Functions should be correctly labeled with __devinit
and __devexit now.
- Set a safety check to prevent infinite loops in
receive_chars
- Added support for removing PCI cards

* Added code to test to see if an entry in the serial driver's
PCI table is redundant (i.e., could have been deduced
using our hueristics) and asks the user to report the
information if so.

* Add support for flow controled serial console.  (Feature
desperately requested to be merged into 2.4.0 by Kanoj
Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS box)

* Add new interface, early_serial_setup() which allows
architecture specific code to set up serial consoles
for those platforms where the port and bus information
must be dynamically determined by the boot.
Early_serial_setup() must be called before rs_init().
(Feature desperately requested to be merged into 2.4.0
by Kanoj Sarcar <[EMAIL PROTECTED]> for his big-iron MIPS
box)


* Fixed fencepost bug which could cause the serial driver to
overrun the flip buffer by a single character if the
UART reports an overrun when the flip buffer is nearly
full.  (not really a critical problem because we have
slop space in the flip buffer for this purpose, but it
really would be good to have this fixed.)

* Add support for the DCI PCCOM8 8-port serial board


Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline
against Linux version 2.4.0
 
===
RCS file: Documentation/RCS/serial-console.txt,v
retrieving revision 1.1
diff -u -r1.1 Documentation/serial-console.txt
--- Documentation/serial-console.txt2000/12/13 15:12:57 1.1
+++ Documentation/serial-console.txt2000/12/13 15:13:09
@@ -20,9 +20,11 @@
 
options:depend on the driver. For the serial port this
defines the baudrate/parity/bits of the port,
-   in the format PN, where  is the speed,
-   P is parity (n/o/e), and N is bits. Default is
-   9600n8. The maximum baudrate is 115200.
+   in the format PNF, where  is the speed,
+   P is parity (n/o/e), N is bits, and F is
+   flow control (n/r for none/rtscts).  P, N,
+   and F are optional.  The default setting is
+   9600n8n. The maximum baudrate is 115200.  
 
 You can specify multiple console= options on the kernel command line.
 Output will appear

Re: Serial cardbus code.... for testing, please.....

2000-12-13 Thread tytso


OK, so I'm currently at the road (San Diego IETF meeting) so I can't
really test this very well; when your compile engine is your Vaio
laptop, it's really slow and painful to do test builds and test booting
kernels.   

But I know some people are eager to test it, and would rather have
something potentially flaky earlier rather waiting for something more
tested later, so here it is.  Please note, the only testing that I have
done to date is Linus's famous "it builds, ship it" regression test
suite.  So if you're not willing to test a patch which might crash your
machine, come back in a day or two after I and others have had a chance
to do some testing.  (But hey, Linus has released kernels to the entire
world without even doing a test compile, so why can't I do it with
something as simple as a serial driver?  :-)


This version of the patch has a couple of new features over past ones,
including sanity check code which should prevent receive_chars() from
getting stuck in a tight loop when the serial card is ejected while a
port is active.  It also has functions correctly labelled with __devinit
and __devexit, and will check to see if an entry in the serial pci table
isn't necessay, and ask the user to report the information in that case.
(mailing list to be active within 24 hours; until then, send the
information to me instead of the e-mail adddress listed in the patch).

The patch also has changes which Kanoj from SGI has been bugging me
about constantly because he needs some changes to support his big-iron
MIPS box project which he's working on.  I had sat on them because
clearly I have a different understanding of "code freeze; critical bugs
only" than other folks on the L-K mailing list, but they're included
here now.  Linus, I can back out some or all of these changes when I
feed the patches to use for merging with 2.4; just let me know what you
want and don't want.

- Ted


Release notes
=

* Add support for PCI hot plugging.  
- Functions should be correctly labeled with __devinit
and __devexit now.
- Set a safety check to prevent infinite loops in
receive_chars
- Added support for removing PCI cards

* Added code to test to see if an entry in the serial driver's
PCI table is redundant (i.e., could have been deduced
using our hueristics) and asks the user to report the
information if so.

* Add support for flow controled serial console.  (Feature
desperately requested to be merged into 2.4.0 by Kanoj
Sarcar [EMAIL PROTECTED] for his big-iron MIPS box)

* Add new interface, early_serial_setup() which allows
architecture specific code to set up serial consoles
for those platforms where the port and bus information
must be dynamically determined by the boot.
Early_serial_setup() must be called before rs_init().
(Feature desperately requested to be merged into 2.4.0
by Kanoj Sarcar [EMAIL PROTECTED] for his big-iron MIPS
box)


* Fixed fencepost bug which could cause the serial driver to
overrun the flip buffer by a single character if the
UART reports an overrun when the flip buffer is nearly
full.  (not really a critical problem because we have
slop space in the flip buffer for this purpose, but it
really would be good to have this fixed.)

* Add support for the DCI PCCOM8 8-port serial board


Patch generated: on Wed Dec 13 10:39:41 EST 2000 by tytso@trampoline
against Linux version 2.4.0
 
===
RCS file: Documentation/RCS/serial-console.txt,v
retrieving revision 1.1
diff -u -r1.1 Documentation/serial-console.txt
--- Documentation/serial-console.txt2000/12/13 15:12:57 1.1
+++ Documentation/serial-console.txt2000/12/13 15:13:09
@@ -20,9 +20,11 @@
 
options:depend on the driver. For the serial port this
defines the baudrate/parity/bits of the port,
-   in the format PN, where  is the speed,
-   P is parity (n/o/e), and N is bits. Default is
-   9600n8. The maximum baudrate is 115200.
+   in the format PNF, where  is the speed,
+   P is parity (n/o/e), N is bits, and F is
+   flow control (n/r for none/rtscts).  P, N,
+   and F are optional.  The default setting is
+   9600n8n. The maximum baudrate is 115200.  
 
 You can specify multiple console= options on the kernel command line.
 Output will appear on all of them. The last

  1   2   >