Re: [PATCH 13/38] ext4: Define usercopy region in ext4_inode_cache slab cache

2018-01-11 Thread Theodore Ts'o
On Wed, Jan 10, 2018 at 06:02:45PM -0800, Kees Cook wrote:
> The ext4 symlink pathnames, stored in struct ext4_inode_info.i_data
> and therefore contained in the ext4_inode_cache slab cache, need
> to be copied to/from userspace.

Symlink operations to/from userspace aren't common or in the hot path,
and when they are in i_data, limited to at most 60 bytes.  Is it worth
it to copy through a bounce buffer so as to disallow any usercopies
into struct ext4_inode_info?

- Ted


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 12:09:18PM +0100, Dmitry Vyukov wrote:
> On Wed, Jan 17, 2018 at 10:49 AM, Daniel Borkmann  
> wrote:
> > Don't know if there's such a possibility, but it would be nice if we could
> > target fuzzing for specific subsystems in related subtrees directly (e.g.
> > for bpf in bpf and bpf-next trees as one example). Dmitry?
> 
> Hi Daniel,
> 
> It's doable.
> Let's start with one bpf tree. Will it be bpf or bpf-next? Which one
> contains more ongoing work? What's the exact git repo address/branch,
> so that I don't second guess?

As a suggestion, until the bpf subsystem is free from problems that
can be found by Syzkaller in Linus's upstream tree, maybe it's not
worth trying to test individual subsystem trees such as the bpf tree?
After all, there's no point trying to bisect our way checking to see
if the problem is with a newly added commit in a development tree, if
it turns out the problem was first introduced years ago in the 4.1 or
3.19 timeframe.

After all, finding these older problems is going to have much higher
value, since these are the sorts of potential security problems that
are worth backporting to real device kernels for Android/ChromeOS, and
for enterprise distro kernels.  So from an "impact to the industry"
perspective, focusing on Linus's tree is going to be far more
productive.  That's a win for the community, and it's a win for those
people on the Syzkaller team who might be going up for promo or
listing their achievements at performance review time.  :-)

This will also give the Syzkaller team more time to make the
automation more intelligent in terms of being able to do the automatic
bisection to find the first guilty commit, labelling the report with
the specific subsystem tree that that it came from, etc., etc.

Cheers,

- Ted

P.S.  Something that might be *really* interesting is for those cases
where Syzkaller can find a repro, to test that repro on various stable
4.4, 4.9, 3.18, et. al. LTS kernels.  This will take less resources
than a full bisection, but it will add real value since knowledge that
it will trigger on a LTS kernel will help prioritize which reports
developers might be more interested in focusing upon, and it will give
them a head start in determining which fixes needed to be backported
to which stable kernels.


Re: dangers of bots on the mailing lists was Re: divide error in ___bpf_prog_run

2018-01-17 Thread Theodore Ts'o
On Wed, Jan 17, 2018 at 04:21:13PM -0800, Alexei Starovoitov wrote:
> 
> If syzkaller can only test one tree than linux-next should be the one.

Well, there's been some controversy about that.  The problem is that
it's often not clear if this is long-standing bug, or a bug which is
in a particular subsystem tree --- and if so, *which* subsystem tree,
etc.  So it gets blasted to linux-kernel, and to get_maintainer.pl,
which is often not accurate --- since the location of the crash
doesn't necessarily point out where the problem originated, and hence
who should look at the syzbot report.  And so this has caused
some irritation.

> There is some value of testing stable trees, but any developer
> will first ask for a reproducer in the latest, so usefulness of
> reporting such bugs will be limited.

What I suggested was to test Linus's tree, and then when a problem is
found, and syzkaller has a reliable repro, to *then* try to see if it
*also* shows up in the LTS kernels.

- Ted


"virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Theodore Ts'o
Hi,

I was doing a last minute regression test of the ext4 tree before
sending a pull request to Linus, which I do using gce-xfstests[1], and
I found that using networking was broken on GCE on linux-next.  I was
using next-20161209, and after bisecting things, I narrowed down the
commit which causing things to break to commit 449000102901:
"virtio-net: enable multiqueue by default".  Reverting this commit on
top of next-20161209 fixed the problem.

[1] http://thunk.org/gce-xfstests

You can reproduce the problem for building the kernel for Google
Compute Engine --- I use a config such as this [2], and then try to
boot a kernel on a VM.  The way I do this involves booting a test
appliance and then kexec'ing into the kernel to be tested[3], using a
2cpu configuration.  (GCE machine type: n1-standard-2)

[2] 
https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9
[3] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

You can then take a look at serial console using a command such as
"gcloud compute instances get-serial-port-output ", and
you will get something like this (see attached).  The important bit is
that the dhclient command is completely failing to be able to get a
response from the network, from which I deduce that apparently that
either networking send or receive or both seem to be badly affected by
the commit in question.

Please let me know if there's anything I can do to help you debug this
further.

Cheers,

- Ted

Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Linux version 
4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian 
4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Command line: 
root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0  
fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 
fstestapi=1.3
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x001: 'x87 floating point registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x002: 'SSE registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x004: 'AVX registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
xstate_offset[2]:  576, xstate_sizes[2]:  256
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Enabled 
xstate features 0x7, context size is 832 bytes, using 'standard' format.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel 
Variables...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File 
System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File 
System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File 
System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File 
System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel 
Variables.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static Device 
Nodes in /dev.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device 
Manager...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device 
Manager.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all 
Devices.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for 
Complete Device Initialization...
Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: clean, 
56268/655360 files, 357439/2620928 blocks
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check on 
Root Device.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and 
Kernel File Systems...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and 
Kernel File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to 
make systemd work better on Debian.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random 
Seed...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems 
(Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Local File 
Systems (Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load/Save Random Seed.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Wait for 
Complete Device Initialization.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Activation of LVM2 
logical volumes...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Copy rules generated 
while the root was ro...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS0.
Dec 11 23:53:20 xfstests-201612120451 

Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Theodore Ts'o
On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote:
> 
> That's unfortunate, of course. It could be a hypervisor or
> a guest kernel bug. ideas:
> - does host have mq capability? how many queues?
> - how about # of msix vectors?
> - after you send something on tx queues,
>   are interrupts arriving on rx queues?
> - is problem rx or tx?
>   set ip and arp manually and send a packet to known MAC,
>   does it get there?

Sorry, I don't know how to debug virtio-net.  Given that it's in a
cloud environment, I also can't set ip addresses manually, since ip
addresses are set manually.

If you can send me a patch, I'm happy to apply it and send you back
results.

I can say that I've had _zero_ problems using pretty much any kernel
from 3.10 to 4.9 using Google Compute Engine.  The commit I referenced
caused things to stop working.  So in terms of regression, this is
definitely a regression, and it's definitely caused by commit
449000102901.  Even if it is a hypervisor "bug", I'm pretty sure I
know what Linus will say if I ask him to revert it.  Linux kernels are
expected to work around hardware bugs, and breaking users just because
hardware is "broken" by some definition is generally not considered
friendly, especially when has been working for years and years before
some commit "fixed" things.

I would very much like to work with you to fix it, but I will need
your help, since virtio-net doesn't seem to print any informational
during the boot sequence, and I don't know how the best way to debug
it.

Cheers,

- Ted


Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Theodore Ts'o
On Tue, Dec 13, 2016 at 11:43:00AM +0800, Jason Wang wrote:
> Thanks for reporting this issue. Looks like I blindly set the affinity
> instead of queues during probe. Could you please try the following patch to
> see if it works?

This fixed things, thanks!!

- Ted


> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b425fa1..fe9f772 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev)
> goto free_unregister_netdev;
> }
> 
> -   virtnet_set_affinity(vi);
> +   rtnl_lock();
> +   virtnet_set_queues(vi, vi->curr_queue_pairs);
> +   rtnl_unlock();
> 
> /* Assume link up if device can't report link status,
>otherwise get link status from config. */
> 
> 


Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-13 Thread Theodore Ts'o
Jason's patch fixed the issue, so I think we have the proper fix, but
to answer your questions:

On Wed, Dec 14, 2016 at 01:46:44AM +0800, Wei Xu wrote:
> 
> Q1:
> Which distribution are you using for the GCE instance?

The test appliance is based on Debian Jessie.

> Q2:
> Are you running xfs test as an embedded VM case, which means XFS test
> appliance is also a VM inside the GCE instance? Or the kernel is built
> for the instance itself?

No, GCE currently doesn't support running nested VM's (e.g., running
VM's inside GCE).  So the kernel is built for the instance itself.
The way the test appliance works is that it initially boots using the
Debian Jessie default kernel and then we kexec into the kernel under
test.

> Q3:
> Can this bug be reproduced for kvm-xfstests case? I'm trying to set up
> a local test bed if it makes sense.

You definitely can't do it out of the box -- you need to build the
image using "gen-image --networking", and then run "kvm-xfstests -N
shell" as root.  But the bug doesn't reproduce on kvm-xfstests, using
a 4.9 host kernel and linux-next guest kernel.


Cheers,

- Ted


Re: [PATCH 4/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-14 Thread Theodore Ts'o
On Wed, Dec 14, 2016 at 04:10:37AM +0100, Jason A. Donenfeld wrote:
> This duplicates the current algorithm for get_random_int/long, but uses
> siphash24 instead. This comes with several benefits. It's certainly
> faster and more cryptographically secure than MD5. This patch also
> hashes the pid, entropy, and timestamp as fixed width fields, in order
> to increase diffusion.
> 
> The previous md5 algorithm used a per-cpu md5 state, which caused
> successive calls to the function to chain upon each other. While it's
> not entirely clear that this kind of chaining is absolutely necessary
> when using a secure PRF like siphash24, it can't hurt, and the timing of
> the call chain does add a degree of natural entropy. So, in keeping with
> this design, instead of the massive per-cpu 64-byte md5 state, there is
> instead a per-cpu previously returned value for chaining.
> 
> Signed-off-by: Jason A. Donenfeld 
> Cc: Jean-Philippe Aumasson 

The original reason for get_random_int was because the original
urandom algorithms were too slow.  When we moved to chacha20, which is
must faster, I didn't think to revisit get_random_int() and
get_random_long().

One somewhat undesirable aspect of the current algorithm is that we
never change random_int_secret.  So I've been toying with the
following, which is 4 times faster than md5.  (I haven't tried
benchmarking against siphash yet.)

[3.606139] random benchmark!!
[3.606276] get_random_int # cycles: 326578
[3.606317] get_random_int_new # cycles: 95438
[3.607423] get_random_bytes # cycles: 2653388

  - Ted

P.S.  It's interesting to note that siphash24 and chacha20 are both
add-rotate-xor based algorithms.

diff --git a/drivers/char/random.c b/drivers/char/random.c
index d6876d506220..be172ea75799 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1681,6 +1681,38 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+
+static int rand_benchmark(void)
+{
+   cycles_t start,finish;
+   int i, out;
+
+   pr_crit("random benchmark!!\n");
+   start = get_cycles();
+   for (i = 0; i < 1000; i++) {
+   get_random_int();
+   }
+   finish = get_cycles();
+   pr_err("get_random_int # cycles: %llu\n", finish - start);
+
+   start = get_cycles();
+   for (i = 0; i < 1000; i++) {
+   get_random_int_new();
+   }
+   finish = get_cycles();
+   pr_err("get_random_int_new # cycles: %llu\n", finish - start);
+
+   start = get_cycles();
+   for (i = 0; i < 1000; i++) {
+   get_random_bytes(&out, sizeof(out));
+   }
+   finish = get_cycles();
+   pr_err("get_random_bytes # cycles: %llu\n", finish - start);
+   return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2096,10 @@ unsigned int get_random_int(void)
__u32 *hash;
unsigned int ret;
 
+#if 0  // force slow path
if (arch_get_random_int(&ret))
return ret;
+#endif
 
hash = get_cpu_var(get_random_int_hash);
 
@@ -2100,6 +2134,38 @@ unsigned long get_random_long(void)
 }
 EXPORT_SYMBOL(get_random_long);
 
+struct random_buf {
+   __u8 buf[CHACHA20_BLOCK_SIZE];
+   int ptr;
+};
+
+static DEFINE_PER_CPU(struct random_buf, batched_entropy);
+
+static void get_batched_entropy(void *buf, int n)
+{
+   struct random_buf *p;
+
+   p = &get_cpu_var(batched_entropy);
+
+   if ((p->ptr == 0) ||
+   (p->ptr + n >= CHACHA20_BLOCK_SIZE)) {
+   extract_crng(p->buf);
+   p->ptr = 0;
+   }
+   BUG_ON(n > CHACHA20_BLOCK_SIZE);
+   memcpy(buf, p->buf, n);
+   p->ptr += n;
+   put_cpu_var(batched_entropy);
+}
+
+static unsigned int get_random_int_new(void)
+{
+   int ret;
+
+   get_batched_entropy(&ret, sizeof(ret));
+   return ret;
+}
+
 /**
  * randomize_page - Generate a random, page aligned address
  * @start: The smallest acceptable address the caller will take.


Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-14 Thread Theodore Ts&#x27;o
On Wed, Dec 14, 2016 at 12:24:43PM +0800, Wei Xu wrote:
> 
> BTW, although this is a guest issue, is there anyway to view the GCE
> host kernel or qemu(if it is) version?

No, there isn't, as far as I know.

 - Ted


Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread Theodore Ts&#x27;o
On Fri, Dec 16, 2016 at 03:17:39PM -0500, George Spelvin wrote:
> > That's a nice analysis. Might one conclude from that that hsiphash is
> > not useful for our purposes? Or does it still remain useful for
> > network facing code?
> 
> I think for attacks where the threat is a DoS, it's usable.  The point
> is you only have to raise the cost to equal that of a packet flood.
> (Just like in electronic warfare, the best you can possibly do is force
> the enemy to use broadband jamming.)
> 
> Hash collision attacks just aren't that powerful.  The original PoC
> was against an application that implemented a hard limit on hash chain
> length as a DoS defense, which the attack then exploited to turn it into
> a hard DoS.

What should we do with get_random_int() and get_random_long()?  In
some cases it's being used in performance sensitive areas, and where
anti-DoS protection might be enough.  In others, maybe not so much.

If we rekeyed the secret used by get_random_int() and
get_random_long() frequently (say, every minute or every 5 minutes),
would that be sufficient for current and future users of these
interfaces?

- Ted

P.S.  I'll note that my performance figures when testing changes to
get_random_int() were done on a 32-bit x86; Jason, I'm guessing your
figures were using a 64-bit x86 system?.  I haven't tried 32-bit ARM
or smaller CPU's (e.g., mips, et. al.) that might be more likely to be
used on IoT devices, but I'm worried about those too, of course.



Re: [kernel-hardening] Re: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-17 Thread Theodore Ts&#x27;o
On Fri, Dec 16, 2016 at 09:15:03PM -0500, George Spelvin wrote:
> >> - Ted, Andy Lutorminski and I will try to figure out a construction of
> >>   get_random_long() that we all like.

We don't have to find the most optimal solution right away; we can
approach this incrementally, after all.

So long as we replace get_random_{long,int}() with something which is
(a) strictly better in terms of security given today's use of MD5, and
(b) which is strictly *faster* than the current construction on 32-bit
and 64-bit systems, we can do that, and can try to make it be faster
while maintaining some minimum level of security which is sufficient
for all current users of get_random_{long,int}() and which can be
clearly artificulated for future users of get_random_{long,int}().

The main worry at this point I have is benchmarking siphash on a
32-bit system.  It may be that simply batching the chacha20 output so
that we're using the urandom construction more efficiently is the
better way to go, since that *does* meet the criteron of strictly more
secure and strictly faster than the current MD5 solution.  I'm open to
using siphash, but I want to see the the 32-bit numbers first.

As far as half-siphash is concerned, it occurs to me that the main
problem will be those users who need to guarantee that output can't be
guessed over a long period of time.  For example, if you have a
long-running process, then the output needs to remain unguessable over
potentially months or years, or else you might be weakening the ASLR
protections.  If on the other hand, the hash table or the process will
be going away in a matter of seconds or minutes, the requirements with
respect to cryptographic strength go down significantly.

Now, maybe this doesn't matter that much if we can guarantee (or make
assumptions) that the attacker doesn't have unlimited access the
output stream of get_random_{long,int}(), or if it's being used in an
anti-DOS use case where it ultimately only needs to be harder than
alternate ways of attacking the system.

Rekeying every five minutes doesn't necessarily help the with respect
to ASLR, but it might reduce the amount of the output stream that
would be available to the attacker in order to be able to attack the
get_random_{long,int}() generator, and it also reduces the value of
doing that attack to only compromising the ASLR for those processes
started within that five minute window.

Cheers,

- Ted

P.S.  I'm using ASLR as an example use case, above; of course we will
need to make similar eximainations of the other uses of
get_random_{long,int}().

P.P.S.  We might also want to think about potentially defining
get_random_{long,int}() to be unambiguously strong, and then creating
a get_weak_random_{long,int}() which on platforms where performance
might be a consideration, it uses a weaker algorithm perhaps with some
kind of rekeying interval.


Re: HalfSipHash Acceptable Usage

2016-12-20 Thread Theodore Ts&#x27;o
On Mon, Dec 19, 2016 at 06:32:44PM +0100, Jason A. Donenfeld wrote:
> 1) Anything that requires actual long-term security will use
> SipHash2-4, with the 64-bit output and the 128-bit key. This includes
> things like TCP sequence numbers. This seems pretty uncontroversial to
> me. Seem okay to you?

Um, why do TCP sequence numbers need long-term security?  So long as
you rekey every 5 minutes or so, TCP sequence numbers don't need any
more security than that, since even if you break the key used to
generate initial sequence numbers seven a minute or two later, any
pending TCP connections will have timed out long before.

See the security analysis done in RFC 6528[1], where among other
things, it points out why MD5 is acceptable with periodic rekeying,
although there is the concern that this could break certain hueristics
used when establishing new connections during the TIME-WAIT state.

[1] https://tools.ietf.org/html/rfc6528

- Ted


Re: HalfSipHash Acceptable Usage

2016-12-21 Thread Theodore Ts&#x27;o
On Wed, Dec 21, 2016 at 01:37:51PM -0500, George Spelvin wrote:
> SipHash annihilates the competition on 64-bit superscalar hardware.
> SipHash dominates the field on 64-bit in-order hardware.
> SipHash wins easily on 32-bit hardware *with enough registers*.
> On register-starved 32-bit machines, it really struggles.

And "with enough registers" includes ARM and MIPS, right?  So the only
real problem is 32-bit x86, and you're right, at that point, only
people who might care are people who are using a space-radiation
hardened 386 --- and they're not likely to be doing high throughput
TCP connections.  :-)

- Ted



Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-21 Thread Theodore Ts&#x27;o
On Thu, Dec 22, 2016 at 03:49:39AM +0100, Jason A. Donenfeld wrote:
> 
> Funny -- while you guys were sending this back & forth, I was writing
> my reply to Andy which essentially arrives at the same conclusion.
> Given that we're all arriving to the same thing, and that Ted shot in
> this direction long before we all did, I'm leaning toward abandoning
> SipHash for the de-MD5-ification of get_random_int/long, and working
> on polishing Ted's idea into something shiny for this patchset.

here are my numbers comparing siphash (using the first three patches
of the v7 siphash patches) with my batched chacha20 implementation.
The results are taken by running get_random_* 1 times, and then
dividing the numbers by 1 to get the average number of cycles for
the call.  I compiled 32-bit and 64-bit kernels, and ran the results
using kvm:

   siphashbatched chacha20
 get_random_int  get_random_long   get_random_int   get_random_long   

32-bit270  278 114146
64-bit 75   75 106186

> I did have two objections to this. The first was that my SipHash
> construction is faster.

Well, it's faster on everything except 32-bit x86.  :-P

> The second, and the more
> important one, was that batching entropy up like this means that 32
> calls will be really fast, and then the 33rd will be slow, since it
> has to do a whole ChaCha round, because get_random_bytes must be
> called to refill the batch.

... and this will take 2121 cycles on 64-bit x86, and 2315 cycles on a
32-bit x86.  Which on a 2.3 GHz processor, is just under a
microsecond.  As far as being inconsistent on process startup, I very
much doubt a microsecond is really going to be visible to the user.  :-)

The bottom line is that I think we're really "pixel peeping" at this
point --- which is what obsessed digital photographers will do when
debating the quality of a Canon vs Nikon DSLR by blowing up a photo by
a thousand times, and then trying to claim that this is visible to the
human eye.  Or people who obsessing over the frequency response curves
of TH-X00 headphones with Mahogony vs Purpleheart wood, when it's
likely that in a blind head-to-head comparison, most people wouldn't
be able to tell the difference

I think the main argument for using the batched getrandom approach is
that it, I would argue, simpler than introducing siphash into the
picture.  On 64-bit platforms it is faster and more consistent, so
it's basically that versus complexity of having to adding siphash to
the things that people have to analyze when considering random number
security on Linux.   But it's a close call either way, I think.

  - Ted

P.S.  My benchmarking code

diff --git a/drivers/char/random.c b/drivers/char/random.c
index a51f0ff43f00..41860864b775 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1682,6 +1682,55 @@ static int rand_initialize(void)
 }
 early_initcall(rand_initialize);
 
+static unsigned int get_random_int_new(void);
+static unsigned long get_random_long_new(void);
+
+#define NUM_CYCLES 1
+#define AVG(finish, start) ((unsigned int)(finish - start + NUM_CYCLES/2) / 
NUM_CYCLES)
+
+static int rand_benchmark(void)
+{
+   cycles_t start,finish;
+   int i, out;
+
+   pr_crit("random benchmark!!\n");
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_int();}
+   finish = get_cycles();
+   pr_err("get_random_int # cycles: %u\n", AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_int_new();
+   }
+   finish = get_cycles();
+   pr_err("get_random_int_new (batched chacha20) # cycles: %u\n", 
AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_long();
+   }
+   finish = get_cycles();
+   pr_err("get_random_long # cycles: %u\n", AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_long_new();
+   }
+   finish = get_cycles();
+   pr_err("get_random_long_new (batched chacha20) # cycles: %u\n", 
AVG(finish, start));
+
+   start = get_cycles();
+   for (i = 0; i < NUM_CYCLES; i++) {
+   get_random_bytes(&out, sizeof(out));
+   }
+   finish = get_cycles();
+   pr_err("get_random_bytes # cycles: %u\n", AVG(finish, start));
+   return 0;
+}
+device_initcall(rand_benchmark);
+
 #ifdef CONFIG_BLOCK
 void rand_initialize_disk(struct gendisk *disk)
 {
@@ -2064,8 +2113,10 @@ unsigned int get_random_int(void)
unsigned int ret;
u64 *chaining;
 
+#if 0  // force slow path
if (arch_get_random_int(&ret))
return ret;
+#endif
 
chaining = &get_cpu_var(get_random_int_chaining);
ret = 

Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-22 Thread Theodore Ts&#x27;o
On Thu, Dec 22, 2016 at 02:10:33PM +0100, Jason A. Donenfeld wrote:
> On Thu, Dec 22, 2016 at 1:47 PM, Hannes Frederic Sowa
>  wrote:
> > following up on what appears to be a random subject: ;)
> >
> > IIRC, ext4 code by default still uses half_md4 for hashing of filenames
> > in the htree. siphash seems to fit this use case pretty good.
> 
> I saw this too. I'll try to address it in v8 of this series.

This is a separate issue, and this series is getting a bit too
complex.  So I'd suggest pushing this off to a separate change.

Changing the htree hash algorithm is an on-disk format change, and so
we couldn't roll it out until e2fsprogs gets updated and rolled out
pretty broadley.  In fact George sent me patches to add siphash as a
hash algorithm for htree a while back (for both the kernel and
e2fsprogs), but I never got around to testing and applying them,
mainly because while it's technically faster, I had other higher
priority issues to work on --- and see previous comments regarding
pixel peeping.  Improving the hash algorithm by tens or even hundreds
of nanoseconds isn't really going to matter since we only do a htree
lookup on a file creation or cold cache lookup, and the SSD or HDD I/O
times will dominate.  And from the power perspective, saving
microwatts of CPU power isn't going to matter if you're going to be
spinning up the storage device

- Ted


Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-22 Thread Theodore Ts&#x27;o
On Thu, Dec 22, 2016 at 07:03:29AM +0100, Jason A. Donenfeld wrote:
> I find this compelling. We'll have one csprng for both
> get_random_int/long and for /dev/urandom, and we'll be able to update
> that silly warning on the comment of get_random_int/long to read
> "gives output of either rdrand quality or of /dev/urandom quality",
> which makes it more useful for other things. It introduces less error
> prone code, and it lets the RNG analysis be spent on just one RNG, not
> two.
> 
> So, with your blessing, I'm going to move ahead with implementing a
> pretty version of this for v8.

Can we do this as a separate series, please?  At this point, it's a
completely separate change from a logical perspective, and we can take
in the change through the random.git tree.

Changes that touch files that are normally changed in several
different git trees leads to the potential for merge conflicts during
the linux-next integration and merge window processes.  Which is why
it's generally best to try to isolate changes as much as possible.

Cheers,

- Ted


Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-22 Thread Theodore Ts&#x27;o
On Thu, Dec 22, 2016 at 05:16:47PM +0100, Jason A. Donenfeld wrote:
> Could you offer a bit of advice on how to manage dependencies between
> patchsets during merge windows? I'm a bit new to the process.
> 
> Specifically, we how have 4 parts:
> 1. add siphash, and use it for some networking code. to: david miller's 
> net-next

I'd do this first, as one set.  Adding a new file to crypto is
unlikely to cause merge conflicts.

> 2. convert char/random to use siphash. to: ted ts'o's random-next

I'm confused, I thought you had agreed to the batched chacha20
approach?

> 3. move lib/md5.c to static function in crypto/md5.c, remove entry
> inside of linux/cryptohash.h. to: ??'s ??-next

This is cleanup, so it doesn't matter that much when it happens.  md5
changes to crypto is also unlikely to cause conflicts, so we could do
this at the same time as (2), if Herbert (the crypto maintainer) agrees.

> 4. move lib/halfmd4.c to static function in fs/ext/hash.c, remove
> entry inside of linux/cryptohash.c. to: td ts'o's ext-next

This is definitely separate.

One more thing.  Can you add some test cases to lib/siphash.h?
Triggered off of a CONFIG_SIPHASH_REGRESSION_TEST config flag, with
some test inputs and known outputs?  I'm going to need to add a
version of siphash to e2fsprogs, and I want to make sure the userspace
version is implementing the same algorithm as the kernel siphash.

  - Ted


Re: [kernel-hardening] Re: [PATCH v7 3/6] random: use SipHash in place of MD5

2016-12-22 Thread Theodore Ts&#x27;o
On Thu, Dec 22, 2016 at 07:08:37PM +0100, Hannes Frederic Sowa wrote:
> I wasn't concerned about performance but more about DoS resilience. I
> wonder how safe half md4 actually is in terms of allowing users to
> generate long hash chains in the filesystem (in terms of length
> extension attacks against half_md4).
> 
> In ext4, is it actually possible that a "disrupter" learns about the
> hashing secret in the way how the inodes are returned during getdents?

They'd have to be a local user, who can execute telldir(3) --- in
which case there are plenty of other denial of service attacks one
could carry out that would be far more devastating.

It might also be an issue if the file system is exposed via NFS, but
again, there are so many other ways an attacker could DoS a NFS server
that I don't think of it as a much of a concern.

Keep in mind that worst someone can do is cause directory inserts to
fail with an ENOSPC, and there are plenty of other ways of doing that
--- such as consuming all of the blocks and inodes in the file system,
for example.

So it's a threat, but not a high priority one as far as I'm concerned.
And if this was a problem in actual practice, users could switch to
the TEA based hash, which should be far harder to attack, and
available today.

- Ted


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts&#x27;o
On Wed, Nov 29, 2017 at 09:50:14AM -0500, David Miller wrote:
> From: Alan Cox 
> Date: Wed, 29 Nov 2017 13:46:12 +
> 
> > I really don't care what the module loading rules end up with and
> > whether we add CAP_SYS_YET_ANOTHER_MEANINGLESS_FLAG but what is
> > actually needed is to properly incorporate it into securiy ruiles
> > for whatever LSM you are using.
> 
> I'm surprised we're not using the SHA1 hashes or whatever we compute
> for the modules to make sure we are loading the foo.ko that we expect
> to be.

We do have signed modules.  But this won't help us if the user is
using a distro kernel which has compiled some module which is known to
be unmaintained which everyone in the know *expects* to have 0-day
bugs, such as DCCP.  That's because the DCCP module is signed.

We could fix this by adding to the signature used for module signing
to include the module name, so that the bad guy can't rename dccp.ko
to be ppp.ko, I suppose

> All of this capability stuff seems to dance a circle around the
> problem rather than fix it.

Half the problem here is that with containers, people are changing the
security model, because they want to let untrusted users have "root",
without really having "root".  Part of the fundamental problem is that
there are some well-meaning, but fundamentally misguided people, who
have been asserting: "Containers are just as secure as VM's".

Well, they are not.  And the sooner people get past this, the better
off they'll be

- Ted


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts&#x27;o
On Wed, Nov 29, 2017 at 10:58:16AM -0500, David Miller wrote:
> That's not what we're talking about.
> 
> We're talking about making sure that loading "ppp.ko" really gets
> ppp.ko rather than some_other_module.ko renamed to ppp.ko via some
> other mechanism.

Right, and the best solution to this problem is to include in the
signature the name of the module.  (One way of doing this is adding
the module name into the cryptographic checksum which is then
digitally signed by the kernel module key; we would have to bump the
version number of the signature format, obviously.)  This binds the
name of the module into the digital signature so that it can't be
renamed.

- Ted


Re: [PATCH v5 next 1/5] modules:capabilities: add request_module_cap()

2017-11-29 Thread Theodore Ts&#x27;o
On Wed, Nov 29, 2017 at 11:28:52AM -0600, Serge E. Hallyn wrote:
> 
> Just to be clear, module loading requires - and must always continue to
> require - CAP_SYS_MODULE against the initial user namespace.  Containers
> in user namespaces do not have that.
> 
> I don't believe anyone has ever claimed that containers which are not in
> a user namespace are in any way secure.

Unless the container performs some action which causes the kernel to
call request_module(), which then loads some kernel module,
potentially containing cr*p unmaintained code which was included when
the distro compiled the world, into the host kernel.

This is an attack vector that doesn't exist if you are using VM's.
And in general, the attack surface of the entire Linux
kernel<->userspace API is far larger than that which is exposed by the
guest<->host interface.

For that reason, containers are *far* more insecure than VM's, since
once the attacker gets root on the guest VM, they then have to attack
the hypervisor interface.  And if you compare the attack surface of
the two, it's pretty clear which is larger, and it's not the
hypervisor interface.

- Ted


Re: [PATCH v2 0/6] kunit: Fix formatting of KUNIT tests to meet the standard

2021-04-18 Thread Theodore Ts&#x27;o
On Wed, Apr 14, 2021 at 04:58:03AM -0400, Nico Pache wrote:
> There are few instances of KUNIT tests that are not properly defined.
> This commit focuses on correcting these issues to match the standard
> defined in the Documentation.

The word "standard" seems to be over-stating things.  The
documentation currently states, "they _usually_ have config options
ending in ``_KUNIT_TEST'' (emphasis mine).  I can imagine that there
might be some useful things we can do from a tooling perspective if we
do standardize things, but if you really want to make it a "standard",
we should first update the manpage to say so, and explain why (e.g.,
so that we can easily extract out all of the kunit test modules, and
perhaps paint a vision of what tools might be able to do with such a
standard).

Alternatively, the word "standard" could perhaps be changed to
"convention", which I think more accurately defines how things work at
the moment.

> Nico Pache (6):
>   kunit: ASoC: topology: adhear to KUNIT formatting standard
>   kunit: software node: adhear to KUNIT formatting standard
>   kunit: ext4: adhear to KUNIT formatting standard
>   kunit: lib: adhear to KUNIT formatting standard
>   kunit: mptcp: adhear to KUNIT formatting standard
>   m68k: update configs to match the proper KUNIT syntax

Also, "adhear" is not the correct spelling; the correct spelling is
"adhere" (from the Latin verb "adhaerere", "to stick", as in "to hold
fast or stick by as if by gluing", which then became "to bind oneself
to the observance of a set of rules or standards or practices").

 - Ted


Re: [PATCH v2 3/6] kunit: ext4: adhear to KUNIT formatting standard

2021-04-18 Thread Theodore Ts&#x27;o
On Wed, Apr 14, 2021 at 04:58:06AM -0400, Nico Pache wrote:
> Drop 'S' from end of CONFIG_EXT4_KUNIT_TESTS inorder to adhear to the KUNIT
> *_KUNIT_TEST config name format.

Another spelling nit, that should be "in order".

This will break people who have existing .kunitconfig files, but if we
are going to make this a standard (but please document it as a
standard first!), might as well do it sooner rather than later.

Cheers,

- Ted


Maintainers / Kernel Summit 2021 planning kick-off

2021-04-19 Thread Theodore Ts&#x27;o
[ Feel free to forward this to other Linux kernel mailing lists as
  appropriate -- Ted ]

This year, the Maintainers and Kernel Summit is currently planned to
be held in Dublin, Ireland, September 27 -- 29th.  Of course, this is
subject to change depending on how much progress the world makes
towards vaccinating the population against the COVID-19 virus, and
whether employers are approving conference travel.  At this point,
there's a fairly good chance that we will need to move to a virtual
conference format, either for one or both of the summits.

As in previous years, the Maintainers Summit is invite-only, where the
primary focus will be process issues around Linux Kernel Development.
It will be limited to 30 invitees and a handful of sponsored
attendees.

The Kernel Summit is organized as a track which is run in parallel
with the other tracks at the Linux Plumbers Conference (LPC), and is
open to all registered attendees of LPC.

Linus has generated a core list of people to be invited to the
Maintainers Summit, and the program committee will be using that list
a starting point of people to be considered.  People who suggest
topics that should be discussed at the Maintainers Summit will also be
added to the list for consideration.  To make topic suggestions for
the Maintainers Summit, please send e-mail to the
ksum...@lists.linux.dev with a subject prefix of [MAINTAINERS SUMMIT].

(Note: The older ksummit-disc...@lists.linuxfoundation.org list has
been migrated to lists.linux.dev, with the subscriber list and
archives preserved.)

The other job of the program committee will be to organize the program
for the Kernel Summit.  The goal of the Kernel Summit track will be to
provide a forum to discuss specific technical issues that would be
easier to resolve in person than over e-mail.  The program committee
will also consider "information sharing" topics if they are clearly of
interest to the wider development community (i.e., advanced training
in topics that would be useful to kernel developers).

To suggest a topic for the Kernel Summit, please do two things.
First, please tag your e-mail with [TECH TOPIC].  As before, please
use a separate e-mail for each topic, and send the topic suggestions
to the ksummit-discuss list.

Secondly, please create a topic at the Linux Plumbers Conference
proposal submission site and target it to the Kernel Summit track.
For your convenience you can use:

https://bit.ly/lpc21-summit

Please do both steps.  I'll try to notice if someone forgets one or
the other, but your chances of making sure your proposal gets the
necessary attention and consideration are maximized by submitting both
to the mailing list and the web site.

People who submit topic suggestions before June 12th and which are
accepted, will be given free admission to the Linux Plumbers
Conference.

We will be reserving roughly half of the Kernel Summit slots for
last-minute discussions that will be scheduled during the week of
Plumbers, in an "unconference style".  This allows last-minute ideas
that come up to be given given slots for discussion.

If you were not subscribed on to the kernel@lists.linux-dev mailing
list from last year (or if you had removed yourself from the
ksummit-disc...@lists.linux-foundation.org mailing list after the
previous year's kernel and maintainers' summit summit), you can
subscribe sending an e-mail to:

ksummit+subscr...@lists.linux.dev

The mailing list archive is available at:

https://lore.kernel.org/ksummit

The program committee this year is composed of the following people:

Jens Axboe
Arnd Bergmann
Jon Corbet
Greg Kroah-Hartman
Ted Ts'o