Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-13 Thread Rasmus Villemoes
On 11/03/2021 19.02, Linus Torvalds wrote:
> On Wed, Mar 10, 2021 at 5:45 PM Rasmus Villemoes
>  wrote:
>>
>> Hm, gcc does elide the test of the return value, but jumps back to a
>> place where it always loads state from its memory location and does the
>> whole switch(). To get it to jump directly to the code implementing the
>> various do_* helpers it seems one needs to avoid that global variable
>> and instead return the next state explicitly. The below boots, but I
>> still can't see any measurable improvement on ppc.
> 
> Ok. That's definitely the right way to do efficient statemachines that
> the compiler can actually generate ok code for, but if you can't
> measure the difference I guess it isn't even worth doing.

Just for good measure, I now got around to test on x86 as well, where I
thought the speculation stuff might make a difference. However, the
indirect calls through the actions[] array don't actually hurt due to
__noinitretpoline, and even removing that from the __init definition, I
only see about 1.5% difference with that state machine patch applied.

So it doesn't seem worth pursuing. I'll send v3 of the async patches
shortly.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-11 Thread Linus Torvalds
On Wed, Mar 10, 2021 at 5:45 PM Rasmus Villemoes
 wrote:
>
> Hm, gcc does elide the test of the return value, but jumps back to a
> place where it always loads state from its memory location and does the
> whole switch(). To get it to jump directly to the code implementing the
> various do_* helpers it seems one needs to avoid that global variable
> and instead return the next state explicitly. The below boots, but I
> still can't see any measurable improvement on ppc.

Ok. That's definitely the right way to do efficient statemachines that
the compiler can actually generate ok code for, but if you can't
measure the difference I guess it isn't even worth doing.

Thanks for checking, though.

Linus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-11 Thread Luis Chamberlain
On Tue, Mar 09, 2021 at 02:07:36PM -0800, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
> >
> > So add an initramfs_async= kernel parameter, allowing the main init
> > process to proceed to handling device_initcall()s without waiting for
> > populate_rootfs() to finish.
> 
> I like this smaller second version of the patch, but am wondering why
> we even need the parameter.
> 
> It sounds mostly like a "maybe I didn't think of all cases" thing -
> and one that will mean that this code will not see a lot of actual
> test coverage..
> 
> And because of the lack of test coverage, I'd rather reverse the
> meaning, and have the async case on by default (without even the
> Kconfig option), and have the kernel command line purely as a "oops,
> it's buggy, easy to ask people to test if this is what ails them".

If we're going to set this as default it might be good to document on
init.h that components that need content in initramfs need the wait
call.

> What *can* happen early boot outside of firmware loading and usermodehelpers?

*In practice* the only thing I can think of at this time is races with
other rootfs_initcall() calls, granted ordering among these is done at
linker time, but I can't think of a issue with them:

arch/x86/kernel/pci-dma.c:rootfs_initcall(pci_iommu_init);
drivers/iommu/intel/irq_remapping.c:rootfs_initcall(ir_dev_scope_init);
drivers/mfd/sta2x11-mfd.c:rootfs_initcall(sta2x11_mfd_init);
drivers/thunderbolt/nhi.c:rootfs_initcall(nhi_init);

But Cc'ing the maintainers of these components just in case.

  Luis


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-10 Thread Rasmus Villemoes
On 11/03/2021 01.17, Rasmus Villemoes wrote:
> On 09/03/2021 23.16, Linus Torvalds wrote:
>> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>>  wrote:
>>>
>>> So add an initramfs_async= kernel parameter, allowing the main init
>>> process to proceed to handling device_initcall()s without waiting for
>>> populate_rootfs() to finish.
>>
>> Oh, and a completely unrelated second comment about this: some of the
>> initramfs population code seems to be actively written to be slow.
>>
>> For example, I'm not sure why that write_buffer() function uses an
>> array of indirect function pointer actions. Even ignoring the whole
>> "speculation protections make that really slow" issue that came later,
>> it seems to always have been actively (a) slower and (b) more complex.
>>
>> The normal way to do that is with a simple switch() statement, which
>> makes the compiler able to do a much better job. Not just for the
>> state selector - maybe it picks that function pointer approach, but
>> probably these days just direct comparisons - but simply to do things
>> like inline all those "it's used in one place" cases entirely. In
>> fact, at that point, a lot of the state machine changes may end up
>> turning into pure goto's - compilers are sometimes good at that
>> (because state machines have often been very timing-critical).
> 
> FWIW, I tried doing
> 

Hm, gcc does elide the test of the return value, but jumps back to a
place where it always loads state from its memory location and does the
whole switch(). To get it to jump directly to the code implementing the
various do_* helpers it seems one needs to avoid that global variable
and instead return the next state explicitly. The below boots, but I
still can't see any measurable improvement on ppc.

Rasmus

Subject: [PATCH] init/initramfs.c: change state machine implementation

Instead of having write_buffer() rely on the global variable "state",
have each of the do_* helpers return the next state, or the new token
Stop. Also, instead of an array of function pointers, use a switch
statement.

This means all the do_* helpers end up inlined into write_buffer(),
and all the places which return a compile-time constant next state now
compile to a direct jump to that code.

We still need the global variable state for the initial choice within
write_buffer, and we also need to preserve the last non-Stop state
across calls.
---
 init/initramfs.c | 90 
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 1d0fdd05e5e9..ad7e04393acb 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -189,7 +189,8 @@ static __initdata enum state {
GotName,
CopyFile,
GotSymlink,
-   Reset
+   Reset,
+   Stop
 } state, next_state;

 static __initdata char *victim;
@@ -207,17 +208,17 @@ static __initdata char *collected;
 static long remains __initdata;
 static __initdata char *collect;

-static void __init read_into(char *buf, unsigned size, enum state next)
+static int __init read_into(char *buf, unsigned size, enum state next)
 {
if (byte_count >= size) {
collected = victim;
eat(size);
-   state = next;
+   return next;
} else {
collect = collected = buf;
remains = size;
next_state = next;
-   state = Collect;
+   return Collect;
}
 }

@@ -225,8 +226,7 @@ static __initdata char *header_buf, *symlink_buf,
*name_buf;

 static int __init do_start(void)
 {
-   read_into(header_buf, 110, GotHeader);
-   return 0;
+   return read_into(header_buf, 110, GotHeader);
 }

 static int __init do_collect(void)
@@ -238,50 +238,46 @@ static int __init do_collect(void)
eat(n);
collect += n;
if ((remains -= n) != 0)
-   return 1;
-   state = next_state;
-   return 0;
+   return Stop;
+   return next_state;
 }

 static int __init do_header(void)
 {
if (memcmp(collected, "070707", 6)==0) {
error("incorrect cpio method used: use -H newc option");
-   return 1;
+   return Stop;
}
if (memcmp(collected, "070701", 6)) {
error("no cpio magic");
-   return 1;
+   return Stop;
}
parse_header(collected);
next_header = this_header + N_ALIGN(name_len) + body_len;
next_header = (next_header + 3) & ~3;
-   state = SkipIt;
if (name_len <= 0 || name_len > PATH_MAX)
-   return 0;
+   return SkipIt;
if (S_ISLNK(mode)) {
if (body_len > PATH_MAX)
-   return 0;
+   return SkipIt;
collect = collected = symlink_buf;
remains = N_ALIGN(name_len) + body_len;
next_state = GotSymlink;
-   

Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-10 Thread Rasmus Villemoes
On 09/03/2021 23.16, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
>>
>> So add an initramfs_async= kernel parameter, allowing the main init
>> process to proceed to handling device_initcall()s without waiting for
>> populate_rootfs() to finish.
> 
> Oh, and a completely unrelated second comment about this: some of the
> initramfs population code seems to be actively written to be slow.
> 
> For example, I'm not sure why that write_buffer() function uses an
> array of indirect function pointer actions. Even ignoring the whole
> "speculation protections make that really slow" issue that came later,
> it seems to always have been actively (a) slower and (b) more complex.
> 
> The normal way to do that is with a simple switch() statement, which
> makes the compiler able to do a much better job. Not just for the
> state selector - maybe it picks that function pointer approach, but
> probably these days just direct comparisons - but simply to do things
> like inline all those "it's used in one place" cases entirely. In
> fact, at that point, a lot of the state machine changes may end up
> turning into pure goto's - compilers are sometimes good at that
> (because state machines have often been very timing-critical).

FWIW, I tried doing

--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -401,24 +401,26 @@ static int __init do_symlink(void)
return 0;
 }

-static __initdata int (*actions[])(void) = {
-   [Start] = do_start,
-   [Collect]   = do_collect,
-   [GotHeader] = do_header,
-   [SkipIt]= do_skip,
-   [GotName]   = do_name,
-   [CopyFile]  = do_copy,
-   [GotSymlink]= do_symlink,
-   [Reset] = do_reset,
-};
-
 static long __init write_buffer(char *buf, unsigned long len)
 {
+   int ret;
+
byte_count = len;
victim = buf;

-   while (!actions[state]())
-   ;
+   do {
+   switch (state) {
+   case Start: ret = do_start(); break;
+   case Collect: ret = do_collect(); break;
+   case GotHeader: ret = do_header(); break;
+   case SkipIt: ret = do_skip(); break;
+   case GotName: ret = do_name(); break;
+   case CopyFile: ret = do_copy(); break;
+   case GotSymlink: ret = do_symlink(); break;
+   case Reset: ret = do_reset(); break;
+   }
+   } while (!ret);
+
return len - byte_count;
 }


and yes, all the do_* functions get inlined into write_buffer with some
net space saving:

add/remove: 0/9 grow/shrink: 1/0 up/down: 1696/-2112 (-416)
Function old new   delta
write_buffer 1001796   +1696
actions   32   - -32
do_start  52   - -52
do_reset 112   --112
do_skip  128   --128
do_collect   144   --144
do_symlink   172   --172
do_copy  304   --304
do_header572   --572
do_name  596   --596
Total: Before=5360, After=4944, chg -7.76%

(ppc32 build). But the unpacking still takes the same time. It might be
different on x86.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Rasmus Villemoes
On 09/03/2021 23.16, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
>>
>> So add an initramfs_async= kernel parameter, allowing the main init
>> process to proceed to handling device_initcall()s without waiting for
>> populate_rootfs() to finish.
> 
> Oh, and a completely unrelated second comment about this: some of the
> initramfs population code seems to be actively written to be slow.
> 
> For example, I'm not sure why that write_buffer() function uses an
> array of indirect function pointer actions. Even ignoring the whole
> "speculation protections make that really slow" issue that came later,
> it seems to always have been actively (a) slower and (b) more complex.
> 
[...]
> Is that likely to be a big part of the costs? No. I assume it's the
> decompression and the actual VFS operations. 

Yes, I have been doing some simple measurements, simply by decompressing
the blob in userspace and comparing to the time to that used by
populate_rootfs(). For both the 6M lz4-compressed blob on my ppc target
and the 26M xz-compressed blob on my laptop, the result is that the
decompression itself accounts for the vast majority of the time - and
for ppc in particular, I don't think there's any spectre slowdown.

So I haven't dared looking into changing the unpack implementation since
it doesn't seem it could buy that much.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Rasmus Villemoes
On 09/03/2021 23.07, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
>>
>> So add an initramfs_async= kernel parameter, allowing the main init
>> process to proceed to handling device_initcall()s without waiting for
>> populate_rootfs() to finish.
> 
> I like this smaller second version of the patch, but am wondering why
> we even need the parameter.
> 
> It sounds mostly like a "maybe I didn't think of all cases" thing -

That's exactly what it is.

> and one that will mean that this code will not see a lot of actual
> test coverage..

Yeah, that's probably true.

> And because of the lack of test coverage, I'd rather reverse the
> meaning, and have the async case on by default (without even the
> Kconfig option), and have the kernel command line purely as a "oops,
> it's buggy, easy to ask people to test if this is what ails them".

Well, I wasn't bold enough to make it "default y" by myself, but I can
certainly do that and nuke the config option.

> What *can* happen early boot outside of firmware loading and usermodehelpers?

Well, that was what I tried to get people to tell me when I sent the
first version as RFC, and also before that
(https://lore.kernel.org/lkml/19574912-44b4-c1dc-44c3-67309968d...@rasmusvillemoes.dk/).
That you can't think of anything suggests that I have covered the
important cases - which does leave random drivers that poke around the
filesystem on their own, but (a) it would probably be a good thing to
have this flush those out and (b) there's the command line option to
make it boot anyway.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Linus Torvalds
On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
 wrote:
>
> So add an initramfs_async= kernel parameter, allowing the main init
> process to proceed to handling device_initcall()s without waiting for
> populate_rootfs() to finish.

Oh, and a completely unrelated second comment about this: some of the
initramfs population code seems to be actively written to be slow.

For example, I'm not sure why that write_buffer() function uses an
array of indirect function pointer actions. Even ignoring the whole
"speculation protections make that really slow" issue that came later,
it seems to always have been actively (a) slower and (b) more complex.

The normal way to do that is with a simple switch() statement, which
makes the compiler able to do a much better job. Not just for the
state selector - maybe it picks that function pointer approach, but
probably these days just direct comparisons - but simply to do things
like inline all those "it's used in one place" cases entirely. In
fact, at that point, a lot of the state machine changes may end up
turning into pure goto's - compilers are sometimes good at that
(because state machines have often been very timing-critical).

Is that likely to be a big part of the costs? No. I assume it's the
decompression and the actual VFS operations. But when the series is
about how this all takes a long time, and I go "that code really looks
actively pessimally written", maybe it would be a good thing to look
into it?

   Linus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Linus Torvalds
On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
 wrote:
>
> So add an initramfs_async= kernel parameter, allowing the main init
> process to proceed to handling device_initcall()s without waiting for
> populate_rootfs() to finish.

I like this smaller second version of the patch, but am wondering why
we even need the parameter.

It sounds mostly like a "maybe I didn't think of all cases" thing -
and one that will mean that this code will not see a lot of actual
test coverage..

And because of the lack of test coverage, I'd rather reverse the
meaning, and have the async case on by default (without even the
Kconfig option), and have the kernel command line purely as a "oops,
it's buggy, easy to ask people to test if this is what ails them".

What *can* happen early boot outside of firmware loading and usermodehelpers?

Hmm?

  Linus