On Fri, Oct 17, 2025 at 09:44:38AM +0200, Anton Khirnov wrote:
> Hi Olly,
> Quoting Olly Betts (2025-10-13 03:57:33)
> > FWIW, I think the obstacles for being able to reuse this in notmuch
> > are:
> >
> > * The code you'd want to use isn't yet in a stable release version.
> > * The indexer side of the pipe communication would need to be refactored
> > into a library to allow clean external use. It's in separate files
> > and the interface is not complex, but it is currently a
> > project-internal API. Alternatively we could document the protocol
> > used which might be an easier route for use from other languages.
> > * The built-in parsers currently still run in the main omindex process
> > so we'd need a worker module which wraps them.
> > * Similarly, any formats still handled by external programs are
> > currently still run by the main omindex process. The library approach
> > covers most formats now, but really we should wrap the external
> > program handling in a worker module so it runs in a separate process
> > too.
> > * There's not currently sandboxing beyond what the subprocess provides
> > (a crashing filter won't terminate the main process). We could get
> > sandboxing equivalent to what you have by just adjusting the command
> > which is run. It's hard to provide sandboxing out of the box here (as
> > you note there isn't really a portable way to) but we could provide
> > easy hooks for it and implementations for some platforms.
>
> This sounds to me like you agree that a substantial amount of work would
> be needed. And I'm probably not the person to do it, not least because I
> barely know C++ (beyond what it shares with C).
I'm not sure I'd say substantial, but it's not something you can just
pull in as a clean dependency today.
> > However doing so would lose the performance benefits because for every
> > file you need to fork() (which is surprisingly slow for a large process,
> > AIUI because all the page tables need to be copied)
>
> I do not believe this is true for any modern Unix, rather they all use
> copy-on-write pages, so fork() on its own should be very cheap.
The pages are copy-on-write, but the page *tables* for the parent
process presumably still need copying (though it seems the conclusion
that this is the problematic cost here may not be right - see below).
The key point is that fork() performance on Linux can be O(parent
process size) even with recent kernels, and to an extent where the
effect can limit indexing throughput. Testing on a machine with 32GB
of RAM:
$ uname -a
Linux gemse 6.12.38+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.38-1
(2025-07-16) x86_64 GNU/Linux
$ cat forktest.pl
#!/usr/bin/perl -w
use strict;
use Time::HiRes;
my $MB = shift;
my $blob = 'X' x (1024*1024*$MB);
my $n = 1000;
my $t = Time::HiRes::time();
for (1 .. $n) { if (fork()) { wait() } else { exit } }
print "${MB}MB\t", (Time::HiRes::time() - $t) / $n * 1000, " ms/fork()\n";
$ for s in 32 64 128 256 512 1024 2048 4096 ; do ./forktest.pl $s ; done
32MB 0.855915784835815 ms/fork()
64MB 0.848785877227783 ms/fork()
128MB 0.961197137832642 ms/fork()
256MB 2.54437303543091 ms/fork()
512MB 11.1489939689636 ms/fork()
1024MB 31.718936920166 ms/fork()
2048MB 72.3293099403381 ms/fork()
4096MB 148.031877040863 ms/fork()
$ uname -a
Linux gemse 6.12.38+deb13-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.12.38-1
(2025-07-16) x86_64 GNU/Linux
$ time pdftotext ~/xapian-badge.pdf
real 0m0.017s
user 0m0.009s
sys 0m0.009s
The last command shows the time to extract text from a small PDF is 17ms
which is less than the time to fork even a 1GB process.
With Perl's Proc::FastSpawn (which strace shows uses vfork() in this
machine) the times don't seem to increase with process size.
We ran into this on a client project back in 2016. The private bug
tracker for it is no longer running, but I still have most of the
discussion in notification emails. It turned out they had a small leak
in the indexer code so the indexer parent process slowly grew in size
and fork() got slower and slower, so it was just resolved by fixing that
leak. The person who investigated hypothesised it was copying page
tables that was the cost but I don't actually see direct evidence for
that conclusion. The ticket got closed as resolved once the leak was
spotted and fixed.
Interestingly, if I repeat the loop above I get slightly better scaling
the second time:
32MB 0.866425037384033 ms/fork()
64MB 0.808671951293945 ms/fork()
128MB 0.826326131820679 ms/fork()
256MB 0.900160789489746 ms/fork()
512MB 1.11015009880066 ms/fork()
1024MB 1.49296593666077 ms/fork()
2048MB 35.9647679328918 ms/fork()
4096MB 69.859573841095 ms/fork()
Then reversing the order of the sizes seems to show going down in size
is even better:
$ for s in 4096 2048 1024 512 256 128 64 32 ; do ./forktest.pl $s ; done
4096MB 58.6394219398499 ms/fork()
2048MB 2.2963171005249 ms/fork()
1024MB 1.61787796020508 ms/fork()
512MB 1.25533103942871 ms/fork()
256MB 1.04365706443787 ms/fork()
128MB 0.949764013290405 ms/fork()
64MB 0.927274942398071 ms/fork()
32MB 0.91487193107605 ms/fork()
That makes me wonder if it's actually something like the OS having to
reclaim pages that are being used to cache data, as on a repeat run
the pages reclaimed from the previous run will probably not all have
been used for caching data yet. Indexing is I/O intensive so will tend
towards any otherwise unused pages being used for caching.
It seems that would only explain the first of each batch of 1000 forks
being slow though. If someone can explain this effect I'd love to know
what's actually going on.
> > I wouldn't necessarily argue for blocking merging this now, but it'd
> > be disappointing to see omindex's worker module architecture hammered
> > into a wrong-shaped hole later on.
>
> Which other shape would you prefer?
I was really just responding to this comment in your patch email:
| it should still be possible to extract them as a standalone filter
| compatible with this patchset
This seemed to be suggesting just wrapping the worker modules as a
program which then gets fork+exec-ed for each file by the mechanism in
your patch. While that would "work" it loses many of the benefits.
> If one wants to avoid spawning processes, the obvious option is to talk
> to a persistent daemon over a socket, but making that the only choice
> would substantially complicate the setup.
The worker modules are effectively semi-persistent daemons (started on
demand, restarted if they die). Once this is a library, the code to
manage them would just be in that library.
> E.g. the simple script from
> patch 7/7 would no longer be anywhere as simple. And in my case, I
> rarely receive more than 100 emails per day, and most of them do not
> have attachments, so I care far more about simplicity than avoiding a
> bit of overhead.
The incremental case is not really the problem though - it's the initial
uptake where we need to index years (or decades) of mail archives. That
is also likely to be a new user's initial experience.
> Also, nothing about my patchset requires forking a process to be the
> only option forever. If there is actual demand for a socket-based
> solution, it should be easy enough to add later.
If you aren't suggesting everything should work via the mechanism in
your patch, I'm happy.
Cheers,
Olly
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]