On 2025-10-12, Anton Khirnov wrote:
> There was some discussion on IRC about reusing Xapian Omega's parsers,
> but after looking at it I concluded that a substantial amount of work
> would be needed to separate them out to make them usable for notmuch in
> a safe manner.
I haven't seen the IRC discussion, and I'm not sure which code you
looked at. Xapian 1.4.x has built-in parsers for some (mostly HTML and
XML-based) formats, and supports running external programs for other
formats.
However in the development version (which will become Xapian 2.0.x),
omindex has a new architecture here which runs extractors in persistent
subprocesses which it talks to via pipes, and supports using libraries
for the extraction. This provides reduced overhead while providing
scope for easier and stronger sandboxing (the subprocess extractors are
separate binaries with fairly limited needs and so can run in a much
more restricted environment). This version is a better fit for notmuch.
FWIW, I think the obstacles for being able to reuse this in notmuch
are:
* The code you'd want to use isn't yet in a stable release version.
* The indexer side of the pipe communication would need to be refactored
into a library to allow clean external use. It's in separate files
and the interface is not complex, but it is currently a
project-internal API. Alternatively we could document the protocol
used which might be an easier route for use from other languages.
* The built-in parsers currently still run in the main omindex process
so we'd need a worker module which wraps them.
* Similarly, any formats still handled by external programs are
currently still run by the main omindex process. The library approach
covers most formats now, but really we should wrap the external
program handling in a worker module so it runs in a separate process
too.
* There's not currently sandboxing beyond what the subprocess provides
(a crashing filter won't terminate the main process). We could get
sandboxing equivalent to what you have by just adjusting the command
which is run. It's hard to provide sandboxing out of the box here (as
you note there isn't really a portable way to) but we could provide
easy hooks for it and implementations for some platforms.
Worker modules are easy to write if you have the code to do the
actual extraction already - most of the work here is probably turning
this into a library with a sensible API for external use.
> If someone volunteers to do that in the future, it should still be
> possible to extract them as a standalone filter compatible with this
> patchset.
However doing so would lose the performance benefits because for every
file you need to fork() (which is surprisingly slow for a large process,
AIUI because all the page tables need to be copied), then load the
filter and reinitialise it from scratch. Persisting the worker modules
means these overheads get amortised over multiple files.
There's also potential for locking down some worker modules even
more - not yet implemented, but for libraries which support a file
descriptor as input the main process could open the file and pass a
file descriptor across the pipe and the worker then may not need
even read access to any part of the filing system (depends on the
library - some read support files from e.g. /usr/share).
I wouldn't necessarily argue for blocking merging this now, but it'd
be disappointing to see omindex's worker module architecture hammered
into a wrong-shaped hole later on.
Cheers,
Olly
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]