Re: [PATCH 0/7] indexing attachment contents

Anton Khirnov Sat, 18 Oct 2025 07:51:41 -0700

Hi Olly,
Quoting Olly Betts (2025-10-13 03:57:33)
> On 2025-10-12, Anton Khirnov wrote:
> > There was some discussion on IRC about reusing Xapian Omega's parsers,
> > but after looking at it I concluded that a substantial amount of work
> > would be needed to separate them out to make them usable for notmuch in
> > a safe manner.
> 
> I haven't seen the IRC discussion, and I'm not sure which code you
> looked at.  Xapian 1.4.x has built-in parsers for some (mostly HTML and
> XML-based) formats, and supports running external programs for other
> formats.


I was looking at git master from about half a year ago, specifically
9a9c3790e.

> However in the development version (which will become Xapian 2.0.x),
> omindex has a new architecture here which runs extractors in persistent
> subprocesses which it talks to via pipes, and supports using libraries
> for the extraction.  This provides reduced overhead while providing
> scope for easier and stronger sandboxing (the subprocess extractors are
> separate binaries with fairly limited needs and so can run in a much
> more restricted environment).  This version is a better fit for notmuch.
> 
> FWIW, I think the obstacles for being able to reuse this in notmuch
> are:
> 
> * The code you'd want to use isn't yet in a stable release version.
> * The indexer side of the pipe communication would need to be refactored
>   into a library to allow clean external use.  It's in separate files
>   and the interface is not complex, but it is currently a
>   project-internal API.  Alternatively we could document the protocol
>   used which might be an easier route for use from other languages.
> * The built-in parsers currently still run in the main omindex process 
>   so we'd need a worker module which wraps them.
> * Similarly, any formats still handled by external programs are
>   currently still run by the main omindex process.  The library approach
>   covers most formats now, but really we should wrap the external
>   program handling in a worker module so it runs in a separate process
>   too.
> * There's not currently sandboxing beyond what the subprocess provides
>   (a crashing filter won't terminate the main process).  We could get
>   sandboxing equivalent to what you have by just adjusting the command
>   which is run.  It's hard to provide sandboxing out of the box here (as
>   you note there isn't really a portable way to) but we could provide
>   easy hooks for it and implementations for some platforms.

This sounds to me like you agree that a substantial amount of work would
be needed. And I'm probably not the person to do it, not least because I
barely know C++ (beyond what it shares with C).

And my other point still stand - while it would be nice to have the
option of using Xapian's indexers, it should be an option, not a
requirement. People should be able to define their own filters, without
having to add them to Xapian.

> Worker modules are easy to write if you have the code to do the
> actual extraction already - most of the work here is probably turning
> this into a library with a sensible API for external use.
> 
> > If someone volunteers to do that in the future, it should still be
> > possible to extract them as a standalone filter compatible with this
> > patchset.
> 
> However doing so would lose the performance benefits because for every
> file you need to fork() (which is surprisingly slow for a large process,
> AIUI because all the page tables need to be copied)

I do not believe this is true for any modern Unix, rather they all use
copy-on-write pages, so fork() on its own should be very cheap.

> then load the filter and reinitialise it from scratch.  Persisting the
> worker modules means these overheads get amortised over multiple
> files.
> 
> There's also potential for locking down some worker modules even
> more - not yet implemented, but for libraries which support a file
> descriptor as input the main process could open the file and pass a
> file descriptor across the pipe and the worker then may not need
> even read access to any part of the filing system (depends on the
> library - some read support files from e.g. /usr/share).
> 
> I wouldn't necessarily argue for blocking merging this now, but it'd
> be disappointing to see omindex's worker module architecture hammered
> into a wrong-shaped hole later on.

Which other shape would you prefer?

If one wants to avoid spawning processes, the obvious option is to talk
to a persistent daemon over a socket, but making that the only choice
would substantially complicate the setup. E.g. the simple script from
patch 7/7 would no longer be anywhere as simple. And in my case, I
rarely receive more than 100 emails per day, and most of them do not
have attachments, so I care far more about simplicity than avoiding a
bit of overhead.

Also, nothing about my patchset requires forking a process to be the
only option forever. If there is actual demand for a socket-based
solution, it should be easy enough to add later.

Cheers,
-- 
Anton Khirnov
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Re: [PATCH 0/7] indexing attachment contents

Reply via email to