Hi Olly, Quoting Olly Betts (2025-10-13 03:57:33) > On 2025-10-12, Anton Khirnov wrote: > > There was some discussion on IRC about reusing Xapian Omega's parsers, > > but after looking at it I concluded that a substantial amount of work > > would be needed to separate them out to make them usable for notmuch in > > a safe manner. > > I haven't seen the IRC discussion, and I'm not sure which code you > looked at. Xapian 1.4.x has built-in parsers for some (mostly HTML and > XML-based) formats, and supports running external programs for other > formats.
I was looking at git master from about half a year ago, specifically 9a9c3790e. > However in the development version (which will become Xapian 2.0.x), > omindex has a new architecture here which runs extractors in persistent > subprocesses which it talks to via pipes, and supports using libraries > for the extraction. This provides reduced overhead while providing > scope for easier and stronger sandboxing (the subprocess extractors are > separate binaries with fairly limited needs and so can run in a much > more restricted environment). This version is a better fit for notmuch. > > FWIW, I think the obstacles for being able to reuse this in notmuch > are: > > * The code you'd want to use isn't yet in a stable release version. > * The indexer side of the pipe communication would need to be refactored > into a library to allow clean external use. It's in separate files > and the interface is not complex, but it is currently a > project-internal API. Alternatively we could document the protocol > used which might be an easier route for use from other languages. > * The built-in parsers currently still run in the main omindex process > so we'd need a worker module which wraps them. > * Similarly, any formats still handled by external programs are > currently still run by the main omindex process. The library approach > covers most formats now, but really we should wrap the external > program handling in a worker module so it runs in a separate process > too. > * There's not currently sandboxing beyond what the subprocess provides > (a crashing filter won't terminate the main process). We could get > sandboxing equivalent to what you have by just adjusting the command > which is run. It's hard to provide sandboxing out of the box here (as > you note there isn't really a portable way to) but we could provide > easy hooks for it and implementations for some platforms. This sounds to me like you agree that a substantial amount of work would be needed. And I'm probably not the person to do it, not least because I barely know C++ (beyond what it shares with C). And my other point still stand - while it would be nice to have the option of using Xapian's indexers, it should be an option, not a requirement. People should be able to define their own filters, without having to add them to Xapian. > Worker modules are easy to write if you have the code to do the > actual extraction already - most of the work here is probably turning > this into a library with a sensible API for external use. > > > If someone volunteers to do that in the future, it should still be > > possible to extract them as a standalone filter compatible with this > > patchset. > > However doing so would lose the performance benefits because for every > file you need to fork() (which is surprisingly slow for a large process, > AIUI because all the page tables need to be copied) I do not believe this is true for any modern Unix, rather they all use copy-on-write pages, so fork() on its own should be very cheap. > then load the filter and reinitialise it from scratch. Persisting the > worker modules means these overheads get amortised over multiple > files. > > There's also potential for locking down some worker modules even > more - not yet implemented, but for libraries which support a file > descriptor as input the main process could open the file and pass a > file descriptor across the pipe and the worker then may not need > even read access to any part of the filing system (depends on the > library - some read support files from e.g. /usr/share). > > I wouldn't necessarily argue for blocking merging this now, but it'd > be disappointing to see omindex's worker module architecture hammered > into a wrong-shaped hole later on. Which other shape would you prefer? If one wants to avoid spawning processes, the obvious option is to talk to a persistent daemon over a socket, but making that the only choice would substantially complicate the setup. E.g. the simple script from patch 7/7 would no longer be anywhere as simple. And in my case, I rarely receive more than 100 emails per day, and most of them do not have attachments, so I care far more about simplicity than avoiding a bit of overhead. Also, nothing about my patchset requires forking a process to be the only option forever. If there is actual demand for a socket-based solution, it should be easy enough to add later. Cheers, -- Anton Khirnov _______________________________________________ notmuch mailing list -- [email protected] To unsubscribe send an email to [email protected]
