Hi, this patchset adds support for indexing the contents of attachments, by piping the payload to an external user-specified program. I have chosen this approach for two main reasons: * converting non-text data into text is inherently subjective, and also a moving target, so I wanted to avoid hardcoding any conversion policy inside libnotmuch * parsers are notoriously bug-prone and have been subject to vast numbers of security issues in the past; since we will index untrusted data from the internet, is is highly important that we can sandbox the parsing code
There was some discussion on IRC about reusing Xapian Omega's parsers, but after looking at it I concluded that a substantial amount of work would be needed to separate them out to make them usable for notmuch in a safe manner. If someone volunteers to do that in the future, it should still be possible to extract them as a standalone filter compatible with this patchset. Sandboxes are unfortunately highly non-portable - I wrote a sample Firejail profile because that's what I use, but it is Linux-only. The example filtering program is currently quite trivial - it delegates to pdftotext for PDF and elinks or w3m for HTML. We may want to add more sophistication, handling *office files, archives, autodetecting faulty mimetypes, etc. I can do some of that and add tests if the basic approach is approved. -- Anton Khirnov (7): bindings/python-cffi: do not use an unbound variable Add API for filtering attachments with an external program. indexopts: avoid a memleak in the error path build: add infrastructure for using close_range()/closefrom() index: implement filtering attachments with an external program contrib/filter: add a Firejail profile to sandbox filtering programs contrib/filter: add an attachment filtering program bindings/python-cffi/notmuch2/_build.py | 5 + bindings/python-cffi/notmuch2/_database.py | 22 +- compat/Makefile.local | 4 + compat/closefrom.c | 25 ++ compat/have_close_range.c | 8 + compat/have_closefrom.c | 8 + configure | 32 +++ contrib/filter/filter.py | 23 ++ contrib/filter/firejail.profile | 41 +++ doc/man1/notmuch-config.rst | 37 +++ lib/config.cc | 3 + lib/index.cc | 276 ++++++++++++++++++++- lib/indexopts.c | 40 ++- lib/notmuch.h | 23 ++ test/T590-libconfig.sh | 5 + 15 files changed, 548 insertions(+), 4 deletions(-) create mode 100644 compat/closefrom.c create mode 100644 compat/have_close_range.c create mode 100644 compat/have_closefrom.c create mode 100755 contrib/filter/filter.py create mode 100644 contrib/filter/firejail.profile -- 2.47.3 Cheers, -- Anton Khirnov _______________________________________________ notmuch mailing list -- [email protected] To unsubscribe send an email to [email protected]
