Hi,
this patchset adds support for indexing the contents of attachments, by
piping the payload to an external user-specified program. I have chosen
this approach for two main reasons:
* converting non-text data into text is inherently subjective, and also
  a moving target, so I wanted to avoid hardcoding any conversion policy
  inside libnotmuch
* parsers are notoriously bug-prone and have been subject to vast
  numbers of security issues in the past; since we will index untrusted
  data from the internet, is is highly important that we can sandbox the
  parsing code

There was some discussion on IRC about reusing Xapian Omega's parsers,
but after looking at it I concluded that a substantial amount of work
would be needed to separate them out to make them usable for notmuch in
a safe manner. If someone volunteers to do that in the future, it should
still be possible to extract them as a standalone filter compatible with
this patchset.

Sandboxes are unfortunately highly non-portable - I wrote a sample
Firejail profile because that's what I use, but it is Linux-only.

The example filtering program is currently quite trivial - it delegates
to pdftotext for PDF and elinks or w3m for HTML. We may want to add more
sophistication, handling *office files, archives, autodetecting faulty
mimetypes, etc. I can do some of that and add tests if the basic
approach is approved.

-- 

Anton Khirnov (7):
  bindings/python-cffi: do not use an unbound variable
  Add API for filtering attachments with an external program.
  indexopts: avoid a memleak in the error path
  build: add infrastructure for using close_range()/closefrom()
  index: implement filtering attachments with an external program
  contrib/filter: add a Firejail profile to sandbox filtering programs
  contrib/filter: add an attachment filtering program

 bindings/python-cffi/notmuch2/_build.py    |   5 +
 bindings/python-cffi/notmuch2/_database.py |  22 +-
 compat/Makefile.local                      |   4 +
 compat/closefrom.c                         |  25 ++
 compat/have_close_range.c                  |   8 +
 compat/have_closefrom.c                    |   8 +
 configure                                  |  32 +++
 contrib/filter/filter.py                   |  23 ++
 contrib/filter/firejail.profile            |  41 +++
 doc/man1/notmuch-config.rst                |  37 +++
 lib/config.cc                              |   3 +
 lib/index.cc                               | 276 ++++++++++++++++++++-
 lib/indexopts.c                            |  40 ++-
 lib/notmuch.h                              |  23 ++
 test/T590-libconfig.sh                     |   5 +
 15 files changed, 548 insertions(+), 4 deletions(-)
 create mode 100644 compat/closefrom.c
 create mode 100644 compat/have_close_range.c
 create mode 100644 compat/have_closefrom.c
 create mode 100755 contrib/filter/filter.py
 create mode 100644 contrib/filter/firejail.profile

-- 
2.47.3

Cheers,
-- 
Anton Khirnov
_______________________________________________
notmuch mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to