Re: notmuch and public-inbox

2021-04-30 Thread Eric Wong
Carl Worth  wrote:
> On Sat, May 01 2021, Eric Wong wrote:
> > I never had the interest in using notmuch since Maildirs are a
> > non-starter with millions of messages with current FSes/OSes.
> 
> What bottleneck are you seeing here?
> 
> I don't have million(s) of messages but I'm getting close with 1.48M
> messages in my current notmuch index.
> 
> I'm not seeing any problematic performance from the filesystem or OS
> myself, so I'm curious what problem you're referring to here.

I assume you have several Maildirs and not just one with 1.48M?

Since I never actually used notmuch myself; most of my aversion
comes from years of using Maildir sync tools (mbsync,
offlineimap, rsync).  They all struggle with many inodes
and syscalls + cache required to walk them.

It's the same reason git puts old objects in packfiles rather
than having millions of loose objects.

Furthermore, my MUA (mutt) struggles on a single Maildir when
its size goes over ~50K.  Maildir is fine as a dumping ground
for mairix search results (typically a few dozen/hundred results).

Maildir is better nowadays on FSes with compression and
checksums; but lack of compression and checksumming were also
points against it; though syscalls are also more expensive with
CPU vulnerability mitigations.

I've always gzipped my archival mboxes for compression and CRC.

My local mirror of all the messages on lore.kernel.org/* is over
14.6M(*) and growing...  (LKML is 4M of that).


(*) 14.6M in the new combined "extindex" format that should be on
lore.kernel.org, soon.  For now, I have an experimental
instance on https://yhbt.net/lore/all/
___
notmuch mailing list -- notmuch@notmuchmail.org
To unsubscribe send an email to notmuch-le...@notmuchmail.org


Re: notmuch and mailing lists

2020-05-02 Thread Eric Wong
Sean Whitton  wrote:
> Hello,
> 
> I was wondering whether anyone who previously read mailing lists via
> NNTP has stopped doing this after starting to use notmuch.

Fwiw, I have some slrn spool to Maildir translators here which should
work with notmuch:

Perl: https://yhbt.net/public-inbox.git/tree/scripts/slrnspool2maildir
(It still uses the Email::* namespace, which I'm slowly getting
 rid of in public-inbox due to performance and inactive upstream...)

Ruby: https://lore.kernel.org/lkml/20190104013522.stng6gwauwnr6wbi@starla/
(doesn't do any header rewriting)

> I've not yet used NNTP to read mailing lists myself, but I think there
> are limitations to the way I currently read lists, and was wondering
> whether it is worth exploring the NNTP approach, or trying to come up
> with notmuch-based workflow improvements.

Not directly related to notmuch:

I'm planning on expanding public-inbox to include a local client
tool which can index, search, and optionally cache NNTP and HTTP
messages from any NNTP and public-inbox HTTP instances.  Maybe
IMAP, too...
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: performance problems with notmuch new

2020-04-29 Thread Eric Wong
David Bremner  wrote:
> Franz Fellner  writes:
> > mail takes at least 10 seconds, sometimes even more.  It can go into
> > minutes when I get lots of mail (~30...).  When I run it after a
> > reboot I can have breakfast while notmuch starts up...  This is all on
> > spinning rust. I thought of getting an SSD but not in the near future.
> 
> I do have at least one spinning rust configuration with about 300k
> messages, and notmuch new is still fast there.

I've yet to figure out how spinning rust can work well with
giant public-inboxes (git + Xapian + SQLite); but I have
a fair bit of experience with SSDs + Xapian.

But some of my recommendations below come from my experience
with HDDs in the old days, before I used Xapian.

> > What I observe during that time: notmuch doesn't really need much CPU.
> > iotop shows constant read and write with extremely low rates, under
> > 1MB/sec.  So I think it might be an issue in xapian?

Seek times, probably   `iostat -x 1' can give you very useful
information about I/O queue sizes and wait times for reads and
writes (the `-x' is the good stuff :), `1' means it keeps
outputting every second.

> Just in case one of the xapian experts can suggest some kind of test for
> why you might be seeing this behaviour, I've included the xapian list in
> CC.

Newer Xapian has a DB_NO_SYNC which notmuch could set as an
option.  Users of old Xapian (or on Perl XS bindings) also have
libeatmydata LD_PRELOAD which I end up using all the time:

https://www.flamingspork.com/projects/libeatmydata/

I run `sync' if I have anything important, but I usually
don't ;)   I do set the kernel do flush dirty data in the
background fairly aggressively, though (more below)

For public-inbox v2 hacking in 2018 (indexing LKML archives, ~3M
messages), I found working on a freshly TRIM-ed SSD with plenty
of free space made the SSD firmware happier.  SSDs can get a LOT
slower as they get fuller (so xapian-compact helps, there, too).

SSD quality matters a lot; but even the low-end QLC stuff beats
high-end HDDs in random I/O; but they will slow down more as
they fill up more.

For writes, I set /proc/sys/vm/dirty_background_bytes to 100M or
something reasonably close to what the SSD can write quickly.
Linux tended to hit I/O stalls with lots of dirty data, so
making the kernel flush it sooner tends to help IME.  Maybe newer
kernels do better *shrug*; but it's basically the local storage
version of the network "Bufferbloat" problem.

Flushing dirty data more frequently also frees up more memory
for the kernel to make better caching decisions about
future/current data it needs to read.

notmuch can probably run a background thread (or use liburing)
to do POSIX_FADV_DONTNEED once its done with a message, too (and
POSIX_FADV_WILLNEED for to-be-indexed messages).  Uncompressed
Maildir messages eat cache space real quick, which means less
cache for Xapian.

public-inbox indexes the v2 inbox format in parallel; but
excessive parallelism still causes I/O contention with SSDs (at
least upper-mid-range ones).  So right now the default limit is
3 indexing processes regardless of CPU count.  Reading from git
is still synchronous atm, but will probably be async in a few
months.  git itself tends to generate decent I/O patterns with
its pack format (but makes posix_fadvise hinting impractical).

Anyways, indexing just under 3 million LKML messages took ~4
hours on 4-core system built in 2010 with a SATA SSD from 2014.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


notmuch modifies DB while iterating?

2018-02-27 Thread Eric Wong
Hello, I'm neither a notmuch user or proficient in C++.

However, I noticed a bug while working on public-inbox (in Perl)
which shares Xapian thread linking logic with notmuch, and I
suspect notmuch is affected by the same problem as public-inbox.

The problem is in the _merge_threads function in add-message.cc
While the Xapian::PostingIterator for loser is iterating, the
Xapian DB is being modified by replace_document via
_notmuch_message_sync.

This was causing DatabaseCorruptError exceptions in public-inbox
with my dataset.

I fixed it in public-inbox by stashing docid scalars into a
Perl array while iterating with the PostingIterator, and then
doing lookups + replacements independently of the
PostingIterator by iterating through the Perl array:

https://public-inbox.org/meta/20180227221302.7308-...@80x24.org/raw


I initially thought this was a bug in the glass backend, but
I've also hit it with chert.

I have a standalone Perl script to reproduce the bug at
https://yhbt.net/skel.bug.perl and 81M gzipped dataset which
reproduces the problem at https://yhbt.net/skel.bug.gz

(each line is: MID [REFERENCES-SEPARATED-BY-SPACES])

Usage:

For failure:
  curl https://yhbt.net/skel.bug.gz | zcat | perl -w /path/to/skel.bug.perl

For success:
  curl https://yhbt.net/skel.bug.gz | zcat | \
BATCH_SIZE=1000 perl -w /path/to/skel.bug.perl
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Mail archives in Git using ssoma (Docker image)

2016-09-07 Thread Eric Wong
David Bremner <da...@tethera.net> wrote:
> Eric Wong <e...@80x24.org> writes:
> > For mirroring existing lists, I started using public-inbox-watch
> > which currently watches Maildirs.  The config knobs are sorta
> > documented from my announcement to git@vger:
> >
> > https://public-inbox.org/git/20160710004813.ga20...@dcvr.yhbt.net/
> > http://hjrcffqmbrq6wope.onion/git/20160710004813.ga20...@dcvr.yhbt.net/
> >
> > Initial import (w/o spamassassin) was done with
> > scripts/import_vger_from_mbox in the source:
> >
> > torsocks git clone http://hjrcffqmbrq6wope.onion/public-inbox
> > git clone https://public-inbox.org/ public-inbox
> > git clone git://repo.or.cz/public-inbox
> >
> 
> FWIW, I already have a Maildir with a complete (and updated) archive of the 
> list (and
> only that) for use of nmbug. So at the risk of putting all eggs in one
> basket, perhaps public-inbox-watch could watch that maildir.

Yes, public-inbox-watch(1) is probably preferable for any subscriber to
start archiving the notmuch list.  I just pushed out some POD manpages
which should probably help (along with the existing INSTALL doc):

   https://public-inbox.org/meta/20160907004907.1479-...@80x24.org/

public-inbox-overview(7) should be a good starting point of ways
to start mirroring/hosting.  Please feel free to ask me directly
and/or m...@public-inbox.org if you need clarification or help.
I'm scatterbrained and tend to omit things when writing
documentation (it's hard to tell what a reader wants to know :x)


Anyways, thanks for notmuch (and being GPL-3.0+)!  I'm not a
user myself(*), but I've found the notmuch source to be a good
place to steal Xapian usage examples from for public-inbox :>



(*) I have trouble with Maildir-only scalability and
still use gzipped mbox for old mail.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Mail archives in Git using ssoma

2016-08-22 Thread Eric Wong
"W. Trevor King" <wk...@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 06:37:04PM +, Eric Wong wrote:
> > Btw, for public-inbox, I'm using git-fast-import now, so imports are
> > a bit faster and $GIT_DIR/ssoma.index is no longer used.  This was
> > crucial for getting git@vger archives imported in a reasonable time.
> 
> ssoma-mda imports 22k notmuch messages in around 15 minutes (with
> profiling enabled), and:

In contrast, git@vger is around 300K messages.  LKML is well
into the millions, and I hope public-inbox (and git!) can handle
that one day, even on cheap hardware (haven't tried).

One problem I noticed with ssoma-mda is that it gets slower as
more messages get imported, since all those files sit in the
index, and the git index format is bad for incremental updates
with big, flat trees.  Big trees are a general problem with git:

I'm now storing blob IDs directly in Xapian and will be
using them more to avoid tree lookups.  tree creation
lookups degrade the same way the index does as they
get bigger.

Currently it's using 2/38 of the SHA-1 like git loose
objects; a goal might be to move towards supporting 2/2/36
(or deeper) as Jeff noted substantial object traversal
improvements:

https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuw...@sigill.intra.peff.net/

Of course, support for 2/38 will be retained for old
archives/messages.

>   $ python -m cProfile -o profile import.py notmuch.mbox
>   $ python -c "import pstats; p=pstats.Stats('profile'); 
> p.sort_stats('cumulative').print_stats(10)"
>   Sun Aug 21 12:56:49 2016profile
> 
>101823722 function calls (99078415 primitive calls) in 885.069 
> seconds
> 
>  Ordered by: cumulative time
>  List reduced from 1145 to 10 due to restriction <10>
> 
>  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>70/10.0020.000  885.069  885.069 {built-in method exec}
>   10.1110.111  885.069  885.069 
> /home/wking/src/notmuch/notmuch-archives.git/import.py:9()
>   10.4000.400  884.915  884.915 
> /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox)
>   228750.6010.000  863.3710.038 
> /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver)
>   228758.9430.000  810.4590.035 
> /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append)
>   228750.4180.000  308.3530.013 
> /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree)
>   22875  307.8550.013  307.8550.013 {built-in method 
> git_index_write_tree}
>   228740.5750.000  279.2930.012 
> /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree)
>   22874  278.5010.012  278.5010.012 {built-in method 
> git_diff_tree_to_index}

It looks like writing the index is already the slowest, here, in
terms of total time, too.  It might be interesting if you
profiled each *-mda invocation to see the degradation from the
first to last message.

>   228750.0880.000   80.4130.004 
> /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read)
> 
> 38 ms per ssoma delivery is probably fast enough, especially if you

Not even close for me :)

> are invoking ssoma-mda once per message, since process setup will take a 
> similar amount of time:
> 
>   $ time python -c 'print("hello")'
>   hello
> 
>   real0m0.016s
>   user0m0.013s
>   sys 0m0.003s
> 
> It's possible that fast-import would shave a few ms off the pygit2
> addition (I'm not sure, and maybe pygit2 is faster than fast-import).
> But I doubt it matters enough either way to be worth changing unless
> you are dealing with a really large corpus.

One key feature is fast-import avoids writing an index entirely.
I think pygit2 would have to learn that, too.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Mail archives in Git using ssoma

2016-08-21 Thread Eric Wong
"W. Trevor King" <wk...@tremily.us> wrote:
> On Fri, Nov 07, 2014 at 11:03:21AM -0800, W. Trevor King wrote:
> > Eric Wong has been working on some tools to store email in a Git
> > repository, and his client-side code is ssoma [1].  I wanted a bit
> > more metadata than the stock ssoma-mda [2], and ended up just
> > writing a ssoma-mda in Python [3]…

Btw, for public-inbox, I'm using git-fast-import now, so imports
are a bit faster and $GIT_DIR/ssoma.index is no longer used.
This was crucial for getting git@vger archives imported in
a reasonable time.

public-inbox-* still keeps ssoma.index up-to-date for backwards
compatibility with ssoma, and will probably do so until 2020 or
later (there'll be a few years of deprecation notices)

So I or someone else needs to update Perl ssoma to use fast-import at
some point, too; and I suggest your python version do the same.
___
notmuch mailing list
notmuch@notmuchmail.org
https://notmuchmail.org/mailman/listinfo/notmuch


Re: Mail archives in Git using ssoma (Docker image)

2016-08-21 Thread Eric Wong
"W. Trevor King" <wk...@tremily.us> wrote:
> On Sun, Aug 21, 2016 at 12:08:52PM +, Eric Wong wrote:
> > "W. Trevor King" <wk...@tremily.us> wrote:
> > > This is the ssoma archive (with the data in it).  I just set up a
> > > basic HTTP archive (following [1]) based on a Docker image [2] (Gentoo
> > > doesn't package all the Perl dependencies public-inbox needs).
> > 
> > Ugh, that sucks (sorry, not a fan of Docker).
> > 
> > What's missing from Gentoo?
> 
> Gentoo doesn't package (or I couldn't find the package for)
> Encode::MIME::Header or Mail::Thread.  I tried installing things from
> CPAN, but ran into a compile-time error from the ‘cpan’ invocationand
> gave up ;).  I can try and reproduce the error if you're curious, but
> I don't have it handy at the moment.

Encode::MIME::Header is distributed with perl itself on Debian and also
the stock upstream install.  Not sure if there's an option you missed or
disabled.

Which perl version do you use?

perl on 5.14 Debian wheezy even seems to have it.  I actually
still want everything to work on 5.8, since that seems to be
the de-facto baseline in the wild.


Mail::Thread is one .pm, and I'll probably replace it with
something (same algorithm) which can use half the memory by
avoiding wrapper object abstractions (it's probably the biggest
memory hog at the moment).

lib/PublicInbox/Thread.pm already has 3 monkey patches to workaround
upstream bugs in Mail::Thread.  It's dead upstream, and not available on
FreeBSD, either.

> > >   $ git config -f srv/notmuch.git/config publicinbox.http 
> > > http://tremily.us
> > >   $ git config -f srv/notmuch.git/config publicinbox.email 
> > > notmuch@notmuchmail.org
> > 
> > That should probably be:
> > 
> > ; based on your [3]
> > git config -f srv/notmuch.git/config \
> > publicinbox.notmuch.url http://tremily.us/notmuch
> > 
> > git config -f srv/notmuch.git/config \
> > publicinbox.notmuch.address notmuch@notmuchmail.org
> > 
> > ; this is crucial for all the public-inbox-* tools
> > git config -f srv/notmuch.git/config \
> > publicinbox.notmuch.mainrepo /path/to/notmuch.git
> 
> I was using these in the Dockerfile's CMD:
> 
>   (cd /srv;
>for NAME in *;
>do
>  CONF="/srv/${NAME}/config";
>  public-inbox-init "${NAME}" "/srv/${NAME}" $(git config -f "${CONF}" 
> publicinbox.http) $(git config -f "${CONF}" publicinbox.email);
>done) && …
> 
> Are you saying that I can skip the ~/.public-inbox/config entries
> setup by public-inbox-init if I set publicinbox.{name}.* in the ssoma
> repository's config?  That would be nice.

Erm, sorry, no, I mean ~/.public-inbox/config as the "git config -f"
arg in the above commands.  Your original config was
meaningless in the context of public-inbox itself; I don't
recall public-inbox relies on $GIT_DIR/config much (if at all)
outside of standard git things.

Using ~/.public-inbox/config is required for multi-inbox lookups
(since you normally run MDA w/o args)

You can also override ~/.public-inbox/config by setting the
PI_CONFIG env (like GIT_CONFIG).

> I don't see a point to having {name} in ssoma-config settings though,
> since you're already in a single bucket by that point (using
> publicinbox.{name}.* makes sense in the multi-bucket
> ~/.public-inbox/config).
> 
> > > It's not updating automatically yet, but that will probably look
> > > like:
> > > 
> > > 1. Pull new mbox [4].
> > > 2. Import into notmuch-arcives [5].
> > > 3. Re-run public-inbox-index (this could probably be via ‘docker exec …’.
> > > 
> > > But I'll have to test that to confirm.  And ideally we'd be using
> > > ssoma-mda or similar directly, instead of going through mbox, but I'd
> > > rather get the official headers on the stored mail than be efficient
> > > ;).
> > 
> > For mirroring existing lists, I started using public-inbox-watch
> > which currently watches Maildirs.
> 
> If I had a Maildir locally, I'd just use procmail and push new
> messages into ssoma-mda.  I'm using the import script because my local
> mail has “how we delivered this to Trevor” headers (which I don't want
> to add) but the downloaded mbox has “how we delivered this to
> notmuch@notmuchmail.org” (which seems like a better fit for a shared
> ssoma repo).

I don't mind extra/different headers.   The majority of messages in
public-inbox.org/git/ has messages that were delivered to gmane;
recent ones are delivered to me, and some holes were filled in by
Jeff King's archives.