Re: notmuch and public-inbox
Carl Worth wrote: > On Sat, May 01 2021, Eric Wong wrote: > > I never had the interest in using notmuch since Maildirs are a > > non-starter with millions of messages with current FSes/OSes. > > What bottleneck are you seeing here? > > I don't have million(s) of messages but I'm getting close with 1.48M > messages in my current notmuch index. > > I'm not seeing any problematic performance from the filesystem or OS > myself, so I'm curious what problem you're referring to here. I assume you have several Maildirs and not just one with 1.48M? Since I never actually used notmuch myself; most of my aversion comes from years of using Maildir sync tools (mbsync, offlineimap, rsync). They all struggle with many inodes and syscalls + cache required to walk them. It's the same reason git puts old objects in packfiles rather than having millions of loose objects. Furthermore, my MUA (mutt) struggles on a single Maildir when its size goes over ~50K. Maildir is fine as a dumping ground for mairix search results (typically a few dozen/hundred results). Maildir is better nowadays on FSes with compression and checksums; but lack of compression and checksumming were also points against it; though syscalls are also more expensive with CPU vulnerability mitigations. I've always gzipped my archival mboxes for compression and CRC. My local mirror of all the messages on lore.kernel.org/* is over 14.6M(*) and growing... (LKML is 4M of that). (*) 14.6M in the new combined "extindex" format that should be on lore.kernel.org, soon. For now, I have an experimental instance on https://yhbt.net/lore/all/ ___ notmuch mailing list -- notmuch@notmuchmail.org To unsubscribe send an email to notmuch-le...@notmuchmail.org
Re: notmuch and mailing lists
Sean Whitton wrote: > Hello, > > I was wondering whether anyone who previously read mailing lists via > NNTP has stopped doing this after starting to use notmuch. Fwiw, I have some slrn spool to Maildir translators here which should work with notmuch: Perl: https://yhbt.net/public-inbox.git/tree/scripts/slrnspool2maildir (It still uses the Email::* namespace, which I'm slowly getting rid of in public-inbox due to performance and inactive upstream...) Ruby: https://lore.kernel.org/lkml/20190104013522.stng6gwauwnr6wbi@starla/ (doesn't do any header rewriting) > I've not yet used NNTP to read mailing lists myself, but I think there > are limitations to the way I currently read lists, and was wondering > whether it is worth exploring the NNTP approach, or trying to come up > with notmuch-based workflow improvements. Not directly related to notmuch: I'm planning on expanding public-inbox to include a local client tool which can index, search, and optionally cache NNTP and HTTP messages from any NNTP and public-inbox HTTP instances. Maybe IMAP, too... ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: performance problems with notmuch new
David Bremner wrote: > Franz Fellner writes: > > mail takes at least 10 seconds, sometimes even more. It can go into > > minutes when I get lots of mail (~30...). When I run it after a > > reboot I can have breakfast while notmuch starts up... This is all on > > spinning rust. I thought of getting an SSD but not in the near future. > > I do have at least one spinning rust configuration with about 300k > messages, and notmuch new is still fast there. I've yet to figure out how spinning rust can work well with giant public-inboxes (git + Xapian + SQLite); but I have a fair bit of experience with SSDs + Xapian. But some of my recommendations below come from my experience with HDDs in the old days, before I used Xapian. > > What I observe during that time: notmuch doesn't really need much CPU. > > iotop shows constant read and write with extremely low rates, under > > 1MB/sec. So I think it might be an issue in xapian? Seek times, probably `iostat -x 1' can give you very useful information about I/O queue sizes and wait times for reads and writes (the `-x' is the good stuff :), `1' means it keeps outputting every second. > Just in case one of the xapian experts can suggest some kind of test for > why you might be seeing this behaviour, I've included the xapian list in > CC. Newer Xapian has a DB_NO_SYNC which notmuch could set as an option. Users of old Xapian (or on Perl XS bindings) also have libeatmydata LD_PRELOAD which I end up using all the time: https://www.flamingspork.com/projects/libeatmydata/ I run `sync' if I have anything important, but I usually don't ;) I do set the kernel do flush dirty data in the background fairly aggressively, though (more below) For public-inbox v2 hacking in 2018 (indexing LKML archives, ~3M messages), I found working on a freshly TRIM-ed SSD with plenty of free space made the SSD firmware happier. SSDs can get a LOT slower as they get fuller (so xapian-compact helps, there, too). SSD quality matters a lot; but even the low-end QLC stuff beats high-end HDDs in random I/O; but they will slow down more as they fill up more. For writes, I set /proc/sys/vm/dirty_background_bytes to 100M or something reasonably close to what the SSD can write quickly. Linux tended to hit I/O stalls with lots of dirty data, so making the kernel flush it sooner tends to help IME. Maybe newer kernels do better *shrug*; but it's basically the local storage version of the network "Bufferbloat" problem. Flushing dirty data more frequently also frees up more memory for the kernel to make better caching decisions about future/current data it needs to read. notmuch can probably run a background thread (or use liburing) to do POSIX_FADV_DONTNEED once its done with a message, too (and POSIX_FADV_WILLNEED for to-be-indexed messages). Uncompressed Maildir messages eat cache space real quick, which means less cache for Xapian. public-inbox indexes the v2 inbox format in parallel; but excessive parallelism still causes I/O contention with SSDs (at least upper-mid-range ones). So right now the default limit is 3 indexing processes regardless of CPU count. Reading from git is still synchronous atm, but will probably be async in a few months. git itself tends to generate decent I/O patterns with its pack format (but makes posix_fadvise hinting impractical). Anyways, indexing just under 3 million LKML messages took ~4 hours on 4-core system built in 2010 with a SATA SSD from 2014. ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
notmuch modifies DB while iterating?
Hello, I'm neither a notmuch user or proficient in C++. However, I noticed a bug while working on public-inbox (in Perl) which shares Xapian thread linking logic with notmuch, and I suspect notmuch is affected by the same problem as public-inbox. The problem is in the _merge_threads function in add-message.cc While the Xapian::PostingIterator for loser is iterating, the Xapian DB is being modified by replace_document via _notmuch_message_sync. This was causing DatabaseCorruptError exceptions in public-inbox with my dataset. I fixed it in public-inbox by stashing docid scalars into a Perl array while iterating with the PostingIterator, and then doing lookups + replacements independently of the PostingIterator by iterating through the Perl array: https://public-inbox.org/meta/20180227221302.7308-...@80x24.org/raw I initially thought this was a bug in the glass backend, but I've also hit it with chert. I have a standalone Perl script to reproduce the bug at https://yhbt.net/skel.bug.perl and 81M gzipped dataset which reproduces the problem at https://yhbt.net/skel.bug.gz (each line is: MID [REFERENCES-SEPARATED-BY-SPACES]) Usage: For failure: curl https://yhbt.net/skel.bug.gz | zcat | perl -w /path/to/skel.bug.perl For success: curl https://yhbt.net/skel.bug.gz | zcat | \ BATCH_SIZE=1000 perl -w /path/to/skel.bug.perl ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Mail archives in Git using ssoma (Docker image)
David Bremner <da...@tethera.net> wrote: > Eric Wong <e...@80x24.org> writes: > > For mirroring existing lists, I started using public-inbox-watch > > which currently watches Maildirs. The config knobs are sorta > > documented from my announcement to git@vger: > > > > https://public-inbox.org/git/20160710004813.ga20...@dcvr.yhbt.net/ > > http://hjrcffqmbrq6wope.onion/git/20160710004813.ga20...@dcvr.yhbt.net/ > > > > Initial import (w/o spamassassin) was done with > > scripts/import_vger_from_mbox in the source: > > > > torsocks git clone http://hjrcffqmbrq6wope.onion/public-inbox > > git clone https://public-inbox.org/ public-inbox > > git clone git://repo.or.cz/public-inbox > > > > FWIW, I already have a Maildir with a complete (and updated) archive of the > list (and > only that) for use of nmbug. So at the risk of putting all eggs in one > basket, perhaps public-inbox-watch could watch that maildir. Yes, public-inbox-watch(1) is probably preferable for any subscriber to start archiving the notmuch list. I just pushed out some POD manpages which should probably help (along with the existing INSTALL doc): https://public-inbox.org/meta/20160907004907.1479-...@80x24.org/ public-inbox-overview(7) should be a good starting point of ways to start mirroring/hosting. Please feel free to ask me directly and/or m...@public-inbox.org if you need clarification or help. I'm scatterbrained and tend to omit things when writing documentation (it's hard to tell what a reader wants to know :x) Anyways, thanks for notmuch (and being GPL-3.0+)! I'm not a user myself(*), but I've found the notmuch source to be a good place to steal Xapian usage examples from for public-inbox :> (*) I have trouble with Maildir-only scalability and still use gzipped mbox for old mail. ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Mail archives in Git using ssoma
"W. Trevor King" <wk...@tremily.us> wrote: > On Sun, Aug 21, 2016 at 06:37:04PM +, Eric Wong wrote: > > Btw, for public-inbox, I'm using git-fast-import now, so imports are > > a bit faster and $GIT_DIR/ssoma.index is no longer used. This was > > crucial for getting git@vger archives imported in a reasonable time. > > ssoma-mda imports 22k notmuch messages in around 15 minutes (with > profiling enabled), and: In contrast, git@vger is around 300K messages. LKML is well into the millions, and I hope public-inbox (and git!) can handle that one day, even on cheap hardware (haven't tried). One problem I noticed with ssoma-mda is that it gets slower as more messages get imported, since all those files sit in the index, and the git index format is bad for incremental updates with big, flat trees. Big trees are a general problem with git: I'm now storing blob IDs directly in Xapian and will be using them more to avoid tree lookups. tree creation lookups degrade the same way the index does as they get bigger. Currently it's using 2/38 of the SHA-1 like git loose objects; a goal might be to move towards supporting 2/2/36 (or deeper) as Jeff noted substantial object traversal improvements: https://public-inbox.org/git/20160805092805.w3nwv2l6jkbuw...@sigill.intra.peff.net/ Of course, support for 2/38 will be retained for old archives/messages. > $ python -m cProfile -o profile import.py notmuch.mbox > $ python -c "import pstats; p=pstats.Stats('profile'); > p.sort_stats('cumulative').print_stats(10)" > Sun Aug 21 12:56:49 2016profile > >101823722 function calls (99078415 primitive calls) in 885.069 > seconds > > Ordered by: cumulative time > List reduced from 1145 to 10 due to restriction <10> > > ncalls tottime percall cumtime percall filename:lineno(function) >70/10.0020.000 885.069 885.069 {built-in method exec} > 10.1110.111 885.069 885.069 > /home/wking/src/notmuch/notmuch-archives.git/import.py:9() > 10.4000.400 884.915 884.915 > /home/wking/src/notmuch/notmuch-archives.git/import.py:17(import_mbox) > 228750.6010.000 863.3710.038 > /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:362(deliver) > 228758.9430.000 810.4590.035 > /home/wking/src/notmuch/notmuch-archives.git/ssoma_mda.py:207(append) > 228750.4180.000 308.3530.013 > /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:146(write_tree) > 22875 307.8550.013 307.8550.013 {built-in method > git_index_write_tree} > 228740.5750.000 279.2930.012 > /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:238(diff_to_tree) > 22874 278.5010.012 278.5010.012 {built-in method > git_diff_tree_to_index} It looks like writing the index is already the slowest, here, in terms of total time, too. It might be interesting if you profiled each *-mda invocation to see the degradation from the first to last message. > 228750.0880.000 80.4130.004 > /home/wking/.local/lib64/python3.4/site-packages/pygit2/index.py:99(read) > > 38 ms per ssoma delivery is probably fast enough, especially if you Not even close for me :) > are invoking ssoma-mda once per message, since process setup will take a > similar amount of time: > > $ time python -c 'print("hello")' > hello > > real0m0.016s > user0m0.013s > sys 0m0.003s > > It's possible that fast-import would shave a few ms off the pygit2 > addition (I'm not sure, and maybe pygit2 is faster than fast-import). > But I doubt it matters enough either way to be worth changing unless > you are dealing with a really large corpus. One key feature is fast-import avoids writing an index entirely. I think pygit2 would have to learn that, too. ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Mail archives in Git using ssoma
"W. Trevor King" <wk...@tremily.us> wrote: > On Fri, Nov 07, 2014 at 11:03:21AM -0800, W. Trevor King wrote: > > Eric Wong has been working on some tools to store email in a Git > > repository, and his client-side code is ssoma [1]. I wanted a bit > > more metadata than the stock ssoma-mda [2], and ended up just > > writing a ssoma-mda in Python [3]… Btw, for public-inbox, I'm using git-fast-import now, so imports are a bit faster and $GIT_DIR/ssoma.index is no longer used. This was crucial for getting git@vger archives imported in a reasonable time. public-inbox-* still keeps ssoma.index up-to-date for backwards compatibility with ssoma, and will probably do so until 2020 or later (there'll be a few years of deprecation notices) So I or someone else needs to update Perl ssoma to use fast-import at some point, too; and I suggest your python version do the same. ___ notmuch mailing list notmuch@notmuchmail.org https://notmuchmail.org/mailman/listinfo/notmuch
Re: Mail archives in Git using ssoma (Docker image)
"W. Trevor King" <wk...@tremily.us> wrote: > On Sun, Aug 21, 2016 at 12:08:52PM +, Eric Wong wrote: > > "W. Trevor King" <wk...@tremily.us> wrote: > > > This is the ssoma archive (with the data in it). I just set up a > > > basic HTTP archive (following [1]) based on a Docker image [2] (Gentoo > > > doesn't package all the Perl dependencies public-inbox needs). > > > > Ugh, that sucks (sorry, not a fan of Docker). > > > > What's missing from Gentoo? > > Gentoo doesn't package (or I couldn't find the package for) > Encode::MIME::Header or Mail::Thread. I tried installing things from > CPAN, but ran into a compile-time error from the ‘cpan’ invocationand > gave up ;). I can try and reproduce the error if you're curious, but > I don't have it handy at the moment. Encode::MIME::Header is distributed with perl itself on Debian and also the stock upstream install. Not sure if there's an option you missed or disabled. Which perl version do you use? perl on 5.14 Debian wheezy even seems to have it. I actually still want everything to work on 5.8, since that seems to be the de-facto baseline in the wild. Mail::Thread is one .pm, and I'll probably replace it with something (same algorithm) which can use half the memory by avoiding wrapper object abstractions (it's probably the biggest memory hog at the moment). lib/PublicInbox/Thread.pm already has 3 monkey patches to workaround upstream bugs in Mail::Thread. It's dead upstream, and not available on FreeBSD, either. > > > $ git config -f srv/notmuch.git/config publicinbox.http > > > http://tremily.us > > > $ git config -f srv/notmuch.git/config publicinbox.email > > > notmuch@notmuchmail.org > > > > That should probably be: > > > > ; based on your [3] > > git config -f srv/notmuch.git/config \ > > publicinbox.notmuch.url http://tremily.us/notmuch > > > > git config -f srv/notmuch.git/config \ > > publicinbox.notmuch.address notmuch@notmuchmail.org > > > > ; this is crucial for all the public-inbox-* tools > > git config -f srv/notmuch.git/config \ > > publicinbox.notmuch.mainrepo /path/to/notmuch.git > > I was using these in the Dockerfile's CMD: > > (cd /srv; >for NAME in *; >do > CONF="/srv/${NAME}/config"; > public-inbox-init "${NAME}" "/srv/${NAME}" $(git config -f "${CONF}" > publicinbox.http) $(git config -f "${CONF}" publicinbox.email); >done) && … > > Are you saying that I can skip the ~/.public-inbox/config entries > setup by public-inbox-init if I set publicinbox.{name}.* in the ssoma > repository's config? That would be nice. Erm, sorry, no, I mean ~/.public-inbox/config as the "git config -f" arg in the above commands. Your original config was meaningless in the context of public-inbox itself; I don't recall public-inbox relies on $GIT_DIR/config much (if at all) outside of standard git things. Using ~/.public-inbox/config is required for multi-inbox lookups (since you normally run MDA w/o args) You can also override ~/.public-inbox/config by setting the PI_CONFIG env (like GIT_CONFIG). > I don't see a point to having {name} in ssoma-config settings though, > since you're already in a single bucket by that point (using > publicinbox.{name}.* makes sense in the multi-bucket > ~/.public-inbox/config). > > > > It's not updating automatically yet, but that will probably look > > > like: > > > > > > 1. Pull new mbox [4]. > > > 2. Import into notmuch-arcives [5]. > > > 3. Re-run public-inbox-index (this could probably be via ‘docker exec …’. > > > > > > But I'll have to test that to confirm. And ideally we'd be using > > > ssoma-mda or similar directly, instead of going through mbox, but I'd > > > rather get the official headers on the stored mail than be efficient > > > ;). > > > > For mirroring existing lists, I started using public-inbox-watch > > which currently watches Maildirs. > > If I had a Maildir locally, I'd just use procmail and push new > messages into ssoma-mda. I'm using the import script because my local > mail has “how we delivered this to Trevor” headers (which I don't want > to add) but the downloaded mbox has “how we delivered this to > notmuch@notmuchmail.org” (which seems like a better fit for a shared > ssoma repo). I don't mind extra/different headers. The majority of messages in public-inbox.org/git/ has messages that were delivered to gmane; recent ones are delivered to me, and some holes were filled in by Jeff King's archives.