Hi Dave! Quoth David Mazieres on Apr 23 at 2:00 am: > Gaute Hope <eg at gaute.vetsj.com> writes: > > > A db-tick or a _good_ ctime solution can as far as I can see solve both > > David M's (correct me if I am wrong) and my purposes, as well as > > probably have more use cases in the future. It would even be an > > interesting direct search: show me everything that changed lately, > > sorted. > > I could live with a db-tick scheme. I would prefer a ctime scheme, > since then I can answer questions such as "what has changed in the last > five minutes"? I mean all kinds of other stuff starts to break if your > clock goes backwards on a mail server machine, not the least of which is > that incremental backups will fail silently, so you risk losing your > mail. > > A middle ground might be to use the maximum of two values: 1) the > time-of-day at which notmuch started executing, and 2) the highest ctime > in the database plus 100 microseconds (leaving plenty of slop to store > timestamps as IEEE doubles with 52 significant bits). Since the values > will be Btree-indexed, computing the max plus one will be cheap.
This makes me curious if you've considered how to fit this in to Xapian. The Xapian query syntax supports range queries over document "values", but within the Xapian B-tree, values are stored in docid order, not value order, so Xapian's range query operator is actually a full scan in implementation. I assume it does this so it doesn't have to store both forward and inverse indexes of values. (I spent some time figuring out the layout of the Xapian database and have fairly detailed notes if anyone's curious.) This is still reasonably fast in practice because it's a sequential scan and only requires a few bytes per message, but it's probably not what you'd expect. That said, Xapian does track per-value statistics that would suffice for the particular problem of monotonic time stamps (e.g., Database::get_value_upper_bound). In principle it would be possible to use user metadata or even document terms to support true B-tree range scans by ctime order, but I don't think it's possible to express queries over this using Xapian's query parser. I've written about 90% of a (new) custom query parser for Notmuch that would enable this, but little things like my looming thesis deadline have interfered with me finishing it. > Incidentally, if you are really this paranoid about time stamps, it > should bother you that notmuch's directory timestamps only have one > second granularity. It's not that hard to get a new message delivered > in the same second that notmuch new finished running. In my > synchronizer, I convert st_mtim (a struct timespec) into a double and > keep that plus size in the database to decide if I need to re-hash > files. But for directories, I'm stuck with NOTMUCH_VALUE_TIMESTAMP, > which are quantized to the second. (Ironically, I think > Xapian::sortable_serialize converts time_ts to doubles anyway, so > avoiding st_mtim is not really helping performance.) This is historical (and, I agree, unfortunate). But nobody's complained, so it hasn't been worth changing the libnotmuch interface to support sub-second directory mtimes. However, notmuch new does correctly handle deliveries in the same second it runs. If the wall-clock time when it starts is the same as the on-disk directory mtime, it skips updating the in-database directory mtime at the end. Hence, on the next run, it will still consider the directory out-of-date. It's a bit of a hack, but it's a hack that would be necessary for supporting older file systems even if we did support sub-second timestamps.