Re: [PATCH 1/5] v2writable: fix batch size accounting

2020-08-07 Thread Eric Wong
Eric Wong wrote: > We need to account for whether shard parallelization is > enabled or not, since users of parallelization are expected > to have more RAM. > --- > lib/PublicInbox/V2Writable.pm | 10 -- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff

[PATCH 0/2] Perl <5.22 fixes

2020-08-07 Thread Eric Wong
fileno(DIRHANDLE) didn't work until Perl 5.22, so we'll fall back to using some Inline::C for setting No_COW on btrfs and polling for -watch on *BSD with older Perl. Eric Wong (2): support setting No_COW on Perl <5.22 dir_idle: require Perl 5.22+ for kqueue lib/PublicInbox/DirI

[PATCH 1/2] support setting No_COW on Perl <5.22

2020-08-07 Thread Eric Wong
fileno(DIRHANDLE) only works on Perl 5.22+, so we need to use dirfd(3) ourselves from Inline::C (or rely on chattr(1) being installed). While we're at it, rename `set_nodatacow' to `nodatacow_fd' for consistency with `nodatacow_dir'. --- lib/PublicInbox/Msgmap.pm| 2 +- lib/PublicInbox/NDC_P

[PATCH 2/2] dir_idle: require Perl 5.22+ for kqueue

2020-08-07 Thread Eric Wong
IO::KQueue requires us to use fileno(DIRHANDLE) for setting up kqueue watches. This use of fileno() is only supported since Perl 5.22, so BSD users on older Perl will have to fall back to old polling. This affects users of -watch, currently; but will affect other read-only Xapian users soon. ---

[PATCH] favor `getconf _NPROCESSORS_ONLN` over GNU nproc

2020-08-08 Thread Eric Wong
getconf(1) itself is POSIX, while `_NPROCESSORS_ONLN' is not. However, FreeBSD (tested 11.4 and 12.1) and glibc (tested CentOS 7.x and Debian 10.x) both support `getconf _NPROCESSORS_ONLN'. GNU coreutils (and thus `nproc' or `gnproc') are not installed by default on the *BSDs, so we'll try the opt

[PATCH 03/14] doc: index: more notes about latest changes

2020-08-09 Thread Eric Wong
With LKML on an HDD, a giant --batch-size of 500m ends up being pretty useful. I was able to index LKML in ~16 hours on a system that had other activity on it. The big downside was it was eating up over 5g of RAM :x. We'll also fix up a duplicated indexBatchSize section, fix formatting around gl

[PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior

2020-08-09 Thread Eric Wong
-index now invokes ->DESTROY like xcpdb does, which is necessary to cleanup $INBOX_DIR/msgmap-XXX files. We'll also exit with the expected values for various signals by adding 128 as described in -xcpdb now terminates worker processes and xap

[PATCH 07/14] avoid File::Temp::tempfile in more places

2020-08-09 Thread Eric Wong
We can use open(..., undef) natively in Perl in t/import.t In places where we need a pathname, the File::Temp OO API gives us auto-unlinking for free. --- lib/PublicInbox/V2Writable.pm | 17 + script/public-inbox-init | 9 - t/import.t| 5 ++---

[PATCH 00/14] more indexing related improvements

2020-08-09 Thread Eric Wong
publicInbox.indexSequentialShard now works incrementally -convert also learned all the options -index learned, so it can be less painful on HDDs. Eric Wong (14): index: require --reindex when using --xapian-only index: --sequential-shard works incrementally doc: index: some more notes

[PATCH 09/14] index: cleanup internal variables

2020-08-09 Thread Eric Wong
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that pub

[PATCH 08/14] admin: use a generic veriable name

2020-08-09 Thread Eric Wong
We parse other options, too, not just --max-size --- lib/PublicInbox/Admin.pm | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm index af2b3da9..8a9a81c9 100644 --- a/lib/PublicInbox/Admin.pm +++ b/lib/PublicInbox/Admin.pm

[PATCH 02/14] index: --sequential-shard works incrementally

2020-08-09 Thread Eric Wong
We should never reindex all data in Xapian unless --reindex is specified on the command-line. This means users who put publicInbox.indexSequentialShard in their config file won't have to put up with a full reindex at every invocation, only when they specify --reindex. We'll also cleanup the progr

[PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename

2020-08-09 Thread Eric Wong
Trying to use the newer ->sqlite_backup_to_dbh method doesn't seem worth it, as we'll have to support DBD::SQLite <= 1.60 another decade or more. Dumping 'msgmap-XXX' into $INBOX_DIR can appear a bit confusing to users, so give it a "mm_tmp-$PID-" name to emphasize it's a temporary fil

[PATCH 01/14] index: require --reindex when using --xapian-only

2020-08-09 Thread Eric Wong
This to avoid user error of a currently undocumented switch; since --xapian-only always goes through the full history at the moment. --- script/public-inbox-index | 3 +++ 1 file changed, 3 insertions(+) diff --git a/script/public-inbox-index b/script/public-inbox-index index 73ca2953..9e0907be 1

[PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge

2020-08-09 Thread Eric Wong
These rarely-used commands have some caveats that needed expanding on. --- Documentation/public-inbox-edit.pod | 14 ++ Documentation/public-inbox-purge.pod | 14 ++ Documentation/public-inbox-xcpdb.pod | 15 +-- 3 files changed, 41 insertions(+), 2 deletions(-

[PATCH 13/14] convert: check ARGV more correctly

2020-08-09 Thread Eric Wong
Instead of silently ignoring excessive args, don't let a user specify an extra directory. Furthermore, we'll support the odd case where BOFH wants to name an $INBOX_DIR to be `0' :P --- script/public-inbox-convert | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/script/pub

[PATCH 11/14] convert: support new -index options

2020-08-09 Thread Eric Wong
Converting v1 inboxes from v2 can be a painful experience on HDD. Some of the new options in the CLI or config file make it less painful. --- Documentation/public-inbox-convert.pod | 19 +++ lib/PublicInbox/Admin.pm | 36 script/public-inbox-convert| 77

[PATCH 12/14] convert: speed up --help

2020-08-09 Thread Eric Wong
Lazy-loading dependencies speeds up --help by several hundred milliseconds and is a huge step towards user-friendliness. --- script/public-inbox-convert | 39 ++--- 1 file changed, 19 insertions(+), 20 deletions(-) diff --git a/script/public-inbox-convert b/script/

[PATCH 10/14] searchidx: use singular `$opt' for consistency with v2

2020-08-09 Thread Eric Wong
The rest of our indexing code uses `$opt' instead of `$opts'. --- lib/PublicInbox/SearchIdx.pm | 22 +++--- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index 7f2447fe..5c39f3d6 100644 --- a/lib/PublicIn

[PATCH 14/14] convert: set No_COW on copied SQLite files

2020-08-09 Thread Eric Wong
We'll use our existing logic and use sqlite_backup_from_file, which appeared in 1.39 (along with sqlite_backup_to_file). --- script/public-inbox-convert | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/script/public-inbox-convert b/script/public-inbox-convert index

Re: [PATCH 03/14] doc: index: more notes about latest changes

2020-08-09 Thread Eric Wong
Kyle Meyer wrote: > Eric Wong writes: > > > For L inboxes, this value is > > multiplied by the number of Xapian shards. Thus a typical v2 > > -inbox with 3 shards will flush every 3 megabytes by default. > > - > > -Default: 1m (one megabyte) > &g

[PATCH] v2writable: show newline after "indexing all of .. " message

2020-08-11 Thread Eric Wong
Otherwise things get very confusing when verbosity is enabled :x --- lib/PublicInbox/V2Writable.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 72198a298..14997b48e 100644 --- a/lib/PublicInbox/V2Writable.pm

[PATCH 1/6] xapcmd: simplify sub reference

2020-08-12 Thread Eric Wong
We don't need to fully-qualify when referring to subs in the same namespace, nor do we need make a SCALAR ref only to dereference it (Yes, still learning Perl :x) --- lib/PublicInbox/Xapcmd.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/Publ

[PATCH 5/6] xcpdb: wire up new index options and --help

2020-08-12 Thread Eric Wong
--sequential-shard also disables the copy parallelism (--jobs), so it can be useful for systems unable to handle parallel random I/O but still want many shards. There was a missing "use strict", too, which is fixed. --- Documentation/public-inbox-xcpdb.pod | 19 +++- lib/PublicInbox/Xapcmd.pm

[PATCH 0/6] xcpdb -index improvements

2020-08-12 Thread Eric Wong
Nothing terribly exciting, since xcpdb isn't really used often. But it'd be bad if it flooded the system with many parallel processes on HDD because -index was configured for many small shards. So now it now supports --sequential-shard and all the other index options. Eric Wong (6)

[PATCH 4/6] admin: don't warn when --jobs exceeds shards

2020-08-12 Thread Eric Wong
Established tools like make(1), prove(1) and xargs(1) don't warn when the desired parallelism level can't be met, either. --- lib/PublicInbox/Admin.pm | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm index ce720beb6..d99a00

[PATCH 6/6] v2writable: remove IdxStack import

2020-08-12 Thread Eric Wong
We use IdxStack via log2stack() from SearchIdx, now. --- lib/PublicInbox/V2Writable.pm | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index 72198a298..d99e476aa 100644 --- a/lib/PublicInbox/V2Writable.pm +++ b/lib/PublicInbox/V2Writ

[PATCH 3/6] xapcmd: reduce CPU idling when shards exceeds job count

2020-08-12 Thread Eric Wong
In case there's unbalanced shards AND we're limiting parallelism while using many shards, spawn the next task in the queue ASAP once a task is done, instead of waiting for all tasks to finish before spawning the next batch. Unbalanced shards probably isn't a big issue for most users; however many

[PATCH 2/6] xcpdb: support --no-fsync from CLI

2020-08-12 Thread Eric Wong
This was omitted in 8b1950055d51d436 :x Fixes: 8b1950055d51d436 ("index+xcpdb: rename `--no-sync' to `--no-fsync'") --- script/public-inbox-xcpdb | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/script/public-inbox-xcpdb b/script/public-inbox-xcpdb index fcd961488..2c91598

[TESTING] WIP - parallel shards on HDD with sequential flush

2020-08-12 Thread Eric Wong
Frequent flushing to save RAM with HDD is horrible with random writes Xapian tends to do; especially when parallelized I think just making the Xapian commits in sequence while the random reads + in-memory changes are still parallelized is doable, though... With this, --no-fsync may even be detrim

Re: [TESTING] WIP - parallel shards on HDD with sequential flush

2020-08-12 Thread Eric Wong
Eric Wong wrote: > Waiting to see if it slows down as the Xapian DBs get bigger... It does :< -- unsubscribe: one-click, see List-Unsubscribe header archive: https://public-inbox.org/meta/

[PATCH] grok-pull.post_update_hook: favor --sequential-shard for HDD

2020-08-13 Thread Eric Wong
--sequential-shard offers better performance on HDD than -j0 since the on-disk active set can be kept small (with -j $HIGH_NUM). --batch-size can also be helpful for systems with much RAM. --- examples/grok-pull.post_update_hook.sh | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --g

[PATCH] index|compact|xcpdb: support --all switch

2020-08-13 Thread Eric Wong
For -index, this is a convenient way to quickly index all inboxes after a grok-pull. Might as well support it for rarely used commands like -compact and -xcpdb, too. --- Documentation/public-inbox-compact.pod | 10 +- Documentation/public-inbox-index.pod | 8 Documentation/pub

[PATCH] doc: add public-inbox-tuning(7) manpage

2020-08-14 Thread Eric Wong
Determining storage device speed and latencies doesn't seem portable or even possible with the wide variety of storage layers in use. This means we need to write a tuning document and hope users read and improve on it :P --- Documentation/public-inbox-tuning.pod| 139 +++

Re: Could public-inbox do something helpful with .mailmap?

2020-08-17 Thread Eric Wong
"Eric W. Biederman" wrote: > > I just dug up some old emails and I got at least one persons current > email address wrong because they have changed their email address > frequently. > > They have an update to their preferred email address in the .mailmap > in the linux-kernel source. Is there a

[PATCH] smsg: handle wide characters in raw mail headers

2020-08-19 Thread Eric Wong
There may be messages in the wild with wide characters in headers which aren't non-RFC2047 encoded. Assume UTF-8 so those fields can round trip through the `ddd' (doc-data-deflated) column of over.sqlite3. This doesn't affect docdata.glass in Xapian (at least not with Search::Xapian), but it does

[PATCH 02/23] admin: progress shows the inbox being indexed

2020-08-20 Thread Eric Wong
This is helpful with --all, or when multiple inboxes are being indexed. --- lib/PublicInbox/Admin.pm | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm index d99a00b4..f5427af7 100644 --- a/lib/PublicInbox/Admin.pm +++ b/lib/PublicInbox/Admin

[PATCH 03/23] compact: support --help/-? and perform lazy loading

2020-08-20 Thread Eric Wong
This probably won't be used much, but --help can still make sense. --- script/public-inbox-compact | 39 +++-- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/script/public-inbox-compact b/script/public-inbox-compact index b5fa0086..a6bb62bd 100755 -

[PATCH 01/23] doc: note -compact and -xcpdb are rarely used

2020-08-20 Thread Eric Wong
Slowly improving the learning curve... --- Documentation/public-inbox-compact.pod | 5 + Documentation/public-inbox-xcpdb.pod | 3 +++ 2 files changed, 8 insertions(+) diff --git a/Documentation/public-inbox-compact.pod b/Documentation/public-inbox-compact.pod index 8e463ab1..4e9b6d9f 1006

[PATCH 00/23] indexing: --skip-docdata + speedups

2020-08-20 Thread Eric Wong
e the default. Eric Wong (23): doc: note -compact and -xcpdb are rarely used admin: progress shows the inbox being indexed compact: support --help/-? and perform lazy loading init: support --help and -? init: support --newsgroup option init: drop -N alias for --skip-artnum search: v2: e

[PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata

2020-08-20 Thread Eric Wong
This is a step towards improving kernel page cache hit rates by relying on over.sqlite3 for document data instead of Xapian. Some micro-optimization to over->get_art was required to maintain performance. --- lib/PublicInbox/Over.pm | 18 +- lib/PublicInbox/SearchView.pm | 10

[PATCH 12/23] searchquery: split off from searchview

2020-08-20 Thread Eric Wong
Since this was already a separate package, split it off into its own file since SearchView may not handle inbox groups. --- MANIFEST | 1 + lib/PublicInbox/SearchQuery.pm | 53 ++ lib/PublicInbox/SearchView.pm | 53 ++-

[PATCH 08/23] xapcmd: simplify {reindex} parameter passing

2020-08-20 Thread Eric Wong
No need to localize it, here, since we can just refer to it in the `$opt' hashref. Hopefully this improves readability for others like it does for me. I sometimes wonder if the concept of a stack in high-level languages is even necessary... --- lib/PublicInbox/Xapcmd.pm | 20 +---

[PATCH 09/23] www: reduce long-lived PublicInbox::Search references

2020-08-20 Thread Eric Wong
While this is unlikely to be a problem in current practice, keeping Xapian DBs open for long responses can interfere with free space recovery after -compact. In the future, it will interfere with inbox search grouping and lead to unexpected results. --- lib/PublicInbox/Inbox.pm | 11

[PATCH 14/23] smsg: reduce utf8::decode call sites

2020-08-20 Thread Eric Wong
Both callers of load_from_data call utf8::decode, so just do utf8::decode in load_from_data. --- lib/PublicInbox/Over.pm | 1 - lib/PublicInbox/Smsg.pm | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index 2b314882..81b9fca7 1

[PATCH 23/23] search: add mset_to_artnums method

2020-08-20 Thread Eric Wong
We can avoid importing mdocid() in several places by using this method, simplifying callers. --- lib/PublicInbox/ExtMsg.pm | 4 +--- lib/PublicInbox/IMAP.pm | 4 +--- lib/PublicInbox/Mbox.pm | 7 ++- lib/PublicInbox/Search.pm | 6 ++ lib/PublicInbox/SearchView.pm | 6 ++

[PATCH 20/23] smsg: remove from_mitem

2020-08-20 Thread Eric Wong
We no longer read docdata.glass from anywhere in our code base. Some adjustments were needed to t/search.t to deal with the Xapian::WritableDatabase committing at different times, since our ->query is avoided from PublicInbox::SearchIdx to avoid needing a {over_ro} field. --- lib/PublicInbox/Sear

[PATCH 06/23] init: drop -N alias for --skip-artnum

2020-08-20 Thread Eric Wong
It may be too easily confused for --newsgroup or --ng. This is too rarely used and never made it into a release, so it should be fine. --- Documentation/public-inbox-init.pod | 2 +- script/public-inbox-init| 2 +- t/init.t| 4 ++-- 3 files changed, 4 inser

[PATCH 04/23] init: support --help and -?

2020-08-20 Thread Eric Wong
And speed those up with some lazy loading, too. --- script/public-inbox-init | 79 +--- 1 file changed, 50 insertions(+), 29 deletions(-) diff --git a/script/public-inbox-init b/script/public-inbox-init index 1c8066df..6852f64a 100755 --- a/script/public-inbox-

[PATCH 07/23] search: v2: ensure shards are numerically sorted

2020-08-20 Thread Eric Wong
This seems required to correctly get the NNTP article number from Xapian docid on combined Xapian DBs. The default (ASCII-betical) sorting was only acceptable for -imapd users until somebody hit 11 (or more) shards, which is a rare case. --- lib/PublicInbox/Search.pm | 27

[PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly

2020-08-20 Thread Eric Wong
Numbers are hard :< --- t/nntpd-v2.t | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/t/nntpd-v2.t b/t/nntpd-v2.t index 7fc3447e..1dd992a0 100644 --- a/t/nntpd-v2.t +++ b/t/nntpd-v2.t @@ -1,4 +1,4 @@ # Copyright (C) 2019-2020 all contributors # License: AGPL-3.0+

[PATCH 13/23] search: make qparse_new an internal function

2020-08-20 Thread Eric Wong
We'll probably be reusing it from another package in a future commit. --- lib/PublicInbox/Search.pm | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index f98513d3..e6200bfb 100644 --- a/lib/PublicInbox/Sear

[PATCH 11/23] search: export mdocid subroutine

2020-08-20 Thread Eric Wong
No need to have awkward globrefs for this. --- lib/PublicInbox/IMAP.pm | 3 +-- lib/PublicInbox/Search.pm | 2 ++ 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm index 3d66f930..562c59d4 100644 --- a/lib/PublicInbox/IMAP.pm +++ b/l

[PATCH 19/23] mbox: avoid Xapian docdata in search results

2020-08-20 Thread Eric Wong
Another place where we can reduce kernel page cache overhead by hitting over.sqlite3 instead of docdata.glass. --- lib/PublicInbox/Mbox.pm | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm index a83c0356.

[PATCH 16/23] searchview: speed up search summary by ~10%

2020-08-20 Thread Eric Wong
Instead of loading one article at-a-time from over.sqlite3, we can use SQL to mass-load IN (?,?, ...) all results with a single SQLite query. Despite SQLite being in-process and having no network latency, the reduction in SQL query executions from loading multiple rows at once speeds things up sig

[PATCH 22/23] init+index: support --skip-docdata for Xapian

2020-08-20 Thread Eric Wong
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction). --- Documentation/public-inbox-in

[PATCH 05/23] init: support --newsgroup option

2020-08-20 Thread Eric Wong
We can reduce the need to edit the config file for NNTP group names this way. --- Documentation/public-inbox-config.pod | 2 +- Documentation/public-inbox-init.pod | 25 + script/public-inbox-init | 12 ++-- t/imapd.t | 6

[PATCH 18/23] extmsg: avoid using Xapian docdata

2020-08-20 Thread Eric Wong
Once again, over.sqlite3 contains everything necessary for Message-ID resolution. Also, Xapian may be completely unnecessary with the advent of over.sqlite3, but that's for another time. --- lib/PublicInbox/ExtMsg.pm | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) d

[PATCH 10/23] search: improve comments around constants

2020-08-20 Thread Eric Wong
We'll probably be adding more value columns like THREADID to sort on. --- lib/PublicInbox/Search.pm | 63 +-- 1 file changed, 34 insertions(+), 29 deletions(-) diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm index 4d02a7c1..593040a8 100644 --

[PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3

2020-08-20 Thread Eric Wong
git blob retrieval dominates on these, "&x=t" (nested) is roughly the same due to increased overhead for ->get_percent storage balancing out the mass-loading from SQLite. Atom "&x=A" is sped up slightly and uses less memory in the long-lived response. --- lib/PublicInbox/SearchView.pm | 25 ++

Re: [PATCH 05/23] init: support --newsgroup option

2020-08-20 Thread Eric Wong
Eric Wong wrote: > +Some of the options documented in L > +require editing the config file. Old versions lack the > +C<-n>/C<--newsgroup> parameter While working on this, I realized -n vs -N could be confusing, so I made the abbreviation --ng instead. So I'll squ

Re: what storage system(s) are you using?

2020-08-21 Thread Eric Wong
Konstantin Ryabitsev wrote: > On Wed, Aug 05, 2020 at 03:11:27AM +0000, Eric Wong wrote: > > I've been mostly using ext4 on SSDs since I started public-inbox > > and it works well. > > As you know, I hope to move lore.kernel.org to a system with a hybrid > lvm-cache

[PATCH] searchview: fix mbox.gz downloads for lynx users

2020-08-21 Thread Eric Wong
Unlike w3m and links, the lynx browser seems to require a `name' attribute for `' elements. Maybe some other browsers do, too. The `name' attribute for submit elements doesn't seem to cause any harm for w3m or links, users, either; despite not (AFAIK) being part of historical or current HTML spec

[PATCH 0/5] "mairix -t" workalike for mbox.gz downloads

2020-08-21 Thread Eric Wong
her before or after --reindex. That means BOFHs can upgrade without regard to ordering. Tested with w3m, links, and lynx (I actually split out my lynx fix separately): https://public-inbox.org/meta/20200822004125.9458-...@80x24.org/ TODO: CLI tool support, HTML interface, JMAP, etc...

[PATCH 1/5] searchidxshard: clear $msgref buffer properly

2020-08-21 Thread Eric Wong
Merely assigning `undef' to a scalar does not free the underlying buffer memory of a scalar. --- lib/PublicInbox/SearchIdxShard.pm | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm index 20077e08..75521b43 100

[PATCH 2/5] searchidx: put all shard-related stuff in SearchIdxShard.pm

2020-08-21 Thread Eric Wong
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process. --- lib/PublicInbox/SearchIdx.pm | 34 ---

[PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex

2020-08-21 Thread Eric Wong
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This all

[PATCH 4/5] search: support downloading mboxes results with full thread

2020-08-21 Thread Eric Wong
Finally, the addition of THREADID for collapsing results in Xapian lets us emulate the "mairix --threads" feature. That is, instead of returning only the matching messages, the entire thread is included in the downloaded mbox.gz This requires a "public-inbox-index --reindex" to be usable. --- lib

[PATCH 3/5] searchidx: index THREADID in Xapian

2020-08-21 Thread Eric Wong
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things). --- Documentation/standards.perl | 4 lib/PublicInbox/Over.pm | 2 +- lib/PublicInbox/OverIdx.pm| 18 +- lib/PublicInbox/Search.pm | 4

Re: [PATCH 5/5] mbox: disable "&t" on existing Xapian until full reindex

2020-08-21 Thread Eric Wong
Eric Wong wrote: > Expanding threads via over.sqlite3 for mbox.gz downloads without > Xapian effectively collapsing on the THREADID column leads to > repeated messages getting downloaded. > > To avoid that situation, use a "has_threadid" Xapian metadata > flag that&

Re: [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads

2020-08-21 Thread Eric Wong
Eric Wong wrote: > It requires "public-inbox-index --reindex" to activate; > but PATCH 5/5 makes it safe to upgrade WWW either before > or after --reindex. That means BOFHs can upgrade without > regard to ordering. public-inbox-watch users will need to restart -watch be

[PATCH] index: --sequential-shard checkpoints after each shard

2020-08-22 Thread Eric Wong
There's no reason we'd want Xapian to defer flushing once we've indexed everything belonging to a particular shard. --- lib/PublicInbox/V2Writable.pm | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm index b0148

Re: [PATCH 0/5] "mairix -t" workalike for mbox.gz downloads

2020-08-22 Thread Eric Wong
Kyle Meyer wrote: > Eric Wong writes: > > Eric Wong wrote: > >> It requires "public-inbox-index --reindex" to activate; > >> but PATCH 5/5 makes it safe to upgrade WWW either before > >> or after --reindex. That means BOFHs can upgrade without >

[PATCH] examples: add imapd systemd examples

2020-08-23 Thread Eric Wong
We've got examples for all the other daemons, too! --- examples/public-inbox-imap-onion.socket | 12 +++ examples/public-inbox-imapd.socket | 12 +++ examples/public-inbox-imapd@.service| 43 + examples/public-inbox-imaps.socket | 12 +++ 4 files c

[PATCH] searchidx: croak for Xapian DB open failure

2020-08-23 Thread Eric Wong
croak() can give more context on the failure, and setting `PERL5OPT=-MCarp=verbose' can force a stacktrace. --- lib/PublicInbox/SearchIdx.pm | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm index ade55756..3f2da6ab 1

anybody hit SQLite "database is locked" errors?

2020-08-24 Thread Eric Wong
Hey all, I've been reindexing frequently ahead of THREADED changes while another process is doing grok-pull and triggering SQLite reads in the post_update_hook. This problem causes an exception in the --reindex process because a read-only SQLite process is holding a SHARED lock. In particular, th

Re: [PATCH] examples: add imapd systemd examples

2020-08-24 Thread Eric Wong
Also squashed this in before pushing: diff --git a/MANIFEST b/MANIFEST index d86d3b15..35adc8d3 100644 --- a/MANIFEST +++ b/MANIFEST @@ -83,6 +83,10 @@ examples/nginx_proxy examples/public-inbox-config examples/public-inbox-httpd.socket examples/public-inbox-httpd@.service +examples/public-inbo

[PATCH 1/3] over: skip nodatacow on the journal

2020-08-24 Thread Eric Wong
This file gets truncated anyhow, so it won't fragment. --- lib/PublicInbox/Over.pm | 3 --- 1 file changed, 3 deletions(-) diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm index fba58d17..a2f04117 100644 --- a/lib/PublicInbox/Over.pm +++ b/lib/PublicInbox/Over.pm @@ -21,9 +21,6 @@ s

[PATCH 3/3] over+msgmap: respect WAL journal_mode if set

2020-08-24 Thread Eric Wong
WAL actually seems to have ideal locking characteristics given concurrency problems I'm experiencing with --reindex running in parallel with expensive read-only SQLite queries: Unfortunately, we cannot blindly use WAL while preserving comp

[PATCH 2/3] msgmap: use "CREATE TABLE IF NOT EXISTS"

2020-08-24 Thread Eric Wong
It's fewer queries and matches what we do in OverIdx. --- lib/PublicInbox/Msgmap.pm | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm index 7290959d..5b4cebc1 100644 --- a/lib/PublicInbox/Msgmap.pm

[PATCH 0/3] SQLite-related things

2020-08-24 Thread Eric Wong
t by default (see 3/3 for explanation). Eric Wong (3): over: skip nodatacow on the journal msgmap: use "CREATE TABLE IF NOT EXISTS" over+msgmap: respect WAL journal_mode if set lib/PublicInbox/Msgmap.pm | 29 ++--- lib/PublicInbox/Over.pm| 26

[PATCH] grok-pull.post_update_hook: flock(2) before SQLite check

2020-08-25 Thread Eric Wong
Unlike DBD::SQLite, the sqlite3(1) CLI does not have a default busy timeout enabled, so it easily times out while acquiring a SHARED lock for read-only queries. We can avoid battery-wasting polling from the SQLite timeout handler by relying on flock(2) as we do in our Perl code. Furthermore, this

[PATCH] doc: 1.6.0 release notes update

2020-08-25 Thread Eric Wong
A few more things happened, here. --- Documentation/RelNotes/v1.6.0.eml | 83 +++ 1 file changed, 74 insertions(+), 9 deletions(-) diff --git a/Documentation/RelNotes/v1.6.0.eml b/Documentation/RelNotes/v1.6.0.eml index 862e1c68..4f72f352 100644 --- a/Documentation/Re

[PATCH] doc: add some more tuning notes

2020-08-25 Thread Eric Wong
I've learned a thing or three about btrfs in the past few weeks and remembered some old HDD things, too. The Xapian MultiDatabase problem will need to be addressed for 1.7... --- Documentation/public-inbox-index.pod | 12 ++-- Documentation/public-inbox-init.pod | 15 +++ D

Re: anybody hit SQLite "database is locked" errors?

2020-08-25 Thread Eric Wong
Eric Wong wrote: > In particular, the expensive (for LKML) `SELECT COUNT(*) FROM msgmap' > statement in our current examples/grok-pull.post_update_hook.sh > seems to be a culprit. That's even opened with the sqlite3 > `-readonly' flag, it still needs to acquire a S

[PATCH] v2writable: compatibility with SWIG Xapian binding

2020-08-25 Thread Eric Wong
The SWIG binding won't auto-convert IV/UV to PV like the XS Search::Xapian binding would, so workaround that shortcoming for now. Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex") --- lib/PublicInbox/V2Writable.pm | 2 +- 1 file changed, 1 insertion(+), 1 deleti

[PATCH 5/5] msgmap: use v5.10.1

2020-08-26 Thread Eric Wong
We use the defined-or (`//', `//=') operators in 5.10, so require 5.10.1 like the rest of our codebase. Update an outdated comment while we're at it. --- lib/PublicInbox/Msgmap.pm | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msg

[PATCH 0/5] some minor SQLite-related cleanups

2020-08-26 Thread Eric Wong
Dropping some expensive dead code and avoiding some overloaded method names to reduce confusion. Eric Wong (5): over: rename ->connect method to ->dbh over: rename ->disconnect to ->dbh_close over: recent: remove expensive COUNT query over*: use v5.10.1, drop warnings

[PATCH 1/5] over: rename ->connect method to ->dbh

2020-08-26 Thread Eric Wong
`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and D

[PATCH 2/5] over: rename ->disconnect to ->dbh_close

2020-08-26 Thread Eric Wong
Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues. --- lib/PublicInbox/Over.pm | 4 ++-- lib/PublicInbox/OverIdx.pm| 6 +++--- lib/PublicInbox/V2Writable.pm | 4 ++-- t/over.t | 2 +-

[PATCH 3/5] over: recent: remove expensive COUNT query

2020-08-26 Thread Eric Wong
As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7 ("www: rework query responses to avoid COUNT in SQLite"), COUNT on many rows is expensive on big SQLite DBs. We've already stopped using that code path long ago in WWW while -imapd and -nntpd never used it. So we'll adjust our remaining

[PATCH 4/5] over*: use v5.10.1, drop warnings

2020-08-26 Thread Eric Wong
v5.10.1 lets us use the lighter parent.pm instead of base.pm, and we'll rely on the shebang to enable warnings (or not). While we're in the area, drop a no-longer-necessary import for PublicInbox::Search, since OverIdx doesn't require search. --- lib/PublicInbox/Over.pm| 2 +- lib/PublicInbox

[PATCH] search: allow testing with current xapian.git and 1.5.x

2020-08-26 Thread Eric Wong
A `PI_XAPIAN' environment variable is now exposed for testing purposes. We'll also deal with the removal of `NumberValueRangeProcessor' and use `NumberRangeProcessor' in its place, but continue favoring the old Search::Xapian since that's all that's packaged for Debian 10.x stable. --- lib/Public

[PATCH] git: show more context info on failures

2020-08-27 Thread Eric Wong
I'm seeing "read: Connection timed out" from in my syslog from -httpd. The fail() calls in PublicInbox::Git seems to be the only code path of ours which could trigger it... ETIMEDOUT shouldn't happen on pipes, only sockets; and all of our socket operations are non-blocking. So this could be cgit

[PATCH 0/8] mostly watch-related odds and ends

2020-08-27 Thread Eric Wong
entation. I tend to get overwhelmed myself when learning new things, too. Moving the watch-only stuff out of the config manpage seems like a step in the right direction in that regard. Eric Wong (8): watchmaildir: ensure I:/W:/E: prefixes in warnings imaptracker: preserve WAL journal_mode if s

[PATCH 1/8] watchmaildir: ensure I:/W:/E: prefixes in warnings

2020-08-27 Thread Eric Wong
For consistency in output, any URL/path-context-dependent prefixes should have the same prefix as the actual warning which triggered it. --- lib/PublicInbox/WatchMaildir.pm | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicI

[PATCH 2/8] imaptracker: preserve WAL journal_mode if set by user

2020-08-27 Thread Eric Wong
It's no problem for most users to enable WAL, here, since there's only a single process doing both reading and writing (unlike the read-only daemons). However, WAL doesn't work on network filesystems, so it can't be enabled by default. --- lib/PublicInbox/IMAPTracker.pm | 7 ++- 1 file change

[PATCH 3/8] overidx: inline create_ghost sub

2020-08-27 Thread Eric Wong
There's no need for this to be a separate sub since there's only a single caller. This saves a few kilobytes at least in short-lived processes. --- lib/PublicInbox/OverIdx.pm | 23 ++- t/over.t | 4 ++-- 2 files changed, 12 insertions(+), 15 deletions(-) di

[PATCH 4/8] doc: document graceful shutdown signals

2020-08-27 Thread Eric Wong
Same as the read-only daemons. --- Documentation/public-inbox-watch.pod | 5 + 1 file changed, 5 insertions(+) diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod index bf3c9bd4..34e8c4f2 100644 --- a/Documentation/public-inbox-watch.pod +++ b/Documentati

[PATCH 6/8] watch: imap: only remove \Seen spam

2020-08-27 Thread Eric Wong
This matches the behavior of Maildir `watchspam' handling in not removing unseen messages. NNTP can't match this behavior, since NNTP servers don't store flags, clients do. --- lib/PublicInbox/WatchMaildir.pm | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git

<    9   10   11   12   13   14   15   16   17   18   >