[PATCH 03/14] doc: index: more notes about latest changes

Eric Wong Sun, 09 Aug 2020 19:12:51 -0700

With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful.  I was able to index LKML in ~16 hours on a
system that had other activity on it.  The big downside was it
was eating up over 5g of RAM :x.


We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
---
 Documentation/public-inbox-index.pod | 62 +++++++++++++++-------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/Documentation/public-inbox-index.pod 
b/Documentation/public-inbox-index.pod
index 56dec993..3ae3b008 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -115,6 +115,11 @@ Sets or overrides L</publicinbox.indexBatchSize> on a
 per-invocation basis.  See L</publicinbox.indexBatchSize>
 below.
 
+When using rotational storage but abundant RAM, using a large
+value (e.g. C<500m>) with C<--sequential-shard> can
+significantly speed up the initial index and full C<--reindex>
+invocations (but not incremental updates).
+
 Available in public-inbox 1.6.0 (PENDING).
 
 =item --no-fsync
@@ -136,11 +141,11 @@ Available in public-inbox 1.6.0 (PENDING).
 
 =head1 FILES
 
-For v1 (ssoma) repositories described in L<public-inbox-v1-format>.
+For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
 All public-inbox-specific files are contained within the
 C<$GIT_DIR/public-inbox/> directory.
 
-v2 inboxes are described in L<public-inbox-v2-format>.
+v2 inboxes are described in L<public-inbox-v2-format(5)>.
 
 =head1 CONFIGURATION
 
@@ -168,40 +173,25 @@ L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
 
 Increase this value on powerful systems to improve throughput at
 the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
-
-This option is available in public-inbox 1.6 or later.
-public-inbox 1.5 and earlier used the current default, C<1m>.
+may not be noticeable on fast systems.  With SSDs, values above
+C<4m> have little benefit.
 
 For L<public-inbox-v2-format(5)> inboxes, this value is
 multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
-Default: 1m (one megabyte)
+inbox with 3 shards will flush every 3 megabytes by default
+when unless parallelism is disabled via C<--sequential-shard>
+or C<--jobs=0>.
 
-=item publicinbox.indexBatchSize
-
-Flushes changes to the filesystem and releases locks after
-indexing the given number of bytes.  The default value of C<1m>
-(one megabyte) is low to minimize memory use and reduce
-contention with parallel invocations of L<public-inbox-mda(1)>,
-L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
-
-Increase this value on powerful systems to improve throughput at
-the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
+This influences memory usage of Xapian, but it is not exact.
+The actual memory used by Xapian and Perl has been observed
+in excess of 10x this value.
 
 This option is available in public-inbox 1.6 or later.
 public-inbox 1.5 and earlier used the current default, C<1m>.
 
-For L<public-inbox-v2-format(5)> inboxes, this value is
-multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
 Default: 1m (one megabyte)
 
 =item publicinbox.indexSequentialShard
-=item publicinbox.<inbox_name>.indexSequentialShard
 
 For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
 allows indexing Xapian shards in multiple passes.  This speeds up
@@ -212,12 +202,23 @@ Using a higher-than-normal number of C<--jobs> with
 L<public-inbox-init(1)> may be required to ensure individual
 shards are small enough to fit into cache.
 
+Warning: interrupting C<public-inbox-index(1)> while this option
+is in use may leave the search indices out-of-date with respect
+to SQLite databases.  WWW and IMAP users may notice incomplete
+search results, but it is otherwise non-fatal.  Using C<--reindex>
+will bring everything back up-to-date.
+
 Available in public-inbox 1.6.0 (PENDING).
 
 This is ignored on L<public-inbox-v1-format(5)> inboxes.
 
 Default: false, shards are indexed in parallel
 
+=item publicinbox.<name>.indexSequentialShard
+
+Identical to L</publicinbox.indexSequentialShard>,
+but only affect the inbox matching E<lt>nameE<gt>.
+
 =back
 
 =head1 ENVIRONMENT
@@ -235,10 +236,13 @@ disk.  This environment is handled directly by Xapian, 
refer to
 Xapian API documentation for more details.
 
 For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
-instead.  Setting C<XAPIAN_FLUSH_THRESHOLD> for a large C<--reindex>
-may cause L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
-L<public-inbox-watch(1)> tasks to wait long periods of time
-during C<--reindex>.
+instead.
+
+Setting C<XAPIAN_FLUSH_THRESHOLD> or
+C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
+L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
+L<public-inbox-watch(1)> tasks to wait long and unpredictable
+periods of time during C<--reindex>.
 
 Default: none, uses C<publicinbox.indexBatchSize>
 
--
unsubscribe: one-click, see List-Unsubscribe header
archive: https://public-inbox.org/meta/

[PATCH 03/14] doc: index: more notes about latest changes

Reply via email to