Manybubbles has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/105091


Change subject: Build fixed size chunks when indexing
......................................................................

Build fixed size chunks when indexing

Because indexing leaks memory even while queueing we need a way to build
fixed size subprocessesd for reindexing.  This changes the default behavior
of --buildChunks to do just that.  You can still get the old behavior with
--buildChunk 10total.

Also, update documentation.

Bug: 59164
Change-Id: I2babb38c1b1e20c8bf121b83ac0ec8ec35bdd1b6
---
M README
M maintenance/forceSearchIndex.php
2 files changed, 32 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch 
refs/changes/91/105091/1

diff --git a/README b/README
index 15bafd3..b821473 100644
--- a/README
+++ b/README
@@ -34,27 +34,29 @@
 
 Bootstrapping large wikis
 -------------------------
-The --batch-size parameter controls the number documents read from MySQL and 
indexed into elasticsearch at
-one time.  It defaults to 50 but you should feel free to play with it.  Too 
low causes transport overhead (sql
-and Elasticsearch) but too high can make the process feel sluggish and can get 
you into trouble with the
-OOM Killer.
+Since most of the load involved in indexing is parsing the pages in php we 
provide a few options to split the
+process into multiple processes.  Don't worry too much about the database 
during this process.  It can generally
+handle more processess indexing then you are likely to be able to spawn.
 
-forceSeachIndex.php accepts the --fromId and --toId parameters which can be 
used to split up the
-work of bootstrapping the wiki into multiple processes.  Since most of the 
load on search indexing is on the
-indexing script in the php process you should be able to break the process 
into multiple chunks and farm
-them out to multiple php processes/machines.  The --buildChunks argument of 
forceSearchIndex.php will cause
-the script to build invocations of itself that you can splay out to different 
processes.  For example:
+The --fromId and --toId let you break the job into smaller chunks or restart 
the process after ctrl-c-ing it.
+
+The --buildChunks parameter lets you either build chunks of a fixed size or a 
fixed number of chunks.  Since the
+process of indexing leaks an unfortunate amount of memory, we suggest using 
fixed size chunks:
  rm -rf /tmp/index_log
  mkdir /tmp/index_log
- php forceSearchIndex.php --buildChunks 10 --forceUpdate --skipLinks 
--indexOnSkip --batch-size 100 |
+ php forceSearchIndex.php --buildChunks 10000 --forceUpdate --skipLinks 
--indexOnSkip |
    xargs -I{} -t -P4 sh -c 'php {} > /tmp/index_log/$$.log'
 
-forceSearchIndex.php also accepts --queue which can be used to enqueue all of 
its work onto the job queue
-rather than running in process.  They will then be processed by whatever job 
queue consumers you have set
-up.  Keep in mind that each one of these jobs can be slow and that adding all 
of them can cause a significant
-backlog on the queue.  This is really only a good idea if you have a very 
robust job queue infrastructure.
-This mechanism is more susceptible to the OOM Killer so you should make sure 
to keep the batch sizes low.
-The default is fine.
+forceSearchIndex.php also accepts --queue which can be used to enqueue all of 
its work onto the job queue rather
+than running in process.  They will then be processed by whatever job queue 
consumers you have set up.  This is
+really only a good idea if you have a very robust job queue infrastructure.
+
+The --batch-size parameter controls the number documents read from MySQL and 
indexed into elasticsearch at
+one time.  It defaults to 10 when parsing and 500 when not and those values 
work pretty well.  Feel free to play
+with it but remember that setting it too high while parsing will make each 
batch take _forever_.  This is really
+bad on the job queue because they'll just be killed.  When not parsing having 
a large number is pretty efficient
+because Elasticsearch does a good job parallelizing the link counts and index 
operations that take up the bulk
+of the time when not parsing.  The defaults are generally good for this.
 
 Handling elasticsearch outages
 ------------------------------
diff --git a/maintenance/forceSearchIndex.php b/maintenance/forceSearchIndex.php
index b3de409..b622327 100644
--- a/maintenance/forceSearchIndex.php
+++ b/maintenance/forceSearchIndex.php
@@ -62,8 +62,11 @@
                $this->addOption( 'toId', 'Stop indexing at a specific page_id. 
 Note useful with --deletes or --from or --to.', false, true );
                $this->addOption( 'deletes', 'If this is set then just index 
deletes, not updates or creates.', false );
                $this->addOption( 'limit', 'Maximum number of pages to process 
before exiting the script. Default to unlimited.', false, true );
-               $this->addOption( 'buildChunks', 'Instead of running the script 
spit out N commands that can be farmed out to ' .
-                       'different processes or machines to rebuild the index.  
Works with fromId and toId, not from and to.', false, true );
+               $this->addOption( 'buildChunks', 'Instead of running the script 
spit out commands that can be farmed out to ' .
+                       'different processes or machines to rebuild the index.  
Works with fromId and toId, not from and to.  ' .
+                       'If specified as a number then chunks no larger than 
that size are spat out.  If specified as a number with ' .
+                       'followed by the word "total" without a space between 
them then that many chunks will be spat out sized to ' .
+                       'cover the entire wiki.' , false, true );
                $this->addOption( 'forceUpdate', 'Blindly upload pages to 
Elasticsearch whether or not it already has an up ' .
                        'to date copy.  Not used with --deletes.' );
                $this->addOption( 'queue', 'Rather than perform the indexes in 
process add them to the job queue.  Ignored for delete.' );
@@ -387,7 +390,7 @@
                return $result;
        }
 
-       private function buildChunks( $chunks ) {
+       private function buildChunks( $buildChunks ) {
                $dbr = $this->getDB( DB_SLAVE );
                if ( $this->toId === null ) {
                        $this->toId = $dbr->selectField( 'page', 'MAX(page_id)' 
);
@@ -405,7 +408,14 @@
                if ( $fromId === $this->toId ) {
                        $this->error( "Couldn't find any pages to index.  
fromId = $fromId = $this->toId = toId.", 1 );
                }
-               $chunkSize = max( 1, ceil( ( $this->toId - $fromId ) / $chunks 
) );
+               $fixedChunkSize = strpos( $buildChunks, 'total' ) === false;
+               $buildChunks = intval( $buildChunks );
+               print $buildChunks . ' ' . $fixedChunkSize . "\n";
+               if ( $fixedChunkSize ) {
+                       $chunkSize = $buildChunks;
+               } else {
+                       $chunkSize = max( 1, ceil( ( $this->toId - $fromId ) / 
$buildChunks ) );
+               }
                for ( $id = $fromId; $id < $this->toId; $id = $id + $chunkSize 
) {
                        $chunkToId = min( $this->toId, $id + $chunkSize );
                        $this->output( $this->mSelf );

-- 
To view, visit https://gerrit.wikimedia.org/r/105091
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2babb38c1b1e20c8bf121b83ac0ec8ec35bdd1b6
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/CirrusSearch
Gerrit-Branch: master
Gerrit-Owner: Manybubbles <never...@wikimedia.org>

_______________________________________________
MediaWiki-commits mailing list
MediaWiki-commits@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to