Manybubbles has uploaded a new change for review. https://gerrit.wikimedia.org/r/105091
Change subject: Build fixed size chunks when indexing ...................................................................... Build fixed size chunks when indexing Because indexing leaks memory even while queueing we need a way to build fixed size subprocessesd for reindexing. This changes the default behavior of --buildChunks to do just that. You can still get the old behavior with --buildChunk 10total. Also, update documentation. Bug: 59164 Change-Id: I2babb38c1b1e20c8bf121b83ac0ec8ec35bdd1b6 --- M README M maintenance/forceSearchIndex.php 2 files changed, 32 insertions(+), 20 deletions(-) git pull ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/CirrusSearch refs/changes/91/105091/1 diff --git a/README b/README index 15bafd3..b821473 100644 --- a/README +++ b/README @@ -34,27 +34,29 @@ Bootstrapping large wikis ------------------------- -The --batch-size parameter controls the number documents read from MySQL and indexed into elasticsearch at -one time. It defaults to 50 but you should feel free to play with it. Too low causes transport overhead (sql -and Elasticsearch) but too high can make the process feel sluggish and can get you into trouble with the -OOM Killer. +Since most of the load involved in indexing is parsing the pages in php we provide a few options to split the +process into multiple processes. Don't worry too much about the database during this process. It can generally +handle more processess indexing then you are likely to be able to spawn. -forceSeachIndex.php accepts the --fromId and --toId parameters which can be used to split up the -work of bootstrapping the wiki into multiple processes. Since most of the load on search indexing is on the -indexing script in the php process you should be able to break the process into multiple chunks and farm -them out to multiple php processes/machines. The --buildChunks argument of forceSearchIndex.php will cause -the script to build invocations of itself that you can splay out to different processes. For example: +The --fromId and --toId let you break the job into smaller chunks or restart the process after ctrl-c-ing it. + +The --buildChunks parameter lets you either build chunks of a fixed size or a fixed number of chunks. Since the +process of indexing leaks an unfortunate amount of memory, we suggest using fixed size chunks: rm -rf /tmp/index_log mkdir /tmp/index_log - php forceSearchIndex.php --buildChunks 10 --forceUpdate --skipLinks --indexOnSkip --batch-size 100 | + php forceSearchIndex.php --buildChunks 10000 --forceUpdate --skipLinks --indexOnSkip | xargs -I{} -t -P4 sh -c 'php {} > /tmp/index_log/$$.log' -forceSearchIndex.php also accepts --queue which can be used to enqueue all of its work onto the job queue -rather than running in process. They will then be processed by whatever job queue consumers you have set -up. Keep in mind that each one of these jobs can be slow and that adding all of them can cause a significant -backlog on the queue. This is really only a good idea if you have a very robust job queue infrastructure. -This mechanism is more susceptible to the OOM Killer so you should make sure to keep the batch sizes low. -The default is fine. +forceSearchIndex.php also accepts --queue which can be used to enqueue all of its work onto the job queue rather +than running in process. They will then be processed by whatever job queue consumers you have set up. This is +really only a good idea if you have a very robust job queue infrastructure. + +The --batch-size parameter controls the number documents read from MySQL and indexed into elasticsearch at +one time. It defaults to 10 when parsing and 500 when not and those values work pretty well. Feel free to play +with it but remember that setting it too high while parsing will make each batch take _forever_. This is really +bad on the job queue because they'll just be killed. When not parsing having a large number is pretty efficient +because Elasticsearch does a good job parallelizing the link counts and index operations that take up the bulk +of the time when not parsing. The defaults are generally good for this. Handling elasticsearch outages ------------------------------ diff --git a/maintenance/forceSearchIndex.php b/maintenance/forceSearchIndex.php index b3de409..b622327 100644 --- a/maintenance/forceSearchIndex.php +++ b/maintenance/forceSearchIndex.php @@ -62,8 +62,11 @@ $this->addOption( 'toId', 'Stop indexing at a specific page_id. Note useful with --deletes or --from or --to.', false, true ); $this->addOption( 'deletes', 'If this is set then just index deletes, not updates or creates.', false ); $this->addOption( 'limit', 'Maximum number of pages to process before exiting the script. Default to unlimited.', false, true ); - $this->addOption( 'buildChunks', 'Instead of running the script spit out N commands that can be farmed out to ' . - 'different processes or machines to rebuild the index. Works with fromId and toId, not from and to.', false, true ); + $this->addOption( 'buildChunks', 'Instead of running the script spit out commands that can be farmed out to ' . + 'different processes or machines to rebuild the index. Works with fromId and toId, not from and to. ' . + 'If specified as a number then chunks no larger than that size are spat out. If specified as a number with ' . + 'followed by the word "total" without a space between them then that many chunks will be spat out sized to ' . + 'cover the entire wiki.' , false, true ); $this->addOption( 'forceUpdate', 'Blindly upload pages to Elasticsearch whether or not it already has an up ' . 'to date copy. Not used with --deletes.' ); $this->addOption( 'queue', 'Rather than perform the indexes in process add them to the job queue. Ignored for delete.' ); @@ -387,7 +390,7 @@ return $result; } - private function buildChunks( $chunks ) { + private function buildChunks( $buildChunks ) { $dbr = $this->getDB( DB_SLAVE ); if ( $this->toId === null ) { $this->toId = $dbr->selectField( 'page', 'MAX(page_id)' ); @@ -405,7 +408,14 @@ if ( $fromId === $this->toId ) { $this->error( "Couldn't find any pages to index. fromId = $fromId = $this->toId = toId.", 1 ); } - $chunkSize = max( 1, ceil( ( $this->toId - $fromId ) / $chunks ) ); + $fixedChunkSize = strpos( $buildChunks, 'total' ) === false; + $buildChunks = intval( $buildChunks ); + print $buildChunks . ' ' . $fixedChunkSize . "\n"; + if ( $fixedChunkSize ) { + $chunkSize = $buildChunks; + } else { + $chunkSize = max( 1, ceil( ( $this->toId - $fromId ) / $buildChunks ) ); + } for ( $id = $fromId; $id < $this->toId; $id = $id + $chunkSize ) { $chunkToId = min( $this->toId, $id + $chunkSize ); $this->output( $this->mSelf ); -- To view, visit https://gerrit.wikimedia.org/r/105091 To unsubscribe, visit https://gerrit.wikimedia.org/r/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I2babb38c1b1e20c8bf121b83ac0ec8ec35bdd1b6 Gerrit-PatchSet: 1 Gerrit-Project: mediawiki/extensions/CirrusSearch Gerrit-Branch: master Gerrit-Owner: Manybubbles <never...@wikimedia.org> _______________________________________________ MediaWiki-commits mailing list MediaWiki-commits@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits