On 5/29/2013 1:07 AM, Dotan Cohen wrote: > In the case of this particular application, reindexing really is > overly burdensome as the application is performing hundreds of writes > to the index per minute. How might I gauge how much spare I/O Solr > could commit to a reindex? All the data that I need is in fact in > stored fields. > > Note that because the social media application that feeds our Solr > index is global, there are no 'off hours'.
I handle this in a very specific way with my sharded index. This won't work for all designs, and the precise procedure won't work for SolrCloud. There is a 'live' and a 'build' core for each of my shards. When I want to reindex, the program makes a note of my current position for deletes, reinserts, and new documents. Then I use a DIH full-import from mysql into the build cores. Once the import is done, I run the update cycle of deletes, reinserts, and new documents on those build cores, using the position information noted earlier. Then I swap the cores so the new index is online. To adapt this for SolrCloud, I would need to use two collections, and update a collection alias for what is considered live. To control the I/O and CPU usage, you might need some kind of throttling in your update/rebuild application. I don't need any throttling in my design. Because I'm using DIH, the import only uses a single thread for each shard on the server. I've got RAID10 for storage and half of the CPU cores are still available for queries, so it doesn't overwhelm the server. The rebuild does lower performance, so I have the other copy of the index handle queries while the rebuild is underway. When the rebuild is done on one copy, I run it again on the other copy. Right now I'm half-upgraded -- one copy of my index is version 3.5.0, the other is 4.2.1. Switching to SolrCloud with sharding and replication would eliminate this flexibility, unless I maintained two separate clouds. Thanks, Shawn