On 5/29/2013 1:07 AM, Dotan Cohen wrote:
> In the case of this particular application, reindexing really is
> overly burdensome as the application is performing hundreds of writes
> to the index per minute. How might I gauge how much spare I/O Solr
> could commit to a reindex? All the data that I need is in fact in
> stored fields.
> 
> Note that because the social media application that feeds our Solr
> index is global, there are no 'off hours'.

I handle this in a very specific way with my sharded index.  This won't
work for all designs, and the precise procedure won't work for SolrCloud.

There is a 'live' and a 'build' core for each of my shards.  When I want
to reindex, the program makes a note of my current position for deletes,
reinserts, and new documents.  Then I use a DIH full-import from mysql
into the build cores.  Once the import is done, I run the update cycle
of deletes, reinserts, and new documents on those build cores, using the
position information noted earlier.  Then I swap the cores so the new
index is online.

To adapt this for SolrCloud, I would need to use two collections, and
update a collection alias for what is considered live.

To control the I/O and CPU usage, you might need some kind of throttling
in your update/rebuild application.

I don't need any throttling in my design.  Because I'm using DIH, the
import only uses a single thread for each shard on the server.  I've got
RAID10 for storage and half of the CPU cores are still available for
queries, so it doesn't overwhelm the server.

The rebuild does lower performance, so I have the other copy of the
index handle queries while the rebuild is underway.  When the rebuild is
done on one copy, I run it again on the other copy.  Right now I'm
half-upgraded -- one copy of my index is version 3.5.0, the other is
4.2.1.  Switching to SolrCloud with sharding and replication would
eliminate this flexibility, unless I maintained two separate clouds.

Thanks,
Shawn

Reply via email to