Hello all, Our application involves a high index write rate - anywhere from a few dozen to many thousands of docs per sec. The write rate is frequently higher than the read rate (though not always), and our index must be as fresh as possible (we'd like search results to be no more than a couple of seconds out of date). We're considering many approaches to achieving our desired TCO.
We've noted that the indexing process can be quite costly. Our latest POC shards the total index over N machines which effectively distributes the indexing load and keeps refresh and and search response times decent, but to maintain performance during peak write rates, we've had to make N a much larger number than we'd like. One idea we're floating would be to do all the analysis centrally, perhaps on N/4 machines, and then stream the raw tokens and data directly to the read "slaves," who would (hopefully) need to do nothing more than manage segments and readers. We have some very rough math that makes the approach compelling, but before diving in wholesale, we thought we'd ask if anyone else has taken a similar approach. Thoughts? Sincerely, Cass Costello www.stubhub.com