Sounds great Anshum! I also think what you propose might have an additional purpose beyond CDCR. A persistent/durable queue (esp. Kafka) can also be used to make writes from the external source durable into the search tier (Kafka+SolrCloud) without SolrCloud needed to provide that immediate durability. Today (without Kafka), this immediate durability of an update request to Solr is satisfied by a distributed UpdateLog. It's worth exploring an option of no UpdateLog[1] -- rely on Kafka for durability. Assuming the client writes directly to the queue, it can return quickly and know the updates won't be lost. Then, on the other side of Kafka, a client can keep Solr up to date.
Some things are lost with this external queue, be it used for CDCR or to what I describe above. * lack of error notifications if a document can't be indexed (bogus field or bad format). * any requirements to make documents visible (searchable) that an indexing client might specify. Perhaps that can be added with some complexity around waiting for a commit message to completely make it through. [1] https://issues.apache.org/jira/browse/SOLR-14778 Also this short-cuts a lot of complexity in Solr pertaining to the UpdateLog and versions. It provides a total ordering of updates to all replicas that is a useful property for other things like synchronizing when segment boundaries occur which is useful to make peer replication based recovery more efficient. ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Fri, Dec 4, 2020 at 11:24 AM Anshum Gupta <[email protected]> wrote: > Hi everyone, > > > Large scale Solr installations often require cross data-center replication > in order to achieve data replication for both, access latency reasons as > well as disaster recovery. In the past users have either designed their own > solutions to deal with this or have tried to rely on the now-deprecated > CDCR. > > > It would be really good to have support for cross data-center replication > within Solr, that is offered and supported by the community. This would > allow the effort around this shared problem to converge. > > > I’d like to propose a new solution based on my experiences at my day job. > The key points about this approach: > > 1. Uses an external, configurable, messaging system in the middle for > actual replication/mirroring. > 2. We offer an abstraction and some default implementations based on > what we can support and what users really want. An example here would be > Kafka. > 3. This would be a separate repository allowing it to have its own > release cadence. We shouldn’t have to release this with every Solr release > as the overlap is just limited to SolrJ interactions. > > > I’ll share a more detailed and evolving document soon with the design for > everyone else to contribute to but wanted to share this as I’m starting to > work on this and wanted to avoid parallel efforts towards the same end-goal. > > -- > Anshum Gupta >
