To your second question: no. Solr's backup process works by sending a message to each shard leader to fetch and restore the data in that shard. Shard leaders fetch this data from the backup repository (S3 in this case), and then send copies of this data to any other replicas that might exist in the shard.
To use a concrete example. Let's say your 5TB collection has three shards with 2 replicas per shard, each on their own Kube pod. Each shard leader will pull its share of the 5TB backup from S3. (I guess that'd be ~1.66TB on 3 different Solr replicas.). Once each shard leader has the data, it sends its 1.66TB to any other replicas in the shard it's responsible for. So total network traffic for this layout would be 5TB network traffic to S3, and 5TB traffic between Solr nodes within the cluster. If there were 3 replicas per shard, there would be 10TB traffic between Solr nodes. etc. To complicate the picture a bit here too: these are upper bounds on the amount of network traffic that would occur. Starting in Solr 8.9 backups are smart enough to only fetch data incrementally. So unless your restoration target is totally empty, it should see much less Solr<-->S3 and Solr<-->Solr traffic. (The actual amount would depend on how similar the current index is to the backed up copy.) Hope that helps! Best, Jason On Mon, Nov 1, 2021 at 1:17 PM Houston Putman <[email protected]> wrote: > > To answer your first question, yes, the S3BackupRepository connects > directly to S3. There is no need to have any shared storage. The next > version of the Solr Operator (v0.5.0) will actually make this very easy to > enable on Kubernetes clusters, such as EKS. > > I am not sure about the answer to your second question. > > - Houston > > > On Thu, Oct 14, 2021 at 4:00 AM Tomer Y. <[email protected]> wrote: > > > Hello, > > > > This is the first time I send a message to this User List, any help will be > > appreciated, we're also open for (paid) consultancy. > > > > We are looking to deploy SolrCloud 8.10 into an EKS cluster > > Normally, you'd need a shared volume between all Solr nodes - because every > > node/pod needs access to the data being restored. This can be solved using > > any NFS (EFS or File Gateway) or replicating an EBS volume per number of > > nodes in the cluster and attached one to each > > > > My question is if it's possible using the S3BackupRepository to skip having > > the need to use EFS/File Gateway and have each Solr node communicate > > directly with S3 > > > > If the answer is yes, then a followup question: our backup is about 5TB. > > Does this means that each of the nodes in the cluster will need to fetch > > 5TB from S3? > > > > > > > > Thank you > >
