[jira] [Commented] (SOLR-17391) Optimize Backup/Restore Operations for Large Collections

David Smiley (Jira) Wed, 07 Aug 2024 14:30:28 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871790#comment-17871790
 ]


David Smiley commented on SOLR-17391:
-------------------------------------

For collectorExecutor (parallel segment search) – I agree! CC [~cpoerschke] 
[~ishan]
_(an aside – the name of that field is very non-obvious. IMO should have been 
named searcherCollectorExecutor.)_

This should also improve the "replayUpdatesExecutor" in CoreContainer in a 
minor way since for that one, queueSize == threads. If there were 4 docs to add 
previously and if there were 4 threads, it would have queued them all and not 
done any in parallel. In practice there are many more docs in the update log, 
however, using all available threads after the short queue is full.  I enhanced 
the test on this PR for this case.  Come to think of it, this feature could use 
a nominal queue size of 1 because it's gated by a semaphore.

> Optimize Backup/Restore Operations for Large Collections
> --------------------------------------------------------
>
>                 Key: SOLR-17391
>                 URL: https://issues.apache.org/jira/browse/SOLR-17391
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>    Affects Versions: 9.4, 9.5, 9.4.1, 9.6, 9.6.1
>            Reporter: Hakan Özler
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The backup/restore performance issue was first reported on [the users 
> mailing|https://lists.apache.org/thread/ssmzg5nhhxdhgz4980opn1vzxs81o9pk] 
> list.
>  
> We're experiencing performance issues in the recent Solr versions — 9.5.0 and 
> 9.6.1 — regarding backup and restore. In 9.2.1, we could take a backup of 
> 10TB data in just 1 and a half hours. Currently, as of 9.5.0, taking a backup 
> of the collection takes 7 hours! We're unable to make use of disaster 
> recovery effectively and reliably in Solr. Therefore, Solr 9.2.1 still 
> remains the most effective choice among the other 9.x versions for our use.
> It seems that this is the ticket causing this issue:
> 1. https://issues.apache.org/jira/browse/SOLR-16879
> Interestingly, we never encountered a throttling problem during operations 
> when this was introduced to be solved based on this argument on 9.2.1. From a 
> devops perspective, we have some details and metrics on these tasks to 
> distinguish the difference between two versions. The overall IOPS was 150MB 
> on 9.6.1, while IOPS was 500MB on 9.2.1 during the same backup and restore 
> tasks. In the first below, the peak on the left represents a backup, in 
> contrast, in the 2nd image, the same backup operation in 9.5.0 uses less 
> resource. As you may spot, 9.5.0 seems to be using a fifth of the resources 
> of 9.2.1. 
>  
> !https://i.imgur.com/aSrs8OM.png!
> Image 1.
> !https://i.imgur.com/aSrs8OM.png!
> Image 2.
>  
> Apart from that, monitoring some relevant metrics during the operations, I 
> had some difficulty interpreting the following metrics:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 
> 1,{code}
> The pool size was 1 although the pool max size is 5. Shouldn't the pool size 
> be 5, instead? However, there is always one task running on a single node, 
> not 5 concurrently, if I'm not mistaken. 
> I was also wondering if the max thread size, which is currently 5 in 9.4+, 
> could be configurable with either an environment variable or Java parameter? 
> The part that needs to be changed seems to be in CoreAdminHandler.java on 
> line 446 [1] I've made a small adjustment to add a Solr parameter called 
> `solr.maxExpensiveTaskThreads` for those who want to set a different thread 
> size for expensive tasks. The number given in this parameter must meet the 
> criteria of ThreadPoolExecutor, otherwise IllegalArgumentException will 
> occur. I've generated a patch [2] and I would love to see if someone from the 
> Solr committers would take on this and apply for the upcoming release. Do you 
> think our observation is accurate and would this patch be feasible to 
> implement?
>  
> 1. 
> [https://github.com/apache/solr/commit/82a847f0f9af18d6eceee18743d636db7a879f3e#diff-5bc3d44ca8b189f44fe9e6f75af8a5510463bdba79ff72a7d0ed190973a32533L446]
> 2. [https://gist.github.com/ozlerhakan/e4d11bddae6a2f89d2c212c220f4c965] 
>  
> Follow up on this, we managed to backup a data of 3TB in 50 minutes with the 
> patch using `solr.maxExpensiveTaskThreads=5` :
>  
> !https://i.imgur.com/oeCrhLn.png|width=626,height=239!
>  
> I also answered the questions from @Kevin Liang , 
> {quote}Was this change tested on a cloud that was also taking active 
> ingest/query requests as the same time as the backup? 
> {quote}
> The test is completed in a SolrCloud 9.6.1 + the patch cluster managed by the 
> official Solr operator on Amazon EKS. The backup strategy is not intended to 
> happen frequently. Instead, we plan to take some backups for a certain period 
> of time, therefore we won't expect intense search traffic in and out during 
> backups.  
>  
> {quote}This performance is really exciting, but I'm curious how much burden 
> it puts on CPU and memory.
> {quote}
> I'd say that Solr was pretty relaxed during the test based on the CPU usage. 
> It looks like backup and restore are not a CPU intensive task. Each node used 
> only one core at a time. 
> !https://i.imgur.com/pEb37nb.png|width=348,height=222!
> !https://i.imgur.com/4aFqJVY.png|width=348,height=238!
> {quote}Also was this just taking a snapshot backup of the segment files or 
> did this also include uploading to S3?
> {quote}
>  
> We're using the recommended backup functionality, where Solr uploads 
> everything to S3 [1] During backup and restore ops, the relevant metrics 
> looked like this:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 
> 5,{code}
> While, without the patch, It indicated the following behavior:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 
> 1,{code}
>  
> Given that we have the patch, I believe we've returned to the old 9.2.1 
> version. Setting the parameter to 1 seems to replicate the current 9.6.1 
> version, where the same backup takes 2.5 hours. This is clear, there was one 
> thread/task running for a shard on every Solr node, as each node has 5 shards 
> in the cluster for the collection, and there were 4 more tasks in the queue:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.active:
>  1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.capacity:
>  2147483644,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.completed:
>  0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.queued:
>  4{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-17391) Optimize Backup/Restore Operations for Large Collections

Reply via email to