[ 
https://issues.apache.org/jira/browse/SOLR-17391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871332#comment-17871332
 ] 

David Smiley commented on SOLR-17391:
-------------------------------------

Hello; thanks for reporting in nice detail!  I didn't yet look at this 
carefully but did see that your proposed patch replaces the thread pool with a 
fixed size (albeit configurable).  Can't we have a cached pool so that we don't 
use any threads if there's nothing to do?  Many servers & tests are commonly 
not doing any of these "expensive" tasks at any one moment.

> Optimize Backup/Restore Operations for Large Collections
> --------------------------------------------------------
>
>                 Key: SOLR-17391
>                 URL: https://issues.apache.org/jira/browse/SOLR-17391
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Backup/Restore
>    Affects Versions: 9.4, 9.5, 9.4.1, 9.6, 9.6.1
>            Reporter: Hakan Özler
>            Priority: Major
>
> The backup/restore performance issue was first reported on [the users 
> mailing|https://lists.apache.org/thread/ssmzg5nhhxdhgz4980opn1vzxs81o9pk] 
> list.
>  
> We're experiencing performance issues in the recent Solr versions — 9.5.0 and 
> 9.6.1 — regarding backup and restore. In 9.2.1, we could take a backup of 
> 10TB data in just 1 and a half hours. Currently, as of 9.5.0, taking a backup 
> of the collection takes 7 hours! We're unable to make use of disaster 
> recovery effectively and reliably in Solr. Therefore, Solr 9.2.1 still 
> remains the most effective choice among the other 9.x versions for our use.
> It seems that this is the ticket causing this issue:
> 1. https://issues.apache.org/jira/browse/SOLR-16879
> Interestingly, we never encountered a throttling problem during operations 
> when this was introduced to be solved based on this argument on 9.2.1. From a 
> devops perspective, we have some details and metrics on these tasks to 
> distinguish the difference between two versions. The overall IOPS was 150MB 
> on 9.6.1, while IOPS was 500MB on 9.2.1 during the same backup and restore 
> tasks. In the first below, the peak on the left represents a backup, in 
> contrast, in the 2nd image, the same backup operation in 9.5.0 uses less 
> resource. As you may spot, 9.5.0 seems to be using a fifth of the resources 
> of 9.2.1. 
>  
> !https://i.imgur.com/aSrs8OM.png!
> Image 1.
> !https://i.imgur.com/aSrs8OM.png!
> Image 2.
>  
> Apart from that, monitoring some relevant metrics during the operations, I 
> had some difficulty interpreting the following metrics:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 
> 1,{code}
> The pool size was 1 although the pool max size is 5. Shouldn't the pool size 
> be 5, instead? However, there is always one task running on a single node, 
> not 5 concurrently, if I'm not mistaken. 
> I was also wondering if the max thread size, which is currently 5 in 9.4+, 
> could be configurable with either an environment variable or Java parameter? 
> The part that needs to be changed seems to be in CoreAdminHandler.java on 
> line 446 [1] I've made a small adjustment to add a Solr parameter called 
> `solr.maxExpensiveTaskThreads` for those who want to set a different thread 
> size for expensive tasks. The number given in this parameter must meet the 
> criteria of ThreadPoolExecutor, otherwise IllegalArgumentException will 
> occur. I've generated a patch [2] and I would love to see if someone from the 
> Solr committers would take on this and apply for the upcoming release. Do you 
> think our observation is accurate and would this patch be feasible to 
> implement?
>  
> 1. 
> [https://github.com/apache/solr/commit/82a847f0f9af18d6eceee18743d636db7a879f3e#diff-5bc3d44ca8b189f44fe9e6f75af8a5510463bdba79ff72a7d0ed190973a32533L446]
> 2. [https://gist.github.com/ozlerhakan/e4d11bddae6a2f89d2c212c220f4c965] 
>  
> Follow up on this, we managed to backup a data of 3TB in 50 minutes with the 
> patch using `solr.maxExpensiveTaskThreads=5` :
>  
> !https://i.imgur.com/oeCrhLn.png|width=626,height=239!
>  
> I also answered the questions from @Kevin Liang , 
> {quote}Was this change tested on a cloud that was also taking active 
> ingest/query requests as the same time as the backup? 
> {quote}
> The test is completed in a SolrCloud 9.6.1 + the patch cluster managed by the 
> official Solr operator on Amazon EKS. The backup strategy is not intended to 
> happen frequently. Instead, we plan to take some backups for a certain period 
> of time, therefore we won't expect intense search traffic in and out during 
> backups.  
>  
> {quote}This performance is really exciting, but I'm curious how much burden 
> it puts on CPU and memory.
> {quote}
> I'd say that Solr was pretty relaxed during the test based on the CPU usage. 
> It looks like backup and restore are not a CPU intensive task. Each node used 
> only one core at a time. 
> !https://i.imgur.com/pEb37nb.png|width=348,height=222!
> !https://i.imgur.com/4aFqJVY.png|width=348,height=238!
> {quote}Also was this just taking a snapshot backup of the segment files or 
> did this also include uploading to S3?
> {quote}
>  
> We're using the recommended backup functionality, where Solr uploads 
> everything to S3 [1] During backup and restore ops, the relevant metrics 
> looked like this:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 
> 5,{code}
> While, without the patch, It indicated the following behavior:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 5,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 
> 1,{code}
>  
> Given that we have the patch, I believe we've returned to the old 9.2.1 
> version. Setting the parameter to 1 seems to replicate the current 9.6.1 
> version, where the same backup takes 2.5 hours. This is clear, there was one 
> thread/task running for a shard on every Solr node, as each node has 5 shards 
> in the cluster for the collection, and there were 4 more tasks in the queue:
> {code:java}
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.core: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.max: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.pool.size: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.running: 1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.active:
>  1,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.capacity:
>  2147483644,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.completed:
>  0,
> ADMIN./admin/cores.threadPool.parallelCoreExpensiveAdminExecutor.tasks.queued:
>  4{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to