[ 
https://issues.apache.org/jira/browse/SOLR-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14640575#comment-14640575
 ] 

Timothy Potter commented on SOLR-7820:
--------------------------------------

Thanks for the feedback ... this actually came up in a production installation 
I worked on ... they had 1.4TB of indexes (oversharded on a node) and that node 
went down. When it came back, Solr decided all shards had to be fully copied 
over because they were too far out-of-date with the leader. The node could 
never recover because they didn't have another 1.4TB of SSD allocated on that 
node. Granted this is an extreme case. The interesting thing here is that node 
wasn't offline for very long, so I was surprised to see it need a full copy.

Part of this is bad design in that they shouldn't have oversharded the nodes as 
much given their space limitations.

I'm wondering if we can compute the necessary space needed for an incoming 
full-index for a shard and if that isn't available, then don't do it. Of course 
that's harder to do when oversharding. But to me that's better than running the 
disk out of space just to keep failing to recover.

I also want to put some more energy into trying to avoid a full copy because in 
my case, the node that went down wasn't out of sync with the leader by more 
than a couple thousand docs per shard, so the fact that Solr wanted to do a 
full copy of 1.4TB of indexes because a few thousand docs were missing sounds 
like the real culprit in my case.

> IndexFetcher should delete the current index directory before downloading the 
> new index when isFullCopyNeeded==true
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-7820
>                 URL: https://issues.apache.org/jira/browse/SOLR-7820
>             Project: Solr
>          Issue Type: Improvement
>          Components: replication (java)
>            Reporter: Timothy Potter
>
> When a replica is trying to recover and it's IndexFetcher decides it needs to 
> pull the full index from a peer (isFullCopyNeeded == true), then the existing 
> index directory should be deleted before the full copy is started to free up 
> disk to pull a fresh index, otherwise the server will potentially need 2x the 
> disk space (old + incoming new). Currently, the IndexFetcher removes the 
> index directory after the new is downloaded; however, once the fetcher 
> decides a full copy is needed, what is the value of the existing index? It's 
> clearly out-of-date and should not serve queries. Since we're deleting data 
> preemptively, maybe this should be an advanced configuration property, only 
> to be used by those that are disk-space constrained (which I'm seeing more 
> and more with people deploying high-end SSDs - they typically don't have 2x 
> the disk capacity required by an index).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to