On 11/28/2016 9:39 AM, Michael Joyner wrote: > I'm running out of spacing when trying to restart nodes to get a > cluster back up fully operational where a node ran out of space during > an optimize. > > It appears to be trying to do a full sync from another node, but > doesn't take care to check available space before starting downloads > and doesn't delete the out of date segment files before attempting to > do the full sync.
If you've run out of space during an optimize, then your Solr install doesn't have enough disk space for proper operation. The recommendation is to have enough disk space to store all your index data three times -- free space should be double the size of all your index data. Typically a merge or optimize will only require double the space, but there are certain worst-case scenarios where it can require triple. I do not know what causes the worst-case situation. This is a Lucene requirement, and Solr is based on Lucene. The replication feature, which is how SolrCloud accomplishes index recovery, assumes that the existing index must remain online until the new index is fully transferred and available, at which time it will become the live index, and the previous one can be deleted. This feature existed long before SolrCloud did. Standalone mode will not be disappearing anytime soon, so this assumption must remain. Writing code to decide when the existing index doesn't need to be kept would be somewhat difficult and potentially very fragile. This doesn't mean we won't do it, but I think that's why it hasn't already been done. Also, we still have that general disk space recommendation already mentioned. If that recommendation is followed, you're not going to run out of disk space due to index recovery. > It seems to know what size the segments are before they are > transferred, is there a reason a basic disk space check isn't done for > the target partition with an immediate abort done if the destination's > space looks like it would go negative before attempting sync? Is this > something that can be enabled in the master solrconfig.xml file? This > would be a lot more useful (IMHO) than waiting for a full sync to > complete only to run out of space after several hundred gigs of data > is transferred with automatic cluster recovery failing as a result. Remembering that the replication feature is NOT limited to use by SolrCloud ... this is not a bad idea. Because the replication handler knows what files must be transferred before an index fetch takes place, it can calculate how much disk space is required, and could return an error response and ignore the command. The way that SolrCloud uses replication may not work with this, though. SolrCloud replication may work differently than the automated replication that can be set up in standalone mode. I am not sure whether it handles individual files, or simply requests an index fetch. But, at the risk of repeating myself ... running with so little free disk space is not recommended. The entire problem is avoided by following recommendations. Thanks, Shawn