On 11/28/2016 9:39 AM, Michael Joyner wrote:
> I'm running out of spacing when trying to restart nodes to get a
> cluster back up fully operational where a node ran out of space during
> an optimize.
>
> It appears to be trying to do a full sync from another node, but
> doesn't take care to check available space before starting downloads
> and doesn't delete the out of date segment files before attempting to
> do the full sync.

If you've run out of space during an optimize, then your Solr install
doesn't have enough disk space for proper operation.  The recommendation
is to have enough disk space to store all your index data three times --
free space should be double the size of all your index data.  Typically
a merge or optimize will only require double the space, but there are
certain worst-case scenarios where it can require triple.  I do not know
what causes the worst-case situation.  This is a Lucene requirement, and
Solr is based on Lucene.

The replication feature, which is how SolrCloud accomplishes index
recovery, assumes that the existing index must remain online until the
new index is fully transferred and available, at which time it will
become the live index, and the previous one can be deleted.  This
feature existed long before SolrCloud did.  Standalone mode will not be
disappearing anytime soon, so this assumption must remain.  Writing code
to decide when the existing index doesn't need to be kept would be
somewhat difficult and potentially very fragile.  This doesn't mean we
won't do it, but I think that's why it hasn't already been done.

Also, we still have that general disk space recommendation already
mentioned.  If that recommendation is followed, you're not going to run
out of disk space due to index recovery.

> It seems to know what size the segments are before they are
> transferred, is there a reason a basic disk space check isn't done for
> the target partition with an immediate abort done if the destination's
> space looks like it would go negative before attempting sync? Is this
> something that can be enabled in the master solrconfig.xml file? This
> would be a lot more useful (IMHO) than waiting for a full sync to
> complete only to run out of space after several hundred gigs of data
> is transferred with automatic cluster recovery failing as a result.

Remembering that the replication feature is NOT limited to use by
SolrCloud ... this is not a bad idea.  Because the replication handler
knows what files must be transferred before an index fetch takes place,
it can calculate how much disk space is required, and could return an
error response and ignore the command.  The way that SolrCloud uses
replication may not work with this, though.  SolrCloud replication may
work differently than the automated replication that can be set up in
standalone mode.  I am not sure whether it handles individual files, or
simply requests an index fetch.

But, at the risk of repeating myself ... running with so little free
disk space is not recommended.  The entire problem is avoided by
following recommendations.

Thanks,
Shawn

Reply via email to