Upon further investigation of this, I believe this is potentially quite a
serious situation for remplication servers.

In SOLR-561, the 'create a new index.<time> folder' concept was introduced
mainly, as I understand it, because Windows locks files/folders that are in
use.
I'm not sure why this is a problem, given that it is only the Solr process
itself that is 'using' these, so the file handles can simply be closed (by
terminating fSyncService or similar), then delete, then carry on.
This is somewhat by-the-by, as the code is out there now.

The real issue that remains is that whenever the slave feels it needs to do
a full copy, any existing index folders are left behind. For large indexes
and/or long-running slaves, this is a path to disk starvation.
>From the admittedly little I know about SnapPuller, I've come up with 2
possible solutions:

1. Change the inherent behaviour as described above, so that only 1 index
folder is ever used (i.e. /index unless an explicit index.properties is
specified).
2. Add an optional parameter that tells the SnapPuller to delete all index*
folders in dataDir except the new 'live' one.

I've modified SnapPuller in our test environment for Option 2, and this
works very well. This takes an optional str parameter in solrconfig.xml
/replication as:
   {{<str name="autoCleanOldIndexes">true</str>}}
This parameter is really only relevant for slaves, but maybe there's a use
case for masters.

This option is a little bit 'brute force'ish, and not as elegant as Option
1, but it does have the advantage of being completely transparent if
{{autoCleanOldIndexes}} is not specified.

If the experts in this area feel it is worthwhile, I can create a JIRA issue
for this and associated SnapPuller patch. Comments most welcome.

Thanks,
Peter




On Sat, Apr 3, 2010 at 3:32 PM, Peter Sturge <peter.stu...@googlemail.com>wrote:

> Hi,
>
> I've got a question regarding remplication index.<timestamp> folders - can
> anyone help?
>
> Note: There is a somewhat related thread here:
>
> http://www.lucidimagination.com/search/document/15a740cca17eed56/solr_1_4_replication_index_directories#e4b0af2f321204d7
>
> I have a remplica that is pushed fetchindex commands on a periodic basis
> when it's time to remplicate (i.e. remplication is managed by the server
> application, not by remplica polling).
> The master that is sending these fetchindex commands tells the remplica to
> remplicate one of its cores, but which core it is changes over time.
> This has the effect of the remplica periodically saying: 'oh, these files
> are totally different, I'll create a brand new index.<timestamp> folder,
> upload the master's files to it and reload'. On its own, this is absolutely
> fine.
> The problem is that any previous index folders are left lying around - i.e.
> not deleted, so eventually (quickly for large indexes) the remplica runs out
> of disk space.
>
> Is there a way to either tell the remplica to always 'reuse' the /index
> folder (ideal) regardless of file name/content, or set its deletionPolicy or
> similar so that it deletes any and all 'old' index.* folders and only keeps
> the current one?
>
>
> Many thanks,
> Peter
>
> <forgive my anti-spam spelling>
>
>

Reply via email to