Bertrand Delacretaz wrote:
On 4/28/07, Christoph Kiehl <[EMAIL PROTECTED]> wrote:

...Our current solution is to shutdown the
repository for a short time start the rdbms backup and copy the index files.
When index file copying is finished we startup the repository again...

Note that the Lucene-based Solr indexer
(http://lucene.apache.org/solr/) has a clever way of allowing online
backups of Lucene indexes, without having to stop anything (or for a
very short time only).

In short, it works like this:

-Solr can be configured to launch a "snapshotter" script at a point in
time when it's not writing anything to the index.

-The script takes a snapshot of the index files using hard links
(won't work on Windows AFAIK), which is very quick on Unixish
platforms.

-Solr waits until the script is done (a few milliseconds I guess) and
resumes indexing.

-Another asynchronous backup script can then copy the snapshot
anywhere, from the hard linked files, without disturbing Solr.

This won't help for the RDBMS part, but implementing something similar
might help for online backups of index files.

See http://wiki.apache.org/solr/CollectionDistribution for more
details - the main goal described there is index replication, but it
obviously works for backups as well.

-Bertrand

Slightly off thread, but relevant to index backup

-----
Sakai has been using Lucene to provide search indexes in a cluster, we have been using a realtime index distribution mechanism where all nodes can take part in the indexing an all nodes can take part in the search delivery. With minor modifications it can work as an indexing farm and searching farm.

It uses a shim just below the index open/close that manages updates to the clusters local disks just below the IndexReaders and IndexWriters.

I looked at Nutch and the nutch file system at the time and unfortunately we had to reject it because, like Solr it required Unix setup and system commands and we needed a 100% java solution that worked out of the box.

It doesn't do Map Reduce, but it does put the indexes locally and all the nodes are up and running all the time.

The relevant parts of the code tree can be found at

The index factory

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java

And the distribution management

which puts segments on zipped form on a shared location, either DB or Filesystem

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java


Is definitely not a perfect solution, and I can see it need lots of improvement, but it works in production.

If Jackrabbit looks really good in a cluster (which I am expecting), we may start putting the indexes directly in jackrabbit and let it manage the distribution, they are not that big in most cases, generally < 10G. (The total data set being indexed go will up to 1TB at some Universities)


The main point being, the central location provides a convenient place for consistent backups of the index (perhaps it overkill)

Ian

Reply via email to