Bertrand Delacretaz wrote:
On 4/28/07, Christoph Kiehl <[EMAIL PROTECTED]> wrote:
...Our current solution is to shutdown the
repository for a short time start the rdbms backup and copy the index
files.
When index file copying is finished we startup the repository again...
Note that the Lucene-based Solr indexer
(http://lucene.apache.org/solr/) has a clever way of allowing online
backups of Lucene indexes, without having to stop anything (or for a
very short time only).
In short, it works like this:
-Solr can be configured to launch a "snapshotter" script at a point in
time when it's not writing anything to the index.
-The script takes a snapshot of the index files using hard links
(won't work on Windows AFAIK), which is very quick on Unixish
platforms.
-Solr waits until the script is done (a few milliseconds I guess) and
resumes indexing.
-Another asynchronous backup script can then copy the snapshot
anywhere, from the hard linked files, without disturbing Solr.
This won't help for the RDBMS part, but implementing something similar
might help for online backups of index files.
See http://wiki.apache.org/solr/CollectionDistribution for more
details - the main goal described there is index replication, but it
obviously works for backups as well.
-Bertrand
Slightly off thread, but relevant to index backup
-----
Sakai has been using Lucene to provide search indexes in a cluster, we
have been using a realtime index distribution mechanism where all nodes
can take part in the indexing an all nodes can take part in the search
delivery. With minor modifications it can work as an indexing farm and
searching farm.
It uses a shim just below the index open/close that manages updates to
the clusters local disks just below the IndexReaders and IndexWriters.
I looked at Nutch and the nutch file system at the time and
unfortunately we had to reject it because, like Solr it required Unix
setup and system commands and we needed a 100% java solution that worked
out of the box.
It doesn't do Map Reduce, but it does put the indexes locally and all
the nodes are up and running all the time.
The relevant parts of the code tree can be found at
The index factory
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java
And the distribution management
which puts segments on zipped form on a shared location, either DB or
Filesystem
https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java
Is definitely not a perfect solution, and I can see it need lots of
improvement, but it works in production.
If Jackrabbit looks really good in a cluster (which I am expecting), we
may start putting the indexes directly in jackrabbit and let it manage
the distribution, they are not that big in most cases, generally < 10G.
(The total data set being indexed go will up to 1TB at some Universities)
The main point being, the central location provides a convenient place
for consistent backups of the index (perhaps it overkill)
Ian