Re: Jackrabbit Scalability / Performance

Ian Boston Sun, 29 Apr 2007 02:42:26 -0700

Bertrand Delacretaz wrote:

On 4/28/07, Christoph Kiehl <[EMAIL PROTECTED]> wrote:

...Our current solution is to shutdown the
repository for a short time start the rdbms backup and copy the indexfiles.
When index file copying is finished we startup the repository again...


Note that the Lucene-based Solr indexer
(http://lucene.apache.org/solr/) has a clever way of allowing online
backups of Lucene indexes, without having to stop anything (or for a
very short time only).

In short, it works like this:

-Solr can be configured to launch a "snapshotter" script at a point in
time when it's not writing anything to the index.

-The script takes a snapshot of the index files using hard links
(won't work on Windows AFAIK), which is very quick on Unixish
platforms.

-Solr waits until the script is done (a few milliseconds I guess) and
resumes indexing.

-Another asynchronous backup script can then copy the snapshot
anywhere, from the hard linked files, without disturbing Solr.

This won't help for the RDBMS part, but implementing something similar
might help for online backups of index files.

See http://wiki.apache.org/solr/CollectionDistribution for more
details - the main goal described there is index replication, but it
obviously works for backups as well.

-Bertrand


Slightly off thread, but relevant to index backup

-----

Sakai has been using Lucene to provide search indexes in a cluster, wehave been using a realtime index distribution mechanism where all nodescan take part in the indexing an all nodes can take part in the searchdelivery. With minor modifications it can work as an indexing farm andsearching farm.

It uses a shim just below the index open/close that manages updates tothe clusters local disks just below the IndexReaders and IndexWriters.

I looked at Nutch and the nutch file system at the time andunfortunately we had to reject it because, like Solr it required Unixsetup and system commands and we needed a 100% java solution that workedout of the box.

It doesn't do Map Reduce, but it does put the indexes locally and allthe nodes are up and running all the time.


The relevant parts of the code tree can be found at

The index factory

https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/ClusterFSIndexStorage.java

And the distribution management

which puts segments on zipped form on a shared location, either DB orFilesystem


https://source.sakaiproject.org/svn//search/trunk/search-impl/impl/src/java/org/sakaiproject/search/index/impl/JDBCClusterIndexStore.java

Is definitely not a perfect solution, and I can see it need lots ofimprovement, but it works in production.

If Jackrabbit looks really good in a cluster (which I am expecting), wemay start putting the indexes directly in jackrabbit and let it managethe distribution, they are not that big in most cases, generally < 10G.(The total data set being indexed go will up to 1TB at some Universities)

The main point being, the central location provides a convenient placefor consistent backups of the index (perhaps it overkill)

Ian

Re: Jackrabbit Scalability / Performance

Reply via email to