Hi, Thanks to Jukka, the ASF and Google, I have been selected for the Google Summer of Code. Jukka Zitting is mentoring this project. The project is on from now until August 21.
Our goal is to create a backup tool for JackRabbit. Here is the project initial description: "Implement a tool for backing up and restoring content in an Apache Jackrabbit content repository. In addition to the basic content hierarchies, the tool should be able to efficiently manage binary content, node version histories, custom node types, and namespace mappings. Incremental or selective backups would be a nice addition, but not strictly necessary." I have started working on the design and I am asking for your feedback to improve it. Please tell me if it suits your needs and what should be improved, what are your ideas on this subject and anything you would like. This feedback and design phase should be over in at most two weeks so I can start coding. *Project Goals* We have set up these design goals:
Backup both data and configuration option. Ease of use for sysAdmin to backup and restore content. Aim for generic operations: when adding new functionalities to JackRabbit
we should not have to update the backup application code.
Aim for non blocking operations. The repository should work normally when
being backuped whenever possible.
Aim for modularity. This would be the first release of the backup tool. It
will evolve for sure.
Disk space is not an issue for now. (It can be worked out in another
release)
The tool is specific to JackRabbit as mysqldump is to MySQL.
*Architecture* The application is composed of those modules: /A configuration XML file (part of repository.xml?). It would contain pointers to the resources to backup (content, hierarchies, custom node types, Lucene index file, repository.xml, and so on) in Xpath form (relative to the repository.xml file) and "file system path". Whenever possible, we will gather informations from JackRabbit configuration files. / The backup format is still to be defined. We would like to use any PersistenceManager. This way we can achieve greater flexibility. For instance backuping to another DB or to the file system. Using a PersistenceManager as a backup format allows us to migrate seemingly between PersistenceManager. Each supported PersistenceManager will have its backup strategy (possibly using external tools). For now we think to support only the ObjectPersistenceManager or the XMLPersistenceManager (which means fetching the directory). / A GenericConverter class to copy a workspace into another. For the backup, the destination workspace will be a temporary workspace configured with a specific PersistenceManader and a File System. This way we will achieve the backup using a pivot format. This class can have other use for the JackRabbit project. / The backup module would work according to the following pseudo-code: Backup all repository elements (repository.xml for instance) List all workspaces For each workspace Backup specific data (Lucene index file for instance) in working directory Copy workspace using GenericConverter to the backup repository End Save XML System view file. Zip all data and move them to a specific folder. Delete backup repository / The Restoration module would work this way: Unzip backup data in the working directory Restore repository data For each workspace: Restore specific data Restore content using importXML function. End Delete working directory. *NB * - We will use to backup the data if possible the existing Jackrabbit or another installation (depending on load test and your recommandations). We would backup one workspace at a time in a temporary workspace. After the data would be backuped, we would destroy and recreate the workspace. Is it possible in the long run (backup would be performed once a day) or it is not an advisable strategy? - All updates and read are isolated through transactions. Do we need to define a locking strategy? If I am correct, I can read a node even though it is locked and it is threadsafe. You don't commit an incoherent modification. - We can ignore transient items for now (the use case we work on is regular backup; besides the admin can forbid connexion for specific backup) - Another approach is to use the system view. But how would we handle JackRabbit extension? What do you think? *Functionalities* The backup/restore application will allow at least to configure those parameters:
From the Command line
backup only a specific workspace backup all workspaces From the XML configuration file: Backup Data: one use case would be to create a Jackrabbit installation correctly configured but with no data in it. We would not want to allow data in it but still want the configuration parameters. List of all specific elements to backup (see upper): this way the software is easily extensible and will follow JackRabbit evolution. *Other approaches* Other approaches have been evaluated:
Working only on the file level is simple to implement. But this solution
would not make it possible to switch between persistence mechanisms. For example a common use case is that a user has started with some simple PersistenceManager, but then needs to switch to a more efficient one as the usage patterns change. It would be a nice addition if the backup tool could be used to handle such migrations as well as standard backup-restore situations.
Working on the JCR API level only. The problem with this approach is that
the JCR API does not specify any way to import or directly modify version histories or node types. *Deliverables* - Two Java application: backup and restore. - XML configuration file (one for both) - Documentation (in the JackRabbit Wiki + Javadoc) - Maybe some patch to the JackRabbit project (if needed, for instance importXML might need some rework). *Evolution (after GSOC)* - Add a sanity check to the archive just made. - Add incremental backup (this would mean a hashcode on each node and a test to know whether or not it had been changed). URL to the original Google SoC Application: http://www.deviant-abstraction.net/application-for-google-summer-of-code-2006/ I look forward for your feedback and I am happy to work with all of you on this tool. I will give my best to make this project a success with your help. Nicolas My blog! http://www.deviant-abstraction.net !! On 5/24/06, Jukka Zitting <[EMAIL PROTECTED]> wrote:
Hi, Good news! We've just been granted a Google Summer of Code project for creating a backup tool for Jackrabbit. Please welcome Nicolas Toper as the student who will be working on the project. I will act as Nicolas' mentor. Nicolas will soon send more details and his initial design ideas for the tool. Please comment on his work as you would on any external contribution. See the Google Summer of Code page (http://code.google.com/soc/) for more details on the program. BR, Jukka Zitting -- Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED] Software craftsmanship, JCR consulting, and Java development