Re: Google Summer of Code project for Jackrabbit

Nicolas Toper Wed, 24 May 2006 04:12:26 -0700

Hi,

Thanks to Jukka, the ASF and Google, I have been selected for the Google
Summer of Code. Jukka Zitting is mentoring this project. The project is on
from now until August 21.


Our goal is to create a backup tool for JackRabbit. Here is the project
initial description:

"Implement a tool for backing up and restoring content in an Apache
Jackrabbit content repository. In addition to the basic content hierarchies,
the tool should be able to efficiently manage binary content, node version
histories, custom node types, and namespace mappings. Incremental or
selective backups would be a nice addition, but not strictly necessary."

I have started working on the design and I am asking for your feedback to
improve it. Please tell me if it suits your needs and what should be
improved, what are your ideas on this subject and anything you would like.


This feedback and design phase should be over in at most two weeks so I can
start coding.



*Project Goals*
We have set up these design goals:

Backup both data and configuration option.
Ease of use for sysAdmin to backup and restore content.
Aim for generic operations: when adding new functionalities to JackRabbit

we should not have to update the backup application code.

Aim for non blocking operations. The repository should work normally when

being backuped whenever possible.

Aim for modularity. This would be the first release of the backup tool. It

will evolve for sure.

Disk space is not an issue for now. (It can be worked out in another

release)

The tool is specific to JackRabbit as mysqldump is to MySQL.




*Architecture*
The application is composed of those modules:

/A configuration XML file (part of repository.xml?). It would contain
pointers to the resources to backup (content, hierarchies, custom node
types, Lucene index file, repository.xml, and so on) in Xpath form (relative
to the repository.xml file) and "file system path". Whenever possible, we
will gather informations from JackRabbit configuration files.

/ The backup format is still to be defined. We would like to use any
PersistenceManager. This way we can achieve greater flexibility. For
instance backuping to another DB or to the file system. Using a
PersistenceManager as a backup format allows us to migrate seemingly between
PersistenceManager. Each supported PersistenceManager will have its backup
strategy (possibly using external tools). For now we think to support only
the ObjectPersistenceManager or the XMLPersistenceManager (which means
fetching the directory).

/ A GenericConverter class to copy a workspace into another. For the backup,
the destination workspace will be a temporary workspace configured with a
specific PersistenceManader and a File System. This way we will achieve the
backup using a pivot format. This class can have other use for the
JackRabbit project.

/ The backup module would work according to the following pseudo-code:

Backup all repository elements (repository.xml for instance)
List all workspaces
For each workspace
Backup specific data (Lucene index file for instance) in working directory
Copy workspace using GenericConverter to the backup repository

End
Save XML System view file.
Zip all data and move them to a specific folder.
Delete backup repository

/ The Restoration module would work this way:
Unzip backup data in the working directory
Restore repository data
For each workspace:
Restore specific data
Restore content using importXML function.
End
Delete working directory.

*NB *

- We will use to backup the data if possible the existing Jackrabbit or
another installation (depending on load test and your recommandations). We
would backup one workspace at a time in a temporary workspace. After the
data would be backuped, we would destroy and recreate the workspace. Is it
possible in the long run (backup would be performed once a day) or it is not
an advisable strategy?

- All updates and read are isolated through transactions. Do we need to
define a locking strategy? If I am correct, I can read a node even though it
is locked and it is threadsafe. You don't commit an incoherent modification.

- We can ignore transient items for now (the use case we work on is regular
backup; besides the admin can forbid connexion for specific backup)

- Another approach is to use the system view. But how would we handle
JackRabbit extension? What do you think?

*Functionalities*
The backup/restore application will allow at least to configure those
parameters:

From the Command line

backup only a specific workspace
backup all workspaces

From the XML configuration file:
Backup Data: one use case would be to create a Jackrabbit installation
correctly configured but with no data in it. We would not want to allow data
in it but still want the configuration parameters.
List of all specific elements to backup (see upper): this way the software
is easily extensible and will follow JackRabbit evolution.

*Other approaches*
Other approaches have been evaluated:

Working only on the file level is simple to implement. But this solution

would not make it possible to switch between persistence mechanisms. For
example a common use case is that a user has started with some simple
PersistenceManager, but then needs to switch to a more efficient one as the
usage patterns change. It would be a nice addition if the backup tool could
be used to handle such migrations as well as standard backup-restore
situations.

Working on the JCR API level only. The problem with this approach is that

the JCR API
does not specify any way to import or directly modify version histories or
node types.

*Deliverables*
- Two Java application: backup and restore.
- XML configuration file (one for both)
- Documentation (in the JackRabbit Wiki + Javadoc)
- Maybe some patch to the JackRabbit project (if needed, for instance
importXML might need some rework).

*Evolution (after GSOC)*
- Add a sanity check to the archive just made.
- Add incremental backup (this would mean a hashcode on each node and a test
to know whether or not it had been changed).

URL to the original Google SoC Application:
http://www.deviant-abstraction.net/application-for-google-summer-of-code-2006/


I look forward for your feedback and I am happy to work with all of you on
this tool. I will give my best to make this project a success with your
help.



Nicolas

My blog! http://www.deviant-abstraction.net !!

On 5/24/06, Jukka Zitting <[EMAIL PROTECTED]> wrote:


Hi,

Good news! We've just been granted a Google Summer of Code project for
creating a backup tool for Jackrabbit. Please welcome Nicolas Toper as
the student who will be working on the project. I will act as Nicolas'
mentor.

Nicolas will soon send more details and his initial design ideas for
the tool. Please comment on his work as you would on any external
contribution.

See the Google Summer of Code page (http://code.google.com/soc/) for
more details on the program.

BR,

Jukka Zitting

--
Yukatan - http://yukatan.fi/ - [EMAIL PROTECTED]
Software craftsmanship, JCR consulting, and Java development

Re: Google Summer of Code project for Jackrabbit

Reply via email to