DataBase Content Repository Implementation (Lenya 2)

Gerd Schrick Thu, 22 Dec 2011 08:59:20 -0800

Dear Lenya devs,

Since our last Lenya-meeting in Freiburg I'm seriously thinking about a
SQL-database based implementation of the content repository (DB-content-


repo) and would like to kindly ask you for your opinion, ideas, hints...

It will only be for Lenya 2 and is still in an early draft status.

Any honest feedback is highly appreciated.


Basic:
1) available as an alternative to the current file based repo
2) It should be possible to choose the repository for each publication (no
switching between repositories, afterwards)

The idea is basicaly to "replace" the filesystem based storage and put all the
content into a DB.
The overall structure of how content is managed/stored in Lenya should not be
changed (only if really necessary)

What we hope to achieve is:
1) better performance with large publications (> 10.000 documents)
2) easy, fast and flexible way to do queries based on metadata
example: there's a metafield "categories" that stores the category keys
comma-separated like "c2,c17,c2006,c33" and we want to have a linklist on

the homepage that lists the latest 5 documents of a certain ("c17") category.
The documents can be found anywhere in the publication.
With SQL this can easily be done with something like "SELECT uuid FROM metatable
WHERE category LIKE '%c17,%' LIMIT 5"
Sure, such a case can be solved with Lucene as well, but I think that there's
much more flexibility to do something like this on-the-fly (maybe an

author while configuring a ContentUnit)
3) Deactivating and deleting documents takes very long in our large publications
due to the link-checking (as far as I understand what's going on)

- I assume a DB will be much faster finding all the items WHERE content LIKE =
'%lenya-document:112344...%'
4) One problem when going for our cluster environment was the performance of the
shared filesystem (NFS) between the delivery nodes (Lenya

instances) - sharing the same DB won't be an issue
4 a) better arrange with modern enterprise environments
Nowadays, there is usually no dedicated filesystem space provided for
applications. Instead, each application gets its share from a NAS which

cannot cope with the heavy FS requirements from Lenya.
5) Maybe that also a clustered Authoring will be easier possible then?
6) Better scalability
7) easier deployment

The Database structure:
"documents" in this context is mainly what is curently stored in the uuid folder
with filename {language}
1) one database per publication
2) Documents table: one table for the documents in the Authoring (also contains
Trash and Archive) and one table for the documents in the Live

area
3) MetaData: one table per MetaData set (dcterms, dcelements, ...), one column
per field. Per MetaData set there's as well one table relating to

the Live documents and one to the Authoring
4) Revision handling: (based on our experience over some years now, the
revisions are only used in the authoring area) so the idea is that in the

above mentioned tables only the current revisions are stored. To store the older
revisions of the documents and metadata there's another table for

each of the above metioned ones.

Example table structure:
documents_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
- workflow state
- last updated
- doctype (resource type)
- mimetype
- content_text (Textfield for the document content if it's a text type)
- content_blob (for binary content; pdf, images ...)

* = Primary key

metadata_authoring:
-* uuid
-* language
-* area [A(uthoring)|T(rash)|(archiv)E]
-* metasetkey (to identify the metaset)
(- workflow state)?
- last updated
- metafield 1
  :
- metafield n


We're looking forward for your valuable feedback and wish you all a Merry
Christmas and a joyful and happy New Year!

Best regards,
 Gerd and Hans

DataBase Content Repository Implementation (Lenya 2)

Reply via email to