Re: Streaming / Serializing Big Objects

Emmanuel Lecharny Fri, 08 Sep 2006 14:52:33 -0700

Sorry, it was supposed to be sent to the ML, but I forgot to do a Reply-All.


So here is my answer to Ole :

Some though and enlightement :

Ole Ersoy a écrit :

Cool -
OK suppose we had a StateManager.

The StateManager has a decode method on it that reads a persistent file
and creates the directory tree.

The directory tree is totally different. We have many files, including aMaster table which contains the entries, and other files which storesthe indices. This is a choice that can be discussed, but basically, we*never* read the file at start (remember that we could have millions ofentries. The cache system (which could perfectly be something likeHibernate, prevayler, or whatever persistent cache) is loaded on thefly. However, just keep in mind that a Ldap server is not intended to bestop very often !

The StateManager's encode method uses a list of
references to directory tree objects
creating a concatenated String of the string
representation of all these objects, and then writes
the string to a file, once all the concatenation is
done.

Objects stored in the Master table (let's call them entries) have thisstructure :

Entry :
-  DistinguishedName (which contains basically two strings), the unique key
- attributes which are a list of :
  - attribute which are : <a name, a list of :>

- values (byte[] or String, or - and this this what we are talkingabout - a reference to a persisted data)

I don't really see what a StateManager can bring here. What we just needto do is to store an attribute value somwhere, and be able to send itback to the user, limiting the memory footprint to do so to a minimalvalue (let say, 1024 bytes, for instance). If we store a reference tothis persisted data - be it a file name, a key to a blob into adatabase, a mail on google, if we create 10000 gmail account to be ableto store 2Tbytes of data for free :) ...

Am I getting any warmer?

I can't say. But may be my explenation are not clear enough :)

I read a little about prevayler.  It just serializes
all the java objects that need to be peristed
immidiately as it becomes aware of them, I think, and
then keeps them updated as the objects mutate.

We are not really willing to store java objetcs, but byte[] or Strings.I know, technically speaking, they are objects :) , but they can also beseen as streams of bytes, which they are, after all !

 So if
the application crashes, on reboot it will read the
persistant files and be back up.

I hope that the backend will be able to be reliable ! Atm, there isnothing really done to assure that we can't loose data if we brutallystop the server, except a flag which force the 'synch-on-write' whenmodifying the data. But we may have problems, because we don't supporttransactions. We need to support transactions, and a kind of shadowpages mechanism, à la RDBMS. Still a work in start (can't say work inprogress, when we just have a few JIRAs and confluence pages about it :)

 To make reboot more
efficient, the persistant files can be managed like I
described above with the StateManager on a clean
shutdown, which I think is what you are describing.

If shutdown the server cleanly, the database is supposed to *always* bein correct state. And again, when starting the server, we don't load anydata, except managment data (like Root DSE). We may stores the cache,and reload it, to improve the 'warm up' process, that seems a reallygood idea, but as I said, shutdowning a ldapserver should not occursvery frequently ...

The reason I mention this is because as the directory
tree mutates, we would not want to persist the entire
tree per mutation right?  So we would have to either
use relational persistance, or write a single file

just containing the mutation.

If we 'mutate' the directory tree, the cache should be updatedaccordingly. Basically, and if we consider the existing server, what wedo is to remove from the cache the modified data. Saving a copy of themutated tree before its mutation is not an option. It is far much betterto modify a copy of this sub-tree, and when the modification is done,then switch the old tree and the new tree. But this is really not easy,as we are not storing trees has a whole, but many trees (one per indexfile) plus a full bench of entries into the master database. This is nota simple matter, and it's difficult to explain, too... There is a kindof explanation here :

http://docs.safehaus.org/display/APACHEDS/Backend

That would mean we are in more of an rsync like mode,
where if the server crashes, we load the original
directory tree file + any mutation files.

Yeah, there is definitively something to dig around this idea. I likethat :) This is what is doing OpenLdap with its journal (logs files).

If the directory shuts down cleanly we encode all the
directory objects to one file and delete all the
"temporary" mutation files.

If the server shutdown cleanly, I think that nothing should be done. Ifyou consider that you have a kind of journal/shadow page that containsall the not yet applied modifications, then the last thing that theserver should do is to wait until those modification are done. When amodification is done, then the corresponding journal/shadow page shouldbe removed (or marked as applied). If we have a problem, then, with thesupport of transaction, we may be able to rollback. A lot of work ...

Incidentally EMF can be used for any type of
serialization, a concatenated file like the one I just
described, xml, relational persistance, etc.  One of
the benefits of EMF is that if for whatever reason
someone wanted to serialize to XML, implementing a
function to do so would be very straight forward.  If
someone wanted to serialize to a relational source,
that's easy too.

We definitively have to implement a RDBMS backend. RDBMS offer all thosemechanisms for free, no need to be hit by the NIH syndrom :) And we alsohave to remember that we are *not* writing Derby, but ADS !

There's also the EMF Technology projects's Object
Constraint Language can be used to query the EMF
model...and  I would think it would be very useful for
creating directory like queries and coding the query
api.

well, a liitle bit to far for my little brain ..

There's an article on the eclipse site just written on
how to use it.

Good. Let's experiment. Talks are good, reding are great, but writing isbetter and implementing a solution is a must !

Cheers,
- Ole


Emmanuel

begin:vcard
fn;quoted-printable:Emmanuel L=C3=A9charny
n;quoted-printable:L=C3=A9charny;Emmanuel
org:IKTEK
adr:;;13 rue Salomon de Rotschild;Suresnes;;92150;France
email;internet:[EMAIL PROTECTED]
x-mozilla-html:FALSE
url:http://www.iktek.com
version:2.1
end:vcard

Re: Streaming / Serializing Big Objects

Reply via email to