Re: Streaming / Serializing Big Objects

Emmanuel Lecharny Fri, 08 Sep 2006 14:54:29 -0700

(Mail sent directly to me by Ole, because I made an error when replyingthe first time - forget to Reply All -. )


Ole Ersoy a écrit :

Comments inline

--- Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:
Some though and enlightement :

Ole Ersoy a écrit :
Cool -
OK suppose we had a StateManager.
The StateManagerhas a decode method on it that reads a persistent
file
and creates the directory tree.
The directory tree is totally different. We have
many files, including aMaster table which contains the entries, and otherfiles which storesthe indices. This is a choice that can be discussed,but basically, we*never* read the file at start (remember that wecould have millions ofentries. The cache system (which could perfectly besomething likeHibernate, prevayler, or whatever persistent cache)is loaded on thefly. However, just keep in mind that a Ldap serveris not intended to bestop very often !
Just a quick terminology clarification - when I say
cache I mean in memory representations and when I say
persisted I mean written to disk.

By directory tree I mean all the information that ADS
is intended to provide, regardless of precisely how it
is persisted or managed.  So I think we are on the
same page here.

So if all the information were in a dom like tree,
then something like EMF OCL could be used to query it.

This may take up more of a memory footprint, or the
queries could be slower, but what if it's just as fast
or faster.  Then ADS would all of a sudden have a lot
more developers working on one of its building blocks.
The StateManager's encode method uses a list of
references to directory tree objects
creating a concatenated String of the string
representation of all these objects, and then
writes
the string to a file, once all the concatenation is
done.
Objects stored in the Master table (let's call them
entries) have thisstructure :
Entry :
-  DistinguishedName (which contains basically two
strings), the unique key
- attributes which are a list of :
   - attribute which are : <a name, a list of :>
     - values (byte[] or String, or - and this this
what we are talkingabout - a reference to a persisted data)
I don't really see what a StateManager can bring
here. What we just needto do is to store an attribute value somwhere, andbe able to send itback to the user, limiting the memory footprint todo so to a minimalvalue (let say, 1024 bytes, for instance). If westore a reference tothis persisted data - be it a file name, a key to ablob into adatabase, a mail on google, if we create 10000 gmailaccount to be ableto store 2Tbytes of data for free :) ...
Yeah!  Lets go with the GMail one!!!! :-)

So I think we are thinking pretty much the same thing
here, and that's what the StateManager would do.

It could even be pluggable, so for instance different
state managers for different peristance mechanisms.

In the end we are just reading and writing data, and
that's the job of the StateManager.

Whether it reads it all at once, a little here or a
little there, is up to it.

If a telecommunications company is using ADS that want
lightning fast queries, then they probably would love
to see ADS restored and run from a single file that is
inmemory for all queries.

But if it's a authentication service where queries can
take there own sweet time, then maybe the IT dept
would rather just get 1 server with a gigantic drive
and have ADS query a persistant data source when it
needs stuff, and nothing is cached in memory.

So I think we are thinking the same thing, the only
question is what is the best solution that minimizes
the in memory foot print, regardless of the size of
the cache, maximizes maintenance ease and feature
development / modularity.
Am I getting any warmer?
I can't say. But may be my explenation are not clear
enough :)
I read a little about prevayler.  It just
serializes
all the java objects that need to be peristed
immidiately as it becomes aware of them, I think,
and
then keeps them updated as the objects mutate.
We are not really willing to store java objetcs, but
byte[] or Strings.I know, technically speaking, they are objects :),but they can also beseen as streams of bytes, which they are, after all
!
Yeah - lets just call them things...that need to be as
fast as possible to read and as fast as possible to
write.

Ofcoarse easy of development and maintenance should be
considered vs. the speed considerations.
So if
the application crashes, on reboot it will read the
persistant files and be back up.
I hope that the backend will be able to be reliable
! Atm, there isnothing really done to assure that we can't loosedata if we brutallystop the server, except a flag which force the'synch-on-write' whenmodifying the data. But we may have problems,because we don't supporttransactions. We need to support transactions, and akind of shadowpages mechanism, à la RDBMS. Still a work in start(can't say work inprogress, when we just have a few JIRAs and
confluence pages about it :)
Prevayler claimed to support transactions inheriently
just by the very nature of what it does...which make
sense...
To make reboot more
efficient, the persistant files can be managed like
I
described above with the StateManager on a clean
shutdown, which I think is what you are describing.
If shutdown the server cleanly, the database is
supposed to *always* bein correct state. And again, when starting theserver, we don't load anydata, except managment data (like Root DSE). We maystores the cache,and reload it, to improve the 'warm up' process,that seems a reallygood idea, but as I said, shutdowning a ldapservershould not occursvery frequently ...
The reason I mention this is because as the
directory
tree mutates, we would not want to persist the
entire
tree per mutation right?  So we would have to
either
use relational persistance, or write a single file
just containing the mutation.
If we 'mutate' the directory tree, the cache should
be updated
We mean persistent source here right - when I say
cache I mean in memory data...
accordingly. Basically, and if we consider the
existing server, what wedo is to remove from the cache the modified data.Saving a copy of themutated tree before its mutation is not an option.It is far much betterto modify a copy of this sub-tree, and when themodification is done,then switch the old tree and the new tree. But thisis really not easy,as we are not storing trees has a whole, but manytrees (one per indexfile) plus a full bench of entries into the masterdatabase. This is nota simple matter, and it's difficult to explain,too... There is a kindof explanation here :
http://docs.safehaus.org/display/APACHEDS/Backend
Yes I see what you are saying with respect to
explaining the exact mutation process that is
happening.

In the end though something changes and we need to
capture that change somehow, so that if the server
goes down we can get back to the same operating state.

We could be doing jdbc transactions right when the
server goes down, and if the transaction is not
complete then we still can't completely recover, but
come close...

I think our general conversation with respect to
journaling, etc. applies here.
That would mean we are in more of an rsync like
mode,
where if the server crashes, we load the original
directory tree file + any mutation files.
Yeah, there is definitively something to dig around
this idea. I likethat :) This is what is doing OpenLdap with its
journal (logs files).
If the directory shuts down cleanly we encode all
the
directory objects to one file and delete all the
"temporary" mutation files.
If the server shutdown cleanly, I think that nothing
should be done. Ifyou consider that you have a kind of journal/shadowpage that containsall the not yet applied modifications, then the lastthing that theserver should do is to wait until those modificationare done. When amodification is done, then the correspondingjournal/shadow page shouldbe removed (or marked as applied). If we have aproblem, then, with thesupport of transaction, we may be able to rollback.
A lot of work ...
Yeah - Exactly...

That's why I suggested the EMF API, because it already
has support for a lot of stuff like that.  I need to
get some examples worked up asap.
Incidentally EMF can be used for any type of
serialization, a concatenated file like the one I
just
described, xml, relational persistance, etc.  One
of
the benefits of EMF is that if for whatever reason
someone wanted to serialize to XML, implementing a
function to do so would be very straight forward.
If
someone wanted to serialize to a relational source,
that's easy too.
We definitively have to implement a RDBMS backend.
RDBMS offer all those
Dang - The message got truncated...now I gotta go and
see what else you said somewhere else...

but I think we are pretty much on the same page...

I just sent that stuff out for the sake of awareness
mostly...

There needs to be a set of considered options for
doing ADS persistance and caching / In memory storing
of directory entries, so I just wanted to make sure I
threw EMF out there for the long run as somethign to
check out.

I'll get some examples worked up soon.

Cheers,
- Ole

=== message truncated ===


__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection aroundhttp://mail.yahoo.com

begin:vcard
fn;quoted-printable:Emmanuel L=C3=A9charny
n;quoted-printable:L=C3=A9charny;Emmanuel
org:IKTEK
adr:;;13 rue Salomon de Rotschild;Suresnes;;92150;France
email;internet:[EMAIL PROTECTED]
x-mozilla-html:FALSE
url:http://www.iktek.com
version:2.1
end:vcard

Re: Streaming / Serializing Big Objects

Reply via email to