Hi Asger,

On Fri, Oct 9, 2009 at 4:34 AM, Asger Askov Blekinge
<a...@statsbiblioteket.dk> wrote:
> Hi Mailing list
>
> I had an very interesting discussion with Mathias Razum (Mr. eSciDoc;)
> at the ECDL 2009 conference. He told me that Fedora, whenever you use an
> API function on an object, parses the entire object, including all
> versions of datastreams. This got my interest, and after the conference
> I examined the fedora code to verify his claim. Well, I saw that the
> gist of it was true, and Fedora use a sax parser.
>
> Now, I would like to start a discussion about this behaivour, if it is a
> problem, and ways it could be improved. I am really not sure the
> performance hit is in any way a problem, so this might be totally
> redundant.

It creates a problem with the foxml is particularly large.  This most
often occurs when people store a lot of XML as "inline" datastreams,
or they have a lot of versions.

> First, I am not sure, but I think that the xml storage format does not
> need to be true foxml. As long as we have ObjectSerializers and
> DeSerializers we should be able to use a different storage format
> without changing the behaviour in any way. Is this a viable route?
> Personally, I fear that it is not.

It would be possible to change Fedora so that it stored the objects in
any (e.g., binary, compressed) format.  And there would be *some*
performance advantage to that, but not much...I think the size of the
objects in memory (and the fact that the foxml has to be "fully
deserialized") is the more significant issue.

> Second, and probably more fruitful, we could do some conditional
> parsing. AFAIK, the SAX parser is blazingly fast, if it does not do
> anything when hitting elements. Is this true?

Yes, SAX and XMLPull are both very fast when they can just skip
elements that aren't of interest.

> That way, we could parse the basic structure of the document, but not
> the datastreams. When a function then request a datastream, that
> datastream is parsed, but not before then. If the latest is requested,
> the version list is not parsed, and so on.
> What are your thoughts about this?

There are disk io/memory tradeoffs, but I suspect that this approach
could work really well for most access patterns.  It would be a good
thing to test.  :)  In some cases, particularly when the foxml is very
small, it's better to just read it all at once.  I wonder if such a
scheme could be adaptive.  Say, if the foxml's over 4k (or whatever),
then use this multi-pass scheme.

I think the "big win" with the multi-pass scheme would be not having
to read the inline xml datastream content until it's needed.  In most
real-world cases, just getting the foxml (including the version list
of all the datastreams) without the inline xml content is going to
save the majority of the memory space.  I'm not sure about the value
of not getting the version list on the first pass...trying to avoid
having that info in memory may introduce complexities that aren't
worth it.

- Chris

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Fedora-commons-developers mailing list
Fedora-commons-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-developers

Reply via email to