RE: Alt-Design status: XML handling

Victor Mote Thu, 21 Nov 2002 03:45:11 -0800

Peter B. West wrote:

> <quote>
...
> Echoing sentiments recently expressed in this publication, Clark said
> that SAX, though efficient, was very hard to use, and that DOM had
> obvious limitations due to the requirement that the document being
> processed be in memory. He suggested that what was needed was a standard
> "pull API," one that efficiently allowed random access to XML documents.


First, thanks for the update on your work -- I understand what you are doing
a little better. Second, the statement above about random access almost
jumped out at me, because I had exactly the same thought earlier today while
contemplating a thread on the XSL-FO list which discussed processing of long
documents and memory constraints related to them. The closest thing to a
perfect document processing system that I have come across is FrameMaker,
which seems to be able to handle pretty large documents with a pretty small
footprint. I don't know for sure, but it seems to me that the "area tree"
(if you will) is written to disk, and pages can be efficiently jumped to in
an arbitrary manner. The WYSIWIG editor is essentially a viewport on the
portion of the document in memory, which is itself a subset of the disk
document. As you edit the document, I presume that events are sent to
something akin to a layout manager, which has to do something with them.
Now, in our case, we need to not only have random access to the area tree,
but also to the fo tree.

What follows is my feeble attempt to reconcile some of these issues.

The issue with SAX as I see it, is that because it is one-way, and our
processing is not (I think the standard calls it "non-linear"), we
presumably have to essentially build our own DOM-ish (random access) things
in order to get the job done. I wonder if we don't end up reinventing the
wheel in frustration with that approach. From a cleanliness of design
standpoint at least, it seems much more straightforward to instead use a
DOM-based approach and write chunks of the two DOMs to disk where necessary.
I haven't thought through whether java.io.RandomAccessFile or a regular
database or some other alternative would be the way to go. The LMs can be
totally protected from all of this by abstracting both the FO and Area
Documents -- in other words, they work with abstract nodes on trees and
don't care what was required to make them available.

Oddly enough, once I have the stability of the DOMs to work from (perhaps
this is more felt than real), an event-based approach seems much more
natural -- like imitating a word processor. In fact, if done properly,
another project could conceivably use FOP as the layout engine for a WYSIWIG
editor. Actually I have been trying to quantify & grasp two processing
models that come to mind: 1) the word-processing model, an event-based
model, and 2) an 18th-century typesetter manually laying out pages, which is
much more of a look-ahead, measure-it-to-see-how-it-fits-before-placing-it
model.

These two models roughly correspond to the two processing models I mentioned
the other day ("I am text, place me somewhere" vs. "I am a page with room,
place something on me"). The second model requires the 2-pass approach. The
first fits either a push or a pull approach (since we can manufacture events
if we need to), the second is definitely pull. When I wrote about those two
models, I was frankly leaning heavily toward the 2nd approach, but I think I
am changing my mind. To explain why, I need to have you forget for a moment
about our SAX-based input (I'll come back to that). Forget also about
performance for a moment, and picture the typesetter setting type one
character at a time, with no thought of what the next character or image
is -- in other words, setting type just like a user sitting at Microsoft
Word does. If the typesetter comes to a concept that messes his previous
work up, he has to yank a line of type out, or perhaps an entire page out,
and replace them. However, (and this is the key point), he eventually will
get the job done. In other words, when abstracted this way, the only benefit
to a look-ahead /should be/ performance. Consider our auto table layout
problem. If on the 350th page of the table, I find an item that requires me
to change the width of the columns, which in turns changes the layout of all
350 pages, yes, I am going to burn up a few cycles to accomplish that, but I
/should/ be able to get it done.

So far all I have done is loosely reconciled these two processing models.
The next thing I want to do is to try to compare these two models with FOP's
layout  process. If I like the event-based model, then maybe I ought to like
FOP's approach. Let me go first to my 18th-century typesetter. Each time he
has to tear out a line or page of type, he can go back to his manuscript
(his FO document, if you will) to rebuild them. Similarly in a word
processor, I presume that Microsoft Word must have some concept that the 2
lines at the top of page 84 are in the same paragraph as the 3 lines at the
bottom of page 83. Do I have something similar in FOP? What the designer in
me wants is a link between every area in my area tree back to its parent fo
object. Then I know in pretty simple fashion how every item in both trees
relates to any other object in either tree. Since we are using SAX (I
promised I would come back to SAX), I conclude that by definition, we don't
have this. When I get to page 350 of that auto-table layout, I either can't
see the beginning of that table, or I have to store that information some
other way. I then presume that, since we need similar functionality, we have
some surrogate that is probably 1) a real pain to manage, 2) uses just as
much memory as a DOM would, and 3) can't feasibly be segregated and written
to disk if needed (because it is in the area tree??).

Now I need to reconcile 1) liking an event-based approach, and 2) disliking
SAX. This is actually pretty easy. I can read through a DOM and create
events or something similar. My layout engine doesn't have to know whether a
given event comes from a user sitting at a keyboard or some TreeWalker
stepping through an fo DOM.

Finally, let's come back to performance. So far, I have been talking about
single-character events ("he just typed an 'a' here, lay my line out
again"). But we can also have more efficient bigger events, analagous to
pasting something into a document. Now my TreeWalker says "I have a 35,000
row auto-layout table for you that needs to span the available width of the
page". My Page LM says "Cool, the available width of the page is 4.5". The
TreeWalker crunches some numbers and says, "OK, column 1 needs to be 1"
wide, column 2 is 1.5", etc. Now, here is the first row." I realize that
this is over-simplified, but I am trying to describe a system that has its
input abstracted.

I question whether SAX is good for FOP's performance. In fact, if it gives
us a klunkier structure, then it almost certainly slows us down. All of the
logic still has to be performed, regardless of the input method, but it must
be performed more slowly if the data is not convenient. I rather think that
the affection for SAX must be because it saves the memory used by the DOM.
It seems to me that writing it to a random-access file when necessary (which
is what I think Peter was suggesting) would be a much better solution.

To conclude, if I were designing this system from scratch, based on what I
know right now, I would:
1. Use DOM for both the fo tree & the area tree.
2. Write them to disk when necessary, hiding all of this from the layout
managers.
3. Use an event-based layout mechanism so that the fo tree doesn't even have
to be there to get layout work done.

I am sure I can be talked out of this by someone smarter, but I wanted to
lay out the whole line of reasoning. My apologies to Peter and anyone else
who may have been working on these points before. I am just now getting
around to them.

After further consideration, my use of "event-based" above may be too
strong. Probably what I mean is more along the lines of API-based. In a
WYSIWIG environment, the event would probably trigger an API action, but
that action could be invoked another way as well. I am too tired to rewrite
it -- I hope you know what I mean.

This final thought is really a question which was briefly addressed during
our recent weekend clarification about the role of the maintenance branch,
and which I wish to apply specifically to the above thoughts. Does or could
the new design give us the ability to (with say, a configuration option)
choose between Layout Philosophy A and B? By this I mean 2 (or more) layout
packages coexisting in the same code base, and sharing common resources that
can be selected (configurable perhaps). If so, then we can play with these
ideas at our leisure, compare them in various ways, transition between them
if necessary, and maybe even keep both to be used in various circumstances.
I think someone (Jeremias perhaps) had indicated that something along these
lines would be possible, but that may have been at a lower level than what I
am discussing here.

I don't mean to rock the boat. I guess I am kind of like a three-year-old
who asks "why" and "why not" all of the time to the annoyance of all around
him -- I am still trying to learn the basics. Thanks for your patience.

Victor Mote


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

RE: Alt-Design status: XML handling

Reply via email to