Re: Alt-Design status: XML handling
Bertrand Delacretaz wrote: Great work Peter! It makes a lot of sense to use higher-level than SAX events, and thanks for explaining this so clearly. If you allow me a suggestion regarding the structure of the code: maybe using some table-driven stuff instead of the many if statements in FoSimplePageMaster would be more readable? Something like: class EventHandler { EventHandler(String regionName,boolean discardSpace,boolean required) ... } /** table of event handlers that must be applied, in order */ EventHandler [] handlers = { new EventHandler(FObjectNames.REGION_BODY,true,true), new EventHandler(FObjectNames.REGION_BEFORE,true,false) }; ...then, in FoSimplePageMaster(...) loop over handlers and let them process the events. I don't know if this applies in general but it might be clearer to read and less risky to modify. Bertrand, Sorry this one slipped through the cracks. Some such approach may be a good idea, but I would be loathe to call it EventHandler. The whole point about pull parsing is to move away from event handling. I would think of these more as methods with parameters like optional, single or multiple, any. Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Rhett Aultman wrote: But, a pull model can be grafted onto a push model by implementing what amounts to a specialized buffer of the pushed data that accepts pull queries...no? Yes, another alternative is additional thread with the same duties. See Aleksander Slominski's parer: http://www.extreme.indiana.edu/xgws/papers/xml_push_pull/node3.html -- Oleg Tkachenko eXperanto team Multiconn Technologies, Israel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Responses below. -Original Message- From: Peter B. West [mailto:[EMAIL PROTECTED]] Sent: Tuesday, November 26, 2002 2:25 AM To: [EMAIL PROTECTED] Subject: Re: Alt-Design status: XML handling This is not a problem for at least the maintenance version of the code. All of the processing is triggered by incoming SAX events, and occurs within the SAX callbacks. These are synchronous events, so the parsing stalls until the callback returns. Page-sequence rendering, e.g., occurs within the endElement() callback of an fo:page-sequence element. True...I did not take synchronous event handling into consideration, although I'm not entirely sure that synchronous event handling is, performance wise, entirely prudent either...though that's for different reasons. And, I believe, it might be wrong, though I must read the full source text. The push model can be seen as a special case of a pull model in the sense of Pull everything ASAP, now and until the data is exhausted. But, a pull model can be grafted onto a push model by implementing what amounts to a specialized buffer of the pushed data that accepts pull queries...no? Which is what I have done. Seems like a logical way to implement pull over push. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
-Original Message- From: Peter B. West [mailto:[EMAIL PROTECTED]] Sent: November 26, 2002 3:25 AM To: [EMAIL PROTECTED] Subject: Re: Alt-Design status: XML handling Rhett, To comment on only two aspects of your posting. Rhett Aultman wrote: -Original Message- From: Oleg Tkachenko [mailto:[EMAIL PROTECTED]] Generally, event-driven processing is a pretty good thing. The critical issue with it, though, is the ratio of event production to event processing. If that number is anything greater than 1, then more events are being produced in a stretch of time than can be effectively processed in that stretch of time. Events start to queue up, taking up memory. If it happens enough, the heap starts to get a little too full, the gc runs a little too much, and that causes processing time to suffer even further. Under most circumstances, event-based processing is like using a garden hose to water a bed of flowers. It works just fine. Under more intense cases, though, it can be more like using a garden hose to fill a small container of water, then leaving the hose laying around (spilling water all over the lawn) while the container gets carried off somewhere. Actually, it really matters where the events are coming from. An HTTP server has no control over how many requests it gets, so your description above is apt. But for FOP (disregarding FOPServlet) everything is one process - the XML parser, the formatter, the renderer - so it's ultimately procedural; there may be an internal boundary where an event/callback system is used, but it's all one thread so nothing queues up at all. The only reason to adopt your approach (and I am not saying I don't like it) is because it's easier to understand. Regards, Arved - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Responses below. -Original Message- From: Arved Sandstrom [mailto:[EMAIL PROTECTED]] Sent: Tue 11/26/2002 6:42 PM To: [EMAIL PROTECTED] Cc: Subject: RE: Alt-Design status: XML handling Actually, it really matters where the events are coming from. An HTTP server has no control over how many requests it gets, so your description above is apt. But for FOP (disregarding FOPServlet) everything is one process - the XML parser, the formatter, the renderer - so it's ultimately procedural; there may be an internal boundary where an event/callback system is used, but it's all one thread so nothing queues up at all. Yes...as I said, I caught myself off-guard because I tend to use an event model only when I need to multicast an event or when I need to be able to send events between two threads. With the single thread you're describing, performance hits I describe aren't an issue. There can be other issues there, but I really don't want to bother because I know they're not relevant. Peter's case for not wanting event-driven is much more sound, and I have to say I agree with it. winmail.dat- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Oleg Tkachenko wrote: Peter B. West wrote: Why is it easier for developers to use? Is it because the API is less complex or more easily understood? Not really. As you point out, the SAX API is not all that complex. The problem is that the processing model of SUX is completely inverted. Well, I believe it's more philosophical question or a question of a programming style. push vs pull, imperative languages vs declarative languages etc etc etc ancient holy war. One likes to define rules aka sax handlers, another likes to weave a web from if statements, only to be able to control processing order ;) Both pull and push have pros and contras and it's a pity java still doesn't have a full-fledged pull parsing API (btw, James Clark is working on StAX[1], so it's a matter of time). I don't believe is is only a matter of style. I think the detrimental effects of push for general programming are glaringly obvious. That I think, rather than catering for simple-minded developers, is what motivated MS' abandonment of SAX. I speak as a long-time anti-MS bigot. You may have come to like writing XSLT that way. It's the only way to write non-hello-world stylesheets in xslt actually. Don't forget, xlst is a declarative language, so probably analogies with java are just irrelevant, they are different beasts. The question is what is good for the fo tree building stuff? Probably you right, pull is more suitable, but the bad thing is that real input is SAX stream hence we must translate push to pull (funny enough ms considers this task as unfeasible one in XMLReader documentation). I haven't read the documentation, but it may be that they are referring to the infeasibility of moving code built around SAX to an XmlReader environment. Hence next question is the cost of your interim buffer, what do you think could be its peak and average size? At the moment it is more expensive than it need be; there is no event pool. I am writing one now. It's fairly trivial, as you can imagine. The buffer is implemented as a circular buffer, currently of 128 elements, but it has been set at 32, and 64 should be more than enough. The circular buffer places an upper limit on the size, and synchronizes (in a broad sense) the activities of producer (parser) and consumer (tree builder.) parser: until buffer full, write events to buffer notify wait tree builder: wait until buffer empty, read events from buffer notify In the SAX model, the throttle on parser throughput is the downstream processing that is immediately triggered by the start and end events generated by the parser. In the buffered model, the throttle is the circular buffer and the waits that occur on it. Of course, as I have mentioned recently. And as I also said, the cost of parsing relative to the intensive downstream element processing of FOP is small. If so, isn't it too early to optimize xml handling altogether? What would we benefit from moving from push to pull? Well, sort of automatic validation is a benefit indeed, but I'm not sure it's enough. This is not an optimisation, but a fundamental design decision. It's all or nothing. See the comments about the feasibility of moving from one model to the other. The whole question is context-dependent. If you are engaged in the peephole processing of SUX you may be obliged to use external validation. With top-down processing you have more choice, because your context is travelling with you. btw, what about unexpected content model objects? Will this fail? fo:simple-page-master master-name=default fo:region-body/ fo:block/ /fo:simple-page-master Unexpected content models will throw an exception. How that is handled is another question. At the moment, while I am in a debugging phase, most exceptions just propagate up, but all the usual flexibility of the exception system is available for refinement. Don't get me wrong here. I'm not saying that external validation is wrong, merely that with a pull model, the need is reduced. There may still be a strong case for it, but not as strong as with SUX. You are right and that btw allows to make external validation optional and still to have reasonable level of validation for free. [1] http://www.jcp.org/en/jsr/detail?id=173 It encourages me greatly that there is so much activity going on in this area. Especially interesting is the Xerces XNI XMLPullParserConfiguration Interface. Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Peter B. West wrote: I don't believe is is only a matter of style. I think the detrimental effects of push for general programming are glaringly obvious. It's just event-driven processing, how it could be detrimental? I haven't read the documentation, but it may be that they are referring to the infeasibility of moving code built around SAX to an XmlReader environment. It's in Comparing XmlReader to SAX Reader page[1]: The push model can be built on top of the pull model. The reverse is not true. Too categorical statement, I think. This is not an optimisation, but a fundamental design decision. It's all or nothing. See the comments about the feasibility of moving from one model to the other. If so, we need more opinions from others. [1] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/html/cpconcomparingxmlreadertosaxreader.asp -- Oleg Tkachenko eXperanto team Multiconn Technologies, Israel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Completely generalized and probably worthless response below. ;) -Original Message- From: Oleg Tkachenko [mailto:[EMAIL PROTECTED]] Sent: Monday, November 25, 2002 4:01 PM To: [EMAIL PROTECTED] Subject: Re: Alt-Design status: XML handling Peter B. West wrote: I don't believe is is only a matter of style. I think the detrimental effects of push for general programming are glaringly obvious. It's just event-driven processing, how it could be detrimental? I cannot speak for FOP, but I can speak in generalities about this. The difference between event-based and pull-style is roughly the difference between using a garden hose and using a garden hose with one of those spray-gun nozzles on it. In the case, the water keeps coming out of the hose, pretty much whether you want it to or not. In the latter case, the water comes out only when you want it, but it requries effort on your behalf. When to use each idea. Generally, event-driven processing is a pretty good thing. The critical issue with it, though, is the ratio of event production to event processing. If that number is anything greater than 1, then more events are being produced in a stretch of time than can be effectively processed in that stretch of time. Events start to queue up, taking up memory. If it happens enough, the heap starts to get a little too full, the gc runs a little too much, and that causes processing time to suffer even further. Under most circumstances, event-based processing is like using a garden hose to water a bed of flowers. It works just fine. Under more intense cases, though, it can be more like using a garden hose to fill a small container of water, then leaving the hose laying around (spilling water all over the lawn) while the container gets carried off somewhere. Comparitively, if a program decides to pull in more data to process, then there's an opportunity to control the amount that comes in at any given point. This means that there's less (or no) need to worry about the rate at which data comes in, since it's turned on and off rather easily. The amount of memory wasted is minimized (yes, I consider a wait queue to be a waste of memory, since it cannot be used for anything more productive), but the downside is that, of course, to keep the data streaming in for long periods of time tends to require continuous effort to tell the pulling system to pull in another chunk, much like how it takes effort to keep the valve open on a hose's spray gun. There has been a time or two in my (admittedly, somewhat short) career as a developer where I've had cause to stop thinking in terms of an event system and instead work with a pull concept, and it was for the reason I gave- when an event source was allowed to generate events at its own pace, and the event handler took too long to process, the events piled up and performance suffered. I'd expect a very similar situation could be expected in FOP. SAX processing tends to fire a lot of events, and if FOP does a reasonable amount of processing work relative to the work needed to fire another event, then those events are piling up in memory and wasting space. I can definitely see an argument for a pull-based system. Also, I think that a push-model probably isn't going to scale as effectively to larger documents, where a pull system should have more constant performance regardless of document size. Of course, take that with a mine of salt. It's in Comparing XmlReader to SAX Reader page[1]: The push model can be built on top of the pull model. The reverse is not true. Too categorical statement, I think. And, I believe, it might be wrong, though I must read the full source text. The push model can be seen as a special case of a pull model in the sense of Pull everything ASAP, now and until the data is exhausted. But, a pull model can be grafted onto a push model by implementing what amounts to a specialized buffer of the pushed data that accepts pull queries...no? If so, we need more opinions from others. My major interests lie in things happening above this layer, so I don't really have too much concern, but I definitely can see a good case for a pull-model. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Oleg Tkachenko wrote: Peter B. West wrote: I don't believe is is only a matter of style. I think the detrimental effects of push for general programming are glaringly obvious. It's just event-driven processing, how it could be detrimental? I may have referred to Dijkstra (R.I.P.) here before. I think it was he who illustrated the importance of appropriate representations by reference to Roman numerals. In the Middle Ages, before the coming of Arabic numerals and zero, long division was considered do-able, though very difficult, and was taught to the foolhardy at universities. As I recall the story, the topic was computer languages, and the moral was: if you use a tool appropriate to the problem you are trying to solve, life will be much easier. As for the selection of a language, so for the selection of a processing model. Event-driven processing is appropriate to event-driven systems. A traffic control system is an event-driven system, as is an operating system; processing an xsl:fo document is not. The variability of xsl:fo processing is constrained within carefully defined hierarchical limits. This shows in the simple-page-master debate. Why has this generally been implemented in violation of the spec, while I picked that violation up the first time I ran against a variant file? The children are determined by the parent, not the other way around. So within an instance of simple-page-master, I expect the first child to be a region-body. Following that, I expect a region-before, but I am not upset if it's not there. Etc. These relationships are quite naturally expressed in a manner the echoes the hierarchical ordering of the document. How is this done with SAX? Nodes are created without context - they just happen. The node must grope around to find its parent, and the virtual tree is constructed from the children up. The parent basically only gets control when its own ENDELEMENT event occurs. I haven't read the documentation, but it may be that they are referring to the infeasibility of moving code built around SAX to an XmlReader environment. It's in Comparing XmlReader to SAX Reader page[1]: The push model can be built on top of the pull model. The reverse is not true. Too categorical statement, I think. Having read the reference, I agree. This is not an optimisation, but a fundamental design decision. It's all or nothing. See the comments about the feasibility of moving from one model to the other. If so, we need more opinions from others. True enough for the HEAD line. But FOP_0-20-0_Alt-Design will continue on the same track. I have been working on it alone for nearly two years now, and for a year before it was even allowed into the code base. Part of what I was doing was pure experiment, which I was prepared to abandon, but much is there because I believe in it, including the pull code. I don't have to persuade a boss, in advance, that my approach is right. I just have to persuade myself. Then I can let the code do the talking. It's called Open Source development. [1] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpguide/html/cpconcomparingxmlreadertosaxreader.asp Given the interest in pull APIs for XML, another advantage of my code is that, when a low-level pull processor becomes available, it can be incorporated into my design with a minimum of fuss for greater efficiency. Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Rhett, To comment on only two aspects of your posting. Rhett Aultman wrote: -Original Message- From: Oleg Tkachenko [mailto:[EMAIL PROTECTED]] Generally, event-driven processing is a pretty good thing. The critical issue with it, though, is the ratio of event production to event processing. If that number is anything greater than 1, then more events are being produced in a stretch of time than can be effectively processed in that stretch of time. Events start to queue up, taking up memory. If it happens enough, the heap starts to get a little too full, the gc runs a little too much, and that causes processing time to suffer even further. Under most circumstances, event-based processing is like using a garden hose to water a bed of flowers. It works just fine. Under more intense cases, though, it can be more like using a garden hose to fill a small container of water, then leaving the hose laying around (spilling water all over the lawn) while the container gets carried off somewhere. This is not a problem for at least the maintenance version of the code. All of the processing is triggered by incoming SAX events, and occurs within the SAX callbacks. These are synchronous events, so the parsing stalls until the callback returns. Page-sequence rendering, e.g., occurs within the endElement() callback of an fo:page-sequence element. It's in Comparing XmlReader to SAX Reader page[1]: The push model can be built on top of the pull model. The reverse is not true. Too categorical statement, I think. And, I believe, it might be wrong, though I must read the full source text. The push model can be seen as a special case of a pull model in the sense of Pull everything ASAP, now and until the data is exhausted. But, a pull model can be grafted onto a push model by implementing what amounts to a specialized buffer of the pushed data that accepts pull queries...no? Which is what I have done. Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Peter B. West wrote: Why is it easier for developers to use? Is it because the API is less complex or more easily understood? Not really. As you point out, the SAX API is not all that complex. The problem is that the processing model of SUX is completely inverted. Well, I believe it's more philosophical question or a question of a programming style. push vs pull, imperative languages vs declarative languages etc etc etc ancient holy war. One likes to define rules aka sax handlers, another likes to weave a web from if statements, only to be able to control processing order ;) Both pull and push have pros and contras and it's a pity java still doesn't have a full-fledged pull parsing API (btw, James Clark is working on StAX[1], so it's a matter of time). You may have come to like writing XSLT that way. It's the only way to write non-hello-world stylesheets in xslt actually. Don't forget, xlst is a declarative language, so probably analogies with java are just irrelevant, they are different beasts. The question is what is good for the fo tree building stuff? Probably you right, pull is more suitable, but the bad thing is that real input is SAX stream hence we must translate push to pull (funny enough ms considers this task as unfeasible one in XMLReader documentation). Hence next question is the cost of your interim buffer, what do you think could be its peak and average size? Of course, as I have mentioned recently. And as I also said, the cost of parsing relative to the intensive downstream element processing of FOP is small. If so, isn't it too early to optimize xml handling altogether? What would we benefit from moving from push to pull? Well, sort of automatic validation is a benefit indeed, but I'm not sure it's enough. The whole question is context-dependent. If you are engaged in the peephole processing of SUX you may be obliged to use external validation. With top-down processing you have more choice, because your context is travelling with you. btw, what about unexpected content model objects? Will this fail? fo:simple-page-master master-name=default fo:region-body/ fo:block/ /fo:simple-page-master Don't get me wrong here. I'm not saying that external validation is wrong, merely that with a pull model, the need is reduced. There may still be a strong case for it, but not as strong as with SUX. You are right and that btw allows to make external validation optional and still to have reasonable level of validation for free. [1] http://www.jcp.org/en/jsr/detail?id=173 -- Oleg Tkachenko eXperanto team Multiconn Technologies, Israel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Manuel... Manuel Mall wrote: Peter, thanks for the update and explanation on your Alt-Design. To be honest: I like it. Reminds me very much of my first exposure to programming language processing (Compilers) nearly 30 years ago = top-down recursive-decent parsing for Pascal. I still think its the best parsing model around (beats YACC type stuff by a long way) in terms of ease of development / understanding / use. Recursive descent is like magic, isn't it? I agree that it's a very tidy approach, which I have used a few times. What motivated me here, though, was just the desire to have the flow of processing follow the natural hierarchy of the data. Such an approach starts with a guaranteed basis of algorithmic clarity; the alternative, it seems to me, starts with a guaranteed basis of obscurity. That, certainly, is what I found when I tried to follow the logic trail through the code. The other idea was the old unix principle of the pipeline. Isolate the components and have them communicate via (possibly bi-directional) pipelines of data/commands/events. This doesn't map very cleanly onto the processes that operate on the FO tree and the layout/Area trees, but it was just what I needed to invert the flow of control during FO tree building. Do you have any similar simple / effective ideas for the layout part which, following the discussions on this list, the new FOP design under CVS HEAD seems to struggle most with? There are good reasons why the layout is not susceptible to the same simple solution.. I do have a number of ideas to contribute, and when the web site is restored I will be referring to some of the notes I have made and posted there. Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Victor Mote wrote: Oleg Tkachenko wrote: I think we should separate fo tree itself from the process of its building. fo tree structure is required and I agree with Keiron - it's not a DOM, it's just tree representation and I cherish the idea to make it an effectively small structure like saxon's internal tree. But any interim buffers should be avoided as much as it's possible (well, Piter's buffer seems not to be a burden). This is probably a philosophical difference. It seems to me that the area tree is built on the foundation of the fo tree, and that if we only get a brief glimpse of the fo tree as it goes by, not only does our foundation disappear, but we end up putting all of that weight into the superstructure, which tends to make the whole thing collapse. Oleg: After thinking about this a bit more, I think I confused this issue. I think what you were saying is that the existing FOP FO tree /is/ the lightweight data structure that you like. I see your point, and yes I agree, there is no need to replace it with something heavier. My train of thought was in a different direction -- ie. how to get that structure written to disk when necessary so that it doesn't all have to be in memory. I (think I) also had a wrong conception of how long the FO tree data persisted. My apologies for the confusion. Victor Mote - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Victor Mote wrote: Victor Mote wrote: Oleg Tkachenko wrote: I think we should separate fo tree itself from the process of its building. fo tree structure is required and I agree with Keiron - it's not a DOM, it's just tree representation and I cherish the idea to make it an effectively small structure like saxon's internal tree. But any interim buffers should be avoided as much as it's possible (well, Piter's buffer seems not to be a burden). This is probably a philosophical difference. It seems to me that the area tree is built on the foundation of the fo tree, and that if we only get a brief glimpse of the fo tree as it goes by, not only does our foundation disappear, but we end up putting all of that weight into the superstructure, which tends to make the whole thing collapse. Oleg: After thinking about this a bit more, I think I confused this issue. I think what you were saying is that the existing FOP FO tree /is/ the lightweight data structure that you like. I see your point, and yes I agree, there is no need to replace it with something heavier. My train of thought was in a different direction -- ie. how to get that structure written to disk when necessary so that it doesn't all have to be in memory. I (think I) also had a wrong conception of how long the FO tree data persisted. My apologies for the confusion. Victor, I will comment at greater length, later, on the issues you have raised, but I want to make some comments on the tree structures here. Most people coming to FOP get confused by the fact that SAX is used for parsing. They think in terms of a SAX/DOM dichotomy, and assume that, because we are using SAX, we have nothing like a DOM. In fact, the FO tree is our DOM, or the first stage of our DOM. In the beginning... the FO tree was always there while the area tree was being built, but Mark Lillywhite did some hacking to restrict the tree to the currently active page sequence. As you point out, the FO tree provides the semantics of the layout. The Area tree is an internal representation of the series of marks on the page. If re-flowing is called for, the information from the FO tree is, once again, required. In my opinion, that means that the FO tree has to be cached. To be more precise, the FO tree has to be able to be cached. I envisage the layout engine feeding instructions back to the FO tree concerning subtrees; basically, delete subtree or cache subtree. The layout engine knows whether the layout of a particular page or page sequence is firm or rubbery, and can instruct the FO tree accordingly. Such decisions would be made very carefully in the layout engine. Back in the mists of time, Arved noted that the page numbering problem could be minimised by allowing enough room for the page number worst case. That was a sensible restriction, but it implies a good guess about just what that worst case is going to be. To get that completely right, you need to lay it all out. In any case, if you have the ever-popular Page x of y in your static-content, you need to redo every page anyway. What the initial guess, if it's correct, circumvents, is the need to reflow every page, with all of its nightmarish implications. This is a case for which the min/opt/max expressions of FOP were made. Take a punt about last page number width. Layout the pages, using optimum. Get to the end, with all page numbers resolved. Go back and reflow lines/paragraphs as necessary, using the full min/max range to avoid page under/overflow. (N.B. This won't entirely remove the need for backup and reflow in other circumstances.) I should point out here that I perceive the need for a third tree - a layout tree. It parallels the layout managers, which themselves form a tree. This is still a vague idea for me, but the layout tree would be the work-in-progress on the area tree. It's necessary because much of the layout happens bottom-up, and at the bottom, layout is occurring which cannot go into the current page. Firstly, you don't want to throw away the layout work that you have already done. Secondly, after the page boundary slashes across the layout you have been engaged in, you want to be able to pick up all of the threads again at the beginning of the new page. The layout tree formalises this procedure. Read Jeffrey Kingston's Lout design document for some insight on this. When I talk about the layout engine, I have in mind the process that builds the layout tree, and moves chunks as they are completed into the area tree. Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Peter B. West wrote: [...] STATUS: The XML pull buffering has been working for some considerable time. I have simply been extending the get/expect methods on top of the simpler methods as I have found a requirement for them in building the FO tree. In cases where the DTD is well known and well structured, XML pull is much easier to use than SAX. For example, one can write a XSL styleshees with templates or with many for-each. SAX is similar to tomplates, XML pull is similar to for-each. Having worked so much with SAX stuff, I can say that in many cases SAX is effectively a PITA, as DOM is for some, and as XML pull is too for some. If the code proves to be easier to understand and write, it will be easier to fix and maintain, so this option should IMHO taken in strong consideration. My 2c -- Nicola Ken Barozzi [EMAIL PROTECTED] - verba volant, scripta manent - (discussions get forgotten, just code remains) - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Peter B. West wrote: quote ... Echoing sentiments recently expressed in this publication, Clark said that SAX, though efficient, was very hard to use, and that DOM had obvious limitations due to the requirement that the document being processed be in memory. He suggested that what was needed was a standard pull API, one that efficiently allowed random access to XML documents. First, thanks for the update on your work -- I understand what you are doing a little better. Second, the statement above about random access almost jumped out at me, because I had exactly the same thought earlier today while contemplating a thread on the XSL-FO list which discussed processing of long documents and memory constraints related to them. The closest thing to a perfect document processing system that I have come across is FrameMaker, which seems to be able to handle pretty large documents with a pretty small footprint. I don't know for sure, but it seems to me that the area tree (if you will) is written to disk, and pages can be efficiently jumped to in an arbitrary manner. The WYSIWIG editor is essentially a viewport on the portion of the document in memory, which is itself a subset of the disk document. As you edit the document, I presume that events are sent to something akin to a layout manager, which has to do something with them. Now, in our case, we need to not only have random access to the area tree, but also to the fo tree. What follows is my feeble attempt to reconcile some of these issues. The issue with SAX as I see it, is that because it is one-way, and our processing is not (I think the standard calls it non-linear), we presumably have to essentially build our own DOM-ish (random access) things in order to get the job done. I wonder if we don't end up reinventing the wheel in frustration with that approach. From a cleanliness of design standpoint at least, it seems much more straightforward to instead use a DOM-based approach and write chunks of the two DOMs to disk where necessary. I haven't thought through whether java.io.RandomAccessFile or a regular database or some other alternative would be the way to go. The LMs can be totally protected from all of this by abstracting both the FO and Area Documents -- in other words, they work with abstract nodes on trees and don't care what was required to make them available. Oddly enough, once I have the stability of the DOMs to work from (perhaps this is more felt than real), an event-based approach seems much more natural -- like imitating a word processor. In fact, if done properly, another project could conceivably use FOP as the layout engine for a WYSIWIG editor. Actually I have been trying to quantify grasp two processing models that come to mind: 1) the word-processing model, an event-based model, and 2) an 18th-century typesetter manually laying out pages, which is much more of a look-ahead, measure-it-to-see-how-it-fits-before-placing-it model. These two models roughly correspond to the two processing models I mentioned the other day (I am text, place me somewhere vs. I am a page with room, place something on me). The second model requires the 2-pass approach. The first fits either a push or a pull approach (since we can manufacture events if we need to), the second is definitely pull. When I wrote about those two models, I was frankly leaning heavily toward the 2nd approach, but I think I am changing my mind. To explain why, I need to have you forget for a moment about our SAX-based input (I'll come back to that). Forget also about performance for a moment, and picture the typesetter setting type one character at a time, with no thought of what the next character or image is -- in other words, setting type just like a user sitting at Microsoft Word does. If the typesetter comes to a concept that messes his previous work up, he has to yank a line of type out, or perhaps an entire page out, and replace them. However, (and this is the key point), he eventually will get the job done. In other words, when abstracted this way, the only benefit to a look-ahead /should be/ performance. Consider our auto table layout problem. If on the 350th page of the table, I find an item that requires me to change the width of the columns, which in turns changes the layout of all 350 pages, yes, I am going to burn up a few cycles to accomplish that, but I /should/ be able to get it done. So far all I have done is loosely reconciled these two processing models. The next thing I want to do is to try to compare these two models with FOP's layout process. If I like the event-based model, then maybe I ought to like FOP's approach. Let me go first to my 18th-century typesetter. Each time he has to tear out a line or page of type, he can go back to his manuscript (his FO document, if you will) to rebuild them. Similarly in a word processor, I presume that Microsoft Word must have some concept that the 2 lines at the top of page 84 are in the same paragraph as the 3
RE: Alt-Design status: XML handling
On Thu, 2002-11-21 at 12:43, Victor Mote wrote: To conclude, if I were designing this system from scratch, based on what I know right now, I would: 1. Use DOM for both the fo tree the area tree. I don't know whether I would call it a DOM but the area tree is an independant data structure that contains all the information necessary to render the document. 2. Write them to disk when necessary, hiding all of this from the layout managers. This has already been done for the area tree. I use the CachedRenderPagesModel all the time. If it cannot render the page immediately then it saves it to disk. The layout managers only know about adding areas to a page and then adding the page to the area tree. For rendering it can dispose of the page once rendered, for awt viewer it could save all pages to disk and load when necessary. As described here (written a long time ago and needs updating): http://xml.apache.org/fop/design/areas.html I don't see why you would need all the fo tree available, each page sequence is independant for the flow and often each page can be finished before the next page. 3. Use an event-based layout mechanism so that the fo tree doesn't even have to be there to get layout work done. Depends exactly what you mean but I think that is the general idea, care to implement it. I am sure I can be talked out of this by someone smarter, but I wanted to lay out the whole line of reasoning. My apologies to Peter and anyone else who may have been working on these points before. I am just now getting around to them. After further consideration, my use of event-based above may be too strong. Probably what I mean is more along the lines of API-based. In a WYSIWIG environment, the event would probably trigger an API action, but that action could be invoked another way as well. I am too tired to rewrite it -- I hope you know what I mean. This final thought is really a question which was briefly addressed during our recent weekend clarification about the role of the maintenance branch, and which I wish to apply specifically to the above thoughts. Does or could the new design give us the ability to (with say, a configuration option) choose between Layout Philosophy A and B? By this I mean 2 (or more) layout packages coexisting in the same code base, and sharing common resources that can be selected (configurable perhaps). If so, then we can play with these ideas at our leisure, compare them in various ways, transition between them if necessary, and maybe even keep both to be used in various circumstances. I think someone (Jeremias perhaps) had indicated that something along these lines would be possible, but that may have been at a lower level than what I am discussing here. This should be quite simple to do. There would be a basic interface set for the layout managers when being created by the fo tree. We could possible have a common one for inline objects. The actual layout implementation could then be changed. Again, this will need to be implemented. I don't mean to rock the boat. I guess I am kind of like a three-year-old who asks why and why not all of the time to the annoyance of all around him -- I am still trying to learn the basics. Thanks for your patience. I keep getting the feeling that everyone is saying the current design is wrong and here is a better idea, which turns out to be the same as the current design. When will people start doing it? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Victor Mote wrote: The issue with SAX as I see it, is that because it is one-way, and our processing is not (I think the standard calls it non-linear), we presumably have to essentially build our own DOM-ish (random access) things in order to get the job done. I think we should separate fo tree itself from the process of its building. fo tree structure is required and I agree with Keiron - it's not a DOM, it's just tree representation and I cherish the idea to make it an effectively small structure like saxon's internal tree. But any interim buffers should be avoided as much as it's possible (well, Piter's buffer seems not to be a burden). To conclude, if I were designing this system from scratch, based on what I know right now, I would: 1. Use DOM for both the fo tree the area tree. Bad idea, I believe. DOM is heaviweight versatile representation of xml document (recall entities, pi's etc nodes), while we need effective and lightweight structure to hold fo/area tree information. DOM has a lot of synchronization stuff, while our trees are almost read-only actually. Ahh stop, probably you didn't mean w3c DOM? -- Oleg Tkachenko eXperanto team Multiconn Technologies, Israel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Oleg, ... Oleg Tkachenko wrote: Peter B. West wrote: taking a very isolated path. My motivation can be summed up in the slogan SAX SUX. I couldn't understand why anyone would persist with it for any complex tasks, e.g. FOP. Actually I cannot say I fully agree with this, because I don't see nothing complex in SAX processing model. And being xslt fan I'm obviously push-model fan. ... significant difference makes XmlReader much easier to use for most Microsoft developers that are used to working with firehose (forward-only/read-only) cursors in ADO. Well, lets consider pull model pros and contras: + easy to use by developer + benefits by kind of structural validation + more? Why is it easier for developers to use? Is it because the API is less complex or more easily understood? Not really. As you point out, the SAX API is not all that complex. The problem is that the processing model of SUX is completely inverted. You may have come to like writing XSLT that way. You may be working with very general grammars, and have no other choice. That doesn't make the inverted, inside-out model any more natural for the expression of processes and algorithms. Easier for developers to use means an easier vocabulary for the expression and solution of programming models and problems; it means easier to document, easier to read and understand, easier to maintain and extend (in the sense of adding functionality). - it glues processing to a particular xml structure, what is not so bad for vocabularies with well-defined and stable content model. The question is whether xsl-fo is a such kind of a vocabulary? I think it doesn't. As a matter of fact xsl-fo even inexpressible in dtd or schema, besides of possibility of extensions. I think that a W3C Recommendation qualifies as a well-defined and stable vocabulary. Hmm. Well, you know what I mean. It changes only infrequently, the changes are well-defined, and are going to involve changes, possibly major, to many parts of the code base anyway. It certainly cannot be described as a dynamic vocabulary. - is there performance penalty? I used to think that easy to use stuff always costs something. Of course, as I have mentioned recently. And as I also said, the cost of parsing relative to the intensive downstream element processing of FOP is small. Obviously, you would look at optimising that as much as possible. - more? Note also that the structure of the code does its own validation. It generates the simple-page-master subtree according to the content model (region-body,region-before?,region-after?,region-start?,region-end?) That's good, but it's not full-fledged validation unfortunately. To many own validation is bad I believe. If we need validation it must be done by specialized validation module and validation should not be scattered throughout the whole code. Much of the validation of FOP has to be self-validation anyway, because so many of the constraints are context-dependent. The whole question is context-dependent. If you are engaged in the peephole processing of SUX you may be obliged to use external validation. With top-down processing you have more choice, because your context is travelling with you. Don't get me wrong here. I'm not saying that external validation is wrong, merely that with a pull model, the need is reduced. There may still be a strong case for it, but not as strong as with SUX. And final question - what's wrong with SAX besides of possible complexity? Isn't that enough? Peter -- Peter B. West [EMAIL PROTECTED] http://www.powerup.com.au/~pbwest/ Lord, to whom shall we go? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Keiron Liddle wrote: On Thu, 2002-11-21 at 12:43, Victor Mote wrote: To conclude, if I were designing this system from scratch, based on what I know right now, I would: 1. Use DOM for both the fo tree the area tree. I don't know whether I would call it a DOM but the area tree is an independant data structure that contains all the information necessary to render the document. 2. Write them to disk when necessary, hiding all of this from the layout managers. This has already been done for the area tree. I use the CachedRenderPagesModel all the time. If it cannot render the page immediately then it saves it to disk. The layout managers only know about adding areas to a page and then adding the page to the area tree. For rendering it can dispose of the page once rendered, for awt viewer it could save all pages to disk and load when necessary. As described here (written a long time ago and needs updating): http://xml.apache.org/fop/design/areas.html OK, I just went back reread it. There is still something I don't understand I'll get to that in a minute. First, let me say that perhaps the better way for me to learn this would be to follow it in a debugger. I'm not too lazy to do that, and if /these issues/ are working pretty well right now, then that is probably what I should be doing -- just say the word. Here is what (after reading the doc considering your comments) my thick head doesn't yet grasp -- when we say the page is cached, I understood that to mean that it is immutably laid out, but that it can't be rendered yet because some previous page cannot be finally laid out yet. What I am trying to address is the situation, like the auto table layout, where something I see while trying to lay out page 5000 changes pages 150-4999 as well. I have to now push some rows or lines from 150 to 151, which triggers pushing some lines from 151 to 152, etc. So the first question is whether the cached pages are changeable or unchangeable. If changeable, then we should be able to deal with arbitrarily long documents and (assuming some reasonable amount of basic memory) not worry about running out of memory -- constrained only by disk space. The second question that I am trying to grasp is, if the cached pages are changeable, then how do we know enough to change those 4850 pages properly without having our fo tree at hand unless we duplicate the information from the fo tree in the area tree. I don't see why you would need all the fo tree available, each page sequence is independant for the flow and often each page can be finished before the next page. You are correct for the current use-case. I have jumped a bit past that into trying to make room for other use-cases that might require the fo to be changed and serialized (the WYSIWIG). Setting that issue aside for the moment, let me rephrase the question, because this is really the huge big issue that makes me uneasy with SAX. Don't we need random access to the fo tree for the current page sequence? * If so, then, using SAX, don't we have to duplicate that same information in the area tree to be able to handle rebuilding 4850 pages? * If not, then, in a big-picture way, how do we go about rebuilding 4850 pages? 3. Use an event-based layout mechanism so that the fo tree doesn't even have to be there to get layout work done. Depends exactly what you mean but I think that is the general idea, care to implement it. OK, I see where I was not clear. In my mind, if there is no fo tree to tie the pieces of the area tree to, you basically have to build one. This is why I brought up Word FrameMaker -- their object models keep the organization of the document (analagous to our fo tree) intact. My theory is that we eventually hurt ourselves by trying to avoid this. The difference is that they have to serialize that organization, and we don't, at least for our current use case. However, perhaps because I am still confused about our general strategy for dealing with the ripple-effect of downstream changes, their model seems to be a good one. I am envisioning an area tree that contains no text at all, but whose objects merely point to nodes offsets sizes in the fo tree. So, for example Line object l contains an array of LineSegment objects, one of whose contents comes from an FOText node, starting at offset 16, size 22. Not only is my text there, but so is most of my font and formatting information. What I have is two different views of the same data -- one that is more structural and the other the specific layout. I have no problem (in our current use-case) with throwing away page-sequences from the fo tree and area tree to free up memory as we go. Does or could the new design give us the ability to (with say, a configuration option) choose between Layout Philosophy A and B? By this I mean 2 (or more) layout packages coexisting in the same code base, and sharing common resources that can be selected
RE: Alt-Design status: XML handling
Oleg Tkachenko wrote: I think we should separate fo tree itself from the process of its building. fo tree structure is required and I agree with Keiron - it's not a DOM, it's just tree representation and I cherish the idea to make it an effectively small structure like saxon's internal tree. But any interim buffers should be avoided as much as it's possible (well, Piter's buffer seems not to be a burden). This is probably a philosophical difference. It seems to me that the area tree is built on the foundation of the fo tree, and that if we only get a brief glimpse of the fo tree as it goes by, not only does our foundation disappear, but we end up putting all of that weight into the superstructure, which tends to make the whole thing collapse. To conclude, if I were designing this system from scratch, based on what I know right now, I would: 1. Use DOM for both the fo tree the area tree. Bad idea, I believe. DOM is heaviweight versatile representation of xml document (recall entities, pi's etc nodes), while we need effective and lightweight structure to hold fo/area tree information. DOM has a lot of synchronization stuff, while our trees are almost read-only actually. Ahh stop, probably you didn't mean w3c DOM? You and Keiron are right -- this is a classic example of using an implementation where an interface would be much better. When I say DOM, what I should say is some randomly-accessible view of the entire tree. Certainly, if there is a lighter-weight alternative than DOM that works for the task at hand, that is better. Victor Mote - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
RE: Alt-Design status: XML handling
Peter, thanks for the update and explanation on your Alt-Design. To be honest: I like it. Reminds me very much of my first exposure to programming language processing (Compilers) nearly 30 years ago = top-down recursive-decent parsing for Pascal. I still think its the best parsing model around (beats YACC type stuff by a long way) in terms of ease of development / understanding / use. Do you have any similar simple / effective ideas for the layout part which, following the discussions on this list, the new FOP design under CVS HEAD seems to struggle most with? Manuel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]
Re: Alt-Design status: XML handling
Great work Peter! It makes a lot of sense to use higher-level than SAX events, and thanks for explaining this so clearly. If you allow me a suggestion regarding the structure of the code: maybe using some table-driven stuff instead of the many if statements in FoSimplePageMaster would be more readable? Something like: class EventHandler { EventHandler(String regionName,boolean discardSpace,boolean required) ... } /** table of event handlers that must be applied, in order */ EventHandler [] handlers = { new EventHandler(FObjectNames.REGION_BODY,true,true), new EventHandler(FObjectNames.REGION_BEFORE,true,false) }; ...then, in FoSimplePageMaster(...) loop over handlers and let them process the events. I don't know if this applies in general but it might be clearer to read and less risky to modify. -Bertrand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, email: [EMAIL PROTECTED]