Vincent Hennebert wrote: > Hi, > <snip/> >> There's another side-effect to tagged PDF: It allows for better text >> extraction from the document. PDF even describes ways to make >> round-trips from XML -> PDF -> XML -> PDF if certain conditions were met. >> However, we don't do that. > > Speaking of that, the current code doesn’t insert empty elements (like > <fo:block/>) into the structure tree. The corresponding StructElem > object /is/ created, but is not linked to its parent. Actually it’s > present in the PDF without being referred to by any other object. > I think this is inconsistent, and actually wrong since that would cause > a loss of information possibly needed by a round-trip transformation. > I’m going to change that.
I mean, /at some point/ I’m going to change that... This is not as easily done as it is said. Take the following example: <fo:block> Before the empty block. <fo:block/> After the empty block. </fo:block> What basically happens currently is that two text drawing requests are made to the PDF renderer. The renderer creates the appropriate PDF stream and registers the pieces of text as children of the structure element corresponding to the outer block. But nothing happens regarding the inner empty block, since obviously there’s nothing to do. The structure element for the inner empty block can’t be added to the outer block’s children at creation time, otherwise the logical order wouldn’t be followed. >From the quick look I had this is a fundamental limitation of the current approach. There’s no way to know at which place an empty element must be inserted into the children list of its parent. The only way to solve this issue probably is to integrate the handling of the logical structure into the whole processing chain, passing the suitable information from the FO tree to the layout engine to the area tree to the renderer. Probably something that should have been done from the beginning but this is all but trivial. Vincent