On Tue, Oct 25, 2011 at 5:40 PM, Rob Weir <[email protected]> wrote:
> Is there a list of the complete set of tags you use, or a schema or something? Hmm, I think technically any tags that are valid XHTML is fair game, but in practice the parsers seems to use a very limited set of tags (table/td/tr, a, img, p, br, div, b, i, u, hN, ul/li, span). I'm sure there are more... and I'm not familiar with most of Tika's parsers! >> For TIKA-736 in particular, it'd be nice to "reconstruct" each slide >> so that any text from the master slide/layout is inlined into each >> slide that uses it, so that the resulting text looks the way it looks >> when you view the document in OpenOffice. This is the approach we're >> working towards in TIKA-712 for PPT/X files. > > Text box position is ultimately encoded as x,y coordinates on the > slide. So the visual appearance on the slide and the order of the > text boxes in the document's XML are generally unrelated. But it > should be possible to sort the coordinates to get an top-to-bottom, > left-to-write reading order. Maybe even with some sensitivity to > BiDi. > > I've certainly seen that use case mentioned by others. OK that makes sense. Besides header/footer shared across pages, and embedded docs, are there other cases where ODF pulls in cross-referenced text? On the position sorting, PDFBox works in a similar way, since PDF also places text (well, glyphs!) at positions and then we have to sometimes "reconstruct" how those glyphs might translate back into words/lines. >> I imagine to do this you'd need DOM-like access to the master slide / >> layout / style, and could then us SAX-like single pass for the >> "normal" slides. >> > > Well, you could stream one slide at a time, but we'd need to be able > to store the complete text contents of each individual slide to do the > coordinate sort. But that is not so bad. Presentations tend to be > outrageously large based on large images (high color depth, high dpi) > rather than large amounts of text. That sounds great, as long as we have random-access to the set of master slides so we can "slip-stream" in any headers/footers/etc. Thanks! Mike McCandless http://blog.mikemccandless.com
