Re: Tika is waiting for ODFToolkit to improve ODF file format processing

Michael McCandless Wed, 26 Oct 2011 09:30:56 -0700

On Tue, Oct 25, 2011 at 5:40 PM, Rob Weir <robw...@apache.org> wrote:


> Is there a list of the complete set of tags you use, or a schema or something?

Hmm, I think technically any tags that are valid XHTML is fair game,
but in practice the parsers seems to use a very limited set of tags
(table/td/tr, a, img, p, br, div, b, i, u, hN, ul/li, span).  I'm sure
there are more... and I'm not familiar with most of Tika's parsers!

>> For TIKA-736 in particular, it'd be nice to "reconstruct" each slide
>> so that any text from the master slide/layout is inlined into each
>> slide that uses it, so that the resulting text looks the way it looks
>> when you view the document in OpenOffice. This is the approach we're
>> working towards in TIKA-712 for PPT/X files.
>
> Text box position is ultimately encoded as x,y coordinates on the
> slide. So the visual appearance on the slide and the order of the
> text boxes in the document's XML are generally unrelated. But it
> should be possible to sort the coordinates to get an top-to-bottom,
> left-to-write reading order. Maybe even with some sensitivity to
> BiDi.
>
> I've certainly seen that use case mentioned by others.

OK that makes sense.

Besides header/footer shared across pages, and embedded  docs,
are there other cases where ODF pulls in cross-referenced text?

On the position sorting, PDFBox works in a similar way, since PDF
also places text (well, glyphs!) at positions and then we have to
sometimes "reconstruct" how those glyphs might translate back into
words/lines.

>> I imagine to do this you'd need DOM-like access to the master slide /
>> layout / style, and could then us SAX-like single pass for the
>> "normal" slides.
>>
>
> Well, you could stream one slide at a time, but we'd need to be able
> to store the complete text contents of each individual slide to do the
> coordinate sort. But that is not so bad. Presentations tend to be
> outrageously large based on large images (high color depth, high dpi)
> rather than large amounts of text.

That sounds great, as long as we have random-access to the set of
master slides so we can "slip-stream" in any headers/footers/etc.

Thanks!

Mike McCandless

http://blog.mikemccandless.com

Re: Tika is waiting for ODFToolkit to improve ODF file format processing

Reply via email to