RE: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Uwe Schindler Thu, 04 Dec 2008 09:42:51 -0800

Hi,

> On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote:
> > Just one question: Is there interest to do the same tag mapping approach
> for
> > OpenXML (MS Office 2007) files? In my opinion, this is much resource
> > friendlier (because it is only extracting text from an XML file) than
> the
> > POI approach of having DOM trees and megabytes of DOM-Tree mappings of
> the
> > OpenXML schema with additional external dependencies.
> 
> I agree that directly mapping things from the underlying XML is
> probably the most straightforward and easy solution for simple text
> extraction.
> 
> However, a proper parser library becomes very handy as soon as you
> start implementing more complex things like extracting content from
> possible attachments or handling encryption. Using an external parser
> library also insulates us from a lot of complex details like users
> complaining why isn't some content in their documents being extracted.
> If we implement parsing inside Tika we also need to take on the burden
> of maintaining and supporting that implementation.
> 
> In general I'd only implement a parser fully in Tika if the required
> amount of code is small (up to a few hundred lines max) and that code
> covers all the features we need. The current MP3 parser is a good
> example where both requirements are currently satisfied, though if we
> want to start supporting some of the more complex MP3 tagging formats
> I'd definitely go for an external parser library.


I thought about this when writing the OpenDocumentParser for OpenOffice. As
the mapping was very simple for these type of documents (just a tag mapping
approach), the code is very short, as you noted. If this is the same with
OpenXML, I would give it a try (but I suspect, M$ made it more complicated
than OpenOffice :-). The cool thing with OpenOffice is, that all document
types (spreadsheets, text and presentations have exactly the same syntax,
very cool). And encryption is not possible (as far as I know) and signed
documents are no problem as its still XML.

Uwe

RE: XML formats vs. parser libraries (Was: [jira] Resolved: (TIKA-172) New Open Document Parser that emmits structured XHTML content.)

Reply via email to