Hi, > On Thu, Dec 4, 2008 at 10:59 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > > Just one question: Is there interest to do the same tag mapping approach > for > > OpenXML (MS Office 2007) files? In my opinion, this is much resource > > friendlier (because it is only extracting text from an XML file) than > the > > POI approach of having DOM trees and megabytes of DOM-Tree mappings of > the > > OpenXML schema with additional external dependencies. > > I agree that directly mapping things from the underlying XML is > probably the most straightforward and easy solution for simple text > extraction. > > However, a proper parser library becomes very handy as soon as you > start implementing more complex things like extracting content from > possible attachments or handling encryption. Using an external parser > library also insulates us from a lot of complex details like users > complaining why isn't some content in their documents being extracted. > If we implement parsing inside Tika we also need to take on the burden > of maintaining and supporting that implementation. > > In general I'd only implement a parser fully in Tika if the required > amount of code is small (up to a few hundred lines max) and that code > covers all the features we need. The current MP3 parser is a good > example where both requirements are currently satisfied, though if we > want to start supporting some of the more complex MP3 tagging formats > I'd definitely go for an external parser library.
I thought about this when writing the OpenDocumentParser for OpenOffice. As the mapping was very simple for these type of documents (just a tag mapping approach), the code is very short, as you noted. If this is the same with OpenXML, I would give it a try (but I suspect, M$ made it more complicated than OpenOffice :-). The cool thing with OpenOffice is, that all document types (spreadsheets, text and presentations have exactly the same syntax, very cool). And encryption is not possible (as far as I know) and signed documents are no problem as its still XML. Uwe
