>Since it would be read-only, would it just be another option, instead of a >full replacement?
Y, think of it like XSSF's eventusermodel. We define an interface for what a user will have to react to, like XSSFSheetXMLHandler's SheetContentsHandler, and we take care of the rest. You can see the current example for docx [1] and pptx [2] in Tika. > Would the data model need to be more fully fleshed out to support all the > corners of the OOXML spec not currently represented? Not that I'm aware of...but...ymmv. In some cases, reading for some elements like "w:t" is actually more robust than traversing the DOM and requiring known structural relationships. Bug 54849 requires us to know to look for SDT at the block level of the document [3]. We wouldn't have hit that if all we cared about were "w:t" or even "sdt" wherever they occurred. Same is true but at a different structural level with Glossary document. There were a handful of other examples that I stumbled upon while working on the SAX parsers in Tika. > Is there anything at all that could help with the write side without the > overhead of XMLBeans? Not that I can think of...that'll be quite some work. [1] https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java [2] https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java [3] https://bz.apache.org/bugzilla/show_bug.cgi?id=54849