RE: bean-free ooxml streaming readers?

Allison, Timothy B. Mon, 10 Apr 2017 12:25:26 -0700

>Since it would be read-only, would it just be another option, instead of a 
>full replacement?


Y, think of it like XSSF's eventusermodel.  We define an interface for what a 
user will have to react to, like XSSFSheetXMLHandler's SheetContentsHandler, 
and we take care of the rest.  You can see the current example for docx [1] and 
pptx [2] in Tika.

> Would the data model need to be more fully fleshed out to support all the 
> corners of the OOXML spec not currently represented?

Not that I'm aware of...but...ymmv.  In some cases, reading for some elements 
like "w:t" is actually more robust than traversing the DOM and requiring known 
structural relationships.  Bug 54849 requires us to know to look for SDT at the 
block level of the document [3].  We wouldn't have hit that if all we cared 
about were "w:t" or even "sdt" wherever they occurred.  Same is true but at a 
different structural level with Glossary document.  There were a handful of 
other examples that I stumbled upon while working on the SAX parsers in Tika.

> Is there anything at all that could help with the write side without the 
> overhead of XMLBeans?

Not that I can think of...that'll be quite some work. 


[1] 
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java

[2] 
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java
 

[3] https://bz.apache.org/bugzilla/show_bug.cgi?id=54849

RE: bean-free ooxml streaming readers?

Reply via email to