[ 
https://issues.apache.org/jira/browse/TIKA-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070391#comment-16070391
 ] 

Tim Allison commented on TIKA-2405:
-----------------------------------

I regret I haven't had a chance to do a formal evaluation btwn legacy DOM and 
SAX.  IIRC, I didn't get around to some formatting stuff (putting footnotes in 
the right locations???) in the new SAX, but it will be more robust on docs like 
you shared with us, and it will be more robust on extracting text (it makes _no 
assumptions_ about where text should be (e.g. TIKA-1130), it extracts 
everything in the document.xml); it will likely use far less memory (really 
only a problem in practice with huge docs).

In short, y, I'd move everything over to the new SAX parser.  I also added a 
SAX parser for pptx for the same reasons...with the same caveats.

If you have the time, you could run tika-app.jar against your docx with and 
without SAX and then run tika-eval's Compare to see if there are any 
degradations in extracted content, increases in exceptions, etc, see 
[wiki|https://wiki.apache.org/tika/TikaEval], 
[slides|http://events.linuxfoundation.org/sites/events/files/slides/ApacheConMiami2017_tallison_v2.pdf]
 and/or [youtube|https://www.youtube.com/watch?v=vRPTPMwI53k)]

I'd be more than happy to walk you through that process.  You have the rare 
opportunity to be the second person in the world to run it. :)

> SAXParseException in text extraction from DOCX file
> ---------------------------------------------------
>
>                 Key: TIKA-2405
>                 URL: https://issues.apache.org/jira/browse/TIKA-2405
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Jorge Spinsanti
>              Labels: sax_docx_fixes
>         Attachments: SAXParseException.docx
>
>
> I got SAXParseException in text extraction from DOCX file (see attachment):
> {code}
> Caused by: org.apache.tika.exception.TikaException: TIKA-198: Illegal 
> IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@1de6e9d6
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 16 more
> Caused by: java.io.IOException: Unable to parse xml bean
>       at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:146)
>       at 
> org.openxmlformats.schemas.wordprocessingml.x2006.main.NumberingDocument$Factory.parse(Unknown
>  Source)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFNumbering.onDocumentRead(XWPFNumbering.java:78)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.onDocumentRead(XWPFDocument.java:192)
>       at org.apache.poi.POIXMLDocument.load(POIXMLDocument.java:190)
>       at 
> org.apache.poi.xwpf.usermodel.XWPFDocument.<init>(XWPFDocument.java:112)
>       at 
> org.apache.poi.xwpf.extractor.XWPFWordExtractor.<init>(XWPFWordExtractor.java:58)
>       at 
> org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:232)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:104)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:106)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       ... 23 more
> Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 37; 
> The encoding declaration is required in the text declaration.
>       at 
> org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
> Source)
>       at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
>       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>       at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
>       at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
>       at org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown 
> Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown
>  Source)
>       at 
> org.apache.xerces.impl.XMLDocumentScannerImpl$XMLDeclDispatcher.dispatch(Unknown
>  Source)
>       at 
> org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
> Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
>       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
>       at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
>       at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
>       at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:121)
>       at 
> org.apache.poi.util.DocumentHelper.readDocument(DocumentHelper.java:140)
>       at org.apache.poi.POIXMLTypeLoader.parse(POIXMLTypeLoader.java:143)
>       ... 33 more
> {code}
> Text extraction using OpenOffice is successful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to