[ 
https://issues.apache.org/jira/browse/TIKA-1774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962489#comment-14962489
 ] 

Nick Burch commented on TIKA-1774:
----------------------------------

This looks like a duplicate of TIKA-1215. See [this comment for why that 
combination isn't 
supported|https://issues.apache.org/jira/browse/TIKA-1215?focusedCommentId=13869693&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13869693]

> org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-1774
>                 URL: https://issues.apache.org/jira/browse/TIKA-1774
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.10
>         Environment: Windows 10
> Java 1.8_45
> Apache Tika 1.10
> MS Word 2013
>            Reporter: Steve K
>            Priority: Minor
>
> Create a test document with MS Word 2013. Just a few paragraphs (lines of 
> text), table, etc.
> Code example:
>         ContentHandler handler = new BodyContentHandler(new 
> ToXMLContentHandler());
>         File inputFile = new File("c:\\temp\\test.docx");
>         InputStream stream = TikaInputStream.get(inputFile);
>         AutoDetectParser parser = new AutoDetectParser();
>         Metadata metadata = new Metadata();
>         parser.parse(stream, handler, metadata);
>         System.out.println(handler.toString());
> This will lead to the following Exception:
> Exception in thread "main" org.xml.sax.SAXException: Namespace 
> http://www.w3.org/1999/xhtml not declared
>       at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>       at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>       at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>       at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:163)
>       at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:107)
>       at 
> org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:93)
>       at 
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>       at 
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>       at com.test.TikaTest.main(TikaTest.java:28)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> This exception occurs when using "ToXMLContentHandler" in combination with 
> the BodyContentHandler. Using "ToXMLContentHandler" alone works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to