[ 
https://issues.apache.org/jira/browse/TIKA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194148#comment-15194148
 ] 

Harsh Fatepuria commented on TIKA-1902:
---------------------------------------

I am using the BodyContentHandler object to just get the <body> part of the 
extracted XHTML.

If I modify the code to :  ContentHandler handler = new ToXMLContentHandler();
I get the full XHTML code with the metadata.

> Error while parsing some files using ContentHandler object (initialized using 
> the BodyContentHandler object) 
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1902
>                 URL: https://issues.apache.org/jira/browse/TIKA-1902
>             Project: Tika
>          Issue Type: Bug
>          Components: handler, parser
>    Affects Versions: 1.12
>         Environment: Java
>            Reporter: Harsh Fatepuria
>              Labels: handler, java, parser, tika
>
> Java Code:
> public static String parseBodyToHTML(String filePath) throws IOException, 
> SAXException, TikaException 
> {
>           ContentHandler handler = new BodyContentHandler(new 
> ToXMLContentHandler());
>        
>           AutoDetectParser parser = new AutoDetectParser();
>           Metadata metadata = new Metadata();
>           try (FileInputStream stream =new FileInputStream(new 
> File(filePath))) {
>               parser.parse(stream, handler, metadata);
>               return handler.toString();
>           }
> }
> While using this function for some files, I get the following error:
> Exception in thread "main" org.xml.sax.SAXException: Namespace 
> http://www.w3.org/1999/xhtml not declared
>       at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>       at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>       at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:250)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>       at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>       at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:291)
>       at org.apache.tika.parser.pdf.PDF2XHTML.startPage(PDF2XHTML.java:225)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:437)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
>       at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
>       at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:148)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:148)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:136)
>       at TTR.TTRAnalysis.parseBodyToHTML(TTRAnalysis.java:39)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to