NullPointerException in tika-app, parsing PDF content
-----------------------------------------------------

                 Key: TIKA-778
                 URL: https://issues.apache.org/jira/browse/TIKA-778
             Project: Tika
          Issue Type: Bug
          Components: gui, parser
    Affects Versions: 1.0
            Reporter: Bastian Mathes


I try to extract text from some pdf files with the tika app. In version 0.10 
the error 
ERROR - Error: Could not parse predefined CMAP file for '--UCS2'
is printed on the command line, but text extraction works and is correct.

In version 1.0 I get the same error message on the command line, but also 
receive an exception and no text is extracted:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@62bc36ff
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
        at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279)
        at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
        at 
javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995)
        at 
javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318)
        at 
javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387)
        at 
javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242)
        at javax.swing.AbstractButton.doClick(AbstractButton.java:357)
        at 
javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809)
        at 
javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850)
        at java.awt.Component.processMouseEvent(Component.java:6288)
        at javax.swing.JComponent.processMouseEvent(JComponent.java:3267)
        at java.awt.Component.processEvent(Component.java:6053)
        at java.awt.Container.processEvent(Container.java:2041)
        at java.awt.Component.dispatchEventImpl(Component.java:4651)
        at java.awt.Container.dispatchEventImpl(Container.java:2099)
        at java.awt.Component.dispatchEvent(Component.java:4481)
        at 
java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4577)
        at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238)
        at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168)
        at java.awt.Container.dispatchEventImpl(Container.java:2085)
        at java.awt.Window.dispatchEventImpl(Window.java:2478)
        at java.awt.Component.dispatchEvent(Component.java:4481)
        at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:643)
        at java.awt.EventQueue.access$000(EventQueue.java:84)
        at java.awt.EventQueue$1.run(EventQueue.java:602)
        at java.awt.EventQueue$1.run(EventQueue.java:600)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
        at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98)
        at java.awt.EventQueue$2.run(EventQueue.java:616)
        at java.awt.EventQueue$2.run(EventQueue.java:614)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87)
        at java.awt.EventQueue.dispatchEvent(EventQueue.java:613)
        at 
java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269)
        at 
java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184)
        at 
java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169)
        at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161)
        at java.awt.EventDispatchThread.run(EventDispatchThread.java:122)
Caused by: java.lang.NullPointerException
        at 
com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907)
        at 
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519)
        at 
org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at 
org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
        at 
org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273)
        at 
org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:216)
        at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:112)
        at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:323)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
        ... 43 more

I tried the same pdf files (and can switch forth and back between version 0.10 
and 1.0, this behavior is stable) and it looks like the exact same pdfbox 
version is inside the tika-app-0.10.jar and tika-app-1.0.jar. It would be great 
if version 1.0 could do what 0.10 can. Sorry that I cannot provide the pdf.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to