[ https://issues.apache.org/jira/browse/TIKA-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150570#comment-13150570 ]
Bastian Mathes commented on TIKA-778: ------------------------------------- Calling the extraction directly on the command line actually works (with or without --html), so the issue is probably not as important that I thought, it is just that opening it from within the Tika application causes this exception (in 1.0, not in 0.10). I send you a PDF via mail. > NullPointerException in tika-app, parsing PDF content > ----------------------------------------------------- > > Key: TIKA-778 > URL: https://issues.apache.org/jira/browse/TIKA-778 > Project: Tika > Issue Type: Bug > Components: gui, parser > Affects Versions: 1.0 > Reporter: Bastian Mathes > > I try to extract text from some pdf files with the tika app. In version 0.10 > the error > ERROR - Error: Could not parse predefined CMAP file for '--UCS2' > is printed on the command line, but text extraction works and is correct. > In version 1.0 I get the same error message on the command line, but also > receive an exception and no text is extracted: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.pdf.PDFParser@62bc36ff > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320) > at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:279) > at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238) > at > javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1995) > at > javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2318) > at > javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:387) > at > javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:242) > at javax.swing.AbstractButton.doClick(AbstractButton.java:357) > at > javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:809) > at > javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:850) > at java.awt.Component.processMouseEvent(Component.java:6288) > at javax.swing.JComponent.processMouseEvent(JComponent.java:3267) > at java.awt.Component.processEvent(Component.java:6053) > at java.awt.Container.processEvent(Container.java:2041) > at java.awt.Component.dispatchEventImpl(Component.java:4651) > at java.awt.Container.dispatchEventImpl(Container.java:2099) > at java.awt.Component.dispatchEvent(Component.java:4481) > at > java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4577) > at java.awt.LightweightDispatcher.processMouseEvent(Container.java:4238) > at java.awt.LightweightDispatcher.dispatchEvent(Container.java:4168) > at java.awt.Container.dispatchEventImpl(Container.java:2085) > at java.awt.Window.dispatchEventImpl(Window.java:2478) > at java.awt.Component.dispatchEvent(Component.java:4481) > at java.awt.EventQueue.dispatchEventImpl(EventQueue.java:643) > at java.awt.EventQueue.access$000(EventQueue.java:84) > at java.awt.EventQueue$1.run(EventQueue.java:602) > at java.awt.EventQueue$1.run(EventQueue.java:600) > at java.security.AccessController.doPrivileged(Native Method) > at > java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) > at > java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:98) > at java.awt.EventQueue$2.run(EventQueue.java:616) > at java.awt.EventQueue$2.run(EventQueue.java:614) > at java.security.AccessController.doPrivileged(Native Method) > at > java.security.AccessControlContext$1.doIntersectionPrivilege(AccessControlContext.java:87) > at java.awt.EventQueue.dispatchEvent(EventQueue.java:613) > at > java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:269) > at > java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:184) > at > java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:174) > at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:169) > at java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:161) > at java.awt.EventDispatchThread.run(EventDispatchThread.java:122) > Caused by: java.lang.NullPointerException > at > com.sun.org.apache.xml.internal.serializer.ToHTMLStream.endElement(ToHTMLStream.java:907) > at > com.sun.org.apache.xalan.internal.xsltc.trax.TransformerHandlerImpl.endElement(TransformerHandlerImpl.java:273) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at org.apache.tika.gui.TikaGUI$2.endElement(TikaGUI.java:519) > at > org.apache.tika.sax.TeeContentHandler.endElement(TeeContentHandler.java:94) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > org.apache.tika.sax.SecureContentHandler.endElement(SecureContentHandler.java:256) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) > at > org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:273) > at > org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:216) > at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:112) > at > org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:323) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:96) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) > ... 43 more > I tried the same pdf files (and can switch forth and back between version > 0.10 and 1.0, this behavior is stable) and it looks like the exact same > pdfbox version is inside the tika-app-0.10.jar and tika-app-1.0.jar. It would > be great if version 1.0 could do what 0.10 can. Sorry that I cannot provide > the pdf. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira