[ https://issues.apache.org/jira/browse/TIKA-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083128#comment-15083128 ]
Tim Allison commented on TIKA-1822: ----------------------------------- When we can't get the ID for a linked object via POI's {{CharacterRun mscr = field.getMarkSeparatorCharacterRun(r);}}, should we add an annotation for an unknown id (e.g. {{<div class="embedded" id="_UNKNOWN_ID" />}}) or should we skip adding an annotation? > NullPointerException when parsing a .doc file > --------------------------------------------- > > Key: TIKA-1822 > URL: https://issues.apache.org/jira/browse/TIKA-1822 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.8 > Environment: Linux > Reporter: Panagiotis Mpailis > Assignee: Tim Allison > Attachments: npe_example.doc > > > We are using Tika 1.11 to extract text from msword documents, and there are a > few errors occurring when processing some docs. > This ticket relates to https://issues.apache.org/jira/browse/TIKA-1733 > however in this case there is an unexpected NullPointerException and not a > clear indication of the error. > Processing a saved copy of the document solves the error altogether. A > difference found between the two documents was that the > _(HWPFDocument)document.getRange()_ returned different values. > {noformat} > Exception in thread "main" org.apache.tika.exception.TikaException: > Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser@58a306e2 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) > at org.apache.tika.Tika.parseToString(Tika.java:496) > at org.apache.tika.Tika.parseToString(Tika.java:610) > Caused by: java.lang.NullPointerException > at > org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:311) > at > org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:169) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146) > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) > ... 10 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)