[ 
https://issues.apache.org/jira/browse/TIKA-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083128#comment-15083128
 ] 

Tim Allison commented on TIKA-1822:
-----------------------------------

When we can't get the ID for a linked object via POI's {{CharacterRun mscr = 
field.getMarkSeparatorCharacterRun(r);}}, should we add an annotation for an 
unknown id (e.g. {{<div class="embedded" id="_UNKNOWN_ID" />}}) or should we 
skip adding an annotation?


> NullPointerException when parsing a .doc file
> ---------------------------------------------
>
>                 Key: TIKA-1822
>                 URL: https://issues.apache.org/jira/browse/TIKA-1822
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.8
>         Environment: Linux
>            Reporter: Panagiotis Mpailis
>            Assignee: Tim Allison
>         Attachments: npe_example.doc
>
>
> We are using Tika 1.11 to extract text from msword documents, and there are a 
> few errors occurring when processing some docs.
> This ticket relates to https://issues.apache.org/jira/browse/TIKA-1733  
> however in this case there is an unexpected NullPointerException and not a 
> clear indication of the error. 
> Processing a saved copy of the document solves the error altogether. A 
> difference found between the two documents was that the 
> _(HWPFDocument)document.getRange()_ returned different values. 
> {noformat}
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.microsoft.OfficeParser@58a306e2
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at org.apache.tika.Tika.parseToString(Tika.java:496)
>       at org.apache.tika.Tika.parseToString(Tika.java:610)
> Caused by: java.lang.NullPointerException
>       at 
> org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:311)
>       at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:169)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>       at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> ... 10 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to