[ 
https://issues.apache.org/jira/browse/TIKA-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14185479#comment-14185479
 ] 

Tim Allison commented on TIKA-1457:
-----------------------------------

Might make sense to test against Tika 1.6 or even 1.7-SNAPSHOT.  Download 1.6 
from the regular download 
[site|http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.6.jar].  You should 
be able to get a snapshot of 1.7 
[here|http://repository.apache.org/content/groups/snapshots/org/apache/tika/] 
although I'm getting timed out at the moment.

If the file works in 1.6, you'll get the fix in the next 4.x release of Solr (I 
think).  If the file works in 1.7, open an issue on Solr to upgrade to that 
when it becomes available.

> NullPointerException in tika-app, parsing PDF content
> -----------------------------------------------------
>
>                 Key: TIKA-1457
>                 URL: https://issues.apache.org/jira/browse/TIKA-1457
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>         Environment: OS - Linux Centos 6.5
> Web APP - Tomcat6
> Using Solr 4.10
> Tika Jar
>           * tika-core-1.5.jar
>           * tika-parsers-1.5.jar
>           * tika-xmp-1.5.jar
>           * pdfbox-1.8.4.jar
>            Reporter: Tadeu Alves
>              Labels: bug, parser, solr, tika,text-extraction
>             Fix For: 1.6
>
>
> When I try to extract text from some pdf files with the tika app 1.5
> null:org.apache.solr.common.SolrException: 
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@52cfcf01
>       at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
>       at 
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>       at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>       at 
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
>       at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
>       at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>       at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>       at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>       at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>       at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>       at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>       at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>       at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>       at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>       at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>       at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>       at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@52cfcf01
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>       at 
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
>       ... 19 more
> Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
> range: 0
>       at java.lang.String.charAt(String.java:658)
>       at 
> org.apache.pdfbox.util.DateConverter.parseDate(DateConverter.java:680)
>       at 
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:808)
>       at 
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:780)
>       at 
> org.apache.pdfbox.util.DateConverter.toCalendar(DateConverter.java:754)
>       at org.apache.pdfbox.cos.COSDictionary.getDate(COSDictionary.java:797)
>       at 
> org.apache.pdfbox.pdmodel.PDDocumentInformation.getModificationDate(PDDocumentInformation.java:232)
>       at 
> org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:176)
>       at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:142)
>       at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>       ... 22 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to