[ 
https://issues.apache.org/jira/browse/PDFBOX-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13790135#comment-13790135
 ] 

Timo Boehme commented on PDFBOX-1743:
-------------------------------------

Only a first analysis:
the document is broken as it defines a wrong font type (embedded font is of 
other type)

Nevertheless using the wrong parser should not lead to out of memory. Either 
the parser has to check font type or parsing algorithm has to implement checks.

> OutOfMemoryError in fontbox
> ---------------------------
>
>                 Key: PDFBOX-1743
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1743
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox, PDModel
>            Reporter: Stanea Paul
>         Attachments: document1.pdf
>
>
> When trying to index a pdf document in solr, pdfbox (fontbox) throws 
> java.lang.OutOfMemoryError: Java heap space exception
> This is the stack trace:
> SEVERE: Full Import failed:java.lang.RuntimeException: 
> java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)
> at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:382)
>  at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:448) 
> at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:429)
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:413)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:326) 
> at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:234)
> ... 3 more
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
> java.lang.OutOfMemoryError:
> Java heap space at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:542)
> at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:411)
>  ... 5
> more
> Caused by: java.lang.OutOfMemoryError: Java heap space at 
> org.apache.fontbox.cff.IndexData.initData(IndexData.java:95)
> at org.apache.fontbox.cff.CFFParser.readIndexData(CFFParser.java:152) at 
> org.apache.fontbox.cff.CFFParser.parse(CFFParser.java:103)
> at org.apache.pdfbox.pdmodel.font.PDType1CFont.load(PDType1CFont.java:322) at 
> org.apache.pdfbox.pdmodel.font.PDType1CFont.<init>(PDType1CFont.java:104)
> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:162) at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:92)
> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:203) at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)
> at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54) 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>  at
>  
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>  at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:455) 
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:379)
> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:335) 
> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:66)
> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> at 
> org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:127)
> at 
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:243)
> at
>  
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:472)
>  ... 6 more



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to