[ https://issues.apache.org/jira/browse/PDFBOX-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941632#comment-14941632 ]
Tilman Hausherr edited comment on PDFBOX-3004 at 10/2/15 7:47 PM: ------------------------------------------------------------------ Yes... As I said, the 1.8 extraction is not very good here. I have attached the 2.0 extraction, which is obviously better. However that is a work in progress, and the improvements won't be a part of PDFBox 1.8. But be assured that TIKA and PDFBox projects are BFF (= best friends forever). So these improvements will be a part of TIKA in the future. was (Author: tilman): Yes... As I said, the 1.8 extraction is not very good here. I have attached the 2.0 extraction, which is obviously better. However that is a work in progress, and the improvements won't be a part of PDFBox 1.8. But be assured that TIKA and PDFBox projects are BFF (= best friends forever). So these improvement will be a part of TIKA in the future. > PDF fulltext index fails. > ------------------------- > > Key: PDFBOX-3004 > URL: https://issues.apache.org/jira/browse/PDFBOX-3004 > Project: PDFBox > Issue Type: Bug > Reporter: Arkady Zalkowitsch > Attachments: Tika-Extract-Error.png, Tika-Meta.png, > not_found-2.0.txt, not_found.pdf, tika-out.txt > > > PDF fulltext index fails when font dictionary in there contains one entry for > the font Helvetica and an entry for Encoding whose value does not represent a > font at all. > The PDF Object in PDF looks like this: > {code} > obj = { > "/Fields": [ 12 0 R ], > "/DA": "/Helvetica 0 Tf 0 g", > "/DR": { > "/Font": { > "/Helvetica": "11 0 R", > "/Encoding": { > "/PDFDocEncoding": "10 0 R" > } > } > "/NeedAppearances": true > } > {code} > PDFBox tries to parse that "font" called Encoding and fails doing so. but > PDResources.getFonts() only logs the resulting exception: > {code} > try { > newFont = PDFontFactory.createFont( (COSDictionary)font ); > } catch (IOException exception) { > LOG.error("error while creating a font", exception); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org