[ https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed PDFBOX-4210. ----------------------------------- Resolution: Not A Bug > Unable to extract the text from a PDF ("No Unicode mapping.." warnings) > ----------------------------------------------------------------------- > > Key: PDFBOX-4210 > URL: https://issues.apache.org/jira/browse/PDFBOX-4210 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.9 > Reporter: Aleksandar Putnik > Priority: Major > Attachments: Testdokument-AcrobatOCR.pdf, > Testdokument-AcrobatOCR.txt, Testdokument.pdf > > > I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from > PDF. > I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can > extract the text (although not with a 100% precision). > Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox > 2.0.9 doesn't return anything. > As you can see from the warning, the font in question is ArialMT. It is > custom encoding and the pdf doesn't include toUnicode mapping. Font type is > CID TrueType (this info is provided by "pdffonts") > "pdftotext" also can't extract anything but only shows an error `Syntax > Error: Unknown character collection 'Adobe-ArialMT'` > The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5. > I would like to determine whether there is a bug in pdfbox or the pdf > producer has to adjust and improve the "readability" of pdf. > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org