[jira] [Closed] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

Tilman Hausherr (Jira) Thu, 26 Dec 2024 07:16:39 -0800


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr closed PDFBOX-4210.
-----------------------------------
    Resolution: Not A Bug

> Unable to extract the text from a PDF ("No Unicode mapping.." warnings)
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4210
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4210
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.9
>            Reporter: Aleksandar Putnik
>            Priority: Major
>         Attachments: Testdokument-AcrobatOCR.pdf, 
> Testdokument-AcrobatOCR.txt, Testdokument.pdf
>
>
> I'm using Tika (v1.18 , which means pdfbox 2.0.9) to extract the text from 
> PDF.
> I have a document from which the Acrobat Reader (Adobe Acrobat Reader DC) can 
> extract the text (although not with a 100% precision).
> Besides warnings "WARNING: No Unicode mapping for ... in font ArialMT" pdfbox 
> 2.0.9 doesn't return anything.
> As you can see from the warning, the font in question is ArialMT. It is 
> custom encoding and the pdf doesn't include toUnicode mapping. Font type is 
> CID TrueType (this info is provided by "pdffonts")
> "pdftotext" also can't extract anything but only shows an error `Syntax 
> Error: Unknown character collection 'Adobe-ArialMT'`
> The pdf producer (used by the customer) is VintaSoft PDF .NET Plug-in v5.5.
> I would like to determine whether there is a bug in pdfbox or the pdf 
> producer has to adjust and improve the "readability" of pdf.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Closed] (PDFBOX-4210) Unable to extract the text from a PDF ("No Unicode mapping.." warnings)

Reply via email to