[jira] [Commented] (PDFBOX-5790) Don't use a predefined CMap if a ToUnicode CMap is present

ASF subversion and git services (Jira) Sun, 24 Mar 2024 23:45:05 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830359#comment-17830359
 ]


ASF subversion and git services commented on PDFBOX-5790:
---------------------------------------------------------

Commit 1916526 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1916526 ]

PDFBOX-5790: don't use a predefined CMap if a ToUnicode CMap is present

> Don't use a predefined CMap if a ToUnicode CMap is present
> ----------------------------------------------------------
>
>                 Key: PDFBOX-5790
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5790
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.31, 4.0.0, 3.0.3 PDFBox
>            Reporter: Andreas Lehmkühler
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>         Attachments: p4_fix.pdf
>
>
> The user Luiz Marcelo Modesto reported an issue with the text extraction of 
> the attached pdf  [^p4_fix.pdf] 
> {quote}
> Hi everyone,
>     I'm not sure if this is the same as FAQ "How come I am getting 
> gibberish(G38G43G36G51G5) when extracting text?"...
>     I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build 
> 11.0.22+7-post-Ubuntu-0ubuntu222.04.1).
>     I'm trying to understand how this PDF chunk (from p4_fix.pdf attached)
>   BT
>   /G1F7 6.0 Tf
>   94.871 773.806 Td
>   <004200430044> Tj
>   ET
>     becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader, 
> Chrome, ...) and becomes "abc" on PDFBox text extraction tool. 
>     Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too.
>     The renders that allow me to copy the text give me "BCD" text.
>     It seems that PDFBox extraction tool follows the item "9.10.2 Mapping 
> character codes to Unicode values" (ISO 32000-2:2020) but all the others 
> choose a different way.
>      Could you help me to understand if there is a problem with the PDF file, 
> with the renders or with the extract text tool? 
> Thank you!
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5790) Don't use a predefined CMap if a ToUnicode CMap is present

Reply via email to