[ https://issues.apache.org/jira/browse/PDFBOX-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830359#comment-17830359 ]
ASF subversion and git services commented on PDFBOX-5790: --------------------------------------------------------- Commit 1916526 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1916526 ] PDFBOX-5790: don't use a predefined CMap if a ToUnicode CMap is present > Don't use a predefined CMap if a ToUnicode CMap is present > ---------------------------------------------------------- > > Key: PDFBOX-5790 > URL: https://issues.apache.org/jira/browse/PDFBOX-5790 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.31, 4.0.0, 3.0.3 PDFBox > Reporter: Andreas Lehmkühler > Assignee: Andreas Lehmkühler > Priority: Major > Attachments: p4_fix.pdf > > > The user Luiz Marcelo Modesto reported an issue with the text extraction of > the attached pdf [^p4_fix.pdf] > {quote} > Hi everyone, > I'm not sure if this is the same as FAQ "How come I am getting > gibberish(G38G43G36G51G5) when extracting text?"... > I'm using PDFBox version 3.0.1 and OpenJDK Runtime Environment (build > 11.0.22+7-post-Ubuntu-0ubuntu222.04.1). > I'm trying to understand how this PDF chunk (from p4_fix.pdf attached) > BT > /G1F7 6.0 Tf > 94.871 773.806 Td > <004200430044> Tj > ET > becomes "BCD" on PDFBox Debugger (the same on qpdfview, Adobe Reader, > Chrome, ...) and becomes "abc" on PDFBox text extraction tool. > Using the Poppler pdftotext (version 22.02.0) gives me "BCD" too. > The renders that allow me to copy the text give me "BCD" text. > It seems that PDFBox extraction tool follows the item "9.10.2 Mapping > character codes to Unicode values" (ISO 32000-2:2020) but all the others > choose a different way. > Could you help me to understand if there is a problem with the PDF file, > with the renders or with the extract text tool? > Thank you! > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org