[ https://issues.apache.org/jira/browse/PDFBOX-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16387439#comment-16387439 ]
Andreas Meier commented on PDFBOX-4141: --------------------------------------- Thanks for the info Tilman. Overriding the characters in writeCharacters will not be the problem. The main question is if this shall be possible to turn on/off by a switch implemented in the master of PDFBox, since Adobe themself do some replacements in their text extraction methods. Whether changes towards replacing/overriding c1/c0 control codes are implemented or not, I checked the Adobe Reader output upon c0 and c1 control codes and ended up with the attached list. Notice that for some reason U+0007 is converted to U+0009 and U+000B to U+000A, this might be overlooked. (double-checked this, because I first thought of a mistake...) The file is separated into c0 and c1 codes as well as space (U+0020) and del (U+007F) Even if a feature like this will not be implemented by Default the mapping list might help some people out there. > Suppress control characters? > ---------------------------- > > Key: PDFBOX-4141 > URL: https://issues.apache.org/jira/browse/PDFBOX-4141 > Project: PDFBox > Issue Type: Improvement > Components: Parsing > Reporter: Andreas Meier > Priority: Minor > Attachments: Mapping_default_to_adobe.csv, Test_with_MW.pdf, > Test_with_MW.txt, Test_with_MW_AdobeReader_export.txt, > Test_with_MW_linux.jpg, Test_without_MW.txt > > > At the moment pdfbox extracts all types of characters. > Therefore control characters that occur will also be extracted. > Unfortunately some of these control characters might deform text. > For example 'MESSAGE WAITING' (U+0095) [MW] > I attached some files and a screenshot how text is printed when MESSAGE > WAITING is present. > Should PDFBox handle this type of characters? Maybe suppress them in > PDFTextStripper? > I know that PDFBox works correctly in this case, a feature to turn off or > suppress special characters might produce better output than the default > Setting unless some control characters are used for any further processing!? > Feedback appreciated. > What other programs do: > a) ignore control characters (Okular PDF Viewer - KDE) > b) exchange them (Adobe Reader wrote a dot "." in place of MW) > Regards > Andreas -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org