[ https://issues.apache.org/jira/browse/PDFBOX-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17011698#comment-17011698 ]
Michael Klink commented on PDFBOX-4549: --------------------------------------- {quote}[~Giorgy]>Can you predict the obfuscation without text extraction?{quote} Hints for some obfuscators may be possible to identify and check for but not universally for all of them. Furthermore, there is obfuscation by accident. E.g. there are tons and tons of PDFs with Indian languages text which extract incorrectly, and the reasons for this are not (at least mostly not) a desire for obfuscation but simply limitations (or bugs, if you prefer so) of a number of widely used PDF generators. > No Unicode mapping > ------------------ > > Key: PDFBOX-4549 > URL: https://issues.apache.org/jira/browse/PDFBOX-4549 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 2.0.15 > Reporter: Sergey Makarov > Assignee: Tilman Hausherr > Priority: Major > Fix For: 2.0.16, 3.0.0 PDFBox > > Attachments: XO_Thames.zip, our_star_wars.pdf > > > Hello, if i try get text from pdf (attached), i will result empty out and > many warns. Font attached also. > Acrobat reader will open succeed, I can select, copy text and save as text > my code: > {code:java} > private static void parseOne(String path) throws IOException { > String pdfFileInText; > PDFTextStripper tStripper; > File file = new File(path); > tStripper = new PDFTextStripper(); > MemoryUsageSetting memUsageSetting = MemoryUsageSetting.setupMixed(0, > 500000000).setTempDir(new File("/home/user/pdfBoxTest/newFiles/")); > PDDocument document = PDDocument.load(file, memUsageSetting); > if (!document.isEncrypted()) { > pdfFileInText = tStripper.getText(document); > System.out.print(pdfFileInText); > } > document.close(); > }{code} > Error: > {code:java} > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init> > WARNING: Invalid ToUnicode CMap in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+83 (83) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+116 (116) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+97 (97) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+114 (114) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+87 (87) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode > WARNING: No Unicode mapping for CID+115 (115) in font HPDFAA+XOThames > May 15, 2019 6:30:01 PM org.apache.pdfbox.pdmodel.font.PDFont <init> > WARNING: Invalid ToUnicode CMap in font HPDFAB+DejaVuSansMono,Book > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org