Darren, FDnC Red wrote > How in the world do I extract text from this PDF? It all comes out as > gibberish non-printable characters, except when I choose "Copy With > Formatting" in Acrobat. I've tried several different techniques on the > web including iText SimpleExtractionStrategy, > TopToBottomExtractionStrategy, LocationTextExtractionStrategyEx, and other > tools as well. Adobe's "Copy With Formatting" somehow decodes the text in > this PDF.
To expound Paulo's reply: Paulo Soares-4 wrote > The text in not extractable and not even Acrobat can do it. When I export > the file to a word doc it places an image, not text. iText text extraction uses mechanisms explained in PDF specification ISO 32000-1 in section 9.10 "Extraction of Text Content", especially sub-section 9.10.2 "Mapping Character Codes to Unicode Values". These mechanisms require that a simple font (your document uses simple fonts) * provides a ToUnicode CMap mapping character codes to the respective Unicode value or * uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding, or * has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font. The specification then states > If these methods fail to produce a Unicode value, there is no way to > determine what the character code represents in which case a conforming > reader may choose a character code of their choosing. Your fonts neither provide a ToUnicode map nor use one of the predefined encodings, and their Differences arrays look like this: [2, /g51, /g85, /g82, /g77, /g72, /g70, /g87, /g48, /g68, /g81, /g88, /g79, /g76, /g71, /g74, /g54, /g83, /g73, /g86, /g38, /g56, /g43, /g36, /g44, /g3, /g9, /g40, /g55, /g53, /g50, /g49, /g15, /g47, /g24, /g19, /g90, /g75, /g25, /g37, /g20, /g28, /g23, /g21, /g11, /g12, /g16, /g27, /g22, /g45, /g41] These /gNN names are not among the mentioned standard names. Thus, as far as the PDF specification is concerned, there is no way to determine what the character code represents. So neither the normal copy&paste of Adobe Reader/Acrobat nor the text extraction of iText (or most other PDF libraries) produce anything sensible for your PDF because they now simply return the bytes from the content stream as character codes without any further ado (which indeed does work well for quite a number of documents). Thus Paulo's "The text in not extractable". That being said, though, text extractors can try harder, e.g. the embedded font programs in their native data may contain or imply an own mapping from glyph code to Unicode, or more character names may be known, or the extractor may apply OCR to the characters from the fonts. In case of your document I assume "Copy With Formatting" successfully extracts the text either * relying on the character names... the NN in the /gNN names for ASCII characters is the ASCII code minus 29; or * comparing the embedded fonts with existing fonts on the computer as the names of the fonts are known from which the embedded fonts are subsets. Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Extracting-Text-tp4660444p4660446.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php