Darren,

FDnC Red wrote
> How in the world do I extract text from this PDF?  It all comes out as
> gibberish non-printable characters, except when I choose "Copy With
> Formatting" in Acrobat.  I've tried several different techniques on the
> web including iText SimpleExtractionStrategy,
> TopToBottomExtractionStrategy, LocationTextExtractionStrategyEx, and other
> tools as well. Adobe's "Copy With Formatting" somehow decodes the text in
> this PDF.

To expound Paulo's reply:


Paulo Soares-4 wrote
> The text in not extractable and not even Acrobat can do it. When I export
> the file to a word doc it places an image, not text.

iText text extraction uses mechanisms explained in PDF specification ISO
32000-1 in section 9.10 "Extraction of Text Content", especially sub-section
9.10.2 "Mapping Character Codes to Unicode Values". 

These mechanisms require that a simple font (your document uses simple
fonts)

* provides a ToUnicode CMap mapping character codes to the respective
Unicode value or
* uses one of the predefined encodings MacRomanEncoding, MacExpertEncoding,
or WinAnsiEncoding, or
* has an encoding whose Differences array includes only character names
taken from the Adobe standard Latin character set and the set of named
characters in the Symbol font.

The specification then states

> If these methods fail to produce a Unicode value, there is no way to
> determine what the character code represents in which case a conforming
> reader may choose a character code of their choosing.

Your fonts neither provide a ToUnicode map nor use one of the predefined
encodings, and their Differences arrays look like this:

[2,  /g51, /g85, /g82, /g77, /g72, /g70, /g87, /g48, /g68, /g81, /g88, /g79,
/g76, /g71, /g74, /g54, /g83, /g73, /g86, /g38, /g56, /g43, /g36, /g44, /g3,
/g9, /g40, /g55, /g53, /g50, /g49, /g15, /g47, /g24, /g19, /g90, /g75, /g25,
/g37, /g20, /g28, /g23, /g21, /g11, /g12, /g16, /g27, /g22, /g45, /g41]

These /gNN names are not among the mentioned standard names.

Thus, as far as the PDF specification is concerned, there is no way to
determine what the character code represents.

So neither the normal copy&paste of Adobe Reader/Acrobat nor the text
extraction of iText (or most other PDF libraries) produce anything sensible
for your PDF because they now simply return the bytes from the content
stream as character codes without any further ado (which indeed does work
well for quite a number of documents).

Thus Paulo's "The text in not extractable".

That being said, though, text extractors can try harder, e.g. the embedded
font programs in their native data may contain or imply an own mapping from
glyph code to Unicode, or more character names may be known, or the
extractor may apply OCR to the characters from the fonts.

In case of your document I assume "Copy With Formatting" successfully
extracts the text either

* relying on the character names... the NN in the /gNN names for ASCII
characters is the ASCII code minus 29; or
* comparing the embedded fonts with existing fonts on the computer as the
names of the fonts are known from which the embedded fonts are subsets.

Regards,   Michael




--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Extracting-Text-tp4660444p4660446.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to