I'm trying to extract the characters from page 41 of: www.irs.gov/pub/irs-pdf/i1040.pdf However, using the attached, ExtractPageContentSorted.java, and the member function, at_page, where: reader was produced from: www.irs.gov/pub/irs-pdf/i1040.pdf and pageNum was: 41 I only managed to produce the output shown in 2nd attachment, i1040p41.txt. The characters shown in i1040p41.txt are nothing like what appears on page 41 of i1040.pdf. Since the at_page member function essentially does what Listing 15.27 in the book does:
http://itextpdf.com/examples/iia.php?id=296 I had expected the charaters to come out OK. I also tried other text extractors: http://poppler.freedesktop.org/ which showed similar garbage characters. What can be done to *properly* extract the text characters from page 41 of i1040.pdf. TIA. -regards, Larry
package lje; import java.io.IOException; import java.io.PrintWriter; import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfTextExtractor; import lje.OpPage; public class ExtractPageContentSorted implements OpPage { public void pre_pages(PrintWriter out) { } public void at_page(PdfReader reader, int pageNum, PrintWriter out) throws IOException { out.println(PdfTextExtractor.getTextFromPage(reader, pageNum)); } }
***Page:41 @ &