Hello,
I know next to nothing about the PDF document format. I am using
pdfbox to read the text out of PDF files that contain recipes. The
PDFs are created on a Fujitsu ScanSnap S1300i document scanner. The
software that creates the PDF files is called ABBYY FineReader. The
PDF files themselves are readable, but when I use the following code
to extract text:
try (PDDocument document = PDDocument.load(file))
{
//Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
//Retrieving text from PDF document
String text = pdfStripper.getText(document);
System.out.println(text);
}
catch(InvalidPasswordException ipe)
{
JOptionPane.showMessageDialog(null, ipe.toString(), "Invalid
Password", JOptionPane.INFORMATION_MESSAGE);
}
catch(IOException ioe)
{
JOptionPane.showMessageDialog(null, ioe.toString(), "IO
Error", JOptionPane.INFORMATION_MESSAGE);
}
Often fractions like:
1/2 teaspoon ground red pepper
end up being parsed as:
V2 teaspoon ground red pepper
I've read a brief description of what a PDF document should look like:
https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-text-streams/
When I search through the PDF file, I can see Tj sequences, but the
values before them are not surrounded by parentheses.
Can someone suggest either
1) What the problem might be
2) What steps I can take to get closer to under
Thanks for your help.
Best Regards,
Jason Mazzotta
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]