> Am 31.01.2018 um 02:51 schrieb jason mazzotta <[email protected]>: > > Hello, > I know next to nothing about the PDF document format. I am using > pdfbox to read the text out of PDF files that contain recipes. The > PDFs are created on a Fujitsu ScanSnap S1300i document scanner. The > software that creates the PDF files is called ABBYY FineReader. The > PDF files themselves are readable, but when I use the following code > to extract text: > > > try (PDDocument document = PDDocument.load(file)) > { > //Instantiate PDFTextStripper class > PDFTextStripper pdfStripper = new PDFTextStripper(); > > //Retrieving text from PDF document > String text = pdfStripper.getText(document); > System.out.println(text); > > } > catch(InvalidPasswordException ipe) > { > JOptionPane.showMessageDialog(null, ipe.toString(), "Invalid > Password", JOptionPane.INFORMATION_MESSAGE); > } > catch(IOException ioe) > { > JOptionPane.showMessageDialog(null, ioe.toString(), "IO > Error", JOptionPane.INFORMATION_MESSAGE); > } > > Often fractions like: > > 1/2 teaspoon ground red pepper > > end up being parsed as: > > V2 teaspoon ground red pepper
that's very likely that the OCR has recognized 1/ to be V. When you look at the PDF you see the image and thus read 1/2. OCR adds the recognized text as an invisible layer. Text extraction uses that text and so you get the V2. You might get better results if you scan with a higher DPI setting. BR Maruan > > I've read a brief description of what a PDF document should look like: > > https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-text-streams/ > > When I search through the PDF file, I can see Tj sequences, but the > values before them are not surrounded by parentheses. > > Can someone suggest either > > 1) What the problem might be > 2) What steps I can take to get closer to under > > Thanks for your help. > > Best Regards, > > Jason Mazzotta > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

