> Am 31.01.2018 um 02:51 schrieb jason mazzotta <[email protected]>:
> 
> Hello,
>     I know next to nothing about the PDF document format.  I am using
> pdfbox to read the text out of PDF files that contain recipes.  The
> PDFs are created on a Fujitsu ScanSnap S1300i document scanner.  The
> software that creates the PDF files is called ABBYY FineReader.  The
> PDF files themselves are readable, but when I use the following code
> to extract text:
> 
> 
> try (PDDocument document = PDDocument.load(file))
>      {
>          //Instantiate PDFTextStripper class
>          PDFTextStripper pdfStripper = new PDFTextStripper();
> 
>          //Retrieving text from PDF document
>          String text = pdfStripper.getText(document);
>          System.out.println(text);
> 
>      }
>      catch(InvalidPasswordException ipe)
>      {
>          JOptionPane.showMessageDialog(null, ipe.toString(), "Invalid
> Password", JOptionPane.INFORMATION_MESSAGE);
>      }
>      catch(IOException ioe)
>      {
>          JOptionPane.showMessageDialog(null, ioe.toString(), "IO
> Error", JOptionPane.INFORMATION_MESSAGE);
>      }
> 
> Often fractions like:
> 
> 1/2 teaspoon ground red pepper
> 
> end up being parsed as:
> 
> V2 teaspoon ground red pepper

that's very likely that the OCR has recognized 1/ to be V. 

When you look at the PDF you see the image and thus read 1/2. OCR adds the 
recognized text as an invisible layer. Text extraction uses that text and so 
you get the V2.

You might get better results if you scan with a higher DPI setting.

BR
Maruan

> 
> I've read a brief description of what a PDF document should look like:
> 
> https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-text-streams/
> 
> When I search through the PDF file, I can see Tj sequences, but the
> values before them are not surrounded by parentheses.
> 
> Can someone suggest either
> 
> 1)  What the problem might be
> 2)  What steps I can take to get closer to under
> 
> Thanks for your help.
> 
> Best Regards,
> 
> Jason Mazzotta
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to