You might want to experiment with different -psm values, we use 1 by default.

Also, which version of Tesseract? I think I got mine from 
(https://github.com/UB-Mannheim/tesseract/wiki), version:

tesseract 3.05.00dev
leptonica-1.73
  libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : libtiff 
4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0



From: Gordon Schneider [mailto:schneid...@transampiping.com]
Sent: Tuesday, July 19, 2016 11:22 AM
To: 'user@tika.apache.org' <user@tika.apache.org>
Subject: RE: Extract Text from a TIFF image

I installed tesseract on my PC. I ran tesseract on its own using the following 
command:

tesseract.exe x:/java/PDFBox/Maxfield-1.tiff x:/java/PDFBox/Maxfield-1

The results are in the attached file. Not as clean as the results Timothy got. 
I am closer to where I want to get to but obviously I am a number of steps to 
my ideal solution. How to get the same results Timothy got?

Thanks

Gord


From: Allison, Timothy B. [mailto:talli...@mitre.org]
Sent: July 18, 2016 2:25 PM
To: user@tika.apache.org<mailto:user@tika.apache.org>
Subject: RE: Extract Text from a TIFF image

You'll need to set up tesseract to run Optical Character Recognition.  While we 
have an integration with OCR, it is not bundled within the app.

See https://wiki.apache.org/tika/TikaOCR

For kicks, I ran this through Tika+Tesseract; this is the output you get once 
you've set up Tesseract:

SUPPLIER: 3177  Invoice Date Description Amount Discount Net Amount 015-28339 
06/08/2015 21,318.54 0.00 21,318.54 C15-28837 06/04/2015 1,529.75 0.00 1,529.75 
01528978 06/04/2015 1,238.18 0.00 1,238.18 015-28978-01 06/04/2015 1,182.85 
0.00 1,182.85 015-28439 06/01/2015 1,113.86 0.00 1,113.86 C15-29707 06/11/2015 
886.84 0.00 886.64 C15-28978-02 06/04/2015 526.91 0.00 526.91 01529385 
06/09/2015 199.29 0.00 199.29 C15~28439~01 06/03/2015 157.34 0.00 157.34 
C15-28670 06/03/2015 136.52 0.00 136.52 C15-28314-01 06/03/2015 132.81 0.00 
132.81 015-28576 06/02/2015 61.26 0.00 61.26 015-29413 06/11/2015 22.37 0.00 
22.37 Cheque #: 83077 Cheque Date 7/14/2015 28,506.32 0.00 28,506.32  SUPPLIER: 
3177  Invoice Date Description Amount Discount Net Amount C15-28339 06/08/2015 
21,318.54 0.00 21,318.54 015-28837 06/04/2015 1,529.75 0.00 1,529.75 015-28978 
06/04/2015 1,238.18 0.00 1,238.18 015-28978-01 06I04/2015 1 ,18285 0.00 
1,182.85 C15-28439 06/01/2015 1,113.86 0.00 1,113.86 015-29707 06l11/2015 
886.64 0.00 886.64 C15-28978~02 06/04/2015 526.91 0.00 526.91 015-29385 
06/09/2015 199.29 0.00 199.29 C15-28439-01 06/03/2015 157.34 0.00 157.34 
015-28670 06/03/2015 136.52 0.00 136.52 015-28314-01 06/03/2015 132.81 0.00 
132.81 C15-28576 06/02/2015 61.26 0.00 61.26 015-29413 06/11/2015 22.37 0.00 
22.37 Cheque #1 83077 Check Daie: 7/14/2015 28,506.32 0.00 28,506.32  07142015 
MMDDYYYY  TWENTY-EIGHT THOUSAND FIVE HUNDRED SIX CAD AND 32/ 100 $ 
"******28,506.32  Trans Am Piping Canada

From: Gordon Schneider [mailto:schneid...@transampiping.com]
Sent: Monday, July 18, 2016 4:05 PM
To: 'user@tika.apache.org' <user@tika.apache.org<mailto:user@tika.apache.org>>
Subject: Extract Text from a TIFF image

I have tried using the GUI for tika-app-1.13 but it shows nothing. I can see 
the metdata but that does not give me the information I need. I have attached 
the file.

Maybe it is not possible to extract the text. If so what should I be looking 
for to tell me that it cannot extract the text.

Thanks


Gordon Schneider
403-236-0601
Trans Am Piping Products Ltd.

Reply via email to