Hi
We are using tesseract 4.0 on debian x64

tesseract 4.00.00alpha
 leptonica-1.74.4
  libpng 1.5.4 : zlib 1.2.11

We have observed that several words have bbox set to 0,0,3400,4400

These are dimensions of the legal paper size.

The words are correctly extracted, and next words in same line have proper 
coordinates set.
We depend on accurate bbox tags to do further processing of the extracted 
text.

The problem is more acute when using PSM 6 (Single block of text).

However, even with PSM 3, it happens sometimes, but very very rarely.

Here is sample output:

    <span class='ocr_line' id='line_1_13' title="bbox 996 1642 2404 1701; 
baseline 0 0; x_size 76.34156; x_descenders 18.341558; x_ascenders 
17.596659"><span class='oc
rx_word' id='word_1_48' title='bbox 0 0 3400 4400; x_wconf 
0'>PLAINTIFF&#39;S</span> <span class='ocrx_word' id='word_1_49' 
title='bbox 996 1642 2404 1701; x_wconf 96'>
SUPPLEMENTAL</span> <span class='ocrx_word' id='word_1_50' title='bbox 0 0 
3400 4400; x_wconf 95'>RESPONSE</span> <span class='ocrx_word' 
id='word_1_51' title='bbox 0 0
 3400 4400; x_wconf 95'>TO</span>

This is the error in line above : "title='bbox 0 0 3400 4400; x_wconf 
0'>PLAINTIFF&#39;S</span>"

When we try PSM  for same file, we get correct results as follows:
    <p class='ocr_par' id='par_1_8' lang='eng' title="bbox 557 1642 2845 
1774">
     <span class='ocr_line' id='line_1_14' title="bbox 999 1642 2400 1689; 
baseline 0 -1; x_size 60.657276; x_descenders 15.164319; x_ascenders 
15.164319"><span class='
ocrx_word' id='word_1_49' title='bbox 999 1642 1398 1689; x_wconf 
90'>PLAINTIFF\xe2\x80\x99S</span> <span class='ocrx_word' id='word_1_50' 
title='bbox 1416 1643 1938 16
89; x_wconf 95'>SUPPLEMENTAL</span> <span class='ocrx_word' id='word_1_51' 
title='bbox 1959 1643 2292 1689; x_wconf 95'>RESPONSE</span> <span 
class='ocrx_word' id='word
_1_52' title='bbox 2311 1643 2400 1689; x_wconf 96'>TO</span>

Seems even the coordinates of the word SUPPLEMENTAL are different in the 
case of PSM 3.


Our command line for PSM 6 mode is as follows:

tesseract -psm 6 -l eng <infile> <outfile> --oem 1 ~/tess.conf

Content of tess.conf
-------------------------------
tessedit_create_hocr 1

Has anyone come across this problem? What should I do to fix this?

Any help appreciated.

thanks,
Sreenath

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/86b48b4a-a2b1-4de2-b9c6-b6e23cfe43b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to