[tesseract-ocr] Re: Need of bounding boxes coordinates of individual letters from image in hocr format

Juanjo Serrano Lloria Tue, 10 Mar 2020 03:14:03 -0700

Hi,
Did you try activating the makebox configuration?

Example in command mode:
tesseract isis_0153.png isis_0153 makebox hocr


El martes, 10 de marzo de 2020, 10:47:09 (UTC+1), Preetilatha Ramalingam 
escribió:
>
> Hi,
>
>    I'm able to get the bounding box coordinates of words in hocr format 
> using the 
> function pytesseract.image_to_pdf_or_hocr(imge,lang='eng',extension='hocr') 
> and I get the below output.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
>     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> <html xmlns="http://www.w3.org/1999/xhtml"; xml:lang="en" lang="en">
>  <head>
>   <title></title>
>   <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
>   <meta name='ocr-system' content='tesseract 5.0.0-alpha-635-g90405' />
>   <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par 
> ocr_line ocrx_word ocrp_wconf'/>
>  </head>
>  <body>
>   <div class='ocr_page' id='page_1' title='image "/tmp/tess_dt1rxsus.PNG"; 
> bbox 0 0 500 250; ppageno 0'>
>    <div class='ocr_carea' id='block_1_1' title="bbox 38 30 464 220">
>     <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 38 30 464 220">
>      <span class='ocr_line' id='line_1_1' title="bbox 77 30 420 94; 
> baseline 0 -14; x_size 64; x_descenders 14; x_ascenders 16">
>       <span class='ocrx_word' id='word_1_1' title='bbox 77 30 236 94; 
> x_wconf 89'>Noisy</span>
>       <span class='ocrx_word' id='word_1_2' title='bbox 251 30 420 94; 
> x_wconf 90'>image</span>
>      </span>
>      <span class='ocr_line' id='line_1_2' title="bbox 166 105 332 150; 
> baseline 0 0; x_size 52.686958; x_descenders 7.6869564; x_ascenders 11">
>       <span class='ocrx_word' id='word_1_3' title='bbox 166 105 219 150; 
> x_wconf 91'>to</span>
>       <span class='ocrx_word' id='word_1_4' title='bbox 235 105 332 150; 
> x_wconf 92'>test</span>
>      </span>
>      <span class='ocr_line' id='line_1_3' title="bbox 38 170 464 220; 
> baseline 0 0; x_size 57.686958; x_descenders 7.6869564; x_ascenders 16">
>       <span class='ocrx_word' id='word_1_5' title='bbox 38 171 296 220; 
> x_wconf 84'>Tesseract</span>
>       <span class='ocrx_word' id='word_1_6' title='bbox 312 170 464 220; 
> x_wconf 92'>OCR</span>
>      </span>
>     </p>
>    </div>
>   </div>
>  </body>
> </html>
>
> But I'm in need of bounding box coordinates for individual letters in 
> words. Below is the desired output.
>
>
> <div class='ocr_page' id='page_1' title='image "choices.png"; bbox 0 0 293 
> 90; ppageno 0'>
>
>    <div class='ocr_carea' id='block_1_1' title="bbox 16 18 270 71">
>
>     <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 16 18 270 71">
>
>      <span class='ocr_line' id='line_1_1' title="bbox 16 18 270 71; baseline 
> -0.012 0; x_size 68.5; x_descenders 17.125; x_ascenders 17.125">
>
>       <span class='ocrx_word' id='word_1_1' title='bbox 16 18 206 71; x_wconf 
> 42'>
>
>        <span class='ocrx_cinfo' title='x_bboxes 16 19 42 71; x_conf 
> 99.041275'>B</span>
>
>        <span class='ocrx_cinfo' title='x_bboxes 49 20 76 71; x_conf 
> 99.038635'>A</span>
>
>        <span class='ocrx_cinfo' title='x_bboxes 84 19 107 70; x_conf 
> 98.950821'>S</span>
>
>        <span class='ocrx_cinfo' title='x_bboxes 117 19 139 69; x_conf 
> 91.848969'>O</span>
>
>        <span class='ocrx_cinfo' title='x_bboxes 148 19 174 70; x_conf 
> 99.027092'>B</span>
>
>        <span class='ocrx_cinfo' title='x_bboxes 181 18 206 69; x_conf 
> 98.989304'>C</span>
>
>       </span>
>
>       <span class='ocrx_word' id='word_1_2' title='bbox 242 18 270 68; 
> x_wconf 88'>
>
>        <span class='ocrx_cinfo' title='x_bboxes 242 18 270 68; x_conf 
> 98.37661'>6</span>
>
>       </span>
>
>      </span>
>
>     </p>
>
>    </div>
>
>   </div>
>
>
> Please help me out to solve this
>
>
>
> Thanks
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0bd535ac-438d-4eac-b4cf-64c2af0f0c54%40googlegroups.com.

[tesseract-ocr] Re: Need of bounding boxes coordinates of individual letters from image in hocr format

Reply via email to