Hi all, there is a request[1] to put back information about word confidence(x_wconf) to hocr output[2] (There has been changes in 3.02 version, and x_wconf was removed).
I want to make it according hOCR spec[3], but I am not sure if I got it right. I tried to contact Thomas Breuel (editor of hOCR spec) but he did not responded (yet). I tried to check cuneiform-linux (1.1.0 ) and ocropus (0.6) output, but they did not provide word confidence information. So I tried to implement it(see patch at issue 748) to the best of my knowledge. There 2 changes comparing 3.01 hocr output: 1. x_wconf is not "small negative amount" (As far as I saw from 0 to -7), but integer from 0 to 100(%) 2. x_wconf is included to title of class='ocrx_word' together with bbox info I would like to know if: - somebody has better idea/understanding of hOCR spec how to implement x_wconf - it did not break some tools (and how to fix it) I attached hocr output for phototest.tif from about mentioned tools for comparison. Thanks for your feedback. [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=748 [2] http://en.wikipedia.org/wiki/HOCR [3] http://docs.google.com/View?docid=dfxcv4vc_67g844kf -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
This is a lot of 12 point text to test the ocr code and see if it works on all types of file format.
The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.
This is a lot of 12 point text to test the ocr code and see if it works on all types of file format.
The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox. The quick brown dog jumped over the lazy fox.

