Dn(a 05.06.2010 14:57, Jimmy O'Regan wrote / napísal(a): > On Saturday, June 5, 2010, zdpo <[email protected]> wrote: > >> Dear Sriranga, >> >> your box file is wrong (for tesseract 3.0 and >r319). It did not match >> to description in "Make Box Files" on >> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract. >> >> BTW: I am aware of any tool that support this new box format (for >> multipage tif). >> >> > it shouldn't matter. The code is supposed to accept the old style too, > provided that the number of pages is set to zero, which is determined > by the image reading code, which doesn't work on windows. > > If it fails on Linux, then I'd consider it a bug. > >
/usr/local/bin/tesseract slk.arial.001.tif slk.arial.001 makebox
batch.nochop
created slk.arial.001.box file with 6 columns (last one with 0).
When I run:
/usr/local/bin/unicharset_extractor slk.arial.001.box
output is OK. When I convert it to 2.x format ('awk '{print $1" "$2"
"$3" "$4" "$5}' <slk.arial.001.box >slk.arial.002.box') and run:
/usr/local/bin/unicharset_extractor slk.arial.002.box
I got errors:
Extracting unicharset from slk.arial.002.box
Box file format error on line 1 ignored
...
Anyway if tesseract 3.0 of Sriranga produced old format that something
is wrong in (his/windows) installation/compilation process. Or maybe he
just simply mixed outputs from tesseract 2.x with 3.0...
Zd.
smime.p7s
Description: S/MIME Cryptographic Signature

