Specifying different dictionary files [was: Getting usable source files from traineddata files]

Nick White Tue, 17 Apr 2012 07:37:56 -0700

On Mon, Apr 16, 2012 at 06:38:01PM +0200, zdenko podobny wrote:
> I think in 3.02 will provide solution this cases: you can use more than one
> language for OCR. e.g. you can run something like this:
> 
> tesseract image output -l grc+ell


Ah, that's a very good idea, and will indeed be useful. However for
my usecase (a script which is mostly the same, but with additions,
and an older version of the language), it would be useful to only
use one set of dictionary files (rather than presumably the union of
grc & ell, in the above example).

I wonder if there's any good way of integrating this functionality
in to tesseract; I could imagine changing the dictionary files
wouldn't be a particularly unusual thing to want to do, as mappings
of dictionaries and scripts is not going to be 1:1.

As a workaround one could probably unpack the traineddata, remove
the dictionary files (and add different ones if appropriate), then
repack it. But ideally I think it would be good to be able to
specify different dictionary files on the command line (and ideally
as UTF-8 word per line files, which were converted into DAWG in
memory if needed.)

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Specifying different dictionary files [was: Getting usable source files from traineddata files]

Reply via email to