* Janusz S. Bień <jsb...@mimuw.edu.pl>, 2010-03-05, 06:30:
[...]
ocrodjvu indeed crashes, but on the garbage-in-garbage-out principle. If
you run ocrodjvu with the --debug option, you'll see that resulting hOCR
files don't contain anything legible. In fact, hOCR for page 2 contains
also some control characters, which completely break HTML parsing,
leading to a crash.

I cannot do much about this, except making the error message more
helpful.

You can skip the faulty page and continue processing.

No, that would be wrong. I cannot (programmatically) distinguish between
exceptions caused by a faulty OCR engine and those caused by real ocrodjvu bug. Certainly I *don't* want to continue processing when the later ones are raised.

That said, if you insist on ignoring exceptions, you can easily achieve that with a simple shell script like:

cp in.djvu out.djvu
djvused -e remove-txt out.djvu
for p in $(seq 1 $(djvused -e n out.djvu))
do
    ocrodjvu -p $p --in-place --render=all --engine=cuneiform --language=pol 
out.djvu
done

--
Jakub Wilk

Attachment: signature.asc
Description: Digital signature

Reply via email to