Bug#572522: ocrodjvu: new problem with cuneiform engine

Jakub Wilk Fri, 05 Mar 2010 03:32:28 -0800

* Janusz S. Bień <jsb...@mimuw.edu.pl>, 2010-03-05, 06:30:
[...]

ocrodjvu indeed crashes, but on the garbage-in-garbage-out principle. If
you run ocrodjvu with the --debug option, you'll see that resulting hOCR
files don't contain anything legible. In fact, hOCR for page 2 contains
also some control characters, which completely break HTML parsing,
leading to a crash.


I cannot do much about this, except making the error message more
helpful.


You can skip the faulty page and continue processing.


No, that would be wrong. I cannot (programmatically) distinguish between

exceptions caused by a faulty OCR engine and those caused by real ocrodjvu bug. Certainly I *don't* want to continue processing when the later ones are raised.

That said, if you insist on ignoring exceptions, you can easily achieve that with a simple shell script like:


cp in.djvu out.djvu
djvused -e remove-txt out.djvu
for p in $(seq 1 $(djvused -e n out.djvu))
do
    ocrodjvu -p $p --in-place --render=all --engine=cuneiform --language=pol 
out.djvu
done

--
Jakub Wilk

signature.asc
Description: Digital signature

Bug#572522: ocrodjvu: new problem with cuneiform engine

Reply via email to