Hello,
thanks for hocr2pdf tool, it's a really useful and I use it a lot. I
have now discovered that hocr2pdf seems to convert PNG images to JPEG
before embedding them and I would like to suggest that it does not do so.
I reduce the file size of scanned documents (TIFFs) with:
convert document.tiff -level 25% -colors 64 document.png
Since the documents are mostly black and white, such a files is a lot
smaller than the corresponding JPEG and the compression is even lossless.
After OCR'ing, when I use hocr2pdf to create a PDF, the image is
converted to JPEG, however, and the file becomes larger. This can be
seen by running pdfimages on the PDF. PDFs do support images in a
similar format to PNG so it would save me a lot of disk space if that
were used instead of the conversion to JPEG.
My command lines are:
tesseract document.tiff document -l deu hocr
hocr2pdf -i document.png -o document.pdf < document.html
pdfimages -list document.pdf
Regards,
Marcel
-----------------------------------------------------------
If you wish to unsubscribe from this mailing, send mail to
[email protected] with a subject of: unsubscribe exact-image