[Wikisource-l] pdftotext

Lars Aronsson Thu, 22 Nov 2012 01:31:52 -0800

Is the "pdftotext" program used when extracting
the OCR text layer from a PDF file?


In this book,
http://fr.wikisource.org/wiki/Livre:Liste_provisoire_des_noms_destines.pdf
it seems that using "pdftotext -raw" would produce
a better result than the current one.

If you download the source PDF file and try to run
pdftotext with and without the -raw option, you
will see a difference in how some very boldface
words are produced: H e l l o (without -raw) and
Hello (with -raw), respectively;
and also in the column separation of some pages,
e.g. page 81 (De Roster--Herborn), where Dyck
is followed by E (with -raw) or G (without -raw).

The man page for pdftotext says -raw is deprecated,
but I don't understand why, as it produces the
best result.


--
  Lars Aronsson (l...@aronsson.se)
  Projekt Runeberg - fri nordisk litteratur - http://runeberg.org/


_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

[Wikisource-l] pdftotext

Reply via email to