Is the "pdftotext" program used when extracting the OCR text layer from a PDF file?
In this book, http://fr.wikisource.org/wiki/Livre:Liste_provisoire_des_noms_destines.pdf it seems that using "pdftotext -raw" would produce a better result than the current one. If you download the source PDF file and try to run pdftotext with and without the -raw option, you will see a difference in how some very boldface words are produced: H e l l o (without -raw) and Hello (with -raw), respectively; and also in the column separation of some pages, e.g. page 81 (De Roster--Herborn), where Dyck is followed by E (with -raw) or G (without -raw). The man page for pdftotext says -raw is deprecated, but I don't understand why, as it produces the best result. -- Lars Aronsson (l...@aronsson.se) Projekt Runeberg - fri nordisk litteratur - http://runeberg.org/ _______________________________________________ Wikisource-l mailing list Wikisource-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikisource-l