Some thoughts/questions in your text below Michel 2009/9/3, Jed Rothwell <jedrothw...@gmail.com>: > Steve uncovered a real treasure trove of stuff from EPRI! I urge everyone to > read it. Good for Steve and good for EPRI. > > HOWEVER, the Acrobat files from the EPRI site are peculiar. They are bigger > than they need to be, and they are not "searchable" (not text under image > format). Some have text under the image which is all wrong. I guess someone > did not know how to set the Acrobat parameters. Acrobat is an unfriendly > program with lousy documentation so people often get it wrong. > > Anyway, I converted the papers to conventional text-under-image format and > reduced the size. I am also eliminating some of the noise (speckles and > dots) and correcting some of worst OCR errors in the underlying text. I can > display and correct the underlying text separately with ABBYY, and then > reassemble the Acrobat file.
For pdfs with low fi underlying ocr, in my experience saving them as pure image files and then re-ocr-ing them with the latest version of Acrobat often improves the ocr quality considerably. Also, did you know you can batch-ocr any number of pdfs at a time? I think it would be only a matter of a few days of automated computer work to make your whole collection of many thousands of CF and peripheral papers searchable. A thought regarding copyright issues, rather than seeking uploading permission for every single paper, would there be a big risk in uploading everything and then removing only those the copyright holders ask you to remove? For those, how about functioning like a real library, where the library card holders can download copyrighted material? > I have just about finished the last paper and I > will upload them tomorrow. > > I will upload a better version of the NSF/EPRI book as well as the TR- > series. > > - Jed >