On Fri, Dec 26, 2008 at 5:40 AM, Luiz Augusto <lugu...@gmail.com> wrote: > On Thu, Dec 25, 2008 at 3:52 PM, Ilmari Karonen <nos...@vyznev.net> wrote: > >> Luiz Augusto wrote: >> > >> > I'm asking it because I've approximately 30GB of public domain scans in >> .pdf >> > format to upload on Commons on the next months (see >> > >> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREE >> > for further information on it) and because I fully agree to the reasons >> > listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3 >> >> Assuming that these are scanned documents that haven't been vectorized, >> have you considered converting them to DjVu format? Not only does >> Wikimedia currently have better support for it than PDF, but you might >> realize some file size savings. Apparently, there's software out there >> to more or less automate it.
Large batches of scans should be converted to djvu, as it is a better format. PDF support will be useful for the small tasks where the person already has a PDF (or it is already uploaded onto commons), and they dont want to learn lots of tools before they start seeing results. i.e. PDF support will make wikisource more accessible. > Someone asked it on en.wikisource and I've replied with this: > http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=928130 > > DjVu (or at least all conversion tools/configuration options that I've tried > in the past months, including the LizardTech Document Express Enterprise > pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf > downloaded from Google Book Search I will get a low quality file (70 dpi or > 150 dpi per page), but if I extract the images from the same .pdf file using > Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR > softwares normally > recommeds to use 300 dpi images). My understanding is that the compression is optional, and the lossy compression is much better than the equivalent lossy compression of PDF. I think it is the free PDF-to-image extraction tools that are causing your problems. >> Of course, that doesn't in any way preclude or remove the need for >> _also_ improving our PDF support. > > > Surely :) > > >> But PDF, as common and useful as it >> is, might not be the optimal format here. >> > > Well, all digitized works from all libraries that I known (from Europe, > United States and Brazil) are avaiable only in .pdf file format. The > Internet Archive is the only one to make avaiable both .pdf and .djvu for > the same book (the .djvu version from IA is also a low quality file, but it > at least is delivered with a high-quality OCR embedded at the .djvu file due > to some closed-source and pay OCR software [Abbyy FineReader, I believe]). I have found the djvu files from IA to be of an appropriate quality, especially for transcription purposes. The PDFs are usually much larger, and not much better quality. -- John Vandenberg _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l