Re: [Wikitech-l] Extension:Pdfhandler

John Vandenberg Thu, 25 Dec 2008 15:47:20 -0800

On Fri, Dec 26, 2008 at 5:40 AM, Luiz Augusto <lugu...@gmail.com> wrote:
> On Thu, Dec 25, 2008 at 3:52 PM, Ilmari Karonen <nos...@vyznev.net> wrote:
>
>> Luiz Augusto wrote:
>> >
>> > I'm asking it because I've approximately 30GB of public domain scans in
>> .pdf
>> > format to upload on Commons on the next months (see
>> >
>> http://en.wikisource.org/w/index.php?oldid=928004#Royal_Society_Digital_Archive_only_for_3_Months_FREE
>> > for further information on it) and because I fully agree to the reasons
>> > listed on https://bugzilla.wikimedia.org/show_bug.cgi?id=11215#c3
>>
>> Assuming that these are scanned documents that haven't been vectorized,
>> have you considered converting them to DjVu format?  Not only does
>> Wikimedia currently have better support for it than PDF, but you might
>> realize some file size savings.  Apparently, there's software out there
>> to more or less automate it.


Large batches of scans should be converted to djvu, as it is a better
format.  PDF support will be useful for the small tasks where the
person already has a PDF (or it is already uploaded onto commons), and
they dont want to learn lots of tools before they start seeing
results.  i.e. PDF support will make wikisource more accessible.

> Someone asked it on en.wikisource and I've replied with this:
> http://en.wikisource.org/w/index.php?title=Wikisource:Scriptorium&diff=prev&oldid=928130
>
> DjVu (or at least all conversion tools/configuration options that I've tried
> in the past months, including the LizardTech Document Express Enterprise
> pdf2djvu and png2djvu options) is a lossy format. If I convert a .pdf
> downloaded from Google Book Search I will get a low quality file (70 dpi or
> 150 dpi per page), but if I extract the images from the same .pdf file using
> Adobe Acrobat Pro 8 I will get a 600 dpi jpeg for each page (OCR
> softwares normally
> recommeds to use 300 dpi images).

My understanding is that the compression is optional, and the lossy
compression is much better than the equivalent lossy compression of
PDF.

I think it is the free PDF-to-image extraction tools that are causing
your problems.

>> Of course, that doesn't in any way preclude or remove the need for
>> _also_ improving our PDF support.
>
>
> Surely :)
>
>
>> But PDF, as common and useful as it
>> is, might not be the optimal format here.
>>
>
> Well, all digitized works from all libraries that I known (from Europe,
> United States and Brazil) are avaiable only in .pdf file format. The
> Internet Archive is the only one to make avaiable both .pdf and .djvu for
> the same book (the .djvu version from IA is also a low quality file, but it
> at least is delivered with a high-quality OCR embedded at the .djvu file due
> to some closed-source and pay OCR software [Abbyy FineReader, I believe]).

I have found the djvu files from IA to be of an appropriate quality,
especially for transcription purposes.  The PDFs are usually much
larger, and not much better quality.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Extension:Pdfhandler

Reply via email to