2016-05-08 14:42 GMT+02:00 Don Osborn <d...@bisharat.net>: > Some earlier posts in this thread made the observation that PDF is for > presentation not archiving. > I tend to disagree. PDF are hugely used for archiving and for that purpose it does not matter how it was generated, it is only meant to be a facsimile, possibly with equal value as the original (printed) paper. The initial numeric format is just a working draft with no legal value in most cases.
That's why PDF files can contain a digistal signature, to give them the same value as the original paper. The initial numeric draft has no value, even if it's easier to search in it. Many (many!) laws and treaties in the world are kept only as PDF, not all of them being searchable in plain text, unless there's been some OCR (and often correction to this process). The original papers (which have legal value) are kept in museums or official national libraries and no longer freely accessible to the public and that's why there are facsimile PDF created to make them accessible (and possibly signed numerically by the official library or some national authority). Lots of organisations are only archiving their legal papers as PDF and recycle their original paper. This is authorized by national laws, provided they insert a verificable signature in them, certifying their date. No alteration of the content is then authorized as these PDF become the new original (except adding new digital signatures, or possibly dropping some of them except the initial dated one whose security may have become loose over time, and for which it is needed to add new stronger signatures by the legitimate right holder; the history of signatures will be kept). Being able to search in a PDF is a distinct goal, not meant directly for archiving, but for using PDFs isolately as *working* documents. However for archives, the ability of searching in them may be provided by separate data (without legal bindings) stored in the archive index, along with the unaltered (and legal) PDF. PDFs are not being meant to be used for presentation (there are much better way to present the content and *adapt* it to the audience or presentation medium. But presentation is also a different goal than being able to search in it. A PDF is just a collection of rendered pages (possibly with a limited resolution, where rendered characters may be a bit fuzzy or some non meaningful color distinctions may be voluntarily lost) to be used "as is" and meant to be read by human eyes (even being able to produce an accurate OCR is not a goal of this format). When producing the PDF, there's choice by the human editor to reduce the resolution, reduce the colorspace and so on if this helps reducing the numeric storage size and helps archiving, or helps protecting the author's rights E.g. there are different PDF versions for free online editions of newspapers, where text may be to fuzzy to be read. But there are versions for subscribers with much better quality (but possibly less ads), and kept in archives if needed, but still not really meant to be searchable in plain text; in fact the producer may want to limit the searchability so that readers will have to look at the pages directly, and see the embedded advertizing boxes even if they are not related directly to what is being searched for; the producer may provide only a limited plain-text index for some headings, but not for the content itself: readers have to scan it visually so that they cannot completely ignore the surrounding context. The producer of the PDF then has the choice of the different options. It has different goals for the document. For legal use, there are some goals to follow, but this does not (most often) include the need to perform plain text search in them. May this means that some OCR or human work will be needed later in order to index it, but this operation may be limited by author's rights and the user will assume its own respondability if he makes a false interpretation when using only automated tools. PDFs are maent to be read and interpreted by humans, not machines.