Reading Linearized Acrobat PDF files

Williams Richard M (Contractor) Wed, 15 Mar 2000 22:41:03 -0800
I am using a servlet to stream out the bytes of a PDF file to a user's
browser. No problem there, it's just streaming out a file. The problem is in
dealing with what Acrobat refers to as "Linearized"  files. When served by a
Netscape Enterprise server, these "linearized" files display the first page
of the PDF file in the Acrobat Reader while subsequent pages are still being
loaded. When displayed using my servlet as the file source, the entire
document must download before I can see the first page.

If I understand the specification correctly (relevant portions of the PDF
spec appear below), the sequence that occurs is this:

1. Browser requests the PDF file.
2. Server begins delivery of the requested file over a persistent HTTP
connection. (Servlet works okay here)
3. Browser recognizes MIME type of application/pdf and launches Adobe
Acrobat reader passing the incoming stream to it.
4. Acrobat reader interprets message at beginning of the byte stream,
identifying the file as a "Linearized" PDF.
5. Based on "size of the file, the data rate of the channel, and the
overhead cost of a transaction." - Acrobat client requests that the
persistent connection be dropped and then requests "chunks" of the file,
having received enough information to display the first page. (Servlet fails
this test)

Does anyone have experience with this. I've looked at all the obvious
problems, and the the received files, their sizes, and file length as set by
res.setContentLength() are all okay.

How can the persistent connection be aborted? Is this possible with a
servlet approach?

Is my mental model as derived from the PDF specification correct?


Relevant portions of the Adobe PDF 1.3 Specification document (copyright
Adobe, Inc.)
==========================================================
9.1 Introduction
A linearized PDF file is one that has been organized in a special way to
enable
efficient incremental access in a network environment. The file is valid PDF
in all
respects, and it is compatible with all existing viewers and other PDF
applications.
Enhanced viewers can recognize that a PDF file has been linearized and can
take
advantage of that organization to enhance viewing performance."
...
...
...
"5. When a PDF file is initially accessed (say, by following a URL hyperlink
from
some other document), the file type is not known to the client. Therefore,
the client
initiates a transaction to retrieve the entire document, then inspects the
MIME tag
of the response as it arrives. Only at that point is the document known to
be PDF.
Additionally, the length of the document becomes known at that time.

6. The client can abort a response while it is still in progress, if it
decides that the
remainder of the data is not of any immediate interest. How quickly the
abort takes
effect depends on round-trip time and server responsiveness. In HTTP,
aborting the
transaction requires closing the connection, which will interfere with the
strategy
of caching the open connection between transactions.
...
...
...
9.5.1 Opening at the first page
As indicated earlier, when a document is initially accessed, a request is
issued to
retrieve the entire file, starting at the beginning. Consequently,
linearized PDF is
organized so that all the data required to display the first page is at the
beginning of
the file. This includes all resources that are referenced from the first
page, whether
or not they are also referenced from other pages.

The first page is usually but not necessarily page 0. If the Catalog
contains an
OpenAction that specifies opening at some page other than page 0, that page
will
be the one physically located at the beginning of the document. Thus,
opening a
document at the default place (rather than a specific destination) requires
simply
waiting for the first page data to arrive; no additional transactions are
required.

In an ordinary PDF viewer, opening a document requires first positioning to
the
end to obtain the startxref line. Since a linearized PDF file has the first
page's
cross-reference table at the beginning, reading the startxref line is not
necessary.
All that is required is to verify that the file length given in the
Linearized
dictionary at the beginning of the file matches the actual length of the
file,
indicating that no updates have been appended to the PDF file.

The Primary Hint Stream is located either before or after the First Page
objects.
This means that it will also be retrieved as part of the initial sequential
read of the
file. The client is expected to interpret and retain all the information in
the hint
tables. They are reasonably compact and are not designed to be obtained from
the
file in random pieces.

The client must now decide whether to continue reading the remainder of the
document sequentially or to abort the initial transaction and access
subsequent
pages using separate transactions requesting byte ranges. This decision is a
function of the size of the file, the data rate of the channel, and the
overhead cost of
a transaction.
================================================================

___________________________________________________________________________
To unsubscribe, send email to [EMAIL PROTECTED] and include in the body
of the message "signoff SERVLET-INTEREST".

Archives: http://archives.java.sun.com/archives/servlet-interest.html
Resources: http://java.sun.com/products/servlet/external-resources.html
LISTSERV Help: http://www.lsoft.com/manuals/user/user.html
Reading Linearized Acrobat PDF files

Reply via email to