I am using a servlet to stream out the bytes of a PDF file to a user's browser. No problem there, it's just streaming out a file. The problem is in dealing with what Acrobat refers to as "Linearized" files. When served by a Netscape Enterprise server, these "linearized" files display the first page of the PDF file in the Acrobat Reader while subsequent pages are still being loaded. When displayed using my servlet as the file source, the entire document must download before I can see the first page. If I understand the specification correctly (relevant portions of the PDF spec appear below), the sequence that occurs is this: 1. Browser requests the PDF file. 2. Server begins delivery of the requested file over a persistent HTTP connection. (Servlet works okay here) 3. Browser recognizes MIME type of application/pdf and launches Adobe Acrobat reader passing the incoming stream to it. 4. Acrobat reader interprets message at beginning of the byte stream, identifying the file as a "Linearized" PDF. 5. Based on "size of the file, the data rate of the channel, and the overhead cost of a transaction." - Acrobat client requests that the persistent connection be dropped and then requests "chunks" of the file, having received enough information to display the first page. (Servlet fails this test) Does anyone have experience with this. I've looked at all the obvious problems, and the the received files, their sizes, and file length as set by res.setContentLength() are all okay. How can the persistent connection be aborted? Is this possible with a servlet approach? Is my mental model as derived from the PDF specification correct? Relevant portions of the Adobe PDF 1.3 Specification document (copyright Adobe, Inc.) ========================================================== 9.1 Introduction A linearized PDF file is one that has been organized in a special way to enable efficient incremental access in a network environment. The file is valid PDF in all respects, and it is compatible with all existing viewers and other PDF applications. Enhanced viewers can recognize that a PDF file has been linearized and can take advantage of that organization to enhance viewing performance." ... ... ... "5. When a PDF file is initially accessed (say, by following a URL hyperlink from some other document), the file type is not known to the client. Therefore, the client initiates a transaction to retrieve the entire document, then inspects the MIME tag of the response as it arrives. Only at that point is the document known to be PDF. Additionally, the length of the document becomes known at that time. 6. The client can abort a response while it is still in progress, if it decides that the remainder of the data is not of any immediate interest. How quickly the abort takes effect depends on round-trip time and server responsiveness. In HTTP, aborting the transaction requires closing the connection, which will interfere with the strategy of caching the open connection between transactions. ... ... ... 9.5.1 Opening at the first page As indicated earlier, when a document is initially accessed, a request is issued to retrieve the entire file, starting at the beginning. Consequently, linearized PDF is organized so that all the data required to display the first page is at the beginning of the file. This includes all resources that are referenced from the first page, whether or not they are also referenced from other pages. The first page is usually but not necessarily page 0. If the Catalog contains an OpenAction that specifies opening at some page other than page 0, that page will be the one physically located at the beginning of the document. Thus, opening a document at the default place (rather than a specific destination) requires simply waiting for the first page data to arrive; no additional transactions are required. In an ordinary PDF viewer, opening a document requires first positioning to the end to obtain the startxref line. Since a linearized PDF file has the first page's cross-reference table at the beginning, reading the startxref line is not necessary. All that is required is to verify that the file length given in the Linearized dictionary at the beginning of the file matches the actual length of the file, indicating that no updates have been appended to the PDF file. The Primary Hint Stream is located either before or after the First Page objects. This means that it will also be retrieved as part of the initial sequential read of the file. The client is expected to interpret and retain all the information in the hint tables. They are reasonably compact and are not designed to be obtained from the file in random pieces. The client must now decide whether to continue reading the remainder of the document sequentially or to abort the initial transaction and access subsequent pages using separate transactions requesting byte ranges. This decision is a function of the size of the file, the data rate of the channel, and the overhead cost of a transaction. ================================================================ ___________________________________________________________________________ To unsubscribe, send email to [EMAIL PROTECTED] and include in the body of the message "signoff SERVLET-INTEREST". Archives: http://archives.java.sun.com/archives/servlet-interest.html Resources: http://java.sun.com/products/servlet/external-resources.html LISTSERV Help: http://www.lsoft.com/manuals/user/user.html
Reading Linearized Acrobat PDF files
Williams Richard M (Contractor) Wed, 15 Mar 2000 22:41:03 -0800
