> On 9 Jan 2015, at 05:33, Leonard Rosenthol <[email protected]> wrote: > > On 1/9/15, 12:25 AM, "John Hewson" <[email protected]> wrote: > > >> We have some support for incremental update in PDFBox already, but I >> don’t see any reason why that should be limited by sharing objects. A hash >> map of COS objects in COSDocument is sufficient to track any update >> state specific to an individual COS object in a given document and has the >> added benefit of keeping document state out of COS object classes. > > Such an implementation can work for simple things, but at some point you > will run into it’s limitations. But since it’s working now and no one is > banging on your door to fix…
Could you shed some light on what you’d expect those limitations to be in general? e.g. Multi-gigabyte files could present some issues. Out of interest, what would you consider to be the definitive “tricky” cases for a PDF parser and COS-level implementation focused on server-side usage? >> Alternatively, should we wish to store document state inside COS objects, >> then we would have all the information necessary to generate a meaningful >> error should an incremental update be attempted on a COS object which >> belongs to another document. In this case the solution is for the user to >> clone() the relevant COS object - this feels natural. > > Yup - that makes sense and sounds like a good solution. > > >> PDFBox doesn’t store the object number and revision in it’s COS object >> classes, so that’s not a problem for us. These numbers are instead stored >> in a hash map inside COSDocument. That means that each COS object is >> independent of a specific COSDocument, with the exception of the >> backing stream for a COSStream. I realise that this might be unusual. > > It’s all about design requirements. So far you haven’t had a requirement > that has required you to be able to navigate the object model in such as > way that this design has failed you. In other implementations, such as > Acrobat/Reader, it wouldn’t work. > > > >> Currently we don’t do on-demand decryption, but if we did, then the >> backing stream which is passed to COSStream could handle this. > > That would work for streams, but not for strings. I suspect that today > you decrypt each string as it is read and the in-memory representation of > such is always un-encrypted. You’d need to add this to strings as well if > you wanted to enable this feature in the future. (admittedly probably > not a requirement for server-side solutions, but a huge deal for desktop > and mobile!) Yes, we decrypt all strings up-front. As you say, that’s not efficient, though we could solve the problem in the same manner which I suggested for COSStream by giving each COSString a backing InputStream which knows its own decryption key. > >> No, because the data as been erased. Calling close() on a COSDocument >> loops through a hash map of every COS object from that document and >> clears its contents. We’re in the process of figuring out why exactly that >> is and if it is necessary for objects other than COSStream. > > Given your model, I agree that clearing the contents doesn’t seem like the > right thing to do. But it would be useful to have a flag on the object > about the owning document being closed. Great - we’ll probably move ahead with doing something along those lines. > >> What I’m proposing is a fairly unexciting change to COSDocument’s close() >> method, but it’s yielded a useful discussion - assuming that we’re now all >> on the same page :) >> > > Excellent discussion. Appreciate your taking the time to explain some of > PDFBox’s inner workings to me. > > Leonard >
