> On 9 Jan 2015, at 05:33, Leonard Rosenthol <[email protected]> wrote:
> 
> On 1/9/15, 12:25 AM, "John Hewson" <[email protected]> wrote:
> 
> 
>> We have some support for incremental update in PDFBox already, but I
>> don’t see any reason why that should be limited by sharing objects. A hash
>> map of COS objects in COSDocument is sufficient to track any update
>> state specific to an individual COS object in a given document and has the
>> added benefit of keeping document state out of COS object classes.
> 
> Such an implementation can work for simple things, but at some point you 
> will run into it’s limitations.  But since it’s working now and no one is 
> banging on your door to fix…

Could you shed some light on what you’d expect those limitations to be
in general? e.g. Multi-gigabyte files could present some issues.

Out of interest, what would you consider to be the definitive “tricky” cases
for a PDF parser and COS-level implementation focused on server-side usage?

>> Alternatively, should we wish to store document state inside COS objects,
>> then we would have all the information necessary to generate a meaningful
>> error should an incremental update be attempted on a COS object which
>> belongs to another document. In this case the solution is for the user to
>> clone() the relevant COS object - this feels natural.
> 
> Yup - that makes sense and sounds like a good solution.
> 
> 
>> PDFBox doesn’t store the object number and revision in it’s COS object
>> classes, so that’s not a problem for us. These numbers are instead stored 
>> in a hash map inside COSDocument. That means that each COS object is
>> independent of a specific COSDocument, with the exception of the
>> backing stream for a COSStream. I realise that this might be unusual.
> 
> It’s all about design requirements.  So far you haven’t had a requirement 
> that has required you to be able to navigate the object model in such as 
> way that this design has failed you.  In other implementations, such as 
> Acrobat/Reader, it wouldn’t work. 
> 
> 
> 
>> Currently we don’t do on-demand decryption, but if we did, then the
>> backing stream which is passed to COSStream could handle this.
> 
> That would work for streams, but not for strings.  I suspect that today 
> you decrypt each string as it is read and the in-memory representation of 
> such is always un-encrypted.  You’d need to add this to strings as well if 
> you wanted to enable this feature in the future.   (admittedly probably 
> not a requirement for server-side solutions, but a huge deal for desktop 
> and mobile!)

Yes, we decrypt all strings up-front. As you say, that’s not efficient, though
we could solve the problem in the same manner which I suggested for
COSStream by giving each COSString a backing InputStream which knows
its own decryption key.

> 
>> No, because the data as been erased. Calling close() on a COSDocument
>> loops through a hash map of every COS object from that document and
>> clears its contents. We’re in the process of figuring out why exactly that
>> is and if it is necessary for objects other than COSStream. 
> 
> Given your model, I agree that clearing the contents doesn’t seem like the 
> right thing to do.  But it would be useful to have a flag on the object 
> about the owning document being closed.

Great - we’ll probably move ahead with doing something along those lines.

> 
>> What I’m proposing is a fairly unexciting change to COSDocument’s close()
>> method, but it’s yielded a useful discussion - assuming that we’re now all
>> on the same page :)
>> 
> 
> Excellent discussion.  Appreciate your taking the time to explain some of 
> PDFBox’s inner workings to me.
> 
> Leonard
> 

Reply via email to