Re: [pdf-devel] Object layer API

jemarch Mon, 16 Feb 2009 15:19:56 -0800

Hey Michael.

Many thanks for your comments! Quite useful.


Here I address most of the questions: I should think about the rest of
them.

   > I just committed in the trunk the first draft of the object layer
   > public API. It is documented in the "Object Layer" chapter of the
   > reference manual.

   Can you post a version online that doesn't contain so many pages?
   The object layer chapter alone contains over 20 pages, which makes
   it very annoying to read -- a single downloadable file would be
   nice.

I though that the info format is quite convenient. You can generate an
info file and navigate into it using Emacs or any info reader, and you
can send suggested improvements as patches.

   pdf_obj_doc_new
    - Why not always create an info dictionary by default?

FDF files don't have an info dictionary, for instance.

    - The "&doc" parameter should be "*doc".

Fixed.

   pdf_obj_doc_open
    - Would it be better to take an open stream, rather than a filesystem
      and path?

Why? PDF documents are stored in files, one document per file,
regardless the kind of file system providing the file.

    - What's the point of header_string?
       - The client probably doesn't care about the header -- they just
         want the library to open a valid PDF.

The user may want to open non-pdf file such as a FDF file, that uses
the "%FDF-" header. Also, we may want to introduce new headers such as
"%GNUPDF-".

       - There may be more than one acceptable string, e.g.
         "%PDF-" and "%!PS-Adobe-N.n PDF-"

What about:

@item header_string
A string like @code{"%PDF-"}. If this parameter is @code{NULL} then
the library will try to recognize the header in the file.

and:

@deftypefun pdf_status_t pdf_obj_doc_get_header (pdf_obj_doc_t @var{doc}, 
pdf_char_t *...@var{header})

@table @strong
@item Parameters
@table @var
@item doc
An object document.
@item header
A pointer to a string. After a successful call to this function
@var{header} is pointing to a static @code{NULL}-terminated buffer
containing the header of the file, or @code{NULL}.
@end table
@item Returns
A PDF status value:
@table @code
@item PDF_OK
@item PDF_ERROR
@end table
@item Usage example
@example
pdf_text_t file_path;
pdf_char_t *header;
pdf_obj_doc_t doc;

/* Open a document */
pdf_text_new_from_unicode ("/foo/bar.pdf",
                           12,
                           PDF_TEXT_UTF8,
                           &file_path);

pdf_obj_doc_open (PDF_OBJ_DOC_OPEN_DO_REPAIR,
                  NULL, /* Use the default filesystem */
                  file_path,
                  NULL, /* Guess the type of the file from the header */
                  &doc);

/* Get the header of the file */
pdf_obj_doc_get_header (doc, &header);
@end example
@end table
@end deftypefun

   pdf_obj_doc_close
    - Wouldn't the handle become invalid on PDF_OK also?

Fixed.

   pdf_obj_doc_save
    - Why wouldn't you want an xref table?  It should probably be written
      by default.

The cross reference table in FDF files is optional, for instance.
But we could write down the xref table by default and provide a flag
to not write it. Sounds good for you?

    - Why should _SAVE_FULL imply _SAVE_DO_GC?  Both flags could be given
      if desired.

Agreed. Fixed.

    - If file==NULL, do we try to keep the same inode number?

Hm? The file associated with the document is a pdf_fsys open file
implemented by some fsys filesystem, so the existence of inode numbers
is a mere hipothesys at this level... I am not sure if I understand
the question.

      Is the file corrupted after an error?

It depends on the specific error. We should build a list of possible
errors returned by pdf_obj_doc_save.

    - Should there be a separate GC function instead of a flag?

Hm, I like this suggestion.

The GC phase is usually invoked just before to save the document, but
a separate call would be quite convenient. Anyway, at the same time I
would keep the flag for doc_save.

What about:

@deftypefun pdf_status_t pdf_obj_doc_gc (pdf_obj_doc_t @var{doc})

Call the garbage collection on @var{doc}, removing non-reachable
objects.

@table @strong
@item Parameters
@table @var
@var doc
An object document.
@end table
@item Returns
A PDF status value:
@table @code
@item PDF_OK
@item PDF_ERROR
@end table
@item Usage example
@example
XXX
@end example
@end table
@end deftypefun

    - Why does the client need to care about the header string?
       - If they specify an old version, do we ensure the output is
         compatible with that version?  What do we do with unrecognized
         versions?
       - I think we should let the user specify some kind of compatibility
         level, and let the library handle details like this.

I believe that the compatibility check should be done at the document
layer level. The client may want to create PDF-like files to store a
collection of objects (as a temporary storage) or something like that.

    - How do we track modifications when _SAVE_FULL is not used?

Hm, I think that a SAVE_FULL implies to discard any previous content
in the PDF file.

   [3.2.3]

   pdf_obj_doc_get_id
    - Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t
      with string type.

But then that object would be associated with a document.

   pdf_obj_doc_set_dirty
       "If the dirty flag of a document is set then its contents are saved
        before the document is closed."
    - Does this mean doc_close will write a file?  That doesn't seem right.

The dirty flag is useful to track if the document needs
saving.

What is wrong with implicitly saving the document in doc_close before
to actually close the document?

   [3.3.1]

   pdf_obj_gen_t
    - Should be described as "non-negative", not "positive".

Fixed.

   [3.3.2]

   pdf_obj_copy
    - If copy_indirect is false, what happens when an indirect object is
      seen?

What about:

@item PDF_FALSE 
The indirectly referenced objects in @var{source} are
not copied to @var{dest_doc}. This means that the copied object will
lack these entries.

    - Where does the object go?  I.e., if you throw away whatever's
      returned in *dest, does dest_doc still have some reference to it?

Yes. The copied object is registered in the destination document.

    - If an object didn't have a back-reference to a document, something
      like pdf_obj_dup would be sufficient.

The current API assumes that every object is associated with a
document. If we change it to allow "free" objects, then 'pdf_obj_dup'
would be enough, right.

    - If a stream is copied, does closing the original file invalidate it?
      (if not, where's the data stored?)

No. The data is stored in a temporary file containing the stream
dictionary and the stream data.

   pdf_obj_destroy
       "This function is a nop if obj is a direct scalar type or the Null
        object."
    - Why would it be a nop?  Scalar objects need to be freed too.

My idea is to implement a pdf_obj_t variable as a struct like:

struct pdf_obj_s
{
  int type;

  union {
    pdf_i32_t number;
    pdf_real_t real;
    void *pointer; /* For direct containers and indirect objects */
  } value;
};

typedef union pdf_obj_s pdf_obj_t;

Then, direct scalar types (and the null object) would be allocated in
the stack. Thus pdf_obj_destroy would be a nop. Anyway I agree in that
it may not be wise to make the client aware of that implementation
detail.

   pdf_obj_enum
    - Non-iterable types should return an error.

We could return PDF_FALSE in that case.

    - Iterating over a stream doesn't make any sense.  If the client
      wants to iterate over its dictionary, they should provide the
      dictionary object.

Hm, considering that it is possible to get the stream dictionary, I
think that I agree in this point.

    - Will iteration stop as soon as PDF_FALSE is returned?

I would say that lazy evaluation on the value returned by the callback
is in order: it does not make a lot of sense to continue the iteration
if something went wrong. So I would say that yes.

    - Why is a callback used?  This will often be annoying to work with,
      especially given C's lack of closures (users will need to define and
      populate a struct, cast client_data to it, etc.).  It will also make
      interactions with other languages more difficult, e.g. if a client
      wants to throw an exception.

The alternative would be to use iterators... . Callbacks tend
to be somewhat inconvenient. A 'pdf_obj_iterator_t' would make it, or
we could even use a pdf_list_t to return a list of pointers to
objects, and thus we could use pdf_list_iterator_t. What do you think?

   pdf_obj_equal
    - Should be "pdf_obj_equal_p".

Yes. Fixed.

       "If they are indirect, they share the same generation number."
    - We should also verify that they point to the same object.

Yes. Fixed.

       "If they are non-scalar, they reference the same value. "
    - What does "the same value" mean?  Does obj1 have to be the same
      pointer as obj2, or just point to a similar array/dict?

The first. I changed the wording to: "If they are non-scalar, they
reference the same object.".

    - Will we compare stream contents?  This would be expensive.

No.

   pdf_obj_get_doc
    - PDF_EBADDATA could be used instead of PDF_EINVOBJ (elsewhere too).

PDF_EINVOBJ is more explicit, but to use PDF_EBADDATA would be more
coherent with the rest of the API. I don't have strong feeling about
it.

   pdf_obj_get_type
       "The returned value should be interpreted as a enum pdf_obj_type_e."
    - Why not use that type as the return value?

Fixed.

    - enum pdf_obj_type_e doesn't include an INDIRECT type.  Do we return
      the type of whatever it references?  If so,
       - What if we need to seek to find it, and get an error?
       - What if the indirect object points to another indirect object?
         How many levels do we go before giving up?

PDF_OBJ_REF was missing. I just added it to pdf_obj_type_e.

   pdf_obj_get_id
    - When is an object ID assigned?  If the user creates an indirect
      object, but hasn't saved the file, do we already know the ID?

Yes. An object ID is allocated by the document in object-creation
time.

   pdf_obj_compressed_p
    - Why would the client care about this?

It may be convenient, but I am not able to come with an useful example
at the moment. The Acrobat SDK provides a similar function.

    - Why does the object need to know where it came from?  I'd expect this
      (i.e. the location of an object identified by an ID and generation)
      to be a property of the document.

It is a property of the document, but since the object should be
associated with a document it is possible to get it using
pdf_obj_compressed_p. We could move this function to
pdf_obj_doc_compressed_p and use the ON+GN of the object, but it would
be less convenient, I think.


   [3.3.5]
    - The document doesn't explain what object collections are, but you
      stated in your email:
   >   The object documents support the notion of "object collections",
   >   that are eventually translated into one or more linked object
   >   streams.

    - Why would the client need to manage them?

Because the underlying object streams are useful to store related
objects that will need each other. The object layer does not have
enough information to decide what objects are "related" and thus be
able to maintain the locality principle.

For instance, the document layer may decide to store all the objects
containing the contents of a single page in an "object collection",
that will translate into one (or several linked) compressed object
stream.

   [3.3.6 -- 3.3.14]

    - Why does each _new method have an indirect flag?  Another option
      would be to let the user create a direct object, and then create a
      reference to it when necessary (and we could also turn direct objects
      into indirect objects when writing, where the spec requires it).
    - If an indirect object is created, will the pdf_obj_*_value methods
      resolve the reference to return the value?
      (if so, my comments on pdf_obj_get_type apply here too)

The indirect flag is an indication to pdf_obj_*_new to "register" the
object into the document's xref.

The pdf_obj_*_value would not require to resolve the reference: the
object structure should contain its value (scalar) or a pointer to it
(non-scalar).

   pdf_obj_null_new
    - Why is the null constructor different from the rest?

There is only one null object (conceptually speaking) and it cannot be
indirect.

   pdf_obj_boolean_new
    - It seems silly to associate a boolean with a document, or allow
      indirect boolean objects if indirect nulls aren't possible.

It is perfectly possible to create an indirect boolean value in a
document, such as:

10 0 obj
true
endobj

On the other hand, the null object cannot be indirect.

   pdf_obj_name_new
    - Presumably the name is null-terminated, so this should be stated.

Fixed.

   pdf_obj_name_value, _string_value
    - It should be possible to get at the value without copying it.

Using a const pointer, you mean?

   pdf_obj_string_hex_p, _hex_set
    - I don't see any reason for this flag -- it's trivial to decide
      between escaped or hex strings when writing, and there should be no
      reason to care when reading.

How would you decide? Both representations have its advantages, and
the client should decide which one to use.


   pdf_obj_array_insert
    - Clarify that index==pdf_obj_array_size() inserts at the end.

Done.

    - I'm not sure why an object would know whether it's in a container,
      but PDF_EINVOBJ implies it does.

A direct object contained in more than one container does not make
sense.

   pdf_obj_array_size
    - How are errors handled?  If we're resolving indirect refs, they'd
      need to be returned; 

The function does not follow indirect references.

      otherwise errors are only possible when the object isn't an
      array, and we can leave that case undefined or return zero.

I changed the prototype to:

@deftypefun pdf_status_t pdf_obj_array_size (pdf_obj_t @var{array}, pdf_size_t 
@var{size})

returning PDF_EINVOBJ if ARRAY is not an array object.

   pdf_obj_array_remove
    - Is this useful? (maybe adding _find and using _remove_at would be
      better, but I'm not sure why you'd want to search within an
      array.)
 
Hm, I would also prefer the _find + _remove_at schema: it is more
general.

   pdf_obj_stream_new
    - There's no reason for indirect_p if it has to be PDF_TRUE.

Fixed.

    - source_length is redundant since attrs_dict must contain the length
      (if we have both, we'll have to explain which takes precedence and
      whether attrs_dict will be modified by the library to add a /Length
      element).

Note that the base stream @var{stm} contains unfiltered data, and that
@var{source_length} is the length in octects of the unfiltered
data. If the stream dictionary contains a filter chain, we usually
does not know the length of the filtered data until we have actually
filtered it. The Length entry in @var{attrs_dict} is the lenght of the
filtered data.

    - Maybe it should be possible to pass a callback function that will
      open and return a stream when necessary (since it may be impractical
      to open all streams ahead of time).
       - Or this type of callback could be handled at the stream layer, by
         creating a dummy stream type that executes callbacks to do the
         real work.

Since 'pdf_obj_stream_new' internally creates an intermediate
representation of the stream in disk, @var{stm} is used just to get
the contents. You can get a reading stream using
'pdf_obj_stream_open_stm'.

   pdf_obj_stream_length
    - "length" should be a pointer.

Fixed.

   The usage of "_get_" seems inconsistent, e.g. pdf_obj_get_type vs.
   pdf_obj_integer_value.

Agreed. I am fixing the API to use a more coherent schema.

I committed the agreed changes to the trunk.

-- 
Jose E. Marchesi  <jema...@gnu.org>
                  http://www.jemarch.net
GNU Project       http://www.gnu.org

Re: [pdf-devel] Object layer API

Reply via email to