Hey Michael. Many thanks for your comments! Quite useful.
Here I address most of the questions: I should think about the rest of them. > I just committed in the trunk the first draft of the object layer > public API. It is documented in the "Object Layer" chapter of the > reference manual. Can you post a version online that doesn't contain so many pages? The object layer chapter alone contains over 20 pages, which makes it very annoying to read -- a single downloadable file would be nice. I though that the info format is quite convenient. You can generate an info file and navigate into it using Emacs or any info reader, and you can send suggested improvements as patches. pdf_obj_doc_new - Why not always create an info dictionary by default? FDF files don't have an info dictionary, for instance. - The "&doc" parameter should be "*doc". Fixed. pdf_obj_doc_open - Would it be better to take an open stream, rather than a filesystem and path? Why? PDF documents are stored in files, one document per file, regardless the kind of file system providing the file. - What's the point of header_string? - The client probably doesn't care about the header -- they just want the library to open a valid PDF. The user may want to open non-pdf file such as a FDF file, that uses the "%FDF-" header. Also, we may want to introduce new headers such as "%GNUPDF-". - There may be more than one acceptable string, e.g. "%PDF-" and "%!PS-Adobe-N.n PDF-" What about: @item header_string A string like @code{"%PDF-"}. If this parameter is @code{NULL} then the library will try to recognize the header in the file. and: @deftypefun pdf_status_t pdf_obj_doc_get_header (pdf_obj_doc_t @var{doc}, pdf_char_t *...@var{header}) @table @strong @item Parameters @table @var @item doc An object document. @item header A pointer to a string. After a successful call to this function @var{header} is pointing to a static @code{NULL}-terminated buffer containing the header of the file, or @code{NULL}. @end table @item Returns A PDF status value: @table @code @item PDF_OK @item PDF_ERROR @end table @item Usage example @example pdf_text_t file_path; pdf_char_t *header; pdf_obj_doc_t doc; /* Open a document */ pdf_text_new_from_unicode ("/foo/bar.pdf", 12, PDF_TEXT_UTF8, &file_path); pdf_obj_doc_open (PDF_OBJ_DOC_OPEN_DO_REPAIR, NULL, /* Use the default filesystem */ file_path, NULL, /* Guess the type of the file from the header */ &doc); /* Get the header of the file */ pdf_obj_doc_get_header (doc, &header); @end example @end table @end deftypefun pdf_obj_doc_close - Wouldn't the handle become invalid on PDF_OK also? Fixed. pdf_obj_doc_save - Why wouldn't you want an xref table? It should probably be written by default. The cross reference table in FDF files is optional, for instance. But we could write down the xref table by default and provide a flag to not write it. Sounds good for you? - Why should _SAVE_FULL imply _SAVE_DO_GC? Both flags could be given if desired. Agreed. Fixed. - If file==NULL, do we try to keep the same inode number? Hm? The file associated with the document is a pdf_fsys open file implemented by some fsys filesystem, so the existence of inode numbers is a mere hipothesys at this level... I am not sure if I understand the question. Is the file corrupted after an error? It depends on the specific error. We should build a list of possible errors returned by pdf_obj_doc_save. - Should there be a separate GC function instead of a flag? Hm, I like this suggestion. The GC phase is usually invoked just before to save the document, but a separate call would be quite convenient. Anyway, at the same time I would keep the flag for doc_save. What about: @deftypefun pdf_status_t pdf_obj_doc_gc (pdf_obj_doc_t @var{doc}) Call the garbage collection on @var{doc}, removing non-reachable objects. @table @strong @item Parameters @table @var @var doc An object document. @end table @item Returns A PDF status value: @table @code @item PDF_OK @item PDF_ERROR @end table @item Usage example @example XXX @end example @end table @end deftypefun - Why does the client need to care about the header string? - If they specify an old version, do we ensure the output is compatible with that version? What do we do with unrecognized versions? - I think we should let the user specify some kind of compatibility level, and let the library handle details like this. I believe that the compatibility check should be done at the document layer level. The client may want to create PDF-like files to store a collection of objects (as a temporary storage) or something like that. - How do we track modifications when _SAVE_FULL is not used? Hm, I think that a SAVE_FULL implies to discard any previous content in the PDF file. [3.2.3] pdf_obj_doc_get_id - Rather than pdf_char_t*/pdf_size_t pairs, we could use a pdf_obj_t with string type. But then that object would be associated with a document. pdf_obj_doc_set_dirty "If the dirty flag of a document is set then its contents are saved before the document is closed." - Does this mean doc_close will write a file? That doesn't seem right. The dirty flag is useful to track if the document needs saving. What is wrong with implicitly saving the document in doc_close before to actually close the document? [3.3.1] pdf_obj_gen_t - Should be described as "non-negative", not "positive". Fixed. [3.3.2] pdf_obj_copy - If copy_indirect is false, what happens when an indirect object is seen? What about: @item PDF_FALSE The indirectly referenced objects in @var{source} are not copied to @var{dest_doc}. This means that the copied object will lack these entries. - Where does the object go? I.e., if you throw away whatever's returned in *dest, does dest_doc still have some reference to it? Yes. The copied object is registered in the destination document. - If an object didn't have a back-reference to a document, something like pdf_obj_dup would be sufficient. The current API assumes that every object is associated with a document. If we change it to allow "free" objects, then 'pdf_obj_dup' would be enough, right. - If a stream is copied, does closing the original file invalidate it? (if not, where's the data stored?) No. The data is stored in a temporary file containing the stream dictionary and the stream data. pdf_obj_destroy "This function is a nop if obj is a direct scalar type or the Null object." - Why would it be a nop? Scalar objects need to be freed too. My idea is to implement a pdf_obj_t variable as a struct like: struct pdf_obj_s { int type; union { pdf_i32_t number; pdf_real_t real; void *pointer; /* For direct containers and indirect objects */ } value; }; typedef union pdf_obj_s pdf_obj_t; Then, direct scalar types (and the null object) would be allocated in the stack. Thus pdf_obj_destroy would be a nop. Anyway I agree in that it may not be wise to make the client aware of that implementation detail. pdf_obj_enum - Non-iterable types should return an error. We could return PDF_FALSE in that case. - Iterating over a stream doesn't make any sense. If the client wants to iterate over its dictionary, they should provide the dictionary object. Hm, considering that it is possible to get the stream dictionary, I think that I agree in this point. - Will iteration stop as soon as PDF_FALSE is returned? I would say that lazy evaluation on the value returned by the callback is in order: it does not make a lot of sense to continue the iteration if something went wrong. So I would say that yes. - Why is a callback used? This will often be annoying to work with, especially given C's lack of closures (users will need to define and populate a struct, cast client_data to it, etc.). It will also make interactions with other languages more difficult, e.g. if a client wants to throw an exception. The alternative would be to use iterators... . Callbacks tend to be somewhat inconvenient. A 'pdf_obj_iterator_t' would make it, or we could even use a pdf_list_t to return a list of pointers to objects, and thus we could use pdf_list_iterator_t. What do you think? pdf_obj_equal - Should be "pdf_obj_equal_p". Yes. Fixed. "If they are indirect, they share the same generation number." - We should also verify that they point to the same object. Yes. Fixed. "If they are non-scalar, they reference the same value. " - What does "the same value" mean? Does obj1 have to be the same pointer as obj2, or just point to a similar array/dict? The first. I changed the wording to: "If they are non-scalar, they reference the same object.". - Will we compare stream contents? This would be expensive. No. pdf_obj_get_doc - PDF_EBADDATA could be used instead of PDF_EINVOBJ (elsewhere too). PDF_EINVOBJ is more explicit, but to use PDF_EBADDATA would be more coherent with the rest of the API. I don't have strong feeling about it. pdf_obj_get_type "The returned value should be interpreted as a enum pdf_obj_type_e." - Why not use that type as the return value? Fixed. - enum pdf_obj_type_e doesn't include an INDIRECT type. Do we return the type of whatever it references? If so, - What if we need to seek to find it, and get an error? - What if the indirect object points to another indirect object? How many levels do we go before giving up? PDF_OBJ_REF was missing. I just added it to pdf_obj_type_e. pdf_obj_get_id - When is an object ID assigned? If the user creates an indirect object, but hasn't saved the file, do we already know the ID? Yes. An object ID is allocated by the document in object-creation time. pdf_obj_compressed_p - Why would the client care about this? It may be convenient, but I am not able to come with an useful example at the moment. The Acrobat SDK provides a similar function. - Why does the object need to know where it came from? I'd expect this (i.e. the location of an object identified by an ID and generation) to be a property of the document. It is a property of the document, but since the object should be associated with a document it is possible to get it using pdf_obj_compressed_p. We could move this function to pdf_obj_doc_compressed_p and use the ON+GN of the object, but it would be less convenient, I think. [3.3.5] - The document doesn't explain what object collections are, but you stated in your email: > The object documents support the notion of "object collections", > that are eventually translated into one or more linked object > streams. - Why would the client need to manage them? Because the underlying object streams are useful to store related objects that will need each other. The object layer does not have enough information to decide what objects are "related" and thus be able to maintain the locality principle. For instance, the document layer may decide to store all the objects containing the contents of a single page in an "object collection", that will translate into one (or several linked) compressed object stream. [3.3.6 -- 3.3.14] - Why does each _new method have an indirect flag? Another option would be to let the user create a direct object, and then create a reference to it when necessary (and we could also turn direct objects into indirect objects when writing, where the spec requires it). - If an indirect object is created, will the pdf_obj_*_value methods resolve the reference to return the value? (if so, my comments on pdf_obj_get_type apply here too) The indirect flag is an indication to pdf_obj_*_new to "register" the object into the document's xref. The pdf_obj_*_value would not require to resolve the reference: the object structure should contain its value (scalar) or a pointer to it (non-scalar). pdf_obj_null_new - Why is the null constructor different from the rest? There is only one null object (conceptually speaking) and it cannot be indirect. pdf_obj_boolean_new - It seems silly to associate a boolean with a document, or allow indirect boolean objects if indirect nulls aren't possible. It is perfectly possible to create an indirect boolean value in a document, such as: 10 0 obj true endobj On the other hand, the null object cannot be indirect. pdf_obj_name_new - Presumably the name is null-terminated, so this should be stated. Fixed. pdf_obj_name_value, _string_value - It should be possible to get at the value without copying it. Using a const pointer, you mean? pdf_obj_string_hex_p, _hex_set - I don't see any reason for this flag -- it's trivial to decide between escaped or hex strings when writing, and there should be no reason to care when reading. How would you decide? Both representations have its advantages, and the client should decide which one to use. pdf_obj_array_insert - Clarify that index==pdf_obj_array_size() inserts at the end. Done. - I'm not sure why an object would know whether it's in a container, but PDF_EINVOBJ implies it does. A direct object contained in more than one container does not make sense. pdf_obj_array_size - How are errors handled? If we're resolving indirect refs, they'd need to be returned; The function does not follow indirect references. otherwise errors are only possible when the object isn't an array, and we can leave that case undefined or return zero. I changed the prototype to: @deftypefun pdf_status_t pdf_obj_array_size (pdf_obj_t @var{array}, pdf_size_t @var{size}) returning PDF_EINVOBJ if ARRAY is not an array object. pdf_obj_array_remove - Is this useful? (maybe adding _find and using _remove_at would be better, but I'm not sure why you'd want to search within an array.) Hm, I would also prefer the _find + _remove_at schema: it is more general. pdf_obj_stream_new - There's no reason for indirect_p if it has to be PDF_TRUE. Fixed. - source_length is redundant since attrs_dict must contain the length (if we have both, we'll have to explain which takes precedence and whether attrs_dict will be modified by the library to add a /Length element). Note that the base stream @var{stm} contains unfiltered data, and that @var{source_length} is the length in octects of the unfiltered data. If the stream dictionary contains a filter chain, we usually does not know the length of the filtered data until we have actually filtered it. The Length entry in @var{attrs_dict} is the lenght of the filtered data. - Maybe it should be possible to pass a callback function that will open and return a stream when necessary (since it may be impractical to open all streams ahead of time). - Or this type of callback could be handled at the stream layer, by creating a dummy stream type that executes callbacks to do the real work. Since 'pdf_obj_stream_new' internally creates an intermediate representation of the stream in disk, @var{stm} is used just to get the contents. You can get a reading stream using 'pdf_obj_stream_open_stm'. pdf_obj_stream_length - "length" should be a pointer. Fixed. The usage of "_get_" seems inconsistent, e.g. pdf_obj_get_type vs. pdf_obj_integer_value. Agreed. I am fixing the API to use a more coherent schema. I committed the agreed changes to the trunk. -- Jose E. Marchesi <jema...@gnu.org> http://www.jemarch.net GNU Project http://www.gnu.org