[boost] Serialization library review

Matthias Troyer Sun, 24 Nov 2002 11:47:26 -0800

After spending more time with the serialization library, implementing an archive
for the XDR format and considering how I could read my old archive files using this
library instead of my own one, I can now give a more detailed review of the
serialization library. I will keep those parts that I discussed in more detail
previously short and focus on some new issues.

To start with a summary, I see problems in the following areas:

1. Compilation problems - minor problems
2. Code quality - solvable problems
3. Does not work on all platforms - solvable problem
4. Interface design: there are some show-stoppers here for now
a) primitive types: code is not portable at the moment
b) performance: need improved methods for large data sets
c) binary archives: design problems in current version
d) small objects: optimization could easily be achieved by a traits class
instead of reimplementing all the std-library serializaing for all
small object classes individually
e) serialization template specialization versus free functions?
5. Overall design: I want a better factorization of the library into
a basic library and separate support for pointers. Also, one needs to
be able to override archive preamble, and functions called for
i) new class encountered ii) start/end of an object serialization
reading the version number, ...

I value all the work that Robert put into this library and am willing to help
improve it further. During this review I had some fruitful exchange of comments
with Robert, which solved all of the compilation problem. I thus hope that it
will be possible to get the library into a better shape soon. For the moment
my overall vote is however to have another round of review after the interface
and implementations have been improved. I vote for urging Robert to improve the
library (and am willing to contribute to the improvements), but have to vote a clear
NO to this version since I want to have another review of an improved version.

Below I will elaborate in more details on the issues summarized above:

1. Compilation problems
=======================

Most compilation problems have been fixed in a revised version serialization7.zip
that Robert sent me last week. I appreciate his efforts in getting these things
fixed quickly. There is just one minor and one major problem left as far as I can see: In boost/void_cast.hpp there is a typename missing on line 148:
is_polymorphic<typename remove_const<FROM>::type >,
^^^^^^^^

The shared_ptr demo does not compile:

boost/shared_ptr.hpp:297: `boost::detail::shared_count boost::shared_ptr<A>::pn' is private
demo_shared_ptr.cpp:183: within this context

Why do you want to access a private variable here?

Otherwise I have not encountered any problems

2. Code quality
===============

I do not want to go into details at this stage since I believe that a redesign
of some aspects of the library is needed first. However code fragments such as:
line 95-96 of archive.cpp seem unacceptable to me:

// note breaking a rule here - is this a problem on some platform
is.read(const_cast<char *>(s.data()), size);

3. Does not work on all platforms
=================================

Running the test program on MacOS X 10.2 (using gcc 3.1) gives:

*** testing native binary archive
../../boost/serialization/archive.hpp:986: failed assertion `is.good()'
Abort

It thus seems there are problems with the binary streams. Note that under
Darwin (FreeBSD in general?), the binary specification is just ignored as
can be read on the man page for fopen:

The mode string can also include the letter ``b'' either as a third char-
acter or as a character between the characters in any of the two-charac-
ter strings described above. This is strictly for compatibility with
ISO/IEC 9899:1990 (``ISO C89'') and has no effect; the ``b'' is ignored.

These problems need to be sorted out, but I do not consider them urgent since
a redesign of the binary archives is urgently needed beforehand.

4a. Interface design: primitive types
=====================================

There are two questions here: a discussion which types to use as primitive types
and problems an incompatibilities of the current version

4a.i) which primitve types to use
---------------------------------

We should discuss whether to use short, int, long ... as the primitive types
or int8_t, int16_t, int32_t, int64_t. The latter makes it easier to write
portable archives, the former seems more natural. I can accept both choices but
we should not mix the two as is done now

4a.ii) problems with the current primitive types
------------------------------------------------

There are two problems:

* A serialization of bool is missing - easy to fix

* The code will not compile on platforms where long is 64-bit:

virtual basic_oarchive & operator<<(long _Val) = 0;
virtual basic_oarchive & operator<<(int64_t _Val) = 0;

The signature of these two functions is identical when int64_t is typedef-ed
to be long. See discussion under 4a.i)

4b. Interface design: performance: need improved methods for large data sets
======================================================================== ====

As mentioned in previous posts, additional functions e.g. load_array and save_array
need to be added to allow efficient serialization of large data sets. The
default version could just use operator<< or operator>> as in:

virtual void save_array(const int* p, std::size_t n) {
for (std::size_t i=0;i<n;++i)
*this << p[i];
}

and this would thus not incur any extra coding work for people not interested.
Serialization of containers such as std::vector, or ublas or mtl vectors
and matrices can make use of this extra function transparent to the user so that
the interface would also not become harder to understand for the library user.

4c. Interface design: binary archives
=====================================

I see two problems that can easily be fixed:

4c.i) Problems with the current version
---------------------------------------

Currently the library wants to protect me from problems with binary
incompatibilities by checking if the sizes of the primitive types are
the same on the platform on which I read or wrote. I believe this to be
a misfeature. Consider two platforms

A: 32-bit long, 64-bit long long, int64_t typedef-ed as long long
B: 64-bit long, int64_t typedef-ed as long

Currently when I try to read an archive from platform A on platform B, the library
aborts because of incompatible primitive types. However, as a power use, knowing about
portability problems I did not use long or long long in my codes, but always int64_t,
and should thus have no problems (ignoring byte order issues). The library should
thus allow me to read the file although the primitive types are different!

Considering this problem, which we face daily (we transfer files between
Linux PCs, Macintosh, SUN and Alpha workstations, Cray, HP, Hitachi and Fujitsu
supercomputers) has made us choose int*_t as the primitive types for serialization
in our library (which is in use since 1994). See 4.a.i).

4c.ii) Change of the binary archive class design
------------------------------------------------

I propose a change to the design of the native binary archives. While implementing
operator<< to just call write_binary is common to all native binary archive classes,
the specific implementation of write_binary can be different. I thus propose to
factor out the operator<< implementations into a new base class basic_obarchive,
which still contains write_binary as pure virtual function. From this class we can derive:

stream_obarchive: serializing into a stream as done now
file_obarchive: serialization into a file using fwrite
buffer_obarchive: serialization into a memory buffer using e.g. memcpy
...

and similar for the input archives.

4d. Interface design: small objects
===================================

I have mentioned this in a previous post. Instead of requiring the user to
reimplement the serialization of standard containers for all small object types
for which the versioning and pointer system should be bypassed, a traits class
can be added and the optimized serialization of all containers of small objects
implemented in the library. Note that the traits class needs to be specialized
only for those objects for which the user wants to optimize serialization, while
no effort is required at all if the standard serialization method is to be used.

4e. Interface design: serialization template specialization versus free functions?
======================================================================== ==========

There are two ways to implement non-intrusive serialization. One is via
specialization of the serialization class, the other via free functions called
e.g. serialization_{save,load,version}.

I myself would give preference to free functions but can live with either variant.
The current library is however not consistent since

* serialization of normal classes goes via specialization of the serialization<T>
class

* serialization of template classes goes via overloading of the free function
serialization_detail::save_template(), ...

This is unacceptable and a consistent method should be found.

5. Overall design:
==================

Finally, before being able to vote yes to an inclusion in boost I would
need to see some changes to the global design of the library.

5.a) Factor out the pointer serialization features
--------------------------------------------------

It should be straightforward to factor out the serialization of pointers
and to split the library into archives for the serialization of basic types,
and an add-on for the serialization of pointers. That way users who do not
need the elaborate pointer serialization mechanism can do so and not incur
the performance penalty (in terms of memory and speed) of to this feature, while
in cases when it is needed we can add it on to he archive.

5.b) Allow overriding of preamble, etc.
---------------------------------------

I would like to have more control over some aspects that are currently hardcoded
into the library:

* writing/reading the preamble
* obtaining the version number of a class
* starting/finishing writing an object
* a new type is encountered

The motivation is very simple: We have hundreds of gigabytes of data lying around
in tens of thousands of files that could easily be read by the serialization archive
if there were not too small differences:

i) I wrote a different preamble
ii) I only wrote one version number for all classes in the archive instead of separate
version numbers for each class
iii) no information was written when a new class was encountered

Since otherwise the interface is nearly identical (many classes contain a load
and a save function, albeit with a different name for the archive classes), changing
all my codes over to a boost::serialization library would be easy if it weren't for
the three issues above.

CONCLUSIONS:
============

Since these are major changes I would like to see a new review after they
are implemented and thus vote NO for now. However I am willing to help Robert with
implementing the changes, improving the library, and am willing to discuss further.

Best regards,

Matthias

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] Serialization library review

Reply via email to