[boost] Serialization Library Review

Robert Ramey Sun, 24 Nov 2002 09:39:53 -0800

Fellow Boosters:

Serialization Discussion Summary


Well, I havnt actually counted up the votes but
the concensus seems pretty clear that it shouldn't
be accepted into boost as is.

Of course I'm disappointed.

Now the question becomes whether its possible to
make changes such that it would be acceptable.  In
order to discuss this question, I would like to
first review the objections, provide my assessment
of their validity and relate what I might be
willing to do, if anything, in order to address
them.

Below I will try classify the objections by group
and address them this way.  This will necessitate
my paraphrasing, summarizing and/or generalizing other people's
arguments and observations.  If this doesn't 
quite exactly capture your point of view, please
be patient. Its difficult to address 2 weeks of
posts on a very arcane subject in a reasonable length
summary.

I will proceed in sequence from the less controversial
to the more controversial. And will sometimes refer emails
posted to the list.

1. Program bugs and docmentation errors
=======================================
The most comprehensive lists of errors was provided by
Pavel Vozenilek and Gennadiy Rozental (Item 5).  
http://aspn.activestate.com/ASPN/Mail/Message/boost/1442277
Alberto Barbati took me to task for the wide character
implementation.  Many minor and easily addressed errors
were pointed out and have been rolled in without any
posting any comment by me.  Errors include things like
inclusion of unnecessary file inclusion, some extraneous
code.  Note that some useless code was included to get
some compilers to behave.  I use MSVC 7 as my main system.
In these cases comments should be included so as to
distinguish these cases from other errors.

2. Implementation Improvements
==============================
Most posts didn't address implementation very much apart
from errors detected.

a) There have been improvements suggested that would improve
the quality of the code with out changing its basic
functionality. These are more effort to address but 
are basically non-controversial.  The most comprehensive
list is again in the Rozental email above (Items 2, 4).  These
basically divide library files into smaller pieces so that
there is less file dependency and things compiler faster.
Also item(3) factors the header writing/reading out of the 
basic_[i|o]archive to make the the system more suitable for
certain applications that didn't occur to me when I started.  
I don't see any problem in addressing these except for the work
involved. I've already made a couple splits.  The library was
improved but it turned out to be more effort than I anticipated.

b) The explanation of the 3 layers in the Rozental post was  
very illuminating to me.  I developed the system in an
evolutionary way such that I lost sight of what I was actually
doing.  I used partial ordering to as a hack around a 
compiler problem and left it at that.  So I am willing
to look at this and make the layering more formal and explicit
and explain this in the documentation.  I should be clear
from the documentation that something was going on that I 
had overlooked. This was the only post that actually explained
what was missing.

3. Documentation Improvements
=============================
a) correction of errors
b) improvement in some of the explanations included.
c) explanation of the three layered implementation and how
it effects the usage of the library for users of different
compilers.
d) better explanation of why archve and serialization have
been made separate concepts and what this implies.

d) New sections
i) serialization of polymorphic pointers including the two
methods supported.
ii) Better explanation of "Large numbers of small objects"
including suggested hacks for special cases.

e) New rationale
i) why volatile types are not supported.  (system should handle
this in a more graceful manner)
ii) The "registration" issue. (see below)
iii) Why "describe" is not included(see below)
iv) Why XML is not included.(see below)

4. New features
===============
Many posts suggested new features.

a) Many of really focused on certain usages of the library
that were provided  rather than the library itself.  e.g. XDR
binary archives are easily implementable but I didn't do it. Things like
this are better implemented by someone with expertise in
the particular problem domain.  

b) Some posts suggested alterations in libary interface to
permit a specific usage of the library.  These were of the character
as "required to implement bracketing for XML" etc.  My view
is that it dangerous to alter the library interface to
accomodate speculation about future directions.  I would
much prefer that some take a copy of the library, develop
his new "thing" making necessary changes in the libary.
The new "thing" then gets judged on its own merits taking
into account the impact that it would have on the library.
I don't think this places any extra burden on the inventor
of the new "thing" and prevents us from making the library
from being harder to use to support features that arn't
added yet.  By the way, the exact situation already came up.
I want to stay away from share_ptr<T> issues but was forced
to confront itas the only way to guarentee exception safety
in certain instances. To implement non-intrusive serialization
of shared_ptr<T> I had to gain access to a variable classified
as private in the class shared_count.  I made changes in my
local system and tested the new system - worked great. 
(See demo_shared_ptr). Then contacted Peter Dimov to request
a change that would permit the serialization library to gain
access to this veriable.  So far he hasn't concented to do so.
But its easier to make the case he should make the change
when I have demostrated exactly what the utility of the change
is and demonstrated that it will have the intended benefit.
(note: I notice that shared_ptr is undergoing some rework.
I hopeful that an accomodation to permit implementation of
serialization of shared_ptr can be made.)

c) Other posts suggested features that would conflict with the
stated goals of the library and/or tie together things
that I strived to keep logically separate.  Attempts to 
suggest extentions to support XML and address issues in
the "registration" system fall into this catagory.
Also, these included things such as mixing in other concepts
such as reflection (describe).  I feel that one of the big
reasons that the library has arrived to the point it has
is that for the most part it doesn't include stuff that
doesn't belong there.  I have strived to make it as "simple
as necessary but no simpler".  I know that this debate
can never be truely settled but I believe that many people
will mostly support this idea - except for their own pet extension.
Actually, I feel the most worthy improvements are to make the
library smaller.  Rozentals suggestion that the archive_header
be optional fits into this category.  This encourages the
usage of the libary for uses beyond data storage.

d) The one suggested new feature that has real merit in my view is
the implementation of consistency checking when building
in debug mode.  I had considered this before but it seemed
a little too much work for the initial proof of concept.
This would add extra information to the archive to
better detect cases where load/save functions are not
symetric.  Normally it is not very difficult to detect
these errors by inspection.  But once versioning starts
it becomes easier to make a mistake that is hard to find.
Of course, when I decided to defer this, another problem
arises.  It would change the internal structure of the
archive system.  What about archives already "out there"
That, is the archive class needed to support versioning
for the same reasons that user classes needed to.  So I decided
to add this information to the archive header. This is
an interesting and useful feature but not an urgent one.

5 Special Problems
==================

5.1 "std::exception" - this is being studied on another thread
so I won't spend a lot of time on this here.  My view is
tat attempts to  "do something useful" for exceptions we
know nothing about is a hopeless and pointless quest. I 
would prefer that uncaught exceptions just abort the program
with an indication of the type of exception thrown. Having
said that this is really a small issue for me, so I 
derived archive exceptions from std::exception just to
not have to argue the point.  I'm willing to leave this as
it is or use the "boost approved" method whatever that might be.

5.2 "registration" - A brief recap:

The current system invokes reading and writing of data variables
in exactly the same sequence.  Thus, when data is written
we know its type and when data is read we know its type because
we count on the rule that save and load functions access
data in exactly the same sequence.  There is not need for
"registration" of any kind.

gotcha - except when reading back polymorphic pointers.  While
reading, we know only the type of the base class but not which
variation should actually be read back from storage.  

Well, seems easy enough, each time we write/read data of a specific
type, we add append a record to a table which contains a reference
to an object that invokes the proper save/load function. The table
built on reading is exactly the same table built on writing.
When its time to write/read a polymorphic pointer we first write/read
the index in the table corresponding to the type.  then on reading
we invoke the appropriate deserializor from the table.  Very simple
and very fast.

gotcha - except it can occur that we write/read a polymorphic pointer
of a type that has never been written/read to/from the archive before.
This could be handled during writing by adding a new entry to the table,
but, upon reading,  there is no way of knowing what type corresponds
to the last table entry.  The obvious fix - don't do that.  Just make
sure that all polymorphic pointers correspond to objects whose class
has already been recorded in the table.  How to do that, just serialize
as NULL pointer to the most derived class.

This is analogus to forward declaration of a class name in C++ source
code.  Its a pain, but its a small price to pay to address the problem
compared to other methods.

This does have the annoyance that somewhere in the code we have to 
write NULL pointers of each derived class of polymorphic pointers.
This is seen as an intolerable burdan by several commentators.

The alternative proposed is that all types used be "registered" so that
archives can contain a class id that can be mapped to serializor/deserializor
functors.  writing/reading polymorphic pointers would then consist of
writing/reading the class id before the class data itself.  The table
of class id <-> serializor is maintained globally in all programs
that use the serialization rather than built on the fly as data
is serialized/deserialized.

I Hope this fairly summarizes the differing views.

For purposes of this discussion, I'm going to refer to the method
currently used in the serialization library as "forward declaration"
while the proposed method as "global registration"

Advantages of forward declaration
=================================
a) does not require permanent global data structures, all required
structures are temporary and exist only until the archive is destructed.
b) minimal requirements for the programmer.  It only requires extra
effort to "forward declare" pointers of derived types of polymorphic
base classes types not otherwise previously serialized.  In practice
this is usually few in number.  The burden is similar to that of
forward declaring class names in a C++ program.
c) does not depend on a portable implementation of type_info.  type_info
is used as a convenience in implementation, but it in no way interacts
with archives or users of the library.  It is strictly a local implementation
device.
d) does not require any notion of class registration.
e) does not require another module to implement boost acceptable
class registration.
g) very fast - class ids are stored in a vector and accesed by index
h) very simple

Disadvantages of forward declaration
====================================
a) Just as in forard class name declarations, its a pain in the neck to
have to specify in advance types you might want to serialize.

Advantages of global registration
=================================
a) you don't have to forward declare anything.  Just include the
class registration in some sort of static variable and just using
the class will make it available for serialization as a derived
type.  Thus, just including header/library code automatically
registers the require types.

Disadvantages of global registration
====================================
a) requires a global structure maintained by the program
b) requires a system for assigning type identifiers.
I believe that it has been shown that no automatic system for creating 
such type identifiers can be guarenteed to always function correctly.
To be really correct, this system should permit any combination
of header files - that is the type identifier should be guarenteed
not to conflict with any other type identifier.  In practice
this turns out not to be as easy as it sounds.  I believe it has
been shown that no system based on type_info can offer such a 
guarentee.  All known systems that offer such a guarentee
require that the programmer explicitly assign a type id
to each class he might ever use.  It also turns out this for
many applications this is only a theortical issue as for
applications within a single organization, class name uniqueness
can be enforced and the problem can be avoided.  But it has
to be explictly acknowledged and addressed.  This issue is
sufficiently ambiguous that calls are now being made for
a registration system as a separate task.
This is discussed at length in the follwing post:
c) slower, as mapping from class id -> deserializaor is not a vector
but some sort of map
d) would required more documentation to explain the above

Red Herrings
============
a) it has been aledged that the forward declaration method
will preclude archives written by one program being read
by another.  I don't know what the basis for this belief was
but in anycase it is demonstrably false. 
http://aspn.activestate.com/ASPN/Mail/Message/1354725
b) it has been aledged that the forward declaration method
is incompatible with plug-ins.  This has been demonstrated
to be false. http://aspn.activestate.com/ASPN/Mail/Message/1384779
and example 

Is it possible to reconcile these differing points of view ?
============================================================
I was not at all prepared for the stridency of the arguments surrounding
this issue.  I truely regret using the term "registration" and 
"register_type". I believe this has contributed to what I see as a huge
amount of misunderstanding as to where type ids fit into this
system.  I also confess that I had recognized that someone
might object to the "forward declaration" so I sort of de-enfasized it
in the hope that it might pass "under the radar" with just a little
grumbling.  BIG MISTAKE. Oh well.

After much thought, I have come conclude that it is possible
to accomodate the global register idea in a way that
does not complicate the system too much.  Basically it would
create a "function" implemented by the preprocessor that looks like

#define SERIALIZATION_GLOBAL_REGISTER(T)\
        boost::serialization::instanciate<T>("T");

#define SERIALIZATION_GLOBAL_REGISTER(T, NAME)\
        boost::serialization::instanciate<T>("NAME");

this would create a serializor/deserializor instances
and add it to the appropriate static table with the key
as the class name or other selected name.  The coexisitence
of this key scheme along side of the sequencialy index
scheme would complicate the serialization/deserialization
code somewhat, but it would be possible.  In general I am
reluctant to complicate a pretty clean implementation 
to add a feature so that we can "have it both ways". 

Advantages
==========
a) might satisfy some people
b) might fit well with a "describe" facility below.

Disadvantages
=============
a) extra work
b) complicates the code somewhat
c) depends upon the preprocessor - I find this distasteful but
opinions may differ.
d) it draws the notion of global registration into the
serialization system - something I would rather not do.
e) required more documentation

5.3 "describe"

Jens Maurer's system included the notion of "describe". In this
system user class serialization would be specified as
 
struct user_type {
  int i;
  template<class Desc>;
  void describe(Desc & d) { d & i; }
};

Mapped to our nomenclature, Desc corresponds to either an
input or output archive and & is an overloaded operator
that corresponds to our operator overload of << and >>

One thing immediately obvious is the specification of
only one line for both input and output there by eliminating
a source errors easily commited in our system of save/load.

On the other hand.

a) it doesn't address base classes
b) it doesn't address versioning
c) it doesn't use const for the save function. This turns out
to be critical for detecting certain types of errors and to
avoid multiple save/load of objects referred to multiple times.

So what would it take implement "describe" in our system?

What occurs to me is using the pre-processor to change

DESCRIBE(
        class_name_id(A [,"A"])         // default class id is class name string
        [version = 0],
        base_classes(
                B,C,...
        )
        member_variables(
                x [(xversion = 0)],     // first version where member used
                y [(yversion = 0)],
                ...
        )
)

into something like

GLOBAL_REGISTER(A, "A")
serialize<A>::load(basic_oarchive & ar, const A &a, version_type v)
{
        ar >> base_object<B>(a);
        ar >> base_object<C>(a);
        if(v >= xversion) ar >> x;
        if(v >= yversion) ar >> y;
        ...
}

Advantages:
a) provide save/load symetry
b) would not complicate the serialzation system
c) its conceptually distinct from the serialization system

Disadvantages:
a) would require usage of the preprocessor
b) might be difficult to implement

Of course, there might be better ways to implmenet this.

Implementation of such a thing would not be a trivial excercise.
But it would not require any changes to the serialization
system itself.  

Ultimately, this is about creating meta data for C++ classes.
i.e. C++ (compile-time) reflection.  For the serialization
system we implemented pieces of this.  e.g. base_object<T>(*this)
registers class hierarchy information in order to permit runtime
casting of void * .

I was reluctant to embark on a "whole new thing" (reflection) without having
first finishing the thing I started out with (serialization).
Basically, I didn't feel I could do "describe" right, so that it should
be left for later.


5.4 "XML"

XML is going to be a lot more than just writing <classname> .. </classname>
tags in the output data.  Some issues involved are

a) Using a system of C++ reflection to get the names of variables and classes
b) generating XML schema from some sort of DESCRIBE facility as above.  This
basically inserts the reflection information into the XML archive
c) the current library keeps track of data stored by writing class-id and
object ids for classes and objects used multiple times. XML containing
this type of information might not be useful to other programs that
don't contain this serialization system which would defeat the purpose of XML.
d) The current system makes a clear distinction between "how a type
is serialized" (save/load) and how primitive types are encoded to bytes
(archive).  This is a powerful design and conceptual feature that I have
insisted on maintaining.  Some suggestions for implementing XML couple
the save/load to the archive. e.g ar << "name1" << variable1 << 
which would largely prohibit serializing to more than one type of archive
and would break one of the most important features of the library.

I keep hearing "That's not a problem if you add virtual function this" or
"It should be easy" or even "if its not easy, it calls into question the
whole design".  Maybe. Other proposed designs don't address this any
better than we do here.  Of course, if it is easy, we can expect an
XML implementation of serialization to be submitted soon.  I can
tell you for a fact that it would not be easy for me.

I don't think anyone can guarentee that XML is doable without actually
doing it.  So I'm disinclined to add anything in the library "because
it will permit implemenation of XML".  It someone submits an XML 
implementation that can be approved by boost members and that
requires some changes to the serializaton and/or library then
that can be considered. But I'm disinclined to make changes
on speculation that they might be useful.

5.5 "Superfast I/O"

There have been requests to add more primitive virtual functions to
basic_[i|o]archive in order to permit increased efficiency.  Specifically,
the idea is to add for each primitive type a virtual function to permit
override of a C array of that type.  So that the native binary
archive could implement

basic_oarchive & basic_oarchive::operator<<(double & t[]) const
{
        write_binary(&t[0], sizeof(t));
}

and run at maximum speed with just one virtual call.  I understand
and sympathize as I hate wasting computer time myself. But at that
point we are defining a special implementation for a combination
of datatype(double), and archivetype.  I'm reluctant to go down this
road.  What about vector<double>, might it benefit from special treatment
as well? Pretty soon everyone and his brother wants to add his own special
primitive type and the logical transperancy is compromised. I would argue that
these are very special cases that can best be handled in the time honored way - 
the software hack. So that the above would be implemented for this very special
case like this

template<class T, int N>
class fast_array
{
        T t[N];
        void save(ar) const
        {
                if(NULL != dynamic_cast<boarchive>(ar))
                        ar.write_binary(t, sizeof(t));
                else
                        ar << t;
        }
};

OK its a hack.  But I prefer to handle special cases in the user code
where required rather than starting the library on a path whereby
the logical transparency is slowly eaten away by adding more and more
special cases.

6.0 Where do we go from here ?
==============================

I havn't done an exact count, but its seems that the library would
not meet the approval of boost in its current form.  On the otherhand, 
it seems that a significant number of boosters like much about the library.

I am very interested in getting this into boost for a number of reasons
a) I would be personally gratifying to me to earn the approval of those
whom I would like to think are my peers.
b) it would be good for my career as a contract software developer.
c) I believe this is a good implementation of a facility which would be
a valuable contribution to the boost library.
d) it has me by the throat.

Suppose I were to make the changes described in sections 1, 2, and 3 above.
Would it be approved by boost?  If I could be assured that it would be,
I would be willing to make those changes. I guess it would take me 90 days.

Suppose that the package could not be accepted into boost unless it
included "describe" functionality and guarentees implementation of XML.
These are two almost completely separate packages which should be separated
from serialization.  I am not prepared to do the work as I see it as orthogonal
to serialization itself.  It is my belief doing either one correctly is
a is a large undertaking which I am not prepared to commit to.
So if this is the position of the majority of boost members, I'm done.

If the decision were on a knife edge, I might be willing to include implementation
of "global_registration" via the macro mechanism described above.  Its not so much
the work I object to but rather the fact that disturbs the simplicity and clarity of 
the
current implementation. Also I think this should be considered in conjunction
with C++ compile - time reflection which is way too big to be included.

7.0 Miscelleania
================
Boost is really missing a key thing to finish off a package like this as well
as other boost packages.

Boost needs designated "platform gurus". One for each platform - gcc 3.1, msvc7,
msvc6, comeau, borland, metroworks, etc.  Once a library has been accepted
into boost, the "platform gurus" would build and run the tests on the platform
for which they are responsable and report back to the library developer.
a) Given the current state of C++ language and the compilers that implement it,
there is really no other way to guarentee that all libraries work on all platforms.
b) It would be much more efficient that having the library developer going around
trying to beg users of different compilers to help out.
c) The system would be much more efficient in that "platform gurus" would know about
the quirks for thier particular platform while library developer's maintainers can't
really be expected to understand all the quirks for all the platforms.  Its the 
M(libraries) * N(platforms) problem in a different form.
d) This effort would be applied only after a library is otherwise approved.
Given that the number of library actually approved per year is small, being a
"platform guru" wouldn't be an unreasonable burden.
d) "platform gurus" would be making a valuable contribution even though they don't have
the time to design, build, and get approved a new library contribution.
e) Such a valuable contribution would justify adding these people to that elite
group known as "boost library contributor" along with the web site picture.  The
"boost library contributor" is incredibly valuable currency.  A little bit spent
in this direction would improve boost code alot.  Of course, the currency is valuable
because it has been wisely spent - so far - so don't go overboard with this idea.

8.O
===

By the way, I vote for approval of the library.

Robert Ramey
_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] Serialization Library Review

Reply via email to