Fellow Boosters: Serialization Discussion Summary
Well, I havnt actually counted up the votes but the concensus seems pretty clear that it shouldn't be accepted into boost as is. Of course I'm disappointed. Now the question becomes whether its possible to make changes such that it would be acceptable. In order to discuss this question, I would like to first review the objections, provide my assessment of their validity and relate what I might be willing to do, if anything, in order to address them. Below I will try classify the objections by group and address them this way. This will necessitate my paraphrasing, summarizing and/or generalizing other people's arguments and observations. If this doesn't quite exactly capture your point of view, please be patient. Its difficult to address 2 weeks of posts on a very arcane subject in a reasonable length summary. I will proceed in sequence from the less controversial to the more controversial. And will sometimes refer emails posted to the list. 1. Program bugs and docmentation errors ======================================= The most comprehensive lists of errors was provided by Pavel Vozenilek and Gennadiy Rozental (Item 5). http://aspn.activestate.com/ASPN/Mail/Message/boost/1442277 Alberto Barbati took me to task for the wide character implementation. Many minor and easily addressed errors were pointed out and have been rolled in without any posting any comment by me. Errors include things like inclusion of unnecessary file inclusion, some extraneous code. Note that some useless code was included to get some compilers to behave. I use MSVC 7 as my main system. In these cases comments should be included so as to distinguish these cases from other errors. 2. Implementation Improvements ============================== Most posts didn't address implementation very much apart from errors detected. a) There have been improvements suggested that would improve the quality of the code with out changing its basic functionality. These are more effort to address but are basically non-controversial. The most comprehensive list is again in the Rozental email above (Items 2, 4). These basically divide library files into smaller pieces so that there is less file dependency and things compiler faster. Also item(3) factors the header writing/reading out of the basic_[i|o]archive to make the the system more suitable for certain applications that didn't occur to me when I started. I don't see any problem in addressing these except for the work involved. I've already made a couple splits. The library was improved but it turned out to be more effort than I anticipated. b) The explanation of the 3 layers in the Rozental post was very illuminating to me. I developed the system in an evolutionary way such that I lost sight of what I was actually doing. I used partial ordering to as a hack around a compiler problem and left it at that. So I am willing to look at this and make the layering more formal and explicit and explain this in the documentation. I should be clear from the documentation that something was going on that I had overlooked. This was the only post that actually explained what was missing. 3. Documentation Improvements ============================= a) correction of errors b) improvement in some of the explanations included. c) explanation of the three layered implementation and how it effects the usage of the library for users of different compilers. d) better explanation of why archve and serialization have been made separate concepts and what this implies. d) New sections i) serialization of polymorphic pointers including the two methods supported. ii) Better explanation of "Large numbers of small objects" including suggested hacks for special cases. e) New rationale i) why volatile types are not supported. (system should handle this in a more graceful manner) ii) The "registration" issue. (see below) iii) Why "describe" is not included(see below) iv) Why XML is not included.(see below) 4. New features =============== Many posts suggested new features. a) Many of really focused on certain usages of the library that were provided rather than the library itself. e.g. XDR binary archives are easily implementable but I didn't do it. Things like this are better implemented by someone with expertise in the particular problem domain. b) Some posts suggested alterations in libary interface to permit a specific usage of the library. These were of the character as "required to implement bracketing for XML" etc. My view is that it dangerous to alter the library interface to accomodate speculation about future directions. I would much prefer that some take a copy of the library, develop his new "thing" making necessary changes in the libary. The new "thing" then gets judged on its own merits taking into account the impact that it would have on the library. I don't think this places any extra burden on the inventor of the new "thing" and prevents us from making the library from being harder to use to support features that arn't added yet. By the way, the exact situation already came up. I want to stay away from share_ptr<T> issues but was forced to confront itas the only way to guarentee exception safety in certain instances. To implement non-intrusive serialization of shared_ptr<T> I had to gain access to a variable classified as private in the class shared_count. I made changes in my local system and tested the new system - worked great. (See demo_shared_ptr). Then contacted Peter Dimov to request a change that would permit the serialization library to gain access to this veriable. So far he hasn't concented to do so. But its easier to make the case he should make the change when I have demostrated exactly what the utility of the change is and demonstrated that it will have the intended benefit. (note: I notice that shared_ptr is undergoing some rework. I hopeful that an accomodation to permit implementation of serialization of shared_ptr can be made.) c) Other posts suggested features that would conflict with the stated goals of the library and/or tie together things that I strived to keep logically separate. Attempts to suggest extentions to support XML and address issues in the "registration" system fall into this catagory. Also, these included things such as mixing in other concepts such as reflection (describe). I feel that one of the big reasons that the library has arrived to the point it has is that for the most part it doesn't include stuff that doesn't belong there. I have strived to make it as "simple as necessary but no simpler". I know that this debate can never be truely settled but I believe that many people will mostly support this idea - except for their own pet extension. Actually, I feel the most worthy improvements are to make the library smaller. Rozentals suggestion that the archive_header be optional fits into this category. This encourages the usage of the libary for uses beyond data storage. d) The one suggested new feature that has real merit in my view is the implementation of consistency checking when building in debug mode. I had considered this before but it seemed a little too much work for the initial proof of concept. This would add extra information to the archive to better detect cases where load/save functions are not symetric. Normally it is not very difficult to detect these errors by inspection. But once versioning starts it becomes easier to make a mistake that is hard to find. Of course, when I decided to defer this, another problem arises. It would change the internal structure of the archive system. What about archives already "out there" That, is the archive class needed to support versioning for the same reasons that user classes needed to. So I decided to add this information to the archive header. This is an interesting and useful feature but not an urgent one. 5 Special Problems ================== 5.1 "std::exception" - this is being studied on another thread so I won't spend a lot of time on this here. My view is tat attempts to "do something useful" for exceptions we know nothing about is a hopeless and pointless quest. I would prefer that uncaught exceptions just abort the program with an indication of the type of exception thrown. Having said that this is really a small issue for me, so I derived archive exceptions from std::exception just to not have to argue the point. I'm willing to leave this as it is or use the "boost approved" method whatever that might be. 5.2 "registration" - A brief recap: The current system invokes reading and writing of data variables in exactly the same sequence. Thus, when data is written we know its type and when data is read we know its type because we count on the rule that save and load functions access data in exactly the same sequence. There is not need for "registration" of any kind. gotcha - except when reading back polymorphic pointers. While reading, we know only the type of the base class but not which variation should actually be read back from storage. Well, seems easy enough, each time we write/read data of a specific type, we add append a record to a table which contains a reference to an object that invokes the proper save/load function. The table built on reading is exactly the same table built on writing. When its time to write/read a polymorphic pointer we first write/read the index in the table corresponding to the type. then on reading we invoke the appropriate deserializor from the table. Very simple and very fast. gotcha - except it can occur that we write/read a polymorphic pointer of a type that has never been written/read to/from the archive before. This could be handled during writing by adding a new entry to the table, but, upon reading, there is no way of knowing what type corresponds to the last table entry. The obvious fix - don't do that. Just make sure that all polymorphic pointers correspond to objects whose class has already been recorded in the table. How to do that, just serialize as NULL pointer to the most derived class. This is analogus to forward declaration of a class name in C++ source code. Its a pain, but its a small price to pay to address the problem compared to other methods. This does have the annoyance that somewhere in the code we have to write NULL pointers of each derived class of polymorphic pointers. This is seen as an intolerable burdan by several commentators. The alternative proposed is that all types used be "registered" so that archives can contain a class id that can be mapped to serializor/deserializor functors. writing/reading polymorphic pointers would then consist of writing/reading the class id before the class data itself. The table of class id <-> serializor is maintained globally in all programs that use the serialization rather than built on the fly as data is serialized/deserialized. I Hope this fairly summarizes the differing views. For purposes of this discussion, I'm going to refer to the method currently used in the serialization library as "forward declaration" while the proposed method as "global registration" Advantages of forward declaration ================================= a) does not require permanent global data structures, all required structures are temporary and exist only until the archive is destructed. b) minimal requirements for the programmer. It only requires extra effort to "forward declare" pointers of derived types of polymorphic base classes types not otherwise previously serialized. In practice this is usually few in number. The burden is similar to that of forward declaring class names in a C++ program. c) does not depend on a portable implementation of type_info. type_info is used as a convenience in implementation, but it in no way interacts with archives or users of the library. It is strictly a local implementation device. d) does not require any notion of class registration. e) does not require another module to implement boost acceptable class registration. g) very fast - class ids are stored in a vector and accesed by index h) very simple Disadvantages of forward declaration ==================================== a) Just as in forard class name declarations, its a pain in the neck to have to specify in advance types you might want to serialize. Advantages of global registration ================================= a) you don't have to forward declare anything. Just include the class registration in some sort of static variable and just using the class will make it available for serialization as a derived type. Thus, just including header/library code automatically registers the require types. Disadvantages of global registration ==================================== a) requires a global structure maintained by the program b) requires a system for assigning type identifiers. I believe that it has been shown that no automatic system for creating such type identifiers can be guarenteed to always function correctly. To be really correct, this system should permit any combination of header files - that is the type identifier should be guarenteed not to conflict with any other type identifier. In practice this turns out not to be as easy as it sounds. I believe it has been shown that no system based on type_info can offer such a guarentee. All known systems that offer such a guarentee require that the programmer explicitly assign a type id to each class he might ever use. It also turns out this for many applications this is only a theortical issue as for applications within a single organization, class name uniqueness can be enforced and the problem can be avoided. But it has to be explictly acknowledged and addressed. This issue is sufficiently ambiguous that calls are now being made for a registration system as a separate task. This is discussed at length in the follwing post: c) slower, as mapping from class id -> deserializaor is not a vector but some sort of map d) would required more documentation to explain the above Red Herrings ============ a) it has been aledged that the forward declaration method will preclude archives written by one program being read by another. I don't know what the basis for this belief was but in anycase it is demonstrably false. http://aspn.activestate.com/ASPN/Mail/Message/1354725 b) it has been aledged that the forward declaration method is incompatible with plug-ins. This has been demonstrated to be false. http://aspn.activestate.com/ASPN/Mail/Message/1384779 and example Is it possible to reconcile these differing points of view ? ============================================================ I was not at all prepared for the stridency of the arguments surrounding this issue. I truely regret using the term "registration" and "register_type". I believe this has contributed to what I see as a huge amount of misunderstanding as to where type ids fit into this system. I also confess that I had recognized that someone might object to the "forward declaration" so I sort of de-enfasized it in the hope that it might pass "under the radar" with just a little grumbling. BIG MISTAKE. Oh well. After much thought, I have come conclude that it is possible to accomodate the global register idea in a way that does not complicate the system too much. Basically it would create a "function" implemented by the preprocessor that looks like #define SERIALIZATION_GLOBAL_REGISTER(T)\ boost::serialization::instanciate<T>("T"); #define SERIALIZATION_GLOBAL_REGISTER(T, NAME)\ boost::serialization::instanciate<T>("NAME"); this would create a serializor/deserializor instances and add it to the appropriate static table with the key as the class name or other selected name. The coexisitence of this key scheme along side of the sequencialy index scheme would complicate the serialization/deserialization code somewhat, but it would be possible. In general I am reluctant to complicate a pretty clean implementation to add a feature so that we can "have it both ways". Advantages ========== a) might satisfy some people b) might fit well with a "describe" facility below. Disadvantages ============= a) extra work b) complicates the code somewhat c) depends upon the preprocessor - I find this distasteful but opinions may differ. d) it draws the notion of global registration into the serialization system - something I would rather not do. e) required more documentation 5.3 "describe" Jens Maurer's system included the notion of "describe". In this system user class serialization would be specified as struct user_type { int i; template<class Desc>; void describe(Desc & d) { d & i; } }; Mapped to our nomenclature, Desc corresponds to either an input or output archive and & is an overloaded operator that corresponds to our operator overload of << and >> One thing immediately obvious is the specification of only one line for both input and output there by eliminating a source errors easily commited in our system of save/load. On the other hand. a) it doesn't address base classes b) it doesn't address versioning c) it doesn't use const for the save function. This turns out to be critical for detecting certain types of errors and to avoid multiple save/load of objects referred to multiple times. So what would it take implement "describe" in our system? What occurs to me is using the pre-processor to change DESCRIBE( class_name_id(A [,"A"]) // default class id is class name string [version = 0], base_classes( B,C,... ) member_variables( x [(xversion = 0)], // first version where member used y [(yversion = 0)], ... ) ) into something like GLOBAL_REGISTER(A, "A") serialize<A>::load(basic_oarchive & ar, const A &a, version_type v) { ar >> base_object<B>(a); ar >> base_object<C>(a); if(v >= xversion) ar >> x; if(v >= yversion) ar >> y; ... } Advantages: a) provide save/load symetry b) would not complicate the serialzation system c) its conceptually distinct from the serialization system Disadvantages: a) would require usage of the preprocessor b) might be difficult to implement Of course, there might be better ways to implmenet this. Implementation of such a thing would not be a trivial excercise. But it would not require any changes to the serialization system itself. Ultimately, this is about creating meta data for C++ classes. i.e. C++ (compile-time) reflection. For the serialization system we implemented pieces of this. e.g. base_object<T>(*this) registers class hierarchy information in order to permit runtime casting of void * . I was reluctant to embark on a "whole new thing" (reflection) without having first finishing the thing I started out with (serialization). Basically, I didn't feel I could do "describe" right, so that it should be left for later. 5.4 "XML" XML is going to be a lot more than just writing <classname> .. </classname> tags in the output data. Some issues involved are a) Using a system of C++ reflection to get the names of variables and classes b) generating XML schema from some sort of DESCRIBE facility as above. This basically inserts the reflection information into the XML archive c) the current library keeps track of data stored by writing class-id and object ids for classes and objects used multiple times. XML containing this type of information might not be useful to other programs that don't contain this serialization system which would defeat the purpose of XML. d) The current system makes a clear distinction between "how a type is serialized" (save/load) and how primitive types are encoded to bytes (archive). This is a powerful design and conceptual feature that I have insisted on maintaining. Some suggestions for implementing XML couple the save/load to the archive. e.g ar << "name1" << variable1 << which would largely prohibit serializing to more than one type of archive and would break one of the most important features of the library. I keep hearing "That's not a problem if you add virtual function this" or "It should be easy" or even "if its not easy, it calls into question the whole design". Maybe. Other proposed designs don't address this any better than we do here. Of course, if it is easy, we can expect an XML implementation of serialization to be submitted soon. I can tell you for a fact that it would not be easy for me. I don't think anyone can guarentee that XML is doable without actually doing it. So I'm disinclined to add anything in the library "because it will permit implemenation of XML". It someone submits an XML implementation that can be approved by boost members and that requires some changes to the serializaton and/or library then that can be considered. But I'm disinclined to make changes on speculation that they might be useful. 5.5 "Superfast I/O" There have been requests to add more primitive virtual functions to basic_[i|o]archive in order to permit increased efficiency. Specifically, the idea is to add for each primitive type a virtual function to permit override of a C array of that type. So that the native binary archive could implement basic_oarchive & basic_oarchive::operator<<(double & t[]) const { write_binary(&t[0], sizeof(t)); } and run at maximum speed with just one virtual call. I understand and sympathize as I hate wasting computer time myself. But at that point we are defining a special implementation for a combination of datatype(double), and archivetype. I'm reluctant to go down this road. What about vector<double>, might it benefit from special treatment as well? Pretty soon everyone and his brother wants to add his own special primitive type and the logical transperancy is compromised. I would argue that these are very special cases that can best be handled in the time honored way - the software hack. So that the above would be implemented for this very special case like this template<class T, int N> class fast_array { T t[N]; void save(ar) const { if(NULL != dynamic_cast<boarchive>(ar)) ar.write_binary(t, sizeof(t)); else ar << t; } }; OK its a hack. But I prefer to handle special cases in the user code where required rather than starting the library on a path whereby the logical transparency is slowly eaten away by adding more and more special cases. 6.0 Where do we go from here ? ============================== I havn't done an exact count, but its seems that the library would not meet the approval of boost in its current form. On the otherhand, it seems that a significant number of boosters like much about the library. I am very interested in getting this into boost for a number of reasons a) I would be personally gratifying to me to earn the approval of those whom I would like to think are my peers. b) it would be good for my career as a contract software developer. c) I believe this is a good implementation of a facility which would be a valuable contribution to the boost library. d) it has me by the throat. Suppose I were to make the changes described in sections 1, 2, and 3 above. Would it be approved by boost? If I could be assured that it would be, I would be willing to make those changes. I guess it would take me 90 days. Suppose that the package could not be accepted into boost unless it included "describe" functionality and guarentees implementation of XML. These are two almost completely separate packages which should be separated from serialization. I am not prepared to do the work as I see it as orthogonal to serialization itself. It is my belief doing either one correctly is a is a large undertaking which I am not prepared to commit to. So if this is the position of the majority of boost members, I'm done. If the decision were on a knife edge, I might be willing to include implementation of "global_registration" via the macro mechanism described above. Its not so much the work I object to but rather the fact that disturbs the simplicity and clarity of the current implementation. Also I think this should be considered in conjunction with C++ compile - time reflection which is way too big to be included. 7.0 Miscelleania ================ Boost is really missing a key thing to finish off a package like this as well as other boost packages. Boost needs designated "platform gurus". One for each platform - gcc 3.1, msvc7, msvc6, comeau, borland, metroworks, etc. Once a library has been accepted into boost, the "platform gurus" would build and run the tests on the platform for which they are responsable and report back to the library developer. a) Given the current state of C++ language and the compilers that implement it, there is really no other way to guarentee that all libraries work on all platforms. b) It would be much more efficient that having the library developer going around trying to beg users of different compilers to help out. c) The system would be much more efficient in that "platform gurus" would know about the quirks for thier particular platform while library developer's maintainers can't really be expected to understand all the quirks for all the platforms. Its the M(libraries) * N(platforms) problem in a different form. d) This effort would be applied only after a library is otherwise approved. Given that the number of library actually approved per year is small, being a "platform guru" wouldn't be an unreasonable burden. d) "platform gurus" would be making a valuable contribution even though they don't have the time to design, build, and get approved a new library contribution. e) Such a valuable contribution would justify adding these people to that elite group known as "boost library contributor" along with the web site picture. The "boost library contributor" is incredibly valuable currency. A little bit spent in this direction would improve boost code alot. Of course, the currency is valuable because it has been wisely spent - so far - so don't go overboard with this idea. 8.O === By the way, I vote for approval of the library. Robert Ramey _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost