[boost] Re: Serialization Submission version 6
Robert Ramey wrote: 1) changing the state of the stream while serializing. My implementation initialized the stream and never contemplated that the same stream might be used for other things. That is that serialized data might be "embedded" as part of a larger stream. Apparently this is an issue for some people. I don't see it as a large issue but it as easy to address. In fact the issue is so easy to address that I don't understand why we are still discussing about it :) If you are willing to accept my solution, please say so immediately, so we won't waste any more time. One method of storing/recovering the data is to use a sequence of characters or wide characters. That is a C++ stream. This has some major benefits: a) All the code required to convert any C++ datatype into characters or wide characters exists and is part of the standard library and is guarenteed to work. This is not true, and I proved it to you with a code snippet in a recent post of mine. The standard *does not* provide a way to output (i.e.: to write on a disk file) a stream of wide characters. You can put wide characters into a wide stream but you will always obtain a file of "narrow" characters, obtained through a "degenerate conversion" as explictly specified in the standard. Moreover, I have very bad news. I just found that the C++ implementation shipped with .NET is not conformant on this point. Consider the following program: int main() { std::wofstream out("test.txt", std::ios::binary); out << L"I owe you \x20ac 1\n"; // \x20ac is the Euro sign return 0; } On .NET with STLport you get the incorrect, but ANSI-conforming, result: "I owe you ¬ 1" '¬' being the character of ASCII code 0xac. On .NET with its native STL implementation you get "I owe you " the program chokes when writing the Euro sign and leaves the stream in "failed" state :( Here Microsoft seems to have really screwed up something. Another observation: I note that my test.cpp program includes wchar_t member variables initialized to values in excess of 256. The system doesn't seem to lose any informaton in storing/loading to a stream with classic locale. I double checked. I have functions in both char and wchar_t versions of text archives to handle both strings of chars and wstrings. This created a couple of problems. The most obvious was what about strings containing embedded blanks. - and other punctuation. Single characters such a space was also a problem. First I implemented them a sequence of short integers. That worked fine but I was concerned that it wasted space, was slow, and inconvenient for debugging. So I made special functions for i/o of string and wstring which just write a string length and then stream out the string buffer as binary. So I never have the problem that unicode or local o anything else interfers with my serialization. This is a side effect of the fact that the usage of the stream was carefully limited to the purpose at hand. You should triple check, then. Following my previous example, this program: int main() { std::wstring outs(L"I owe you \x20ac 1"), ins; { std::wofstream out("test.txt", std::ios::binary); boost::woarchive ar(out); ar << outs; } { std::wifstream in("test.txt", std::ios::binary); boost::wiarchive ar(in); ar >> ins; } assert(outs == ins); return 0; } fails on at least two platforms (.NET/native STL and .NET/STLport), in two different ways. Of course this raises the question why support wstreams at all? We're not using its advantages (unless we have a lot of unicode text to store) and it doubles the required space. Let's replace wide streams and archives with narrow ones in the previous example. The program indeed run successfully on both STLport and .NET native STL, but let's have a look at the archive file: ---begin file 22 serialization::archive 1 0 1 13 73 32 111 119 101 32 121 111 117 32 8364 32 49 ---end file this alternative requires from 2 to 6 (six!) bytes per Unicode character. Even up to 12 if you use surrogates, that become 8 if your wchar_t is 32-bit wide (:o another platform-specific issue has leaked in!). If I had lots of Unicode strings I would have no doubt about which is the better solution. I hope you realize that Unicode output is a lot more complex than it seems. I am just asking you to allow the programmer to avoid overriding the locale, which still can be the default option. Am I asking too much? Alberto Barbati ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
>From: Alberto Barbati <[EMAIL PROTECTED]> >This solution does not address the objections in my last post in the >original thread. You seem really concerned about this. We could meet in >the middle with this solution, instead: There are a couple of issues here 1) changing the state of the stream while serializing. My implementation initialized the stream and never contemplated that the same stream might be used for other things. That is that serialized data might be "embedded" as part of a larger stream. Apparently this is an issue for some people. I don't see it as a large issue but it as easy to address. Basically, I can capture the stream state when it is attached to an archive and restore the stream state when the archive is closed. My previous example only considered the locale as the stream state. After perusing BOOST documentation I found the concept of "io state savers". I realize that there are other aspects of the stream state that might need saving as well. As time permits I will look into this 2) the function the stream in serialization The archive implementation is a means to store serialized data. Any number of derivations from the base class are possible and even useful. The archive concept as represented by the base classes basic_[i|o]archive presumes no particular method of data storage. Its only requirement is that it be able to store data in a sequence and recover data in the same sequence. One method of storing/recovering the data is to use a sequence of characters or wide characters. That is a C++ stream. This has some major benefits: a) All the code required to convert any C++ datatype into characters or wide characters exists and is part of the standard library and is guarenteed to work. b) All the file handling is implemented as well The fact that streams include lots of facilities such as local, punctuation, etc is not relevent to our usage in text archives. We only need to be concerned about this to the extent that they might create portability problems. Hence I set the locale for all streams used for serialization to "classic". This is the correct thing to do. The purpose of serialization is to restore the C++ data types that we started with. We don't want the data altered by the stream processing. Another way of saying this is that streams have lots of facilities for dealing with human readable text in a country/language independent way. In this context the text is dealt with only by our serialization system so these facilities should not be used. >> Another observation: >> >> I note that my test.cpp program includes wchar_t member variables initialized >> to values in excess of 256. >> The system doesn't seem to lose any informaton in storing/loading to a stream >> with classic locale. I double checked. I have functions in both char and wchar_t versions of text archives to handle both strings of chars and wstrings. This created a couple of problems. The most obvious was what about strings containing embedded blanks. - and other punctuation. Single characters such a space was also a problem. First I implemented them a sequence of short integers. That worked fine but I was concerned that it wasted space, was slow, and inconvenient for debugging. So I made special functions for i/o of string and wstring which just write a string length and then stream out the string buffer as binary. So I never have the problem that unicode or local o anything else interfers with my serialization. This is a side effect of the fact that the usage of the stream was carefully limited to the purpose at hand. Of course this raises the question why support wstreams at all? We're not using its advantages (unless we have a lot of unicode text to store) and it doubles the required space. In summary my view is that: a) locale doesn't matter to us as long as we always use the same one b) leave the stream in the state you found it. Robert Ramey ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
Robert Ramey wrote: > register_cross_program_class_identifier(const char *id="T") > An alternative could be to use register_type<> as it is, but augment the serialization traits class to provide a const char* serialization::get_cross_program_class_identifier(); This solution has the advantage that the identifier string can be physically located near the load/save/version functions (which are usually near the class itself). Alberto Barbati ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
Vahan Margaryan wrote: Eric Woodruff wrote: type_info is not portable in the slightest. I realize that. I just pointed out that it's not so convenient to have user-supplied string ids because of the template classes. As pointed out by Robert, the user-supplied string id could be made optional. For the lazy user we might imagine a default value obtained in some programmatic way, for example a possibly pre-processed type_info::name(). Alberto Barbati ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
Robert Ramey wrote: I believe I have found the an acceptable resolution to the "registration" cunundrum. (note: I consider this "registration" topic is a different issue from my registry class proposal. This one relates to "identification" of user classes, while mine is just an issue of factorizing responsibilities among library classes) I agree that this solution of the "identification" issue is quite right and would be very beneficial to the overall usefulness of the library. Bravo to Robert and Vladimir. I conjure up something like (pseudo code): register_cross_program_class_identifier(const char *id="T") The perfect place for this function could be as a method of my registry class, don't you think? ;) Alberto Barbati ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
Vladimir Prus wrote: Robert Ramey wrote: Now I remember why I included this. Suppose that an archive is created where the default local is a spanish speaking country where the number 123 thousand is written 123.000 The archive is sent to another country where the default locale is an english speaking country where the string 123.000 means 123 That's why I set the local to classic. I'm well aware of this issue. See below. is it a good idea to change stream locale without user's consent. Maybe, archive should create *their own* (i/o)stream, sharing streambuffer with the stream the user has passed, and with appropriately modified locale? In my opinion it the archive should not change the stream locale without the programmer's consent. The main reason for this is that she may indeed want to use her own locale, for example to allow Unicode output. Ovverriding only num_put/num_get (and why not ctype also?) is not a nice solution, in my opinion, it's just a hack. Moreover, I can imagine a brave programmer that is aware that her serialized data will not be read by any other application except hers and decides to have the text output to follow her native language conventions. In the end, between the two possibilities: 1) override the locale (entirely or partially), reducing programmer's freedom of customizing the output but guaranteeing a perfect portable output; 2) not override the locale, leaving to the programmer the complete responsiblity to set the right one that satisfies her specific requirements, with the risk that she messes things up; I vote without doubt for number 2. To be paranoid, on output we could write in each archive a magic number like 12345.678, on input we try to read it and if it doesn't match the magic number, we issue an error. I know that this hack won't catch 100% of the problems, but it will catch most of them and is not less safe than writing the sizeof() of the basic types as we are currently doing. Alberto Barbati ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Re: [boost] Re: Serialization Submission version 6
"William E. Kempf" <[EMAIL PROTECTED]> writes: > David Abrahams said: >> "William E. Kempf" <[EMAIL PROTECTED]> writes: >> >> Not all uses of serialization depend on that. > > Most of the cases I've ever had a need for do. Either it's being used to > persist data, or it's being used to do IPC. Are the other uses really > prevalent in your experience? No, I didn't claim they were. I'm just saying that they exist. Where applicable, it's important to be able to avoid a lot of extra work writing registration code. >> Which implementations? Boost.Python v2 /depends/ on >> type_info::name() returning a usefully-unique string on many >> platforms (depending on their dynamic linking model). I've never >> seen an example of a platform which "often give you strings that >> are *not* unique", but if they're out there, I need to know about >> them. > > OK, I may be wrong here. However, it was my understanding that many > implementation returned nothing more than the type's name, minus the > namespace, which would mean you could easily get non-unique names. > I had even heard a rumor once that there was a compiler that always > returned a null string, in order to save space in the executable, > but I can't tell you what compiler that was supposed to be or verify > that it was anything more than a bad rumor. > > Regardless, however, you have to admit that all of this *IS* allowed > by the standard, making relying on this behavior to be shaky even if > you could confirm that all current implementations do something > useful here. Yep, it's legally shaky. I think it's relatively portable, practically speaking. >>> Can type_info::name() be useful? Yes, provided the implementation >>> did something useful, but it's not portable, and not useful for >>> the task at hand. >> >> There are lots of tasks you can do with a serialization library, >> and I submit that a reasonable proportion of those tasks can take >> advantage of type_info::name() on a useful number of compilers. ^^^ > > If type_info:name() doesn't return a unique string? No, but show me one example which doesn't. My point is that most compilers do emit unique strings for all practical purposes. Most even do something human-readable. >>> BTW, there's a LOT to be said for specifically supplying an >>> identifier when implementing a persistence/serialization library, even >>> though it means tedious busy work. Specifically, it allows you to >>> insure the id is valid across multiple programs, regardless of how the >>> implementation might auto-magically generate an identifier. I'd >>> recommend choosing a "large integer" representation instead of a >>> string, however, since it will take less space to represent >>> externally. The GUID type is actually a fairly good choice here. >> >> Maybe it would make sense to use Steve Dewhurst's typeof() >> technique. At least that could help reduce the number of user-supplied >> identifiers needed. > > That may well be worth looking into, but I'm not familiar with the > technique. I know he generates unique integer ids (or at least, I think I > know that), but will the generation produce consistent ids across > application runs? Yes. It's a compile-time encoding. You have to register the UDTs, but composite types such as T and T (*)(U,V) get unique type ids automatically. -- David Abrahams [EMAIL PROTECTED] * http://www.boost-consulting.com Boost support, enhancements, training, and commercial distribution ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Re: [boost] Re: Serialization Submission version 6
David Abrahams said: > "William E. Kempf" <[EMAIL PROTECTED]> writes: > >> David Abrahams said: >>> "Eric Woodruff" <[EMAIL PROTECTED]> writes: >>> type_info is not portable in the slightest. >>> >>> There are lots of applications where that doesn't matter. And with a >>> little postprocessing, the type_info::name() produced by most >>> compilers could easily be normalized into a common format. >> >> The trouble is that serialization requires an identifier that's >> persistant across application runs. > > Not all uses of serialization depend on that. Most of the cases I've ever had a need for do. Either it's being used to persist data, or it's being used to do IPC. Are the other uses really prevalent in your experience? >> type_info by itself doesn't help here, because you can't persist the >> type_info instance even if it were gauranteed to compare across runs >> (obviously it's not). type_info::name() is so underspecified that you >> can't be sure it won't give you the same string for every type. More >> importantly, in practice (i.e. implementations do this), >> type_info::name() will often give you strings that are *not* unique, >> rendering it worthless for this domain. > > Which implementations? Boost.Python v2 /depends/ on type_info::name() > returning a usefully-unique string on many platforms (depending on their > dynamic linking model). I've never seen an example of a platform which > "often give you strings that are *not* unique", but if they're out > there, I need to know about them. OK, I may be wrong here. However, it was my understanding that many implementation returned nothing more than the type's name, minus the namespace, which would mean you could easily get non-unique names. I had even heard a rumor once that there was a compiler that always returned a null string, in order to save space in the executable, but I can't tell you what compiler that was supposed to be or verify that it was anything more than a bad rumor. Regardless, however, you have to admit that all of this *IS* allowed by the standard, making relying on this behavior to be shaky even if you could confirm that all current implementations do something useful here. >> Can type_info::name() be useful? Yes, provided the implementation did >> something useful, but it's not portable, and not useful for the task >> at hand. > > There are lots of tasks you can do with a serialization library, and I > submit that a reasonable proportion of those tasks can take advantage of > type_info::name() on a useful number of compilers. If type_info:name() doesn't return a unique string? >> BTW, there's a LOT to be said for specifically supplying an >> identifier when implementing a persistence/serialization library, even >> though it means tedious busy work. Specifically, it allows you to >> insure the id is valid across multiple programs, regardless of how the >> implementation might auto-magically generate an identifier. I'd >> recommend choosing a "large integer" representation instead of a >> string, however, since it will take less space to represent >> externally. The GUID type is actually a fairly good choice here. > > Maybe it would make sense to use Steve Dewhurst's typeof() > technique. At least that could help reduce the number of user-supplied > identifiers needed. That may well be worth looking into, but I'm not familiar with the technique. I know he generates unique integer ids (or at least, I think I know that), but will the generation produce consistent ids across application runs? If not, regardless of whether or not serialization can be useful with out that, I would claim it cripples the serialization library enough to not be a valid solution. William E. Kempf ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Re: [boost] Re: Serialization Submission version 6
"William E. Kempf" <[EMAIL PROTECTED]> writes: > David Abrahams said: >> "Eric Woodruff" <[EMAIL PROTECTED]> writes: >> >>> type_info is not portable in the slightest. >> >> There are lots of applications where that doesn't matter. And with a >> little postprocessing, the type_info::name() produced by most >> compilers could easily be normalized into a common format. > > The trouble is that serialization requires an identifier that's > persistant across application runs. Not all uses of serialization depend on that. > type_info by itself doesn't help here, because you can't persist the > type_info instance even if it were gauranteed to compare across runs > (obviously it's not). type_info::name() is so underspecified that > you can't be sure it won't give you the same string for every type. > More importantly, in practice (i.e. implementations do this), > type_info::name() will often give you strings that are *not* unique, > rendering it worthless for this domain. Which implementations? Boost.Python v2 /depends/ on type_info::name() returning a usefully-unique string on many platforms (depending on their dynamic linking model). I've never seen an example of a platform which "often give you strings that are *not* unique", but if they're out there, I need to know about them. > Can type_info::name() be useful? Yes, provided the implementation did > something useful, but it's not portable, and not useful for the task at > hand. There are lots of tasks you can do with a serialization library, and I submit that a reasonable proportion of those tasks can take advantage of type_info::name() on a useful number of compilers. > BTW, there's a LOT to be said for specifically supplying an > identifier when implementing a persistence/serialization library, > even though it means tedious busy work. Specifically, it allows you > to insure the id is valid across multiple programs, regardless of > how the implementation might auto-magically generate an identifier. > I'd recommend choosing a "large integer" representation instead of a > string, however, since it will take less space to represent > externally. The GUID type is actually a fairly good choice here. Maybe it would make sense to use Steve Dewhurst's typeof() technique. At least that could help reduce the number of user-supplied identifiers needed. -- David Abrahams [EMAIL PROTECTED] * http://www.boost-consulting.com Boost support, enhancements, training, and commercial distribution ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Re: [boost] Re: Serialization Submission version 6
David Abrahams said: > "Eric Woodruff" <[EMAIL PROTECTED]> writes: > >> type_info is not portable in the slightest. > > There are lots of applications where that doesn't matter. And with a > little postprocessing, the type_info::name() produced by most > compilers could easily be normalized into a common format. The trouble is that serialization requires an identifier that's persistant across application runs. type_info by itself doesn't help here, because you can't persist the type_info instance even if it were gauranteed to compare across runs (obviously it's not). type_info::name() is so underspecified that you can't be sure it won't give you the same string for every type. More importantly, in practice (i.e. implementations do this), type_info::name() will often give you strings that are *not* unique, rendering it worthless for this domain. Can type_info::name() be useful? Yes, provided the implementation did something useful, but it's not portable, and not useful for the task at hand. BTW, there's a LOT to be said for specifically supplying an identifier when implementing a persistence/serialization library, even though it means tedious busy work. Specifically, it allows you to insure the id is valid across multiple programs, regardless of how the implementation might auto-magically generate an identifier. I'd recommend choosing a "large integer" representation instead of a string, however, since it will take less space to represent externally. The GUID type is actually a fairly good choice here. William E. Kempf ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Re: [boost] Re: Serialization Submission version 6
- Original Message - Eric Woodruff wrote: > type_info is not portable in the slightest. I realize that. I just pointed out that it's not so convenient to have user-supplied string ids because of the template classes. Regards, -Vahan > "Vahan Margaryan" <[EMAIL PROTECTED]> wrote in message > news:003401c28bee$7fbc4f40$4f09a8c0@;lan.mosaic.am... > - Original Message - > From: "Robert Ramey" <[EMAIL PROTECTED]> > Sent: Thursday, November 14, 2002 5:45 PM > Subject: Re: [boost] Serialization Submission version 6 > > > > > > > register_cross_program_class_identifier(const char *id="T") > > > > This would be invoked for each class declaration. Now we have > > a portable id associated with each class - exactly what we need. > > Polymorphic pointers would archive this tag and use it > > to determine the proper class to construct on loading. > > > > The default class identifier would be the text representation of the class > name. > > (note: in general not necessarily the same as type_info.name() ) > > which is going to be sufficent for almost all cases. > > The problem that usually arises from this is having to make up class ids for > template classes. type_info does this for you. > > Regards, > -Vahan > > > > > > > ___ > Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost > ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Re: [boost] Re: Serialization Submission version 6
"Eric Woodruff" <[EMAIL PROTECTED]> writes: > type_info is not portable in the slightest. There are lots of applications where that doesn't matter. And with a little postprocessing, the type_info::name() produced by most compilers could easily be normalized into a common format. -- David Abrahams [EMAIL PROTECTED] * http://www.boost-consulting.com Boost support, enhancements, training, and commercial distribution ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
type_info is not portable in the slightest. "Vahan Margaryan" <[EMAIL PROTECTED]> wrote in message news:003401c28bee$7fbc4f40$4f09a8c0@;lan.mosaic.am... - Original Message - From: "Robert Ramey" <[EMAIL PROTECTED]> Sent: Thursday, November 14, 2002 5:45 PM Subject: Re: [boost] Serialization Submission version 6 > > > register_cross_program_class_identifier(const char *id="T") > > This would be invoked for each class declaration. Now we have > a portable id associated with each class - exactly what we need. > Polymorphic pointers would archive this tag and use it > to determine the proper class to construct on loading. > > The default class identifier would be the text representation of the class name. > (note: in general not necessarily the same as type_info.name() ) > which is going to be sufficent for almost all cases. The problem that usually arises from this is having to make up class ids for template classes. type_info does this for you. Regards, -Vahan ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
>Dirk Gerrits wrote: >IIRC the old persistance library defined facilities for input and output >using the RFC-1014 XDR: External Data Representation Standard. The new >serialization library doesn't seem to include such archive classes and >leaves it up to the user to write these. >Now I don't mean to dispute the decision, but I'd just like to know what >the rationale for it was. Well, I've been busy. Seriously, I personally had no interest in implementing XDR. a) It didn't add any more portability than using a text file. b) using a text stream permited standard library code to address the mapping between machines - for free and guaranteed correct c) In my opinion wouldn't be any faster Of course all sorts of objects were raised to these views. Worse, everyone wanted their own pet archive format. So I did the natural thing - I punted. The data storage is completelty factored out. I'm waiting for any one of those who told me how easy it is to make a portable XDR archiver to submit a derivation of basic_[i|o]archive. This factoring out actually is a great thing. The data storage itself doesn't even have to be a stream like object. It could be something more interesting like an pipe to another machine or what ever. Robert Ramey ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
[boost] Re: Serialization Submission version 6
IIRC the old persistance library defined facilities for input and output using the RFC-1014 XDR: External Data Representation Standard. The new serialization library doesn't seem to include such archive classes and leaves it up to the user to write these. Now I don't mean to dispute the decision, but I'd just like to know what the rationale for it was. Secondly, I have a nit about the documentation. In the reference section about the 'Definition of New Archive Formats' it is said that: 'The archive format is specified by implementing the virtual functions of the base class.' I think it would be clearer if the documentation also stated all those virtual functions that one should implement, so that the user wouldn't have to read archive.hpp. Perhaps a simple example should be provided. Say, a rot13 text stream or something. Regards, Dirk Gerrits ___ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost