On Monday 06 January 2003 03:52 am, Alberto Barbati wrote: > I hope that in front of a full-blown and working > library I might get more attention.
Since the Boost list has gotten (IMO) a little bit out of hand, I just haven't seen the previous mail. Actually, it was lucky I came across this one... Anyway, I have a few comments on this since I'm currently working on something very similar. As a first comment, I think that such facets are pretty useful. However, from just glancing at the implementation and the interfaces I have a few points which may be worth for consideration: - The 'state' argument for the various facet functions is normally accessed as a plain integer in your code. However, eg. 'std::mbstate_t' is not an integer at all. In fact, it is implementation defined and can thus be something entirely different on each platform. I'd suggest to provide some from of accessor functions to read or write objects of this type, possibly in the form of some traits class. This could then be used to cope with differences and/or more complex organization of the state type. - Falling back to UTF-16 in cases where the Unicode characters may not fit into the words is one possible approach to deal with the characters. Another approach is to just indicate an error. For example, I know that certain XML-files I'm using exclusively store ASCII characters and there is no need to use a different internal character type than 'char'. If I ever come across a non-ASCII character I don't really want to have it encoded in something like UTF-8 but I want to fail reading the file (and possibly retry reading using a bigger character type). I would appreciate if such a possibility would be incorporated, too (this is, however, nothing I feel too strongly about). - In the context I want to use the facets [at least those I'm implementing] I don't really want to bother about the details of the encoding. That is, I just want to 'imbue()' an appropriate 'std::locale' object to the stream and have the facet figure out what encoding is used, especially when reading from some file. The basic idea is that each file is either started by a byte order mark from which UTF-16BE or UTF16LE can be deduced or it is in UTF-8. The encoding could eg. be stored in typical 'std::mbstate_t' objects (these are often a struct with a 'wchar_t' and a count or something like this). To enable something like this, it would be helpful if the actual conversion functions were actually separated from the facets: The facets would just call the corresponding conversion functions. The actual conversion is, apart from the state argument, entirely stateless and there is no need to bind it to any facet. - The conversion functions seem to assume that there are input sequence consisting of more than one character passed to it. I don't think this is guaranteed at all. At the very minimum, the facets should defend against being called with too small buffers (at least in 'do_out()' of 'utf16_utf32' the constant 3 is substracted from a pointer without prior check that the input area consists of at least three characters, thereby possibly invoking undefined behavior). I think that it is acceptable behavior of a facet user (like eg. a file stream buffer) to feed individual characters to the 'do_*()' functions or at least to increase the buffers char by char (I think that at least one of the file stream buffer implementations actually does just that in certain cases!). - My understanding of 'encoding()' is that '-1' indicates that "shift states" are used and that a special termination sequence returning the state into some initial state is possibly needed. That is, although the 'state' variable is used, the UTF-* encodings are considered to be stateless and there is no need to 'unshift()' anything because the 'do_out()' function will eventually write all what is necessary if it gets a big enough output buffer. That is, the 'encoding()' members should probably return a '0' or a positive integer for all UTF-* encodings since the encodings are essentially stateless. My personal feeling is that UTF-16 is, however, a variable width encoding, even if the internal encoding is also UTF-16: A perverted user of the facet might legally present the indivual words out of order to the facet of a fixed width encoding, eg. in a first pass all odd words and in a second all even words. This would rip surrogate sequences apart causing unexpected errors. The code itself is pretty readable. However, I haven't made a thorough review of it [yet]. -- <mailto:[EMAIL PROTECTED]> <http://www.dietmar-kuehl.de/> Phaidros eaSE - Easy Software Engineering: <http://www.phaidros.com/> _______________________________________________ Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost