[boost] Re: UTF library available for review

Alberto Barbati Wed, 08 Jan 2003 14:54:10 -0800

Dietmar Kuehl wrote:

Alberto Barbati wrote:

One can use a char traits class different from
std::char_traits<T>, that defines a suitable state type.


This is not really viable due to 27.8.1.1 paragraph 4:

  An instance of basic_filebuf behaves as described in lib.filebuf
  provided traits::pos_type is fpos<traits::state_type>. Otherwise the
  behavior is undefined.

Thanks for pointing that out, I missed it. However it's not really a problem, you can add a pos_type typedef to our test_traits, like this:

template <typename T>
struct test_traits : public std::char_traits<T>
{
typedef boost::uint32_t state_type;
typedef std::fpos<boost::uint32_t> pos_type;
};

It would be possible to create a conversion stream buffer (which is
probably a good idea anyway) which removes this requirement but even
then things don't really work out: stream using a different character
traits are not compatible with the normal streams. I haven't worked
much with wide character and don't know how important it is to have
eg. the possibility of using a wide character file stream and one of
the standard wide character streams (eg. 'std::wcout') be replacable.
I think it is crucial that the library supports 'std::mbstate_t'
although this will require platform specific stuff. It should be
factored out and documented such that porting to new platform consists
basically of looking up how 'std::mbstate_t' is defined.

That's a better argument. I will think about it. As I said, I'm definetely not against adding acceessors to mbstate_t, I just have to think what's the better way to do it.

I forgot to say in my previous post that this version of the library
only supports platforms where type char is exactly 8 bits. This
assumption is required because I have to work at the octet level while
reading from/writing to a stream.


I don't see why this would be required, however. This would only be
necessary if you try to cast a sequence of 'char's into a sequence
of 'wchar_t's. Converting between these is also possible in a portable
way (well, at least portable across platforms with identical size of
'wchar_t' even if 'char' has different sizes).

The problem is that if char does not have 8 bits, then I cannot be sure that the underlying implementation reads from a file 8 bits at a time. Please correct me if I'm wrong on this point. That requirement is essential for the UTF-8 encoding.

Such decision is very strong, I know. Yet, one of the main problems with
the acceptance of Unicode as a standard is that there are too many
applications around that uses only a subset of it. For example, one of
the first feedback I got, at the beginning of this work, was "I don't
need to handle surrogates, could you provide an optimized facet for that
case?". The answer was "Yes, I could, but I won't".


As I said, I don't have strong feelings about this (and I have
implemented such a facet myself already anyway...). However, note that
I requested something quite different: I definitely want to detect if
a character cannot be represented using the internally used character.
In fact, I would like to see this happen even for a 16 bit internal type
because UTF-16 processing is considerably more complex than UC2
processing and I can see people falling into the trap of testing only
cases where UC2 is used. That is, the implicit choice of using UTF-16
is actually a pretty dangerous one, IMO.

I know it's dangerous, but I prefer that way. I would like this to be "The UTF Library", not just some "conversion library". I also want to support the Unicode standard to its full extent. Supporting a conversion not covered by Unicode, just because someone finds it useful, does not go in that direction. If this position would stop my proposal to be accepted in Boost, I would just retire it.

There already exist a facility to select the correct facet according to
the byte order mark. It's very simple to use:

    std::wifstream file("MyFile", std::ios_base::binary);
    boost::utf::imbue_detect_from_bom(file);

that's it.

I have seen this possibility and I disagree that it is very simple to use
for several reasons:

- There is at least one implementation which does not allow changing
  the locale after the file was opened. This is a reasonable
  restriction which seems to be covered by the standard (I thought
  otherwise myself but haven't found any statement supporting a
  different view).  Thus, changing the code conversion facet without

> closing the file may or may not be possible. Closing and reopening
> a file may also be impossible for certain kinds of files.

I guess you are mentioning 27.8.1.4, clauses 19 (description of function filebuf::imbue):

"Note: This may require reconversion of previously converted characters. This in turn may require the implementation to be able to reconstruct the original contents of the file."

That may indeed be a problem. In my humble opinion, the use of "may" is quite unfortunate... it seems that implementation need not reconvert previous characters and leaves unspecified (not even "undefined" nor "implementation defined") what happens if the implementation cannot perform the reconstruction.

In which way is imbue implemented in the implementation you were mentioning?

- Your approach assumes a seekable stream which is not necessarily
  the case: At least on UNIXes I can open a file stream to read from
  a named pipe which is definitely non-seekable. Adjusting the state
  internally can avoid the need to do any seeking, although admittedly
  at the cost some complexity encapsulated by the facet.

We don't really need to seek, anyway. Once the BOM is extracted and detected, I could just imbue the correct facet. Seeking back is just overzealous, to allow the implementation to extract the BOM, but I agree that it is kind of stupid.

The above two lines cause undefined behavior according to the C++
standard. Correspondingly 'ptr1 < ptr2' is defined if and only if
'ptr1' and 'ptr2' are pointers of the same type (not counting
cv-qualification) and point *into* the same array object or one behind
the last element. If this condition does not hold, the expression
'ptr1 < ptr2' causes undefined behavior, too.

At first sight, this restriction seems to be pretty esoteric but it is
actually not: On a segmented architecture, 'ptr - 3' might result in a
pointer which looks as if it points to the end of a segment. This in
turn means that 'ptr - 3 < ptr' does not necessarily hold if 'ptr'
points to one of the first two positions in an array object.

Your right on everything, here. I'll add the check.

You are right, UTF-* encoding are in fact stateless. However, for a
reason too complex to describe here (it will be written in the
documentation for sure!) the facets that use UTF-16 internally need to
be state-dependent in order to support surrogates.

I don't think so. UTF-16 is a multi-byte encoding but a stateless one.
... and I don't see how library issue #76 changes anything about this!
In fact, if it does, it is probably broken and the resolution needs
fixing, not the code using it. The cost of turning an encoding into a
stateful multi byte encoding is likely to be something you called
"brain dead" in your article: I'm not yet aware of doing the conversion
one external character at a time. All other encoding, ie. fixed width
encodings and stateless multi byte encodings, can do much better.

I think it the time for that complex explanation I was talking about.
The UTF facet I wrote do no involve one encoding, but two encodings each, one on the internal side and one on the external side:

external sequence (bytes)
|
external encoding (UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)
|
Unicode scalar value (abstract characters)
|
internal encoding (UTF-16, UTF-32)
|
internal characters (wchar_t or whatever)

The external encoding is never a problem.

If the internal encoding is UTF-32, everything's fine, because for each Unicode scalar value there corresponds at most one internal character.
In facts, this is just a validation step, a few scalar values are invalid and generate an error, all the others are simply mapped to internal characters identically.

The problems arise for the internal UTF-16 encoding, where each valid Unicode scalar value is returned as either one *or* two internal characters (a surrogate pair).

But issue #76 explicitly requires that a codecvt facet must be able to convert internal characters *one at a time*. So what to do when I encounter a Unicode scalar value that requires a surrogate pair and the implementation required one single character? Simple: I output the first surrogate and store the second surrogate in the state. In the next call I return the second surrogate.

Here explained why I need shift-states. I challenge you to find a better way to have this kind of processing without violating issue #76. On the other hand, issue #76 is not under discussion, its rationale is strong as iron.

There is one more problem: some implementation requires the codecvt facet to consume at least one character if it produces at least one character. In my opinion, the standard does not allow this assumption. I posted a DR to comp.std.c++ hoping the LWG will add explictly wording about the issue. However, until implementation are fixed, we have to face with them. Fortunately, I was able to find a way to always extract at least one character for each character produced, without a great loss of performance while still providing under #ifdef the more optimal code for implementations that can handle it.

Please notice that the extra cost for handling this very complex case is neglible if the source sequence does not contain characters that need such a complex handling.

Alberto

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] Re: UTF library available for review

Reply via email to