[boost] New version of UTF library (was Re: UTF library available for review)

Alberto Barbati Sun, 12 Jan 2003 03:33:07 -0800

Hi,

I just uploaded here http://groups.yahoo.com/group/boost/files/utf/ a new version of the UTF library. The changes are:

1) Added missing typename keywords and used BOOST_DEDUCED_TYPENAME in every applicable place

2) Added safety checks on buffer size. (Thanks to Dietmar Kuehl)

3) Now the state type is not assumed to be an integer type. In order to access the state two unqualified free functions get_state() and set_state() are used instead. File utf_config.hpp provides a default *non-portable* implementation that relies on reinterpret_cast, which should be specialized for each platform. (Thanks to Dietmar Kuehl)

The suite now compiles correctly on gcc cygwin yet it fails to link because it complains about missing wchar_t specializations. Can anyone help me on this?

It also seems that gcc does not provide specialization for any library class (basic_filebuf, char_traits, etc.) for internal types different from char and wchar_t. Could anyone confirm this? This could be a problem if the user want to use UTF-32 facets but its wchar_t is 16 bit wide. I can easily provide an implementation of char_traits for implementations lacking it. Should I do it?

Alberto Barbati wrote:
> Dietmar Kuehl wrote:
>> Alberto Barbati wrote:

The problem is that if char does not have 8 bits, then I cannot be sure that the underlying implementation reads from a file 8 bits at a time. Please correct me if I'm wrong on this point. That requirement is essential for the UTF-8 encoding.

Has anyone any comment about this? I don't have access to any implementation where char has more than 8 bits to verify.

There already exist a facility to select the correct facet according to
the byte order mark. It's very simple to use:
    std::wifstream file("MyFile", std::ios_base::binary);
    boost::utf::imbue_detect_from_bom(file);

that's it.
I have seen this possibility and I disagree that it is very simple to use
for several reasons:

- There is at least one implementation which does not allow changing
  the locale after the file was opened. This is a reasonable
  restriction which seems to be covered by the standard (I thought
  otherwise myself but haven't found any statement supporting a
  different view).  Thus, changing the code conversion facet without
> closing the file may or may not be possible. Closing and reopening
> a file may also be impossible for certain kinds of files.

I guess you are mentioning 27.8.1.4, clauses 19 (description of function filebuf::imbue):

"Note: This may require reconversion of previously converted characters. This in turn may require the implementation to be able to reconstruct the original contents of the file."

That may indeed be a problem. In my humble opinion, the use of "may" is quite unfortunate... it seems that implementation need not reconvert previous characters and leaves unspecified (not even "undefined" nor "implementation defined") what happens if the implementation cannot perform the reconstruction.

In which way is imbue implemented in the implementation you were mentioning?

I looked deeper into the question.

Of the three implementations I checked (VS.Net/Dinkumware, STLport, gcc 3.2 prerelease) none of them implement clause 19. gcc even has an explicit comment about this. All of them allows imbue() in the middle of a file. Which implementation where you talking about?

I am considering writing a mega-facet that automatically adapts to the file encoding according to the BOM. It could easily be done for UTF-32 as the conversion code is already factored out of the facet classes (splitted into file utfXX_algo.hpp and utf32_strategy.hpp). I plan to do the same factorization for UTF-16 facets also; it is already done for facet utf8_utf16. However, please bear in mind that such a facet can't be as performant as the little ones, because each of do_in/do_out/do_length functions have to be a large switch over the several implementations and such a switch need to executed each of the several times do_XXX is called for each character.

BTW, this mega-facet is ok when reading from a file. How should it behave when writing? Will it be ok to return error until a encoding is chosen? In fact, reading *and* writing at the same time to a Unicode file is IMHO a sure way to disaster, unless writing always occur at end of file with std::ios_base::app.

I am considering adding stream classes, derived from std::basic_* classes (or maybe from boost::filesystem classes?) as a conveniency. What do you think?

Alberto

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost

[boost] New version of UTF library (was Re: UTF library available for review)

Reply via email to