On Wed, Nov 17, 2010 at 08:23:58PM +0100, Stephan Witt wrote:
> Am 16.11.2010 um 18:00 schrieb Enrico Forestieri:
> > Moral: either we stick an "#undef isdigit" in our code or we switch
> > to iswdigit(). However, in this case, some locale expert should clarify
> > under what conditions the output of iswdigit() differs from that of
> > isdigit().
> 
> I had to think about this further...
> 
> 1. isdigit() is only one example for the problems with wchar_t.
> The same problem exists with isalnum() et. al. 
> * isalnum() in output_xhtml.cpp, function cleanAttr(docstring const & str)
> * or isalpha() in InsetRef::getFormattedCmd()
> 
> 2. iswdigit() et. al. depends on the current locale.
> Shouldn't the locale depend on the document language? 
> Then it would be iswdigit_l() etc...
> 
> 3. in lstrings.cpp is a isDigit(char_type) implementation...
> There is some use of ucs4_to_qchar() and this again has a comment saying it's 
> a hack.
> If it's correct to use isDigit() then hasDigit can use that.
> 
> 4. Other numerals like 1/3 or "roman numeral one thousand" M etc. should be
> classified as digits as well.
> 
> I don't know what the correct solution would be...
> I'd use neither the "#undef isdigit" nor the range check with "less then 
> 0x80".
> I'd use some iswctype() or iswctype_l() solution. 
> And all uses of non-wchar_t ctype function for char_type arguments should be 
> verified.

There's a bit of confusion about this problem. I suggest that we stick
to a common ground by using the posix standard.

According to posix:
http://www.opengroup.org/onlinepubs/9699919799/functions/isdigit.html
http://www.opengroup.org/onlinepubs/9699919799/functions/iswdigit.html

both isdigit() and iswdigit() are locale dependent. However, passing
as argument of isdigit() an int not representable as an unsigned char
or equal to the value of EOF, triggers undefined behavior (which includes
the possibility of a crash).

So, both isdigit() and iswdigit() return the same output when the argument
is less than 256 or EOF, but only iswdigit() can be safely used for other
values. Given that, I would say that your patch was indeed correct.

Strictly speaking, we should use iswdigit_l() to be able to specify
the wanted locale, which is the locale of the language being spelled,
possibly different from the current locale of the process and also
different from the locale of the document language.

Now things start becoming complicated, maybe too much complicated.
For this reason, I would stick to the isdigit() test with a range
check on the passed argument. In this way we only test for the
arabic digits, avoiding problems whatever the locale involved in
the test.

As regards other uses of isdigit(), I think that in our code we only
need to know whether a char is an arabic digit and not some kind of
Aegean number. Moreover, if we recognize (in some locale) a char which
is not an arabic digit as a digit, maybe we would even trigger some
bug...

As regards the other isxxx() tests, I suggest to audit them on an as
needed basis, according to what the posix standard says:
http://www.opengroup.org/onlinepubs/9699919799/mindex.html
(click on the "Alphabetic index" link at the bottom for accessing the
relevant man page).

-- 
Enrico

Reply via email to