Re: Rename std.ctype to std.ascii?
On 2011-06-16 12:51, Jouko Koski wrote: > "Jonathan M Davis" wrote: > > On 2011-06-14 11:53, Jouko Koski wrote: > >> I would not consider it being good idea to include this kind of > >> ascii-only > >> utilities in the standard-ish library. > > > > For some classes of operations, it makes perfect sense to be checking for > > ASCII characters only. For others, it's just people not worrying about > > internationalization like they should be. For instance, format strings > > don't > > care about unicode as far as their escape sequences go. %a, %d, etc. are > > all > > pure ASCII. > > Do we really need a common library utility for such a bounded domain? I > would vote dropping ascii-only std.ctype altogether. Those who know and > ensure that they are dealing with ascii-only, ebcdic-only or whatever-only > representations can easily write their own utilities to their particular > domains - maybe even better optimized than std.ctype because the domain may > be even more restricted. A common use ascii-only utility will be used > inevitably in places where it shouldn't. > > > std.ctype/std.ascii deals with ASCII for those situations where you > > really do > > only care about ASCII. It deals with unicode characters, but it returns > > false > > for everything with them which returns a bool, and it never tries to > > change > > their case. std.uni actually deals with unicode and worries about things > > like > > whether a unicode character is uppercase or not. > > That is what (or ) utilities do when the default locale > setting is in effect. Some other posters seem to suggest that a more > generalized library module does this, too, without losing performance. You actually do get a performance loss for a number of functions. They do tend to shortcut on ASCII in many cases, but they tend to become too large to be inlined, and if all you care about is ASCII, even if there are unicode characters in the string (which is common enough in domains that have nothing to do with English - e.g. regular expressions), you take a performance hit for all characters which aren't ASCII. There are also a number of functions which arguably don't make much sense to try and turn into unicode functions (e.g. isDigit) but are heavily used. Another fun one is isWhite vs isUniWhite. In most cases, you _don't_ care about unicode whitespace, and it is definitely more expensive to call isUniWhite than isWhite, because there are a _lot_ of extraneous whitespace characters in unicode. std.ctype/std.ascii is _not_ going away. Too many people find those functions to be useful. I grant you that too many programmers don't worry about unicode when they should, but there are so many issues surrounding the proper handling of unicode that programmers aren't going to get it right unless they're actully trying to get it right. D provides a lot of the tools to make unicode mostly work correctly out of the box, but it's still complicated enough that you can't expect it to "just work" without programmers having some clue of what they're doing. And forcing people to come up with their own functions for basic ASCII operations (which pretty much every other programming language has) isn't going to help any. - Jonathan M Davis
Re: Rename std.ctype to std.ascii?
"Jonathan M Davis" wrote: On 2011-06-14 11:53, Jouko Koski wrote: I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library. For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII. Do we really need a common library utility for such a bounded domain? I would vote dropping ascii-only std.ctype altogether. Those who know and ensure that they are dealing with ascii-only, ebcdic-only or whatever-only representations can easily write their own utilities to their particular domains - maybe even better optimized than std.ctype because the domain may be even more restricted. A common use ascii-only utility will be used inevitably in places where it shouldn't. std.ctype/std.ascii deals with ASCII for those situations where you really do only care about ASCII. It deals with unicode characters, but it returns false for everything with them which returns a bool, and it never tries to change their case. std.uni actually deals with unicode and worries about things like whether a unicode character is uppercase or not. That is what (or ) utilities do when the default locale setting is in effect. Some other posters seem to suggest that a more generalized library module does this, too, without losing performance. -- Jouko
Re: Rename std.ctype to std.ascii?
On Tue, 14 Jun 2011 10:20:48 +0100, Jonathan M Davis wrote: So, given the arguably poor name of ctype and the fact that std.ctype does not actually match ctype.h's behavior, unless someone comes up with a really good reason not to fairly soon, I'm going to schedule std.ctype for deprecation and put the properly camelcased functions in std.ascii. I reckon this is the best option. -- Using Opera's revolutionary email client: http://www.opera.com/mail/
Re: Rename std.ctype to std.ascii?
On 2011-06-14 11:53, Jouko Koski wrote: > "Jonathan M Davis" wrote: > > So, yes I understood. It's just that as far as I can tell, locales don't > > matter if you're completely restricting yourself to ASCII like std.ctype > does. > > I would not consider it being good idea to include this kind of ascii-only > utilities in the standard-ish library. It might be best to rename the > module to std.ascii_for_insular_yankees_others_keep_away so that nobody > would use it by accident. This way the name would also remind us about the > historical terms which were used quarter of a century ago when ascii-only > utilities were first suggested to the intenational C > standardization committee. For some classes of operations, it makes perfect sense to be checking for ASCII characters only. For others, it's just people not worrying about internationalization like they should be. For instance, format strings don't care about unicode as far as their escape sequences go. %a, %d, etc. are all pure ASCII. So, worrying about unicode with them just wouldn't make sense. In most cases, isDigit working on the arabic numerals 0 through 9 is _exactly_ what people want and need. But if you were to try and make it more unicode- friendly, would Greek or Chinese numbers count as digits? Maybe, maybe not. It gets much more complicated. In some cases, all you care about with isUpper or toUpper is ASCII. In others, you want it to deal with unicode (and probably locales as well) properly. std.ctype/std.ascii deals with ASCII for those situations where you really do only care about ASCII. It deals with unicode characters, but it returns false for everything with them which returns a bool, and it never tries to change their case. std.uni actually deals with unicode and worries about things like whether a unicode character is uppercase or not. They're for two different use cases. Most of Phobos should be dealing with unicode (e.g. pretty much everything in std.string should be using the std.uni functions rather than the std.ascii functions if there's a function which is in both), but there are cases where unicode doesn't matter, and you might as well have the efficiency available of just dealing with ASCII. Ultimately, it's up to the programmer to do the right thing. - Jonathan M Davis
Re: Rename std.ctype to std.ascii?
Am 14.06.2011 21:29, schrieb Timon Gehr: > Daniel Gibson wrote: >> Am 14.06.2011 20:58, schrieb Andrej Mitrovic: >>> Why does std.ctype exist anyway? Can't you use std.uni for both ASCII >>> and UTF? Or is there some overhead in using the uni functions? >> >> I haven't looked at either implementation, but on ASCII everything is >> really simple.. isalpha, isdigit, isupper and islower are just a simple >> checks if the value is between two values, tolower(dchar c) is just >> return isupper(c) ? c+32 : c; etc. >> For Unicode this is most probably *much* harder (=> more expensive). >> >> Cheers, >> - Daniel > > The implementation of toUniLower shortcuts on ASCII characters. I don't > expect it > to be any slower if not for inlineability. And if somebody really needs the > speed, > I feel manually writing if('A' <= c && c <= 'Z') c+=32; (or similar) is just > good > enough. > > > Timon OK. I just looked at the implementation and it seems like there are ASCII-shortcuts in all those unicode functions. So I agree with Andrej, stc.ctype isn't really needed. Cheers, - Daniel
Re: Rename std.ctype to std.ascii?
Daniel Gibson wrote: > Am 14.06.2011 20:58, schrieb Andrej Mitrovic: >> Why does std.ctype exist anyway? Can't you use std.uni for both ASCII >> and UTF? Or is there some overhead in using the uni functions? > > I haven't looked at either implementation, but on ASCII everything is > really simple.. isalpha, isdigit, isupper and islower are just a simple > checks if the value is between two values, tolower(dchar c) is just > return isupper(c) ? c+32 : c; etc. > For Unicode this is most probably *much* harder (=> more expensive). > > Cheers, > - Daniel The implementation of toUniLower shortcuts on ASCII characters. I don't expect it to be any slower if not for inlineability. And if somebody really needs the speed, I feel manually writing if('A' <= c && c <= 'Z') c+=32; (or similar) is just good enough. Timon
Re: Rename std.ctype to std.ascii?
Am 14.06.2011 20:58, schrieb Andrej Mitrovic: > Why does std.ctype exist anyway? Can't you use std.uni for both ASCII > and UTF? Or is there some overhead in using the uni functions? I haven't looked at either implementation, but on ASCII everything is really simple.. isalpha, isdigit, isupper and islower are just a simple checks if the value is between two values, tolower(dchar c) is just return isupper(c) ? c+32 : c; etc. For Unicode this is most probably *much* harder (=> more expensive). Cheers, - Daniel
Re: Rename std.ctype to std.ascii?
Why does std.ctype exist anyway? Can't you use std.uni for both ASCII and UTF? Or is there some overhead in using the uni functions?
Re: Rename std.ctype to std.ascii?
"Jonathan M Davis" wrote: So, yes I understood. It's just that as far as I can tell, locales don't matter if you're completely restricting yourself to ASCII like std.ctype does. I would not consider it being good idea to include this kind of ascii-only utilities in the standard-ish library. It might be best to rename the module to std.ascii_for_insular_yankees_others_keep_away so that nobody would use it by accident. This way the name would also remind us about the historical terms which were used quarter of a century ago when ascii-only utilities were first suggested to the intenational C standardization committee. -- Jouko
Re: Rename std.ctype to std.ascii?
On 2011-06-14 02:51, David Nadlinger wrote: > On 6/14/11 11:20 AM, Jonathan M Davis wrote: > > On 2011-06-14 01:51, David Nadlinger wrote: > >> But the functions in do. And there can be some > >> locale-dependent problems even if you use only ASCII, the most prominent > >> being the different handling of »i« in the Turkish locale: > >> http://www.i18nguy.com/unicode/turkish-i18n.html > >> > >> This is probably another reason why it shouldn't be called std.ctype… > >> > > From the looks of it, that affects extended ASCII but not ASCII (since > > the > > > > Turkish uppercase I isn't even in ASCII). It's definitely a great link > > though. Thanks! > > Oh, I was probably a bit unclear – what I meant is that it affects you > also if you use only ASCII input, since toupper('i') == 221 when your > locale is tr_TR.ISO-8859-9. Yes, but the result is extended ASCII, so it doesn't affect anything which only deals with pure ASCII. ctype.h deals with extended ASCII, so locales actually affect what it's doing. std.ctype only deals in pure ASCII, so it wouldn't do anything which would result in a non-ASCII character, and so locales shouldn't matter at all. However, if you _do_ want to bring locales into it, then a locale like tr_TR.ISO_8859-9 is not going to be able to operate purely in ASCII, since the uppercase value of i is 221, which is extended ASCII. So, yes I understood. It's just that as far as I can tell, locales don't matter if you're completely restricting yourself to ASCII like std.ctype does. And std.ctype is not going to try and deal with locales at this point (and likely not ever). I think that that is far better left to unicode. The Turkish locale is a great example of why you _want_ to be dealing with unicode when dealing with locales. std.ctype is for when you're specifically restricting yourself to ASCII (which sometimes can be very useful - e.g. with formatting strings or regex strings where all of the special characters are ASCII; using unicode functions would just make them slower at no benefit and would risk changing behavior based on locale if you brought locales into it). If you're not restricting yourself to ASCII, then std.uni is the way to go. - Jonathan M Davis
Re: Rename std.ctype to std.ascii?
On 6/14/11 11:20 AM, Jonathan M Davis wrote: On 2011-06-14 01:51, David Nadlinger wrote: But the functions in do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of »i« in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html This is probably another reason why it shouldn't be called std.ctype… From the looks of it, that affects extended ASCII but not ASCII (since the Turkish uppercase I isn't even in ASCII). It's definitely a great link though. Thanks! Oh, I was probably a bit unclear – what I meant is that it affects you also if you use only ASCII input, since toupper('i') == 221 when your locale is tr_TR.ISO-8859-9. David
Re: Rename std.ctype to std.ascii?
On 2011-06-14 01:51, David Nadlinger wrote: > On 6/14/11 8:23 AM, Jonathan M Davis wrote: > >> What is your definition for ASCII character? > >> > >> Most of the functions (or macros) are locale dependent, see > >> setlocale() and. And there is the, too. > >> > >> While the C standardized ways of doing things might not be most > >> appropriate approach in D domain, we must not base our design decisions > >> on deficient analysis. "I just want this text uppercase" is one of the > >> hardest things in the _world_. The problem is not just the header or > >> package naming. > > > > ??? std.ctype does _nothing_ with localization. And even if it did, that > > doesn't change what ASCII is. ASCII is made up of the values 0 through > > 127. And honestly, I have no clue how _those_ characters could be > > affected by locale. Extended-ASCII might be, but I wouldn't think that > > ASCII would be. Regardless, std.ctype does nothing with locale. > > But the functions in do. And there can be some > locale-dependent problems even if you use only ASCII, the most prominent > being the different handling of »i« in the Turkish locale: > http://www.i18nguy.com/unicode/turkish-i18n.html > > This is probably another reason why it shouldn't be called std.ctype… From the looks of it, that affects extended ASCII but not ASCII (since the Turkish uppercase I isn't even in ASCII). It's definitely a great link though. Thanks! It may be that we'll want to improve std.uni to deal with locales in some manner (either by providing new functions which handle them or altering the current ones to handle them), but std.ctype is pure ASCII. And while I don't see how locales can affect pure ASCII, ctype.h appears to actually deal with extended ASCII rather than just ASCII (where locales _do_ matter). So, all in all, std.ctype definitely has different behavior than ctype.h, which makes the name std.ctype that much worse. So, given the arguably poor name of ctype and the fact that std.ctype does not actually match ctype.h's behavior, unless someone comes up with a really good reason not to fairly soon, I'm going to schedule std.ctype for deprecation and put the properly camelcased functions in std.ascii. - Jonathan M Davis
Re: Rename std.ctype to std.ascii?
On 6/14/11 8:23 AM, Jonathan M Davis wrote: What is your definition for ASCII character? Most of the functions (or macros) are locale dependent, see setlocale() and. And there is the, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming. ??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. But the functions in do. And there can be some locale-dependent problems even if you use only ASCII, the most prominent being the different handling of »i« in the Turkish locale: http://www.i18nguy.com/unicode/turkish-i18n.html This is probably another reason why it shouldn't be called std.ctype… David
Re: Rename std.ctype to std.ascii?
On Jun 14, 11 14:23, Jonathan M Davis wrote: On 2011-06-13 22:48, Jouko Koski wrote: "Jonathan M Davis" wrote: std.ctype is modeled after C's ctype.h. It has functions for operating on characters - particularly functions which indicate the type of a character (I believe that ctype stands for character type, so that makes sense). For instance, isdigit will tell you whether a particular character is a digit. It only works on ASCII characters (non-ASCII characters return false for functions like isdigit and functions like toupper do nothing to non-ASCII characters). What is your definition for ASCII character? Most of the functions (or macros) are locale dependent, see setlocale() and. And there is the, too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming. ??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. - Jonathan M Davis std.ctype does not, but does. (which could be another reason it shouldn't be called std.ctype.)
Re: Rename std.ctype to std.ascii?
On 2011-06-13 22:48, Jouko Koski wrote: > "Jonathan M Davis" wrote: > > std.ctype is modeled after C's ctype.h. It has functions for operating on > > characters - particularly functions which indicate the type of a > > character (I > > believe that ctype stands for character type, so that makes sense). For > > instance, isdigit will tell you whether a particular character is a > > digit. It > > only works on ASCII characters (non-ASCII characters return false for > > functions like isdigit and functions like toupper do nothing to non-ASCII > > characters). > > What is your definition for ASCII character? > > Most of the functions (or macros) are locale dependent, see > setlocale() and . And there is the , too. > > While the C standardized ways of doing things might not be most appropriate > approach in D domain, we must not base our design decisions on deficient > analysis. "I just want this text uppercase" is one of the hardest things in > the _world_. The problem is not just the header or package naming. ??? std.ctype does _nothing_ with localization. And even if it did, that doesn't change what ASCII is. ASCII is made up of the values 0 through 127. And honestly, I have no clue how _those_ characters could be affected by locale. Extended-ASCII might be, but I wouldn't think that ASCII would be. Regardless, std.ctype does nothing with locale. - Jonathan M Davis
Re: Rename std.ctype to std.ascii?
"Jonathan M Davis" wrote: std.ctype is modeled after C's ctype.h. It has functions for operating on characters - particularly functions which indicate the type of a character (I believe that ctype stands for character type, so that makes sense). For instance, isdigit will tell you whether a particular character is a digit. It only works on ASCII characters (non-ASCII characters return false for functions like isdigit and functions like toupper do nothing to non-ASCII characters). What is your definition for ASCII character? Most of the functions (or macros) are locale dependent, see setlocale() and . And there is the , too. While the C standardized ways of doing things might not be most appropriate approach in D domain, we must not base our design decisions on deficient analysis. "I just want this text uppercase" is one of the hardest things in the _world_. The problem is not just the header or package naming. -- Jouko
Re: Rename std.ctype to std.ascii?
Come to think of it, I think I had a note in a todo somewhere that said "post a feature request to change ctype to ascii". It's a good standard name.
Re: Rename std.ctype to std.ascii?
I'm all for it. I've never liked ctype, and I got lost trying to find ascii functions since I didn't know where to look. The first time I saw ctype I thought it was a collection of C type aliases.. heh.
Re: Rename std.ctype to std.ascii?
On 2011-06-13 18:43, Jose Armando Garcia wrote: > On Mon, Jun 13, 2011 at 10:28 PM, Jonathan M Davis wrote: > > std.ctype is modeled after C's ctype.h. It has functions for operating on > > characters - particularly functions which indicate the type of a > > character (I believe that ctype stands for character type, so that makes > > sense). For instance, isdigit will tell you whether a particular > > character is a digit. It only works on ASCII characters (non-ASCII > > characters return false for functions like isdigit and functions like > > toupper do nothing to non-ASCII characters). > > > > std.uni, on the other hand, operates on characters just like std.ctype > > does, but it extends its charter to unicode characters (e.g. it has > > isUniUpper which _does_ work on unicode characters, unlike std.ctype's > > isupper). > > > > The thing is that aside from those familiar with C/C++, most programmers > > are likely to find the module name ctype to be rather uniformative. If > > they're looking for something like isdigit, they're not terribly likely > > to go looking at std.ctype first. And I'm not sure that std.ascii will > > be all that much more obvious to them, but it fits in much better with > > std.uni. std.ascii gets the character functions which operate only on > > ASCII characters, and std.uni gets the character functions which operate > > on unicode characters in addition to ASCII characters. > > > > I don't think that the change of module name is enough of an improvement > > to merit changing the name just because ctype is arguably bad. However, > > as it turns out, _no_ function in std.ctype is properly camelcased, and > > many of them return int instead of bool (which the C functions they're > > modeled after do but which is not particularly D-like and can cause > > problems when you actual _need_ them to return bool). And it has been > > made very clear in past discussions in this newsgroup that the consensus > > is that we prefer that Phobos functions follow Phobos' naming > > conventions (which means camelcasing) rather than matching the casing of > > functions in other languages. So, all of the functions in std.ctype need > > to be renamed. > > > > I now have a pull request which creates properly camelcased versions of > > all of them ( https://github.com/D-Programming-Language/phobos/pull/101 > > ). The thing is though that because _every_ function in std.ctype is > > renamed, the cost of renaming the entire module (as far as people > > updating their code to use functions such as isDigit instead of isdigit > > goes) is essentially the same if as just renaming the functions > > in-place. In either case, the old functions will go through the full > > deprecation process before they're actually gone, so no one's code will > > suddenly break because of the changes, but any code that uses the old > > functions will eventually have to be change to use the properly named > > ones. And since the cost to making those changes is essentially the same > > whether we replace the whole std.ctype module or whether we replace all > > of its functions, I'm wondering whether it would be worthwhile to take > > this opportunity to rename std.ctype? > > > > I don't think that the name change is enough of an improvement to do it > > if it's going to break everyone's code, but given that fixing all of its > > functions gives us a perfect opportunity to rename it at no additional > > cost, I feel that the question should be posed. > > > > Should we rename std.ctype to std.ascii? Or should we just keep the old > > name, which is familiar to C programmers? > > > > - Jonathan M Davis > > or deprecate std.ctype and create a new std.ascii. Well, yes. That's what would be happening. All of the old functions would be in std.ctype and put on the deprecation path, while the new std.ascii would have the new, properly camelcased functions in it. But what that's effectively doing is renaming std.ctype to std.ascii. It's just that std.ctype will stick around with its old functions until it's gone through the full deprecation cycle. - Jonathan M Davis
Re: Rename std.ctype to std.ascii?
On Mon, Jun 13, 2011 at 10:28 PM, Jonathan M Davis wrote: > std.ctype is modeled after C's ctype.h. It has functions for operating on > characters - particularly functions which indicate the type of a character (I > believe that ctype stands for character type, so that makes sense). For > instance, isdigit will tell you whether a particular character is a digit. It > only works on ASCII characters (non-ASCII characters return false for > functions like isdigit and functions like toupper do nothing to non-ASCII > characters). > > std.uni, on the other hand, operates on characters just like std.ctype does, > but it extends its charter to unicode characters (e.g. it has isUniUpper which > _does_ work on unicode characters, unlike std.ctype's isupper). > > The thing is that aside from those familiar with C/C++, most programmers are > likely to find the module name ctype to be rather uniformative. If they're > looking for something like isdigit, they're not terribly likely to go looking > at std.ctype first. And I'm not sure that std.ascii will be all that much more > obvious to them, but it fits in much better with std.uni. std.ascii gets the > character functions which operate only on ASCII characters, and std.uni gets > the character functions which operate on unicode characters in addition to > ASCII characters. > > I don't think that the change of module name is enough of an improvement to > merit changing the name just because ctype is arguably bad. However, as it > turns out, _no_ function in std.ctype is properly camelcased, and many of them > return int instead of bool (which the C functions they're modeled after do but > which is not particularly D-like and can cause problems when you actual _need_ > them to return bool). And it has been made very clear in past discussions in > this newsgroup that the consensus is that we prefer that Phobos functions > follow Phobos' naming conventions (which means camelcasing) rather than > matching the casing of functions in other languages. So, all of the functions > in std.ctype need to be renamed. > > I now have a pull request which creates properly camelcased versions of all of > them ( https://github.com/D-Programming-Language/phobos/pull/101 ). The thing > is though that because _every_ function in std.ctype is renamed, the cost of > renaming the entire module (as far as people updating their code to use > functions such as isDigit instead of isdigit goes) is essentially the same if > as just renaming the functions in-place. In either case, the old functions > will go through the full deprecation process before they're actually gone, so > no one's code will suddenly break because of the changes, but any code that > uses the old functions will eventually have to be change to use the properly > named ones. And since the cost to making those changes is essentially the same > whether we replace the whole std.ctype module or whether we replace all of its > functions, I'm wondering whether it would be worthwhile to take this > opportunity to rename std.ctype? > > I don't think that the name change is enough of an improvement to do it if > it's going to break everyone's code, but given that fixing all of its > functions gives us a perfect opportunity to rename it at no additional cost, I > feel that the question should be posed. > > Should we rename std.ctype to std.ascii? Or should we just keep the old name, > which is familiar to C programmers? > > - Jonathan M Davis > or deprecate std.ctype and create a new std.ascii.