Re: Range of chars (narrow string ranges)

Jonathan M Davis via Digitalmars-d Tue, 28 Apr 2015 09:51:09 -0700

On Tuesday, 28 April 2015 at 09:11:10 UTC, Chris wrote:

Would it be much work to show have example code or even anexperimental module that gets rid of auto-decoding, so we couldsee what would be affected in general and how actual code wehave would be affected by it?
The topic keeps coming up again and again, and while I'm infavor of anything that enhances performance, I'm afraid ofhaving to refactor large chunks of my code. However, this fearmay be unfounded, but I would need some examples to visualizethe problem.

Honestly, most code won't care. If we just switched out all ofthe auto-decoding right now, pretty much anything using onlyASCII would just work, and most anything that's trying tomanipulate ASCII characters in a Unicode string will just work,whereas code that's specifically manipulating Unicode charactersmight have problems (e.g. comparing front with a dchar will nolonger have the same result, since front would just be the firstcode unit rather than necessarily the first code point). Sincemost Phobos range-based functions which operate on strings arespecial-cased on strings already, many of them would continue tojust work (e.g. find returns the same range type as what's passedto it even if it's given a string, so it might just work with thechange, or it might need to be tweaked slightly), and those thatwould then generally either need to call encode on an argument tomake it match the string type in the cases string types mix (e.g."foo".find("fo"d) would need to call encode on "fo"d to make it astring for comparison), or the caller would need to usestd.utf.byDchar or std.uni.byGrapheme to operate on code pointsor graphemes rather than code units.

The two biggest places in Phobos that would potentially haveproblems are functions that special-cased strings but still usedfront and those which have to return a new range type. e.g.filter would be a good example, because it's forced to return anew range type. Right now, it would filter on dchars, but withthe change, it would filter on the code unit type (most typicallychar). If you're filtering on ASCII characters, it wouldn'tmatter aside from the fact that the resulting range would have anelement type of char rather than dchar, but if you're filteringon Unicode characters, it wouldn't work anymore. For situationslike that, you'd be forced do use std.utf.byDchar orstd.uni.byGrapheme. However, since most string code tends tooperate on substrings rather than characters, I don't know howcommon it even is to use a function like filter on a string (asopposed to a range of strings). Such code might actually befairly rare.

So, there _are_ a few functions which stop working the same wayin a potentially silent manner if we just made it so that frontdidn't autodecode anymore. However, in general, because Phobosalmost always special-cases strings, calls to Phobos functionsprobably wouldn't need to change in most cases, and when they do,a call to byDchar would restore the old behavior. But of course,we'd want to do the transition in a way that didn't result insilent behavioral changes that would break code, even though inmost cases, it wouldn't matter, because most code will beoperating on ASCII strings even if the strings themselves containUnicode - e.g. unicodeString.find(asciiString) is far more commonthan unicodeString.find(otherUnicodeString).

I suspect that the code that's at the greatest risk is code thatchecks for is(Unqual!(ElementType!Range) == dchar) to operate onstrings and wrapper ranges around strings, since it would thenonly match the cases where byDchar had been used. In generalthough, the code that's going to run into the most trouble isuser code that contains range-based functions similar to what youmight find in Phobos rather than code that's simply using thePhobos functions like startsWith and find - i.e. if you'rewriting range-base code that worries about doing stuff likespecial-casing strings or which specifically needs to operate oncode points, then you're going to have to make changes, whereasto a great extent, if all you're doing is passing strings toPhobos functions, your code will tend to just work.

To actually see what the impact would be, we'd have to justchange Phobos, I think, and then see what the impact was on usercode. It could be surprising how much or how little it affectsthings, though in most cases, I expect that it'll mean that codewill just work. And if we really wanted to do that, we couldcreate a version flag that turned of autodecoding and version thechanges in Phobos appropriately to see what we got. In manycases, if we simply made sure that Phobos functions whichspecial-cased strings didn't use front directly but insteaddidn't care whether they were operating on ranges of char, wchar,or dchar, then we wouldn't even need to version anything (e.g.find could easily be made to work that way if it doesn'talready), but some functions (like filter) would need to beversioned differently.

So, maybe what we need to do to start is to just go throughPhobos and make as many functions as possible not care aboutwhether they're dealing with strings as ranges of char, wchar, ordchar. And at least then, we'd minimize how much code would haveto be versioned differently if we were to test out getting rid ofautodecoding with versioning.


- Jonathan M Davis

Re: Range of chars (narrow string ranges)

Reply via email to