[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 Iain Buclaw changed: What|Removed |Added Priority|P1 |P4 --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 Vladimir Panteleev changed: What|Removed |Added Component|dmd |druntime --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #39 from Vladimir Panteleev --- *** Issue 22473 has been marked as a duplicate of this issue. *** --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 Walter Bright changed: What|Removed |Added See Also||https://issues.dlang.org/sh ||ow_bug.cgi?id=20134 --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #38 from Martin Nowak --- (In reply to Vladimir Panteleev from comment #36) > Question, is there any overhead in actually verifying the validity of UTF-8 > streams, or is all overhead related to error handling (i.e. inability to be > nothrow)? I think it's fairly measurable b/c you need to add lots of additional checks and branches (though highly predictable ones). While my initial decode implementation https://github.com/MartinNowak/phobos/blob/1b0edb728c/std/utf.d#L577-L651 was transmogrify into 200 lines in the meantime https://github.com/dlang/phobos/blob/acafd848d8/std/utf.d#L1167-L1369, you can still use it to benchmark validation. I did run a lot of benchmarks when introducing that function, and the code path for decoding just remains slow, even with the throwing code path removed out of normal control flow. --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 Jack Stouffer changed: What|Removed |Added CC||j...@jackstouffer.com --- Comment #37 from Jack Stouffer --- This entire discussion is moot unless you get Andrei on board with a breaking change to a very fundamental part of the language. --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 Vladimir Panteleev changed: What|Removed |Added See Also||https://issues.dlang.org/sh ||ow_bug.cgi?id=14919 --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #36 from Vladimir Panteleev --- Question, is there any overhead in actually verifying the validity of UTF-8 streams, or is all overhead related to error handling (i.e. inability to be nothrow)? --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #35 from Jonathan M Davis --- (In reply to Martin Nowak from comment #32) > Summary: > > We should adopt a new model of unicode validations. > The current one where every string processing function decodes unicode > characters and performs validation causes too much overhead. > A better alternative would be to perform unicode validation once when > reading raw data (ubyte[]) and then assume any char[]/wchar[]/dchar[] is a > valid unicode string. > Invalid encodings introduced by string processing algorithms are programming > bugs and thus do not warrant runtime checks in release builds. Exactly. --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #34 from Vladimir Panteleev --- (In reply to Martin Nowak from comment #31) > BTW, this is what I already wrote in comment 23. Not sure why you only > partially quoted my answer to suggest a contradiction. Err, well, to be fair, you did not state this clearly in comment 23, which is why I asked for a clarification. I was not trying to maliciously nitpick your words, just tried to understand your point. --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #33 from Sobirari Muhomori --- Removing autodecoding is good, but this issue is about making autodecode @nothrow @nogc. --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #32 from Martin Nowak --- Summary: We should adopt a new model of unicode validations. The current one where every string processing function decodes unicode characters and performs validation causes too much overhead. A better alternative would be to perform unicode validation once when reading raw data (ubyte[]) and then assume any char[]/wchar[]/dchar[] is a valid unicode string. Invalid encodings introduced by string processing algorithms are programming bugs and thus do not warrant runtime checks in release builds. Also see https://github.com/D-Programming-Language/druntime/pull/1279 --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 --- Comment #31 from Martin Nowak --- (In reply to Martin Nowak from comment #30) > Well, b/c they contain delimited binary and ASCII data, you'll have to find > those delimiters, then validate and cast the ASCII part to a string, and can > then use std.string functions. BTW, this is what I already wrote in comment 23. Not sure why you only partially quoted my answer to suggest a contradiction. --
[Issue 14519] Get rid of unicode validation in string processing
https://issues.dlang.org/show_bug.cgi?id=14519 Martin Nowak changed: What|Removed |Added Summary|[Enh] foreach on strings|Get rid of unicode |should return |validation in string |replacementDchar rather |processing |than throwing | --