Re: auto-decoding
On Sunday, 1 April 2018 at 02:44:32 UTC, Uknown wrote: If you want to stop auto-decoding, you can use std.string.representation like this: import std.string : representation; auto no_decode = some_string.representation; Now no_decode wont be auto-decoded, and you can use it in place of some_string. You can also use std.utf to decode by graphemes instead. .representation gives you an const(ubyte)[] What you typically want is const(char)[], for this you can use std.utf.byCodeUnit https://dlang.org/phobos/std_utf.html#byCodeUnit There's also this good article: https://tour.dlang.org/tour/en/gems/unicode
Re: auto-decoding
On Sunday, 1 April 2018 at 01:19:08 UTC, auto wrote: What is auto decoding and why it is a problem? Auto-decoding is essentially related to UTF representation of Unicode strings. In D, `char[]` and `string` represent UTF8 strings, `wchar[]` and `wstring` represent UTF16 strings and `dchar[]` and `dstring` represent UTF32 strings. You need to know how UFT works in order to understand auto-decoding. Since in practice most code deals with UTF8 I'll explain wrt that. Essentially, the problem comes down to the fact that not all the Unicode characters are representable by 8 bit `char`s (for UTF8). Only the ASCII stuff is represented by the "normal" way. UTF8 uses the fact that the first few buts in a char are never used in ASCII, to tell how many more `char`s ahead that character is encoded in. You can watch this video for a better understanding[0]. By default though, if one were to traverse a `char` looking for characters, they would get unexpected results with Unicode data Auto-decoding tries to solve this by automatically applying the algorithm to decode the characters to Unicode "Code-Points". This is where my knowledge ends though. I'll give you pros and cons of auto-decoding. Pros: * It makes Unicode string handeling much more easier for beginners. * Much less effort in general, it seems to "just work™" Cons: * It makes string handling slow by default * It may be the wrong thing, since you may not want Unicode code-points, but graphemes instead. * Auto-decoding throws exceptions on reaching invalid code-points, so all string handling code in general throws exceptions. If you want to stop auto-decoding, you can use std.string.representation like this: import std.string : representation; auto no_decode = some_string.representation; Now no_decode wont be auto-decoded, and you can use it in place of some_string. You can also use std.utf to decode by graphemes instead. You should also read this blog post: https://jackstouffer.com/blog/d_auto_decoding_and_you.html And this forum post: https://forum.dlang.org/post/eozguhavggchzzruz...@forum.dlang.org [0]: https://www.youtube.com/watch?v=MijmeoH9LT4
auto-decoding
What is auto decoding and why it is a problem?
Re: Auto-decoding
On Saturday, 15 July 2017 at 18:47:25 UTC, Joakim wrote: On Saturday, 15 July 2017 at 18:14:48 UTC, aberba wrote: So what is the current plan? :) Andrei has talked about having a non-auto-decoding path for those who know what they're doing and actively choose that path, while keeping auto-decoding the default, so as not to break existing code. Jack has been submitting PRs for this, but it is probably tedious work, so progress is slow and I don't know how much more remains to be done: https://github.com/dlang/phobos/pulls?q=is%3Apr+auto-decoding+is%3Aclosed The idea is that once DIP1000 has matured, more focus on compiler support for reference-counting will be given with the aim of improving the @nogc experience. One example is DIP1008 for @nogc exceptions [1], but another one that is important in this context is RCString [2]. The idea is that RCString will be a new opt-in string type without auto-decoding and GC. Another idea in the game is `version(NoAutoDecode)`: https://github.com/dlang/phobos/pull/5513 However, here's unfortunately still unclear whether that could result in a working solution. [1] https://github.com/dlang/DIPs/blob/master/DIPs/DIP1008.md [2] https://github.com/dlang/phobos/pull/4878
Re: Auto-decoding
On Saturday, 15 July 2017 at 18:14:48 UTC, aberba wrote: On Saturday, 15 July 2017 at 05:54:32 UTC, ag0aep6g wrote: On 07/15/2017 06:21 AM, bauss wrote: [...] 1) Drop two elements from "Bär". With auto-decoding you get "r", which is nice. Without auto-decoding you get [0xA4, 'r'] where 0xA4 is the second half of the encoding of 'ä'. You have to know your Unicode to understand what is going on there. [...] So what is the current plan? :) Andrei has talked about having a non-auto-decoding path for those who know what they're doing and actively choose that path, while keeping auto-decoding the default, so as not to break existing code. Jack has been submitting PRs for this, but it is probably tedious work, so progress is slow and I don't know how much more remains to be done: https://github.com/dlang/phobos/pulls?q=is%3Apr+auto-decoding+is%3Aclosed
Re: Auto-decoding
On 07/15/2017 08:14 PM, aberba wrote: So what is the current plan? :) As far as I'm aware, there's no concrete plan to change anything. We just gotta deal with auto-decoding for the time being.
Re: Auto-decoding
On Saturday, 15 July 2017 at 05:54:32 UTC, ag0aep6g wrote: On 07/15/2017 06:21 AM, bauss wrote: [...] 1) Drop two elements from "Bär". With auto-decoding you get "r", which is nice. Without auto-decoding you get [0xA4, 'r'] where 0xA4 is the second half of the encoding of 'ä'. You have to know your Unicode to understand what is going on there. [...] So what is the current plan? :)
Re: Auto-decoding
On 07/15/2017 06:21 AM, bauss wrote: I understand what it is and how it works, but I don't understand anything of how it solves any problems? Could someone give an example of when auto-decoding actually is useful in contrast to not using it? 1) Drop two elements from "Bär". With auto-decoding you get "r", which is nice. Without auto-decoding you get [0xA4, 'r'] where 0xA4 is the second half of the encoding of 'ä'. You have to know your Unicode to understand what is going on there. 2) Search for 'ä' (one wchar/dchar) in the `string` "Bär". With auto-decoding, you pop the 'B' and then there's your 'ä'. Without auto-decoding, you can't find 'ä', because "Bär" doesn't have a single element that matches 'ä'. You have to search for "ä" (two `char`s) instead. The goal of auto-decoding was to make it so that you don't have to think about Unicode all the time when processing strings. Instead you could think in terms of "characters". But auto-decoding falls flat on that goal, which is why it's disliked. You still have to think about Unicode stuff for correctness (combining characters, graphemes), and now you also have to worry about the performance of auto-decoding.
Auto-decoding
I understand what it is and how it works, but I don't understand anything of how it solves any problems? Could someone give an example of when auto-decoding actually is useful in contrast to not using it? Just trying to get an understanding of what exactly its purpose is. I did read https://jackstouffer.com/blog/d_auto_decoding_and_you.html But I still feel like there's not a clear explanation of what issues exist when you don't have it. If I need to be more clear, just let me know.