Re: Today's programming challenge - How's your Range-Fu ?
On Mon, Apr 20, 2015 at 06:03:49PM +, John Colvin via Digitalmars-d wrote: > On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote: > >To measure the columns needed to print a string, you'll need the > >number of graphemes. (d|)?string.length gives you the number of code > >units. > > Even that's not really true. In the end it's up to the font and layout > engine to decide how much space anything takes up. Unicode doesn't > play nicely with the idea of text as a grid of rows and fixed-width > columns of characters, although quite a lot can (and is, see urxvt for > example) be shoe-horned in. Yeah, even the grapheme count does not necessarily tell you how wide the printed string really is. The characters in the CJK block are usually rendered with fonts that are, on average, twice as wide as your typical Latin/Cyrillic character, so even applications like urxvt that shoehorn proportional-width fonts into a text grid render CJK characters as two columns rather than one. Because of this, I actually wrote a function at one time to determine the width of a given Unicode character (i.e., zero, single, or double) as displayed in urxvt. Obviously, this is no help if you need to wrap lines rendered with a proportional font. And it doesn't even attempt to work correctly with bidi text. This is why I said at the beginning that wrapping a line of text is a LOT harder than it sounds. A function that only takes a string as input does not have the necessary information to do this correctly in all use cases. The current wrap() function doesn't even do it correctly modulo the information available: it doesn't handle combining diacritics and zero-width characters properly. In fact, it doesn't even handle control characters properly, except perhaps for \t and \n. There are so many things wrong with the current wrap() function (and many other string-processing functions in Phobos) that it makes it look like a joke when we claim that D provides Unicode correctness out-of-the-box. The only use case where wrap() gives the correct result is when you stick with pre-Unicode Latin strings to be displayed on a text console. As such, I don't really see the general utility of wrap() as it currently stands, and I question its value in Phobos, as opposed to an actually more useful implementation that, for instance, correctly implements the Unicode line-breaking algorithm. T -- It said to install Windows 2000 or better, so I installed Linux instead.
Re: Today's programming challenge - How's your Range-Fu ?
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote: To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units. Even that's not really true. In the end it's up to the font and layout engine to decide how much space anything takes up. Unicode doesn't play nicely with the idea of text as a grid of rows and fixed-width columns of characters, although quite a lot can (and is, see urxvt for example) be shoe-horned in.
Re: Today's programming challenge - How's your Range-Fu ?
This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable. There are three things that you need to be aware of when handling unicode: code units, code points and graphems. In general the length of one guarantees anything about the length of the other, except for utf32, which is a 1:1 mapping between code units and code points. In this thread, we were discussing the relationship between code points and graphemes. You're examples however apply to the relationship between code units and code points. To measure the columns needed to print a string, you'll need the number of graphemes. (d|)?string.length gives you the number of code units. If you normalize a string (in the sequence of characters/codepoints sense, not object.string) to NFC, it will decompose every precomposed character in the string (like é, single codeunit), establish a defined order between the composite characters and then recompose a selected few graphemes (like é). This way é always ends up as a single code unit in NFC. There are dozens of other combinations where you'll still have n:1 mapping between code points and graphemes left after normalization. Example given already in this thread: putting an arrow over an latin letter is typical in math and always more than one codepoint.
Re: Today's programming challenge - How's your Range-Fu ?
On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote: Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable. I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count. The problem is not normalization as such, the problem is with string (as opposed to dstring): import std.uni : normalize, NFC; void main() { dstring de_one = "é"; dstring de_two = "e\u0301"; assert(de_one.length == 1); assert(de_two.length == 2); string e_one = "é"; string e_two = "e\u0301"; string random = "ab"; assert(e_one.length == 2); assert(e_two.length == 3); assert(e_one.length == random.length); assert(normalize!NFC(e_one).length == 2); assert(normalize!NFC(e_two).length == 2); } This can lead to subtle bugs, cf. length of random and e_one. You have to convert everything to dstring to get the "expected" result. However, this is not always desirable.
Re: Today's programming challenge - How's your Range-Fu ?
Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable. I think it is 100% reliable, it just doesn't make the problems go away. It just guarantees that two strings normalized to the same form are binary equal iff they are equal in the unicode sense. Nothing about columns or string length or grapheme count.
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote: Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Yes. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei I don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization. Yes, again and again I encountered length related bugs with Unicode characters. Normalization is not 100% reliable. I don't know anyone who works with non English characters who doesn't have problems with Unicode related issues sometimes.
Re: Today's programming challenge - How's your Range-Fu ?
On 2015-04-20 08:04, Nick B wrote: Perhaps a new Unicode standard, could start that way as well ? https://xkcd.com/927/ -- /Jacob Carlborg
Re: Today's programming challenge - How's your Range-Fu ?
On 19/04/15 22:58, ketmar wrote: On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote: it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdf This is not a very accurate depiction of Unicode. For example: And, moreover, BOM is meaningless without mentioning of encoding. So we have to specify encoding anyway. No. BOM is what lets your auto-detect the encoding. If you know you will be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you which it is. That is its entire purpose, in fact. There, pretty much, goes point #1. And then: Unicode contains at least “writing direction” control symbols (LTR is U+200E and RTL is U+200F) which role is IDENTICAL to the role of codepage-switching symbols with the associated disadvantages. That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere invisible characters with defined directionality. Cutting them away from a substring would not invalidate your text more than cutting away actual text would under the same conditions. In any case, unlike page switching symbols, it would only affect your display, not your understanding of the text. So point #2 is out. He has some valid argument under point #3, but also lots of !(@#&$ nonsense. He is right, I think, that denoting units with separate code points makes no sense, but the rest of his arguments seem completely off. For example, asking Latin and Cyrillic to share the same region merely because some letters look alike makes no sense, implementation wise. Points #4, #5, #6 and #7 are the same point. The main objection I have there is his assumption that the situation is, somehow, worse than it was. Yes, if you knew your encoding was Windows-1255, you could assume the text is Hebrew. Or Yiddish. And this, I think, is one of the encodings with the least number of languages riding on it. Windows-1256 has Arabic, Persian, Urdu and others. Windows-1251 has the entire western Europe script. As pointed out elsewhere in this thread, Spanish and French treat case folding of accented letters differently. Also, we see that the solution he thinks would work better actually doesn't. People living in France don't switch to a QWERTY keyboard when they want to type English. They type English with their AZERTY keyboard. There simply is no automatic way to tell what language something is typed in without a human telling you (or applying content based heuristics). Microsoft Word stores, for each letter, which was the keyboard language it was typed with. This causes great problems when copying to other editors, performing searches, or simply trying to get bidirectional text to appear correctly. The problem is so bad that phone numbers where the prefix appears after the actual number is not considered bad form or unusual, even in official PR material or when sending resumes. In fact, the only time you can count on someone to switch keyboards is when they need to switch to a language with a different alphabet. No Russian speaker will type English using the Russian layout, even if what she has to say happens to use letters with the same glyphs. You simply do not plan that much ahead. The point I'm driving at is that just because some posted some rant on the Internet doesn't mean it's correct. When someone says something is broken, always ask them what they suggest instead. Shachar
Re: Today's programming challenge - How's your Range-Fu ?
On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote: On Mon, 20 Apr 2015 01:27:36 +, Nick B wrote: Perhaps Unicode needs to be rebuild from the ground up ? alas, it's too late. now we'll live with that "unicode" crap for many years. Perhaps. or perhaps not. This community got together under Walter and Andrei leadership to building a new programming language, on the pillars of the old. Perhaps a new Unicode standard, could start that way as well ?
Re: Today's programming challenge - How's your Range-Fu ?
On Mon, 20 Apr 2015 01:27:36 +, Nick B wrote: > Perhaps Unicode needs to be rebuild from the ground up ? alas, it's too late. now we'll live with that "unicode" crap for many years. signature.asc Description: PGP signature
Re: Today's programming challenge - How's your Range-Fu ?
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote: On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote: it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdf Ketmar Great link, and a really good arguement about the problems with Unicode. Quote from 'Instead of Conclusion' Yes. This is the root of Unicode misdesign. They mixed up two mutually exclusive approaches. They blended badly two different abstraction levels: the textual level which corresponds to a language idea and the graphical level which does not care of a language, yet cares of writing direction, subscripts, superscripts and so on. In other words we need two different Unicodes built on these two opposite principles, instead of the one built on an insane mix of controversial axioms. end quote. Perhaps Unicode needs to be rebuild from the ground up ?
Re: Today's programming challenge - How's your Range-Fu ?
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote: On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote: é might be obvious, but Unicode isn't just for writing European prose. it is also to insert pictures of the animals into text. There's other uses for unicode? 🐧
Re: Today's programming challenge - How's your Range-Fu ?
On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote: > é might be obvious, but Unicode isn't just for writing European prose. it is also to insert pictures of the animals into text. > Unicode is a nightmarish system in some ways, but considering how > incredibly difficult the problem it solves is, it's actually not too > crazy. it's not crazy, it's just broken in all possible ways: http://file.bestmx.net/ee/articles/uni_vs_code.pdf signature.asc Description: PGP signature
Re: Today's programming challenge - How's your Range-Fu ?
On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote: U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. That's probably right. It is in fact a major feat to have the world adopt a new standard wholesale, but there are also difficult "semiotic" issues when you encode symbols and different languages view symbols differently (e.g. is "ä" an "a" or do you have two unique letters in the alphabet?) Take "å", it can represent a unit (ångström) or a letter with a circle above it, or a unique letter in the alphabet. The letter "æ" can be seen as a combination of "ae" or a unique letter. And we can expect languages, signs and practices to evolve over time too. How can you normalize encodings without normalizing writing practice and natural language development? That would be beyond the mandate of a unicode standard organization...
Re: Today's programming challenge - How's your Range-Fu ?
On 19/04/15 10:51, Abdulhaq wrote: MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote: On 18/04/15 21:40, Walter Bright wrote: Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. Yes Arabic is similar too Actually, the Arab presentation forms serve a slightly different purpose. In Hebrew, the presentation forms are mostly for Bibilical text, where certain decorations are usually done. For Arabic, the main reason for the presentation forms is shaping. Almost every Arabic letter can be written in up to four different forms (alone, start of word, middle of word and end of word). This means that Arabic has 28 letters, but over 100 different shapes for those letters. These days, when the font can do the shaping, the 28 letters suffice. During the DOS days, you needed to actually store those glyphs somewhere, which means that you needed to allocate a number to them. In Hebrew, some letters also have a final form. Since the numbers are so significantly smaller, however, (22 letters, 5 of which have final forms), Hebrew keyboards actually have all 27 letters on them. Going strictly by the "Unicode way", one would be expected to spell שלום with U05DE as the last letter, and let the shaping engine figure out that it should use the final form (or add a ZWNJ). Since all Hebrew code charts contained a final form Mem, however, you actually spell it with U05DD in the end, and it is considered a distinct letter. Shachar
Re: Today's programming challenge - How's your Range-Fu ?
MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote: On 18/04/15 21:40, Walter Bright wrote: I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing. A lot of areas in Unicode are due to pre-Unicode legacy. I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks". The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly. Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy. Shachar or shall I say שחר Yes Arabic is similar too
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote: On 4/18/2015 4:35 AM, Jacob Carlborg wrote: \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version. é might be obvious, but Unicode isn't just for writing European prose. Uses for combining characters includes (but is *nowhere* near to limited to) mathematical notation, where the combinatorial explosion of possible combinations that still belong to one grapheme cluster (character is a familiar but misleading word when talking about Unicode) would trivially become an insanely (more atoms than in the universe levels of) large number of characters. Unicode is a nightmarish system in some ways, but considering how incredibly difficult the problem it solves is, it's actually not too crazy.
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu wrote: On 4/18/15 4:35 AM, Jacob Carlborg wrote: On 2015-04-18 12:27, Walter Bright wrote: That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments. This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei Normalisation can allow some simplifications, sometimes, but knowing whether it will or not requires a lot of a priori knowledge about the input as well as the normalisation form.
Re: Today's programming challenge - How's your Range-Fu ?
On 18/04/15 21:40, Walter Bright wrote: I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing. A lot of areas in Unicode are due to pre-Unicode legacy. I'm guessing here, but looking at the code points, é (U00e9 - Latin small letter E with acute), which comes from Latin-1, which is designed to follow ISO-8859-1. U0301 (Combining acute accent) comes from "Combining diacritical marks". The way I understand things, Unicode would really prefer to use U0065+U0301 rather than U00e9. Because of legacy systems, and because they would rather have the ISO-8509 code pages be 1:1 mappings, rather than 1:n mappings, they introduced code points they really would rather do without. This also explains the "presentation forms" code pages (e.g. http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be glyphs, rather than code points. Due to legacy reasons, it was not possible to simply discard them. They received code points, with a warning not to use these code points directly. Also, notice that some letters can only be achieved using multiple code points. Hebrew diacritics, for example, do not, typically, have a composite form. My name fully spelled (which you rarely would do), שַׁחַר, cannot be represented with less than 6 code points, despite having only three letters. The last paragraph isn't strictly true. You can use UFB2C + U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. You would be using the presentation form which, as pointed above, is only there for legacy. Shachar or shall I say שחר
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote: However, I think Walter's goal here is to match the original wrap() functionality. Yes, although the overarching goal is: Minimize Need For Using GC In Phobos and the method here is to use ranges rather than having to allocate string temporaries.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote: Take it up with the Unicode consortium. :-) I see nobody knows :-)
Re: Today's programming challenge - How's your Range-Fu ?
On Fri, Apr 17, 2015 at 08:44:51PM +, Panke via Digitalmars-d wrote: > On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote: > >On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote: > > > >>Well, talk is cheap, so here's a working implementation of the > >>non-Unicode-correct line wrapper that uses ranges and does not > >>allocate: > > > >there is some... inconsistency: `std.string.wrap` adds final "\n" to > >string. ;-) but i always hated it for that. > > A range of lines instead of inserted \n would be a good API as well. Indeed, that would be even more useful, then you could just do .joiner("\n") to get the original functionality. However, I think Walter's goal here is to match the original wrap() functionality. Perhaps the prospective wrapped() function could be implemented in terms of a byWrappedLines() function which does return a range of wrapped lines. T -- The volume of a pizza of thickness a and radius z can be described by the following formula: pi zz a. -- Wouter Verhelst
Re: Today's programming challenge - How's your Range-Fu ?
On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote: > On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote: [...] > >When we don't know provenance of incoming data, we have to assume the > >worst and run normalization to be sure that we got it right. > > I'm not arguing against the existence of the Unicode standard, I'm > saying I can't figure any justification for standardizing different > encodings of the same thing. Take it up with the Unicode consortium. :-) T -- Tech-savvy: euphemism for nerdy.
Re: Today's programming challenge - How's your Range-Fu ?
On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote: > On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote: > >On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d > >wrote: > >>On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote: > >>>One possible solution would be to modify std.uni.graphemeStride to > >>>not allocate, since it shouldn't need to do so just to compute the > >>>length of the next grapheme. > >> > >>That should be done. There should be a fixed maximum codepoint count > >>to graphemeStride. > > > >Why? Scanning a string for a grapheme of arbitrary length does not > >need allocation since you're just reading data. Unless there is some > >required intermediate representation that I'm not aware of? > > If there's no need for allocation at all, why does it allocate? This > should be fixed. AFAICT, the only reason it allocates is because it shares the same underlying implementation as byGrapheme. There's probably a way to fix this, I just don't have the time right now to figure out the code. T -- Маленькие детки - маленькие бедки.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote: On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote: On 4/18/2015 4:35 AM, Jacob Carlborg wrote: \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version. Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data. Data entry should be handled by the driver program, not a universal interchange format. When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right. I'm not arguing against the existence of the Unicode standard, I'm saying I can't figure any justification for standardizing different encodings of the same thing.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote: On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote: On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote: One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. That should be done. There should be a fixed maximum codepoint count to graphemeStride. Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of? If there's no need for allocation at all, why does it allocate? This should be fixed.
Re: Today's programming challenge - How's your Range-Fu ?
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote: > On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote: > >One possible solution would be to modify std.uni.graphemeStride to > >not allocate, since it shouldn't need to do so just to compute the > >length of the next grapheme. > > That should be done. There should be a fixed maximum codepoint count > to graphemeStride. Why? Scanning a string for a grapheme of arbitrary length does not need allocation since you're just reading data. Unless there is some required intermediate representation that I'm not aware of? T -- "How are you doing?" "Doing what?"
Re: Today's programming challenge - How's your Range-Fu ?
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote: > On 4/18/2015 4:35 AM, Jacob Carlborg wrote: > >\u0301 is the "combining acute accent" [1]. > > > >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm > > I won't deny what the spec says, but it doesn't make any sense to have > two different representations of eacute, and I don't know why anyone > would use the two code point version. Well, *somebody* has to convert it to the single code point eacute, whether it's the human (if the keyboard has a single key for it), or the code interpreting keystrokes (the user may have typed it as e + combining acute), or the program that generated the combination, or the program that receives the data. When we don't know provenance of incoming data, we have to assume the worst and run normalization to be sure that we got it right. The two code-point version may also arise from string concatenation, in which case normalization has to be done again (or possibly from the point of concatenation, given the right algorithms). T -- Mediocrity has been pushed to extremes.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 4:35 AM, Jacob Carlborg wrote: \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm I won't deny what the spec says, but it doesn't make any sense to have two different representations of eacute, and I don't know why anyone would use the two code point version.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote: One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. That should be done. There should be a fixed maximum codepoint count to graphemeStride.
Re: Today's programming challenge - How's your Range-Fu ?
Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Yes. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei I don't think so. The thing is, even after normalization we have to deal with combining characters because in all normalization forms there will be combining characters left after normalization.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/15 4:35 AM, Jacob Carlborg wrote: On 2015-04-18 12:27, Walter Bright wrote: That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments. This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm Isn't this solved commonly with a normalization pass? We should have a normalizeUTF() that can be inserted in a pipeline. Then the rest of Phobos doesn't need to mind these combining characters. -- Andrei
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote: On Sat, Apr 18, 2015 at 11:52:50AM +, Chris via Digitalmars-d wrote: On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote: >On 2015-04-18 12:27, Walter Bright wrote: > >>That doesn't make sense to me, because the umlauts and the >>accented >>e all have Unicode code point assignments. > >This code snippet demonstrates the problem: > >import std.stdio; > >void main () >{ >dstring a = "e\u0301"; >dstring b = "é"; >assert(a != b); >assert(a.length == 2); >assert(b.length == 1); >writefln(a, " ", b); >} > >If you run the above code all asserts should pass. If your >system >correctly supports Unicode (works on OS X 10.10) the two >printed >characters should look exactly the same. > >\u0301 is the "combining acute accent" [1]. > >[1] >http://www.fileformat.info/info/unicode/char/0301/index.htm Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once. Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy. Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements. Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or "@nogc". Which makes it unusable in @nogc code. One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. T This is why on OS X I always normalized strings to composed. However, there are always issues with Unicode, because, as you said, the layman's notion of what a character is is not the same as Unicode's. I wrote a utility function that uses byGrapheme and byCodePoint. It's a bit of an overhead, but I always get the correct length and character access (e.g. if txt.startsWith("é")).
Re: Today's programming challenge - How's your Range-Fu ?
Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. 1. Problem: Normalization is not closed under almost all operations. E.g. concatenating two normalized strings does not guarantee the result is in normalized form. 2. Problem: Some unicode algorithms e.g. string comparison require a normalization step. It doesn't matter which form you use, but you have to pick one. Now we could say that all strings passed to phobos have to be normalized as (say) NFC and that phobos function thus skip the normalization.
Re: Today's programming challenge - How's your Range-Fu ?
On Sat, Apr 18, 2015 at 11:52:50AM +, Chris via Digitalmars-d wrote: > On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote: > >On 2015-04-18 12:27, Walter Bright wrote: > > > >>That doesn't make sense to me, because the umlauts and the accented > >>e all have Unicode code point assignments. > > > >This code snippet demonstrates the problem: > > > >import std.stdio; > > > >void main () > >{ > >dstring a = "e\u0301"; > >dstring b = "é"; > >assert(a != b); > >assert(a.length == 2); > >assert(b.length == 1); > >writefln(a, " ", b); > >} > > > >If you run the above code all asserts should pass. If your system > >correctly supports Unicode (works on OS X 10.10) the two printed > >characters should look exactly the same. > > > >\u0301 is the "combining acute accent" [1]. > > > >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm > > Yep, this was the cause of some bugs I had in my program. The thing is > you never know, if a text is composed or decomposed, so you have to be > prepared that "é" has length 2 or 1. On OS X these characters are > automatically decomposed by default. So if you pipe it through the > system an "é" (length=1) automatically becomes "e\u0301" (length=2). > Same goes for file names on OS X. I've had to find a workaround for > this more than once. Wait, I thought the recommended approach is to normalize first, then do string processing later? Normalizing first will eliminate inconsistencies of this sort, and allow string-processing code to use a uniform approach to handling the string. I don't think it's a good idea to manually deal with composed/decomposed issues within every individual string function. Of course, even after normalization, you still have the issue of zero-width characters and combining diacritics, because not every language has precomposed characters handy. Using byGrapheme, within the current state of Phobos, is still the best bet as to correctly counting the number of printed columns as opposed to the number of "characters" (which, in the Unicode definition, does not always match the layman's notion of "character"). Unfortunately, byGrapheme may allocate, which fails Walter's requirements. Well, to be fair, byGrapheme only *occasionally* allocates -- only for input with unusually long sequences of combining diacritics -- for normal use cases you'll pretty much never have any allocations. But the language can't express the idea of "occasionally allocates", there is only "allocates" or "@nogc". Which makes it unusable in @nogc code. One possible solution would be to modify std.uni.graphemeStride to not allocate, since it shouldn't need to do so just to compute the length of the next grapheme. T -- Just because you survived after you did it, doesn't mean it wasn't stupid!
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote: On 2015-04-18 14:25, Gary Willoughby wrote: byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here? How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected: foreach (e ; "e\u0301".byGrapheme) writeln(e); void main() { import std.stdio; import std.uni; foreach (e ; "e\u0301".byGrapheme) writeln(e[]); }
Re: Today's programming challenge - How's your Range-Fu ?
On 2015-04-18 14:25, Gary Willoughby wrote: byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here? How is byGrapheme supposed to be used? I tried this put it doesn't do what I expected: foreach (e ; "e\u0301".byGrapheme) writeln(e); -- /Jacob Carlborg
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote: On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote: On 2015-04-18 12:27, Walter Bright wrote: That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments. This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once. byGrapheme to the rescue: http://dlang.org/phobos/std_uni.html#byGrapheme Or is this unsuitable here?
Re: Today's programming challenge - How's your Range-Fu ?
Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located. For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases. -- Paulo While true, it does not affect wrap (the algorithm) as far as I can see.
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote: On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote: On 4/18/2015 12:58 AM, John Colvin wrote: On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote: On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote: So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column. Code points aren't equivalent to characters. They're not the same thing in most European languages, I know a bit of German, for what characters is that not true? Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all. See also: http://unicode.org/reports/tr15/#Norm_Forms Also another issue is that lower case letters and upper case might have different size requirements or look different depending on where on the word they are located. For example, German ß and SS, Greek σ and ς. I know Turkish also has similar cases. -- Paulo
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote: On 2015-04-18 12:27, Walter Bright wrote: That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments. This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm Yep, this was the cause of some bugs I had in my program. The thing is you never know, if a text is composed or decomposed, so you have to be prepared that "é" has length 2 or 1. On OS X these characters are automatically decomposed by default. So if you pipe it through the system an "é" (length=1) automatically becomes "e\u0301" (length=2). Same goes for file names on OS X. I've had to find a workaround for this more than once.
Re: Today's programming challenge - How's your Range-Fu ?
On 2015-04-18 12:27, Walter Bright wrote: That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments. This code snippet demonstrates the problem: import std.stdio; void main () { dstring a = "e\u0301"; dstring b = "é"; assert(a != b); assert(a.length == 2); assert(b.length == 1); writefln(a, " ", b); } If you run the above code all asserts should pass. If your system correctly supports Unicode (works on OS X 10.10) the two printed characters should look exactly the same. \u0301 is the "combining acute accent" [1]. [1] http://www.fileformat.info/info/unicode/char/0301/index.htm -- /Jacob Carlborg
Re: Today's programming challenge - How's your Range-Fu ?
That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments. Yes, but you may have perfectly fine unicode text where the combined form is used. Actually there is a normalization form for unicode that requires the combined form. To be fully correct phobos needs to handle that as well.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 1:26 AM, Panke wrote: On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote: On 4/18/2015 12:58 AM, John Colvin wrote: On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote: On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote: So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column. Code points aren't equivalent to characters. They're not the same thing in most European languages, I know a bit of German, for what characters is that not true? Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café That doesn't make sense to me, because the umlauts and the accented e all have Unicode code point assignments.
Re: Today's programming challenge - How's your Range-Fu ?
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote: On 4/18/2015 12:58 AM, John Colvin wrote: On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote: On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote: So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column. Code points aren't equivalent to characters. They're not the same thing in most European languages, I know a bit of German, for what characters is that not true? Umlauts, if combined characters are used. Also words that still have their accents left after import from foreign languages. E.g. Café Getting all unicode correct seems a daunting task with a severe performance impact, esp. if we need to assume that a string might have any normalization form or none at all. See also: http://unicode.org/reports/tr15/#Norm_Forms
Re: Today's programming challenge - How's your Range-Fu ?
On 4/18/2015 12:58 AM, John Colvin wrote: On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote: On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote: So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column. Code points aren't equivalent to characters. They're not the same thing in most European languages, I know a bit of German, for what characters is that not true? never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning. Code points are a useful chunk size for some tasjs and completely insufficient for others. The first order of business is making wrap() work with ranges, and otherwise work the same as it always has (it's one of the oldest Phobos functions). There are different standard levels of Unicode support. The lowest level is working correctly with code points, which is what wrap() does. Going to a higher level of support comes after range support. I know little about combining characters. You obviously know much more, do you want to take charge of this function?
Re: Today's programming challenge - How's your Range-Fu ?
On 2015-04-18 09:58, John Colvin wrote: Code points aren't equivalent to characters. They're not the same thing in most European languages, never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning. For that we have std.ascii. -- /Jacob Carlborg
Re: Today's programming challenge - How's your Range-Fu ?
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote: On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote: So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column. Code points aren't equivalent to characters. They're not the same thing in most European languages, never mind the rest of the world. If we have a line-wrapping algorithm in phobos that works by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning. Code points are a useful chunk size for some tasjs and completely insufficient for others.
Re: Today's programming challenge - How's your Range-Fu ?
On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote: There's also the question of what to do with bidi markings: how do you handle counting the columns in that case? Which BiDi marking are you referring to? LRM/RLM and friends? If so, don't worry: the interface, as described, is incapable of properly handling BiDi anyways. The proper way to handle BiDi line wrapping is this. First you assign a BiDi level to each character (at which point the markings are, effectively, removed from the input, so there goes your problem). Then you calculate the glyph's width until the line limit is reached, and then you reorder each line according to the BiDi levels you calculated earlier. As can be easily seen, this requires transitioning BiDi information that is per-paragraph across the line break logic, pretty much mandating multiple passes on the input. Since the requested interface does not allow that, proper BiDi line breaking is impossible with that interface. I'll mention that not everyone take that as a serious problem. Window's text control, for example, calculates line breaks on the text, and then runs the BiDi algorithm on each line individually. Few people notice this. Then again, people have already grown used to BiDi text being scrambled. Shachar
Re: Today's programming challenge - How's your Range-Fu ?
On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote: On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote: Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that. A range of lines instead of inserted \n would be a good API as well.
Re: Today's programming challenge - How's your Range-Fu ?
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote: > Well, talk is cheap, so here's a working implementation of the > non-Unicode-correct line wrapper that uses ranges and does not allocate: there is some... inconsistency: `std.string.wrap` adds final "\n" to string. ;-) but i always hated it for that. signature.asc Description: PGP signature
Re: Today's programming challenge - How's your Range-Fu ?
On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote: On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote: On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote: Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: awesome! Please make a pull request for this so you get proper credit! Doesn't that mean I have to add the autodecoding workarounds first? Before it gets pulled, yes, meaning that the element type of front() should match the element encoding type of Range. There's also an issue with firstindent and indent being the same range type as 'range', which is not practical as Range is likely a voldemort type. I suggest making them simply of type 'string'. I don't see any point to make them ranges. A unit test with an input range is needed, and one with some multibyte unicode encodings.
Re: Today's programming challenge - How's your Range-Fu ?
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote: > On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote: > >Well, talk is cheap, so here's a working implementation of the > >non-Unicode-correct line wrapper that uses ranges and does not > >allocate: > > awesome! Please make a pull request for this so you get proper credit! Doesn't that mean I have to add the autodecoding workarounds first? T -- Life is too short to run proprietary software. -- Bdale Garbee
Re: Today's programming challenge - How's your Range-Fu ?
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote: So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). It'd be good enough to duplicate the existing behavior, which is to treat decoded unicode characters as one column.
Re: Today's programming challenge - How's your Range-Fu ?
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote: Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: awesome! Please make a pull request for this so you get proper credit!
Re: Today's programming challenge - How's your Range-Fu ?
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...] > So either you have to throw out all pretenses of Unicode-correctness > and just stick with ASCII-style per-character line-wrapping, or you > have to live with byGrapheme with all the complexity that it entails. > The former is quite easy to write -- I could throw it together in a > couple o' hours max, but the latter is a pretty big project (cf. > Unicode line-breaking algorithm, which is one of the TR's). [...] Well, talk is cheap, so here's a working implementation of the non-Unicode-correct line wrapper that uses ranges and does not allocate: import std.range.primitives; /** * Range version of $(D std.string.wrap). * * Bugs: * This function does not conform to the Unicode line-breaking algorithm. It * does not take into account zero-width characters, combining diacritics, * double-width characters, non-breaking spaces, and bidi markings. Strings * containing these characters therefore may not be wrapped correctly. */ auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null, R indent = null, in size_t tabsize = 8) if (isForwardRange!R && is(ElementType!R : dchar)) { import std.algorithm.iteration : map, joiner; import std.range : chain; import std.uni; alias CharType = ElementType!R; // Returns: Wrapped lines. struct Result { private R range, indent; private size_t maxCols, tabSize; private size_t currentCol = 0; private R curIndent; bool empty = true; bool atBreak = false; this(R _range, R _firstindent, R _indent, size_t columns, size_t tabsize) { this.range = _range; this.curIndent = _firstindent.save; this.indent = _indent; this.maxCols = columns; this.tabSize = tabsize; empty = _range.empty; } @property CharType front() { if (atBreak) return '\n';// should implicit convert to wider characters else if (!curIndent.empty) return curIndent.front; else return range.front; } void popFront() { if (atBreak) { // We're at a linebreak. atBreak = false; currentCol = 0; // Start new line with indent curIndent = indent.save; return; } else if (!curIndent.empty) { // We're iterating over an initial indent. curIndent.popFront(); currentCol++; return; } // We're iterating over the main range. range.popFront(); if (range.empty) { empty = true; return; } if (range.front == '\t') currentCol += tabSize; else if (isWhite(range.front)) { // Scan for next word boundary to decide whether or not to // break here. R tmp = range.save; assert(!tmp.empty); size_t col = currentCol; // Find start of next word while (!tmp.empty && isWhite(tmp.front)) { col++; tmp.popFront(); } // Remember start of next word so that if we need to break, we // won't introduce extraneous spaces to the start of the new // line. R nextWord = tmp.save; while (!tmp.empty && !isWhite(tmp.front)) { col++; tmp.popFront(); } assert(tmp.empty || isWhite(tmp.front)); if (col > maxCols) { // Word wrap needed. Move current range position to
Re: Today's programming challenge - How's your Range-Fu ?
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote: [...] > -- > All problems are easy in retrospect. Argh, my Perl script doth mock me! T -- Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
Re: Today's programming challenge - How's your Range-Fu ?
On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote: > Challenge level - Moderately easy > > Consider the function std.string.wrap: > > http://dlang.org/phobos/std_string.html#.wrap > > It takes a string as input, and returns a GC allocated string that is > word-wrapped. It needs to be enhanced to: > > 1. Accept a ForwardRange as input. > 2. Return a lazy ForwardRange that delivers the characters of the > wrapped result one by one on demand. > 3. Not allocate any memory. > 4. The element encoding type of the returned range must be the same as > the element encoding type of the input. This is harder than it looks at first sight, actually. Mostly thanks to the complexity of Unicode... you need to identify zero-width, normal-width, and double-width characters, combining diacritics, various kinds of spaces (e.g. cannot break on non-breaking space) and treat them accordingly. Which requires decoding. (Well, in theory std.uni could be enhanced to work directly with encoded data, but right now it doesn't. In any case this is outside the scope of this challenge, I think.) Unfortunately, the only reliable way I know of currently that can deal with the spacing of Unicode characters correctly is to segment the input with byGrapheme, which currently is GC-dependent. So this fails (3). There's also the question of what to do with bidi markings: how do you handle counting the columns in that case? Of course, if you forego Unicode correctness, then you *could* just word-wrap on a per-character basis (i.e., every character counts as 1 column), but this also makes the resulting code useless as far as dealing with general Unicode data is concerned -- it'd only work for ASCII, and various character ranges inherited from the old 8-bit European encodings. Not to mention, line-breaking in Chinese encodings cannot work as prescribed anyway, because the rules are different (you can break anywhere at a character boundary except punctuation -- there is no such concept as a space character in Chinese writing). Same applies for Korean/Japanese. So either you have to throw out all pretenses of Unicode-correctness and just stick with ASCII-style per-character line-wrapping, or you have to live with byGrapheme with all the complexity that it entails. The former is quite easy to write -- I could throw it together in a couple o' hours max, but the latter is a pretty big project (cf. Unicode line-breaking algorithm, which is one of the TR's). T -- All problems are easy in retrospect.
Today's programming challenge - How's your Range-Fu ?
Challenge level - Moderately easy Consider the function std.string.wrap: http://dlang.org/phobos/std_string.html#.wrap It takes a string as input, and returns a GC allocated string that is word-wrapped. It needs to be enhanced to: 1. Accept a ForwardRange as input. 2. Return a lazy ForwardRange that delivers the characters of the wrapped result one by one on demand. 3. Not allocate any memory. 4. The element encoding type of the returned range must be the same as the element encoding type of the input.
Re: Special Type Challenge
On Sunday, 8 February 2015 at 19:59:40 UTC, Meta wrote: On Sunday, 8 February 2015 at 13:06:08 UTC, Marc Schütz wrote: No, `alias this` convert from the type it is declared in to another type. `opImplicitCast` would be declared in the destination type. So like this? struct Type1 { string str; } struct Type2 { string str; Type2 opImplicitCast(Type1 t) { return Type2(t.str); } } void main() { Type2 t2 = Type1("Hello, World!"); } Instead of the opposite? Yes. It's almost like a constructor, but would be called wherever an explicit call to the constructor is required today: struct S { this(int x) {} } S test1() { return 10;// not ok return S(10); // works } struct T { static typeof(this) opImplicitCast(int x) { return typeof(this)(x); } } T test1() { return 10;// ok return T(10); // not sure whether this should // work or not } Because it's so similar to a constructor (it always needs to be static and return `typeof(this)`), maybe a special syntax can be used: struct S { @implicit this(int x) {} }
Re: Special Type Challenge
On Sunday, 8 February 2015 at 13:06:08 UTC, Marc Schütz wrote: No, `alias this` convert from the type it is declared in to another type. `opImplicitCast` would be declared in the destination type. So like this? struct Type1 { string str; } struct Type2 { string str; Type2 opImplicitCast(Type1 t) { return Type2(t.str); } } void main() { Type2 t2 = Type1("Hello, World!"); } Instead of the opposite?
Re: Special Type Challenge
On Saturday, 7 February 2015 at 20:53:48 UTC, Meta wrote: On Saturday, 7 February 2015 at 19:38:10 UTC, Marc Schütz wrote: On Saturday, 7 February 2015 at 05:27:39 UTC, Andrei Alexandrescu wrote: On 2/6/15 8:28 PM, Jonathan Marler wrote: Do you know if D might support that later or if there's a reason for not supporting it? It's deliberate following the C++ experience. -- Andrei Hasn't there been a debate about a hypothetical `opImplicitCast()`? The default would still be off, but you can opt-in by defining said method. `alias this` is pretty much `opImplicitCast`, but it currently has a few holes, such as http://forum.dlang.org/post/xcnwuneclebuyqcjb...@forum.dlang.org. It's not always appropriate in every situation though, as you get full-on subtyping along with implicit conversion. No, `alias this` convert from the type it is declared in to another type. `opImplicitCast` would be declared in the destination type.
Re: Special Type Challenge
On Saturday, 7 February 2015 at 19:38:10 UTC, Marc Schütz wrote: Hasn't there been a debate about a hypothetical `opImplicitCast()`? The default would still be off, but you can opt-in by defining said method. How else would it be done? I am still confused as to what the reasons behind not having implicit conversion are.
Re: Special Type Challenge
On Saturday, 7 February 2015 at 19:38:10 UTC, Marc Schütz wrote: On Saturday, 7 February 2015 at 05:27:39 UTC, Andrei Alexandrescu wrote: On 2/6/15 8:28 PM, Jonathan Marler wrote: Do you know if D might support that later or if there's a reason for not supporting it? It's deliberate following the C++ experience. -- Andrei Hasn't there been a debate about a hypothetical `opImplicitCast()`? The default would still be off, but you can opt-in by defining said method. `alias this` is pretty much `opImplicitCast`, but it currently has a few holes, such as http://forum.dlang.org/post/xcnwuneclebuyqcjb...@forum.dlang.org. It's not always appropriate in every situation though, as you get full-on subtyping along with implicit conversion.
Re: Special Type Challenge
On Saturday, 7 February 2015 at 05:27:39 UTC, Andrei Alexandrescu wrote: On 2/6/15 8:28 PM, Jonathan Marler wrote: Do you know if D might support that later or if there's a reason for not supporting it? It's deliberate following the C++ experience. -- Andrei Hasn't there been a debate about a hypothetical `opImplicitCast()`? The default would still be off, but you can opt-in by defining said method.
Re: Special Type Challenge
On 2/6/15 8:28 PM, Jonathan Marler wrote: Do you know if D might support that later or if there's a reason for not supporting it? It's deliberate following the C++ experience. -- Andrei
Re: Special Type Challenge
On Saturday, 7 February 2015 at 03:48:24 UTC, Adam D. Ruppe wrote: wrote: b = -256; that won't fit in a byte btw. Woops, I typed that example too fast :) The rest of the assignment stuff is easy. I'd prolly even do it with a template: this(T)(T t) { this.opAssign(t); } // for construction Byte opAssign(T)(T t) if(T.sizeof == 1) { // for other assignment data_holder = cast(typeof(data_holder) t); } and that should do it. This code is almost exactly what I came up with. But what about the function problem? void echo(Byte b); echo('c'); // won't work I can't figure out a way to make this work (without making echo a template). If D supported implicit conversions from built-in types to user types then it could work. Do you know if D might support that later or if there's a reason for not supporting it?
Re: Special Type Challenge
On Saturday, 7 February 2015 at 01:55:27 UTC, Jonathan Marler wrote: b = -256; that won't fit in a byte btw. The rest of the assignment stuff is easy. I'd prolly even do it with a template: this(T)(T t) { this.opAssign(t); } // for construction Byte opAssign(T)(T t) if(T.sizeof == 1) { // for other assignment data_holder = cast(typeof(data_holder) t); } and that should do it. Byte[] barr; barr = "teststring"; barr = [0,1,2,3]; } Language won't let you do these though.
Re: Special Type Challenge
On Saturday, 7 February 2015 at 02:12:08 UTC, Jakob Ovrum wrote: Byte echo(Byte b) { return b; } b = echo('a'); b = echo(cast(const char)'a'); b = echo(cast(immutable char)'a'); b = echo(cast(byte)1); b = echo(cast(const byte)1); b = echo(cast(immutable byte)1); b = echo(cast(ubyte)1); b = echo(cast(const ubyte)1); b = echo(cast(immutable ubyte)1); Byte[] barr; barr = "teststring"; barr = [0,1,2,3]; } These are not possible as D does not support implicit construction. It's a bit odd that D supports implicit conversions from user-defined types to built-in types but not the reverse. What's the reason for this? Is it done on purpose or is this just a hole that might be filled in later?
Re: Special Type Challenge
On Saturday, 7 February 2015 at 01:55:27 UTC, Jonathan Marler wrote: Byte b; b = 0; b = 1; b = 255; b = -256; b = 'a'; b = cast(const char)'a'; b = cast(immutable char)'a'; b = cast(byte)1; b = cast(const byte)1; b = cast(immutable byte)1; b = cast(ubyte)1; b = cast(const ubyte)1; b = cast(immutable ubyte)1; These are possible with opAssign (which should be accompanied with corresponding constructor(s)). Byte echo(Byte b) { return b; } b = echo('a'); b = echo(cast(const char)'a'); b = echo(cast(immutable char)'a'); b = echo(cast(byte)1); b = echo(cast(const byte)1); b = echo(cast(immutable byte)1); b = echo(cast(ubyte)1); b = echo(cast(const ubyte)1); b = echo(cast(immutable ubyte)1); Byte[] barr; barr = "teststring"; barr = [0,1,2,3]; } These are not possible as D does not support implicit construction.
Special Type Challenge
I'm wondering if the following is possible in D. I tried and failed but maybe someone else will be able to pull it off. // Challenge: Create a type named "Byte" that, // 1. Uses 1 byte of memory // 2. Can be used as an argument to a non-template function // 3. Handles implicit conversion from any 1-byte type // 4. Compiles and passes the following unittest unittest { Byte b; b = 0; b = 1; b = 255; b = -256; b = 'a'; b = cast(const char)'a'; b = cast(immutable char)'a'; b = cast(byte)1; b = cast(const byte)1; b = cast(immutable byte)1; b = cast(ubyte)1; b = cast(const ubyte)1; b = cast(immutable ubyte)1; Byte echo(Byte b) { return b; } b = echo('a'); b = echo(cast(const char)'a'); b = echo(cast(immutable char)'a'); b = echo(cast(byte)1); b = echo(cast(const byte)1); b = echo(cast(immutable byte)1); b = echo(cast(ubyte)1); b = echo(cast(const ubyte)1); b = echo(cast(immutable ubyte)1); Byte[] barr; barr = "teststring"; barr = [0,1,2,3]; }
Pull Request Challenge for D
Hi all 1) found this and somehow like that idea. maybe it's also a way forward for us ;) https://huntingbears.nl/2015/01/11/cpan-pull-request-challenge-for-january-datetimeformatepoch/ 2) as D upload/download statistics are always in the forums, I also like this page and hope it can be inspiring ;) http://stats.cpantesters.org/uploads.html regards anton oks
Function Pointer Challenge
In a library I was writing I was in need of the following: Write code that takes a function pointer/delegate and an array of strings that calls the function by parsing each string into the given functions arguments. And if the function has a return value, the code will also return the functions return value after the call. I had written this functionality before in .NET (C# specifically) using .NET's runtime reflection. The code looks nice but runtime reflection has poor performance. Using D's compile-time features I was able to write a D template function that implemented this in about 10 lines of code. ReturnType!Function call(Function)(Function func, const char[][] args...) if (isCallable!Function) { alias Args = ParameterTypeTuple!Function; if(args.length != Args.length) throw new Exception(format("Expected %d arguments but got %d", Args.length, args.length)); Args argsTuple; foreach(i,Arg;Args) { argsTuple[i] = to!Arg(args[i]); } return func(argsTuple); } Here's a unit test to demonstrate its usage: unittest { void voidFunction() { writeln("[Test] Called voidFunction()"); } void randomFunction(int i, uint u, string s, char c) { writefln("[Test] Called randomFunction(%s, %s, \"%s\", '%s')", i, u, s, c); } ulong echoUlong(ulong value) { writefln("[Test] Called echoUlong(%s)", value); return value; } (&voidFunction).call(); (&randomFunction).call("-1000", "567", "HelloWorld!", "?"); string passedValue = "123456789"; ulong returnValue = (&echoUlong).call(passedValue); writefln("[Test] echoUlong(%s) = %s", passedValue, returnValue); try { (&randomFunction).call("wrong number of args"); assert(0); } catch(Exception e) { writefln("[Test] Caught %s: '%s'", typeid(e), e.msg); } writeln("[Test] Success"); } I think this challenge does a great job in demonstrating D's compile-time power. I couldn't think of a way to do this in C without doing some type of code generation. The reason I needed this functionality was because I was writing a remote procedure call type of library, where the function's being called were known at compile time, but the arguments (passed over a socket) had to be processed at runtime. I was wondering if anyone had good solutions to this problem in other languages. I was very pleased with the D solution but I predict that solutions in other languages are going to be much uglier.
Re: Challenge: write a really really small front() for UTF8
Oups, in last return forgot "+ s[3] "
Re: Challenge: write a really really small front() for UTF8
http://goo.gl/4RSWhr Only 3 ifs.
Re: Challenge: write a really really small front() for UTF8
W dniu 2014-03-25 11:42, dennis luehring pisze: Am 25.03.2014 11:38, schrieb Nick Sabalausky: On 3/25/2014 4:00 AM, Iain Buclaw wrote: On 25 March 2014 00:04, Daniel N wrote: On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising. I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window. 4) It's tied to one piece of hardware. No Thankee. void doStuff() { if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d the extra branch could kill the performance benefit if doStuff is too small void function() doStuff; void main() { auto supportCpuFeatureX = detectCpuFeatureX(); if (supportCpuFeatureX) doStuff = &doStuff_FeatureX; else doStuff = &doStuff_Fallback; }
Re: Challenge: write a really really small front() for UTF8
24-Mar-2014 23:53, Dmitry Olshansky пишет: 24-Mar-2014 01:22, Andrei Alexandrescu пишет: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei I had to join the party at some point. This seems like 25 instructions: http://goo.gl/N7sHtK Interestingly gdc-4.8 produces better results. http://goo.gl/1R7GMs -- Dmitry Olshansky
Re: Challenge: write a really really small front() for UTF8
dchar front(char[] s) { dchar c = s[0]; if (!(c & 0x80)) return c; byte b = (c >> 4) & 3; b += !b; c &= 63 >> b; char *p = s.ptr; do { p++; c = c << 6 | *p & 63; } while(--b); return c; }
Re: Challenge: write a really really small front() for UTF8
On Tuesday, 25 March 2014 at 10:42:59 UTC, dennis luehring wrote: void doStuff() { if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d the extra branch could kill the performance benefit if doStuff is too small you'd simply have to hoist the condition outside the inner loop. Furthermore the branch prediction would never fail, only unpredictable branches are terrible.
Re: Challenge: write a really really small front() for UTF8
Am 25.03.2014 11:38, schrieb Nick Sabalausky: On 3/25/2014 4:00 AM, Iain Buclaw wrote: On 25 March 2014 00:04, Daniel N wrote: On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising. I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window. 4) It's tied to one piece of hardware. No Thankee. void doStuff() { if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d the extra branch could kill the performance benefit if doStuff is too small
Re: Challenge: write a really really small front() for UTF8
On 3/25/2014 4:00 AM, Iain Buclaw wrote: On 25 March 2014 00:04, Daniel N wrote: On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising. I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window. 4) It's tied to one piece of hardware. No Thankee. bool supportCpuFeatureX; void main() { supportCpuFeatureX = detectCpuFeatureX(); doStuff(); } void doStuff() { if(supportCpuFeatureX) doStuff_FeatureX(); else doStuff_Fallback(); } > dmd -inline blah.d
Re: Challenge: write a really really small front() for UTF8
On 25 March 2014 00:04, Daniel N wrote: > On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: >> >> I'm currently too busy to submit a complete solution, but please feel free >> to use my idea if you think it sounds promising. > > > I now managed to dig up my old C source... but I'm still blocked by dmd not > accepting the 'pext' instruction... > > 1) I know my solution is not directly comparable to the rest in this > thread(for many reasons). > 2) It's of course trivial to add a fast path for ascii... if desired. > 3) It throws safety and standards out the window. > 4) It's tied to one piece of hardware. No Thankee.
Re: Challenge: write a really really small front() for UTF8
Am 24.03.2014 17:44, schrieb Andrei Alexandrescu: On 3/24/14, 5:51 AM, w0rp wrote: On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable. I would strongly advise to at least offer an option Options are fine for functions etc. But front would need to find an all-around good compromise between speed and correctness. Andrei b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError b"\255".decode("utf-8", errors="replace") # replacement character used b"\255".decode("utf-8", errors="ignore") # Empty string, invalid sequence removed. i think there should be a base range for UTF8 iteration - with policy based error extension (like in python) and some variants that defer this base UTF8 range with different error behavior - and one of these become the phobos standard = default parameter so its still switchable
Re: Challenge: write a really really small front() for UTF8
On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote: I'm currently too busy to submit a complete solution, but please feel free to use my idea if you think it sounds promising. I now managed to dig up my old C source... but I'm still blocked by dmd not accepting the 'pext' instruction... 1) I know my solution is not directly comparable to the rest in this thread(for many reasons). 2) It's of course trivial to add a fast path for ascii... if desired. 3) It throws safety and standards out the window. #include char32_t front(const char* inp) { static const uint32_t mask[32] = { 0b0111''', 0b''', 0b0001'0011'', 0b'0011'0011', 0b0111'0011'0011'0011, }; uint32_t rev = __builtin_bswap32(reinterpret_castuint32_t*>(inp)[0]); uint32_t len = __lzcnt32(~rev); return _pext_u32(rev, mask[len]); } This is what clang 3.4 generated: ## BB#0: pushq %rbp Ltmp2: .cfi_def_cfa_offset 16 Ltmp3: .cfi_offset %rbp, -16 movq%rsp, %rbp Ltmp4: .cfi_def_cfa_register %rbp movl(%rdi), %eax bswapl %eax movl%eax, %ecx notl%ecx lzcntl %ecx, %ecx leaq__ZZ5frontPKcE4mask(%rip), %rdx pextl (%rdx,%rcx,4), %eax, %eax popq%rbp ret
Re: Challenge: write a really really small front() for UTF8
On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu wrote: On 3/24/14, 2:02 AM, monarch_dodra wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error. Andrei That sounds good to me (I think). Better than a straight up assert anyways... ...which is what the code you submitted does. Also, I think relying on bounds checking is wrong: Disabling bounds checking mean you are OK for memory corruption for *wrong code*. I don't think invalid string is wrong code. 'front' should do an actual "if", and "assert(0)" when needed. This should have no impact on performance BTW, if done with pointer extraction ("slice.ptr[1]"), and in a trusted context. Of course, this is all assuming we are asserting at all.
Re: Challenge: write a really really small front() for UTF8
24-Mar-2014 01:22, Andrei Alexandrescu пишет: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei I had to join the party at some point. This seems like 25 instructions: http://goo.gl/N7sHtK -- Dmitry Olshansky
Re: Challenge: write a really really small front() for UTF8
On Monday, 24 March 2014 at 16:17:52 UTC, Andrei Alexandrescu wrote: On 3/24/14, 12:53 AM, Chris Williams wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei http://goo.gl/TaZTNB Nice! Why assert(ret != 0)? Andrei It should be -1. I wanted to make sure that bsr() wasn't called on zero, but of course that's not the correct way to test it when I'm about to ~ it.
Re: Challenge: write a really really small front() for UTF8
Whoops, wrong "original" version. That was the one before I understood the game. Here mine is fixed (but up to 180 lines - 16 labels = 164 instructions): http://goo.gl/wjIAVm On Monday, 24 March 2014 at 17:14:11 UTC, dnspies wrote: It seems like mine wasn't being inlined because I had carelessly replaced char[] with const(char)[] in the function signature (I don't know why that should make a difference, but it does) Going back to my original version (with Andre's trick to avoid letting the loop unroll), I get dnspies_orig 159 - 16 labels = 143 instructions However, MFortin1 can be made to do better by taking successive slices instead of indexing (and commenting out the check, since it no longer helps optimize). MFortin1_slice 137 - 13 labels = 124 instructions http://goo.gl/JAgVK1 On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote: On 2014-03-24 06:08:30 +, "safety0ff" said: Everything seems to work in your corrected versions: http://dpaste.dzfl.pl/3b32535559ba Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization. Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: http://goo.gl/QjnA2k implementation line count of main in assembly andralex 285 - 9 labels = 276 instructions MFortin1 135 - 13 labels = 122 instructions MFortin2 150 - 14 labels = 136 instructions safety0ff -- not inlined (too big?) -- dnspies 161 - 15 labels = 146 instructions CWilliams233 - 25 labels = 208 instructions For compactness, my first implementation seems to be the winner, both with and without inlining. That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_]. So I added this line to my two implementations: if (1+tailLength >= s.length) return dchar.init; Now lets see what happens with those changes: http://goo.gl/XPCGYE implementation line count of main in assembly MFortin1Check103 - 11 labels = 92 instructions MFortin2Check135 - 13 labels = 122 instructions Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: http://goo.gl/E2Q0Yu implementation line count of main in assembly andralex 384 - 9 labels = 375 instructions MFortin1 173 - 19 labels = 154 instructions MFortin2 182 - 18 labels = 164 instructions safety0ff -- not inlined -- dnspies -- not inlined -- CWilliams229 - 23 labels = 206 instructions MFortin1Check211 - 24 labels = 187 instructions MFortin2Check218 - 21 labels = 197 instructions So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check.
Re: Challenge: write a really really small front() for UTF8
It seems like mine wasn't being inlined because I had carelessly replaced char[] with const(char)[] in the function signature (I don't know why that should make a difference, but it does) Going back to my original version (with Andre's trick to avoid letting the loop unroll), I get dnspies_orig 159 - 16 labels = 143 instructions However, MFortin1 can be made to do better by taking successive slices instead of indexing (and commenting out the check, since it no longer helps optimize). MFortin1_slice 137 - 13 labels = 124 instructions http://goo.gl/JAgVK1 On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote: On 2014-03-24 06:08:30 +, "safety0ff" said: Everything seems to work in your corrected versions: http://dpaste.dzfl.pl/3b32535559ba Andrei's version doesn't compile on recent compiler versions due to goto skipping initialization. Ok, so let's check what happens in actual usage. In other words, what happens when front is inlined and used in a loop. I'm counting the number of assembly lines of this main function for looping through each possible input using the various implementations: http://goo.gl/QjnA2k implementation line count of main in assembly andralex 285 - 9 labels = 276 instructions MFortin1 135 - 13 labels = 122 instructions MFortin2 150 - 14 labels = 136 instructions safety0ff -- not inlined (too big?) -- dnspies 161 - 15 labels = 146 instructions CWilliams233 - 25 labels = 208 instructions For compactness, my first implementation seems to be the winner, both with and without inlining. That said, I think front should only assume a non-empty char[] and thus should check the length before accessing anything after the first byte to protect against incomplete multi-byte sequences such as [0x1000_]. So I added this line to my two implementations: if (1+tailLength >= s.length) return dchar.init; Now lets see what happens with those changes: http://goo.gl/XPCGYE implementation line count of main in assembly MFortin1Check103 - 11 labels = 92 instructions MFortin2Check135 - 13 labels = 122 instructions Now I'm baffled. Adding a test makes the code shorter? It actually make the standalone functions longer by 8 instructions (as expected), but the inlined version is shorter. My guess is that the optimizer creates something entirely different and it turns out that this different version optimises better after inlining. That said, my test "main" function isn't very representative of the general case because the length is statically known by the optimizer. Let's see what it does when the length is not statically known: http://goo.gl/E2Q0Yu implementation line count of main in assembly andralex 384 - 9 labels = 375 instructions MFortin1 173 - 19 labels = 154 instructions MFortin2 182 - 18 labels = 164 instructions safety0ff -- not inlined -- dnspies -- not inlined -- CWilliams229 - 23 labels = 206 instructions MFortin1Check211 - 24 labels = 187 instructions MFortin2Check218 - 21 labels = 197 instructions So again MFortin1 is the winner for compactness. Still, I maintain that we ought to at least check the length of the array for multi-byte sequences to protect against incomplete sequences that could lay at the end of the string, so I'd favor MFortin1Check.
Re: Challenge: write a really really small front() for UTF8
On Monday, 24 March 2014 at 16:41:02 UTC, John Colvin wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei On a bigendian machine with loose alignment requirements (1 byte), you can do this Same again, but for little-endian, 18 instructions: http://goo.gl/jlrweQ uint front(char[] s) { if(!(s[0] & 0b1000_)) return s[0]; //handle ASCII assert(s[0] & 0b0100_); if(s[0] & 0b0010_) { if(s[0] & 0b0001_) { assert(s.length >=4 && !(s[0] & 0b1000) && s[1] <= 0b1011_ && s[2] <= 0b1011_ && s[3] <= 0b1011_); return swapEndian(*(cast(dchar*)(s.ptr))); } assert(s.length >= 3 && s[1] <= 0b1011_ && s[2] <= 0b1011_); return swapEndian(*(cast(dchar*)(s.ptr))) >> 8; } assert(s.length >= 2 && s[1] <= 0b1011_); return swapEndian(*(cast(wchar*)(s.ptr))); }
Re: Challenge: write a really really small front() for UTF8
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei On a bigendian machine with loose alignment requirements (1 byte), you can do this, which is down to 13 instructions on x86 (which is of course meaningless, what with it being the wrong endianess): uint front(char[] s) { if(!(s[0] & 0b1000_)) return s[0]; //handle ASCII assert(s[0] & 0b0100_); if(s[0] & 0b0010_) { if(s[0] & 0b0001_) { assert(s.length >=4 && !(s[0] & 0b1000) && s[1] <= 0b1011_ && s[2] <= 0b1011_ && s[3] <= 0b1011_); return *(cast(dchar*)(s.ptr)); } assert(s.length >= 3 && s[1] <= 0b1011_ && s[2] <= 0b1011_); return *(cast(dchar*)(s.ptr)) >> 8; } assert(s.length >= 2 && s[1] <= 0b1011_); return *(cast(wchar*)(s.ptr)); } http://goo.gl/Kf6RZJ There may be architectures that can benefit from this.
Re: Challenge: write a really really small front() for UTF8
On 3/24/14, 5:51 AM, w0rp wrote: On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? Having a fast "front" is fine and all, but if it means your program asserting in release (or worst, silently corrupting memory) just because the client was trying to read a bad text file, I'm unsure this is acceptable. I would strongly advise to at least offer an option Options are fine for functions etc. But front would need to find an all-around good compromise between speed and correctness. Andrei
Re: Challenge: write a really really small front() for UTF8
On 3/24/14, 3:30 AM, dnspies wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Ok, I managed to make it smaller. I think this is the smallest so far with only 23 instructions (no loop unrolling in this one): http://goo.gl/RKF5Vm This is nice because even the checks are relatively cheap. It's quite difficult to understand tho :o). -- Andrei
Re: Challenge: write a really really small front() for UTF8
On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu wrote: On 3/24/14, 2:02 AM, monarch_dodra wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error. Ignoring UTF errors is a lossy operation and has the potential problem of irreversible data loss. For example, consider a program which needs to process Windows-1251 files: it would need to read the file from disk, convert to UTF-8, process it, convert back to Windows-1251, and save it back to disk. If a bug causes the UTF-8 conversion step to be accidentally skipped, then all Unicode data in that file is lost. I think UTF-8 decoding operations should, by default, throw on UTF-8 errors. Ignoring UTF-8 errors should only be done explicitly, with the programmer's consent.
Re: Challenge: write a really really small front() for UTF8
On 3/24/14, 2:02 AM, monarch_dodra wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Before we roll this out, could we discuss a strategy/guideline in regards to detecting and handling invalid UTF sequences? I think std.array.front should return the invalid dchar on error, and popFront should attempt to resync on error. Andrei
Re: Challenge: write a really really small front() for UTF8
On 3/24/14, 1:56 AM, dnspies wrote: On Monday, 24 March 2014 at 08:06:53 UTC, dnspies wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); } Sorry, I misunderstood. We only want the x's in the output. Here's my revised solution http://goo.gl/PL729J That turned out quite large. -- Andrei
Re: Challenge: write a really really small front() for UTF8
On 3/24/14, 1:06 AM, dnspies wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei Here's mine (the loop gets unrolled): dchar front(const char[] s) { assert(s.length > 0); byte c = s[0]; dchar res = cast(ubyte)c; if(c >= 0) { return res; } c <<= 1; assert(c < 0); for(int i = 1; i < 4; i++) { assert(i < s.length); ubyte d = s[i]; assert((d >> 6) == 0b10); res = (res << 8) | d; c <<= 1; if(c >= 0) return res; } assert(false); } http://goo.gl/HBSv5T - looks nice! For minimum size replace the assert(false) with one inside the loop: http://goo.gl/eWs0ft Andrei
Re: Challenge: write a really really small front() for UTF8
On 3/24/14, 12:53 AM, Chris Williams wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei http://goo.gl/TaZTNB Nice! Why assert(ret != 0)? Andrei
Re: Challenge: write a really really small front() for UTF8
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei http://forum.dlang.org/post/fsgdviklnbugdzjsg...@forum.dlang.org
Re: Challenge: write a really really small front() for UTF8
On Monday, 24 March 2014 at 15:32:09 UTC, John Colvin wrote: On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote: Here's a baseline: http://goo.gl/91vIGc. Destroy! Andrei I've probably missed something, but here's 15 instructions: http://goo.gl/d3Mgj0 woops yeah, that's completely broken.