Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread H. S. Teoh via Digitalmars-d
On Mon, Apr 20, 2015 at 06:03:49PM +, John Colvin via Digitalmars-d wrote:
> On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
> >To measure the columns needed to print a string, you'll need the
> >number of graphemes. (d|)?string.length gives you the number of code
> >units.
> 
> Even that's not really true. In the end it's up to the font and layout
> engine to decide how much space anything takes up. Unicode doesn't
> play nicely with the idea of text as a grid of rows and fixed-width
> columns of characters, although quite a lot can (and is, see urxvt for
> example) be shoe-horned in.

Yeah, even the grapheme count does not necessarily tell you how wide the
printed string really is. The characters in the CJK block are usually
rendered with fonts that are, on average, twice as wide as your typical
Latin/Cyrillic character, so even applications like urxvt that shoehorn
proportional-width fonts into a text grid render CJK characters as two
columns rather than one.

Because of this, I actually wrote a function at one time to determine
the width of a given Unicode character (i.e., zero, single, or double)
as displayed in urxvt. Obviously, this is no help if you need to wrap
lines rendered with a proportional font. And it doesn't even attempt to
work correctly with bidi text.

This is why I said at the beginning that wrapping a line of text is a
LOT harder than it sounds. A function that only takes a string as input
does not have the necessary information to do this correctly in all use
cases. The current wrap() function doesn't even do it correctly modulo
the information available: it doesn't handle combining diacritics and
zero-width characters properly. In fact, it doesn't even handle control
characters properly, except perhaps for \t and \n. There are so many
things wrong with the current wrap() function (and many other
string-processing functions in Phobos) that it makes it look like a joke
when we claim that D provides Unicode correctness out-of-the-box.

The only use case where wrap() gives the correct result is when you
stick with pre-Unicode Latin strings to be displayed on a text console.
As such, I don't really see the general utility of wrap() as it
currently stands, and I question its value in Phobos, as opposed to an
actually more useful implementation that, for instance, correctly
implements the Unicode line-breaking algorithm.


T

-- 
It said to install Windows 2000 or better, so I installed Linux instead.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread John Colvin via Digitalmars-d

On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need 
the number of graphemes. (d|)?string.length gives you the 
number of code units.


Even that's not really true. In the end it's up to the font and 
layout engine to decide how much space anything takes up. Unicode 
doesn't play nicely with the idea of text as a grid of rows and 
fixed-width columns of characters, although quite a lot can (and 
is, see urxvt for example) be shoe-horned in.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread Panke via Digitalmars-d
This can lead to subtle bugs, cf. length of random and e_one. 
You have to convert everything to dstring to get the "expected" 
result. However, this is not always desirable.


There are three things that you need to be aware of when handling 
unicode: code units, code points and graphems.


In general the length of one guarantees anything about the length 
of the other, except for utf32, which is a 1:1 mapping between 
code units and code points.


In this thread, we were discussing the relationship between code 
points and graphemes. You're examples however apply to the 
relationship between code units and code points.


To measure the columns needed to print a string, you'll need the 
number of graphemes. (d|)?string.length gives you the number of 
code units.


If you normalize a string (in the sequence of 
characters/codepoints sense, not object.string) to NFC, it will 
decompose every precomposed character in the string (like é, 
single codeunit), establish a defined order between the composite 
characters and then recompose a selected few graphemes (like é). 
This way é always ends up as a single code unit in NFC. There are 
dozens of other combinations where you'll still have n:1 mapping 
between code points and graphemes left after normalization.


Example given already in this thread: putting an arrow over an 
latin letter is typical in math and always more than one 
codepoint.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread Chris via Digitalmars-d

On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:


Yes, again and again I encountered length related bugs with 
Unicode characters. Normalization is not 100% reliable.


I think it is 100% reliable, it just doesn't make the problems 
go away. It just guarantees that two strings normalized to the 
same form are binary equal iff they are equal in the unicode 
sense. Nothing about columns or string length or grapheme count.


The problem is not normalization as such, the problem is with 
string (as opposed to dstring):


import std.uni : normalize, NFC;
void main() {

  dstring de_one = "é";
  dstring de_two = "e\u0301";

  assert(de_one.length == 1);
  assert(de_two.length == 2);

  string e_one = "é";
  string e_two = "e\u0301";

  string random = "ab";

  assert(e_one.length == 2);
  assert(e_two.length == 3);
  assert(e_one.length == random.length);

  assert(normalize!NFC(e_one).length == 2);
  assert(normalize!NFC(e_two).length == 2);
}

This can lead to subtle bugs, cf. length of random and e_one. You 
have to convert everything to dstring to get the "expected" 
result. However, this is not always desirable.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread Panke via Digitalmars-d


Yes, again and again I encountered length related bugs with 
Unicode characters. Normalization is not 100% reliable.


I think it is 100% reliable, it just doesn't make the problems go 
away. It just guarantees that two strings normalized to the same 
form are binary equal iff they are equal in the unicode sense. 
Nothing about columns or string length or grapheme count.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread Chris via Digitalmars-d

On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote:
Isn't this solved commonly with a normalization pass? We 
should have a normalizeUTF() that can be inserted in a 
pipeline.


Yes.

Then the rest of Phobos doesn't need to mind these combining 
characters. -- Andrei


I don't think so. The thing is, even after normalization we 
have to deal with combining characters because in all 
normalization forms there will be combining characters left 
after normalization.


Yes, again and again I encountered length related bugs with 
Unicode characters. Normalization is not 100% reliable. I don't 
know anyone who works with non English characters who doesn't 
have problems with Unicode related issues sometimes.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread Jacob Carlborg via Digitalmars-d

On 2015-04-20 08:04, Nick B wrote:


Perhaps a new Unicode standard, could start that way as well ?


https://xkcd.com/927/

--
/Jacob Carlborg


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-20 Thread Shachar Shemesh via Digitalmars-d

On 19/04/15 22:58, ketmar wrote:

On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:

it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf



This is not a very accurate depiction of Unicode.

For example:
And, moreover, BOM is meaningless without mentioning of encoding. So we 
have to specify encoding anyway.


No. BOM is what lets your auto-detect the encoding. If you know you will 
be using UTF-8, 16 or 32 with an unknown encoding, BOM will tell you 
which it is. That is its entire purpose, in fact.


There, pretty much, goes point #1.

And then:
Unicode contains at least “writing direction” control symbols (LTR is 
U+200E and RTL is U+200F) which role is IDENTICAL to the role of 
codepage-switching symbols with the associated disadvantages.


That's just ignorance of how the UBA (TR#9) works. LRM and RLM are mere 
invisible characters with defined directionality. Cutting them away from 
a substring would not invalidate your text more than cutting away actual 
text would under the same conditions. In any case, unlike page switching 
symbols, it would only affect your display, not your understanding of 
the text.


So point #2 is out.

He has some valid argument under point #3, but also lots of !(@#&$ 
nonsense. He is right, I think, that denoting units with separate code 
points makes no sense, but the rest of his arguments seem completely 
off. For example, asking Latin and Cyrillic to share the same region 
merely because some letters look alike makes no sense, implementation wise.



Points #4, #5, #6 and #7 are the same point. The main objection I have 
there is his assumption that the situation is, somehow, worse than it 
was. Yes, if you knew your encoding was Windows-1255, you could assume 
the text is Hebrew.


Or Yiddish.

And this, I think, is one of the encodings with the least number of 
languages riding on it. Windows-1256 has Arabic, Persian, Urdu and 
others. Windows-1251 has the entire western Europe script. As pointed 
out elsewhere in this thread, Spanish and French treat case folding of 
accented letters differently.


Also, we see that the solution he thinks would work better actually 
doesn't. People living in France don't switch to a QWERTY keyboard when 
they want to type English. They type English with their AZERTY keyboard. 
There simply is no automatic way to tell what language something is 
typed in without a human telling you (or applying content based heuristics).


Microsoft Word stores, for each letter, which was the keyboard language 
it was typed with. This causes great problems when copying to other 
editors, performing searches, or simply trying to get bidirectional text 
to appear correctly. The problem is so bad that phone numbers where the 
prefix appears after the actual number is not considered bad form or 
unusual, even in official PR material or when sending resumes.


In fact, the only time you can count on someone to switch keyboards is 
when they need to switch to a language with a different alphabet. No 
Russian speaker will type English using the Russian layout, even if what 
she has to say happens to use letters with the same glyphs. You simply 
do not plan that much ahead.


The point I'm driving at is that just because some posted some rant on 
the Internet doesn't mean it's correct. When someone says something is 
broken, always ask them what they suggest instead.


Shachar


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread Nick B via Digitalmars-d

On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote:

On Mon, 20 Apr 2015 01:27:36 +, Nick B wrote:


Perhaps Unicode needs to be rebuild from the ground up ?


alas, it's too late. now we'll live with that "unicode" crap 
for many

years.


Perhaps. or perhaps not. This community got together under Walter 
and Andrei leadership to building a new programming language, on 
the pillars of the old.

Perhaps a new Unicode standard, could start that way as well ?



Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread ketmar via Digitalmars-d
On Mon, 20 Apr 2015 01:27:36 +, Nick B wrote:

> Perhaps Unicode needs to be rebuild from the ground up ?

alas, it's too late. now we'll live with that "unicode" crap for many 
years.

signature.asc
Description: PGP signature


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread Nick B via Digitalmars-d

On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:

On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:



it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf


Ketmar

Great link, and a really good arguement about the problems with 
Unicode.


Quote from 'Instead of Conclusion'

Yes. This is the root of Unicode misdesign. They mixed up two 
mutually exclusive
approaches. They blended badly two different abstraction levels: 
the textual level which
corresponds to a language idea and the graphical level which does 
not care of a language, yet

cares of writing direction, subscripts, superscripts and so on.

In other words we need two different Unicodes built on these two 
opposite principles,

instead of the one built on an insane mix of controversial axioms.

end quote.

Perhaps Unicode needs to be rebuild from the ground up ?


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread weaselcat via Digitalmars-d

On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:

On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:

é might be obvious, but Unicode isn't just for writing 
European prose.


it is also to insert pictures of the animals into text.


There's other uses for unicode?
🐧


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread ketmar via Digitalmars-d
On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:

> é might be obvious, but Unicode isn't just for writing European prose.

it is also to insert pictures of the animals into text.

> Unicode is a nightmarish system in some ways, but considering how
> incredibly difficult the problem it solves is, it's actually not too
> crazy.

it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf

signature.asc
Description: PGP signature


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread via Digitalmars-d

On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
U0065+U0301 rather than U00e9. Because of legacy systems, and 
because they would rather have the ISO-8509 code pages be 1:1 
mappings, rather than 1:n mappings, they introduced code points 
they really would rather do without.


That's probably right. It is in fact a major feat to have the 
world adopt a new standard wholesale, but there are also 
difficult "semiotic" issues when you encode symbols and different 
languages view symbols differently (e.g. is "ä" an "a" or do you 
have two unique letters in the alphabet?)


Take "å", it can represent a unit (ångström) or a letter with a 
circle above it, or a unique letter in the alphabet. The letter 
"æ" can be seen as a combination of "ae" or a unique letter.


And we can expect languages, signs and practices to evolve over 
time too. How can you normalize encodings without normalizing 
writing practice and natural language development? That would be 
beyond the mandate of a unicode standard organization...


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread Shachar Shemesh via Digitalmars-d

On 19/04/15 10:51, Abdulhaq wrote:

MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:

On 18/04/15 21:40, Walter Bright wrote:



Also, notice that some letters can only be achieved using multiple
code points. Hebrew diacritics, for example, do not, typically, have a
composite form. My name fully spelled (which you rarely would do),
שַׁחַר, cannot be represented with less than 6 code points, despite
having only three letters.



Yes Arabic is similar too



Actually, the Arab presentation forms serve a slightly different 
purpose. In Hebrew, the presentation forms are mostly for Bibilical 
text, where certain decorations are usually done.


For Arabic, the main reason for the presentation forms is shaping. 
Almost every Arabic letter can be written in up to four different forms 
(alone, start of word, middle of word and end of word). This means that 
Arabic has 28 letters, but over 100 different shapes for those letters. 
These days, when the font can do the shaping, the 28 letters suffice. 
During the DOS days, you needed to actually store those glyphs 
somewhere, which means that you needed to allocate a number to them.


In Hebrew, some letters also have a final form. Since the numbers are so 
significantly smaller, however, (22 letters, 5 of which have final 
forms), Hebrew keyboards actually have all 27 letters on them. Going 
strictly by the "Unicode way", one would be expected to spell שלום with 
U05DE as the last letter, and let the shaping engine figure out that it 
should use the final form (or add a ZWNJ). Since all Hebrew code charts 
contained a final form Mem, however, you actually spell it with U05DD in 
the end, and it is considered a distinct letter.


Shachar


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread Abdulhaq via Digitalmars-d

MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:

On 18/04/15 21:40, Walter Bright wrote:


I'm not arguing against the existence of the Unicode standard, 
I'm
saying I can't figure any justification for standardizing 
different

encodings of the same thing.



A lot of areas in Unicode are due to pre-Unicode legacy.

I'm guessing here, but looking at the code points, é (U00e9 - 
Latin small letter E with acute), which comes from Latin-1, 
which is designed to follow ISO-8859-1. U0301 (Combining acute 
accent) comes from "Combining diacritical marks".


The way I understand things, Unicode would really prefer to use 
U0065+U0301 rather than U00e9. Because of legacy systems, and 
because they would rather have the ISO-8509 code pages be 1:1 
mappings, rather than 1:n mappings, they introduced code points 
they really would rather do without.


This also explains the "presentation forms" code pages (e.g. 
http://www.unicode.org/charts/PDF/UFB00.pdf). These were 
intended to be glyphs, rather than code points. Due to legacy 
reasons, it was not possible to simply discard them. They 
received code points, with a warning not to use these code 
points directly.


Also, notice that some letters can only be achieved using 
multiple code points. Hebrew diacritics, for example, do not, 
typically, have a composite form. My name fully spelled (which 
you rarely would do), שַׁחַר, cannot be represented with less 
than 6 code points, despite having only three letters.


The last paragraph isn't strictly true. You can use UFB2C + 
U05B7 for the first letter instead of U05E9 + U05C2 + U05B7. 
You would be using the presentation form which, as pointed 
above, is only there for legacy.


Shachar
or shall I say
שחר


Yes Arabic is similar too



Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread John Colvin via Digitalmars-d

On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote:

On 4/18/2015 4:35 AM, Jacob Carlborg wrote:

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


I won't deny what the spec says, but it doesn't make any sense 
to have two different representations of eacute, and I don't 
know why anyone would use the two code point version.


é might be obvious, but Unicode isn't just for writing European 
prose. Uses for combining characters includes (but is *nowhere* 
near to limited to) mathematical notation, where the 
combinatorial explosion of possible combinations that still 
belong to one grapheme cluster (character is a familiar but 
misleading word when talking about Unicode) would trivially 
become an insanely (more atoms than in the universe levels of) 
large number of characters.


Unicode is a nightmarish system in some ways, but considering how 
incredibly difficult the problem it solves is, it's actually not 
too crazy.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-19 Thread John Colvin via Digitalmars-d
On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu 
wrote:

On 4/18/15 4:35 AM, Jacob Carlborg wrote:

On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the 
accented e

all have Unicode code point assignments.


This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
dstring a = "e\u0301";
dstring b = "é";
assert(a != b);
assert(a.length == 2);
assert(b.length == 1);
writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your 
system
correctly supports Unicode (works on OS X 10.10) the two 
printed

characters should look exactly the same.

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


Isn't this solved commonly with a normalization pass? We should 
have a normalizeUTF() that can be inserted in a pipeline. Then 
the rest of Phobos doesn't need to mind these combining 
characters. -- Andrei


Normalisation can allow some simplifications, sometimes, but 
knowing whether it will or not requires a lot of a priori 
knowledge about the input as well as the normalisation form.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Shachar Shemesh via Digitalmars-d

On 18/04/15 21:40, Walter Bright wrote:


I'm not arguing against the existence of the Unicode standard, I'm
saying I can't figure any justification for standardizing different
encodings of the same thing.



A lot of areas in Unicode are due to pre-Unicode legacy.

I'm guessing here, but looking at the code points, é (U00e9 - Latin 
small letter E with acute), which comes from Latin-1, which is designed 
to follow ISO-8859-1. U0301 (Combining acute accent) comes from 
"Combining diacritical marks".


The way I understand things, Unicode would really prefer to use 
U0065+U0301 rather than U00e9. Because of legacy systems, and because 
they would rather have the ISO-8509 code pages be 1:1 mappings, rather 
than 1:n mappings, they introduced code points they really would rather 
do without.


This also explains the "presentation forms" code pages (e.g. 
http://www.unicode.org/charts/PDF/UFB00.pdf). These were intended to be 
glyphs, rather than code points. Due to legacy reasons, it was not 
possible to simply discard them. They received code points, with a 
warning not to use these code points directly.


Also, notice that some letters can only be achieved using multiple code 
points. Hebrew diacritics, for example, do not, typically, have a 
composite form. My name fully spelled (which you rarely would do), שַׁחַר, 
cannot be represented with less than 6 code points, despite having only 
three letters.


The last paragraph isn't strictly true. You can use UFB2C + U05B7 for 
the first letter instead of U05E9 + U05C2 + U05B7. You would be using 
the presentation form which, as pointed above, is only there for legacy.


Shachar
or shall I say
שחר


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote:

However, I think Walter's goal here is to match the original wrap()
functionality.


Yes, although the overarching goal is:

Minimize Need For Using GC In Phobos

and the method here is to use ranges rather than having to allocate string 
temporaries.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote:

Take it up with the Unicode consortium. :-)


I see nobody knows :-)



Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread H. S. Teoh via Digitalmars-d
On Fri, Apr 17, 2015 at 08:44:51PM +, Panke via Digitalmars-d wrote:
> On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
> >On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
> >
> >>Well, talk is cheap, so here's a working implementation of the
> >>non-Unicode-correct line wrapper that uses ranges and does not
> >>allocate:
> >
> >there is some... inconsistency: `std.string.wrap` adds final "\n" to
> >string. ;-) but i always hated it for that.
> 
> A range of lines instead of inserted \n would be a good API as well.

Indeed, that would be even more useful, then you could just do
.joiner("\n") to get the original functionality.

However, I think Walter's goal here is to match the original wrap()
functionality.

Perhaps the prospective wrapped() function could be implemented in terms
of a byWrappedLines() function which does return a range of wrapped
lines.


T

-- 
The volume of a pizza of thickness a and radius z can be described by the 
following formula: pi zz a. -- Wouter Verhelst


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread H. S. Teoh via Digitalmars-d
On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
[...]
> >When we don't know provenance of incoming data, we have to assume the
> >worst and run normalization to be sure that we got it right.
> 
> I'm not arguing against the existence of the Unicode standard, I'm
> saying I can't figure any justification for standardizing different
> encodings of the same thing.

Take it up with the Unicode consortium. :-)


T

-- 
Tech-savvy: euphemism for nerdy.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread H. S. Teoh via Digitalmars-d
On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
> >On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d 
> >wrote:
> >>On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
> >>>One possible solution would be to modify std.uni.graphemeStride to
> >>>not allocate, since it shouldn't need to do so just to compute the
> >>>length of the next grapheme.
> >>
> >>That should be done. There should be a fixed maximum codepoint count
> >>to graphemeStride.
> >
> >Why? Scanning a string for a grapheme of arbitrary length does not
> >need allocation since you're just reading data. Unless there is some
> >required intermediate representation that I'm not aware of?
> 
> If there's no need for allocation at all, why does it allocate? This
> should be fixed.

AFAICT, the only reason it allocates is because it shares the same
underlying implementation as byGrapheme. There's probably a way to fix
this, I just don't have the time right now to figure out the code.


T

-- 
Маленькие детки - маленькие бедки.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:

On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:

On 4/18/2015 4:35 AM, Jacob Carlborg wrote:

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


I won't deny what the spec says, but it doesn't make any sense to have
two different representations of eacute, and I don't know why anyone
would use the two code point version.


Well, *somebody* has to convert it to the single code point eacute,
whether it's the human (if the keyboard has a single key for it), or the
code interpreting keystrokes (the user may have typed it as e +
combining acute), or the program that generated the combination, or the
program that receives the data.


Data entry should be handled by the driver program, not a universal interchange 
format.




When we don't know provenance of
incoming data, we have to assume the worst and run normalization to be
sure that we got it right.


I'm not arguing against the existence of the Unicode standard, I'm saying I 
can't figure any justification for standardizing different encodings of the same 
thing.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:

On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:

On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:

One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it shouldn't need to do so just to compute the
length of the next grapheme.


That should be done. There should be a fixed maximum codepoint count
to graphemeStride.


Why? Scanning a string for a grapheme of arbitrary length does not need
allocation since you're just reading data. Unless there is some required
intermediate representation that I'm not aware of?


If there's no need for allocation at all, why does it allocate? This should be 
fixed.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread H. S. Teoh via Digitalmars-d
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
> >One possible solution would be to modify std.uni.graphemeStride to
> >not allocate, since it shouldn't need to do so just to compute the
> >length of the next grapheme.
> 
> That should be done. There should be a fixed maximum codepoint count
> to graphemeStride.

Why? Scanning a string for a grapheme of arbitrary length does not need
allocation since you're just reading data. Unless there is some required
intermediate representation that I'm not aware of?


T

-- 
"How are you doing?" "Doing what?"


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread H. S. Teoh via Digitalmars-d
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
> >\u0301 is the "combining acute accent" [1].
> >
> >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
> 
> I won't deny what the spec says, but it doesn't make any sense to have
> two different representations of eacute, and I don't know why anyone
> would use the two code point version.

Well, *somebody* has to convert it to the single code point eacute,
whether it's the human (if the keyboard has a single key for it), or the
code interpreting keystrokes (the user may have typed it as e +
combining acute), or the program that generated the combination, or the
program that receives the data. When we don't know provenance of
incoming data, we have to assume the worst and run normalization to be
sure that we got it right.

The two code-point version may also arise from string concatenation, in
which case normalization has to be done again (or possibly from the
point of concatenation, given the right algorithms).


T

-- 
Mediocrity has been pushed to extremes.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 4:35 AM, Jacob Carlborg wrote:

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


I won't deny what the spec says, but it doesn't make any sense to have two 
different representations of eacute, and I don't know why anyone would use the 
two code point version.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:

One possible solution would be to modify std.uni.graphemeStride to not
allocate, since it shouldn't need to do so just to compute the length of
the next grapheme.


That should be done. There should be a fixed maximum codepoint count to 
graphemeStride.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Tobias Pankrath via Digitalmars-d
Isn't this solved commonly with a normalization pass? We should 
have a normalizeUTF() that can be inserted in a pipeline.


Yes.

Then the rest of Phobos doesn't need to mind these combining 
characters. -- Andrei


I don't think so. The thing is, even after normalization we have 
to deal with combining characters because in all normalization 
forms there will be combining characters left after normalization.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Andrei Alexandrescu via Digitalmars-d

On 4/18/15 4:35 AM, Jacob Carlborg wrote:

On 2015-04-18 12:27, Walter Bright wrote:


That doesn't make sense to me, because the umlauts and the accented e
all have Unicode code point assignments.


This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
 dstring a = "e\u0301";
 dstring b = "é";
 assert(a != b);
 assert(a.length == 2);
 assert(b.length == 1);
 writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your system
correctly supports Unicode (works on OS X 10.10) the two printed
characters should look exactly the same.

\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


Isn't this solved commonly with a normalization pass? We should have a 
normalizeUTF() that can be inserted in a pipeline. Then the rest of 
Phobos doesn't need to mind these combining characters. -- Andrei




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Chris via Digitalmars-d

On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
On Sat, Apr 18, 2015 at 11:52:50AM +, Chris via 
Digitalmars-d wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
wrote:

>On 2015-04-18 12:27, Walter Bright wrote:
>
>>That doesn't make sense to me, because the umlauts and the 
>>accented

>>e all have Unicode code point assignments.
>
>This code snippet demonstrates the problem:
>
>import std.stdio;
>
>void main ()
>{
>dstring a = "e\u0301";
>dstring b = "é";
>assert(a != b);
>assert(a.length == 2);
>assert(b.length == 1);
>writefln(a, " ", b);
>}
>
>If you run the above code all asserts should pass. If your 
>system
>correctly supports Unicode (works on OS X 10.10) the two 
>printed

>characters should look exactly the same.
>
>\u0301 is the "combining acute accent" [1].
>
>[1] 
>http://www.fileformat.info/info/unicode/char/0301/index.htm


Yep, this was the cause of some bugs I had in my program. The 
thing is
you never know, if a text is composed or decomposed, so you 
have to be
prepared that "é" has length 2 or 1. On OS X these characters 
are
automatically decomposed by default. So if you pipe it through 
the
system an "é" (length=1) automatically becomes "e\u0301" 
(length=2).
Same goes for file names on OS X. I've had to find a 
workaround for

this more than once.


Wait, I thought the recommended approach is to normalize first, 
then do

string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code 
to use a
uniform approach to handling the string. I don't think it's a 
good idea
to manually deal with composed/decomposed issues within every 
individual

string function.

Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because not 
every

language has precomposed characters handy.

Using byGrapheme, within the current state of Phobos, is still 
the best
bet as to correctly counting the number of printed columns as 
opposed to
the number of "characters" (which, in the Unicode definition, 
does not

always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.

Well, to be fair, byGrapheme only *occasionally* allocates -- 
only for
input with unusually long sequences of combining diacritics -- 
for
normal use cases you'll pretty much never have any allocations. 
But the
language can't express the idea of "occasionally allocates", 
there is
only "allocates" or "@nogc". Which makes it unusable in @nogc 
code.


One possible solution would be to modify std.uni.graphemeStride 
to not
allocate, since it shouldn't need to do so just to compute the 
length of

the next grapheme.


T


This is why on OS X I always normalized strings to composed. 
However, there are always issues with Unicode, because, as you 
said, the layman's notion of what a character is is not the same 
as Unicode's. I wrote a utility function that uses byGrapheme and 
byCodePoint. It's a bit of an overhead, but I always get the 
correct length and character access (e.g. if txt.startsWith("é")).


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Tobias Pankrath via Digitalmars-d
Wait, I thought the recommended approach is to normalize first, 
then do

string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code 
to use a
uniform approach to handling the string. I don't think it's a 
good idea
to manually deal with composed/decomposed issues within every 
individual

string function.



1. Problem: Normalization is not closed under almost all 
operations. E.g. concatenating two normalized strings does not 
guarantee the result is in normalized form.


2. Problem: Some unicode algorithms e.g. string comparison 
require a normalization step. It doesn't matter which form you 
use, but you have to pick one.


Now we could say that all strings passed to phobos have to be 
normalized as (say) NFC and that phobos function thus skip the 
normalization.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread H. S. Teoh via Digitalmars-d
On Sat, Apr 18, 2015 at 11:52:50AM +, Chris via Digitalmars-d wrote:
> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
> >On 2015-04-18 12:27, Walter Bright wrote:
> >
> >>That doesn't make sense to me, because the umlauts and the accented
> >>e all have Unicode code point assignments.
> >
> >This code snippet demonstrates the problem:
> >
> >import std.stdio;
> >
> >void main ()
> >{
> >dstring a = "e\u0301";
> >dstring b = "é";
> >assert(a != b);
> >assert(a.length == 2);
> >assert(b.length == 1);
> >writefln(a, " ", b);
> >}
> >
> >If you run the above code all asserts should pass. If your system
> >correctly supports Unicode (works on OS X 10.10) the two printed
> >characters should look exactly the same.
> >
> >\u0301 is the "combining acute accent" [1].
> >
> >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
> 
> Yep, this was the cause of some bugs I had in my program. The thing is
> you never know, if a text is composed or decomposed, so you have to be
> prepared that "é" has length 2 or 1. On OS X these characters are
> automatically decomposed by default. So if you pipe it through the
> system an "é" (length=1) automatically becomes "e\u0301" (length=2).
> Same goes for file names on OS X. I've had to find a workaround for
> this more than once.

Wait, I thought the recommended approach is to normalize first, then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code to use a
uniform approach to handling the string. I don't think it's a good idea
to manually deal with composed/decomposed issues within every individual
string function.

Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because not every
language has precomposed characters handy.

Using byGrapheme, within the current state of Phobos, is still the best
bet as to correctly counting the number of printed columns as opposed to
the number of "characters" (which, in the Unicode definition, does not
always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.

Well, to be fair, byGrapheme only *occasionally* allocates -- only for
input with unusually long sequences of combining diacritics -- for
normal use cases you'll pretty much never have any allocations. But the
language can't express the idea of "occasionally allocates", there is
only "allocates" or "@nogc". Which makes it unusable in @nogc code.

One possible solution would be to modify std.uni.graphemeStride to not
allocate, since it shouldn't need to do so just to compute the length of
the next grapheme.


T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Jakob Ovrum via Digitalmars-d

On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote:

On 2015-04-18 14:25, Gary Willoughby wrote:


byGrapheme to the rescue:

http://dlang.org/phobos/std_uni.html#byGrapheme

Or is this unsuitable here?


How is byGrapheme supposed to be used? I tried this put it 
doesn't do what I expected:


foreach (e ; "e\u0301".byGrapheme)
writeln(e);


void main()
{
import std.stdio;
import std.uni;

foreach (e ; "e\u0301".byGrapheme)
writeln(e[]);
}


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Jacob Carlborg via Digitalmars-d

On 2015-04-18 14:25, Gary Willoughby wrote:


byGrapheme to the rescue:

http://dlang.org/phobos/std_uni.html#byGrapheme

Or is this unsuitable here?


How is byGrapheme supposed to be used? I tried this put it doesn't do 
what I expected:


foreach (e ; "e\u0301".byGrapheme)
writeln(e);

--
/Jacob Carlborg


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Gary Willoughby via Digitalmars-d

On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
wrote:

On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the 
accented e

all have Unicode code point assignments.


This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
   dstring a = "e\u0301";
   dstring b = "é";
   assert(a != b);
   assert(a.length == 2);
   assert(b.length == 1);
   writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your 
system correctly supports Unicode (works on OS X 10.10) the 
two printed characters should look exactly the same.


\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


Yep, this was the cause of some bugs I had in my program. The 
thing is you never know, if a text is composed or decomposed, 
so you have to be prepared that "é" has length 2 or 1. On OS X 
these characters are automatically decomposed by default. So if 
you pipe it through the system an "é" (length=1) automatically 
becomes "e\u0301" (length=2). Same goes for file names on OS X. 
I've had to find a workaround for this more than once.


byGrapheme to the rescue:

http://dlang.org/phobos/std_uni.html#byGrapheme

Or is this unsuitable here?


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Tobias Pankrath via Digitalmars-d
Also another issue is that lower case letters and upper case 
might have different size requirements or look different 
depending on where on the word they are located.


For example, German ß and SS, Greek σ and ς. I know Turkish 
also has similar cases.


--
Paulo


While true, it does not affect wrap (the algorithm) as far as I 
can see.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Paulo Pinto via Digitalmars-d

On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:

On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:

On 4/18/2015 12:58 AM, John Colvin wrote:

On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of 
Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or 
you have to
live with byGrapheme with all the complexity that it 
entails. The former
is quite easy to write -- I could throw it together in a 
couple o' hours
max, but the latter is a pretty big project (cf. Unicode 
line-breaking

algorithm, which is one of the TR's).


It'd be good enough to duplicate the existing behavior, 
which is to treat

decoded unicode characters as one column.


Code points aren't equivalent to characters. They're not the 
same thing in most

European languages,


I know a bit of German, for what characters is that not true?


Umlauts, if combined characters are used. Also words that still 
have their accents left after import from foreign languages. 
E.g. Café


Getting all unicode correct seems a daunting task with a severe 
performance impact, esp. if we need to assume that a string 
might have any normalization form or none at all.


See also: http://unicode.org/reports/tr15/#Norm_Forms


Also another issue is that lower case letters and upper case 
might have different size requirements or look different 
depending on where on the word they are located.


For example, German ß and SS, Greek σ and ς. I know Turkish also 
has similar cases.


--
Paulo


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Chris via Digitalmars-d

On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:

On 2015-04-18 12:27, Walter Bright wrote:

That doesn't make sense to me, because the umlauts and the 
accented e

all have Unicode code point assignments.


This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
dstring a = "e\u0301";
dstring b = "é";
assert(a != b);
assert(a.length == 2);
assert(b.length == 1);
writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your 
system correctly supports Unicode (works on OS X 10.10) the two 
printed characters should look exactly the same.


\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm


Yep, this was the cause of some bugs I had in my program. The 
thing is you never know, if a text is composed or decomposed, so 
you have to be prepared that "é" has length 2 or 1. On OS X these 
characters are automatically decomposed by default. So if you 
pipe it through the system an "é" (length=1) automatically 
becomes "e\u0301" (length=2). Same goes for file names on OS X. 
I've had to find a workaround for this more than once.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Jacob Carlborg via Digitalmars-d

On 2015-04-18 12:27, Walter Bright wrote:


That doesn't make sense to me, because the umlauts and the accented e
all have Unicode code point assignments.


This code snippet demonstrates the problem:

import std.stdio;

void main ()
{
dstring a = "e\u0301";
dstring b = "é";
assert(a != b);
assert(a.length == 2);
assert(b.length == 1);
writefln(a, " ", b);
}

If you run the above code all asserts should pass. If your system 
correctly supports Unicode (works on OS X 10.10) the two printed 
characters should look exactly the same.


\u0301 is the "combining acute accent" [1].

[1] http://www.fileformat.info/info/unicode/char/0301/index.htm

--
/Jacob Carlborg


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Panke via Digitalmars-d
That doesn't make sense to me, because the umlauts and the 
accented e all have Unicode code point assignments.


Yes, but you may have perfectly fine unicode text where the 
combined form is used. Actually there is a normalization form for 
unicode that requires the combined form. To be fully correct 
phobos needs to handle that as well.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 1:26 AM, Panke wrote:

On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:

On 4/18/2015 12:58 AM, John Colvin wrote:

On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:

So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to write -- I could throw it together in a couple o' hours
max, but the latter is a pretty big project (cf. Unicode line-breaking
algorithm, which is one of the TR's).


It'd be good enough to duplicate the existing behavior, which is to treat
decoded unicode characters as one column.


Code points aren't equivalent to characters. They're not the same thing in most
European languages,


I know a bit of German, for what characters is that not true?


Umlauts, if combined characters are used. Also words that still have their
accents left after import from foreign languages. E.g. Café


That doesn't make sense to me, because the umlauts and the accented e all have 
Unicode code point assignments.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Panke via Digitalmars-d

On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:

On 4/18/2015 12:58 AM, John Colvin wrote:

On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of 
Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or 
you have to
live with byGrapheme with all the complexity that it 
entails. The former
is quite easy to write -- I could throw it together in a 
couple o' hours
max, but the latter is a pretty big project (cf. Unicode 
line-breaking

algorithm, which is one of the TR's).


It'd be good enough to duplicate the existing behavior, which 
is to treat

decoded unicode characters as one column.


Code points aren't equivalent to characters. They're not the 
same thing in most

European languages,


I know a bit of German, for what characters is that not true?


Umlauts, if combined characters are used. Also words that still 
have their accents left after import from foreign languages. E.g. 
Café


Getting all unicode correct seems a daunting task with a severe 
performance impact, esp. if we need to assume that a string might 
have any normalization form or none at all.


See also: http://unicode.org/reports/tr15/#Norm_Forms


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Walter Bright via Digitalmars-d

On 4/18/2015 12:58 AM, John Colvin wrote:

On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:

So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to write -- I could throw it together in a couple o' hours
max, but the latter is a pretty big project (cf. Unicode line-breaking
algorithm, which is one of the TR's).


It'd be good enough to duplicate the existing behavior, which is to treat
decoded unicode characters as one column.


Code points aren't equivalent to characters. They're not the same thing in most
European languages,


I know a bit of German, for what characters is that not true?


never mind the rest of the world. If we have a line-wrapping
algorithm in phobos that works by code points, it needs a large "THIS IS ONLY
FOR SIMPLE ENGLISH TEXT" warning.

Code points are a useful chunk size for some tasjs and completely insufficient
for others.


The first order of business is making wrap() work with ranges, and otherwise 
work the same as it always has (it's one of the oldest Phobos functions).


There are different standard levels of Unicode support. The lowest level is 
working correctly with code points, which is what wrap() does. Going to a higher 
level of support comes after range support.


I know little about combining characters. You obviously know much more, do you 
want to take charge of this function?


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread Jacob Carlborg via Digitalmars-d

On 2015-04-18 09:58, John Colvin wrote:


Code points aren't equivalent to characters. They're not the same thing
in most European languages, never mind the rest of the world. If we have
a line-wrapping algorithm in phobos that works by code points, it needs
a large "THIS IS ONLY FOR SIMPLE ENGLISH TEXT" warning.


For that we have std.ascii.

--
/Jacob Carlborg


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-18 Thread John Colvin via Digitalmars-d

On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of 
Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or 
you have to
live with byGrapheme with all the complexity that it entails. 
The former
is quite easy to write -- I could throw it together in a 
couple o' hours
max, but the latter is a pretty big project (cf. Unicode 
line-breaking

algorithm, which is one of the TR's).


It'd be good enough to duplicate the existing behavior, which 
is to treat decoded unicode characters as one column.


Code points aren't equivalent to characters. They're not the same 
thing in most European languages, never mind the rest of the 
world. If we have a line-wrapping algorithm in phobos that works 
by code points, it needs a large "THIS IS ONLY FOR SIMPLE ENGLISH 
TEXT" warning.


Code points are a useful chunk size for some tasjs and completely 
insufficient for others.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread Shachar Shemesh via Digitalmars-d

On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote:

There's also the question of what to do with bidi markings: how do you
handle counting the columns in that case?



Which BiDi marking are you referring to? LRM/RLM and friends? If so, 
don't worry: the interface, as described, is incapable of properly 
handling BiDi anyways.


The proper way to handle BiDi line wrapping is this. First you assign a 
BiDi level to each character (at which point the markings are, 
effectively, removed from the input, so there goes your problem). Then 
you calculate the glyph's width until the line limit is reached, and 
then you reorder each line according to the BiDi levels you calculated 
earlier.


As can be easily seen, this requires transitioning BiDi information that 
is per-paragraph across the line break logic, pretty much mandating 
multiple passes on the input. Since the requested interface does not 
allow that, proper BiDi line breaking is impossible with that interface.


I'll mention that not everyone take that as a serious problem. Window's 
text control, for example, calculates line breaks on the text, and then 
runs the BiDi algorithm on each line individually. Few people notice 
this. Then again, people have already grown used to BiDi text being 
scrambled.


Shachar


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread Panke via Digitalmars-d

On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via 
Digitalmars-d wrote:



Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not 
allocate:


there is some... inconsistency: `std.string.wrap` adds final 
"\n" to

string. ;-) but i always hated it for that.


A range of lines instead of inserted \n would be a good API as 
well.


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread ketmar via Digitalmars-d
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:

> Well, talk is cheap, so here's a working implementation of the
> non-Unicode-correct line wrapper that uses ranges and does not allocate:

there is some... inconsistency: `std.string.wrap` adds final "\n" to 
string. ;-) but i always hated it for that.

signature.asc
Description: PGP signature


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread Walter Bright via Digitalmars-d

On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:

On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:

On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:

Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:


awesome! Please make a pull request for this so you get proper credit!


Doesn't that mean I have to add the autodecoding workarounds first?


Before it gets pulled, yes, meaning that the element type of front() should 
match the element encoding type of Range.


There's also an issue with firstindent and indent being the same range type as 
'range', which is not practical as Range is likely a voldemort type. I suggest 
making them simply of type 'string'. I don't see any point to make them ranges.


A unit test with an input range is needed, and one with some multibyte unicode 
encodings.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread H. S. Teoh via Digitalmars-d
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
> On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
> >Well, talk is cheap, so here's a working implementation of the
> >non-Unicode-correct line wrapper that uses ranges and does not
> >allocate:
> 
> awesome! Please make a pull request for this so you get proper credit!

Doesn't that mean I have to add the autodecoding workarounds first?


T

-- 
Life is too short to run proprietary software. -- Bdale Garbee


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread Walter Bright via Digitalmars-d

On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:

So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to write -- I could throw it together in a couple o' hours
max, but the latter is a pretty big project (cf. Unicode line-breaking
algorithm, which is one of the TR's).


It'd be good enough to duplicate the existing behavior, which is to treat 
decoded unicode characters as one column.




Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread Walter Bright via Digitalmars-d

On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:

Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not allocate:


awesome! Please make a pull request for this so you get proper credit!



Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread H. S. Teoh via Digitalmars-d
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
> So either you have to throw out all pretenses of Unicode-correctness
> and just stick with ASCII-style per-character line-wrapping, or you
> have to live with byGrapheme with all the complexity that it entails.
> The former is quite easy to write -- I could throw it together in a
> couple o' hours max, but the latter is a pretty big project (cf.
> Unicode line-breaking algorithm, which is one of the TR's).
[...]

Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not allocate:

import std.range.primitives;

/**
 * Range version of $(D std.string.wrap).
 *
 * Bugs:
 * This function does not conform to the Unicode line-breaking 
algorithm. It
 * does not take into account zero-width characters, combining 
diacritics,
 * double-width characters, non-breaking spaces, and bidi markings.  
Strings
 * containing these characters therefore may not be wrapped correctly.
 */
auto wrapped(R)(R range, in size_t columns = 80, R firstindent = null,
R indent = null, in size_t tabsize = 8)
if (isForwardRange!R && is(ElementType!R : dchar))
{
import std.algorithm.iteration : map, joiner;
import std.range : chain;
import std.uni;

alias CharType = ElementType!R;

// Returns: Wrapped lines.
struct Result
{
private R range, indent;
private size_t maxCols, tabSize;

private size_t currentCol = 0;
private R curIndent;
bool empty = true;
bool atBreak = false;

this(R _range, R _firstindent, R _indent, size_t columns, 
size_t tabsize)
{
this.range = _range;
this.curIndent = _firstindent.save;
this.indent = _indent;
this.maxCols = columns;
this.tabSize = tabsize;

empty = _range.empty;

}

@property CharType front()
{
if (atBreak)
return '\n';// should implicit convert to wider 
characters
else if (!curIndent.empty)
return curIndent.front;
else
return range.front;
}

void popFront()
{
if (atBreak)
{
// We're at a linebreak.
atBreak = false;
currentCol = 0;

// Start new line with indent
curIndent = indent.save;
return;
}
else if (!curIndent.empty)
{
// We're iterating over an initial indent.
curIndent.popFront();
currentCol++;
return;
}

// We're iterating over the main range.
range.popFront();
if (range.empty)
{
empty = true;
return;
}

if (range.front == '\t')
currentCol += tabSize;
else if (isWhite(range.front))
{
// Scan for next word boundary to decide whether or not 
to
// break here.
R tmp = range.save;
assert(!tmp.empty);

size_t col = currentCol;

// Find start of next word
while (!tmp.empty && isWhite(tmp.front))
{
col++;
tmp.popFront();
}

// Remember start of next word so that if we need to 
break, we
// won't introduce extraneous spaces to the start of 
the new
// line.
R nextWord = tmp.save;

while (!tmp.empty && !isWhite(tmp.front))
{
col++;
tmp.popFront();
}
assert(tmp.empty || isWhite(tmp.front));

if (col > maxCols)
{
// Word wrap needed. Move current range position to
 

Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread H. S. Teoh via Digitalmars-d
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
> -- 
> All problems are easy in retrospect.

Argh, my Perl script doth mock me!


T

-- 
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder


Re: Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread H. S. Teoh via Digitalmars-d
On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:
> Challenge level - Moderately easy
>
> Consider the function std.string.wrap:
> 
>   http://dlang.org/phobos/std_string.html#.wrap
> 
> It takes a string as input, and returns a GC allocated string that is
> word-wrapped. It needs to be enhanced to:
> 
> 1. Accept a ForwardRange as input.
> 2. Return a lazy ForwardRange that delivers the characters of the
> wrapped result one by one on demand.
> 3. Not allocate any memory.
> 4. The element encoding type of the returned range must be the same as
> the element encoding type of the input.

This is harder than it looks at first sight, actually. Mostly thanks to
the complexity of Unicode... you need to identify zero-width,
normal-width, and double-width characters, combining diacritics, various
kinds of spaces (e.g. cannot break on non-breaking space) and treat them
accordingly.  Which requires decoding.  (Well, in theory std.uni could
be enhanced to work directly with encoded data, but right now it
doesn't. In any case this is outside the scope of this challenge, I
think.)

Unfortunately, the only reliable way I know of currently that can deal
with the spacing of Unicode characters correctly is to segment the input
with byGrapheme, which currently is GC-dependent. So this fails (3).

There's also the question of what to do with bidi markings: how do you
handle counting the columns in that case?

Of course, if you forego Unicode correctness, then you *could* just
word-wrap on a per-character basis (i.e., every character counts as 1
column), but this also makes the resulting code useless as far as
dealing with general Unicode data is concerned -- it'd only work for
ASCII, and various character ranges inherited from the old 8-bit
European encodings. Not to mention, line-breaking in Chinese encodings
cannot work as prescribed anyway, because the rules are different (you
can break anywhere at a character boundary except punctuation -- there
is no such concept as a space character in Chinese writing). Same
applies for Korean/Japanese.

So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to write -- I could throw it together in a couple o' hours
max, but the latter is a pretty big project (cf. Unicode line-breaking
algorithm, which is one of the TR's).


T

-- 
All problems are easy in retrospect.


Today's programming challenge - How's your Range-Fu ?

2015-04-17 Thread Walter Bright via Digitalmars-d

Challenge level - Moderately easy

Consider the function std.string.wrap:

  http://dlang.org/phobos/std_string.html#.wrap

It takes a string as input, and returns a GC allocated string that is 
word-wrapped. It needs to be enhanced to:


1. Accept a ForwardRange as input.
2. Return a lazy ForwardRange that delivers the characters of the wrapped result 
one by one on demand.

3. Not allocate any memory.
4. The element encoding type of the returned range must be the same as the 
element encoding type of the input.




Re: Special Type Challenge

2015-02-09 Thread via Digitalmars-d

On Sunday, 8 February 2015 at 19:59:40 UTC, Meta wrote:

On Sunday, 8 February 2015 at 13:06:08 UTC, Marc Schütz wrote:
No, `alias this` convert from the type it is declared in to 
another type. `opImplicitCast` would be declared in the 
destination type.


So like this?

struct Type1
{
string str;
}

struct Type2
{
string str;

Type2 opImplicitCast(Type1 t)
{
return Type2(t.str);
}
}

void main()
{
Type2 t2 = Type1("Hello, World!");
}

Instead of the opposite?


Yes. It's almost like a constructor, but would be called wherever 
an explicit call to the constructor is required today:


struct S {
this(int x) {}
}

S test1() {
return 10;// not ok
return S(10); // works
}

struct T {
static typeof(this) opImplicitCast(int x) {
return typeof(this)(x);
}
}

T test1() {
return 10;// ok
return T(10); // not sure whether this should
  // work or not
}

Because it's so similar to a constructor (it always needs to be 
static and return `typeof(this)`), maybe a special syntax can be 
used:


struct S {
@implicit this(int x) {}
}


Re: Special Type Challenge

2015-02-08 Thread Meta via Digitalmars-d

On Sunday, 8 February 2015 at 13:06:08 UTC, Marc Schütz wrote:
No, `alias this` convert from the type it is declared in to 
another type. `opImplicitCast` would be declared in the 
destination type.


So like this?

struct Type1
{
string str;
}

struct Type2
{
string str;

Type2 opImplicitCast(Type1 t)
{
return Type2(t.str);
}
}

void main()
{
Type2 t2 = Type1("Hello, World!");
}

Instead of the opposite?


Re: Special Type Challenge

2015-02-08 Thread via Digitalmars-d

On Saturday, 7 February 2015 at 20:53:48 UTC, Meta wrote:

On Saturday, 7 February 2015 at 19:38:10 UTC, Marc Schütz wrote:
On Saturday, 7 February 2015 at 05:27:39 UTC, Andrei 
Alexandrescu wrote:

On 2/6/15 8:28 PM, Jonathan Marler wrote:
Do you know if D might support that later or if there's a 
reason for not

supporting it?


It's deliberate following the C++ experience. -- Andrei


Hasn't there been a debate about a hypothetical 
`opImplicitCast()`? The default would still be off, but you 
can opt-in by defining said method.


`alias this` is pretty much `opImplicitCast`, but it currently 
has a few holes, such as 
http://forum.dlang.org/post/xcnwuneclebuyqcjb...@forum.dlang.org. 
It's not always appropriate in every situation though, as you 
get full-on subtyping along with implicit conversion.


No, `alias this` convert from the type it is declared in to 
another type. `opImplicitCast` would be declared in the 
destination type.


Re: Special Type Challenge

2015-02-07 Thread Tofu Ninja via Digitalmars-d

On Saturday, 7 February 2015 at 19:38:10 UTC, Marc Schütz wrote:
Hasn't there been a debate about a hypothetical 
`opImplicitCast()`? The default would still be off, but you can 
opt-in by defining said method.


How else would it be done? I am still confused as to what the 
reasons behind not having implicit conversion are.


Re: Special Type Challenge

2015-02-07 Thread Meta via Digitalmars-d

On Saturday, 7 February 2015 at 19:38:10 UTC, Marc Schütz wrote:
On Saturday, 7 February 2015 at 05:27:39 UTC, Andrei 
Alexandrescu wrote:

On 2/6/15 8:28 PM, Jonathan Marler wrote:
Do you know if D might support that later or if there's a 
reason for not

supporting it?


It's deliberate following the C++ experience. -- Andrei


Hasn't there been a debate about a hypothetical 
`opImplicitCast()`? The default would still be off, but you can 
opt-in by defining said method.


`alias this` is pretty much `opImplicitCast`, but it currently 
has a few holes, such as 
http://forum.dlang.org/post/xcnwuneclebuyqcjb...@forum.dlang.org. 
It's not always appropriate in every situation though, as you get 
full-on subtyping along with implicit conversion.


Re: Special Type Challenge

2015-02-07 Thread via Digitalmars-d
On Saturday, 7 February 2015 at 05:27:39 UTC, Andrei Alexandrescu 
wrote:

On 2/6/15 8:28 PM, Jonathan Marler wrote:
Do you know if D might support that later or if there's a 
reason for not

supporting it?


It's deliberate following the C++ experience. -- Andrei


Hasn't there been a debate about a hypothetical 
`opImplicitCast()`? The default would still be off, but you can 
opt-in by defining said method.


Re: Special Type Challenge

2015-02-06 Thread Andrei Alexandrescu via Digitalmars-d

On 2/6/15 8:28 PM, Jonathan Marler wrote:

Do you know if D might support that later or if there's a reason for not
supporting it?


It's deliberate following the C++ experience. -- Andrei


Re: Special Type Challenge

2015-02-06 Thread Jonathan Marler via Digitalmars-d

On Saturday, 7 February 2015 at 03:48:24 UTC, Adam D. Ruppe wrote:

wrote:

 b = -256;


that won't fit in a byte btw.


Woops, I typed that example too fast :)



The rest of the assignment stuff is easy. I'd prolly even do it 
with a template:


this(T)(T t) { this.opAssign(t); } // for construction
Byte opAssign(T)(T t) if(T.sizeof == 1) { // for other 
assignment

data_holder = cast(typeof(data_holder) t);
}

and that should do it.


This code is almost exactly what I came up with.  But what about 
the function problem?


void echo(Byte b);

echo('c'); // won't work

I can't figure out a way to make this work (without making echo a 
template).  If D supported implicit conversions from built-in 
types to user types then it could work.  Do you know if D might 
support that later or if there's a reason for not supporting it?




Re: Special Type Challenge

2015-02-06 Thread Adam D. Ruppe via Digitalmars-d
On Saturday, 7 February 2015 at 01:55:27 UTC, Jonathan Marler 
wrote:

  b = -256;


that won't fit in a byte btw.


The rest of the assignment stuff is easy. I'd prolly even do it 
with a template:


this(T)(T t) { this.opAssign(t); } // for construction
Byte opAssign(T)(T t) if(T.sizeof == 1) { // for other assignment
data_holder = cast(typeof(data_holder) t);
}

and that should do it.


  Byte[] barr;

  barr = "teststring";
  barr = [0,1,2,3];
}


Language won't let you do these though.


Re: Special Type Challenge

2015-02-06 Thread Jonathan Marler via Digitalmars-d

On Saturday, 7 February 2015 at 02:12:08 UTC, Jakob Ovrum wrote:

 Byte echo(Byte b)
 {
   return b;
 }
 b = echo('a');
 b = echo(cast(const char)'a');
 b = echo(cast(immutable char)'a');
 b = echo(cast(byte)1);
 b = echo(cast(const byte)1);
 b = echo(cast(immutable byte)1);
 b = echo(cast(ubyte)1);
 b = echo(cast(const ubyte)1);
 b = echo(cast(immutable ubyte)1);

 Byte[] barr;

 barr = "teststring";
 barr = [0,1,2,3];
}


These are not possible as D does not support implicit 
construction.


It's a bit odd that D supports implicit conversions from 
user-defined types to built-in types but not the reverse.  What's 
the reason for this?  Is it done on purpose or is this just a 
hole that might be filled in later?


Re: Special Type Challenge

2015-02-06 Thread Jakob Ovrum via Digitalmars-d
On Saturday, 7 February 2015 at 01:55:27 UTC, Jonathan Marler 
wrote:

  Byte b;

  b = 0;
  b = 1;
  b = 255;
  b = -256;

  b = 'a';
  b = cast(const char)'a';
  b = cast(immutable char)'a';
  b = cast(byte)1;
  b = cast(const byte)1;
  b = cast(immutable byte)1;
  b = cast(ubyte)1;
  b = cast(const ubyte)1;
  b = cast(immutable ubyte)1;


These are possible with opAssign (which should be accompanied 
with corresponding constructor(s)).



  Byte echo(Byte b)
  {
return b;
  }
  b = echo('a');
  b = echo(cast(const char)'a');
  b = echo(cast(immutable char)'a');
  b = echo(cast(byte)1);
  b = echo(cast(const byte)1);
  b = echo(cast(immutable byte)1);
  b = echo(cast(ubyte)1);
  b = echo(cast(const ubyte)1);
  b = echo(cast(immutable ubyte)1);

  Byte[] barr;

  barr = "teststring";
  barr = [0,1,2,3];
}


These are not possible as D does not support implicit 
construction.


Special Type Challenge

2015-02-06 Thread Jonathan Marler via Digitalmars-d
I'm wondering if the following is possible in D.  I tried and 
failed but maybe someone else will be able to pull it off.


// Challenge: Create a type named "Byte" that,
//   1. Uses 1 byte of memory
//   2. Can be used as an argument to a non-template function
//   3. Handles implicit conversion from any 1-byte type
//   4. Compiles and passes the following unittest
unittest
{
  Byte b;

  b = 0;
  b = 1;
  b = 255;
  b = -256;

  b = 'a';
  b = cast(const char)'a';
  b = cast(immutable char)'a';
  b = cast(byte)1;
  b = cast(const byte)1;
  b = cast(immutable byte)1;
  b = cast(ubyte)1;
  b = cast(const ubyte)1;
  b = cast(immutable ubyte)1;

  Byte echo(Byte b)
  {
return b;
  }
  b = echo('a');
  b = echo(cast(const char)'a');
  b = echo(cast(immutable char)'a');
  b = echo(cast(byte)1);
  b = echo(cast(const byte)1);
  b = echo(cast(immutable byte)1);
  b = echo(cast(ubyte)1);
  b = echo(cast(const ubyte)1);
  b = echo(cast(immutable ubyte)1);

  Byte[] barr;

  barr = "teststring";
  barr = [0,1,2,3];
}


Pull Request Challenge for D

2015-01-29 Thread notna via Digitalmars-d

Hi all

1) found this and somehow like that idea. maybe it's also a way 
forward for us ;)
   
https://huntingbears.nl/2015/01/11/cpan-pull-request-challenge-for-january-datetimeformatepoch/


2) as D upload/download statistics are always in the forums, I 
also like this page and hope it can be inspiring ;)

   http://stats.cpantesters.org/uploads.html

regards
 anton oks


Function Pointer Challenge

2014-04-16 Thread Jonathan Marler via Digitalmars-d

In a library I was writing I was in need of the following:

Write code that takes a function pointer/delegate and an array of 
strings that calls the function by parsing each string into the 
given functions arguments.  And if the function has a return 
value, the code will also return the functions return value after 
the call.


I had written this functionality before in .NET (C# specifically) 
using .NET's runtime reflection.  The code looks nice but runtime 
reflection has poor performance.  Using D's compile-time features 
I was able to write a D template function that implemented this 
in about 10 lines of code.


ReturnType!Function call(Function)(Function func, const char[][] 
args...) if (isCallable!Function)

{
  alias Args = ParameterTypeTuple!Function;

  if(args.length != Args.length)
throw new Exception(format("Expected %d arguments but got 
%d", Args.length, args.length));


  Args argsTuple;

  foreach(i,Arg;Args) {
argsTuple[i] = to!Arg(args[i]);
  }

  return func(argsTuple);
}

Here's a unit test to demonstrate its usage:

unittest
{
  void voidFunction()
  {
writeln("[Test] Called voidFunction()");
  }
  void randomFunction(int i, uint u, string s, char c)
  {
writefln("[Test] Called randomFunction(%s, %s, \"%s\", 
'%s')", i, u, s, c);

  }
  ulong echoUlong(ulong value)
  {
writefln("[Test] Called echoUlong(%s)", value);
return value;
  }

 (&voidFunction).call();
 (&randomFunction).call("-1000", "567", "HelloWorld!", "?");

  string passedValue = "123456789";
  ulong returnValue = (&echoUlong).call(passedValue);
  writefln("[Test] echoUlong(%s) = %s", passedValue, returnValue);

  try {
(&randomFunction).call("wrong number of args");
assert(0);
  } catch(Exception e) {
writefln("[Test] Caught %s: '%s'", typeid(e), e.msg);
  }

  writeln("[Test] Success");
}

I think this challenge does a great job in demonstrating D's 
compile-time power. I couldn't think of a way to do this in C 
without doing some type of code generation.  The reason I needed 
this functionality was because I was writing a remote procedure 
call type of library, where the function's being called were 
known at compile time, but the arguments (passed over a socket) 
had to be processed at runtime.  I was wondering if anyone had 
good solutions to this problem in other languages.  I was very 
pleased with the D solution but I predict that solutions in other 
languages are going to be much uglier.


Re: Challenge: write a really really small front() for UTF8

2014-03-26 Thread Temtaime

Oups, in last return forgot "+ s[3] "


Re: Challenge: write a really really small front() for UTF8

2014-03-26 Thread Temtaime

http://goo.gl/4RSWhr

Only 3 ifs.


Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread Piotr Szturmaj

W dniu 2014-03-25 11:42, dennis luehring pisze:

Am 25.03.2014 11:38, schrieb Nick Sabalausky:

On 3/25/2014 4:00 AM, Iain Buclaw wrote:

On 25 March 2014 00:04, Daniel N  wrote:

On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:


I'm currently too busy to submit a complete solution, but please
feel free
to use my idea if you think it sounds promising.



I now managed to dig up my old C source... but I'm still blocked by
dmd not
accepting the 'pext' instruction...

1) I know my solution is not directly comparable to the rest in this
thread(for many reasons).
2) It's of course trivial to add a fast path for ascii... if desired.
3) It throws safety and standards out the window.




4) It's tied to one piece of hardware.

No Thankee.

void doStuff() {
  if(supportCpuFeatureX)
  doStuff_FeatureX();
  else
  doStuff_Fallback();
}

  > dmd -inline blah.d


the extra branch could kill the performance benefit if doStuff is too small


void function() doStuff;

void main() {
auto supportCpuFeatureX = detectCpuFeatureX();

if (supportCpuFeatureX)
doStuff = &doStuff_FeatureX;
else
doStuff = &doStuff_Fallback;
}



Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread Dmitry Olshansky

24-Mar-2014 23:53, Dmitry Olshansky пишет:

24-Mar-2014 01:22, Andrei Alexandrescu пишет:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


I had to join the party at some point.
This seems like 25 instructions:
http://goo.gl/N7sHtK



Interestingly gdc-4.8 produces better results.
http://goo.gl/1R7GMs

--
Dmitry Olshansky


Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread anonymous

dchar front(char[] s) {
  dchar c = s[0];
  if (!(c & 0x80))
return c;
  byte b = (c >> 4) & 3;
  b += !b;
  c &= 63 >> b;
  char *p = s.ptr;
  do {
p++;
c = c << 6 | *p & 63;
  } while(--b);
  return c;
}


Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread Daniel N

On Tuesday, 25 March 2014 at 10:42:59 UTC, dennis luehring wrote:

void doStuff() {
 if(supportCpuFeatureX)
 doStuff_FeatureX();
 else
 doStuff_Fallback();
}

 > dmd -inline blah.d


the extra branch could kill the performance benefit if doStuff 
is too small


you'd simply have to hoist the condition outside the inner loop.

Furthermore the branch prediction would never fail, only 
unpredictable branches are terrible.


Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread dennis luehring

Am 25.03.2014 11:38, schrieb Nick Sabalausky:

On 3/25/2014 4:00 AM, Iain Buclaw wrote:

On 25 March 2014 00:04, Daniel N  wrote:

On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:


I'm currently too busy to submit a complete solution, but please feel free
to use my idea if you think it sounds promising.



I now managed to dig up my old C source... but I'm still blocked by dmd not
accepting the 'pext' instruction...

1) I know my solution is not directly comparable to the rest in this
thread(for many reasons).
2) It's of course trivial to add a fast path for ascii... if desired.
3) It throws safety and standards out the window.




4) It's tied to one piece of hardware.

No Thankee.

void doStuff() {
  if(supportCpuFeatureX)
  doStuff_FeatureX();
  else
  doStuff_Fallback();
}

  > dmd -inline blah.d


the extra branch could kill the performance benefit if doStuff is too small


Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread Nick Sabalausky

On 3/25/2014 4:00 AM, Iain Buclaw wrote:

On 25 March 2014 00:04, Daniel N  wrote:

On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:


I'm currently too busy to submit a complete solution, but please feel free
to use my idea if you think it sounds promising.



I now managed to dig up my old C source... but I'm still blocked by dmd not
accepting the 'pext' instruction...

1) I know my solution is not directly comparable to the rest in this
thread(for many reasons).
2) It's of course trivial to add a fast path for ascii... if desired.
3) It throws safety and standards out the window.




4) It's tied to one piece of hardware.

No Thankee.



bool supportCpuFeatureX;

void main() {
supportCpuFeatureX = detectCpuFeatureX();

doStuff();
}

void doStuff() {
if(supportCpuFeatureX)
doStuff_FeatureX();
else
doStuff_Fallback();
}

> dmd -inline blah.d



Re: Challenge: write a really really small front() for UTF8

2014-03-25 Thread Iain Buclaw
On 25 March 2014 00:04, Daniel N  wrote:
> On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
>>
>> I'm currently too busy to submit a complete solution, but please feel free
>> to use my idea if you think it sounds promising.
>
>
> I now managed to dig up my old C source... but I'm still blocked by dmd not
> accepting the 'pext' instruction...
>
> 1) I know my solution is not directly comparable to the rest in this
> thread(for many reasons).
> 2) It's of course trivial to add a fast path for ascii... if desired.
> 3) It throws safety and standards out the window.
>


4) It's tied to one piece of hardware.

No Thankee.


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread dennis luehring

Am 24.03.2014 17:44, schrieb Andrei Alexandrescu:

On 3/24/14, 5:51 AM, w0rp wrote:

On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Before we roll this out, could we discuss a strategy/guideline in
regards to detecting and handling invalid UTF sequences?

Having a fast "front" is fine and all, but if it means your program
asserting in release (or worst, silently corrupting memory) just
because the client was trying to read a bad text file, I'm unsure this
is acceptable.


I would strongly advise to at least offer an option


Options are fine for functions etc. But front would need to find an
all-around good compromise between speed and correctness.

Andrei



b"\255".decode("utf-8", errors="strict") # UnicodeDecodeError
b"\255".decode("utf-8", errors="replace") # replacement character used
b"\255".decode("utf-8", errors="ignore") # Empty string, invalid
sequence removed.

i think there should be a base range for UTF8 iteration - with policy 
based error extension (like in python) and some variants that defer this 
base UTF8 range with different error behavior - and one of these become 
the phobos standard = default parameter so its still switchable





Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Daniel N

On Monday, 24 March 2014 at 12:21:55 UTC, Daniel N wrote:
I'm currently too busy to submit a complete solution, but 
please feel free to use my idea if you think it sounds 
promising.


I now managed to dig up my old C source... but I'm still blocked 
by dmd not accepting the 'pext' instruction...


1) I know my solution is not directly comparable to the rest in 
this thread(for many reasons).
2) It's of course trivial to add a fast path for ascii... if 
desired.

3) It throws safety and standards out the window.

#include 

char32_t front(const char* inp)
{
  static const uint32_t mask[32] =
  {
  0b0111''',
  0b''',
  0b0001'0011'',
  0b'0011'0011',
  0b0111'0011'0011'0011,
  };
  uint32_t rev = __builtin_bswap32(reinterpret_castuint32_t*>(inp)[0]);

  uint32_t len = __lzcnt32(~rev);

  return _pext_u32(rev, mask[len]);
}

This is what clang 3.4 generated:
## BB#0:
pushq   %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq%rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
movl(%rdi), %eax
bswapl  %eax
movl%eax, %ecx
notl%ecx
lzcntl  %ecx, %ecx
leaq__ZZ5frontPKcE4mask(%rip), %rdx
pextl   (%rdx,%rcx,4), %eax, %eax
popq%rbp
ret



Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread monarch_dodra
On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu 
wrote:

On 3/24/14, 2:02 AM, monarch_dodra wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Before we roll this out, could we discuss a strategy/guideline 
in

regards to detecting and handling invalid UTF sequences?


I think std.array.front should return the invalid dchar on 
error, and popFront should attempt to resync on error.


Andrei


That sounds good to me (I think). Better than a straight up 
assert anyways...


...which is what the code you submitted does.

Also, I think relying on bounds checking is wrong: Disabling 
bounds checking mean you are OK for memory corruption for *wrong 
code*. I don't think invalid string is wrong code. 'front' should 
do an actual "if", and "assert(0)" when needed. This should have 
no impact on performance BTW, if done with pointer extraction 
("slice.ptr[1]"), and in a trusted context.


Of course, this is all assuming we are asserting at all.


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Dmitry Olshansky

24-Mar-2014 01:22, Andrei Alexandrescu пишет:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


I had to join the party at some point.
This seems like 25 instructions:
http://goo.gl/N7sHtK

--
Dmitry Olshansky


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Chris Williams
On Monday, 24 March 2014 at 16:17:52 UTC, Andrei Alexandrescu 
wrote:

On 3/24/14, 12:53 AM, Chris Williams wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


http://goo.gl/TaZTNB


Nice! Why assert(ret != 0)?

Andrei


It should be -1. I wanted to make sure that bsr() wasn't called 
on zero, but of course that's not the correct way to test it when 
I'm about to ~ it.


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread dnspies
Whoops, wrong "original" version.  That was the one before I 
understood the game. Here mine is fixed (but up to 180 lines - 16 
labels = 164 instructions):


http://goo.gl/wjIAVm

On Monday, 24 March 2014 at 17:14:11 UTC, dnspies wrote:
It seems like mine wasn't being inlined because I had 
carelessly replaced char[] with const(char)[] in the function 
signature (I don't know why that should make a difference, but 
it does)


Going back to my original version (with Andre's trick to avoid 
letting the loop unroll), I get

dnspies_orig   159 - 16 labels = 143 instructions

However, MFortin1 can be made to do better by taking successive 
slices instead of indexing (and commenting out the check, since 
it no longer helps optimize).

MFortin1_slice 137 - 13 labels = 124 instructions

http://goo.gl/JAgVK1

On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote:
On 2014-03-24 06:08:30 +, "safety0ff" 
 said:



Everything seems to work in your corrected versions:
http://dpaste.dzfl.pl/3b32535559ba

Andrei's version doesn't compile on recent compiler versions 
due to goto skipping initialization.


Ok, so let's check what happens in actual usage. In other 
words, what happens when front is inlined and used in a loop. 
I'm counting the number of assembly lines of this main 
function for looping through each possible input using the 
various implementations:


http://goo.gl/QjnA2k

implementation   line count of main in assembly
andralex 285 -  9 labels = 276 instructions
MFortin1 135 - 13 labels = 122 instructions
MFortin2 150 - 14 labels = 136 instructions
safety0ff -- not inlined (too big?) --
dnspies  161 - 15 labels = 146 instructions
CWilliams233 - 25 labels = 208 instructions

For compactness, my first implementation seems to be the 
winner, both with and without inlining.


That said, I think front should only assume a non-empty char[] 
and thus should check the length before accessing anything 
after the first byte to protect against incomplete multi-byte 
sequences such as [0x1000_]. So I added this line to my 
two implementations:


  if (1+tailLength >= s.length) return dchar.init;

Now lets see what happens with those changes:

http://goo.gl/XPCGYE

implementation   line count of main in assembly
MFortin1Check103 - 11 labels =  92 instructions
MFortin2Check135 - 13 labels = 122 instructions

Now I'm baffled. Adding a test makes the code shorter? It 
actually make the standalone functions longer by 8 
instructions (as expected), but the inlined version is 
shorter. My guess is that the optimizer creates something 
entirely different and it turns out that this different 
version optimises better after inlining.


That said, my test "main" function isn't very representative 
of the general case because the length is statically known by 
the optimizer. Let's see what it does when the length is not 
statically known:


http://goo.gl/E2Q0Yu

implementation   line count of main in assembly
andralex 384 -  9 labels = 375 instructions
MFortin1 173 - 19 labels = 154 instructions
MFortin2 182 - 18 labels = 164 instructions
safety0ff -- not inlined --
dnspies   -- not inlined --
CWilliams229 - 23 labels = 206 instructions
MFortin1Check211 - 24 labels = 187 instructions
MFortin2Check218 - 21 labels = 197 instructions

So again MFortin1 is the winner for compactness. Still, I 
maintain that we ought to at least check the length of the 
array for multi-byte sequences to protect against incomplete 
sequences that could lay at the end of the string, so I'd 
favor MFortin1Check.




Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread dnspies
It seems like mine wasn't being inlined because I had carelessly 
replaced char[] with const(char)[] in the function signature (I 
don't know why that should make a difference, but it does)


Going back to my original version (with Andre's trick to avoid 
letting the loop unroll), I get

dnspies_orig   159 - 16 labels = 143 instructions

However, MFortin1 can be made to do better by taking successive 
slices instead of indexing (and commenting out the check, since 
it no longer helps optimize).

MFortin1_slice 137 - 13 labels = 124 instructions

http://goo.gl/JAgVK1

On Monday, 24 March 2014 at 13:39:45 UTC, Michel Fortin wrote:
On 2014-03-24 06:08:30 +, "safety0ff" 
 said:



Everything seems to work in your corrected versions:
http://dpaste.dzfl.pl/3b32535559ba

Andrei's version doesn't compile on recent compiler versions 
due to goto skipping initialization.


Ok, so let's check what happens in actual usage. In other 
words, what happens when front is inlined and used in a loop. 
I'm counting the number of assembly lines of this main function 
for looping through each possible input using the various 
implementations:


http://goo.gl/QjnA2k

implementation   line count of main in assembly
andralex 285 -  9 labels = 276 instructions
MFortin1 135 - 13 labels = 122 instructions
MFortin2 150 - 14 labels = 136 instructions
safety0ff -- not inlined (too big?) --
dnspies  161 - 15 labels = 146 instructions
CWilliams233 - 25 labels = 208 instructions

For compactness, my first implementation seems to be the 
winner, both with and without inlining.


That said, I think front should only assume a non-empty char[] 
and thus should check the length before accessing anything 
after the first byte to protect against incomplete multi-byte 
sequences such as [0x1000_]. So I added this line to my two 
implementations:


   if (1+tailLength >= s.length) return dchar.init;

Now lets see what happens with those changes:

http://goo.gl/XPCGYE

implementation   line count of main in assembly
MFortin1Check103 - 11 labels =  92 instructions
MFortin2Check135 - 13 labels = 122 instructions

Now I'm baffled. Adding a test makes the code shorter? It 
actually make the standalone functions longer by 8 instructions 
(as expected), but the inlined version is shorter. My guess is 
that the optimizer creates something entirely different and it 
turns out that this different version optimises better after 
inlining.


That said, my test "main" function isn't very representative of 
the general case because the length is statically known by the 
optimizer. Let's see what it does when the length is not 
statically known:


http://goo.gl/E2Q0Yu

implementation   line count of main in assembly
andralex 384 -  9 labels = 375 instructions
MFortin1 173 - 19 labels = 154 instructions
MFortin2 182 - 18 labels = 164 instructions
safety0ff -- not inlined --
dnspies   -- not inlined --
CWilliams229 - 23 labels = 206 instructions
MFortin1Check211 - 24 labels = 187 instructions
MFortin2Check218 - 21 labels = 197 instructions

So again MFortin1 is the winner for compactness. Still, I 
maintain that we ought to at least check the length of the 
array for multi-byte sequences to protect against incomplete 
sequences that could lay at the end of the string, so I'd favor 
MFortin1Check.




Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread John Colvin

On Monday, 24 March 2014 at 16:41:02 UTC, John Colvin wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


On a bigendian machine with loose alignment requirements (1 
byte), you can do this


Same again, but for little-endian, 18 instructions:

http://goo.gl/jlrweQ

uint front(char[] s)
{
  if(!(s[0] & 0b1000_)) return s[0]; //handle ASCII
  assert(s[0] & 0b0100_);

  if(s[0] & 0b0010_)
  {
if(s[0] & 0b0001_)
{
  assert(s.length >=4 && !(s[0] & 0b1000)
 && s[1] <= 0b1011_
 && s[2] <= 0b1011_
 && s[3] <= 0b1011_);
  return swapEndian(*(cast(dchar*)(s.ptr)));
}
assert(s.length >= 3 && s[1] <= 0b1011_ && s[2] <= 
0b1011_);

return swapEndian(*(cast(dchar*)(s.ptr))) >> 8;
  }

  assert(s.length >= 2 && s[1] <= 0b1011_);
  return swapEndian(*(cast(wchar*)(s.ptr)));
}


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread John Colvin
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


On a bigendian machine with loose alignment requirements (1 
byte), you can do this, which is down to 13 instructions on x86 
(which is of course meaningless, what with it being the wrong 
endianess):


uint front(char[] s) {
  if(!(s[0] & 0b1000_)) return s[0]; //handle ASCII
  assert(s[0] & 0b0100_);

  if(s[0] & 0b0010_)
  {
if(s[0] & 0b0001_)
{
  assert(s.length >=4 && !(s[0] & 0b1000)
 && s[1] <= 0b1011_
 && s[2] <= 0b1011_
 && s[3] <= 0b1011_);
  return *(cast(dchar*)(s.ptr));
}
assert(s.length >= 3 && s[1] <= 0b1011_ && s[2] <= 
0b1011_);

return *(cast(dchar*)(s.ptr)) >> 8;
  }

  assert(s.length >= 2 && s[1] <= 0b1011_);
  return *(cast(wchar*)(s.ptr));
}

http://goo.gl/Kf6RZJ


There may be architectures that can benefit from this.


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Andrei Alexandrescu

On 3/24/14, 5:51 AM, w0rp wrote:

On Monday, 24 March 2014 at 09:02:19 UTC, monarch_dodra wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Before we roll this out, could we discuss a strategy/guideline in
regards to detecting and handling invalid UTF sequences?

Having a fast "front" is fine and all, but if it means your program
asserting in release (or worst, silently corrupting memory) just
because the client was trying to read a bad text file, I'm unsure this
is acceptable.


I would strongly advise to at least offer an option


Options are fine for functions etc. But front would need to find an 
all-around good compromise between speed and correctness.


Andrei



Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Andrei Alexandrescu

On 3/24/14, 3:30 AM, dnspies wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Ok, I managed to make it smaller.  I think this is the smallest so far
with only 23 instructions (no loop unrolling in this one):

http://goo.gl/RKF5Vm


This is nice because even the checks are relatively cheap. It's quite 
difficult to understand tho :o). -- Andrei


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Vladimir Panteleev
On Monday, 24 March 2014 at 16:31:42 UTC, Andrei Alexandrescu 
wrote:

On 3/24/14, 2:02 AM, monarch_dodra wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Before we roll this out, could we discuss a strategy/guideline 
in

regards to detecting and handling invalid UTF sequences?


I think std.array.front should return the invalid dchar on 
error, and popFront should attempt to resync on error.


Ignoring UTF errors is a lossy operation and has the potential 
problem of irreversible data loss. For example, consider a 
program which needs to process Windows-1251 files: it would need 
to read the file from disk, convert to UTF-8, process it, convert 
back to Windows-1251, and save it back to disk. If a bug causes 
the UTF-8 conversion step to be accidentally skipped, then all 
Unicode data in that file is lost.


I think UTF-8 decoding operations should, by default, throw on 
UTF-8 errors. Ignoring UTF-8 errors should only be done 
explicitly, with the programmer's consent.


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Andrei Alexandrescu

On 3/24/14, 2:02 AM, monarch_dodra wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Before we roll this out, could we discuss a strategy/guideline in
regards to detecting and handling invalid UTF sequences?


I think std.array.front should return the invalid dchar on error, and 
popFront should attempt to resync on error.


Andrei



Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Andrei Alexandrescu

On 3/24/14, 1:56 AM, dnspies wrote:

On Monday, 24 March 2014 at 08:06:53 UTC, dnspies wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Here's mine (the loop gets unrolled):

dchar front(const char[] s) {
  assert(s.length > 0);
  byte c = s[0];
  dchar res = cast(ubyte)c;
  if(c >= 0) {
return res;
  }
  c <<= 1;
  assert(c < 0);
  for(int i = 1; i < 4; i++) {
assert(i < s.length);
ubyte d = s[i];
assert((d >> 6) == 0b10);
res = (res << 8) | d;
c <<= 1;
if(c >= 0)
  return res;
  }
  assert(false);
}


Sorry, I misunderstood.  We only want the x's in the output. Here's my
revised solution

http://goo.gl/PL729J


That turned out quite large. -- Andrei



Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Andrei Alexandrescu

On 3/24/14, 1:06 AM, dnspies wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


Here's mine (the loop gets unrolled):

dchar front(const char[] s) {
   assert(s.length > 0);
   byte c = s[0];
   dchar res = cast(ubyte)c;
   if(c >= 0) {
 return res;
   }
   c <<= 1;
   assert(c < 0);
   for(int i = 1; i < 4; i++) {
 assert(i < s.length);
 ubyte d = s[i];
 assert((d >> 6) == 0b10);
 res = (res << 8) | d;
 c <<= 1;
 if(c >= 0)
   return res;
   }
   assert(false);
}


http://goo.gl/HBSv5T - looks nice! For minimum size replace the 
assert(false) with one inside the loop: http://goo.gl/eWs0ft


Andrei



Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread Andrei Alexandrescu

On 3/24/14, 12:53 AM, Chris Williams wrote:

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


http://goo.gl/TaZTNB


Nice! Why assert(ret != 0)?

Andrei


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread John Colvin

On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


http://forum.dlang.org/post/fsgdviklnbugdzjsg...@forum.dlang.org


Re: Challenge: write a really really small front() for UTF8

2014-03-24 Thread John Colvin

On Monday, 24 March 2014 at 15:32:09 UTC, John Colvin wrote:
On Sunday, 23 March 2014 at 21:23:18 UTC, Andrei Alexandrescu 
wrote:

Here's a baseline: http://goo.gl/91vIGc. Destroy!

Andrei


I've probably missed something, but here's 15 instructions: 
http://goo.gl/d3Mgj0


woops yeah, that's completely broken.


<    1   2   3   4   5   6   >