On Tuesday, 21 April 2015 at 13:06:22 UTC, JohnnyK wrote:
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need
the
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need
the number of graphemes. (d|)?string.length gives you the
On Monday, 20 April 2015 at 19:24:01 UTC, Panke wrote:
On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need
the number of graphemes. (d|)?string.length gives you the
On 2015-04-20 08:04, Nick B wrote:
Perhaps a new Unicode standard, could start that way as well ?
https://xkcd.com/927/
--
/Jacob Carlborg
On Saturday, 18 April 2015 at 17:04:54 UTC, Tobias Pankrath wrote:
Isn't this solved commonly with a normalization pass? We
should have a normalizeUTF() that can be inserted in a
pipeline.
Yes.
Then the rest of Phobos doesn't need to mind these combining
characters. -- Andrei
I don't
Yes, again and again I encountered length related bugs with
Unicode characters. Normalization is not 100% reliable.
I think it is 100% reliable, it just doesn't make the problems go
away. It just guarantees that two strings normalized to the same
form are binary equal iff they are equal in
On Monday, 20 April 2015 at 11:04:58 UTC, Panke wrote:
Yes, again and again I encountered length related bugs with
Unicode characters. Normalization is not 100% reliable.
I think it is 100% reliable, it just doesn't make the problems
go away. It just guarantees that two strings normalized
This can lead to subtle bugs, cf. length of random and e_one.
You have to convert everything to dstring to get the expected
result. However, this is not always desirable.
There are three things that you need to be aware of when handling
unicode: code units, code points and graphems.
In
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need
the number of graphemes. (d|)?string.length gives you the
number of code units.
Even that's not really true. In the end it's up to the font and
layout engine to decide how much
On Mon, Apr 20, 2015 at 06:03:49PM +, John Colvin via Digitalmars-d wrote:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need the
number of graphemes. (d|)?string.length gives you the number of code
units.
Even that's not
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
This can lead to subtle bugs, cf. length of random and e_one.
You have to convert everything to dstring to get the
expected result. However, this is not always desirable.
There are three things that you need to be aware of when
handling
On Monday, 20 April 2015 at 03:39:54 UTC, ketmar wrote:
On Mon, 20 Apr 2015 01:27:36 +, Nick B wrote:
Perhaps Unicode needs to be rebuild from the ground up ?
alas, it's too late. now we'll live with that unicode crap
for many
years.
Perhaps. or perhaps not. This community got
On Monday, 20 April 2015 at 18:03:50 UTC, John Colvin wrote:
On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
To measure the columns needed to print a string, you'll need
the number of graphemes. (d|)?string.length gives you the
number of code units.
Even that's not really true.
Why?
On 19/04/15 22:58, ketmar wrote:
On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:
it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf
This is not a very accurate depiction of Unicode.
For example:
And, moreover, BOM is meaningless
On Saturday, 18 April 2015 at 16:01:20 UTC, Andrei Alexandrescu
wrote:
On 4/18/15 4:35 AM, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the
accented e
all have Unicode code point assignments.
This code snippet
On 19/04/15 10:51, Abdulhaq wrote:
MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
On 18/04/15 21:40, Walter Bright wrote:
Also, notice that some letters can only be achieved using multiple
code points. Hebrew diacritics, for example, do not, typically, have a
composite
On Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
U0065+U0301 rather than U00e9. Because of legacy systems, and
because they would rather have the ISO-8509 code pages be 1:1
mappings, rather than 1:n mappings, they introduced code points
they really would rather do without.
MiOn Sunday, 19 April 2015 at 02:20:01 UTC, Shachar Shemesh wrote:
On 18/04/15 21:40, Walter Bright wrote:
I'm not arguing against the existence of the Unicode standard,
I'm
saying I can't figure any justification for standardizing
different
encodings of the same thing.
A lot of areas in
On Saturday, 18 April 2015 at 17:50:12 UTC, Walter Bright wrote:
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
\u0301 is the combining acute accent [1].
[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any sense
to have two
On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:
é might be obvious, but Unicode isn't just for writing European prose.
it is also to insert pictures of the animals into text.
Unicode is a nightmarish system in some ways, but considering how
incredibly difficult the problem it solves
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:
é might be obvious, but Unicode isn't just for writing
European prose.
it is also to insert pictures of the animals into text.
There's other uses for unicode?
On Sunday, 19 April 2015 at 19:58:28 UTC, ketmar wrote:
On Sun, 19 Apr 2015 07:54:36 +, John Colvin wrote:
it's not crazy, it's just broken in all possible ways:
http://file.bestmx.net/ee/articles/uni_vs_code.pdf
Ketmar
Great link, and a really good arguement about the problems with
On Mon, 20 Apr 2015 01:27:36 +, Nick B wrote:
Perhaps Unicode needs to be rebuild from the ground up ?
alas, it's too late. now we'll live with that unicode crap for many
years.
signature.asc
Description: PGP signature
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the accented e
all have Unicode code point assignments.
This code snippet demonstrates the problem:
import std.stdio;
void main ()
{
dstring a = e\u0301;
dstring b = é;
assert(a !=
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the
accented e
all have Unicode code point assignments.
This code snippet demonstrates the problem:
import std.stdio;
void
On 4/18/2015 1:26 AM, Panke wrote:
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
On 4/18/2015 12:58 AM, John Colvin wrote:
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw
Also another issue is that lower case letters and upper case
might have different size requirements or look different
depending on where on the word they are located.
For example, German ß and SS, Greek σ and ς. I know Turkish
also has similar cases.
--
Paulo
While true, it does not
On Saturday, 18 April 2015 at 11:52:52 UTC, Chris wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg
wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the
accented e
all have Unicode code point assignments.
This code
On 2015-04-18 14:25, Gary Willoughby wrote:
byGrapheme to the rescue:
http://dlang.org/phobos/std_uni.html#byGrapheme
Or is this unsuitable here?
How is byGrapheme supposed to be used? I tried this put it doesn't do
what I expected:
foreach (e ; e\u0301.byGrapheme)
writeln(e);
--
Wait, I thought the recommended approach is to normalize first,
then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code
to use a
uniform approach to handling the string. I don't think it's a
good idea
to manually deal
On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
On Sat, Apr 18, 2015 at 11:52:50AM +, Chris via
Digitalmars-d wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg
wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts
On Saturday, 18 April 2015 at 08:26:12 UTC, Panke wrote:
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
On 4/18/2015 12:58 AM, John Colvin wrote:
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So
On Saturday, 18 April 2015 at 12:48:53 UTC, Jacob Carlborg wrote:
On 2015-04-18 14:25, Gary Willoughby wrote:
byGrapheme to the rescue:
http://dlang.org/phobos/std_uni.html#byGrapheme
Or is this unsuitable here?
How is byGrapheme supposed to be used? I tried this put it
doesn't do what I
That doesn't make sense to me, because the umlauts and the
accented e all have Unicode code point assignments.
Yes, but you may have perfectly fine unicode text where the
combined form is used. Actually there is a normalization form for
unicode that requires the combined form. To be fully
On Sat, Apr 18, 2015 at 11:52:50AM +, Chris via Digitalmars-d wrote:
On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the accented
e all have Unicode code point assignments.
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
\u0301 is the combining acute accent [1].
[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any sense to have two
different representations of eacute, and I don't know why anyone
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to not
allocate, since it shouldn't need to do so just to compute the length of
the next grapheme.
That should be done. There should be a fixed maximum codepoint count to
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
\u0301 is the combining acute accent [1].
[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I won't deny what the spec says, but it doesn't make any
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it shouldn't need to do so just to compute the
length of the next
On 4/18/15 4:35 AM, Jacob Carlborg wrote:
On 2015-04-18 12:27, Walter Bright wrote:
That doesn't make sense to me, because the umlauts and the accented e
all have Unicode code point assignments.
This code snippet demonstrates the problem:
import std.stdio;
void main ()
{
dstring a =
On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
On Sat, Apr 18, 2015 at 10:50:18AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 4:35 AM, Jacob Carlborg wrote:
\u0301 is the combining acute accent [1].
[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
I
On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible solution would be to modify std.uni.graphemeStride to
not allocate, since it
Isn't this solved commonly with a normalization pass? We should
have a normalizeUTF() that can be inserted in a pipeline.
Yes.
Then the rest of Phobos doesn't need to mind these combining
characters. -- Andrei
I don't think so. The thing is, even after normalization we have
to deal with
On Fri, Apr 17, 2015 at 08:44:51PM +, Panke via Digitalmars-d wrote:
On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line
On Sat, Apr 18, 2015 at 11:37:27AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 11:29 AM, H. S. Teoh via Digitalmars-d wrote:
On Sat, Apr 18, 2015 at 10:53:04AM -0700, Walter Bright via Digitalmars-d
wrote:
On 4/18/2015 6:27 AM, H. S. Teoh via Digitalmars-d wrote:
One possible
On Sat, Apr 18, 2015 at 11:40:08AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/18/2015 11:28 AM, H. S. Teoh via Digitalmars-d wrote:
[...]
When we don't know provenance of incoming data, we have to assume the
worst and run normalization to be sure that we got it right.
I'm not arguing
On 4/18/2015 1:22 PM, H. S. Teoh via Digitalmars-d wrote:
Take it up with the Unicode consortium. :-)
I see nobody knows :-)
On 18/04/15 21:40, Walter Bright wrote:
I'm not arguing against the existence of the Unicode standard, I'm
saying I can't figure any justification for standardizing different
encodings of the same thing.
A lot of areas in Unicode are due to pre-Unicode legacy.
I'm guessing here, but looking
On 4/18/2015 1:32 PM, H. S. Teoh via Digitalmars-d wrote:
However, I think Walter's goal here is to match the original wrap()
functionality.
Yes, although the overarching goal is:
Minimize Need For Using GC In Phobos
and the method here is to use ranges rather than having to allocate
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of
Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or
you have to
live with byGrapheme with
On 4/18/2015 12:58 AM, John Colvin wrote:
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping,
On 2015-04-18 09:58, John Colvin wrote:
Code points aren't equivalent to characters. They're not the same thing
in most European languages, never mind the rest of the world. If we have
a line-wrapping algorithm in phobos that works by code points, it needs
a large THIS IS ONLY FOR SIMPLE
On Saturday, 18 April 2015 at 08:18:46 UTC, Walter Bright wrote:
On 4/18/2015 12:58 AM, John Colvin wrote:
On Friday, 17 April 2015 at 18:41:59 UTC, Walter Bright wrote:
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of
Challenge level - Moderately easy
Consider the function std.string.wrap:
http://dlang.org/phobos/std_string.html#.wrap
It takes a string as input, and returns a GC allocated string that is
word-wrapped. It needs to be enhanced to:
1. Accept a ForwardRange as input.
2. Return a lazy
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
So either you have to throw out all pretenses of Unicode-correctness
and just stick with ASCII-style per-character line-wrapping, or you
have to live with byGrapheme with all the complexity that it entails.
The
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not allocate:
awesome! Please make a pull request for this so you get proper credit!
On 4/17/2015 9:59 AM, H. S. Teoh via Digitalmars-d wrote:
So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:
awesome! Please
On Fri, Apr 17, 2015 at 09:59:40AM -0700, H. S. Teoh via Digitalmars-d wrote:
[...]
--
All problems are easy in retrospect.
Argh, my Perl script doth mock me!
T
--
Windows: the ultimate triumph of marketing over technology. -- Adrian von Bidder
On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:
Challenge level - Moderately easy
Consider the function std.string.wrap:
http://dlang.org/phobos/std_string.html#.wrap
It takes a string as input, and returns a GC allocated string that is
word-wrapped. It
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not allocate:
there is some... inconsistency: `std.string.wrap` adds final \n to
string. ;-) but i
On 4/17/2015 11:46 AM, H. S. Teoh via Digitalmars-d wrote:
On Fri, Apr 17, 2015 at 11:44:52AM -0700, Walter Bright via Digitalmars-d wrote:
On 4/17/2015 11:17 AM, H. S. Teoh via Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper
On Friday, 17 April 2015 at 19:44:41 UTC, ketmar wrote:
On Fri, 17 Apr 2015 11:17:30 -0700, H. S. Teoh via
Digitalmars-d wrote:
Well, talk is cheap, so here's a working implementation of the
non-Unicode-correct line wrapper that uses ranges and does not
allocate:
there is some...
On 17/04/15 19:59, H. S. Teoh via Digitalmars-d wrote:
There's also the question of what to do with bidi markings: how do you
handle counting the columns in that case?
Which BiDi marking are you referring to? LRM/RLM and friends? If so,
don't worry: the interface, as described, is incapable
64 matches
Mail list logo