subject:"Regarding hex strings"

Re: Regarding hex strings

2012-10-22 Thread H. S. Teoh

On Mon, Oct 22, 2012 at 01:14:21PM +0200, Dejan Lekic wrote:
> >
> >If you want vastly human readable, you want heredoc hex syntax,
> >something like this:
> >
> > ubyte[] = x"< > 32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
> > 32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
> > 2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
> > 74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
> > 6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
> > 74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
> > 20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
> > 22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
> > 20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
> > 6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
> > END";
> >
> 
> Having a heredoc syntax for hex-strings that produce ubyte[] arrays
> is confusing for people who would (naturally) expect a string from a
> heredoc string. It is not named hereDOC for no reason. :)

What I meant was, a syntax similar to heredoc, not an actual heredoc,
which would be a string.


T

-- 
Knowledge is that area of ignorance that we arrange and classify. -- Ambrose 
Bierce

Re: Regarding hex strings

2012-10-22 Thread Simen Kjaeraas


On 2012-45-18 02:10, bearophile  wrote:


So maybe the following literals are more useful in D2:

ubyte[] data4 = x[A1 B2 C3 D4];
uint[]  data5 = x[A1 B2 C3 D4];
ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];


That syntax is already taken, though.

Still, I see no reason for x"..." not to return ubyte[].

--
Simen

Re: Regarding hex strings

2012-10-22 Thread Dejan Lekic


On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:

(Repost)

hex strings are useful, but I think they were invented in D1 
when strings were convertible to char[]. But today they are an 
array of immutable UFT-8, so I think this default type is not 
so useful:


void main() {
string data1 = x"A1 B2 C3 D4"; // OK
immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression 
("\xa1\xb2\xc3\xd4") of type string to ubyte[]



Generally I want to use hex strings to put binary data in a 
program, so usually it's a ubyte[] or uint[].


So I have to use something like:

auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup);


So maybe the following literals are more useful in D2:

ubyte[] data4 = x[A1 B2 C3 D4];
uint[]  data5 = x[A1 B2 C3 D4];
ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

Bye,
bearophile


+1 on this one
I also like the x[ ... ] literal because it makes it obvious that 
we are dealing with an array.

Re: Regarding hex strings

2012-10-22 Thread Dejan Lekic



If you want vastly human readable, you want heredoc hex syntax,
something like this:

ubyte[] = x"<

Having a heredoc syntax for hex-strings that produce ubyte[] 
arrays is confusing for people who would (naturally) expect a 
string from a heredoc string. It is not named hereDOC for no 
reason. :)

Re: Regarding hex strings

2012-10-20 Thread Nick Sabalausky

On Sat, 20 Oct 2012 14:05:21 -0700
"H. S. Teoh"  wrote:

> On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
> > On Sat, 20 Oct 2012 14:59:27 +0200
> > "foobar"  wrote:
> > > On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij
> > > wrote:
> > > >
> > > > Maybe. Just an example of a real world code:
> > > >
> > > > Arrays:
> > > > https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> > > >
> > > > vs
> > > >
> > > > Hex strings:
> > > > https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> > > >
> > > > By the way, current code isn't affected by the topic issue.
> > > 
> > > I personally find the former more readable but I guess there 
> > > would always be someone to disagree. As the say, YMMV.
> > 
> > Honestly, I can't imagine how anyone wouldn't find the latter vastly
> > more readable.
> 
> If you want vastly human readable, you want heredoc hex syntax,
> something like this:
> 
>   ubyte[] = x"<   32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
>   32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
>   2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
>   74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
>   6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
>   74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
>   20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
>   22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
>   20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
>   6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
>   END";
> 
> (I just made that syntax up, so the details are not final, but you get
> the idea.) I would propose supporting this in D, but then D already
> has way too many different ways of writing strings, some of
> questionable utility, so I will refrain.
> 
> Of course, the above syntax might actually be implementable with a
> suitable mixin template that takes a compile-time string. Maybe we
> should lobby for such a template to go into Phobos -- that might
> motivate people to fix CTFE in dmd so that it doesn't consume
> unreasonable amounts of memory when the size of CTFE input gets
> moderately large (see other recent thread on this topic).
> 

Can't you already just do this?:

auto blah = x"
32 2b 32 3d 34 2e 20 32 2a 32 3d 34 2e 20 32 5e
32 3d 34 2e 20 54 68 65 72 65 66 6f 72 65 2c 20
2b 2c 20 2a 2c 20 61 6e 64 20 5e 20 61 72 65 20
74 68 65 20 73 61 6d 65 20 6f 70 65 72 61 74 69
6f 6e 2e 0a 22 36 34 30 4b 20 6f 75 67 68 74 20
74 6f 20 62 65 20 65 6e 6f 75 67 68 22 20 2d 2d
20 42 69 6c 6c 20 47 2e 2c 20 31 39 38 34 2e 20
22 54 68 65 20 49 6e 74 65 72 6e 65 74 20 69 73
20 6e 6f 74 20 61 20 70 72 69 6d 61 72 79 20 67
6f 61 6c 20 66 6f 72 20 50 43 20 75 73 61 67 65
";

I thought all string literals in D accepted embedded newlines?

Re: Regarding hex strings

2012-10-20 Thread foobar

On Saturday, 20 October 2012 at 21:16:44 UTC, foobar wrote:

On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:
On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky 
wrote:

On Sat, 20 Oct 2012 14:59:27 +0200
"foobar"  wrote:
> On Saturday, 20 October 2012 at 10:51:25 UTC, Denis 
> Shelomovskij

> wrote:
> >
> > Maybe. Just an example of a real world code:
> >
> > Arrays:
> > 
https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> >
> > vs
> >
> > Hex strings:
> > 
https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> >
> > By the way, current code isn't affected by the topic 
> > issue.
> 
> I personally find the former more readable but I guess 
> there would always be someone to disagree. As the say, YMMV.

Honestly, I can't imagine how anyone wouldn't find the latter 
vastly

more readable.

If you want vastly human readable, you want heredoc hex syntax,
something like this:

ubyte[] = x"<(I just made that syntax up, so the details are not final, but 
you get
the idea.) I would propose supporting this in D, but then D 
already has
way too many different ways of writing strings, some of 
questionable

utility, so I will refrain.

Of course, the above syntax might actually be implementable 
with a
suitable mixin template that takes a compile-time string. 
Maybe we
should lobby for such a template to go into Phobos -- that 
might

motivate people to fix CTFE in dmd so that it doesn't consume
unreasonable amounts of memory when the size of CTFE input gets
moderately large (see other recent thread on this topic).

T

Yeah, I like this. I'd prefer brackets over quotes but it not a 
big dig as the qoutes in the above are not very noticeable. It 
should look distinct from textual strings.

As you said, this could/should be implemented as a template.

Vote++

** not a big deal

Re: Regarding hex strings

2012-10-20 Thread foobar

On Saturday, 20 October 2012 at 21:03:20 UTC, H. S. Teoh wrote:

On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:

On Sat, 20 Oct 2012 14:59:27 +0200
"foobar"  wrote:
> On Saturday, 20 October 2012 at 10:51:25 UTC, Denis 
> Shelomovskij

> wrote:
> >
> > Maybe. Just an example of a real world code:
> >
> > Arrays:
> > 
https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> >
> > vs
> >
> > Hex strings:
> > 
https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> >
> > By the way, current code isn't affected by the topic issue.
> 
> I personally find the former more readable but I guess there 
> would always be someone to disagree. As the say, YMMV.

Honestly, I can't imagine how anyone wouldn't find the latter 
vastly

more readable.

If you want vastly human readable, you want heredoc hex syntax,
something like this:

ubyte[] = x"<(I just made that syntax up, so the details are not final, but 
you get
the idea.) I would propose supporting this in D, but then D 
already has
way too many different ways of writing strings, some of 
questionable

utility, so I will refrain.

Of course, the above syntax might actually be implementable 
with a
suitable mixin template that takes a compile-time string. Maybe 
we

should lobby for such a template to go into Phobos -- that might
motivate people to fix CTFE in dmd so that it doesn't consume
unreasonable amounts of memory when the size of CTFE input gets
moderately large (see other recent thread on this topic).

T

Yeah, I like this. I'd prefer brackets over quotes but it not a 
big dig as the qoutes in the above are not very noticeable. It 
should look distinct from textual strings.

As you said, this could/should be implemented as a template.

Vote++

Re: Regarding hex strings

2012-10-20 Thread Nick Sabalausky

On Fri, 19 Oct 2012 15:07:09 +0200
"foobar"  wrote:

> On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:
> > On Thu, 18 Oct 2012 12:11:13 +0200
> > "foobar"  wrote:
> >> 
> >> How often large binary blobs are literally spelled in the 
> >> source code (as opposed to just being read from a file)?
> >
> >
> > Frequency isn't the issue. The issues are "*Is* it ever 
> > needed?" and
> > "When it is needed, is it useful enough?" The answer to both is 
> > most
> > certainly "yes". (Remember, D is supposed to usable as a systems
> > language, it's not merely a high-level-app-only language.)
> 
> Any real-world use cases to support this claim?

I've used it. And Denis just posted an example of where it was used to
make code far more readable.

> Does C++ have such a feature?

It does not. As one consequence off the top of my head, including binary
data into GBA homebrew became more of an awkward bloated mess than it
needed to be.

> My limited experience with kernels is that this feature is not 
> needed.

"I haven't needed it" isn't remotely sufficient to demonstrate that
something doesn't "pull it's own weight".

> The solution we used for this was to define an extern 
> symbol and load it with a linker script (the binary data was of 
> course stored in separate files).
> 

Yuck!

s/solution/workaround/

> >
> > Keep in mind, the question "Does it pull it's own weight?" is 
> > for
> > adding new features, not for going around gutting the language
> > just because we can.
> 
> Ok, I grant you that but remember that the whole thread started 
> because the feature _doesn't_ work so lets rephrase - is it worth 
> the effort to fix this feature?
> 

The only bug is that it tries to validate it as UTF contrary to the
spec. Making it *not* try to validate it sounds like a very minor
effort. I think you're blowing it out of proportion.

And yes, I think it's definitely worth it.

> >
> >> In any case, I'm not opposed to such a utility library, in 
> >> fact I think it's a rather good idea and we already have a 
> >> precedent with "oct!"
> >> I just don't think this belongs as a built-in feature in the 
> >> language.
> >
> > I think monarch_dodra's test proves that it definitely needs to 
> > be
> > built-in.
> 
> It proves that DMD has bugs that should be fixed, nothing more.

Right so let's jettison x"..." just because *someday* CTFE might become
good enough that we can bring the feature back. How does that make
any sense?

We already have it, it basically works (aside from only a fairly
trivial issue). *When* CTFE is good enough to replace it, *then* we can
have a sane debate about actually doing so. Until then, "Let's get
rid of x"..." because it can be done in the library" is a pointless
argument because at least for now it's NOT TRUE.

Re: Regarding hex strings

2012-10-20 Thread H. S. Teoh

On Sat, Oct 20, 2012 at 04:39:28PM -0400, Nick Sabalausky wrote:
> On Sat, 20 Oct 2012 14:59:27 +0200
> "foobar"  wrote:
> > On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij
> > wrote:
> > >
> > > Maybe. Just an example of a real world code:
> > >
> > > Arrays:
> > > https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> > >
> > > vs
> > >
> > > Hex strings:
> > > https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> > >
> > > By the way, current code isn't affected by the topic issue.
> > 
> > I personally find the former more readable but I guess there 
> > would always be someone to disagree. As the say, YMMV.
> 
> Honestly, I can't imagine how anyone wouldn't find the latter vastly
> more readable.

If you want vastly human readable, you want heredoc hex syntax,
something like this:

ubyte[] = x"<

Re: Regarding hex strings

2012-10-20 Thread Nick Sabalausky

On Sat, 20 Oct 2012 14:59:27 +0200
"foobar"  wrote:
> On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij 
> wrote:
> >
> > Maybe. Just an example of a real world code:
> >
> > Arrays:
> > https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110
> >
> > vs
> >
> > Hex strings:
> > https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130
> >
> > By the way, current code isn't affected by the topic issue.
> 
> I personally find the former more readable but I guess there 
> would always be someone to disagree. As the say, YMMV.

Honestly, I can't imagine how anyone wouldn't find the latter vastly
more readable.

Re: Regarding hex strings

2012-10-20 Thread Nick Sabalausky

On Fri, 19 Oct 2012 20:46:06 +0200
> 
> For general purpose binary data (i.e. _not_ UTF encoded Unicode 
> text) I also _already_ said IMO should be either stored as 
> ubyte[]

Problem is, x"..." is FAR better syntax for that.

> or better yet their own types that would ensure the 
> correct invariants for the data type, be it audio, video, or just 
> a different text encoding.

Using x"..." doesn't prevent anyone from doing that:

auto a = SomeAudioType(x"...");

> 
> In neither case the hex-string is relevant IMO. In the former it 
> potentially violates the type's invariant and in the latter we 
> already have array literals.
> 
> Using a malformed _string_ to initialize ubyte[] IMO is simply 
> less readable. How did that article call such features, "WAT"?

The only thing ridiculous about x"..." is that somewhere along the
lines it was decided that it must be a string instead of the arbitrary
binary data that it *is*.

Re: Regarding hex strings

2012-10-20 Thread monarch_dodra


On Friday, 19 October 2012 at 03:14:54 UTC, Marco Leise wrote:


Hehe, I assume most of the regulars know this: DMD used to
use a garbage collector that is disabled. Memory just isn't
freed! Also it has copy on write semantics during CTFE:

int bug6498(int x)
{
int n = 0;
while (n < x)
++n;
return n;
}
static assert(bug6498(10_000_000)==10_000_000);

--> Fails with an 'out of memory' error.

http://d.puremagic.com/issues/show_bug.cgi?id=6498

So, as strange as it sounds, for now try not to write often or
into large blocks. Using this knowledge I was sometimes able
to bring down the memory consumption considerably by caching
recurring concatenations of two strings or to!string calls.

That said, appending single elements to an array may actually
be better than using a fixed-sized one and have DMD duplicate
it on every write. :p

Please remember to give Don a cookie when he manages to change
the compiler to modify in-place where appropriate.


I should have read your post in more detail. I thought you were 
saying that allocations are never freed, but it is indeed more 
than that: Every write allocates.


I just spent the last hour trying to "optimize" my code, only to 
realize that at its "simplest" (Walk the string counting 
elements), I run out of memory :/


Can't do much more about it at this point.

Re: Regarding hex strings

2012-10-20 Thread foobar

On Saturday, 20 October 2012 at 10:51:25 UTC, Denis Shelomovskij 
wrote:

18.10.2012 12:58, foobar пишет:
IMO, this is a redundant feature that complicates the language 
for no

benefit and should be deprecated.
strings already have an escape sequence for specifying 
code-points "\u"

and for ubyte arrays you can simply use:
immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.



Maybe. Just an example of a real world code:

Arrays:
https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

vs

Hex strings:
https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

By the way, current code isn't affected by the topic issue.


I personally find the former more readable but I guess there 
would always be someone to disagree. As the say, YMMV.

Re: Regarding hex strings

2012-10-20 Thread Denis Shelomovskij


18.10.2012 12:58, foobar пишет:

IMO, this is a redundant feature that complicates the language for no
benefit and should be deprecated.
strings already have an escape sequence for specifying code-points "\u"
and for ubyte arrays you can simply use:
immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.



Maybe. Just an example of a real world code:

Arrays:
https://github.com/D-Programming-Language/druntime/blob/fc45de1d089a1025df60ee2eea66ba27ee0bd99c/src/core/sys/windows/dll.d#L110

vs

Hex strings:
https://github.com/denis-sh/hooking/blob/69105a24d77fcb6eca701282a16dd5ec7311c077/tlsfixer/ntdll.d#L130

By the way, current code isn't affected by the topic issue.

--
Денис В. Шеломовский
Denis V. Shelomovskij

Re: Regarding hex strings

2012-10-19 Thread foobar


On Friday, 19 October 2012 at 18:46:07 UTC, foobar wrote:

On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:

On 19/10/12 16:07, foobar wrote:
On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston 
wrote:


We can still have both (assuming the code points are 
valid...):

string foo = "\ua1\ub2\uc3"; // no .dup


That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4


Come on, "assuming the code points are valid". It says so 4 
lines above!


It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it 
includes the high bits that indicate the length of each char).

\u makes dchars.

"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two 
non-zero bytes.


Yes, the \u requires code points and not code-units for a 
specific UTF encoding, which you are correct in pointing out 
are four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode 
encoding errors.


http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters."


I _already_ said that I consider this a major semantic bug as 
it violates the principle of least surprise - the programmer's 
expectation that the D string types which are Unicode according 
to the spec to, well, actually contain _valid_ Unicode and 
_not_ arbitrary binary data.
Given the above, the design of \u makes perfect sense for 
_strings_ - you can use _valid_ code-points (not code units) in 
hex form.


For general purpose binary data (i.e. _not_ UTF encoded Unicode 
text) I also _already_ said IMO should be either stored as 
ubyte[] or better yet their own types that would ensure the 
correct invariants for the data type, be it audio, video, or 
just a different text encoding.


In neither case the hex-string is relevant IMO. In the former 
it potentially violates the type's invariant and in the latter 
we already have array literals.


Using a malformed _string_ to initialize ubyte[] IMO is simply 
less readable. How did that article call such features, "WAT"?


I just re-checked and to clarify string literals support _three_ 
escape sequences:

\x__ - a single byte
\u - two bytes
\U - four bytes

So raw bytes _can_ be directly specified and I hope the compiler 
still verifies the string literal is valid Unicode.

Re: Regarding hex strings

2012-10-19 Thread foobar


On Friday, 19 October 2012 at 15:07:44 UTC, Don Clugston wrote:

On 19/10/12 16:07, foobar wrote:

On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:


We can still have both (assuming the code points are 
valid...):

string foo = "\ua1\ub2\uc3"; // no .dup


That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4


Come on, "assuming the code points are valid". It says so 4 
lines above!


It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it 
includes the high bits that indicate the length of each char).

\u makes dchars.

"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two 
non-zero bytes.


Yes, the \u requires code points and not code-units for a 
specific UTF encoding, which you are correct in pointing out are 
four hex digits and not two.
This is a very reasonable choice to prevent/reduce Unicode 
encoding errors.


http://dlang.org/lex.html#HexString states:
"Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters."


I _already_ said that I consider this a major semantic bug as it 
violates the principle of least surprise - the programmer's 
expectation that the D string types which are Unicode according 
to the spec to, well, actually contain _valid_ Unicode and _not_ 
arbitrary binary data.
Given the above, the design of \u makes perfect sense for 
_strings_ - you can use _valid_ code-points (not code units) in 
hex form.


For general purpose binary data (i.e. _not_ UTF encoded Unicode 
text) I also _already_ said IMO should be either stored as 
ubyte[] or better yet their own types that would ensure the 
correct invariants for the data type, be it audio, video, or just 
a different text encoding.


In neither case the hex-string is relevant IMO. In the former it 
potentially violates the type's invariant and in the latter we 
already have array literals.


Using a malformed _string_ to initialize ubyte[] IMO is simply 
less readable. How did that article call such features, "WAT"?

Re: Regarding hex strings

2012-10-19 Thread Don Clugston


On 19/10/12 16:07, foobar wrote:

On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:


We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup


That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4


Come on, "assuming the code points are valid". It says so 4 lines above!


It isn't the same.
Hex strings are the raw bytes, eg UTF8 code points. (ie, it includes the 
high bits that indicate the length of each char).

\u makes dchars.

"\u00A1" is not the same as x"A1" nor is it x"00 A1". It's two non-zero 
bytes.

Re: Regarding hex strings

2012-10-19 Thread foobar


On Friday, 19 October 2012 at 13:19:09 UTC, Don Clugston wrote:


We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup


That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4


Come on, "assuming the code points are valid". It says so 4 lines 
above!

Re: Regarding hex strings

2012-10-19 Thread Don Clugston


On 18/10/12 17:43, foobar wrote:

On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:

On 18/10/12 10:58, foobar wrote:

On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:

On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]

hex strings are useful, but I think they were invented in D1 when
strings were convertible to char[]. But today they are an array of
immutable UFT-8, so I think this default type is not so useful:

void main() {
   string data1 = x"A1 B2 C3 D4"; // OK
   immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression
("\xa1\xb2\xc3\xd4") of type string to ubyte[]

[...]

Yeah I think hex strings would be better as ubyte[] by default.

More generally, though, I think *both* of the above lines should be
equally accepted.  If you write x"A1 B2 C3" in the context of
initializing a string, then the compiler should infer the type of the
literal as string, and if the same literal occurs in the context of,
say, passing a ubyte[], then its type should be inferred as ubyte[],
NOT
string.


T


IMO, this is a redundant feature that complicates the language for no
benefit and should be deprecated.
strings already have an escape sequence for specifying code-points "\u"
and for ubyte arrays you can simply use:
immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.


That is not the same. Array literals are not the same as string
literals, they have an implicit .dup.
See my recent thread on this issue (which unfortunately seems have to
died without a resolution, people got hung up about trailing null
characters without apparently noticing the more important issue of the
dup).


I don't see how that detail is relevant to this discussion as I was not
arguing against string literals or array literals in general.

We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup


That doesn't compile.
Error: escape hex sequence has 2 hex digits instead of 4

Re: Regarding hex strings

2012-10-19 Thread foobar


On Friday, 19 October 2012 at 00:14:18 UTC, Nick Sabalausky wrote:

On Thu, 18 Oct 2012 12:11:13 +0200
"foobar"  wrote:


How often large binary blobs are literally spelled in the 
source code (as opposed to just being read from a file)?



Frequency isn't the issue. The issues are "*Is* it ever 
needed?" and
"When it is needed, is it useful enough?" The answer to both is 
most

certainly "yes". (Remember, D is supposed to usable as a systems
language, it's not merely a high-level-app-only language.)


Any real-world use cases to support this claim? Does C++ have 
such a feature?
My limited experience with kernels is that this feature is not 
needed. The solution we used for this was to define an extern 
symbol and load it with a linker script (the binary data was of 
course stored in separate files).




Keep in mind, the question "Does it pull it's own weight?" is 
for

adding new features, not for going around gutting the language
just because we can.


Ok, I grant you that but remember that the whole thread started 
because the feature _doesn't_ work so lets rephrase - is it worth 
the effort to fix this feature?




In any case, I'm not opposed to such a utility library, in 
fact I think it's a rather good idea and we already have a 
precedent with "oct!"
I just don't think this belongs as a built-in feature in the 
language.


I think monarch_dodra's test proves that it definitely needs to 
be

built-in.


It proves that DMD has bugs that should be fixed, nothing more.

Re: Regarding hex strings

2012-10-18 Thread Jonathan M Davis

On Friday, October 19, 2012 07:29:46 Marco Leise wrote:
> Am Thu, 18 Oct 2012 21:03:01 -0700
> 
> schrieb Jonathan M Davis :
> > On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
> > > Memory just isn't freed!
> > 
> > That was my understanding, but the last time that I said that, Brad
> > Roberts
> > said that it wasn't true, and that we should stop spreading that FUD, so I
> > don't know what the exact situation is, but it sounds like if that was
> > true in the past, it's not true now. Regardless, it's clear that dmd
> > still uses too much memory in many cases, especially when code uses a lot
> > of templates or CTFE.
> > 
> > - Jonathan M Davis
> 
> He called it a FUD?

I don't think that he used quite that term, but his point was that I shouldn't 
be saying that, because it wasn't true, and so I was spreading incorrect 
information (that and the fact that he was tired of people spreading that 
incorrect information, IIRC). I can't find the exact post at the moment though.

> I guess we can meet somewhere in the middle. Btw. did
> I mix up Don and Brad in the last post ? Who is working on the
> memory management ?

I don't think that you mixed anyone up. Don works primarily on CTFE. Brad 
works primarily on the auto tester and other infrastructure required for the 
dmd/Phobos folks to do what they do.

- Jonathan M Davis

Re: Regarding hex strings

2012-10-18 Thread Marco Leise

Am Thu, 18 Oct 2012 21:03:01 -0700
schrieb Jonathan M Davis :

> On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
> > Memory just isn't freed!
> 
> That was my understanding, but the last time that I said that, Brad Roberts 
> said that it wasn't true, and that we should stop spreading that FUD, so I 
> don't know what the exact situation is, but it sounds like if that was true 
> in 
> the past, it's not true now. Regardless, it's clear that dmd still uses too 
> much memory in many cases, especially when code uses a lot of templates or 
> CTFE.
> 
> - Jonathan M Davis

He called it a FUD? Without trying to sound too patronizing, most D
programmers would really only notice DMD's memory footprint
when they use CTFE features. It is always Pegged, ctRegex, etc.
that make the issue come up, never basic code. And preloading
the Boehm collector showed that gigabytes of CTFE memory usage
can still be brought down to a few hundred MB [citation
needed]. I guess we can meet somewhere in the middle. Btw. did
I mix up Don and Brad in the last post ? Who is working on the
memory management ?

-- 
Marco

Re: Regarding hex strings

2012-10-18 Thread Jonathan M Davis

On Friday, October 19, 2012 05:14:44 Marco Leise wrote:
> Hehe, I assume most of the regulars know this: DMD used to
> use a garbage collector that is disabled.

Yes, but it didn't use it for long, because it made performance worse, and 
Walter didn't have the time to spend fixing it, so it was disabled. Presumably, 
someone will take the time to improve it at some point and then it will be re-
enabled.

> Memory just isn't freed!

That was my understanding, but the last time that I said that, Brad Roberts 
said that it wasn't true, and that we should stop spreading that FUD, so I 
don't know what the exact situation is, but it sounds like if that was true in 
the past, it's not true now. Regardless, it's clear that dmd still uses too 
much memory in many cases, especially when code uses a lot of templates or 
CTFE.

- Jonathan M Davis

Re: Regarding hex strings

2012-10-18 Thread Marco Leise

Am Thu, 18 Oct 2012 16:31:57 +0200
schrieb "monarch_dodra" :

> On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:
> > monarch_dodra:
> >
> >> hex! was a very good idea actually, imo.
> >
> > It must scale up to "real world" usages. Try it with a program
> > composed of 3 modules each one containing a 100 KB long string.
> > Then try it with a program with two hundred of medium sized
> > literals, and let's see compilation times and binary sizes.
> >
> > Bye,
> > bearophile
> 
> Hum... The compilation is pretty fast actually, about 1 second, 
> provided it doesn't choke.
> 
> It works for strings up to a length of 400 lines @ 80 chars per 
> line, which result to approximately 16K of data. After that, I 
> get a DMD out of memory error.
> 
> DMD memory usage spikes quite quickly. To compile those 400 lines 
> (16K), I use 800MB of memory (!). If I reach about 1GB, then it 
> crashes.
> 
> I tried using a refAppender instead of ret~, but that changed 
> nothing.
> 
> Kind of weird it would use that much memory though...
> 
> Also, the memory doesn't get released. I can parse a 1x400 Line 
> string, but if I try to parse 3 of them, DMD will choke on the 
> second one. :(

Hehe, I assume most of the regulars know this: DMD used to
use a garbage collector that is disabled. Memory just isn't
freed! Also it has copy on write semantics during CTFE:

int bug6498(int x)
{
int n = 0;
while (n < x)
++n;
return n;
}
static assert(bug6498(10_000_000)==10_000_000);

--> Fails with an 'out of memory' error.

http://d.puremagic.com/issues/show_bug.cgi?id=6498

So, as strange as it sounds, for now try not to write often or
into large blocks. Using this knowledge I was sometimes able
to bring down the memory consumption considerably by caching
recurring concatenations of two strings or to!string calls.

That said, appending single elements to an array may actually
be better than using a fixed-sized one and have DMD duplicate
it on every write. :p

Please remember to give Don a cookie when he manages to change
the compiler to modify in-place where appropriate.

-- 
Marco

Re: Regarding hex strings

2012-10-18 Thread Nick Sabalausky

On Thu, 18 Oct 2012 12:11:13 +0200
"foobar"  wrote:
> 
> How often large binary blobs are literally spelled in the source 
> code (as opposed to just being read from a file)?

Frequency isn't the issue. The issues are "*Is* it ever needed?" and
"When it is needed, is it useful enough?" The answer to both is most
certainly "yes". (Remember, D is supposed to usable as a systems
language, it's not merely a high-level-app-only language.)

Keep in mind, the question "Does it pull it's own weight?" is for
adding new features, not for going around gutting the language
just because we can.

> In any case, I'm not opposed to such a utility library, in fact I 
> think it's a rather good idea and we already have a precedent 
> with "oct!"
> I just don't think this belongs as a built-in feature in the 
> language.

I think monarch_dodra's test proves that it definitely needs to be
built-in.

Re: Regarding hex strings

2012-10-18 Thread bearophile


Nick Sabalausky:


Big +1

Having the language expect x"..." to always be a string (let 
alone a *valid UTF* string) is just insane. It's just too

damn useful for arbitrary binary data.


I'd like an opinion on such topics from one of the the D bosses 
:-)


Bye,
bearophile

Re: Regarding hex strings

2012-10-18 Thread Nick Sabalausky

On Wed, 17 Oct 2012 19:49:43 -0700
"H. S. Teoh"  wrote:

> On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
> [...]
> > hex strings are useful, but I think they were invented in D1 when
> > strings were convertible to char[]. But today they are an array of
> > immutable UFT-8, so I think this default type is not so useful:
> > 
> > void main() {
> > string data1 = x"A1 B2 C3 D4"; // OK
> > immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
> > }
> > 
> > 
> > test.d(3): Error: cannot implicitly convert expression
> > ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
> [...]
> 
> Yeah I think hex strings would be better as ubyte[] by default.
> 
> More generally, though, I think *both* of the above lines should be
> equally accepted.  If you write x"A1 B2 C3" in the context of
> initializing a string, then the compiler should infer the type of the
> literal as string, and if the same literal occurs in the context of,
> say, passing a ubyte[], then its type should be inferred as ubyte[],
> NOT string.
> 

Big +1

Having the language expect x"..." to always be a string (let alone a
*valid UTF* string) is just insane. It's just too damn useful for
arbitrary binary data.

Re: Regarding hex strings

2012-10-18 Thread Jonathan M Davis

On Thursday, October 18, 2012 21:09:14 Kagamin wrote:
> Your keyboard doesn't have ready unicode values for all
> characters either.

So? That doesn't make it so that it's not valuable to be able to input the 
values in hexidecimal instead of as actual unicode characters. Heck, if you 
want a specific character, I wouldn't trust copying the characters anyway, 
because it's far too easy to have two characters which look really similar but 
are different (e.g. there are multiple types of angle brackets in unicode), 
whereas with the numbers you can be sure. And with some characters (e.g. 
unicode whitespace characters), it generally doesn't make sense to enter the 
characters directly.

Regardless, my point is that both approaches can be useful, so it's good to be 
able to do both. If you prefer to put the unicode characters in directly, then 
do that, but others may prefer the other way. Personally, I've done both.

- Jonathan M Davis

Re: Regarding hex strings

2012-10-18 Thread Kagamin

Your keyboard doesn't have ready unicode values for all 
characters either.

Re: Regarding hex strings

2012-10-18 Thread Jonathan M Davis

On Thursday, October 18, 2012 15:56:50 Kagamin wrote:
> On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
> > Have you actually ever written code that requires using code
> > points? This feature is a *huge* convenience for when you do.
> > Just compare:
> > 
> > string nihongo1 = x"e697a5 e69cac e8aa9e";
> > string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
> > ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8,
> > 0xaa, 0x9e];
> 
> You should use unicode directly here, that's the whole point to
> support it.
> string nihongo = "日本語";

It's a nice feature, but there are plenty of cases where it makes more sense 
to use the unicode values rather than the characters themselves (e.g. your 
keyboard doesn't have the characters in question). It's valuable to be able to 
do it both ways.

- Jonathan M Davis

Re: Regarding hex strings

2012-10-18 Thread foobar


On Thursday, 18 October 2012 at 14:29:57 UTC, Don Clugston wrote:

On 18/10/12 10:58, foobar wrote:

On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:

On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]
hex strings are useful, but I think they were invented in D1 
when
strings were convertible to char[]. But today they are an 
array of
immutable UFT-8, so I think this default type is not so 
useful:


void main() {
   string data1 = x"A1 B2 C3 D4"; // OK
   immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression
("\xa1\xb2\xc3\xd4") of type string to ubyte[]

[...]

Yeah I think hex strings would be better as ubyte[] by 
default.


More generally, though, I think *both* of the above lines 
should be

equally accepted.  If you write x"A1 B2 C3" in the context of
initializing a string, then the compiler should infer the 
type of the
literal as string, and if the same literal occurs in the 
context of,
say, passing a ubyte[], then its type should be inferred as 
ubyte[], NOT

string.


T


IMO, this is a redundant feature that complicates the language 
for no

benefit and should be deprecated.
strings already have an escape sequence for specifying 
code-points "\u"

and for ubyte arrays you can simply use:
immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.


That is not the same. Array literals are not the same as string 
literals, they have an implicit .dup.
See my recent thread on this issue (which unfortunately seems 
have to died without a resolution, people got hung up about 
trailing null characters without apparently noticing the more 
important issue of the dup).


I don't see how that detail is relevant to this discussion as I 
was not arguing against string literals or array literals in 
general.


We can still have both (assuming the code points are valid...):
string foo = "\ua1\ub2\uc3"; // no .dup
and:
ubyte[3] goo = [0xa1, 0xb2, 0xc3]; // implicit .dup

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 13:15:55 UTC, bearophile wrote:

monarch_dodra:


hex! was a very good idea actually, imo.


It must scale up to "real world" usages. Try it with a program
composed of 3 modules each one containing a 100 KB long string.
Then try it with a program with two hundred of medium sized
literals, and let's see compilation times and binary sizes.

Bye,
bearophile


Hum... The compilation is pretty fast actually, about 1 second, 
provided it doesn't choke.


It works for strings up to a length of 400 lines @ 80 chars per 
line, which result to approximately 16K of data. After that, I 
get a DMD out of memory error.


DMD memory usage spikes quite quickly. To compile those 400 lines 
(16K), I use 800MB of memory (!). If I reach about 1GB, then it 
crashes.


I tried using a refAppender instead of ret~, but that changed 
nothing.


Kind of weird it would use that much memory though...

Also, the memory doesn't get released. I can parse a 1x400 Line 
string, but if I try to parse 3 of them, DMD will choke on the 
second one. :(

Re: Regarding hex strings

2012-10-18 Thread Don Clugston


On 18/10/12 10:58, foobar wrote:

On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:

On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]

hex strings are useful, but I think they were invented in D1 when
strings were convertible to char[]. But today they are an array of
immutable UFT-8, so I think this default type is not so useful:

void main() {
string data1 = x"A1 B2 C3 D4"; // OK
immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression
("\xa1\xb2\xc3\xd4") of type string to ubyte[]

[...]

Yeah I think hex strings would be better as ubyte[] by default.

More generally, though, I think *both* of the above lines should be
equally accepted.  If you write x"A1 B2 C3" in the context of
initializing a string, then the compiler should infer the type of the
literal as string, and if the same literal occurs in the context of,
say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
string.


T


IMO, this is a redundant feature that complicates the language for no
benefit and should be deprecated.
strings already have an escape sequence for specifying code-points "\u"
and for ubyte arrays you can simply use:
immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.


That is not the same. Array literals are not the same as string 
literals, they have an implicit .dup.
See my recent thread on this issue (which unfortunately seems have to 
died without a resolution, people got hung up about trailing null 
characters without apparently noticing the more important issue of the dup).

Re: Regarding hex strings

2012-10-18 Thread Kagamin


On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:
Have you actually ever written code that requires using code 
points? This feature is a *huge* convenience for when you do. 
Just compare:


string nihongo1 = x"e697a5 e69cac e8aa9e";
string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
0xaa, 0x9e];


You should use unicode directly here, that's the whole point to 
support it.

string nihongo = "日本語";

Re: Regarding hex strings

2012-10-18 Thread bearophile


monarch_dodra:


hex! was a very good idea actually, imo.


It must scale up to "real world" usages. Try it with a program
composed of 3 modules each one containing a 100 KB long string.
Then try it with a program with two hundred of medium sized
literals, and let's see compilation times and binary sizes.

Bye,
bearophile

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 11:26:13 UTC, monarch_dodra wrote:


NOT a final version.


With correct-er utf string support. In theory, non-ascii 
characters are illegal, but it makes for safer code, and better 
diagnosis.


//
ubyte[] decode(string s)
{
ubyte[] ret;;
while(s.length)
{
while( s.front == ' ' || s.front == '_' )
{
s.popFront();
if (!s.length) assert(0, text("Premature end of 
string."));;

}

dchar c1 = s.front;
if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected 
character ", c1, "."));

c1 = std.ascii.toUpper(c1);

s.popFront();
if (!s.length) assert(0, text("Premature end of string 
after ", c1, "."));


dchar c2 = s.front;
if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected 
character ", c2, " after ", c1, "."));

c2 = std.ascii.toUpper(c2);
s.popFront();

ubyte val;
if('0' <= c2 && c2 <= '9') val += (c2 - '0');
if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
ret ~= val;
}
return ret;
}
//

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 11:24:04 UTC, monarch_dodra wrote:
hex! was a very good idea actually, imo. I'll post my current 
impl in the next post.




//
import std.stdio;
import std.conv;
import std.ascii;


template hex(string s)
{
enum hex = decode(s);
}


template hex(ulong ul)
{
enum hex = decode(ul);
}

ubyte[] decode(string s)
{
ubyte[] ret;
size_t p;
while(p < s.length)
{
while( s[p] == ' ' || s[p] == '_' )
{
++p;
if (p == s.length) assert(0, text("Premature end of 
string at index ", p, "."));;

}

char c1 = s[p];
if (!std.ascii.isHexDigit(c1)) assert(0, text("Unexpected 
character ", c1, " at index ", p, "."));

c1 = cast(char)std.ascii.toUpper(c1);

++p;
if (p == s.length) assert(0, text("Premature end of 
string after ", c1, "."));


char c2 = s[p];
if (!std.ascii.isHexDigit(c2)) assert(0, text("Unexpected 
character ", c2, " at index ", p, "."));

c2 = cast(char)std.ascii.toUpper(c2);
++p;


ubyte val;
if('0' <= c2 && c2 <= '9') val += (c2 - '0');
if('A' <= c2 && c2 <= 'F') val += (c2 - 'A' + 10);
if('0' <= c1 && c1 <= '9') val += ((c1 - '0')*16);
if('A' <= c1 && c1 <= 'F') val += ((c1 - 'A' + 10)*16);
ret ~= val;
}
return ret;
}

ubyte[] decode(ulong ul)
{
//NOTE: This is not efficinet AT ALL (push front)
//but it is ctfe, so we can live it for now ^^
//I'll optimize it if I try to push it
ubyte[] ret;
while(ul)
{
ubyte t = ul%256;
ret = t ~ ret;
ul /= 256;
}
return ret;
}
//

NOT a final version.

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 10:39:46 UTC, monarch_dodra wrote:


Yeah, that makes sense too. I'll try to toy around on my end 
and see if I can write an "hex".


That was actually relatively easy!

Here is some usecase:

//
void main()
{
enum a = hex!"01 ff 7f";
enum b = hex!0x01_ff_7f;
ubyte[] c = hex!"0123456789abcdef";
immutable(ubyte)[] bearophile1 = hex!"A1 B2 C3 D4";
immutable(ubyte)[] bearophile2 = hex!0xA1_B2_C3_D4;

a.writeln();
b.writeln();
c.writeln();
bearophile1.writeln();
bearophile2.writeln();
}
//

And corresponding output:

//
[1, 255, 127]
[1, 255, 127]
[1, 35, 69, 103, 137, 171, 205, 239]
[161, 178, 195, 212]
[161, 178, 195, 212]
//

hex! was a very good idea actually, imo. I'll post my current 
impl in the next post.


That said, I don't know if I'd deprecate x"", as it serves a 
different role, as you have already pointed out, in that it 
*will* validate the code points.

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 10:17:06 UTC, foobar wrote:

On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:

On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:

The docs say:
http://dlang.org/lex.html

Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters.<




This is especially a good reason to remove this feature as it 
breaks the principle of least surprise and I consider it a 
major bug, not a feature.


I expect D's strings which are by definition Unicode to _only_ 
ever allow _valid_ Unicode. It makes no sense what so ever to 
allow this nasty back-door. Other text encoding should be 
either stored and treated as binary data (ubyte[]) or better 
yet stored in their own types that will ensure those encodings' 
invariants.


Yeah, that makes sense too. I'll try to toy around on my end and 
see if I can write an "hex".

Re: Regarding hex strings

2012-10-18 Thread foobar


On Thursday, 18 October 2012 at 10:11:14 UTC, foobar wrote:

On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:

The docs say:
http://dlang.org/lex.html

Hex strings allow string literals to be created using hex 
data. The hex data need not form valid UTF characters.<




This is especially a good reason to remove this feature as it 
breaks the principle of least surprise and I consider it a major 
bug, not a feature.


I expect D's strings which are by definition Unicode to _only_ 
ever allow _valid_ Unicode. It makes no sense what so ever to 
allow this nasty back-door. Other text encoding should be either 
stored and treated as binary data (ubyte[]) or better yet stored 
in their own types that will ensure those encodings' invariants.

Re: Regarding hex strings

2012-10-18 Thread foobar


On Thursday, 18 October 2012 at 10:05:06 UTC, bearophile wrote:

The docs say:
http://dlang.org/lex.html

Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters.<


But this code:


void main() {
immutable ubyte[4] data = x"F9 04 C1 E2";
}



Gives me:

temp.d(2): Error: Outside Unicode code space

Are the docs correct?

--

foobar:

Seems to me this is in the same ballpark as the built-in 
complex numbers. Sure it's nice to be able to write "4+5i" 
instead of "complex(4,5)" but how frequently do you actually 
ever need the _literals_ even in complex computational heavy 
code?


Compared to "oct!5151151511", one problem with code like this 
is that binary blobs are sometimes large, so supporting a x"" 
syntax is better:


immutable ubyte[4] data = hex!"F9 04 C1 E2";

Bye,
bearophile


How often large binary blobs are literally spelled in the source 
code (as opposed to just being read from a file)?
In any case, I'm not opposed to such a utility library, in fact I 
think it's a rather good idea and we already have a precedent 
with "oct!"
I just don't think this belongs as a built-in feature in the 
language.

Re: Regarding hex strings

2012-10-18 Thread foobar


On Thursday, 18 October 2012 at 09:42:43 UTC, monarch_dodra wrote:

On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:


IMO, this is a redundant feature that complicates the language 
for no benefit and should be deprecated.
strings already have an escape sequence for specifying 
code-points "\u" and for ubyte arrays you can simply use:

immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.


Have you actually ever written code that requires using code 
points? This feature is a *huge* convenience for when you do. 
Just compare:


string nihongo1 = x"e697a5 e69cac e8aa9e";
string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
0xaa, 0x9e];


BTW, your data2 doesn't compile.


I didn't try to compile it :) I just rewrote berophile's example 
with 0x prefixes.


How often do you actually need to write code-point _literals_ in 
your code?
I'm not arguing that it isn't convenient. My question would be 
rather Anderi's "does it pull it's own weight?" meaning does the 
added complexity in the language and having more than one way for 
doing something worth that convenience?


Seems to me this is in the same ballpark as the built-in complex 
numbers. Sure it's nice to be able to write "4+5i" instead of 
"complex(4,5)" but how frequently do you actually ever need the 
_literals_ even in complex computational heavy code?

Re: Regarding hex strings

2012-10-18 Thread bearophile


The docs say:
http://dlang.org/lex.html

Hex strings allow string literals to be created using hex data. 
The hex data need not form valid UTF characters.<


But this code:


void main() {
immutable ubyte[4] data = x"F9 04 C1 E2";
}



Gives me:

temp.d(2): Error: Outside Unicode code space

Are the docs correct?

--

foobar:

Seems to me this is in the same ballpark as the built-in 
complex numbers. Sure it's nice to be able to write "4+5i" 
instead of "complex(4,5)" but how frequently do you actually 
ever need the _literals_ even in complex computational heavy 
code?


Compared to "oct!5151151511", one problem with code like this is 
that binary blobs are sometimes large, so supporting a x"" syntax 
is better:


immutable ubyte[4] data = hex!"F9 04 C1 E2";

Bye,
bearophile

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 00:45:12 UTC, bearophile wrote:

(Repost)

hex strings are useful, but I think they were invented in D1 
when strings were convertible to char[]. But today they are an 
array of immutable UFT-8, so I think this default type is not 
so useful:


void main() {
string data1 = x"A1 B2 C3 D4"; // OK
immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression 
("\xa1\xb2\xc3\xd4") of type string to ubyte[]


[SNIP]

Bye,
bearophile


The conversion can't be done *implicitly*, but you can still get 
your code to compile:


//
void main() {
immutable(ubyte)[] data2 =
cast(immutable(ubyte)[]) x"A1 B2 C3 D4"; // OK!
}
//

It's a bit ugly, and I agree it should work natively, but it is a 
workaround.

Re: Regarding hex strings

2012-10-18 Thread monarch_dodra


On Thursday, 18 October 2012 at 08:58:57 UTC, foobar wrote:


IMO, this is a redundant feature that complicates the language 
for no benefit and should be deprecated.
strings already have an escape sequence for specifying 
code-points "\u" and for ubyte arrays you can simply use:

immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.


Have you actually ever written code that requires using code 
points? This feature is a *huge* convenience for when you do. 
Just compare:


string nihongo1 = x"e697a5 e69cac e8aa9e";
string nihongo2 = "\ue697a5\ue69cac\ue8aa9e";
ubyte[] nihongo3 = [0xe6, 0x97, 0xa5, 0xe6, 0x9c, 0xac, 0xe8, 
0xaa, 0x9e];


BTW, your data2 doesn't compile.

Re: Regarding hex strings

2012-10-18 Thread foobar


On Thursday, 18 October 2012 at 02:47:42 UTC, H. S. Teoh wrote:

On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]
hex strings are useful, but I think they were invented in D1 
when
strings were convertible to char[]. But today they are an 
array of

immutable UFT-8, so I think this default type is not so useful:

void main() {
string data1 = x"A1 B2 C3 D4"; // OK
immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression
("\xa1\xb2\xc3\xd4") of type string to ubyte[]

[...]

Yeah I think hex strings would be better as ubyte[] by default.

More generally, though, I think *both* of the above lines 
should be

equally accepted.  If you write x"A1 B2 C3" in the context of
initializing a string, then the compiler should infer the type 
of the
literal as string, and if the same literal occurs in the 
context of,
say, passing a ubyte[], then its type should be inferred as 
ubyte[], NOT

string.


T


IMO, this is a redundant feature that complicates the language 
for no benefit and should be deprecated.
strings already have an escape sequence for specifying 
code-points "\u" and for ubyte arrays you can simply use:

immutable(ubyte)[] data2 = [0xA1 0xB2 0xC3 0xD4];

So basically this feature gains us nothing.

Re: Regarding hex strings

2012-10-17 Thread H. S. Teoh

On Thu, Oct 18, 2012 at 02:45:10AM +0200, bearophile wrote:
[...]
> hex strings are useful, but I think they were invented in D1 when
> strings were convertible to char[]. But today they are an array of
> immutable UFT-8, so I think this default type is not so useful:
> 
> void main() {
> string data1 = x"A1 B2 C3 D4"; // OK
> immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
> }
> 
> 
> test.d(3): Error: cannot implicitly convert expression
> ("\xa1\xb2\xc3\xd4") of type string to ubyte[]
[...]

Yeah I think hex strings would be better as ubyte[] by default.

More generally, though, I think *both* of the above lines should be
equally accepted.  If you write x"A1 B2 C3" in the context of
initializing a string, then the compiler should infer the type of the
literal as string, and if the same literal occurs in the context of,
say, passing a ubyte[], then its type should be inferred as ubyte[], NOT
string.

T

-- 
Who told you to swim in Crocodile Lake without life insurance??

Regarding hex strings

2012-10-17 Thread bearophile


(Repost)

hex strings are useful, but I think they were invented in D1 when 
strings were convertible to char[]. But today they are an array 
of immutable UFT-8, so I think this default type is not so useful:


void main() {
string data1 = x"A1 B2 C3 D4"; // OK
immutable(ubyte)[] data2 = x"A1 B2 C3 D4"; // error
}


test.d(3): Error: cannot implicitly convert expression 
("\xa1\xb2\xc3\xd4") of type string to ubyte[]



Generally I want to use hex strings to put binary data in a 
program, so usually it's a ubyte[] or uint[].


So I have to use something like:

auto data3 = cast(ubyte[])(x"A1 B2 C3 D4".dup);


So maybe the following literals are more useful in D2:

ubyte[] data4 = x[A1 B2 C3 D4];
uint[]  data5 = x[A1 B2 C3 D4];
ulong[] data6 = x[A1 B2 C3 D4 A1 B2 C3 D4];

Bye,
bearophile

48 matches

Mail list logo