LLVM IR influence on compiler debugging

2012-06-28 Thread bearophile

This is a very easy to read article about the design of LLVM:
http://www.drdobbs.com/architecture-and-design/the-design-of-llvm/240001128

It explains what the IR is:

The most important aspect of its design is the LLVM Intermediate 
Representation (IR), which is the form it uses to represent code 
in the compiler. LLVM IR [...] is itself defined as a first 
class language with well-defined semantics.<


In particular, LLVM IR is both well specified and the only 
interface to the optimizer. This property means that all you 
need to know to write a front end for LLVM is what LLVM IR is, 
how it works, and the invariants it expects. Since LLVM IR has a 
first-class textual form, it is both possible and reasonable to 
build a front end that outputs LLVM IR as text, then uses UNIX 
pipes to send it through the optimizer sequence and code 
generator of your choice. It might be surprising, but this is 
actually a pretty novel property to LLVM and one of the major 
reasons for its success in a broad range of different 
applications. Even the widely successful and relatively 
well-architected GCC compiler does not have this property: its 
GIMPLE mid-level representation is not a self-contained 
representation.<


That IR has a great effect on making it simpler to debug the 
compiler, I think this is important (and I think it partially 
explains why Clang was created so quickly):


Compilers are very complicated, and quality is important, 
therefore testing is critical. For example, after fixing a bug 
that caused a crash in an optimizer, a regression test should be 
added to make sure it doesn't happen again. The traditional 
approach to testing this is to write a .c file (for example) 
that is run through the compiler, and to have a test harness 
that verifies that the compiler doesn't crash. This is the 
approach used by the GCC test suite, for example. The problem 
with this approach is that the compiler consists of many 
different subsystems and even many different passes in the 
optimizer, all of which have the opportunity to change what the 
input code looks like by the time it gets to the previously 
buggy code in question. If something changes in the front end or 
an earlier optimizer, a test case can easily fail to test what 
it is supposed to be testing. By using the textual form of LLVM 
IR with the modular optimizer, the LLVM test suite has highly 
focused regression tests that can load LLVM IR from disk, run it 
through exactly one optimization pass, and verify the expected 
behavior. Beyond crashing, a more complicated behavioral test 
wants to verify that an optimization is actually performed. 
[...] While this might seem like a really trivial example, this 
is very difficult to test by writing .c files: front ends often 
do constant folding as they parse, so it is very difficult and 
fragile to write code that makes its way downstream to a 
constant folding optimization pass. Because we can load LLVM IR 
as text and send it through the specific optimization pass we're 
interested in, then dump out the result as another text file, it 
is really straightforward to test exactly what we want, both for 
regression and feature tests.<


Bye,
bearophile


Re: D runtime in os x dylib

2012-06-28 Thread John Colvin

On Wednesday, 27 June 2012 at 10:15:37 UTC, Jacob Carlborg wrote:

On 2012-06-27 02:35, John Colvin wrote:

On Tuesday, 26 June 2012 at 18:31:34 UTC, John Colvin wrote:

On Monday, 25 June 2012 at 18:55:22 UTC, Jacob Carlborg wrote:


Are you using the latest sources of DMD?


Yes I am.


sorry, my mistake, i replaced the wrong dmd file. It compiles 
fine with

the up to date dmd.


Good, is that working any better ?


Yes, it appears the bug is fixed in the current druntime. Thanks 
for your help.


Re: Raw binary(to work without OS) in D

2012-06-28 Thread bearophile

Paulo Pinto:

The huge amount of documentation made available at BUILD time, 
plus Windows 8 to play with?


WinRT (aka Metro) is COM based.

Instead of me dumping a list of links, what you want to know 
exactly?


OK. No need for a list.

Bye,
bearophile


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Sean Kelly
On Jun 28, 2012, at 9:45 AM, Iain Buclaw wrote:
> 
> Wouldn't it be useful if the compiler had diagnostics for all implicit
> allocations it makes (ie: closures, array literals)?  Similar to that
> of the -vtls switch. These such things you may want to avoid in a
> freestanding environment (no link to C runtime).

Yes it would.  I guess the question is how to expose this.  Generally speaking 
though, array append type operations allocate, AA insertions allocate, and 
non-scope delegate use allocates.  I think that's it.

Re: Raw binary(to work without OS) in D

2012-06-28 Thread Paulo Pinto

On Thursday, 28 June 2012 at 13:39:54 UTC, bearophile wrote:

Paulo Pinto:


as Metro is also native code.


Are you sure? Do you have a reference on this?

Bye,
bearophile


The huge amount of documentation made available at BUILD time, 
plus Windows 8 to play with?


WinRT (aka Metro) is COM based.

Instead of me dumping a list of links, what you want to know 
exactly?


--
Paulo


Re: Raw binary(to work without OS) in D

2012-06-28 Thread David Nadlinger
On Thursday, 28 June 2012 at 14:35:24 UTC, Andrei Alexandrescu 
wrote:

On 6/28/12 10:07 AM, Roman D. Boiko wrote:

On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:
I think just exposing them via .sig and .exp might be the way 
to go?


sig is easy to confuse with sign


.mantissa and .exp


.exp might potentially lead to confusion regarding std.math.exp 
with UFCS in place. Not that this would be a huge deal, but since 
new property would be used only very rarely anyway, I don't think 
going for a longer name like .exponent or even 
rawExponent/exponentBits/… would be a problem.


David


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Iain Buclaw
On 22 June 2012 07:35, Walter Bright  wrote:
> On 6/21/2012 11:07 PM, Mehrdad wrote:
>>
>> On Thursday, 21 June 2012 at 19:44:40 UTC, Walter Bright wrote:
>>>
>>> On 6/21/2012 9:39 AM, Alex wrote:

 is it possible to use D to write code to work without OS?
 like i do it with gcc.
>>>
>>>
>>> Sure. But you'll need to thoroughly understand how the D runtime startup
>>> code
>>> works, so you can adjust as necessary.
>>
>>
>> Good luck getting the C-runtime part of the "D runtime" right..
>
>
> It's not that hard. But there's a lot of detail to learn & take care of.
>

Wouldn't it be useful if the compiler had diagnostics for all implicit
allocations it makes (ie: closures, array literals)?  Similar to that
of the -vtls switch. These such things you may want to avoid in a
freestanding environment (no link to C runtime).

-- 
Iain Buclaw

*(p < e ? p++ : p) = (c & 0x0f) + '0';


Re: Raw binary(to work without OS) in D

2012-06-28 Thread David Nadlinger

On Thursday, 28 June 2012 at 15:28:10 UTC, Don Clugston wrote:
There's an oddity, though: the type of X.significand would be 
dependent on the type of X […]


I don't think this is a problem at all – for example, the type 
of T.init depends on T as well…


David


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Jens Mueller
Don Clugston wrote:
> On 28/06/12 17:00, Jens Mueller wrote:
> >Andrei Alexandrescu wrote:
> >>On 6/28/12 10:07 AM, Roman D. Boiko wrote:
> >>>On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:
> I think just exposing them via .sig and .exp might be the way to go?
> >>>
> >>>sig is easy to confuse with sign
> >>
> >>.mantissa and .exp
> >
> >Letting the compiler define these properties is a solution. I thought
> >Don is looking for something more general. But maybe this isn't needed
> >here. Don't know. But using mantissa should be discouraged.
> >I suggest calling them
> >.significand and .exponent
> >
> >significand is preferred over mantissa by IEEE FP committee. I think
> >it's fine to spell them out. There won't be much code using them anyway.
> >
> >Jens
> 
> 
> Yes, adding new properties would be the easiest way from a CTFE
> perspective; that way, they are endian-ness independent. It's a bit
> niche, but then again adding a special case for this in CTFE is
> niche as well. Maybe it would be the best approach.

Sounds good then.

> With naming, I'm included to agree, but the funny thing is that we
> have X.mant_dig as the number of digits in the significand.

You could add a deprecated alias to X.mant_dig and provide a new name.
We should adopt IEEE's vocabulary where possible.

> There's an oddity, though: the type of X.significand would be
> dependent on the type of X (and for the non-existent quadruple
> float, it would be non-existent ucent type!)

But this is no problem, is it?

> Would it include the implicit bit of an 80-bit x87 real (the silly bit)?

Not sure what the silly bit is. You mean the bit that is implicitly
always 1, don't you? mant_dig says 24 for a float. Means it is included
when counting the bits. Then for consistency it should be included.

Jens


Re: Productions users

2012-06-28 Thread Graham Fawcett
On Thursday, 28 June 2012 at 16:02:43 UTC, Jakob Bornecrantz 
wrote:

On Thursday, 28 June 2012 at 07:38:19 UTC, Andrea Fontana wrote:
In production it's just a way to say "completed, not still in 
pre-alpha/alpha/beta/testing phase". Usable. Working. Public :)


No difference between commercial, open source, free, etc ...


Software is never completed only abandoned.

Also a lot of software is being used by the general public but 
still have the Alpha/Beta tag. But I think "Usable. Working. 
Public" is a good definition.


Cheers, Jakob.


Let me suggest a three-part rephrase of the original question 
(because I'm personally interested in how people are using D 
lately, and less interested in the meta discussion!):


- What have you written in D lately that you're proud of?

- Who benefits from your program, and how?

- If your program is open-source, where do you publish the code?

If we post answers, perhaps someone would be gracious enough to 
collect them on the wiki.


Graham



Re: Productions users

2012-06-28 Thread Jakob Bornecrantz

On Thursday, 28 June 2012 at 07:38:19 UTC, Andrea Fontana wrote:
In production it's just a way to say "completed, not still in 
pre-alpha/alpha/beta/testing phase". Usable. Working. Public :)


No difference between commercial, open source, free, etc ...


Software is never completed only abandoned.

Also a lot of software is being used by the general public but 
still have the Alpha/Beta tag. But I think "Usable. Working. 
Public" is a good definition.


Cheers, Jakob.


Re: Raw binary(to work without OS) in D

2012-06-28 Thread bearophile

Don Clugston:

There's an oddity, though: the type of X.significand would be 
dependent on the type of X (and for the non-existent quadruple 
float, it would be non-existent ucent type!)


But ucents are in D specs and I think their name is already 
somewhere in the compiler. While "quadruple" (or "qfloat") is not 
yet in D specs:


http://en.wikipedia.org/wiki/Quadruple_precision

Bye,
bearophile


Re: standard ranges

2012-06-28 Thread Roman D. Boiko
My point (and the reason I somehow hijacked this thread) is that 
such functionality would be useful for random access over narrow 
strings. Currently random access is missing.


Also this approach fits nicely if random access is needed to 
Unicode characters, not just code points.


I don't see much practical value in proposals to reconsider 
current string implementation. Arguments have been presented that 
it improves consistency, but as Andrei replied, consistency 
itself has many dimensions. Proposed changes have not been 
described in detail, but from what I understood, they would make 
common use cases more verbose.


In contrast, I do see value in Select / Rank indexing. Does 
anybody agree?





Re: Raw binary(to work without OS) in D

2012-06-28 Thread Don Clugston

On 28/06/12 17:00, Jens Mueller wrote:

Andrei Alexandrescu wrote:

On 6/28/12 10:07 AM, Roman D. Boiko wrote:

On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:

I think just exposing them via .sig and .exp might be the way to go?


sig is easy to confuse with sign


.mantissa and .exp


Letting the compiler define these properties is a solution. I thought
Don is looking for something more general. But maybe this isn't needed
here. Don't know. But using mantissa should be discouraged.
I suggest calling them
.significand and .exponent

significand is preferred over mantissa by IEEE FP committee. I think
it's fine to spell them out. There won't be much code using them anyway.

Jens



Yes, adding new properties would be the easiest way from a CTFE 
perspective; that way, they are endian-ness independent. It's a bit 
niche, but then again adding a special case for this in CTFE is niche as 
well. Maybe it would be the best approach.


With naming, I'm included to agree, but the funny thing is that we have 
X.mant_dig as the number of digits in the significand.


There's an oddity, though: the type of X.significand would be dependent 
on the type of X (and for the non-existent quadruple float, it would be 
non-existent ucent type!)

Would it include the implicit bit of an 80-bit x87 real (the silly bit)?



Re: standard ranges

2012-06-28 Thread Roman D. Boiko
On Thursday, 28 June 2012 at 14:34:03 UTC, Andrei Alexandrescu 
wrote:
Well of course I've exaggerated a bit. My point is that 
mentioning "200 ns!!!" sounds to the uninformed ear as good as 
"2000 ns" or "20 ns", i.e. "an amount of time so short by human 
scale, it must mean fast". You need to compare e.g. against 
random access in an array etc.


Andrei


I have no benchmarks for plain array access on the same machine 
and compiler that authors used. However, it looks like two cache 
misses happen at most. If that is true, we may charge 100 ns each 
memory access + computation. I would claim that from those most 
time takes memory access, since the same algorithms take 35-50 ns 
for smaller arrays (up to 4Mbits which is about 512KB), but I'm 
not sure that my conclusions are definitely true.


Also, I made a mistake in another post. I should have said that 
it is possible to address arrays of up to 2^64 code units, but 
benchmarks are provided for data sizes in bits (i.e., up to 
1GBit).


Asymptotically algorithms should require slightly smaller space 
overhead for bigger arrays: space complexity is O(N/logN). But 
memory address resolution may become slower. This is true for 
both Rank/Select algorithms and raw array access.


Again, please note that price is paid only once per code unit 
resolution (for Select) or code point calculation (for Rank). 
Subsequent nearby accesses should be very cheap.


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Jens Mueller
Andrei Alexandrescu wrote:
> On 6/28/12 10:07 AM, Roman D. Boiko wrote:
> >On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:
> >>I think just exposing them via .sig and .exp might be the way to go?
> >
> >sig is easy to confuse with sign
> 
> .mantissa and .exp

Letting the compiler define these properties is a solution. I thought
Don is looking for something more general. But maybe this isn't needed
here. Don't know. But using mantissa should be discouraged.
I suggest calling them
.significand and .exponent

significand is preferred over mantissa by IEEE FP committee. I think
it's fine to spell them out. There won't be much code using them anyway.

Jens


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Timon Gehr

On 06/28/2012 04:35 PM, Andrei Alexandrescu wrote:

On 6/28/12 10:07 AM, Roman D. Boiko wrote:

On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:

I think just exposing them via .sig and .exp might be the way to go?


sig is easy to confuse with sign


.mantissa and .exp

Andrei


Imho, it should be either

.mantissa and .exponent

or

.mant and .exp


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Andrei Alexandrescu

On 6/28/12 10:07 AM, Roman D. Boiko wrote:

On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:

I think just exposing them via .sig and .exp might be the way to go?


sig is easy to confuse with sign


.mantissa and .exp

Andrei


Re: standard ranges

2012-06-28 Thread Andrei Alexandrescu

On 6/28/12 8:57 AM, Roman D. Boiko wrote:

On Thursday, 28 June 2012 at 12:29:14 UTC, Andrei Alexandrescu wrote:

On 6/28/12 5:58 AM, Roman D. Boiko wrote:
Pedantically speaking, sheer timings say nothing without the
appropriate baselines.

Andrei


I used results of benchmarks for two such algorithms, which I like most,
taken from here:

Vigna, S. (2008). "Broadword implementation of rank/select queries".
Experimental Algorithms: 154–168.

http://en.wikipedia.org/wiki/Succinct_data_structure#cite_ref-vigna2008broadword_6-0


Numbers should be valid for some C/C++ code executed on a machine that
already existed back in 2008. I'm not sure there is a good baseline to
compare. One option would be to benchmark random access to code points
in a UTF-32 string. I also don't know about any D implementations of
these algorithms, thus cannot predict how they would behave against
dstring random access.

But your statement that these timings say nothing is not fair, because
they can be used to conclude that this speed should be enough for most
practical use cases, especially if those use cases are known.


Well of course I've exaggerated a bit. My point is that mentioning "200 
ns!!!" sounds to the uninformed ear as good as "2000 ns" or "20 ns", 
i.e. "an amount of time so short by human scale, it must mean fast". You 
need to compare e.g. against random access in an array etc.



Andrei


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Roman D. Boiko

On Thursday, 28 June 2012 at 14:04:37 UTC, Mehrdad wrote:
I think just exposing them via .sig and .exp might be the way 
to go?


sig is easy to confuse with sign


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Mehrdad

On Thursday, 28 June 2012 at 14:02:37 UTC, Jens Mueller wrote:

Good luck! I'm looking forward to your solution.

Jens


I think just exposing them via .sig and .exp might be the way to 
go?


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Jens Mueller
Don Clugston wrote:
> On 28/06/12 15:31, Jens Mueller wrote:
> >Andrei Alexandrescu wrote:
> >>On 6/22/12 7:41 AM, Don Clugston wrote:
> >>>I think the main thing that's still done in C is the floating point
> >>>formatting.
> >>
> >>Would be great if a contributor could translate FP parsing and
> >>formatting code into D. Then we can use it in CTFE. I need it badly
> >>for some function tabulation code.
> >
> >I think formatting cannot be done such that it is CTFE-able. I tried
> >implementing a less-inefficient version. As far as I can tell at some
> >point you need to extract the significand and the exponent. This is done
> >by some "unsafe" cast which is not allowed in CTFE. I don't know a way
> >to do it in CTFE-compatible way.
> >
> >Jens
> 
> Yeah, I think I will have to find a way of allowing it. But it's
> difficult to see a clean way of doing it.

Good luck! I'm looking forward to your solution.

Jens


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Don Clugston

On 28/06/12 15:31, Jens Mueller wrote:

Andrei Alexandrescu wrote:

On 6/22/12 7:41 AM, Don Clugston wrote:

I think the main thing that's still done in C is the floating point
formatting.


Would be great if a contributor could translate FP parsing and
formatting code into D. Then we can use it in CTFE. I need it badly
for some function tabulation code.


I think formatting cannot be done such that it is CTFE-able. I tried
implementing a less-inefficient version. As far as I can tell at some
point you need to extract the significand and the exponent. This is done
by some "unsafe" cast which is not allowed in CTFE. I don't know a way
to do it in CTFE-compatible way.

Jens


Yeah, I think I will have to find a way of allowing it. But it's 
difficult to see a clean way of doing it.


Re: standard ranges

2012-06-28 Thread Christophe Travert
Timon Gehr , dans le message (digitalmars.D:170884), a écrit :
>> An I would say I am also entitle to say strings are not normal
>> ranges, since they define length, but have isLength as true,
> 
> hasLength as false.
Of course, my mistake.

> They define length, but it is not part of the range interface.
> 
> It is analogous to the following:
> [...]

I consider this bad design.

>> and define opIndex and opSlice,
> 
> [] and [..] operate on code units, but for a random access range as
> defined by Phobos, they would not.

A bidirectional range of dchar with additional methods of a random 
access range of char. That is what I call schizophrenic.

>> but are not RandomAccessRanges.
>>
>> The fact that isDynamicArray!(char[]) is true, but
>> isRandomAccessRange is not is just another aspect of the schizophrenia.
>> The behavior of a templated function on a string will depend on which
>> was used as a guard.
>>
> No, it won't.

Take the isRandomAccessRange specialization of an algorithm in Phobos, 
replace the guard by isDynamicArray, and you are very likely to change 
the behavior, if you do not simply break the function.

> When read carefully, the conclusion says that code compatibility is
> important only a couple sentences before it says that breaking code for
> the fun of it may be a good thing.

It was intended as a side-note, not a conclusion. Sorry for not being 
clear.

>> newcomers are troubled by this problem,  and I think it is important.
> 
> Newcomers sometimes become seasoned D programmers. Sometimes they know
> what Unicode is about even before that.

I knew what unicode was before coming to D. But, strings being arrays, I 
suspected myString.front would return the same as myString[0], i.e., a 
char, and that it was my job to make sure my algorithms were valid for 
UTF-8 encoding if I wanted to support it. Most of the time, in langage 
without such UTF-8 support, they are without much troubles. Code units 
matters more than code points most of the time.

> The language is consistent here. The library treats some language
> features specially. It is not the language that is "confusing". The
> whole reason to introduce the library behaviour is probably based on
> similar reasoning as given in your post.

OK, I should have said the standard library is inconsistent (with the 
langage).

> The special casing has not caused me any trouble, and sometimes it was 
> useful.

Of course, I can live with that.


Re: Raw binary(to work without OS) in D

2012-06-28 Thread bearophile

Paulo Pinto:


as Metro is also native code.


Are you sure? Do you have a reference on this?

Bye,
bearophile


Re: Raw binary(to work without OS) in D

2012-06-28 Thread Jens Mueller
Andrei Alexandrescu wrote:
> On 6/22/12 7:41 AM, Don Clugston wrote:
> >I think the main thing that's still done in C is the floating point
> >formatting.
> 
> Would be great if a contributor could translate FP parsing and
> formatting code into D. Then we can use it in CTFE. I need it badly
> for some function tabulation code.

I think formatting cannot be done such that it is CTFE-able. I tried
implementing a less-inefficient version. As far as I can tell at some
point you need to extract the significand and the exponent. This is done
by some "unsafe" cast which is not allowed in CTFE. I don't know a way
to do it in CTFE-compatible way.

Jens


Re: standard ranges

2012-06-28 Thread Roman D. Boiko
Timings should not be very different from random access in any 
UTF-32 string implementation, because of design of these 
algorithms:


* only operations on 64-bit aligned words are performed 
(addition, multiplication, bitwise and shift operations)


* there is no branching except at the very top level for very 
large array sizes


* data is stored in a way that makes algorithms cache-oblivious 
IIRC. Authors claim that very few cache misses are neccessary 
(1-2 per random access).


* after determining code unit index for some code point index 
further access is performed as usually inside an array, so in 
order to perform slicing it is only needed to calculate code unit 
indices for its end and start.


* original data arrays are not modified (unlike for compact 
representations of dstring, for example).


Re: standard ranges

2012-06-28 Thread Timon Gehr

On 06/28/2012 11:28 AM, Christophe Travert wrote:

Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :

On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:

"Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :

completely consistent with regards to how it treats strings. The _only_
inconsintencies are between the language and the library - namely how
foreach iterates on code units by default and the fact that while the
language defines length, slicing, and random-access operations for
strings, the library effectively does not consider strings to have them.



char[] is not treated as an array by the library


Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and
char[] works with the functions in std.array. It's just that they're all
special-cased appropriately to handle narrow strings properly. What it doesn't
do is treat char[] as a range of char.


and is not treated as a RandomAccessRange.


All arrays are treated as RandomAccessRanges, except for char[] and
wchar[]. So I think I am entitled to say that strings are not treated as
arrays.


"Not treated like other arrays", is the strongest claim that can be
made there.


An I would say I am also entitle to say strings are not normal
ranges, since they define length, but have isLength as true,


hasLength as false. They define length, but it is not part of the range
interface.

It is analogous to the following:

class charArray : ForwardRange!dchar{
/* interface ForwardRange!dchar */
dchar front();
bool empty();
void popFront();
NarrowString save();

/* other methods */
size_t length();
char opIndex(size_t i);
String opSlice(size_t a, size_t b);
}


and define opIndex and opSlice,


[] and [..] operate on code units, but for a random access range as
defined by Phobos, they would not.


but are not RandomAccessRanges.

The fact that isDynamicArray!(char[]) is true, but
isRandomAccessRange is not is just another aspect of the schizophrenia.
The behavior of a templated function on a string will depend on which
was used as a guard.



No, it won't.



Which is what I already said.


That is a second inconsistency, and it would be avoided is string were a

struct.

No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing
for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a
struct to represent them is irrelevant. And if you can't do those operations
in O(1), then they can't be random access ranges.


I never said strings should support length and slicing. I even said
they should not. foreach is inconsistent with the way strings are
treated in phobos, but opIndex, opSlice and length, are inconsistent to.
string[0] and string.front do not even return the same

Please read my post a little bit more carefully before
answering them.



This is very impolite.

On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:

Slicing is provided for convenience, but not as opSlice, since it is not O(1), 
but
as a method with a separate name.




About the rest of your post, I basically say the same as you in shorter
terms, except that I am in favor of changing things (but I didn't even
said they should be changed in my conclusion).



When read carefully, the conclusion says that code compatibility is
important only a couple sentences before it says that breaking code for
the fun of it may be a good thing.


newcomers are troubled by this problem,  and I think it is important.


Newcomers sometimes become seasoned D programmers. Sometimes they know
what Unicode is about even before that.


They will make mistakes when using both array and range functions on
strings in the same algorithm, or when using array functions without
knowing about utf8 encoding issues (the fact that array functions are
also valid range functions if not for strings does not help). But I also
think experienced programmers can be affected, because of inattention,
reusing codes written by inexperienced programmers, or inappropriate
template guards usage.


In the ASCII-7 subset, UTF-8 strings are actually random access, and
iterating an UTF-8 string by code point is safe if you are eg. just
going to treat some ASCII characters specially.

I don't care much whether or not (bad?) code handles Unicode correctly,
but it is important that code correctly documents whether or not it
does so, and to what extent it does. The new std.regex has good Unicode
support, and to enable that, it had to add some pretty large tables to
Phobos, the functionality of which is not exposed to the library user
as of now. It is therefore safe to say that many/most existing D
programs do not handle the whole Unicode standard correctly.

Unicode has to be _actively_ supported. There are distinct issues that
are hard to abstract away efficiently. Treating an Unicode string as a
range of code points is not solving them. (dchar[] indexing is still
not guaranteed to give back the 'i'

Re: standard ranges

2012-06-28 Thread Roman D. Boiko
On Thursday, 28 June 2012 at 12:29:14 UTC, Andrei Alexandrescu 
wrote:

On 6/28/12 5:58 AM, Roman D. Boiko wrote:
Pedantically speaking, sheer timings say nothing without the 
appropriate baselines.


Andrei


I used results of benchmarks for two such algorithms, which I 
like most, taken from here:


Vigna, S. (2008). "Broadword implementation of rank/select 
queries". Experimental Algorithms: 154–168.


http://en.wikipedia.org/wiki/Succinct_data_structure#cite_ref-vigna2008broadword_6-0

Numbers should be valid for some C/C++ code executed on a machine 
that already existed back in 2008. I'm not sure there is a good 
baseline to compare. One option would be to benchmark random 
access to code points in a UTF-32 string. I also don't know about 
any D implementations of these algorithms, thus cannot predict 
how they would behave against dstring random access.


But your statement that these timings say nothing is not fair, 
because they can be used to conclude that this speed should be 
enough for most practical use cases, especially if those use 
cases are known.


Re: standard ranges

2012-06-28 Thread Andrei Alexandrescu

On 6/28/12 5:28 AM, Christophe Travert wrote:

As a more general comment, I think having a consistent langage is a very
important goal to achieve when designing a langage. It makes everything
simpler, from langage design to user through compiler and library
development. It may not be too late for D.


In a way it's too late for any language in actual use. The "fog of 
language design" makes it nigh impossible to design a language/library 
combo that is perfectly consistent, not to mention the fact that 
consistency itself has many dimensions, some of which may be in competition.


We'd probably do things a bit differently if we started from scratch. As 
things are, D's strings have a couple of quirks but are very apt for 
good and efficient string manipulation where index computation in the 
code unit realm is combined with the range of code points realm. I 
suppose people who have an understanding of UTF don't have difficulty 
using D's strings. Above all, alea jacta est and there's little we can 
do about that save for inventing a time machine.



Andrei



Re: standard ranges

2012-06-28 Thread Andrei Alexandrescu

On 6/28/12 5:58 AM, Roman D. Boiko wrote:

Pedantically speaking, it is possible to index a string with about
50-51% memory overhead to get random access in 0(1) time.
Best-performing algorithms can do random access in about 35-50
nanoseconds per operation for strings up to tens of megabytes. For
bigger strings (tested up to 1GB) or when some other memory-intensive
calculations are performed simultaneously, random access takes up to 200
nanoseconds due to memory-access resolution process.


Pedantically speaking, sheer timings say nothing without the appropriate 
baselines.


Andrei


Re: standard ranges

2012-06-28 Thread Roman D. Boiko

On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
Pedantically speaking, it is possible to index a string with 
about 50-51% memory overhead to get random access in 0(1) time. 
Best-performing algorithms can do random access in about 35-50 
nanoseconds per operation for strings up to tens of megabytes. 
For bigger strings (tested up to 1GB) or when some other 
memory-intensive calculations are performed simultaneously, 
random access takes up to 200 nanoseconds due to memory-access 
resolution process.
This would support both random access to characters by their code 
point index in a string and determining code point index by code 
unit index.


If only the former is needed, space overhead decreases to 25% for 
1K and <15% for 16K-1G string sizes (measured in number of code 
units, which is twice the number of bytes for wstring). Strings 
up to 2^64 code units would be supported.


This would also improve access speed significantly (by 10% for 
small strings and about twice for large).




Re: standard ranges

2012-06-28 Thread Christophe Travert
"David Nadlinger" , dans le message (digitalmars.D:170875), a écrit :
> On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
>>> char[] is not treated as an array by the library, and is not 
>>> treated as a RandomAccessRange. That is a second 
>>> inconsistency, and it would be avoided is string were a struct.
>>
>> So, it looked to me like you were saying that making string a 
>> struct would
>> make it so that it was a random access range, which would mean 
>> implementing
>> length, opSlice, and opIndex.
> 
> I think he meant that the problem would be solved because people 
> would be less likely to expect it to be a random access range in 
> the first place.

Yes.



Re: standard ranges

2012-06-28 Thread Roman D. Boiko

On Thursday, 28 June 2012 at 10:02:59 UTC, Roman D. Boiko wrote:

On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
Pedantically speaking, it is possible to index a string with 
about 50-51% memory overhead to get random access in 0(1) 
time. Best-performing algorithms can do random access in about 
35-50 nanoseconds per operation for strings up to tens of 
megabytes. For bigger strings (tested up to 1GB) or when some 
other memory-intensive calculations are performed 
simultaneously, random access takes up to 200 nanoseconds due 
to memory-access resolution process.


Just a remark, indexing would take O(N) operations and N/B 
memory transfers where N = str.length and B is size of cache 
buffer.


That being said, I would be against switching from string 
representation as arrays. Such switch would hardly help us solve 
any problems of practical importance better (by a significant 
degree) than they have to be solved with current design.


However, a struct could be created for indexing which I mentioned 
in two previous posts to give efficient random access for narrow 
strings (and arbitrary variable-length data stored consequently 
in arrays) without any significant overhead.


Respective algorithms are called Rank and Select, and there exist 
many variations of them (with different trade-offs, but some of 
them are arguably better than others).


I have investigated this question quite deeply in the last two 
weeks, because similar algorithms would be useful in my DCT 
project. If nobody else will implement them before me, I will 
eventually do that myself. It is just a matter of finding some 
free time, likely a week or two.


Re: standard ranges

2012-06-28 Thread Roman D. Boiko

On Thursday, 28 June 2012 at 09:58:02 UTC, Roman D. Boiko wrote:
Pedantically speaking, it is possible to index a string with 
about 50-51% memory overhead to get random access in 0(1) time. 
Best-performing algorithms can do random access in about 35-50 
nanoseconds per operation for strings up to tens of megabytes. 
For bigger strings (tested up to 1GB) or when some other 
memory-intensive calculations are performed simultaneously, 
random access takes up to 200 nanoseconds due to memory-access 
resolution process.


Just a remark, indexing would take O(N) operations and N/B memory 
transfers where N = str.length and B is size of cache buffer.


Re: standard ranges

2012-06-28 Thread Roman D. Boiko

On Thursday, 28 June 2012 at 05:10:43 UTC, Jonathan M Davis wrote:

On Thursday, June 28, 2012 08:59:32 Gor Gyolchanyan wrote:
Currently strings below dstring are only applicable in 
ForwardRange

and below, but not RandomAccessRange as they should be.


Except that they shouldn't be, because you can't do random 
access on a narrow
string in O(1). If you can't index or slice a range in O(1), it 
has no
business having those operations. The same goes for length. 
That's why narrow
strings do not have any of those operations as far as ranges 
are concerned.
Having those operations in anything worse than O(1) violates 
the algorithmic
complexity guarantees that ranges are supposed to provide, 
which would
seriously harm the efficiency of algorithms which rely on them. 
It's the same
reason why std.container defines the algorithmic complexity of 
all the
operations in std.container. If you want a random-access range 
which is a
string type, you need dchar[], const(dchar)[], or dstring. That 
is very much

on purpose and would not change even if strings were structs.

- Jonathan M Davis


Pedantically speaking, it is possible to index a string with 
about 50-51% memory overhead to get random access in 0(1) time. 
Best-performing algorithms can do random access in about 35-50 
nanoseconds per operation for strings up to tens of megabytes. 
For bigger strings (tested up to 1GB) or when some other 
memory-intensive calculations are performed simultaneously, 
random access takes up to 200 nanoseconds due to memory-access 
resolution process.


Re: standard ranges

2012-06-28 Thread David Nadlinger

On Thursday, 28 June 2012 at 09:49:19 UTC, Jonathan M Davis wrote:
char[] is not treated as an array by the library, and is not 
treated as a RandomAccessRange. That is a second 
inconsistency, and it would be avoided is string were a struct.


So, it looked to me like you were saying that making string a 
struct would
make it so that it was a random access range, which would mean 
implementing

length, opSlice, and opIndex.


I think he meant that the problem would be solved because people 
would be less likely to expect it to be a random access range in 
the first place.


What troubles me most with having is(string == immutable(char)[]) 
is that it more or less precludes us from adding small string 
optimizations, etc. in the future…


David


Re: standard ranges

2012-06-28 Thread Jonathan M Davis
On Thursday, June 28, 2012 09:28:52 Christophe Travert wrote:
> I never said strings should support length and slicing. I even said
> they should not. foreach is inconsistent with the way strings are
> treated in phobos, but opIndex, opSlice and length, are inconsistent to.
> string[0] and string.front do not even return the same
> 
> Please read my post a little bit more carefully before
> answering them.

You said this:

> char[] is not treated as an array by the library, and is not treated as 
> a RandomAccessRange. That is a second inconsistency, and it would be 
> avoided is string were a struct.

So, it looked to me like you were saying that making string a struct would 
make it so that it was a random access range, which would mean implementing 
length, opSlice, and opIndex.

- Jonathan M Davis


Re: standard ranges

2012-06-28 Thread Christophe Travert
Jonathan M Davis , dans le message (digitalmars.D:170872), a écrit :
> On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
>> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
>> > completely consistent with regards to how it treats strings. The _only_
>> > inconsintencies are between the language and the library - namely how
>> > foreach iterates on code units by default and the fact that while the
>> > language defines length, slicing, and random-access operations for
>> > strings, the library effectively does not consider strings to have them.
> 
>> char[] is not treated as an array by the library
> 
> Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and 
> char[] works with the functions in std.array. It's just that they're all 
> special-cased appropriately to handle narrow strings properly. What it 
> doesn't 
> do is treat char[] as a range of char.
> 
>> and is not treated as a RandomAccessRange.

All arrays are treated as RandomAccessRanges, except for char[] and 
wchar[]. So I think I am entitled to say that strings are not treated as 
arrays. An I would say I am also entitle to say strings are not normal 
ranges, since they define length, but have isLength as true, and define 
opIndex and opSlice, but are not RandomAccessRanges.

The fact that isDynamicArray!(char[]) is true, but 
isRandomAccessRange is not is just another aspect of the schizophrenia. 
The behavior of a templated function on a string will depend on which 
was used as a guard.

> 
> Which is what I already said.
> 
>> That is a second inconsistency, and it would be avoided is string were a 
> struct.
> 
> No, it wouldn't. It is _impossible_ to implement length, slicing, and 
> indexing 
> for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a 
> struct to represent them is irrelevant. And if you can't do those operations 
> in O(1), then they can't be random access ranges.

I never said strings should support length and slicing. I even said 
they should not. foreach is inconsistent with the way strings are 
treated in phobos, but opIndex, opSlice and length, are inconsistent to. 
string[0] and string.front do not even return the same

Please read my post a little bit more carefully before 
answering them.

About the rest of your post, I basically say the same as you in shorter 
terms, except that I am in favor of changing things (but I didn't even 
said they should be changed in my conclusion).

newcomers are troubled by this problem, and I think it is important. 
They will make mistakes when using both array and range functions on 
strings in the same algorithm, or when using array functions without 
knowing about utf8 encoding issues (the fact that array functions are 
also valid range functions if not for strings does not help). But I also 
think experienced programmers can be affected, because of inattention, 
reusing codes written by inexperienced programmers, or inappropriate 
template guards usage.

As a more general comment, I think having a consistent langage is a very 
important goal to achieve when designing a langage. It makes everything 
simpler, from langage design to user through compiler and library 
development. It may not be too late for D.

-- 
Christophe


Re: standard ranges

2012-06-28 Thread Jonathan M Davis
On Thursday, June 28, 2012 08:05:19 Christophe Travert wrote:
> "Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
> > completely consistent with regards to how it treats strings. The _only_
> > inconsintencies are between the language and the library - namely how
> > foreach iterates on code units by default and the fact that while the
> > language defines length, slicing, and random-access operations for
> > strings, the library effectively does not consider strings to have them.

> char[] is not treated as an array by the library

Phobos _does_ treat char[] as an array. isDynamicArray!(char[]) is true, and 
char[] works with the functions in std.array. It's just that they're all 
special-cased appropriately to handle narrow strings properly. What it doesn't 
do is treat char[] as a range of char.

> and is not treated as a RandomAccessRange.

Which is what I already said.

> That is a second inconsistency, and it would be avoided is string were a 
struct.

No, it wouldn't. It is _impossible_ to implement length, slicing, and indexing 
for UTF-8 and UTF-16 strings in O(1). Whether you're using an array or a 
struct to represent them is irrelevant. And if you can't do those operations 
in O(1), then they can't be random access ranges.

The _only_ thing that using a struct for narrow strings fixes is the 
inconsistencies with foreach (it would then use dchar just like all of the 
range stuff does), and slicing, indexing, and length wouldn't be on it, 
eliminating the oddity of them existing but not considered to exist by range-
based functions. It _would_ make things somewhat nicer for newbies, but it 
would not give you one iota more of functionality. Narrow strings would still 
be bidirectional ranges but not access ranges, and you would still have to 
operate on the underlying array to operate on strings efficiently.

If we were to start from stratch, it probably would be better to go with a 
struct type for strings, but it would break far too much code for far too 
little benefit at this point. You need to understand the unicode stuff 
regardless - like the difference between code units and code points. So, if 
anything, the fact that strings are treated inconsistently and are treated as 
ranges of dchar - which confuses so many newbies - is arguably a _good_ thing 
in that it forces newbies to realize and understand the unicode issues 
involved rather than blindly using strings in a horribly inefficient manner as 
would inevitably occur with a struct string type.

So, no, the situation is not exactly ideal, and yes, a struct string type 
might have been a better solution, but I think that many of the folks who are 
pushing for a struct string type are seriously overestimating the problems 
that it would solve. Yes, it would make the language and library more 
consistent, but that's it. You'd still have to use strings in essentially the 
same way that you do now. It's just that you wouldn't have to explicitly use 
dchar with foreach, and you'd have to get at the property which returned the 
underlying array in order to operate on the code units as you need to do in 
many functions to make your code appropriately efficient rather than simply 
using the string that way directly by not using its range-based functions. 
There is a difference, but it's a lot smaller than many people seem to think.

- Jonathan M Davis


Re: standard ranges

2012-06-28 Thread Christophe Travert
"Jonathan M Davis" , dans le message (digitalmars.D:170852), a écrit :
> completely consistent with regards to how it treats strings. The _only_ 
> inconsintencies are between the language and the library - namely how foreach 
> iterates on code units by default and the fact that while the language 
> defines 
> length, slicing, and random-access operations for strings, the library 
> effectively does not consider strings to have them.

char[] is not treated as an array by the library, and is not treated as 
a RandomAccessRange. That is a second inconsistency, and it would be 
avoided is string were a struct.

I won't repeat arguments that were already said, but if it matters, to 
me, things should be such that:

 - string is a druntime defined struct, with an undelying 
immutable(char)[]. It is a BidirectionalRange of dchar. Slicing is 
provided for convenience, but not as opSlice, since it is not O(1), but 
as a method with a separate name. Direct access to the underlying 
char[]/ubyte[] is provided.

 - similar structs are provided to hold underlying const(char)[] and 
char[]

 - similar structs are provided for wstring

 - dstring is a druntime defined alias to dchar[] or a struct with the 
same functionalities for consistency with narrow string being struct.

 - All those structs may be provided as a template.
struct string(T = immutable(char)) {...}
alias string(immutable(wchar)) wstring;
alias string(immutable(dchar)) dstring;

string(const(char)) and string(char) ... are the other types of 
strings.

 - this string template could also be defined as a wrapper to convert 
any range of char/wchar into a range of dchar. That does not need to be 
in druntime. Only types necessary for string litterals should be in 
druntime.

 - string should not be convertible to char*. Use toStringz to interface 
with c code, or the underlying char[] if you know you it is 
zero-terminated, at you own risk. Only string litterals need to be 
convertible to char*, and I would say that they should be 
zero-terminated only when they are directly used as char*, to allow the 
compiler to optimize memory.

 - char /may/ disappear in favor of ubyte (or the contrary, or one could 
alias the other), if there is no other need to keep separate types that 
having strings that are different from ubyte[]. Only dchar is necessary, 
and it could just be called char.

That is ideal to me. Of course, I understand code compatibility is 
important, and compromises have to be made. The current situation is a 
compromise, but I don't like it because it is a WAT for every newcomer. 
But the last point, for example, would bring no more that code breakage. 
Such code breakage may make us find bugs however...

-- 
Christophe


Re: Productions users

2012-06-28 Thread Jonathan M Davis
On Thursday, June 28, 2012 09:29:14 Alex Rønne Petersen wrote:
> On 27-06-2012 23:31, Jonathan M Davis wrote:
> > On Wednesday, June 27, 2012 23:00:58 nazriel wrote:
> >> On Wednesday, 27 June 2012 at 08:53:14 UTC, Andrea Fontana wrote:
> >>> I think it would be useful to add on dlang.org a section to
> >>> show how d is used in production. I can't find any page about
> >>> it. It seems an accademic-only programming language!
> >> 
> >> What do you mean by production?
> >> Open source project? Freeware applications?
> >> Does commercial projects counts?
> > 
> > I would have expected "in production" to _only_ mean commercial projects.
> > 
> > - Jonathan M Davis
> 
> I think it would be a mistake to only highlight commercial users. As
> Tobias pointed out, there are many non-profit organizations running on
> open source software that are well-known.

Oh, I wasn't suggesting that we only highlight commercial projects. I was just 
saying that I expected the term "in production" to refer to commercial 
projects spefically. Whether we want to highlight major open source projects 
and/or other non-commercial projects is another matter entirely - though I 
suspect that commercial projects would generally carry more weight than other 
types of projects in terms of convincing people that D is being used seriously 
in the real world.

- Jonathan M Davis


Re: Productions users

2012-06-28 Thread Andrea Fontana
In production it's just a way to say "completed, not still in 
pre-alpha/alpha/beta/testing phase". Usable. Working. Public :)


No difference between commercial, open source, free, etc ...

On Wednesday, 27 June 2012 at 21:33:20 UTC, Jonathan M Davis 
wrote:

On Wednesday, June 27, 2012 23:00:58 nazriel wrote:
On Wednesday, 27 June 2012 at 08:53:14 UTC, Andrea Fontana 
wrote:

> I think it would be useful to add on dlang.org a section to
> show how d is used in production. I can't find any page about
> it. It seems an accademic-only programming language!

What do you mean by production?
Open source project? Freeware applications?
Does commercial projects counts?


I would have expected "in production" to _only_ mean commercial 
projects.


- Jonathan M Davis





Re: Productions users

2012-06-28 Thread Andrea Fontana
If you take a good project/library/service, on their homepage 
(not wiki) there's
 always a list of production projects (that means: "it's not 
currently in development but it's public and completed") that use 
it.


For example, mongodb. On its homepage there's a list of 
production users and a link named "more productions users" that 
point here: 
http://www.mongodb.org/display/DOCS/Production+Deployments


Check rails: http://rubyonrails.org/   it has "who is already on 
rails?" in homepage


Another one? http://hadoop.apache.org/  in homepage: "Who Uses 
Hadoop?"

Also amazon aws on its homepage has this section.

If I visit dlang.org i think: "Ok, nice language but it works in 
real world for real projects or it's just a toy language?"


My company uses D for a "natural language parser" i've written 
for our internal search engine (our users search - in italian - 
"restaurants in the province of Venice opened on Valentine's day" 
and my parser translates that phrase in a query)


On Wednesday, 27 June 2012 at 21:56:30 UTC, Tobias Pankrath wrote:
On Wednesday, 27 June 2012 at 21:33:20 UTC, Jonathan M Davis 
wrote:

On Wednesday, June 27, 2012 23:00:58 nazriel wrote:
On Wednesday, 27 June 2012 at 08:53:14 UTC, Andrea Fontana 
wrote:

> I think it would be useful to add on dlang.org a section to
> show how d is used in production. I can't find any page 
> about

> it. It seems an accademic-only programming language!

What do you mean by production?
Open source project? Freeware applications?
Does commercial projects counts?


I would have expected "in production" to _only_ mean 
commercial projects.


- Jonathan M Davis


I wouldn't. But that is probably a definition thing. If I'd 
written say Wikipedia in it, that would qualify as production 
use, too.




Re: Productions users

2012-06-28 Thread Alex Rønne Petersen

On 27-06-2012 23:31, Jonathan M Davis wrote:

On Wednesday, June 27, 2012 23:00:58 nazriel wrote:

On Wednesday, 27 June 2012 at 08:53:14 UTC, Andrea Fontana wrote:

I think it would be useful to add on dlang.org a section to
show how d is used in production. I can't find any page about
it. It seems an accademic-only programming language!


What do you mean by production?
Open source project? Freeware applications?
Does commercial projects counts?


I would have expected "in production" to _only_ mean commercial projects.

- Jonathan M Davis



I think it would be a mistake to only highlight commercial users. As 
Tobias pointed out, there are many non-profit organizations running on 
open source software that are well-known.


--
Alex Rønne Petersen
a...@lycus.org
http://lycus.org